From b843f498ca2eebc4a08bf87176338076488f3656 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Mon, 22 Jun 2026 08:43:30 +0000 Subject: [PATCH] docs(recon): document voice-detect and face-detect ggml backends Document the new standalone C++/ggml biometric backends as the recommended/default option for face and voice recognition, keeping the existing Python insightface / speaker-recognition backends framed as the legacy path. - features/face-recognition.md: add a face-detect (ggml) backend section with the gallery entries (buffalo-l/m/s non-commercial, yunet-sface Apache-2.0), licensing, and verify/detect/analyze quickstart. - features/voice-recognition.md: add a voice-detect (ggml) backend section with the gallery entries (ecapa-tdnn, wespeaker-resnet34, eres2net, campplus speaker recognizers; emotion-wav2vec2 non-commercial analyze head) and quickstart. - reference/compatibility-table.md: add face-detect.cpp and voice-detect.cpp rows to the Vision, Detection & Recognition table. Signed-off-by: Ettore Di Giacinto Assisted-by: Claude:claude-opus-4-8 [Claude Code] --- docs/content/features/face-recognition.md | 93 +++++++++++++++++-- docs/content/features/voice-recognition.md | 90 ++++++++++++++++-- docs/content/reference/compatibility-table.md | 2 + 3 files changed, 170 insertions(+), 15 deletions(-) diff --git a/docs/content/features/face-recognition.md b/docs/content/features/face-recognition.md index ecc3e7213..7bddc702f 100644 --- a/docs/content/features/face-recognition.md +++ b/docs/content/features/face-recognition.md @@ -7,16 +7,93 @@ url = "/features/face-recognition/" ![Face recognition: 1:N match against a vector store, with an anti-spoofing liveness gate that can veto a verification](/images/diagrams/face-recognition-flow.png) -LocalAI supports face recognition through the `insightface` backend: -face verification (1:1), face identification (1:N) against a built-in -vector store, face embedding, face detection, demographic analysis -(age / gender), and antispoofing / liveness detection. +LocalAI supports face recognition: face verification (1:1), face +identification (1:N) against a built-in vector store, face embedding, +face detection, demographic analysis (age / gender), and antispoofing / +liveness detection. -The backend ships **two interchangeable engines** under one image, each -paired with a distinct gallery entry so users can pick by license and -accuracy needs. +The same `/v1/face/*` HTTP API is served by two backends: -## Licensing — read this first +- **`face-detect` (recommended, default).** A standalone C++/ggml + engine ([face-detect.cpp](https://github.com/mudler/face-detect.cpp)): + no Python, no onnxruntime, no torch runtime. Each gallery entry is a + single self-describing GGUF. This is the recommended option for new + deployments. +- **`insightface` (Python).** The original ONNX Runtime backend. Still + supported; see [the Python backend](#insightface-python-backend) below. + +Both backends expose the identical wire format, so the API examples in +this page work with either - only the gallery entry name (the `model` +field) changes. + +## face-detect (ggml) backend + +The `face-detect` backend reads the detector and recognizer architecture +(`facedetect.arch`) directly from the GGUF metadata, so installing a +gallery entry is all that is needed to select an engine. It drives the +Embeddings / Detect / FaceVerify / FaceAnalyze gRPC rpcs behind the +`/v1/face/{embed,verify,analyze,detect,register,identify,forget}` +endpoints. + +### Licensing - read this first + +| Gallery entry | Detector + recognizer | Embedding dim | License | +|---|---|---|---| +| `face-detect-buffalo-l` | SCRFD-10GF + ArcFace R50 + GenderAge | 512 | **Non-commercial research only** (upstream insightface weights) | +| `face-detect-buffalo-m` | SCRFD-2.5GF + ArcFace R50 + GenderAge | 512 | **Non-commercial research only** | +| `face-detect-buffalo-s` | SCRFD-500MF + MBF + GenderAge | 512 | **Non-commercial research only** | +| `face-detect-yunet-sface` | YuNet + SFace (OpenCV Zoo) | 128 | **Apache 2.0 - commercial-safe** | + +The insightface buffalo packs (buffalo_l / buffalo_m / buffalo_s) are +released by the upstream maintainers for **non-commercial research use +only**. Pick the `face-detect-yunet-sface` entry for production / +commercial deployments. + +### Quickstart + +Install the commercial-safe entry (recommended for copy-paste): + +```bash +local-ai models install face-detect-yunet-sface +``` + +Verify that two images depict the same person: + +```bash +curl -sX POST http://localhost:8080/v1/face/verify \ + -H "Content-Type: application/json" \ + -d '{ + "model": "face-detect-yunet-sface", + "img1": "https://example.com/alice_1.jpg", + "img2": "https://example.com/alice_2.jpg" + }' +``` + +Detect faces and analyze demographics (buffalo entries populate +age / gender; YuNet + SFace returns regions only): + +```bash +curl -sX POST http://localhost:8080/v1/face/detect \ + -H "Content-Type: application/json" \ + -d '{"model": "face-detect-buffalo-l", "img": "https://example.com/group.jpg"}' + +curl -sX POST http://localhost:8080/v1/face/analyze \ + -H "Content-Type: application/json" \ + -d '{"model": "face-detect-buffalo-l", "img": "https://example.com/alice.jpg"}' +``` + +The 1:N register / identify / forget workflow and the rest of the API +are identical to the [API reference](#api-reference) below - just pass a +`face-detect-*` model name. The per-engine verify thresholds are ~0.35 +for the buffalo ArcFace/MBF recognizers and ~0.363 for SFace. + +## insightface (Python) backend + +The `insightface` backend ships **two interchangeable engines** under +one image, each paired with a distinct gallery entry so users can pick +by license and accuracy needs. + +### Licensing - read this first | Gallery entry | Detector + recognizer | Size | License | |---|---|---|---| diff --git a/docs/content/features/voice-recognition.md b/docs/content/features/voice-recognition.md index 20728a28f..aed5d5bf6 100644 --- a/docs/content/features/voice-recognition.md +++ b/docs/content/features/voice-recognition.md @@ -7,16 +7,92 @@ url = "/features/voice-recognition/" ![Voice recognition: register, identify, and forget voiceprints in a vector store, for 1:1 verify or 1:N identify](/images/diagrams/voice-recognition-flow.png) -LocalAI supports voice (speaker) recognition through the -`speaker-recognition` backend: speaker verification (1:1), speaker -identification (1:N) against a built-in vector store, speaker -embedding, and demographic analysis (age / gender / emotion from -voice). +LocalAI supports voice (speaker) recognition: speaker verification +(1:1), speaker identification (1:N) against a built-in vector store, +speaker embedding, and demographic analysis (age / gender / emotion +from voice). The audio analog to [Face Recognition](/features/face-recognition/), -following the same two-engine pattern under one image. +served over the same `/v1/voice/*` HTTP API by two backends: -## Engines +- **`voice-detect` (recommended, default).** A standalone C++/ggml + engine ([voice-detect.cpp](https://github.com/mudler/voice-detect.cpp)): + no Python, no onnxruntime, no torch runtime. Each gallery entry is a + single self-describing GGUF. This is the recommended option for new + deployments. +- **`speaker-recognition` (Python).** The original SpeechBrain / ONNX + backend. Still supported; see [the Python backend](#speaker-recognition-python-backend) + below. + +Both backends expose the identical wire format, so the API examples on +this page work with either - only the gallery entry name (the `model` +field) changes. + +## voice-detect (ggml) backend + +The `voice-detect` backend reads the embedding (or analysis) +architecture (`voicedetect.arch`) directly from the GGUF metadata, so +installing a gallery entry is all that is needed to select an engine. It +drives the VoiceEmbed / VoiceVerify / VoiceAnalyze gRPC rpcs behind the +`/v1/voice/{embed,verify,analyze,register,identify,forget}` endpoints. + +### Gallery entries + +| Gallery entry | Model | Embedding dim | License | +|---|---|---|---| +| `voice-detect-ecapa-tdnn` | SpeechBrain ECAPA-TDNN (VoxCeleb) | 192 | **Apache 2.0 - commercial-safe** | +| `voice-detect-wespeaker-resnet34` | WeSpeaker ResNet34 (VoxCeleb) | 256 | CC-BY-4.0 | +| `voice-detect-eres2net` | 3D-Speaker ERes2Net (VoxCeleb) | 192 | **Apache 2.0 - commercial-safe** | +| `voice-detect-campplus` | 3D-Speaker CAM++ (VoxCeleb) | 192 | **Apache 2.0 - commercial-safe** | +| `voice-detect-emotion-wav2vec2` | audEERING wav2vec2 (age / gender / emotion) | analyze head | **CC-BY-NC-SA-4.0 - non-commercial** | + +The four speaker-recognition entries drive verify / embed / identify. +`voice-detect-emotion-wav2vec2` is the analysis head behind +`/v1/voice/analyze` (continuous age estimate plus gender and emotion +class scores) and is **non-commercial / research use only**. + +### Quickstart + +Install the default entry (recommended for copy-paste): + +```bash +local-ai models install voice-detect-ecapa-tdnn +``` + +Verify that two audio clips were spoken by the same person: + +```bash +curl -sX POST http://localhost:8080/v1/voice/verify \ + -H "Content-Type: application/json" \ + -d '{ + "model": "voice-detect-ecapa-tdnn", + "audio1": "https://example.com/alice_1.wav", + "audio2": "https://example.com/alice_2.wav" + }' +``` + +Analyze age / gender / emotion (install the analyze entry first): + +```bash +local-ai models install voice-detect-emotion-wav2vec2 + +curl -sX POST http://localhost:8080/v1/voice/analyze \ + -H "Content-Type: application/json" \ + -d '{"model": "voice-detect-emotion-wav2vec2", "audio": "https://example.com/alice.wav"}' +``` + +The 1:N register / identify / forget workflow and the rest of the API +are identical to the [API reference](#api-reference) below - just pass a +`voice-detect-*` model name. The default verify threshold is ~0.25 for +the ECAPA-TDNN / ERes2Net / CAM++ recognizers and ~0.30 for WeSpeaker +ResNet34. + +## speaker-recognition (Python) backend + +The `speaker-recognition` backend follows the same two-engine pattern +under one image. + +### Engines | Gallery entry | Model | Size | License | |---|---|---|---| diff --git a/docs/content/reference/compatibility-table.md b/docs/content/reference/compatibility-table.md index 21971ff45..0e9551b3b 100644 --- a/docs/content/reference/compatibility-table.md +++ b/docs/content/reference/compatibility-table.md @@ -97,6 +97,8 @@ All backends listed here can be installed on demand from the [Backend Gallery]({ | [locate-anything.cpp](https://github.com/mudler/locate-anything.cpp) | Open-vocabulary object detection and visual grounding (LocateAnything-3B) in C/C++ using GGML | CPU, CUDA 12/13, Intel SYCL, Vulkan, Jetson L4T | | [depth-anything.cpp](https://github.com/mudler/depth-anything.cpp) | Depth Anything 3 monocular metric depth + camera pose in C/C++ using GGML | CPU, CUDA 12/13, Intel SYCL, Vulkan, Jetson L4T | | [sam3.cpp](https://github.com/PABannier/sam3.cpp) | Segment Anything (SAM 3/2/EdgeTAM) with text/point/box prompts in C/C++ using GGML | CPU, CUDA 12/13, Intel SYCL, Vulkan, Jetson L4T | +| [face-detect.cpp](https://github.com/mudler/face-detect.cpp) | Native face detection, recognition, embedding, demographics and anti-spoofing (SCRFD/ArcFace, YuNet/SFace) in C/C++ using GGML | CPU, CUDA 12/13, ROCm, Intel SYCL, Vulkan, Metal, Jetson L4T | +| [voice-detect.cpp](https://github.com/mudler/voice-detect.cpp) | Native speaker (voice) recognition and voice analysis (ECAPA-TDNN, WeSpeaker, ERes2Net, CAM++, wav2vec2) in C/C++ using GGML | CPU, CUDA 12/13, ROCm, Intel SYCL, Vulkan, Metal, Jetson L4T | | [insightface](https://github.com/deepinsight/insightface) | Face verification, embedding, and anti-spoofing liveness (ONNX Runtime) | CPU, CUDA 12 | | [speaker-recognition](https://speechbrain.github.io/) | Speaker (voice) recognition via SpeechBrain ECAPA-TDNN | CPU, CUDA 12, Metal |