LocalAI/docs/content/features/voice-recognition.md at c6170b875d02b7b007ed4631a59835fc7e364feb

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-22 15:49:12 -04:00

Files

Ettore Di Giacinto b843f498ca docs(recon): document voice-detect and face-detect ggml backends

Document the new standalone C++/ggml biometric backends as the
recommended/default option for face and voice recognition, keeping the
existing Python insightface / speaker-recognition backends framed as the
legacy path.

- features/face-recognition.md: add a face-detect (ggml) backend section
  with the gallery entries (buffalo-l/m/s non-commercial, yunet-sface
  Apache-2.0), licensing, and verify/detect/analyze quickstart.
- features/voice-recognition.md: add a voice-detect (ggml) backend
  section with the gallery entries (ecapa-tdnn, wespeaker-resnet34,
  eres2net, campplus speaker recognizers; emotion-wav2vec2 non-commercial
  analyze head) and quickstart.
- reference/compatibility-table.md: add face-detect.cpp and
  voice-detect.cpp rows to the Vision, Detection & Recognition table.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

2026-06-22 08:43:30 +00:00

11 KiB

Raw Blame History

+++ disableToc = false title = "Voice Recognition" weight = 15 url = "/features/voice-recognition/" +++

LocalAI supports voice (speaker) recognition: speaker verification (1:1), speaker identification (1:N) against a built-in vector store, speaker embedding, and demographic analysis (age / gender / emotion from voice).

The audio analog to Face Recognition, served over the same /v1/voice/* HTTP API by two backends:

voice-detect (recommended, default). A standalone C++/ggml engine (voice-detect.cpp): no Python, no onnxruntime, no torch runtime. Each gallery entry is a single self-describing GGUF. This is the recommended option for new deployments.
speaker-recognition (Python). The original SpeechBrain / ONNX backend. Still supported; see the Python backend below.

Both backends expose the identical wire format, so the API examples on this page work with either - only the gallery entry name (the model field) changes.

voice-detect (ggml) backend

The voice-detect backend reads the embedding (or analysis) architecture (voicedetect.arch) directly from the GGUF metadata, so installing a gallery entry is all that is needed to select an engine. It drives the VoiceEmbed / VoiceVerify / VoiceAnalyze gRPC rpcs behind the /v1/voice/{embed,verify,analyze,register,identify,forget} endpoints.

Gallery entries

Gallery entry	Model	Embedding dim	License
`voice-detect-ecapa-tdnn`	SpeechBrain ECAPA-TDNN (VoxCeleb)	192	Apache 2.0 - commercial-safe
`voice-detect-wespeaker-resnet34`	WeSpeaker ResNet34 (VoxCeleb)	256	CC-BY-4.0
`voice-detect-eres2net`	3D-Speaker ERes2Net (VoxCeleb)	192	Apache 2.0 - commercial-safe
`voice-detect-campplus`	3D-Speaker CAM++ (VoxCeleb)	192	Apache 2.0 - commercial-safe
`voice-detect-emotion-wav2vec2`	audEERING wav2vec2 (age / gender / emotion)	analyze head	CC-BY-NC-SA-4.0 - non-commercial

The four speaker-recognition entries drive verify / embed / identify. voice-detect-emotion-wav2vec2 is the analysis head behind /v1/voice/analyze (continuous age estimate plus gender and emotion class scores) and is non-commercial / research use only.

Quickstart

Install the default entry (recommended for copy-paste):

local-ai models install voice-detect-ecapa-tdnn

Verify that two audio clips were spoken by the same person:

curl -sX POST http://localhost:8080/v1/voice/verify \
  -H "Content-Type: application/json" \
  -d '{
    "model": "voice-detect-ecapa-tdnn",
    "audio1": "https://example.com/alice_1.wav",
    "audio2": "https://example.com/alice_2.wav"
  }'

Analyze age / gender / emotion (install the analyze entry first):

local-ai models install voice-detect-emotion-wav2vec2

curl -sX POST http://localhost:8080/v1/voice/analyze \
  -H "Content-Type: application/json" \
  -d '{"model": "voice-detect-emotion-wav2vec2", "audio": "https://example.com/alice.wav"}'

The 1:N register / identify / forget workflow and the rest of the API are identical to the API reference below - just pass a voice-detect-* model name. The default verify threshold is ~0.25 for the ECAPA-TDNN / ERes2Net / CAM++ recognizers and ~0.30 for WeSpeaker ResNet34.

speaker-recognition (Python) backend

The speaker-recognition backend follows the same two-engine pattern under one image.

Engines

Gallery entry	Model	Size	License
`speechbrain-ecapa-tdnn`	ECAPA-TDNN on VoxCeleb (SpeechBrain)	~17 MB	Apache 2.0 — commercial-safe
`wespeaker-resnet34`	WeSpeaker ResNet34 ONNX	~26 MB	Apache 2.0 — commercial-safe

Both entries are commercial-safe Apache-2.0. SpeechBrain is the default — it's a lightweight pure-PyTorch checkpoint that auto- downloads on first use. The wespeaker-resnet34 entry wires the direct-ONNX path for CPU-only deployments that don't want the torch runtime.

Quickstart

Install the default backend and model:

local-ai models install speechbrain-ecapa-tdnn

Verify that two audio clips were spoken by the same person:

curl -sX POST http://localhost:8080/v1/voice/verify \
  -H "Content-Type: application/json" \
  -d '{
    "model": "speechbrain-ecapa-tdnn",
    "audio1": "https://example.com/alice_1.wav",
    "audio2": "https://example.com/alice_2.wav"
  }'

Response:

{
  "verified": true,
  "distance": 0.18,
  "threshold": 0.25,
  "confidence": 28.0,
  "model": "speechbrain-ecapa-tdnn",
  "processing_time_ms": 340.0
}

1:N identification workflow (register → identify → forget)

Same flow as face recognition, same in-memory vector store under the hood.

curl -sX POST http://localhost:8080/v1/voice/register \
  -H "Content-Type: application/json" \
  -d '{
    "model": "speechbrain-ecapa-tdnn",
    "name": "Alice",
    "audio": "https://example.com/alice.wav"
  }'
# → {"id": "b2f...", "name": "Alice", "registered_at": "2026-04-22T..."}

Identify an unknown probe:

curl -sX POST http://localhost:8080/v1/voice/identify \
  -H "Content-Type: application/json" \
  -d '{
    "model": "speechbrain-ecapa-tdnn",
    "audio": "https://example.com/unknown.wav",
    "top_k": 5
  }'
# → {"matches": [{"id":"b2f...","name":"Alice","distance":0.19,"match":true,...}]}

Remove a speaker by ID:

curl -sX POST http://localhost:8080/v1/voice/forget \
  -d '{"id": "b2f..."}'
# → 204 No Content

{{% notice warning %}} Storage caveat. The default vector store is in-memory. All registered speakers are lost when LocalAI restarts. Persistent storage (pgvector) is a tracked future enhancement shared with face recognition — the voice-recognition HTTP API is designed to swap the backing store without changing the wire format. {{% /notice %}}

API reference

`POST /v1/voice/verify` (1:1)

field	type	description
`model`	string	gallery entry name (e.g. `speechbrain-ecapa-tdnn`)
`audio1`, `audio2`	string	URL, base64, or data-URI of an audio file
`threshold`	float, optional	cosine-distance cutoff; default 0.25 for ECAPA-TDNN
`anti_spoofing`	bool, optional	reserved — unused in the current release

Returns verified, distance, threshold, confidence, model, and processing_time_ms.

`POST /v1/voice/analyze`

Returns demographic attributes (age, gender, emotion) inferred from speech:

field	type	description
`model`	string	gallery entry
`audio`	string	URL / base64 / data-URI
`actions`	string[]	subset of `["age","gender","emotion"]`; empty = all supported

Emotion is inferred from the SUPERB emotion-recognition checkpoint (superb/wav2vec2-base-superb-er, Apache 2.0) — 4-way categorical neutral / happy / angry / sad. The model auto-downloads on the first analyze call.

Age and gender are opt-in: no standard-transformers checkpoint with a clean classifier head is shipped as the default. The high-accuracy Audeering age/gender model uses a custom multi-task head that AutoModelForAudioClassification doesn't load safely (the age weights are silently dropped and the classifier is re-initialised with random values). To enable age/gender, set age_gender_model:<repo> in the model YAML's options: pointing at a checkpoint with a vanilla Wav2Vec2ForSequenceClassification head. Override the emotion default similarly via emotion_model:. Set either to an empty string to disable that head.

If a head fails to load (offline, disk full, transformers missing), the engine degrades gracefully: it still returns the attributes it could compute. When nothing can be computed the backend returns 501 Unimplemented.

Analyze is supported by both speechbrain-ecapa-tdnn and wespeaker-resnet34 — the speaker recognizer and the analysis head are independent.

`POST /v1/voice/register` (1:N enrollment)

field	type	description
`model`	string	voice recognition model
`audio`	string	speaker audio to enroll
`name`	string	human-readable label
`labels`	map[string]string, optional	arbitrary metadata
`store`	string, optional	vector store model; defaults to local-store

Returns {id, name, registered_at}. The id is an opaque UUID used by /v1/voice/identify and /v1/voice/forget.

`POST /v1/voice/identify` (1:N recognition)

field	type	description
`model`	string	voice recognition model
`audio`	string	probe audio
`top_k`	int, optional	max matches to return; default 5
`threshold`	float, optional	cosine-distance cutoff; default 0.25
`store`	string, optional	vector store model

Returns a list of matches sorted by ascending distance, each with id, name, labels, distance, confidence, and match (distance ≤ threshold).

`POST /v1/voice/forget`

field	type	description
`id`	string	ID returned by `/v1/voice/register`

Returns 204 No Content on success, 404 Not Found if the ID is unknown.

`POST /v1/voice/embed`

Returns the L2-normalized speaker embedding vector.

field	type	description
`model`	string	voice model
`audio`	string	URL / base64 / data-URI

Returns {embedding: float[], dim: int, model: string}. Dimension depends on the recognizer: 192 for ECAPA-TDNN, 256 for WeSpeaker ResNet34.

Note: the OpenAI-compatible /v1/embeddings endpoint is intentionally text-only — it does nothing useful with audio input. Use /v1/voice/embed for audio.

Audio input

Audio is materialised by the HTTP layer to a temporary WAV file before the gRPC call. All audio fields accept:

http:// / https:// URLs (downloaded server-side, subject to ValidateExternalURL safety checks).
Raw base64 (no prefix).
Data URIs (data:audio/wav;base64,...).

The backend itself always receives a filesystem path — the same convention the Whisper / Voxtral transcription backends use.

Threshold reference

Recognizer	Cosine-distance threshold
ECAPA-TDNN (SpeechBrain, VoxCeleb)	~0.25
WeSpeaker ResNet34	~0.30
3D-Speaker ERes2Net	~0.28

Pass threshold explicitly when switching recognizers — the per-model default only applies when omitted.

Face Recognition — the image analog; the two share a registry design.
Audio to Text — transcription (Whisper, Voxtral, faster-whisper). Runs in addition to, not instead of, voice recognition.
Stores — the generic vector store powering both the face and voice 1:N recognition pipelines.
Embeddings — text-only OpenAI-compatible embedding endpoint; for audio embeddings use /v1/voice/embed.

11 KiB Raw Blame History

voice-detect (ggml) backend

Gallery entries

Quickstart

speaker-recognition (Python) backend

Engines

Quickstart

1:N identification workflow (register → identify → forget)

API reference

POST /v1/voice/verify (1:1)

POST /v1/voice/analyze

POST /v1/voice/register (1:N enrollment)

POST /v1/voice/identify (1:N recognition)

POST /v1/voice/forget

POST /v1/voice/embed

Audio input

Threshold reference

Related features

11 KiB

Raw Blame History

`POST /v1/voice/verify` (1:1)

`POST /v1/voice/analyze`

`POST /v1/voice/register` (1:N enrollment)

`POST /v1/voice/identify` (1:N recognition)

`POST /v1/voice/forget`

`POST /v1/voice/embed`