Files
LocalAI/docs/content/features/voice-recognition.md
Ettore Di Giacinto 181ebb6df4 feat: voice recognition (#9500)
* feat(voice-recognition): add /v1/voice/{verify,analyze,embed} + speaker-recognition backend

Audio analog to face recognition. Adds three gRPC RPCs
(VoiceVerify / VoiceAnalyze / VoiceEmbed), their Go service and HTTP
layers, a new FLAG_SPEAKER_RECOGNITION capability flag, and a Python
backend scaffold under backend/python/speaker-recognition/ wrapping
SpeechBrain ECAPA-TDNN with a parallel OnnxDirectEngine for
WeSpeaker / 3D-Speaker ONNX exports.

The kokoros Rust backend gets matching unimplemented trait stubs —
tonic's async_trait has no defaults, so adding an RPC without Rust
stubs breaks the build (same regression fixed by eb01c772 for face).

Swagger, /api/instructions, and the auth RouteFeatureRegistry /
APIFeatures list are updated so the endpoints surface everywhere a
client or admin UI looks.

Assisted-by: Claude:claude-opus-4-7

* feat(voice-recognition): add 1:N identify + register/forget endpoints

Mirrors the face-recognition register/identify/forget surface. New
package core/services/voicerecognition/ carries a Registry interface
and a local-store-backed implementation (same in-memory vector-store
plumbing facerecognition uses, separate instance so the embedding
spaces stay isolated).

Handlers under /v1/voice/{register,identify,forget} reuse
backend.VoiceEmbed to compute the probe vector, then delegate the
nearest-neighbour search to the registry. Default cosine-distance
threshold is tuned for ECAPA-TDNN on VoxCeleb (0.25, EER ~1.9%).

As with the face registry, the current backing is in-memory only — a
pgvector implementation is a future constructor-level swap.

Assisted-by: Claude:claude-opus-4-7

* feat(voice-recognition): gallery, docs, CI and e2e coverage

- backend/index.yaml: speaker-recognition backend entry + CPU and
  CUDA-12 image variants (plus matching development variants).
- gallery/index.yaml: speechbrain-ecapa-tdnn (default) and
  wespeaker-resnet34 model entries. The WeSpeaker SHA-256 is a
  deliberate placeholder — the HF URI must be curl'd and its hash
  filled in before the entry installs.
- docs/content/features/voice-recognition.md: API reference + quickstart,
  mirrors the face-recognition docs.
- React UI: CAP_SPEAKER_RECOGNITION flag export (consumers follow face's
  precedent — no dedicated tab yet).
- tests/e2e-backends: voice_embed / voice_verify / voice_analyze specs.
  Helper resolveFaceFixture is reused as-is — the only thing face/voice
  share is "download a file into workDir", so no need for a new helper.
- Makefile: docker-build-speaker-recognition + test-extra-backend-
  speaker-recognition-{ecapa,all} targets. Audio fixtures default to
  VCTK p225/p226 samples from HuggingFace.
- CI: test-extra.yml grows a tests-speaker-recognition-grpc job
  mirroring insightface. backend.yml matrix gains CPU + CUDA-12 image
  build entries — scripts/changed-backends.js auto-picks these up.

Assisted-by: Claude:claude-opus-4-7

* feat(voice-recognition): wire a working /v1/voice/analyze head

Adds AnalysisHead: a lazy-loading age / gender / emotion inference
wrapper that plugs into both SpeechBrainEngine and OnnxDirectEngine.

Defaults to two open-licence HuggingFace checkpoints:
  - audeering/wav2vec2-large-robust-24-ft-age-gender (Apache 2.0) —
    age regression + 3-way gender (female / male / child).
  - superb/wav2vec2-base-superb-er (Apache 2.0) — 4-way emotion.

Both are optional and degrade gracefully when transformers or the
model can't be loaded — the engine raises NotImplementedError so the
gRPC layer returns 501 instead of a generic 500.

Emotion classes pass through from the model (neutral/happy/angry/sad
on the default checkpoint); the e2e test now accepts any non-empty
dominant gender so custom age_gender_model overrides don't fail it.

Adds transformers to the backend's CPU and CUDA-12 requirements.

Assisted-by: Claude:claude-opus-4-7

* fix(voice-recognition): pin real WeSpeaker ResNet34 ONNX SHA-256

Replaces the placeholder hash in gallery/index.yaml with the actual
SHA-256 (7bb2f06e…) of the upstream
Wespeaker/wespeaker-voxceleb-resnet34-LM ONNX at ~25MB. `local-ai
models install wespeaker-resnet34` now succeeds.

Assisted-by: Claude:claude-opus-4-7

* fix(voice-recognition): soundfile loader + honest analyze default

Two issues surfaced on first end-to-end smoke with the actual backend
image:

1. torchaudio.load in torchaudio 2.8+ requires the torchcodec package
   for audio decoding. Switch SpeechBrainEngine._load_waveform to the
   already-present soundfile (listed in requirements.txt) plus a numpy
   linear resample to 16kHz. Drops a heavy ffmpeg-linked dep and the
   codepath we never exercise (torchaudio's ffmpeg backend).

2. The AnalysisHead was defaulting to audeering/wav2vec2-large-robust-
   24-ft-age-gender, but AutoModelForAudioClassification silently
   mangles that checkpoint — it reports the age head weights as
   UNEXPECTED and re-initialises the classifier head with random
   values, so the "gender" output is noise and there is no age output
   at all. Make age/gender opt-in instead (empty default; users wire
   a cleanly-loadable Wav2Vec2ForSequenceClassification checkpoint via
   age_gender_model: option). Emotion keeps its working Superb default.
   Also broaden _infer_age_gender's tensor-shape handling and catch
   runtime exceptions so a dodgy age/gender head never takes down the
   whole analyze call.

Docs and README updated to match the new policy.

Verified with the branch-scoped gallery on localhost:
- voice/embed    → 192-d ECAPA-TDNN vector
- voice/verify   → same-clip dist≈6e-08 verified=true; cross-speaker
                   dist 0.76–0.99 verified=false (as expected)
- voice/register/identify/forget → round-trip works, 404 on unknown id
- voice/analyze  → emotion populated, age/gender omitted (opt-in)

Assisted-by: Claude:claude-opus-4-7

* fix(voice-recognition): real CI audio fixtures + fixture-agnostic verify spec

Two issues surfaced after CI actually ran the speaker-recognition e2e
target (I'd curl-tested against a running server but hadn't run the
make target locally):

1. The default BACKEND_TEST_VOICE_AUDIO_* URLs pointed at
   huggingface.co/datasets/CSTR-Edinburgh/vctk paths that return 404
   (the dataset is gated). Swap them for the speechbrain test samples
   served from github.com/speechbrain/speechbrain/raw/develop/ —
   public, no auth, correct 16kHz mono format.

2. The VoiceVerify spec required d(file1,file2) < 0.4, assuming
   file1/file2 were same-speaker. The speechbrain samples are three
   different speakers (example1/2/5), and there is no easy un-gated
   source of true same-speaker audio pairs (VoxCeleb/VCTK/LibriSpeech
   are all license- or size-gated for CI use). Replace the ceiling
   check with a relative-ordering assertion: d(pair) > d(same-clip)
   for both file2 and file3 — that's enough to prove the embeddings
   encode speaker info, and it works with any three non-identical
   clips. Actual speaker ordering d(1,2) vs d(1,3) is logged but not
   asserted.

Local run: 4/4 voice specs pass (Health, LoadModel, VoiceEmbed,
VoiceVerify) on the built backend image. 12 non-voice specs skipped
as expected.

Assisted-by: Claude:claude-opus-4-7

* fix(ci): checkout with submodules in the reusable backend_build workflow

The kokoros Rust backend build fails with

    failed to read .../sources/Kokoros/kokoros/Cargo.toml: No such file

because the reusable backend_build.yml workflow's actions/checkout
step was missing `submodules: true`. Dockerfile.rust does `COPY .
/LocalAI`, and without the submodule files the subsequent `cargo
build` can't find the vendored Kokoros crate.

The bug pre-dates this PR — scripts/changed-backends.js only triggers
the kokoros image job when something under backend/rust/kokoros or
the shared proto changes, so master had been coasting past it. The
voice-recognition proto addition re-broke it.

Other checkouts in backend.yml (llama-cpp-darwin) and test-extra.yml
(insightface, kokoros, speaker-recognition) already pass
`submodules: true`; this brings the shared backend image builder in
line.

Assisted-by: Claude:claude-opus-4-7
2026-04-23 12:07:14 +02:00

248 lines
8.0 KiB
Markdown

+++
disableToc = false
title = "Voice Recognition"
weight = 15
url = "/features/voice-recognition/"
+++
LocalAI supports voice (speaker) recognition through the
`speaker-recognition` backend: speaker verification (1:1), speaker
identification (1:N) against a built-in vector store, speaker
embedding, and demographic analysis (age / gender / emotion from
voice).
The audio analog to [Face Recognition](/features/face-recognition/),
following the same two-engine pattern under one image.
## Engines
| Gallery entry | Model | Size | License |
|---|---|---|---|
| `speechbrain-ecapa-tdnn` | ECAPA-TDNN on VoxCeleb (SpeechBrain) | ~17 MB | **Apache 2.0 — commercial-safe** |
| `wespeaker-resnet34` | WeSpeaker ResNet34 ONNX | ~26 MB | **Apache 2.0 — commercial-safe** |
Both entries are commercial-safe Apache-2.0. SpeechBrain is the
default — it's a lightweight pure-PyTorch checkpoint that auto-
downloads on first use. The `wespeaker-resnet34` entry wires the
direct-ONNX path for CPU-only deployments that don't want the torch
runtime.
## Quickstart
Install the default backend and model:
```bash
local-ai models install speechbrain-ecapa-tdnn
```
Verify that two audio clips were spoken by the same person:
```bash
curl -sX POST http://localhost:8080/v1/voice/verify \
-H "Content-Type: application/json" \
-d '{
"model": "speechbrain-ecapa-tdnn",
"audio1": "https://example.com/alice_1.wav",
"audio2": "https://example.com/alice_2.wav"
}'
```
Response:
```json
{
"verified": true,
"distance": 0.18,
"threshold": 0.25,
"confidence": 28.0,
"model": "speechbrain-ecapa-tdnn",
"processing_time_ms": 340.0
}
```
## 1:N identification workflow (register → identify → forget)
Same flow as face recognition, same in-memory vector store under the
hood.
1. Register known speakers:
```bash
curl -sX POST http://localhost:8080/v1/voice/register \
-H "Content-Type: application/json" \
-d '{
"model": "speechbrain-ecapa-tdnn",
"name": "Alice",
"audio": "https://example.com/alice.wav"
}'
# → {"id": "b2f...", "name": "Alice", "registered_at": "2026-04-22T..."}
```
2. Identify an unknown probe:
```bash
curl -sX POST http://localhost:8080/v1/voice/identify \
-H "Content-Type: application/json" \
-d '{
"model": "speechbrain-ecapa-tdnn",
"audio": "https://example.com/unknown.wav",
"top_k": 5
}'
# → {"matches": [{"id":"b2f...","name":"Alice","distance":0.19,"match":true,...}]}
```
3. Remove a speaker by ID:
```bash
curl -sX POST http://localhost:8080/v1/voice/forget \
-d '{"id": "b2f..."}'
# → 204 No Content
```
{{% alert icon="⚠️" color="warning" %}}
**Storage caveat.** The default vector store is in-memory. All
registered speakers are lost when LocalAI restarts. Persistent storage
(pgvector) is a tracked future enhancement shared with face
recognition — the voice-recognition HTTP API is designed to swap the
backing store without changing the wire format.
{{% /alert %}}
## API reference
### `POST /v1/voice/verify` (1:1)
| field | type | description |
|---|---|---|
| `model` | string | gallery entry name (e.g. `speechbrain-ecapa-tdnn`) |
| `audio1`, `audio2` | string | URL, base64, or data-URI of an audio file |
| `threshold` | float, optional | cosine-distance cutoff; default 0.25 for ECAPA-TDNN |
| `anti_spoofing` | bool, optional | reserved — unused in the current release |
Returns `verified`, `distance`, `threshold`, `confidence`, `model`,
and `processing_time_ms`.
### `POST /v1/voice/analyze`
Returns demographic attributes (age, gender, emotion) inferred from
speech:
| field | type | description |
|---|---|---|
| `model` | string | gallery entry |
| `audio` | string | URL / base64 / data-URI |
| `actions` | string[] | subset of `["age","gender","emotion"]`; empty = all supported |
Emotion is inferred from the SUPERB emotion-recognition checkpoint
(`superb/wav2vec2-base-superb-er`, Apache 2.0) — 4-way categorical
neutral / happy / angry / sad. The model auto-downloads on the first
analyze call.
Age and gender are **opt-in**: no standard-transformers checkpoint
with a clean classifier head is shipped as the default. The
high-accuracy Audeering age/gender model uses a custom multi-task
head that `AutoModelForAudioClassification` doesn't load safely
(the age weights are silently dropped and the classifier is
re-initialised with random values). To enable age/gender, set
`age_gender_model:<repo>` in the model YAML's `options:` pointing at
a checkpoint with a vanilla `Wav2Vec2ForSequenceClassification`
head. Override the emotion default similarly via `emotion_model:`.
Set either to an empty string to disable that head.
If a head fails to load (offline, disk full, `transformers`
missing), the engine degrades gracefully: it still returns the
attributes it could compute. When nothing can be computed the backend
returns `501 Unimplemented`.
Analyze is supported by both `speechbrain-ecapa-tdnn` and
`wespeaker-resnet34` — the speaker recognizer and the analysis head
are independent.
### `POST /v1/voice/register` (1:N enrollment)
| field | type | description |
|---|---|---|
| `model` | string | voice recognition model |
| `audio` | string | speaker audio to enroll |
| `name` | string | human-readable label |
| `labels` | map[string]string, optional | arbitrary metadata |
| `store` | string, optional | vector store model; defaults to local-store |
Returns `{id, name, registered_at}`. The `id` is an opaque UUID used
by `/v1/voice/identify` and `/v1/voice/forget`.
### `POST /v1/voice/identify` (1:N recognition)
| field | type | description |
|---|---|---|
| `model` | string | voice recognition model |
| `audio` | string | probe audio |
| `top_k` | int, optional | max matches to return; default 5 |
| `threshold` | float, optional | cosine-distance cutoff; default 0.25 |
| `store` | string, optional | vector store model |
Returns a list of matches sorted by ascending distance, each with
`id`, `name`, `labels`, `distance`, `confidence`, and `match`
(`distance ≤ threshold`).
### `POST /v1/voice/forget`
| field | type | description |
|---|---|---|
| `id` | string | ID returned by `/v1/voice/register` |
Returns `204 No Content` on success, `404 Not Found` if the ID is
unknown.
### `POST /v1/voice/embed`
Returns the L2-normalized speaker embedding vector.
| field | type | description |
|---|---|---|
| `model` | string | voice model |
| `audio` | string | URL / base64 / data-URI |
Returns `{embedding: float[], dim: int, model: string}`. Dimension
depends on the recognizer: 192 for ECAPA-TDNN, 256 for WeSpeaker
ResNet34.
> **Note:** the OpenAI-compatible `/v1/embeddings` endpoint is
> intentionally text-only — it does nothing useful with audio input.
> Use `/v1/voice/embed` for audio.
## Audio input
Audio is materialised by the HTTP layer to a temporary WAV file
before the gRPC call. All audio fields accept:
- `http://` / `https://` URLs (downloaded server-side, subject to
`ValidateExternalURL` safety checks).
- Raw base64 (no prefix).
- Data URIs (`data:audio/wav;base64,...`).
The backend itself always receives a filesystem path — the same
convention the Whisper / Voxtral transcription backends use.
## Threshold reference
| Recognizer | Cosine-distance threshold |
|---|---|
| ECAPA-TDNN (SpeechBrain, VoxCeleb) | ~0.25 |
| WeSpeaker ResNet34 | ~0.30 |
| 3D-Speaker ERes2Net | ~0.28 |
Pass `threshold` explicitly when switching recognizers — the per-model
default only applies when omitted.
## Related features
- [Face Recognition](/features/face-recognition/) — the image analog;
the two share a registry design.
- [Audio to Text](/features/audio-to-text/) — transcription (Whisper,
Voxtral, faster-whisper). Runs in addition to, not instead of,
voice recognition.
- [Stores](/features/stores/) — the generic vector store powering
both the face and voice 1:N recognition pipelines.
- [Embeddings](/features/embeddings/) — text-only OpenAI-compatible
embedding endpoint; for audio embeddings use `/v1/voice/embed`.