LocalAI/docs/content/features/voice-recognition.md at eebf08ff1dbbad17cd644777ac1dd832af6528fd

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-05 07:16:10 -04:00

Files

LocalAI [bot] 7e59a5c7c5 docs: architecture & feature diagrams (blueprint style) (#10137 )

* docs: add 'how LocalAI works' architecture diagram

Add a blueprint-style architecture diagram: clients -> small core (API,
router, WebUI, agents) -> gRPC -> backend processes pulled on demand as
OCI images. Place it on the overview page and replace the stale external
architecture image on the reference page.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs: add blueprint diagrams across feature, distributed & getting-started docs

Add 24 architecture/flow/comparison diagrams (PNG + HTML source) under
docs/static/images/diagrams/, wired into their docs pages, from an
impact-vs-effort audit of the docs. Broaden the API surface on the
overview architecture diagram (OpenAI, Anthropic, ElevenLabs, Ollama,
and LocalAI's own API) and move the gRPC boundary label clear of the arrows.

Pages: distributed mode (architecture, scheduling, ds4 layer-split),
distributed inferencing, MLX, realtime, quantization, MCP, agents,
mitm & cloud proxy, middleware, reverse-proxy TLS, VRAM, voice & face
recognition, reranker, function calling, fine-tuning (recipe + jobs),
diarization, audio transform, quickstart, model resolution.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs: add composable-core diagram to README hero

Commit the composable-core card (small core + on-demand backend tiles)
alongside the other diagrams and reference it from the README hero via a
repo-relative path, so it renders on GitHub.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs: fix composable-core connectors/badge and federated-vs-worker layout

- composable-core: thicken the plug-in connectors so they read clearly, and
  widen the SEPARATE IMAGE badge so its text no longer overflows the box.
- federated-vs-worker: shorten the WHOLE/SPLIT REQUEST pills to fit, and
  replace the tangled node-to-node activation arrows with a clean fan-out
  (request split across all sharded nodes), mirroring the federated panel.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-02 18:43:22 +02:00

8.1 KiB

Raw Blame History

+++ disableToc = false title = "Voice Recognition" weight = 15 url = "/features/voice-recognition/" +++

LocalAI supports voice (speaker) recognition through the speaker-recognition backend: speaker verification (1:1), speaker identification (1:N) against a built-in vector store, speaker embedding, and demographic analysis (age / gender / emotion from voice).

The audio analog to Face Recognition, following the same two-engine pattern under one image.

Engines

Gallery entry	Model	Size	License
`speechbrain-ecapa-tdnn`	ECAPA-TDNN on VoxCeleb (SpeechBrain)	~17 MB	Apache 2.0 — commercial-safe
`wespeaker-resnet34`	WeSpeaker ResNet34 ONNX	~26 MB	Apache 2.0 — commercial-safe

Both entries are commercial-safe Apache-2.0. SpeechBrain is the default — it's a lightweight pure-PyTorch checkpoint that auto- downloads on first use. The wespeaker-resnet34 entry wires the direct-ONNX path for CPU-only deployments that don't want the torch runtime.

Quickstart

Install the default backend and model:

local-ai models install speechbrain-ecapa-tdnn

Verify that two audio clips were spoken by the same person:

curl -sX POST http://localhost:8080/v1/voice/verify \
  -H "Content-Type: application/json" \
  -d '{
    "model": "speechbrain-ecapa-tdnn",
    "audio1": "https://example.com/alice_1.wav",
    "audio2": "https://example.com/alice_2.wav"
  }'

Response:

{
  "verified": true,
  "distance": 0.18,
  "threshold": 0.25,
  "confidence": 28.0,
  "model": "speechbrain-ecapa-tdnn",
  "processing_time_ms": 340.0
}

1:N identification workflow (register → identify → forget)

Same flow as face recognition, same in-memory vector store under the hood.

curl -sX POST http://localhost:8080/v1/voice/register \
  -H "Content-Type: application/json" \
  -d '{
    "model": "speechbrain-ecapa-tdnn",
    "name": "Alice",
    "audio": "https://example.com/alice.wav"
  }'
# → {"id": "b2f...", "name": "Alice", "registered_at": "2026-04-22T..."}

Identify an unknown probe:

curl -sX POST http://localhost:8080/v1/voice/identify \
  -H "Content-Type: application/json" \
  -d '{
    "model": "speechbrain-ecapa-tdnn",
    "audio": "https://example.com/unknown.wav",
    "top_k": 5
  }'
# → {"matches": [{"id":"b2f...","name":"Alice","distance":0.19,"match":true,...}]}

Remove a speaker by ID:

curl -sX POST http://localhost:8080/v1/voice/forget \
  -d '{"id": "b2f..."}'
# → 204 No Content

{{% notice warning %}} Storage caveat. The default vector store is in-memory. All registered speakers are lost when LocalAI restarts. Persistent storage (pgvector) is a tracked future enhancement shared with face recognition — the voice-recognition HTTP API is designed to swap the backing store without changing the wire format. {{% /notice %}}

API reference

`POST /v1/voice/verify` (1:1)

field	type	description
`model`	string	gallery entry name (e.g. `speechbrain-ecapa-tdnn`)
`audio1`, `audio2`	string	URL, base64, or data-URI of an audio file
`threshold`	float, optional	cosine-distance cutoff; default 0.25 for ECAPA-TDNN
`anti_spoofing`	bool, optional	reserved — unused in the current release

Returns verified, distance, threshold, confidence, model, and processing_time_ms.

`POST /v1/voice/analyze`

Returns demographic attributes (age, gender, emotion) inferred from speech:

field	type	description
`model`	string	gallery entry
`audio`	string	URL / base64 / data-URI
`actions`	string[]	subset of `["age","gender","emotion"]`; empty = all supported

Emotion is inferred from the SUPERB emotion-recognition checkpoint (superb/wav2vec2-base-superb-er, Apache 2.0) — 4-way categorical neutral / happy / angry / sad. The model auto-downloads on the first analyze call.

Age and gender are opt-in: no standard-transformers checkpoint with a clean classifier head is shipped as the default. The high-accuracy Audeering age/gender model uses a custom multi-task head that AutoModelForAudioClassification doesn't load safely (the age weights are silently dropped and the classifier is re-initialised with random values). To enable age/gender, set age_gender_model:<repo> in the model YAML's options: pointing at a checkpoint with a vanilla Wav2Vec2ForSequenceClassification head. Override the emotion default similarly via emotion_model:. Set either to an empty string to disable that head.

If a head fails to load (offline, disk full, transformers missing), the engine degrades gracefully: it still returns the attributes it could compute. When nothing can be computed the backend returns 501 Unimplemented.

Analyze is supported by both speechbrain-ecapa-tdnn and wespeaker-resnet34 — the speaker recognizer and the analysis head are independent.

`POST /v1/voice/register` (1:N enrollment)

field	type	description
`model`	string	voice recognition model
`audio`	string	speaker audio to enroll
`name`	string	human-readable label
`labels`	map[string]string, optional	arbitrary metadata
`store`	string, optional	vector store model; defaults to local-store

Returns {id, name, registered_at}. The id is an opaque UUID used by /v1/voice/identify and /v1/voice/forget.

`POST /v1/voice/identify` (1:N recognition)

field	type	description
`model`	string	voice recognition model
`audio`	string	probe audio
`top_k`	int, optional	max matches to return; default 5
`threshold`	float, optional	cosine-distance cutoff; default 0.25
`store`	string, optional	vector store model

Returns a list of matches sorted by ascending distance, each with id, name, labels, distance, confidence, and match (distance ≤ threshold).

`POST /v1/voice/forget`

field	type	description
`id`	string	ID returned by `/v1/voice/register`

Returns 204 No Content on success, 404 Not Found if the ID is unknown.

`POST /v1/voice/embed`

Returns the L2-normalized speaker embedding vector.

field	type	description
`model`	string	voice model
`audio`	string	URL / base64 / data-URI

Returns {embedding: float[], dim: int, model: string}. Dimension depends on the recognizer: 192 for ECAPA-TDNN, 256 for WeSpeaker ResNet34.

Note: the OpenAI-compatible /v1/embeddings endpoint is intentionally text-only — it does nothing useful with audio input. Use /v1/voice/embed for audio.

Audio input

Audio is materialised by the HTTP layer to a temporary WAV file before the gRPC call. All audio fields accept:

http:// / https:// URLs (downloaded server-side, subject to ValidateExternalURL safety checks).
Raw base64 (no prefix).
Data URIs (data:audio/wav;base64,...).

The backend itself always receives a filesystem path — the same convention the Whisper / Voxtral transcription backends use.

Threshold reference

Recognizer	Cosine-distance threshold
ECAPA-TDNN (SpeechBrain, VoxCeleb)	~0.25
WeSpeaker ResNet34	~0.30
3D-Speaker ERes2Net	~0.28

Pass threshold explicitly when switching recognizers — the per-model default only applies when omitted.

Face Recognition — the image analog; the two share a registry design.
Audio to Text — transcription (Whisper, Voxtral, faster-whisper). Runs in addition to, not instead of, voice recognition.
Stores — the generic vector store powering both the face and voice 1:N recognition pipelines.
Embeddings — text-only OpenAI-compatible embedding endpoint; for audio embeddings use /v1/voice/embed.

8.1 KiB Raw Blame History

Engines

Quickstart

1:N identification workflow (register → identify → forget)

API reference

POST /v1/voice/verify (1:1)

POST /v1/voice/analyze

POST /v1/voice/register (1:N enrollment)

POST /v1/voice/identify (1:N recognition)

POST /v1/voice/forget

POST /v1/voice/embed