feat: voice recognition (#9500)

* feat(voice-recognition): add /v1/voice/{verify,analyze,embed} + speaker-recognition backend

Audio analog to face recognition. Adds three gRPC RPCs
(VoiceVerify / VoiceAnalyze / VoiceEmbed), their Go service and HTTP
layers, a new FLAG_SPEAKER_RECOGNITION capability flag, and a Python
backend scaffold under backend/python/speaker-recognition/ wrapping
SpeechBrain ECAPA-TDNN with a parallel OnnxDirectEngine for
WeSpeaker / 3D-Speaker ONNX exports.

The kokoros Rust backend gets matching unimplemented trait stubs —
tonic's async_trait has no defaults, so adding an RPC without Rust
stubs breaks the build (same regression fixed by eb01c772 for face).

Swagger, /api/instructions, and the auth RouteFeatureRegistry /
APIFeatures list are updated so the endpoints surface everywhere a
client or admin UI looks.

Assisted-by: Claude:claude-opus-4-7

* feat(voice-recognition): add 1:N identify + register/forget endpoints

Mirrors the face-recognition register/identify/forget surface. New
package core/services/voicerecognition/ carries a Registry interface
and a local-store-backed implementation (same in-memory vector-store
plumbing facerecognition uses, separate instance so the embedding
spaces stay isolated).

Handlers under /v1/voice/{register,identify,forget} reuse
backend.VoiceEmbed to compute the probe vector, then delegate the
nearest-neighbour search to the registry. Default cosine-distance
threshold is tuned for ECAPA-TDNN on VoxCeleb (0.25, EER ~1.9%).

As with the face registry, the current backing is in-memory only — a
pgvector implementation is a future constructor-level swap.

Assisted-by: Claude:claude-opus-4-7

* feat(voice-recognition): gallery, docs, CI and e2e coverage

- backend/index.yaml: speaker-recognition backend entry + CPU and
  CUDA-12 image variants (plus matching development variants).
- gallery/index.yaml: speechbrain-ecapa-tdnn (default) and
  wespeaker-resnet34 model entries. The WeSpeaker SHA-256 is a
  deliberate placeholder — the HF URI must be curl'd and its hash
  filled in before the entry installs.
- docs/content/features/voice-recognition.md: API reference + quickstart,
  mirrors the face-recognition docs.
- React UI: CAP_SPEAKER_RECOGNITION flag export (consumers follow face's
  precedent — no dedicated tab yet).
- tests/e2e-backends: voice_embed / voice_verify / voice_analyze specs.
  Helper resolveFaceFixture is reused as-is — the only thing face/voice
  share is "download a file into workDir", so no need for a new helper.
- Makefile: docker-build-speaker-recognition + test-extra-backend-
  speaker-recognition-{ecapa,all} targets. Audio fixtures default to
  VCTK p225/p226 samples from HuggingFace.
- CI: test-extra.yml grows a tests-speaker-recognition-grpc job
  mirroring insightface. backend.yml matrix gains CPU + CUDA-12 image
  build entries — scripts/changed-backends.js auto-picks these up.

Assisted-by: Claude:claude-opus-4-7

* feat(voice-recognition): wire a working /v1/voice/analyze head

Adds AnalysisHead: a lazy-loading age / gender / emotion inference
wrapper that plugs into both SpeechBrainEngine and OnnxDirectEngine.

Defaults to two open-licence HuggingFace checkpoints:
  - audeering/wav2vec2-large-robust-24-ft-age-gender (Apache 2.0) —
    age regression + 3-way gender (female / male / child).
  - superb/wav2vec2-base-superb-er (Apache 2.0) — 4-way emotion.

Both are optional and degrade gracefully when transformers or the
model can't be loaded — the engine raises NotImplementedError so the
gRPC layer returns 501 instead of a generic 500.

Emotion classes pass through from the model (neutral/happy/angry/sad
on the default checkpoint); the e2e test now accepts any non-empty
dominant gender so custom age_gender_model overrides don't fail it.

Adds transformers to the backend's CPU and CUDA-12 requirements.

Assisted-by: Claude:claude-opus-4-7

* fix(voice-recognition): pin real WeSpeaker ResNet34 ONNX SHA-256

Replaces the placeholder hash in gallery/index.yaml with the actual
SHA-256 (7bb2f06e…) of the upstream
Wespeaker/wespeaker-voxceleb-resnet34-LM ONNX at ~25MB. `local-ai
models install wespeaker-resnet34` now succeeds.

Assisted-by: Claude:claude-opus-4-7

* fix(voice-recognition): soundfile loader + honest analyze default

Two issues surfaced on first end-to-end smoke with the actual backend
image:

1. torchaudio.load in torchaudio 2.8+ requires the torchcodec package
   for audio decoding. Switch SpeechBrainEngine._load_waveform to the
   already-present soundfile (listed in requirements.txt) plus a numpy
   linear resample to 16kHz. Drops a heavy ffmpeg-linked dep and the
   codepath we never exercise (torchaudio's ffmpeg backend).

2. The AnalysisHead was defaulting to audeering/wav2vec2-large-robust-
   24-ft-age-gender, but AutoModelForAudioClassification silently
   mangles that checkpoint — it reports the age head weights as
   UNEXPECTED and re-initialises the classifier head with random
   values, so the "gender" output is noise and there is no age output
   at all. Make age/gender opt-in instead (empty default; users wire
   a cleanly-loadable Wav2Vec2ForSequenceClassification checkpoint via
   age_gender_model: option). Emotion keeps its working Superb default.
   Also broaden _infer_age_gender's tensor-shape handling and catch
   runtime exceptions so a dodgy age/gender head never takes down the
   whole analyze call.

Docs and README updated to match the new policy.

Verified with the branch-scoped gallery on localhost:
- voice/embed    → 192-d ECAPA-TDNN vector
- voice/verify   → same-clip dist≈6e-08 verified=true; cross-speaker
                   dist 0.76–0.99 verified=false (as expected)
- voice/register/identify/forget → round-trip works, 404 on unknown id
- voice/analyze  → emotion populated, age/gender omitted (opt-in)

Assisted-by: Claude:claude-opus-4-7

* fix(voice-recognition): real CI audio fixtures + fixture-agnostic verify spec

Two issues surfaced after CI actually ran the speaker-recognition e2e
target (I'd curl-tested against a running server but hadn't run the
make target locally):

1. The default BACKEND_TEST_VOICE_AUDIO_* URLs pointed at
   huggingface.co/datasets/CSTR-Edinburgh/vctk paths that return 404
   (the dataset is gated). Swap them for the speechbrain test samples
   served from github.com/speechbrain/speechbrain/raw/develop/ —
   public, no auth, correct 16kHz mono format.

2. The VoiceVerify spec required d(file1,file2) < 0.4, assuming
   file1/file2 were same-speaker. The speechbrain samples are three
   different speakers (example1/2/5), and there is no easy un-gated
   source of true same-speaker audio pairs (VoxCeleb/VCTK/LibriSpeech
   are all license- or size-gated for CI use). Replace the ceiling
   check with a relative-ordering assertion: d(pair) > d(same-clip)
   for both file2 and file3 — that's enough to prove the embeddings
   encode speaker info, and it works with any three non-identical
   clips. Actual speaker ordering d(1,2) vs d(1,3) is logged but not
   asserted.

Local run: 4/4 voice specs pass (Health, LoadModel, VoiceEmbed,
VoiceVerify) on the built backend image. 12 non-voice specs skipped
as expected.

Assisted-by: Claude:claude-opus-4-7

* fix(ci): checkout with submodules in the reusable backend_build workflow

The kokoros Rust backend build fails with

    failed to read .../sources/Kokoros/kokoros/Cargo.toml: No such file

because the reusable backend_build.yml workflow's actions/checkout
step was missing `submodules: true`. Dockerfile.rust does `COPY .
/LocalAI`, and without the submodule files the subsequent `cargo
build` can't find the vendored Kokoros crate.

The bug pre-dates this PR — scripts/changed-backends.js only triggers
the kokoros image job when something under backend/rust/kokoros or
the shared proto changes, so master had been coasting past it. The
voice-recognition proto addition re-broke it.

Other checkouts in backend.yml (llama-cpp-darwin) and test-extra.yml
(insightface, kokoros, speaker-recognition) already pass
`submodules: true`; this brings the shared backend image builder in
line.

Assisted-by: Claude:claude-opus-4-7
This commit is contained in:
Ettore Di Giacinto
2026-04-23 12:07:14 +02:00
committed by GitHub
parent 1c59165d63
commit 181ebb6df4
53 changed files with 3747 additions and 6 deletions

View File

@@ -724,6 +724,19 @@ jobs:
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "8"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-12-speaker-recognition'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "speaker-recognition"
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "8"
@@ -2653,6 +2666,20 @@ jobs:
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
# speaker-recognition (voice/speaker biometrics)
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64,linux/arm64'
tag-latest: 'auto'
tag-suffix: '-cpu-speaker-recognition'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "speaker-recognition"
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
- build-type: 'intel'
cuda-major-version: ""
cuda-minor-version: ""

View File

@@ -108,6 +108,8 @@ jobs:
- name: Checkout
uses: actions/checkout@v6
with:
submodules: true
- name: Release space from worker
if: inputs.runs-on == 'ubuntu-latest'

View File

@@ -39,6 +39,7 @@ jobs:
voxtral: ${{ steps.detect.outputs.voxtral }}
kokoros: ${{ steps.detect.outputs.kokoros }}
insightface: ${{ steps.detect.outputs.insightface }}
speaker-recognition: ${{ steps.detect.outputs.speaker-recognition }}
steps:
- name: Checkout repository
uses: actions/checkout@v6
@@ -778,3 +779,29 @@ jobs:
- name: Build insightface backend image and run both model configurations
run: |
make test-extra-backend-insightface-all
tests-speaker-recognition-grpc:
needs: detect-changes
if: needs.detect-changes.outputs.speaker-recognition == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: ubuntu-latest
timeout-minutes: 90
steps:
- name: Clone
uses: actions/checkout@v6
with:
submodules: true
- name: Dependencies
run: |
sudo apt-get update
sudo apt-get install -y --no-install-recommends \
make build-essential curl ca-certificates git tar
- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: '1.26.0'
- name: Free disk space
run: |
sudo rm -rf /usr/share/dotnet /opt/ghc /usr/local/lib/android /opt/hostedtoolcache/CodeQL || true
df -h
- name: Build speaker-recognition backend image and run the ECAPA-TDNN configuration
run: |
make test-extra-backend-speaker-recognition-all

View File

@@ -1,5 +1,5 @@
# Disable parallel execution for backend builds
.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/insightface backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/tinygrad
.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/tinygrad
GOCMD=go
GOTEST=$(GOCMD) test
@@ -435,6 +435,7 @@ prepare-test-extra: protogen-python
$(MAKE) -C backend/python/trl
$(MAKE) -C backend/python/tinygrad
$(MAKE) -C backend/python/insightface
$(MAKE) -C backend/python/speaker-recognition
$(MAKE) -C backend/rust/kokoros kokoros-grpc
test-extra: prepare-test-extra
@@ -459,6 +460,7 @@ test-extra: prepare-test-extra
$(MAKE) -C backend/python/trl test
$(MAKE) -C backend/python/tinygrad test
$(MAKE) -C backend/python/insightface test
$(MAKE) -C backend/python/speaker-recognition test
$(MAKE) -C backend/rust/kokoros test
##
@@ -713,6 +715,41 @@ test-extra-backend-insightface-all: \
test-extra-backend-insightface-buffalo-sc \
test-extra-backend-insightface-opencv
## speaker-recognition — voice (speaker) biometrics.
##
## Audio fixtures default to the speechbrain test samples served
## straight from their GitHub repo — public, no auth needed, and they
## ship as 16kHz mono WAV/FLAC which is exactly what the engine wants.
## example{1,2,5} are three different speakers; the suite treats
## example1 as the "same-image twin" probe (verify(clip, clip) must
## return distance≈0) and the other two as cross-speaker ceilings.
## Override with BACKEND_TEST_VOICE_AUDIO_{1,2,3}_FILE for offline runs.
VOICE_AUDIO_1_URL ?= https://github.com/speechbrain/speechbrain/raw/develop/tests/samples/single-mic/example1.wav
VOICE_AUDIO_2_URL ?= https://github.com/speechbrain/speechbrain/raw/develop/tests/samples/single-mic/example2.flac
VOICE_AUDIO_3_URL ?= https://github.com/speechbrain/speechbrain/raw/develop/tests/samples/single-mic/example5.wav
## ECAPA-TDNN via SpeechBrain — default CI configuration. Auto-downloads
## the checkpoint from HuggingFace on first LoadModel (bundled in the
## backend image pip install). 192-d embeddings, cosine-distance based.
## The e2e suite drives LoadModel directly so we don't rely on LocalAI's
## gallery flow here.
test-extra-backend-speaker-recognition-ecapa: docker-build-speaker-recognition
BACKEND_IMAGE=local-ai-backend:speaker-recognition \
BACKEND_TEST_MODEL_NAME=speechbrain/spkrec-ecapa-voxceleb \
BACKEND_TEST_OPTIONS=engine:speechbrain,source:speechbrain/spkrec-ecapa-voxceleb \
BACKEND_TEST_CAPS=health,load,voice_embed,voice_verify \
BACKEND_TEST_VOICE_AUDIO_1_URL=$(VOICE_AUDIO_1_URL) \
BACKEND_TEST_VOICE_AUDIO_2_URL=$(VOICE_AUDIO_2_URL) \
BACKEND_TEST_VOICE_AUDIO_3_URL=$(VOICE_AUDIO_3_URL) \
BACKEND_TEST_VOICE_VERIFY_DISTANCE_CEILING=0.4 \
$(MAKE) test-extra-backend
## Aggregate — today there's only one voice config; the target exists
## so the CI workflow matches the insightface-all naming convention and
## can grow to include WeSpeaker / 3D-Speaker later.
test-extra-backend-speaker-recognition-all: \
test-extra-backend-speaker-recognition-ecapa
## sglang mirrors the vllm setup: HuggingFace model id, same tiny Qwen,
## tool-call extraction via sglang's native qwen parser. CPU builds use
## sglang's upstream pyproject_cpu.toml recipe (see backend/python/sglang/install.sh).
@@ -859,6 +896,7 @@ BACKEND_FASTER_WHISPER = faster-whisper|python|.|false|true
BACKEND_COQUI = coqui|python|.|false|true
BACKEND_RFDETR = rfdetr|python|.|false|true
BACKEND_INSIGHTFACE = insightface|python|.|false|true
BACKEND_SPEAKER_RECOGNITION = speaker-recognition|python|.|false|true
BACKEND_KITTEN_TTS = kitten-tts|python|.|false|true
BACKEND_NEUTTS = neutts|python|.|false|true
BACKEND_KOKORO = kokoro|python|.|false|true
@@ -931,6 +969,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_FASTER_WHISPER)))
$(eval $(call generate-docker-build-target,$(BACKEND_COQUI)))
$(eval $(call generate-docker-build-target,$(BACKEND_RFDETR)))
$(eval $(call generate-docker-build-target,$(BACKEND_INSIGHTFACE)))
$(eval $(call generate-docker-build-target,$(BACKEND_SPEAKER_RECOGNITION)))
$(eval $(call generate-docker-build-target,$(BACKEND_KITTEN_TTS)))
$(eval $(call generate-docker-build-target,$(BACKEND_NEUTTS)))
$(eval $(call generate-docker-build-target,$(BACKEND_KOKORO)))
@@ -965,7 +1004,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_SAM3_CPP)))
docker-save-%: backend-images
docker save local-ai-backend:$* -o backend-images/$*.tar
docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp docker-build-insightface
docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp docker-build-insightface docker-build-speaker-recognition
########################################################
### Mock Backend for E2E Tests

View File

@@ -26,6 +26,9 @@ service Backend {
rpc Detect(DetectOptions) returns (DetectResponse) {}
rpc FaceVerify(FaceVerifyRequest) returns (FaceVerifyResponse) {}
rpc FaceAnalyze(FaceAnalyzeRequest) returns (FaceAnalyzeResponse) {}
rpc VoiceVerify(VoiceVerifyRequest) returns (VoiceVerifyResponse) {}
rpc VoiceAnalyze(VoiceAnalyzeRequest) returns (VoiceAnalyzeResponse) {}
rpc VoiceEmbed(VoiceEmbedRequest) returns (VoiceEmbedResponse) {}
rpc StoresSet(StoresSetOptions) returns (Result) {}
rpc StoresDelete(StoresDeleteOptions) returns (Result) {}
@@ -528,6 +531,57 @@ message FaceAnalyzeResponse {
repeated FaceAnalysis faces = 1;
}
// --- Voice (speaker) recognition messages ---
//
// Analogous to the Face* messages above, but for speaker biometrics.
// Audio fields accept a filesystem path (same convention as
// TranscriptRequest.dst). The HTTP layer materialises base64 / URL /
// data-URI inputs to a temp file before calling the gRPC backend.
message VoiceVerifyRequest {
string audio1 = 1; // path to first audio clip
string audio2 = 2; // path to second audio clip
float threshold = 3; // cosine-distance threshold; 0 = use backend default
bool anti_spoofing = 4; // reserved for future AASIST bolt-on
}
message VoiceVerifyResponse {
bool verified = 1;
float distance = 2; // 1 - cosine_similarity
float threshold = 3;
float confidence = 4; // 0-100
string model = 5; // e.g. "speechbrain/spkrec-ecapa-voxceleb"
float processing_time_ms = 6;
}
message VoiceAnalyzeRequest {
string audio = 1; // path to audio clip
repeated string actions = 2; // subset of ["age","gender","emotion"]; empty = all-supported
}
message VoiceAnalysis {
float start = 1; // segment start time in seconds (0 if single-utterance)
float end = 2; // segment end time in seconds
float age = 3;
string dominant_gender = 4;
map<string, float> gender = 5;
string dominant_emotion = 6;
map<string, float> emotion = 7;
}
message VoiceAnalyzeResponse {
repeated VoiceAnalysis segments = 1;
}
message VoiceEmbedRequest {
string audio = 1; // path to audio clip
}
message VoiceEmbedResponse {
repeated float embedding = 1;
string model = 2;
}
message ToolFormatMarkers {
string format_type = 1; // "json_native", "tag_with_json", "tag_with_tagged"

View File

@@ -3773,3 +3773,64 @@
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-insightface"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-insightface
# speaker-recognition (voice/speaker biometrics) — Apache-2.0 stack
- &speakerrecognition
name: "speaker-recognition"
alias: "speaker-recognition"
# SpeechBrain is Apache-2.0. WeSpeaker / 3D-Speaker ONNX exports are
# Apache-2.0. The backend itself ships only Python deps — all model
# weights flow through LocalAI's gallery download mechanism (or
# SpeechBrain's built-in HF auto-download at first LoadModel).
license: apache-2.0
description: |
Speaker (voice) recognition backend — the audio analog to
insightface. Wraps SpeechBrain ECAPA-TDNN (default engine, 192-d
embeddings, ~1.9% EER on VoxCeleb) plus an OnnxDirectEngine for
pre-exported WeSpeaker / 3D-Speaker ONNX models.
Exposes speaker verification (/v1/voice/verify), speaker embedding
(/v1/voice/embed), speaker analysis (/v1/voice/analyze), and 1:N
speaker identification (/v1/voice/{register,identify,forget}).
Registrations use LocalAI's built-in vector store — same in-memory
backing the face-recognition registry uses, separate instance.
urls:
- https://speechbrain.github.io/
- https://github.com/wenet-e2e/wespeaker
- https://github.com/modelscope/3D-Speaker
tags:
- voice-recognition
- speaker-verification
- speaker-embedding
- gpu
- cpu
capabilities:
default: "cpu-speaker-recognition"
nvidia: "cuda12-speaker-recognition"
nvidia-cuda-12: "cuda12-speaker-recognition"
- !!merge <<: *speakerrecognition
name: "speaker-recognition-development"
capabilities:
default: "cpu-speaker-recognition-development"
nvidia: "cuda12-speaker-recognition-development"
nvidia-cuda-12: "cuda12-speaker-recognition-development"
- !!merge <<: *speakerrecognition
name: "cpu-speaker-recognition"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-speaker-recognition"
mirrors:
- localai/localai-backends:latest-cpu-speaker-recognition
- !!merge <<: *speakerrecognition
name: "cuda12-speaker-recognition"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-speaker-recognition"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-speaker-recognition
- !!merge <<: *speakerrecognition
name: "cpu-speaker-recognition-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-speaker-recognition"
mirrors:
- localai/localai-backends:master-cpu-speaker-recognition
- !!merge <<: *speakerrecognition
name: "cuda12-speaker-recognition-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-speaker-recognition"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-speaker-recognition

View File

@@ -0,0 +1,13 @@
.DEFAULT_GOAL := install
.PHONY: install
install:
bash install.sh
.PHONY: protogen-clean
protogen-clean:
$(RM) backend_pb2_grpc.py backend_pb2.py
.PHONY: clean
clean: protogen-clean
rm -rf venv __pycache__

View File

@@ -0,0 +1,40 @@
# speaker-recognition
Speaker (voice) recognition backend for LocalAI. The audio analog to
`insightface` — produces speaker embeddings and supports 1:1 voice
verification and voice demographic analysis.
## Engines
- **SpeechBrainEngine** (default): ECAPA-TDNN trained on VoxCeleb.
192-d L2-normalised embeddings, cosine distance for verification.
Auto-downloads from HuggingFace on first LoadModel.
- **OnnxDirectEngine**: Any pre-exported ONNX speaker encoder
(WeSpeaker ResNet, 3D-Speaker ERes2Net, CAM++, …). Model path comes
from the gallery `files:` entry.
Engine selection is gallery-driven: if the model config provides
`model_path:` / `onnx:` the ONNX engine is used, otherwise the
SpeechBrain engine.
## Endpoints
- `POST /v1/voice/verify` — 1:1 same-speaker check.
- `POST /v1/voice/embed` — extract a speaker embedding vector.
- `POST /v1/voice/analyze` — voice demographics, loaded lazily on
the first analyze call:
- **Emotion** (default, opt-out): `superb/wav2vec2-base-superb-er`
(Apache-2.0), 4-way categorical (neutral / happy / angry / sad).
- **Age + gender** (opt-in): no default — wire a checkpoint with a
standard `Wav2Vec2ForSequenceClassification` head via
`age_gender_model:<repo>` in options. The Audeering
age-gender model is *not* usable as a drop-in because its
multi-task head isn't loadable via `AutoModelForAudioClassification`.
Both heads are optional. When nothing loads, the engine returns 501.
## Audio input
Audio is materialised by the HTTP layer to a temp wav before calling
the gRPC backend. Accepted input forms on the HTTP side: URL, data-URI,
or raw base64. The backend itself always receives a filesystem path.

View File

@@ -0,0 +1,205 @@
#!/usr/bin/env python3
"""gRPC server for the LocalAI speaker-recognition backend.
Implements Health / LoadModel / Status plus the voice-specific methods:
VoiceVerify, VoiceAnalyze, VoiceEmbed. The heavy lifting lives in
engines.py — this file is just the gRPC plumbing, mirroring the
insightface backend's two-engine split (SpeechBrain + OnnxDirect).
"""
from __future__ import annotations
import argparse
import os
import signal
import sys
import time
from concurrent import futures
import backend_pb2
import backend_pb2_grpc
import grpc
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "common"))
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "common"))
from grpc_auth import get_auth_interceptors # noqa: E402
from engines import SpeakerEngine, build_engine # noqa: E402
_ONE_DAY = 60 * 60 * 24
MAX_WORKERS = int(os.environ.get("PYTHON_GRPC_MAX_WORKERS", "1"))
# ECAPA-TDNN on VoxCeleb is the reference. Threshold is tuned for
# cosine distance (1 - cosine_similarity). Clients may override.
DEFAULT_VERIFY_THRESHOLD = 0.25
def _parse_options(raw: list[str]) -> dict[str, str]:
out: dict[str, str] = {}
for entry in raw:
if ":" not in entry:
continue
k, v = entry.split(":", 1)
out[k.strip()] = v.strip()
return out
class BackendServicer(backend_pb2_grpc.BackendServicer):
def __init__(self) -> None:
self.engine: SpeakerEngine | None = None
self.engine_name: str = ""
self.model_name: str = ""
self.verify_threshold: float = DEFAULT_VERIFY_THRESHOLD
def Health(self, request, context):
return backend_pb2.Reply(message=bytes("OK", "utf-8"))
def LoadModel(self, request, context):
options = _parse_options(list(request.Options))
# Surface LocalAI's models directory (ModelPath) so engines can
# anchor relative paths and auto-download into a writable spot
# alongside every other gallery-managed asset.
options["_model_path"] = request.ModelPath or ""
try:
engine, engine_name = build_engine(request.Model, options)
except Exception as exc: # noqa: BLE001
return backend_pb2.Result(success=False, message=f"engine init failed: {exc}")
self.engine = engine
self.engine_name = engine_name
self.model_name = request.Model
threshold_opt = options.get("verify_threshold")
if threshold_opt:
try:
self.verify_threshold = float(threshold_opt)
except ValueError:
pass
return backend_pb2.Result(success=True, message=f"loaded {engine_name}")
def Status(self, request, context):
state = backend_pb2.StatusResponse.State.READY if self.engine else backend_pb2.StatusResponse.State.UNINITIALIZED
return backend_pb2.StatusResponse(state=state)
def _require_engine(self, context) -> SpeakerEngine | None:
if self.engine is None:
context.set_code(grpc.StatusCode.FAILED_PRECONDITION)
context.set_details("no speaker-recognition model loaded")
return None
return self.engine
def VoiceVerify(self, request, context):
engine = self._require_engine(context)
if engine is None:
return backend_pb2.VoiceVerifyResponse()
if not request.audio1 or not request.audio2:
context.set_code(grpc.StatusCode.INVALID_ARGUMENT)
context.set_details("audio1 and audio2 are required")
return backend_pb2.VoiceVerifyResponse()
threshold = request.threshold if request.threshold > 0 else self.verify_threshold
started = time.time()
try:
distance = engine.compare(request.audio1, request.audio2)
except Exception as exc: # noqa: BLE001
context.set_code(grpc.StatusCode.INTERNAL)
context.set_details(f"voice verify failed: {exc}")
return backend_pb2.VoiceVerifyResponse()
elapsed_ms = (time.time() - started) * 1000.0
# Confidence goes linearly from 100 at distance=0 to 0 at distance=threshold.
confidence = max(0.0, min(100.0, (1.0 - distance / threshold) * 100.0))
return backend_pb2.VoiceVerifyResponse(
verified=distance <= threshold,
distance=distance,
threshold=threshold,
confidence=confidence,
model=self.model_name,
processing_time_ms=elapsed_ms,
)
def VoiceEmbed(self, request, context):
engine = self._require_engine(context)
if engine is None:
return backend_pb2.VoiceEmbedResponse()
if not request.audio:
context.set_code(grpc.StatusCode.INVALID_ARGUMENT)
context.set_details("audio is required")
return backend_pb2.VoiceEmbedResponse()
try:
vec = engine.embed(request.audio)
except Exception as exc: # noqa: BLE001
context.set_code(grpc.StatusCode.INTERNAL)
context.set_details(f"voice embed failed: {exc}")
return backend_pb2.VoiceEmbedResponse()
return backend_pb2.VoiceEmbedResponse(embedding=list(vec), model=self.model_name)
def VoiceAnalyze(self, request, context):
engine = self._require_engine(context)
if engine is None:
return backend_pb2.VoiceAnalyzeResponse()
if not request.audio:
context.set_code(grpc.StatusCode.INVALID_ARGUMENT)
context.set_details("audio is required")
return backend_pb2.VoiceAnalyzeResponse()
actions = list(request.actions) or ["age", "gender", "emotion"]
try:
segments = engine.analyze(request.audio, actions)
except NotImplementedError:
context.set_code(grpc.StatusCode.UNIMPLEMENTED)
context.set_details(f"analyze not supported by {self.engine_name}")
return backend_pb2.VoiceAnalyzeResponse()
except Exception as exc: # noqa: BLE001
context.set_code(grpc.StatusCode.INTERNAL)
context.set_details(f"voice analyze failed: {exc}")
return backend_pb2.VoiceAnalyzeResponse()
proto_segments = []
for seg in segments:
proto_segments.append(
backend_pb2.VoiceAnalysis(
start=seg.get("start", 0.0),
end=seg.get("end", 0.0),
age=seg.get("age", 0.0),
dominant_gender=seg.get("dominant_gender", ""),
gender=seg.get("gender", {}),
dominant_emotion=seg.get("dominant_emotion", ""),
emotion=seg.get("emotion", {}),
)
)
return backend_pb2.VoiceAnalyzeResponse(segments=proto_segments)
def serve(address: str) -> None:
interceptors = get_auth_interceptors()
server = grpc.server(
futures.ThreadPoolExecutor(max_workers=MAX_WORKERS),
interceptors=interceptors,
options=[
("grpc.max_send_message_length", 128 * 1024 * 1024),
("grpc.max_receive_message_length", 128 * 1024 * 1024),
],
)
backend_pb2_grpc.add_BackendServicer_to_server(BackendServicer(), server)
server.add_insecure_port(address)
server.start()
print("speaker-recognition backend listening on", address, flush=True)
def _stop(*_):
server.stop(0)
sys.exit(0)
signal.signal(signal.SIGTERM, _stop)
signal.signal(signal.SIGINT, _stop)
try:
while True:
time.sleep(_ONE_DAY)
except KeyboardInterrupt:
server.stop(0)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--addr", default="localhost:50051")
args = parser.parse_args()
serve(args.addr)

View File

@@ -0,0 +1,387 @@
"""Speaker-recognition engines.
Two engines are offered, mirroring the insightface backend's split:
* SpeechBrainEngine: full PyTorch / SpeechBrain path. Uses the
ECAPA-TDNN recipe trained on VoxCeleb; 192-d L2-normalized
embeddings, cosine distance for verification. Auto-downloads the
checkpoint into LocalAI's models directory on first LoadModel.
* OnnxDirectEngine: CPU-friendly fallback that runs pre-exported
ONNX speaker encoders (WeSpeaker ResNet34, 3D-Speaker ERes2Net,
CAM++, etc.). Model paths come from the model config — the gallery
`files:` flow drops them into the models directory.
Engine selection follows the same gallery-driven convention face
recognition uses (insightface commits 9c6da0f7 / 405fec0b): the
Python backend reads `engine` / `model_path` / `checkpoint` from the
options dict and picks an engine accordingly.
"""
from __future__ import annotations
import os
from typing import Any, Iterable, Protocol
class SpeakerEngine(Protocol):
"""Interface both concrete engines satisfy."""
name: str
def embed(self, audio_path: str) -> list[float]: # pragma: no cover - interface
...
def compare(self, audio1: str, audio2: str) -> float: # pragma: no cover
...
def analyze(self, audio_path: str, actions: Iterable[str]) -> list[dict[str, Any]]: # pragma: no cover
...
def _cosine_distance(a, b) -> float:
import numpy as np
va = np.asarray(a, dtype=np.float32).reshape(-1)
vb = np.asarray(b, dtype=np.float32).reshape(-1)
na = float(np.linalg.norm(va))
nb = float(np.linalg.norm(vb))
if na == 0.0 or nb == 0.0:
return 1.0
return float(1.0 - np.dot(va, vb) / (na * nb))
class AnalysisHead:
"""Age / gender / emotion head, lazy-loaded on first analyze call.
Wraps two open-licence HuggingFace checkpoints:
* audeering/wav2vec2-large-robust-24-ft-age-gender — age
regression (0100 years) + 3-way gender (female/male/child).
Apache 2.0.
* superb/wav2vec2-base-superb-er — 4-way emotion classification
(neutral / happy / angry / sad). Apache 2.0.
Either model is optional — the head degrades gracefully to only the
attributes it could load. Override the checkpoint with the
`age_gender_model` / `emotion_model` option if you want something
else. Set either to an empty string to disable that head.
"""
# Age + gender is OFF by default: the high-accuracy Apache-2.0
# checkpoint (Audeering wav2vec2-large-robust-24-ft-age-gender) uses a
# custom multi-task head that AutoModelForAudioClassification silently
# mangles — it drops the age weights as UNEXPECTED and re-initialises
# the classifier head with random values, so the output is noise. Users
# who have a cleanly loadable age/gender classifier can opt in with
# `age_gender_model:<repo>` in options. The emotion default below
# (superb/wav2vec2-base-superb-er) loads via the standard audio-
# classification pipeline with no such caveat.
DEFAULT_AGE_GENDER_MODEL = ""
DEFAULT_EMOTION_MODEL = "superb/wav2vec2-base-superb-er"
AGE_GENDER_LABELS = ("female", "male", "child")
def __init__(self, options: dict[str, str]):
self._options = options
self._age_gender = None
self._age_gender_processor = None
self._age_gender_loaded = False
self._age_gender_error: str | None = None
self._emotion = None
self._emotion_loaded = False
self._emotion_error: str | None = None
# --- age / gender -------------------------------------------------
def _ensure_age_gender(self):
if self._age_gender_loaded:
return
self._age_gender_loaded = True
model_id = self._options.get(
"age_gender_model", self.DEFAULT_AGE_GENDER_MODEL
)
if not model_id:
self._age_gender_error = "disabled"
return
try:
# Late imports — torch / transformers are heavy and only
# pulled in when the analyze head actually runs.
import torch # type: ignore
from transformers import AutoFeatureExtractor, AutoModelForAudioClassification # type: ignore
self._torch = torch
self._age_gender_processor = AutoFeatureExtractor.from_pretrained(model_id)
self._age_gender = AutoModelForAudioClassification.from_pretrained(model_id)
self._age_gender.eval()
except Exception as exc: # noqa: BLE001
self._age_gender_error = f"{type(exc).__name__}: {exc}"
def _infer_age_gender(self, waveform_16k) -> dict[str, Any]:
self._ensure_age_gender()
if self._age_gender is None:
return {}
import numpy as np
try:
inputs = self._age_gender_processor(
waveform_16k, sampling_rate=16000, return_tensors="pt"
)
with self._torch.no_grad():
outputs = self._age_gender(**inputs)
# Audeering's checkpoint is published with a custom head: the
# official recipe exposes `(hidden_states, logits_age, logits_gender)`.
# AutoModelForAudioClassification flattens that into a single
# `logits` tensor of shape [batch, 4] — [age_regression, female, male, child].
# Fall back gracefully when the shape is different (e.g. a
# user-supplied age_gender_model checkpoint that returns a proper tuple).
hidden = getattr(outputs, "logits", outputs)
age_years = None
gender_logits = None
if isinstance(hidden, (tuple, list)) and len(hidden) >= 2:
age_years = float(hidden[0].squeeze().item()) * 100.0
gender_logits = hidden[1]
else:
flat = hidden.squeeze()
if flat.ndim == 1 and flat.numel() >= 4:
age_years = float(flat[0].item()) * 100.0
gender_logits = flat[1:4]
elif flat.ndim == 1 and flat.numel() == 1:
age_years = float(flat.item()) * 100.0
if age_years is None and gender_logits is None:
return {}
result: dict[str, Any] = {}
if age_years is not None:
result["age"] = age_years
if gender_logits is not None:
probs = self._torch.softmax(gender_logits, dim=-1).cpu().numpy()
probs = np.asarray(probs).reshape(-1)
gender_map = {
label: float(probs[i])
for i, label in enumerate(self.AGE_GENDER_LABELS[: len(probs)])
}
result["gender"] = gender_map
if gender_map:
dom = max(gender_map.items(), key=lambda kv: kv[1])[0]
result["dominant_gender"] = {
"female": "Female",
"male": "Male",
"child": "Child",
}.get(dom, dom.capitalize())
return result
except Exception as exc: # noqa: BLE001
# Analyze is a best-effort feature — never take down the
# whole analyze call because the age/gender head had a bad
# day. Mark the failure so the emotion branch still runs.
self._age_gender_error = f"runtime: {type(exc).__name__}: {exc}"
return {}
# --- emotion ------------------------------------------------------
def _ensure_emotion(self):
if self._emotion_loaded:
return
self._emotion_loaded = True
model_id = self._options.get("emotion_model", self.DEFAULT_EMOTION_MODEL)
if not model_id:
self._emotion_error = "disabled"
return
try:
from transformers import pipeline # type: ignore
self._emotion = pipeline("audio-classification", model=model_id)
except Exception as exc: # noqa: BLE001
self._emotion_error = f"{type(exc).__name__}: {exc}"
def _infer_emotion(self, audio_path: str) -> dict[str, Any]:
self._ensure_emotion()
if self._emotion is None:
return {}
try:
raw = self._emotion(audio_path, top_k=8)
except Exception as exc: # noqa: BLE001
# Second-line defense: don't fail the whole analyze call
# over a runtime inference hiccup.
self._emotion_error = f"runtime: {type(exc).__name__}: {exc}"
return {}
emotion_map = {row["label"].lower(): float(row["score"]) for row in raw}
if not emotion_map:
return {}
dom = max(emotion_map.items(), key=lambda kv: kv[1])[0]
return {"emotion": emotion_map, "dominant_emotion": dom}
# --- orchestrator -------------------------------------------------
def analyze(self, audio_path: str, waveform_16k, actions: Iterable[str]) -> dict[str, Any]:
wanted = {a.strip().lower() for a in actions} if actions else {"age", "gender", "emotion"}
result: dict[str, Any] = {}
if "age" in wanted or "gender" in wanted:
ag = self._infer_age_gender(waveform_16k)
if "age" in wanted and "age" in ag:
result["age"] = ag["age"]
if "gender" in wanted:
if "gender" in ag:
result["gender"] = ag["gender"]
if "dominant_gender" in ag:
result["dominant_gender"] = ag["dominant_gender"]
if "emotion" in wanted:
em = self._infer_emotion(audio_path)
result.update(em)
return result
class SpeechBrainEngine:
"""ECAPA-TDNN via SpeechBrain. Auto-downloads on first use."""
name = "speechbrain-ecapa-tdnn"
def __init__(self, model_name: str, options: dict[str, str]):
# Late imports so the module can be introspected / tested
# without torch / speechbrain being installed.
from speechbrain.inference.speaker import EncoderClassifier # type: ignore
source = options.get("source") or model_name or "speechbrain/spkrec-ecapa-voxceleb"
savedir = options.get("_model_path") or os.environ.get("HF_HOME") or "./pretrained_models"
self._model = EncoderClassifier.from_hparams(source=source, savedir=savedir)
self._analysis = AnalysisHead(options)
def _load_waveform(self, path: str):
# Use soundfile + torch directly — torchaudio.load in torchaudio
# 2.8+ requires the torchcodec package for decoding, which adds
# another heavy ffmpeg-linked dep. soundfile covers WAV/FLAC
# which is what we care about here.
import numpy as np
import soundfile as sf # type: ignore
import torch # type: ignore
audio, sr = sf.read(path, always_2d=False)
if audio.ndim > 1:
audio = audio.mean(axis=1)
audio = np.asarray(audio, dtype=np.float32)
if sr != 16000:
# Simple linear resample — good enough for 16kHz downsampling
# from 44.1/48kHz, and we expect 16kHz inputs in practice.
ratio = 16000 / float(sr)
n = int(round(len(audio) * ratio))
audio = np.interp(
np.linspace(0, len(audio), n, endpoint=False),
np.arange(len(audio)),
audio,
).astype(np.float32)
return torch.from_numpy(audio).unsqueeze(0) # [1, T]
def embed(self, audio_path: str) -> list[float]:
waveform = self._load_waveform(audio_path)
vec = self._model.encode_batch(waveform).squeeze().detach().cpu().numpy()
return [float(x) for x in vec]
def compare(self, audio1: str, audio2: str) -> float:
return _cosine_distance(self.embed(audio1), self.embed(audio2))
def analyze(self, audio_path: str, actions):
# Age / gender / emotion aren't produced by ECAPA-TDNN itself;
# delegate to AnalysisHead which wraps separate Apache-2.0
# checkpoints. Returns a single segment spanning the clip —
# segmentation / diarisation is a future enhancement.
waveform = self._load_waveform(audio_path)
mono = waveform.squeeze().detach().cpu().numpy()
attrs = self._analysis.analyze(audio_path, mono, actions)
if not attrs:
raise NotImplementedError(
"analyze head failed to load — install transformers + torch or pass age_gender_model/emotion_model options"
)
duration = float(mono.shape[-1]) / 16000.0 if mono.size else 0.0
return [dict(start=0.0, end=duration, **attrs)]
class OnnxDirectEngine:
"""Run a pre-exported ONNX speaker encoder (WeSpeaker / 3D-Speaker)."""
name = "onnx-direct"
def __init__(self, model_name: str, options: dict[str, str]):
import onnxruntime as ort # type: ignore
# The gallery is expected to have dropped the ONNX file under
# the models directory; accept either an absolute path or a
# filename relative to _model_path.
onnx_path = options.get("model_path") or options.get("onnx")
if not onnx_path:
raise ValueError("OnnxDirectEngine requires `model_path: <file.onnx>` in options")
if not os.path.isabs(onnx_path):
onnx_path = os.path.join(options.get("_model_path", ""), onnx_path)
if not os.path.isfile(onnx_path):
raise FileNotFoundError(f"ONNX model not found: {onnx_path}")
providers = options.get("providers")
if providers:
provider_list = [p.strip() for p in providers.split(",") if p.strip()]
else:
provider_list = ["CPUExecutionProvider"]
self._session = ort.InferenceSession(onnx_path, providers=provider_list)
self._input_name = self._session.get_inputs()[0].name
self._expected_sr = int(options.get("sample_rate", "16000"))
self._analysis = AnalysisHead(options)
def _load_waveform(self, path: str):
import numpy as np
import soundfile as sf # type: ignore
audio, sr = sf.read(path, always_2d=False)
if sr != self._expected_sr:
# Cheap linear resample — good enough for sanity; callers
# should pre-resample for production.
ratio = self._expected_sr / float(sr)
n = int(round(len(audio) * ratio))
audio = np.interp(
np.linspace(0, len(audio), n, endpoint=False),
np.arange(len(audio)),
audio,
)
if audio.ndim > 1:
audio = audio.mean(axis=1)
return audio.astype("float32")
def embed(self, audio_path: str) -> list[float]:
import numpy as np
audio = self._load_waveform(audio_path)
feed = audio.reshape(1, -1)
out = self._session.run(None, {self._input_name: feed})
vec = np.asarray(out[0]).reshape(-1)
return [float(x) for x in vec]
def compare(self, audio1: str, audio2: str) -> float:
return _cosine_distance(self.embed(audio1), self.embed(audio2))
def analyze(self, audio_path: str, actions):
# AnalysisHead expects 16kHz mono; _load_waveform already
# resamples to self._expected_sr. If the user configured a
# non-16k expected rate, resample one more time for analyze.
audio = self._load_waveform(audio_path)
if self._expected_sr != 16000:
import numpy as np
ratio = 16000 / float(self._expected_sr)
n = int(round(len(audio) * ratio))
audio = np.interp(
np.linspace(0, len(audio), n, endpoint=False),
np.arange(len(audio)),
audio,
).astype("float32")
attrs = self._analysis.analyze(audio_path, audio, actions)
if not attrs:
raise NotImplementedError(
"analyze head failed to load — install transformers + torch or pass age_gender_model/emotion_model options"
)
duration = float(len(audio)) / 16000.0 if len(audio) else 0.0
return [dict(start=0.0, end=duration, **attrs)]
def build_engine(model_name: str, options: dict[str, str]) -> tuple[SpeakerEngine, str]:
"""Pick an engine based on the options. ONNX path takes priority:
if the gallery has dropped a `model_path:` or `onnx:` option, run
the direct ONNX engine. Otherwise, fall back to SpeechBrain.
"""
engine_kind = (options.get("engine") or "").lower()
if engine_kind == "onnx" or options.get("model_path") or options.get("onnx"):
return OnnxDirectEngine(model_name, options), OnnxDirectEngine.name
return SpeechBrainEngine(model_name, options), SpeechBrainEngine.name

View File

@@ -0,0 +1,19 @@
#!/bin/bash
set -e
backend_dir=$(dirname $0)
if [ -d $backend_dir/common ]; then
source $backend_dir/common/libbackend.sh
else
source $backend_dir/../common/libbackend.sh
fi
installRequirements
# No pre-baked model weights. Weights flow through LocalAI's gallery
# `files:` mechanism — see gallery entries for speechbrain-ecapa-tdnn
# and WeSpeaker / 3D-Speaker ONNX packs. SpeechBrain's
# EncoderClassifier.from_hparams also knows how to auto-download from
# HuggingFace into the configured savedir (we point it at ModelPath),
# so the first LoadModel call bootstraps the checkpoint if the gallery
# flow wasn't used.

View File

@@ -0,0 +1,5 @@
torch
torchaudio
speechbrain
transformers
onnxruntime

View File

@@ -0,0 +1,5 @@
torch
torchaudio
speechbrain
transformers
onnxruntime-gpu

View File

@@ -0,0 +1,5 @@
grpcio==1.71.0
protobuf
grpcio-tools
numpy
soundfile

View File

@@ -0,0 +1,9 @@
#!/bin/bash
backend_dir=$(dirname $0)
if [ -d $backend_dir/common ]; then
source $backend_dir/common/libbackend.sh
else
source $backend_dir/../common/libbackend.sh
fi
startBackend $@

View File

@@ -0,0 +1,78 @@
"""Unit tests for the speaker-recognition gRPC backend.
The servicer is instantiated in-process (no gRPC channel) and driven
directly. The default path exercises SpeechBrain's ECAPA-TDNN — the
first run downloads the checkpoint into a temp savedir. Tests are
skipped gracefully when the heavy optional dependencies (torch /
speechbrain / onnxruntime) are not installed, so the gRPC plumbing
can still be verified on a bare image.
"""
from __future__ import annotations
import importlib
import os
import sys
import tempfile
import unittest
sys.path.insert(0, os.path.dirname(__file__))
import backend_pb2 # noqa: E402
from backend import BackendServicer # noqa: E402
def _have(*mods: str) -> bool:
for m in mods:
if importlib.util.find_spec(m) is None:
return False
return True
class _FakeCtx:
"""Minimal stand-in for a gRPC servicer context."""
def __init__(self) -> None:
self.code = None
self.details = ""
def set_code(self, c):
self.code = c
def set_details(self, d):
self.details = d
class ServicerPlumbingTest(unittest.TestCase):
"""Checks that LoadModel returns a clear error when no engine deps
are installed, and that Voice* calls on an uninitialised servicer
surface FAILED_PRECONDITION — both verifying the gRPC wiring
without requiring SpeechBrain or ONNX at test time."""
def test_pre_load_voice_calls_are_rejected(self):
svc = BackendServicer()
ctx = _FakeCtx()
svc.VoiceVerify(backend_pb2.VoiceVerifyRequest(audio1="/tmp/a.wav", audio2="/tmp/b.wav"), ctx)
self.assertEqual(str(ctx.code), "StatusCode.FAILED_PRECONDITION")
def test_load_without_deps_fails_cleanly(self):
svc = BackendServicer()
req = backend_pb2.ModelOptions(Model="speechbrain/spkrec-ecapa-voxceleb", ModelPath="")
result = svc.LoadModel(req, _FakeCtx())
# Either the deps are installed and it loaded, or they aren't
# and we got a structured error instead of a crash.
self.assertTrue(result.success or "engine init failed" in result.message)
@unittest.skipUnless(_have("speechbrain", "torch", "torchaudio"), "speechbrain / torch missing")
class SpeechBrainEngineSmokeTest(unittest.TestCase):
def test_load_and_embed(self):
svc = BackendServicer()
with tempfile.TemporaryDirectory() as td:
req = backend_pb2.ModelOptions(Model="speechbrain/spkrec-ecapa-voxceleb", ModelPath=td)
result = svc.LoadModel(req, _FakeCtx())
self.assertTrue(result.success, result.message)
if __name__ == "__main__":
unittest.main()

View File

@@ -0,0 +1,11 @@
#!/bin/bash
set -e
backend_dir=$(dirname $0)
if [ -d $backend_dir/common ]; then
source $backend_dir/common/libbackend.sh
else
source $backend_dir/../common/libbackend.sh
fi
runUnittests

View File

@@ -386,6 +386,27 @@ impl Backend for KokorosService {
Err(Status::unimplemented("Not supported"))
}
async fn voice_verify(
&self,
_: Request<backend::VoiceVerifyRequest>,
) -> Result<Response<backend::VoiceVerifyResponse>, Status> {
Err(Status::unimplemented("Not supported"))
}
async fn voice_analyze(
&self,
_: Request<backend::VoiceAnalyzeRequest>,
) -> Result<Response<backend::VoiceAnalyzeResponse>, Status> {
Err(Status::unimplemented("Not supported"))
}
async fn voice_embed(
&self,
_: Request<backend::VoiceEmbedRequest>,
) -> Result<Response<backend::VoiceEmbedResponse>, Status> {
Err(Status::unimplemented("Not supported"))
}
async fn stores_set(
&self,
_: Request<backend::StoresSetOptions>,

View File

@@ -14,6 +14,7 @@ import (
"github.com/mudler/LocalAI/core/services/facerecognition"
"github.com/mudler/LocalAI/core/services/galleryop"
"github.com/mudler/LocalAI/core/services/nodes"
"github.com/mudler/LocalAI/core/services/voicerecognition"
"github.com/mudler/LocalAI/core/templates"
pkggrpc "github.com/mudler/LocalAI/pkg/grpc"
"github.com/mudler/LocalAI/pkg/model"
@@ -29,6 +30,12 @@ import (
// family per deployment; we keep the door open instead.
const faceEmbeddingDim = 0
// voiceEmbeddingDim is the expected dimension for speaker embeddings.
// 0 so the Registry accepts whatever dim the loaded recognizer
// produces — ECAPA-TDNN is 192, WeSpeaker ResNet34 is 256, 3D-Speaker
// ERes2Net is 192, CAM++ is 512.
const voiceEmbeddingDim = 0
type Application struct {
backendLoader *config.ModelConfigLoader
modelLoader *model.ModelLoader
@@ -39,6 +46,7 @@ type Application struct {
agentJobService *agentpool.AgentJobService
agentPoolService atomic.Pointer[agentpool.AgentPoolService]
faceRegistry facerecognition.Registry
voiceRegistry voicerecognition.Registry
authDB *gorm.DB
watchdogMutex sync.Mutex
watchdogStop chan bool
@@ -78,6 +86,14 @@ func newApplication(appConfig *config.ApplicationConfig) *Application {
}
app.faceRegistry = facerecognition.NewStoreRegistry(faceStoreResolver, "", faceEmbeddingDim)
// Voice (speaker) recognition registry — same plumbing, separate
// registry so embedding spaces stay isolated (a face vector and a
// speaker vector are not comparable).
voiceStoreResolver := func(_ context.Context, storeName string) (pkggrpc.Backend, error) {
return corebackend.StoreBackend(ml, appConfig, storeName, "")
}
app.voiceRegistry = voicerecognition.NewStoreRegistry(voiceStoreResolver, "", voiceEmbeddingDim)
return app
}
@@ -130,6 +146,14 @@ func (a *Application) FaceRegistry() facerecognition.Registry {
return a.faceRegistry
}
// VoiceRegistry returns the voice (speaker) recognition registry used
// for 1:N identification. Same in-memory local-store backing as
// FaceRegistry but a separate instance — voice embeddings live in
// their own vector space.
func (a *Application) VoiceRegistry() voicerecognition.Registry {
return a.voiceRegistry
}
// AuthDB returns the auth database connection, or nil if auth is not enabled.
func (a *Application) AuthDB() *gorm.DB {
return a.authDB

View File

@@ -0,0 +1,58 @@
package backend
import (
"context"
"fmt"
"time"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/trace"
"github.com/mudler/LocalAI/pkg/grpc/proto"
"github.com/mudler/LocalAI/pkg/model"
)
func VoiceAnalyze(
audio string,
actions []string,
loader *model.ModelLoader,
appConfig *config.ApplicationConfig,
modelConfig config.ModelConfig,
) (*proto.VoiceAnalyzeResponse, error) {
opts := ModelOptions(modelConfig, appConfig)
voiceModel, err := loader.Load(opts...)
if err != nil {
recordModelLoadFailure(appConfig, modelConfig.Name, modelConfig.Backend, err, nil)
return nil, err
}
if voiceModel == nil {
return nil, fmt.Errorf("could not load voice recognition model")
}
var startTime time.Time
if appConfig.EnableTracing {
trace.InitBackendTracingIfEnabled(appConfig.TracingMaxItems)
startTime = time.Now()
}
res, err := voiceModel.VoiceAnalyze(context.Background(), &proto.VoiceAnalyzeRequest{
Audio: audio,
Actions: actions,
})
if appConfig.EnableTracing {
errStr := ""
if err != nil {
errStr = err.Error()
}
trace.RecordBackendTrace(trace.BackendTrace{
Timestamp: startTime,
Duration: time.Since(startTime),
Type: trace.BackendTraceVoiceAnalyze,
ModelName: modelConfig.Name,
Backend: modelConfig.Backend,
Error: errStr,
})
}
return res, err
}

View File

@@ -0,0 +1,66 @@
package backend
import (
"context"
"fmt"
"time"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/trace"
"github.com/mudler/LocalAI/pkg/grpc/proto"
"github.com/mudler/LocalAI/pkg/model"
)
// VoiceEmbed returns a speaker embedding (typically 192-d for ECAPA-TDNN)
// for the audio file at audioPath. Unlike ModelEmbedding (which is
// OpenAI-compatible and text-only), this call takes an audio path and
// returns the backend's speaker-encoder output.
func VoiceEmbed(
audioPath string,
loader *model.ModelLoader,
appConfig *config.ApplicationConfig,
modelConfig config.ModelConfig,
) (*proto.VoiceEmbedResponse, error) {
opts := ModelOptions(modelConfig, appConfig)
voiceModel, err := loader.Load(opts...)
if err != nil {
recordModelLoadFailure(appConfig, modelConfig.Name, modelConfig.Backend, err, nil)
return nil, err
}
if voiceModel == nil {
return nil, fmt.Errorf("could not load voice recognition model")
}
var startTime time.Time
if appConfig.EnableTracing {
trace.InitBackendTracingIfEnabled(appConfig.TracingMaxItems)
startTime = time.Now()
}
res, err := voiceModel.VoiceEmbed(context.Background(), &proto.VoiceEmbedRequest{
Audio: audioPath,
})
if appConfig.EnableTracing {
errStr := ""
if err != nil {
errStr = err.Error()
}
trace.RecordBackendTrace(trace.BackendTrace{
Timestamp: startTime,
Duration: time.Since(startTime),
Type: trace.BackendTraceVoiceEmbed,
ModelName: modelConfig.Name,
Backend: modelConfig.Backend,
Error: errStr,
})
}
if err != nil {
return nil, err
}
if res == nil || len(res.Embedding) == 0 {
return nil, fmt.Errorf("voice embedding returned empty vector (no speech detected?)")
}
return res, nil
}

View File

@@ -0,0 +1,61 @@
package backend
import (
"context"
"fmt"
"time"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/trace"
"github.com/mudler/LocalAI/pkg/grpc/proto"
"github.com/mudler/LocalAI/pkg/model"
)
func VoiceVerify(
audio1, audio2 string,
threshold float32,
antiSpoofing bool,
loader *model.ModelLoader,
appConfig *config.ApplicationConfig,
modelConfig config.ModelConfig,
) (*proto.VoiceVerifyResponse, error) {
opts := ModelOptions(modelConfig, appConfig)
voiceModel, err := loader.Load(opts...)
if err != nil {
recordModelLoadFailure(appConfig, modelConfig.Name, modelConfig.Backend, err, nil)
return nil, err
}
if voiceModel == nil {
return nil, fmt.Errorf("could not load voice recognition model")
}
var startTime time.Time
if appConfig.EnableTracing {
trace.InitBackendTracingIfEnabled(appConfig.TracingMaxItems)
startTime = time.Now()
}
res, err := voiceModel.VoiceVerify(context.Background(), &proto.VoiceVerifyRequest{
Audio1: audio1,
Audio2: audio2,
Threshold: threshold,
AntiSpoofing: antiSpoofing,
})
if appConfig.EnableTracing {
errStr := ""
if err != nil {
errStr = err.Error()
}
trace.RecordBackendTrace(trace.BackendTrace{
Timestamp: startTime,
Duration: time.Since(startTime),
Type: trace.BackendTraceVoiceVerify,
ModelName: modelConfig.Name,
Backend: modelConfig.Backend,
Error: errStr,
})
}
return res, err
}

View File

@@ -588,7 +588,8 @@ const (
FLAG_VAD ModelConfigUsecase = 0b010000000000
FLAG_VIDEO ModelConfigUsecase = 0b100000000000
FLAG_DETECTION ModelConfigUsecase = 0b1000000000000
FLAG_FACE_RECOGNITION ModelConfigUsecase = 0b10000000000000
FLAG_FACE_RECOGNITION ModelConfigUsecase = 0b10000000000000
FLAG_SPEAKER_RECOGNITION ModelConfigUsecase = 0b100000000000000
// Common Subsets
FLAG_LLM ModelConfigUsecase = FLAG_CHAT | FLAG_COMPLETION | FLAG_EDIT
@@ -612,7 +613,8 @@ func GetAllModelConfigUsecases() map[string]ModelConfigUsecase {
"FLAG_LLM": FLAG_LLM,
"FLAG_VIDEO": FLAG_VIDEO,
"FLAG_DETECTION": FLAG_DETECTION,
"FLAG_FACE_RECOGNITION": FLAG_FACE_RECOGNITION,
"FLAG_FACE_RECOGNITION": FLAG_FACE_RECOGNITION,
"FLAG_SPEAKER_RECOGNITION": FLAG_SPEAKER_RECOGNITION,
}
}
@@ -653,7 +655,7 @@ func (c *ModelConfig) GuessUsecases(u ModelConfigUsecase) bool {
nonTextGenBackends := []string{
"whisper", "piper", "kokoro",
"diffusers", "stablediffusion", "stablediffusion-ggml",
"rerankers", "silero-vad", "rfdetr", "insightface",
"rerankers", "silero-vad", "rfdetr", "insightface", "speaker-recognition",
"transformers-musicgen", "ace-step", "acestep-cpp",
}
@@ -743,6 +745,13 @@ func (c *ModelConfig) GuessUsecases(u ModelConfigUsecase) bool {
}
}
if (u & FLAG_SPEAKER_RECOGNITION) == FLAG_SPEAKER_RECOGNITION {
speakerBackends := []string{"speaker-recognition"}
if !slices.Contains(speakerBackends, c.Backend) {
return false
}
}
if (u & FLAG_SOUND_GENERATION) == FLAG_SOUND_GENERATION {
soundGenBackends := []string{"transformers-musicgen", "ace-step", "acestep-cpp", "mock-backend"}
if !slices.Contains(soundGenBackends, c.Backend) {

View File

@@ -65,6 +65,14 @@ var RouteFeatureRegistry = []RouteFeature{
{"POST", "/v1/face/identify", FeatureFaceRecognition},
{"POST", "/v1/face/forget", FeatureFaceRecognition},
// Voice (speaker) recognition
{"POST", "/v1/voice/verify", FeatureVoiceRecognition},
{"POST", "/v1/voice/analyze", FeatureVoiceRecognition},
{"POST", "/v1/voice/embed", FeatureVoiceRecognition},
{"POST", "/v1/voice/register", FeatureVoiceRecognition},
{"POST", "/v1/voice/identify", FeatureVoiceRecognition},
{"POST", "/v1/voice/forget", FeatureVoiceRecognition},
// Video
{"POST", "/video", FeatureVideo},
@@ -160,5 +168,6 @@ func APIFeatureMetas() []FeatureMeta {
{FeatureMCP, "MCP", true},
{FeatureStores, "Stores", true},
{FeatureFaceRecognition, "Face Recognition", true},
{FeatureVoiceRecognition, "Voice Recognition", true},
}
}

View File

@@ -52,6 +52,7 @@ const (
FeatureMCP = "mcp"
FeatureStores = "stores"
FeatureFaceRecognition = "face_recognition"
FeatureVoiceRecognition = "voice_recognition"
)
// AgentFeatures lists agent-related features (default OFF).
@@ -65,7 +66,7 @@ var APIFeatures = []string{
FeatureChat, FeatureImages, FeatureAudioSpeech, FeatureAudioTranscription,
FeatureVAD, FeatureDetection, FeatureVideo, FeatureEmbeddings, FeatureSound,
FeatureRealtime, FeatureRerank, FeatureTokenize, FeatureMCP, FeatureStores,
FeatureFaceRecognition,
FeatureFaceRecognition, FeatureVoiceRecognition,
}
// AllFeatures lists all known features (used by UI and validation).

View File

@@ -79,6 +79,12 @@ var instructionDefs = []instructionDef{
Tags: []string{"face-recognition"},
Intro: "The /v1/face/register, /identify, and /forget endpoints build on a vector store — registrations are in-memory by default and lost on restart. Use /v1/face/embed for a raw embedding; /v1/embeddings is OpenAI-compatible and text-only.",
},
{
Name: "voice-recognition",
Description: "Speaker verification (1:1), embedding, and demographic analysis from voice",
Tags: []string{"voice-recognition"},
Intro: "Voice (speaker) recognition — the audio analog to /v1/face/*. Use /v1/voice/verify for 1:1 speaker comparison, /v1/voice/identify for 1:N match against the registered store, /v1/voice/{register,forget} to manage that store, /v1/voice/embed for a raw speaker-encoder vector, and /v1/voice/analyze for age / gender / emotion inferred from speech. Registrations are in-memory by default and lost on restart. Audio inputs accept URL, base64, or data-URI; /v1/embeddings remains text-only.",
},
}
// swaggerState holds parsed swagger spec data, initialised once.

View File

@@ -0,0 +1,82 @@
package localai
import (
"encoding/base64"
"fmt"
"io"
"net/http"
"os"
"regexp"
"strings"
"time"
"github.com/labstack/echo/v4"
"github.com/mudler/LocalAI/pkg/utils"
)
var audioDataURIPattern = regexp.MustCompile(`^data:([^;]+);base64,`)
var audioDownloadClient = http.Client{Timeout: 30 * time.Second}
// decodeAudioInput materialises a URL / data-URI / raw-base64 audio
// payload to a temporary file and returns its path plus a cleanup
// function. Voice backends expect a filesystem path (same convention
// as TranscriptRequest.dst) — callers must defer the returned cleanup
// so the temp file does not leak.
//
// Bad inputs (invalid URL, undecodable base64, non-audio payload) are
// surfaced as 400 Bad Request rather than 500 so API consumers can
// distinguish a client mistake from a server failure.
func decodeAudioInput(s string) (string, func(), error) {
if s == "" {
return "", func() {}, echo.NewHTTPError(http.StatusBadRequest, "audio input is empty")
}
var raw []byte
switch {
case strings.HasPrefix(s, "http://") || strings.HasPrefix(s, "https://"):
if err := utils.ValidateExternalURL(s); err != nil {
return "", func() {}, echo.NewHTTPError(http.StatusBadRequest, fmt.Sprintf("invalid audio URL: %v", err))
}
resp, err := audioDownloadClient.Get(s)
if err != nil {
return "", func() {}, echo.NewHTTPError(http.StatusBadRequest, fmt.Sprintf("audio download failed: %v", err))
}
defer resp.Body.Close()
raw, err = io.ReadAll(resp.Body)
if err != nil {
return "", func() {}, echo.NewHTTPError(http.StatusBadRequest, fmt.Sprintf("audio download read failed: %v", err))
}
default:
payload := s
if m := audioDataURIPattern.FindString(s); m != "" {
payload = strings.Replace(s, m, "", 1)
}
decoded, err := base64.StdEncoding.DecodeString(payload)
if err != nil {
return "", func() {}, echo.NewHTTPError(http.StatusBadRequest, fmt.Sprintf("invalid audio base64: %v", err))
}
raw = decoded
}
if len(raw) == 0 {
return "", func() {}, echo.NewHTTPError(http.StatusBadRequest, "audio input decoded to zero bytes")
}
f, err := os.CreateTemp("", "localai-voice-*.wav")
if err != nil {
return "", func() {}, err
}
path := f.Name()
cleanup := func() { _ = os.Remove(path) }
if _, err := f.Write(raw); err != nil {
f.Close()
cleanup()
return "", func() {}, err
}
if err := f.Close(); err != nil {
cleanup()
return "", func() {}, err
}
return path, cleanup, nil
}

View File

@@ -0,0 +1,60 @@
package localai
import (
"net/http"
"github.com/labstack/echo/v4"
"github.com/mudler/LocalAI/core/backend"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/http/middleware"
"github.com/mudler/LocalAI/core/schema"
"github.com/mudler/LocalAI/pkg/model"
"github.com/mudler/xlog"
)
// VoiceAnalyzeEndpoint returns demographic attributes inferred from speech.
// @Summary Analyze demographic attributes (age, gender, emotion) from a voice clip.
// @Tags voice-recognition
// @Param request body schema.VoiceAnalyzeRequest true "query params"
// @Success 200 {object} schema.VoiceAnalyzeResponse "Response"
// @Router /v1/voice/analyze [post]
func VoiceAnalyzeEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, appConfig *config.ApplicationConfig) echo.HandlerFunc {
return func(c echo.Context) error {
input, ok := c.Get(middleware.CONTEXT_LOCALS_KEY_LOCALAI_REQUEST).(*schema.VoiceAnalyzeRequest)
if !ok || input.Model == "" {
return echo.ErrBadRequest
}
cfg, ok := c.Get(middleware.CONTEXT_LOCALS_KEY_MODEL_CONFIG).(*config.ModelConfig)
if !ok || cfg == nil {
return echo.ErrBadRequest
}
audio, cleanup, err := decodeAudioInput(input.Audio)
if err != nil {
return err
}
defer cleanup()
xlog.Debug("VoiceAnalyze", "model", cfg.Name, "backend", cfg.Backend, "actions", input.Actions)
res, err := backend.VoiceAnalyze(audio, input.Actions, ml, appConfig, *cfg)
if err != nil {
return mapBackendError(err)
}
response := schema.VoiceAnalyzeResponse{
Segments: make([]schema.VoiceAnalysis, len(res.GetSegments())),
}
for i, s := range res.GetSegments() {
response.Segments[i] = schema.VoiceAnalysis{
Start: s.GetStart(),
End: s.GetEnd(),
Age: s.GetAge(),
DominantGender: s.GetDominantGender(),
Gender: s.GetGender(),
DominantEmotion: s.GetDominantEmotion(),
Emotion: s.GetEmotion(),
}
}
return c.JSON(http.StatusOK, response)
}
}

View File

@@ -0,0 +1,54 @@
package localai
import (
"net/http"
"github.com/labstack/echo/v4"
"github.com/mudler/LocalAI/core/backend"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/http/middleware"
"github.com/mudler/LocalAI/core/schema"
"github.com/mudler/LocalAI/pkg/model"
"github.com/mudler/xlog"
)
// VoiceEmbedEndpoint extracts a speaker embedding vector from an audio clip.
//
// Distinct from /v1/embeddings, which is OpenAI-compatible and text-only
// by contract. Use this endpoint when you need a speaker-encoder output
// (typically 192-d for ECAPA-TDNN, 256-d for ResNet/WeSpeaker).
//
// @Summary Extract a speaker embedding from an audio clip.
// @Tags voice-recognition
// @Param request body schema.VoiceEmbedRequest true "query params"
// @Success 200 {object} schema.VoiceEmbedResponse "Response"
// @Router /v1/voice/embed [post]
func VoiceEmbedEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, appConfig *config.ApplicationConfig) echo.HandlerFunc {
return func(c echo.Context) error {
input, ok := c.Get(middleware.CONTEXT_LOCALS_KEY_LOCALAI_REQUEST).(*schema.VoiceEmbedRequest)
if !ok || input.Model == "" {
return echo.ErrBadRequest
}
cfg, ok := c.Get(middleware.CONTEXT_LOCALS_KEY_MODEL_CONFIG).(*config.ModelConfig)
if !ok || cfg == nil {
return echo.ErrBadRequest
}
audio, cleanup, err := decodeAudioInput(input.Audio)
if err != nil {
return err
}
defer cleanup()
xlog.Debug("VoiceEmbed", "model", cfg.Name, "backend", cfg.Backend)
res, err := backend.VoiceEmbed(audio, ml, appConfig, *cfg)
if err != nil {
return mapBackendError(err)
}
return c.JSON(http.StatusOK, schema.VoiceEmbedResponse{
Embedding: res.GetEmbedding(),
Dim: len(res.GetEmbedding()),
Model: res.GetModel(),
})
}
}

View File

@@ -0,0 +1,45 @@
package localai
import (
"errors"
"net/http"
"github.com/labstack/echo/v4"
"github.com/mudler/LocalAI/core/http/middleware"
"github.com/mudler/LocalAI/core/schema"
"github.com/mudler/LocalAI/core/services/voicerecognition"
"github.com/mudler/xlog"
)
// VoiceForgetEndpoint removes a previously-registered speaker by ID.
// @Summary Remove a previously-registered speaker by ID.
// @Tags voice-recognition
// @Param request body schema.VoiceForgetRequest true "query params"
// @Success 204 "No Content"
// @Router /v1/voice/forget [post]
func VoiceForgetEndpoint(registry voicerecognition.Registry) echo.HandlerFunc {
return func(c echo.Context) error {
input, ok := c.Get(middleware.CONTEXT_LOCALS_KEY_LOCALAI_REQUEST).(*schema.VoiceForgetRequest)
if !ok {
// Forget doesn't load a model — fall back to a raw bind when
// the request extractor hasn't run (route registered without
// SetModelAndConfig).
input = new(schema.VoiceForgetRequest)
if err := c.Bind(input); err != nil {
return echo.ErrBadRequest
}
}
if input.ID == "" {
return echo.NewHTTPError(http.StatusBadRequest, "id is required")
}
xlog.Debug("VoiceForget", "id", input.ID)
if err := registry.Forget(c.Request().Context(), input.ID); err != nil {
if errors.Is(err, voicerecognition.ErrNotFound) {
return echo.NewHTTPError(http.StatusNotFound, err.Error())
}
return err
}
return c.NoContent(http.StatusNoContent)
}
}

View File

@@ -0,0 +1,82 @@
package localai
import (
"cmp"
"net/http"
"github.com/labstack/echo/v4"
"github.com/mudler/LocalAI/core/backend"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/http/middleware"
"github.com/mudler/LocalAI/core/schema"
"github.com/mudler/LocalAI/core/services/voicerecognition"
"github.com/mudler/LocalAI/pkg/model"
"github.com/mudler/xlog"
)
// defaultVoiceIdentifyThreshold is the cosine-distance cutoff applied
// when the client does not specify one. Tuned for ECAPA-TDNN on
// VoxCeleb (EER ~1.9%). Other recognizers (WeSpeaker, ERes2Net) may
// need overrides.
const defaultVoiceIdentifyThreshold = float32(0.25)
// VoiceIdentifyEndpoint runs 1:N identification against the registered store.
// @Summary Identify a speaker against the registered database (1:N recognition).
// @Tags voice-recognition
// @Param request body schema.VoiceIdentifyRequest true "query params"
// @Success 200 {object} schema.VoiceIdentifyResponse "Response"
// @Router /v1/voice/identify [post]
func VoiceIdentifyEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, appConfig *config.ApplicationConfig, registry voicerecognition.Registry) echo.HandlerFunc {
return func(c echo.Context) error {
input, ok := c.Get(middleware.CONTEXT_LOCALS_KEY_LOCALAI_REQUEST).(*schema.VoiceIdentifyRequest)
if !ok || input.Model == "" {
return echo.ErrBadRequest
}
cfg, ok := c.Get(middleware.CONTEXT_LOCALS_KEY_MODEL_CONFIG).(*config.ModelConfig)
if !ok || cfg == nil {
return echo.ErrBadRequest
}
audio, cleanup, err := decodeAudioInput(input.Audio)
if err != nil {
return err
}
defer cleanup()
topK := cmp.Or(input.TopK, 5)
threshold := cmp.Or(input.Threshold, defaultVoiceIdentifyThreshold)
xlog.Debug("VoiceIdentify", "model", cfg.Name, "topK", topK, "threshold", threshold)
embed, err := backend.VoiceEmbed(audio, ml, appConfig, *cfg)
if err != nil {
return mapBackendError(err)
}
matches, err := registry.Identify(c.Request().Context(), embed.GetEmbedding(), topK)
if err != nil {
return err
}
response := schema.VoiceIdentifyResponse{
Matches: make([]schema.VoiceIdentifyMatch, len(matches)),
}
for i, m := range matches {
confidence := (1 - m.Distance/threshold) * 100
if confidence < 0 {
confidence = 0
}
if confidence > 100 {
confidence = 100
}
response.Matches[i] = schema.VoiceIdentifyMatch{
ID: m.ID,
Name: m.Metadata.Name,
Labels: m.Metadata.Labels,
Distance: m.Distance,
Confidence: confidence,
Match: m.Distance <= threshold,
}
}
return c.JSON(http.StatusOK, response)
}
}

View File

@@ -0,0 +1,61 @@
package localai
import (
"net/http"
"github.com/labstack/echo/v4"
"github.com/mudler/LocalAI/core/backend"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/http/middleware"
"github.com/mudler/LocalAI/core/schema"
"github.com/mudler/LocalAI/core/services/voicerecognition"
"github.com/mudler/LocalAI/pkg/model"
"github.com/mudler/xlog"
)
// VoiceRegisterEndpoint enrolls a speaker into the 1:N identification store.
// @Summary Register a speaker for 1:N identification.
// @Tags voice-recognition
// @Param request body schema.VoiceRegisterRequest true "query params"
// @Success 200 {object} schema.VoiceRegisterResponse "Response"
// @Router /v1/voice/register [post]
func VoiceRegisterEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, appConfig *config.ApplicationConfig, registry voicerecognition.Registry) echo.HandlerFunc {
return func(c echo.Context) error {
input, ok := c.Get(middleware.CONTEXT_LOCALS_KEY_LOCALAI_REQUEST).(*schema.VoiceRegisterRequest)
if !ok || input.Model == "" {
return echo.ErrBadRequest
}
cfg, ok := c.Get(middleware.CONTEXT_LOCALS_KEY_MODEL_CONFIG).(*config.ModelConfig)
if !ok || cfg == nil {
return echo.ErrBadRequest
}
if input.Name == "" {
return echo.NewHTTPError(http.StatusBadRequest, "name is required")
}
audio, cleanup, err := decodeAudioInput(input.Audio)
if err != nil {
return err
}
defer cleanup()
xlog.Debug("VoiceRegister", "model", cfg.Name, "name", input.Name)
res, err := backend.VoiceEmbed(audio, ml, appConfig, *cfg)
if err != nil {
return mapBackendError(err)
}
stored, err := registry.Register(c.Request().Context(), res.GetEmbedding(), voicerecognition.Metadata{
Name: input.Name,
Labels: input.Labels,
})
if err != nil {
return err
}
return c.JSON(http.StatusOK, schema.VoiceRegisterResponse{
ID: stored.ID,
Name: stored.Name,
RegisteredAt: stored.RegisteredAt,
})
}
}

View File

@@ -0,0 +1,59 @@
package localai
import (
"net/http"
"github.com/labstack/echo/v4"
"github.com/mudler/LocalAI/core/backend"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/http/middleware"
"github.com/mudler/LocalAI/core/schema"
"github.com/mudler/LocalAI/pkg/model"
"github.com/mudler/xlog"
)
// VoiceVerifyEndpoint compares two audio clips and reports whether they were
// spoken by the same person.
// @Summary Verify that two audio clips were spoken by the same person.
// @Tags voice-recognition
// @Param request body schema.VoiceVerifyRequest true "query params"
// @Success 200 {object} schema.VoiceVerifyResponse "Response"
// @Router /v1/voice/verify [post]
func VoiceVerifyEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, appConfig *config.ApplicationConfig) echo.HandlerFunc {
return func(c echo.Context) error {
input, ok := c.Get(middleware.CONTEXT_LOCALS_KEY_LOCALAI_REQUEST).(*schema.VoiceVerifyRequest)
if !ok || input.Model == "" {
return echo.ErrBadRequest
}
cfg, ok := c.Get(middleware.CONTEXT_LOCALS_KEY_MODEL_CONFIG).(*config.ModelConfig)
if !ok || cfg == nil {
return echo.ErrBadRequest
}
audio1, cleanup1, err := decodeAudioInput(input.Audio1)
if err != nil {
return err
}
defer cleanup1()
audio2, cleanup2, err := decodeAudioInput(input.Audio2)
if err != nil {
return err
}
defer cleanup2()
xlog.Debug("VoiceVerify", "model", cfg.Name, "backend", cfg.Backend)
res, err := backend.VoiceVerify(audio1, audio2, input.Threshold, input.AntiSpoofing, ml, appConfig, *cfg)
if err != nil {
return mapBackendError(err)
}
return c.JSON(http.StatusOK, schema.VoiceVerifyResponse{
Verified: res.GetVerified(),
Distance: res.GetDistance(),
Threshold: res.GetThreshold(),
Confidence: res.GetConfidence(),
Model: res.GetModel(),
ProcessingTimeMs: res.GetProcessingTimeMs(),
})
}
}

View File

@@ -13,3 +13,4 @@ export const CAP_VAD = 'FLAG_VAD'
export const CAP_VIDEO = 'FLAG_VIDEO'
export const CAP_DETECTION = 'FLAG_DETECTION'
export const CAP_FACE_RECOGNITION = 'FLAG_FACE_RECOGNITION'
export const CAP_SPEAKER_RECOGNITION = 'FLAG_SPEAKER_RECOGNITION'

View File

@@ -120,6 +120,28 @@ func RegisterLocalAIRoutes(router *echo.Echo,
// Forget does not load a face model — it only needs the registry.
router.POST("/v1/face/forget", localai.FaceForgetEndpoint(app.FaceRegistry()))
// Voice (speaker) recognition endpoints
voiceMw := []echo.MiddlewareFunc{
requestExtractor.BuildFilteredFirstAvailableDefaultModel(config.BuildUsecaseFilterFn(config.FLAG_SPEAKER_RECOGNITION)),
}
router.POST("/v1/voice/verify",
localai.VoiceVerifyEndpoint(cl, ml, appConfig),
append(voiceMw, requestExtractor.SetModelAndConfig(func() schema.LocalAIRequest { return new(schema.VoiceVerifyRequest) }))...)
router.POST("/v1/voice/analyze",
localai.VoiceAnalyzeEndpoint(cl, ml, appConfig),
append(voiceMw, requestExtractor.SetModelAndConfig(func() schema.LocalAIRequest { return new(schema.VoiceAnalyzeRequest) }))...)
router.POST("/v1/voice/embed",
localai.VoiceEmbedEndpoint(cl, ml, appConfig),
append(voiceMw, requestExtractor.SetModelAndConfig(func() schema.LocalAIRequest { return new(schema.VoiceEmbedRequest) }))...)
router.POST("/v1/voice/register",
localai.VoiceRegisterEndpoint(cl, ml, appConfig, app.VoiceRegistry()),
append(voiceMw, requestExtractor.SetModelAndConfig(func() schema.LocalAIRequest { return new(schema.VoiceRegisterRequest) }))...)
router.POST("/v1/voice/identify",
localai.VoiceIdentifyEndpoint(cl, ml, appConfig, app.VoiceRegistry()),
append(voiceMw, requestExtractor.SetModelAndConfig(func() schema.LocalAIRequest { return new(schema.VoiceIdentifyRequest) }))...)
// Forget does not load a voice model — it only needs the registry.
router.POST("/v1/voice/forget", localai.VoiceForgetEndpoint(app.VoiceRegistry()))
ttsHandler := localai.TTSEndpoint(cl, ml, appConfig)
router.POST("/tts",
ttsHandler,

View File

@@ -290,6 +290,110 @@ type FaceForgetRequest struct {
Store string `json:"store,omitempty"`
}
// ─── Voice (speaker) recognition ───────────────────────────────────
//
// VoiceVerifyRequest compares two audio clips and reports whether they
// were spoken by the same speaker. Audio1/Audio2 accept URL, base64,
// or data-URI (the HTTP layer materialises the bytes to a temp file
// before calling the gRPC backend).
type VoiceVerifyRequest struct {
BasicModelRequest
Audio1 string `json:"audio1"`
Audio2 string `json:"audio2"`
Threshold float32 `json:"threshold,omitempty"`
AntiSpoofing bool `json:"anti_spoofing,omitempty"`
}
type VoiceVerifyResponse struct {
Verified bool `json:"verified"`
Distance float32 `json:"distance"`
Threshold float32 `json:"threshold"`
Confidence float32 `json:"confidence"`
Model string `json:"model"`
ProcessingTimeMs float32 `json:"processing_time_ms,omitempty"`
}
// VoiceAnalyzeRequest asks the backend for demographic attributes
// (age, gender, emotion) inferred from the audio clip.
type VoiceAnalyzeRequest struct {
BasicModelRequest
Audio string `json:"audio"`
Actions []string `json:"actions,omitempty"` // subset of {"age","gender","emotion"}
}
type VoiceAnalyzeResponse struct {
Segments []VoiceAnalysis `json:"segments"`
}
type VoiceAnalysis struct {
Start float32 `json:"start"`
End float32 `json:"end"`
Age float32 `json:"age,omitempty"`
DominantGender string `json:"dominant_gender,omitempty"`
Gender map[string]float32 `json:"gender,omitempty"`
DominantEmotion string `json:"dominant_emotion,omitempty"`
Emotion map[string]float32 `json:"emotion,omitempty"`
}
// VoiceEmbedRequest extracts a speaker embedding from an audio clip.
// Distinct from /v1/embeddings (OpenAI-compatible, text-only) — this
// endpoint accepts URL / base64 / data-URI audio inputs.
type VoiceEmbedRequest struct {
BasicModelRequest
Audio string `json:"audio"`
}
type VoiceEmbedResponse struct {
Embedding []float32 `json:"embedding"`
Dim int `json:"dim"`
Model string `json:"model,omitempty"`
}
// VoiceRegisterRequest enrolls a speaker into the 1:N identification store.
type VoiceRegisterRequest struct {
BasicModelRequest
Audio string `json:"audio"`
Name string `json:"name"`
Labels map[string]string `json:"labels,omitempty"`
Store string `json:"store,omitempty"`
}
type VoiceRegisterResponse struct {
ID string `json:"id"`
Name string `json:"name"`
RegisteredAt time.Time `json:"registered_at"`
}
// VoiceIdentifyRequest runs 1:N recognition: embed the probe and
// return the top-K nearest registered speakers.
type VoiceIdentifyRequest struct {
BasicModelRequest
Audio string `json:"audio"`
TopK int `json:"top_k,omitempty"`
Threshold float32 `json:"threshold,omitempty"`
Store string `json:"store,omitempty"`
}
type VoiceIdentifyResponse struct {
Matches []VoiceIdentifyMatch `json:"matches"`
}
type VoiceIdentifyMatch struct {
ID string `json:"id"`
Name string `json:"name"`
Labels map[string]string `json:"labels,omitempty"`
Distance float32 `json:"distance"`
Confidence float32 `json:"confidence"`
Match bool `json:"match"`
}
// VoiceForgetRequest removes a previously-registered speaker by ID.
type VoiceForgetRequest struct {
BasicModelRequest
ID string `json:"id"`
Store string `json:"store,omitempty"`
}
type ImportModelRequest struct {
URI string `json:"uri"`
Preferences json.RawMessage `json:"preferences,omitempty"`

View File

@@ -174,6 +174,15 @@ func (c *fakeBackendClient) FaceVerify(_ context.Context, _ *pb.FaceVerifyReques
func (c *fakeBackendClient) FaceAnalyze(_ context.Context, _ *pb.FaceAnalyzeRequest, _ ...ggrpc.CallOption) (*pb.FaceAnalyzeResponse, error) {
return nil, nil
}
func (c *fakeBackendClient) VoiceVerify(_ context.Context, _ *pb.VoiceVerifyRequest, _ ...ggrpc.CallOption) (*pb.VoiceVerifyResponse, error) {
return nil, nil
}
func (c *fakeBackendClient) VoiceAnalyze(_ context.Context, _ *pb.VoiceAnalyzeRequest, _ ...ggrpc.CallOption) (*pb.VoiceAnalyzeResponse, error) {
return nil, nil
}
func (c *fakeBackendClient) VoiceEmbed(_ context.Context, _ *pb.VoiceEmbedRequest, _ ...ggrpc.CallOption) (*pb.VoiceEmbedResponse, error) {
return nil, nil
}
func (c *fakeBackendClient) AudioTranscription(_ context.Context, _ *pb.TranscriptRequest, _ ...ggrpc.CallOption) (*pb.TranscriptResult, error) {
return nil, nil
}

View File

@@ -99,6 +99,18 @@ func (f *fakeGRPCBackend) FaceAnalyze(_ context.Context, _ *pb.FaceAnalyzeReques
return &pb.FaceAnalyzeResponse{}, nil
}
func (f *fakeGRPCBackend) VoiceVerify(_ context.Context, _ *pb.VoiceVerifyRequest, _ ...ggrpc.CallOption) (*pb.VoiceVerifyResponse, error) {
return &pb.VoiceVerifyResponse{}, nil
}
func (f *fakeGRPCBackend) VoiceAnalyze(_ context.Context, _ *pb.VoiceAnalyzeRequest, _ ...ggrpc.CallOption) (*pb.VoiceAnalyzeResponse, error) {
return &pb.VoiceAnalyzeResponse{}, nil
}
func (f *fakeGRPCBackend) VoiceEmbed(_ context.Context, _ *pb.VoiceEmbedRequest, _ ...ggrpc.CallOption) (*pb.VoiceEmbedResponse, error) {
return &pb.VoiceEmbedResponse{}, nil
}
func (f *fakeGRPCBackend) AudioTranscription(_ context.Context, _ *pb.TranscriptRequest, _ ...ggrpc.CallOption) (*pb.TranscriptResult, error) {
return &pb.TranscriptResult{}, nil
}

View File

@@ -0,0 +1,58 @@
// Package voicerecognition provides a swappable backing store for
// speaker embeddings and the 1:N identification pipeline on top of it.
//
// Mirrors the facerecognition package — the audio analog. The current
// implementation (NewStoreRegistry) is backed by LocalAI's in-memory
// local-store gRPC backend, so all registrations are lost on restart.
//
// TODO: share a persistent pgvector-backed implementation with
// facerecognition once the first one lands. The Registry interface
// here is intentionally identical in shape, so a shared generic
// biometric registry can replace both without HTTP-handler churn.
package voicerecognition
import (
"context"
"errors"
"time"
)
// Registry stores speaker embeddings keyed by an opaque ID and
// supports approximate similarity search. Implementations are expected
// to be safe for concurrent use.
type Registry interface {
// Register stores a speaker embedding alongside its metadata.
// Returns the stored metadata with ID and RegisteredAt populated.
Register(ctx context.Context, embedding []float32, meta Metadata) (Metadata, error)
// Identify returns up to topK matches for the probe embedding,
// sorted by ascending distance (closest first).
Identify(ctx context.Context, probe []float32, topK int) ([]Match, error)
// Forget removes a previously-registered embedding by ID.
// Returns ErrNotFound if the ID is unknown.
Forget(ctx context.Context, id string) error
}
// Metadata is the user-supplied payload stored alongside a speaker embedding.
type Metadata struct {
// ID is populated by the registry at Register time; callers must not set it.
ID string `json:"id"`
Name string `json:"name"`
Labels map[string]string `json:"labels,omitempty"`
RegisteredAt time.Time `json:"registered_at"`
}
// Match is a single result from Identify, ranked by similarity.
type Match struct {
ID string
Metadata Metadata
Distance float32 // 1 - cosine_similarity; lower = closer
}
// Sentinel errors; callers should compare with errors.Is.
var (
ErrNotFound = errors.New("voicerecognition: id not found")
ErrEmptyEmbedding = errors.New("voicerecognition: embedding is empty")
ErrDimensionMismatch = errors.New("voicerecognition: embedding dimension mismatch")
)

View File

@@ -0,0 +1,138 @@
package voicerecognition
import (
"context"
"encoding/json"
"fmt"
"sort"
"sync"
"time"
"github.com/google/uuid"
"github.com/mudler/LocalAI/pkg/grpc"
"github.com/mudler/LocalAI/pkg/store"
)
// StoreResolver resolves a named vector store to a gRPC backend. The
// HTTP handler layer wires this to backend.StoreBackend so the
// registry stays decoupled from ModelLoader plumbing.
type StoreResolver func(ctx context.Context, storeName string) (grpc.Backend, error)
// NewStoreRegistry returns a Registry backed by LocalAI's generic
// StoresSet / StoresFind / StoresDelete gRPC surface.
//
// storeName selects which vector-store model to use (defaults to the
// local-store Go backend). `dim` is the expected embedding dimension;
// pass 0 to accept whatever dimension arrives (useful when the voice
// backend exposes recognizers of different sizes, e.g. ECAPA-TDNN at
// 192 vs ResNet at 256).
func NewStoreRegistry(resolve StoreResolver, storeName string, dim int) Registry {
return &storeRegistry{
resolve: resolve,
storeName: storeName,
dim: dim,
}
}
type storeRegistry struct {
resolve StoreResolver
storeName string
dim int
// TODO(postgres): the local-store gRPC surface keys by embedding
// vector and exposes no "list all" method, so we cannot delete by
// ID without remembering the embedding. This in-memory index is
// rebuilt on every Register and lost on restart — acceptable while
// the only implementation is itself in-memory.
idIndex sync.Map // map[string][]float32
}
func (r *storeRegistry) Register(ctx context.Context, embedding []float32, meta Metadata) (Metadata, error) {
if len(embedding) == 0 {
return Metadata{}, ErrEmptyEmbedding
}
if r.dim != 0 && len(embedding) != r.dim {
return Metadata{}, fmt.Errorf("%w: expected %d, got %d", ErrDimensionMismatch, r.dim, len(embedding))
}
backend, err := r.resolve(ctx, r.storeName)
if err != nil {
return Metadata{}, fmt.Errorf("voicerecognition: resolve store: %w", err)
}
meta.ID = uuid.NewString()
if meta.RegisteredAt.IsZero() {
meta.RegisteredAt = time.Now().UTC()
}
payload, err := json.Marshal(meta)
if err != nil {
return Metadata{}, fmt.Errorf("voicerecognition: marshal metadata: %w", err)
}
if err := store.SetSingle(ctx, backend, embedding, payload); err != nil {
return Metadata{}, fmt.Errorf("voicerecognition: set: %w", err)
}
embCopy := append([]float32(nil), embedding...)
r.idIndex.Store(meta.ID, embCopy)
return meta, nil
}
func (r *storeRegistry) Identify(ctx context.Context, probe []float32, topK int) ([]Match, error) {
if len(probe) == 0 {
return nil, ErrEmptyEmbedding
}
if r.dim != 0 && len(probe) != r.dim {
return nil, fmt.Errorf("%w: expected %d, got %d", ErrDimensionMismatch, r.dim, len(probe))
}
if topK <= 0 {
topK = 5
}
backend, err := r.resolve(ctx, r.storeName)
if err != nil {
return nil, fmt.Errorf("voicerecognition: resolve store: %w", err)
}
_, values, similarities, err := store.Find(ctx, backend, probe, topK)
if err != nil {
return nil, fmt.Errorf("voicerecognition: find: %w", err)
}
matches := make([]Match, 0, len(values))
for i, raw := range values {
var meta Metadata
if err := json.Unmarshal(raw, &meta); err != nil {
// Shared stores may contain non-voice records; skip them.
continue
}
matches = append(matches, Match{
ID: meta.ID,
Metadata: meta,
Distance: 1 - similarities[i],
})
}
sort.SliceStable(matches, func(i, j int) bool { return matches[i].Distance < matches[j].Distance })
return matches, nil
}
func (r *storeRegistry) Forget(ctx context.Context, id string) error {
raw, ok := r.idIndex.Load(id)
if !ok {
return ErrNotFound
}
embedding := raw.([]float32)
backend, err := r.resolve(ctx, r.storeName)
if err != nil {
return fmt.Errorf("voicerecognition: resolve store: %w", err)
}
if err := store.DeleteSingle(ctx, backend, embedding); err != nil {
return fmt.Errorf("voicerecognition: delete: %w", err)
}
r.idIndex.Delete(id)
return nil
}

View File

@@ -26,6 +26,9 @@ const (
BackendTraceDetection BackendTraceType = "detection"
BackendTraceFaceVerify BackendTraceType = "face_verify"
BackendTraceFaceAnalyze BackendTraceType = "face_analyze"
BackendTraceVoiceVerify BackendTraceType = "voice_verify"
BackendTraceVoiceAnalyze BackendTraceType = "voice_analyze"
BackendTraceVoiceEmbed BackendTraceType = "voice_embed"
BackendTraceModelLoad BackendTraceType = "model_load"
)

View File

@@ -0,0 +1,247 @@
+++
disableToc = false
title = "Voice Recognition"
weight = 15
url = "/features/voice-recognition/"
+++
LocalAI supports voice (speaker) recognition through the
`speaker-recognition` backend: speaker verification (1:1), speaker
identification (1:N) against a built-in vector store, speaker
embedding, and demographic analysis (age / gender / emotion from
voice).
The audio analog to [Face Recognition](/features/face-recognition/),
following the same two-engine pattern under one image.
## Engines
| Gallery entry | Model | Size | License |
|---|---|---|---|
| `speechbrain-ecapa-tdnn` | ECAPA-TDNN on VoxCeleb (SpeechBrain) | ~17 MB | **Apache 2.0 — commercial-safe** |
| `wespeaker-resnet34` | WeSpeaker ResNet34 ONNX | ~26 MB | **Apache 2.0 — commercial-safe** |
Both entries are commercial-safe Apache-2.0. SpeechBrain is the
default — it's a lightweight pure-PyTorch checkpoint that auto-
downloads on first use. The `wespeaker-resnet34` entry wires the
direct-ONNX path for CPU-only deployments that don't want the torch
runtime.
## Quickstart
Install the default backend and model:
```bash
local-ai models install speechbrain-ecapa-tdnn
```
Verify that two audio clips were spoken by the same person:
```bash
curl -sX POST http://localhost:8080/v1/voice/verify \
-H "Content-Type: application/json" \
-d '{
"model": "speechbrain-ecapa-tdnn",
"audio1": "https://example.com/alice_1.wav",
"audio2": "https://example.com/alice_2.wav"
}'
```
Response:
```json
{
"verified": true,
"distance": 0.18,
"threshold": 0.25,
"confidence": 28.0,
"model": "speechbrain-ecapa-tdnn",
"processing_time_ms": 340.0
}
```
## 1:N identification workflow (register → identify → forget)
Same flow as face recognition, same in-memory vector store under the
hood.
1. Register known speakers:
```bash
curl -sX POST http://localhost:8080/v1/voice/register \
-H "Content-Type: application/json" \
-d '{
"model": "speechbrain-ecapa-tdnn",
"name": "Alice",
"audio": "https://example.com/alice.wav"
}'
# → {"id": "b2f...", "name": "Alice", "registered_at": "2026-04-22T..."}
```
2. Identify an unknown probe:
```bash
curl -sX POST http://localhost:8080/v1/voice/identify \
-H "Content-Type: application/json" \
-d '{
"model": "speechbrain-ecapa-tdnn",
"audio": "https://example.com/unknown.wav",
"top_k": 5
}'
# → {"matches": [{"id":"b2f...","name":"Alice","distance":0.19,"match":true,...}]}
```
3. Remove a speaker by ID:
```bash
curl -sX POST http://localhost:8080/v1/voice/forget \
-d '{"id": "b2f..."}'
# → 204 No Content
```
{{% alert icon="⚠️" color="warning" %}}
**Storage caveat.** The default vector store is in-memory. All
registered speakers are lost when LocalAI restarts. Persistent storage
(pgvector) is a tracked future enhancement shared with face
recognition — the voice-recognition HTTP API is designed to swap the
backing store without changing the wire format.
{{% /alert %}}
## API reference
### `POST /v1/voice/verify` (1:1)
| field | type | description |
|---|---|---|
| `model` | string | gallery entry name (e.g. `speechbrain-ecapa-tdnn`) |
| `audio1`, `audio2` | string | URL, base64, or data-URI of an audio file |
| `threshold` | float, optional | cosine-distance cutoff; default 0.25 for ECAPA-TDNN |
| `anti_spoofing` | bool, optional | reserved — unused in the current release |
Returns `verified`, `distance`, `threshold`, `confidence`, `model`,
and `processing_time_ms`.
### `POST /v1/voice/analyze`
Returns demographic attributes (age, gender, emotion) inferred from
speech:
| field | type | description |
|---|---|---|
| `model` | string | gallery entry |
| `audio` | string | URL / base64 / data-URI |
| `actions` | string[] | subset of `["age","gender","emotion"]`; empty = all supported |
Emotion is inferred from the SUPERB emotion-recognition checkpoint
(`superb/wav2vec2-base-superb-er`, Apache 2.0) — 4-way categorical
neutral / happy / angry / sad. The model auto-downloads on the first
analyze call.
Age and gender are **opt-in**: no standard-transformers checkpoint
with a clean classifier head is shipped as the default. The
high-accuracy Audeering age/gender model uses a custom multi-task
head that `AutoModelForAudioClassification` doesn't load safely
(the age weights are silently dropped and the classifier is
re-initialised with random values). To enable age/gender, set
`age_gender_model:<repo>` in the model YAML's `options:` pointing at
a checkpoint with a vanilla `Wav2Vec2ForSequenceClassification`
head. Override the emotion default similarly via `emotion_model:`.
Set either to an empty string to disable that head.
If a head fails to load (offline, disk full, `transformers`
missing), the engine degrades gracefully: it still returns the
attributes it could compute. When nothing can be computed the backend
returns `501 Unimplemented`.
Analyze is supported by both `speechbrain-ecapa-tdnn` and
`wespeaker-resnet34` — the speaker recognizer and the analysis head
are independent.
### `POST /v1/voice/register` (1:N enrollment)
| field | type | description |
|---|---|---|
| `model` | string | voice recognition model |
| `audio` | string | speaker audio to enroll |
| `name` | string | human-readable label |
| `labels` | map[string]string, optional | arbitrary metadata |
| `store` | string, optional | vector store model; defaults to local-store |
Returns `{id, name, registered_at}`. The `id` is an opaque UUID used
by `/v1/voice/identify` and `/v1/voice/forget`.
### `POST /v1/voice/identify` (1:N recognition)
| field | type | description |
|---|---|---|
| `model` | string | voice recognition model |
| `audio` | string | probe audio |
| `top_k` | int, optional | max matches to return; default 5 |
| `threshold` | float, optional | cosine-distance cutoff; default 0.25 |
| `store` | string, optional | vector store model |
Returns a list of matches sorted by ascending distance, each with
`id`, `name`, `labels`, `distance`, `confidence`, and `match`
(`distance ≤ threshold`).
### `POST /v1/voice/forget`
| field | type | description |
|---|---|---|
| `id` | string | ID returned by `/v1/voice/register` |
Returns `204 No Content` on success, `404 Not Found` if the ID is
unknown.
### `POST /v1/voice/embed`
Returns the L2-normalized speaker embedding vector.
| field | type | description |
|---|---|---|
| `model` | string | voice model |
| `audio` | string | URL / base64 / data-URI |
Returns `{embedding: float[], dim: int, model: string}`. Dimension
depends on the recognizer: 192 for ECAPA-TDNN, 256 for WeSpeaker
ResNet34.
> **Note:** the OpenAI-compatible `/v1/embeddings` endpoint is
> intentionally text-only — it does nothing useful with audio input.
> Use `/v1/voice/embed` for audio.
## Audio input
Audio is materialised by the HTTP layer to a temporary WAV file
before the gRPC call. All audio fields accept:
- `http://` / `https://` URLs (downloaded server-side, subject to
`ValidateExternalURL` safety checks).
- Raw base64 (no prefix).
- Data URIs (`data:audio/wav;base64,...`).
The backend itself always receives a filesystem path — the same
convention the Whisper / Voxtral transcription backends use.
## Threshold reference
| Recognizer | Cosine-distance threshold |
|---|---|
| ECAPA-TDNN (SpeechBrain, VoxCeleb) | ~0.25 |
| WeSpeaker ResNet34 | ~0.30 |
| 3D-Speaker ERes2Net | ~0.28 |
Pass `threshold` explicitly when switching recognizers — the per-model
default only applies when omitted.
## Related features
- [Face Recognition](/features/face-recognition/) — the image analog;
the two share a registry design.
- [Audio to Text](/features/audio-to-text/) — transcription (Whisper,
Voxtral, faster-whisper). Runs in addition to, not instead of,
voice recognition.
- [Stores](/features/stores/) — the generic vector store powering
both the face and voice 1:N recognition pipelines.
- [Embeddings](/features/embeddings/) — text-only OpenAI-compatible
embedding endpoint; for audio embeddings use `/v1/voice/embed`.

View File

@@ -3993,6 +3993,57 @@
- filename: face_recognition_sface_2021dec_int8.onnx
sha256: 2b0e941e6f16cc048c20aee0c8e31f569118f65d702914540f7bfdc14048d78a
uri: https://github.com/opencv/opencv_zoo/raw/main/models/face_recognition_sface/face_recognition_sface_2021dec_int8.onnx
- &speechbrain_ecapa_tdnn
name: "speechbrain-ecapa-tdnn"
url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
license: apache-2.0
description: |
Speaker (voice) recognition with SpeechBrain's ECAPA-TDNN trained
on VoxCeleb. 192-d L2-normalised embeddings, ~1.9% Equal Error
Rate on VoxCeleb1-O. APACHE 2.0 — commercial-safe.
The checkpoint is auto-downloaded from HuggingFace on first
LoadModel (no separate weight file in gallery `files:`). Points at
the upstream SpeechBrain HF repo directly — same bytes every
deployment.
tags: [voice-recognition, speaker-verification, speaker-embedding, commercial-ok, cpu, gpu]
urls:
- https://speechbrain.github.io/
- https://huggingface.co/speechbrain/spkrec-ecapa-voxceleb
overrides:
backend: speaker-recognition
parameters: {model: speechbrain/spkrec-ecapa-voxceleb}
options:
- "engine:speechbrain"
- "source:speechbrain/spkrec-ecapa-voxceleb"
known_usecases: [speaker_recognition]
- &wespeaker_resnet34
name: "wespeaker-resnet34"
url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
license: apache-2.0
description: |
Speaker recognition with WeSpeaker's ResNet34 trained on VoxCeleb,
exported to ONNX. 256-d embeddings, CPU-friendly — avoids the
PyTorch runtime entirely (onnxruntime only). APACHE 2.0.
Pair with the `speaker-recognition` backend's OnnxDirectEngine.
Use when ECAPA-TDNN's torch dependency is undesirable (small
images, edge deployments).
tags: [voice-recognition, speaker-verification, speaker-embedding, commercial-ok, edge, cpu]
urls:
- https://github.com/wenet-e2e/wespeaker
overrides:
backend: speaker-recognition
parameters: {model: wespeaker_voxceleb_resnet34.onnx}
options:
- "engine:onnx"
- "model_path:wespeaker_voxceleb_resnet34.onnx"
- "sample_rate:16000"
known_usecases: [speaker_recognition]
files:
- filename: wespeaker_voxceleb_resnet34.onnx
sha256: 7bb2f06e9df17cdf1ef14ee8a15ab08ed28e8d0ef5054ee135741560df2ec068
uri: https://huggingface.co/Wespeaker/wespeaker-voxceleb-resnet34-LM/resolve/main/voxceleb_resnet34_LM.onnx
- &rfdetr
name: "rfdetr-base"
url: "github:mudler/LocalAI/gallery/virtual.yaml@master"

View File

@@ -56,6 +56,9 @@ type Backend interface {
Detect(ctx context.Context, in *pb.DetectOptions, opts ...grpc.CallOption) (*pb.DetectResponse, error)
FaceVerify(ctx context.Context, in *pb.FaceVerifyRequest, opts ...grpc.CallOption) (*pb.FaceVerifyResponse, error)
FaceAnalyze(ctx context.Context, in *pb.FaceAnalyzeRequest, opts ...grpc.CallOption) (*pb.FaceAnalyzeResponse, error)
VoiceVerify(ctx context.Context, in *pb.VoiceVerifyRequest, opts ...grpc.CallOption) (*pb.VoiceVerifyResponse, error)
VoiceAnalyze(ctx context.Context, in *pb.VoiceAnalyzeRequest, opts ...grpc.CallOption) (*pb.VoiceAnalyzeResponse, error)
VoiceEmbed(ctx context.Context, in *pb.VoiceEmbedRequest, opts ...grpc.CallOption) (*pb.VoiceEmbedResponse, error)
AudioTranscription(ctx context.Context, in *pb.TranscriptRequest, opts ...grpc.CallOption) (*pb.TranscriptResult, error)
AudioTranscriptionStream(ctx context.Context, in *pb.TranscriptRequest, f func(chunk *pb.TranscriptStreamResponse), opts ...grpc.CallOption) error
TokenizeString(ctx context.Context, in *pb.PredictOptions, opts ...grpc.CallOption) (*pb.TokenizationResponse, error)

View File

@@ -89,6 +89,18 @@ func (llm *Base) FaceAnalyze(*pb.FaceAnalyzeRequest) (pb.FaceAnalyzeResponse, er
return pb.FaceAnalyzeResponse{}, fmt.Errorf("unimplemented")
}
func (llm *Base) VoiceVerify(*pb.VoiceVerifyRequest) (pb.VoiceVerifyResponse, error) {
return pb.VoiceVerifyResponse{}, fmt.Errorf("unimplemented")
}
func (llm *Base) VoiceAnalyze(*pb.VoiceAnalyzeRequest) (pb.VoiceAnalyzeResponse, error) {
return pb.VoiceAnalyzeResponse{}, fmt.Errorf("unimplemented")
}
func (llm *Base) VoiceEmbed(*pb.VoiceEmbedRequest) (pb.VoiceEmbedResponse, error) {
return pb.VoiceEmbedResponse{}, fmt.Errorf("unimplemented")
}
func (llm *Base) TokenizeString(opts *pb.PredictOptions) (pb.TokenizationResponse, error) {
return pb.TokenizationResponse{}, fmt.Errorf("unimplemented")
}

View File

@@ -616,6 +616,60 @@ func (c *Client) FaceAnalyze(ctx context.Context, in *pb.FaceAnalyzeRequest, opt
return client.FaceAnalyze(ctx, in, opts...)
}
func (c *Client) VoiceVerify(ctx context.Context, in *pb.VoiceVerifyRequest, opts ...grpc.CallOption) (*pb.VoiceVerifyResponse, error) {
if !c.parallel {
c.opMutex.Lock()
defer c.opMutex.Unlock()
}
c.setBusy(true)
defer c.setBusy(false)
c.wdMark()
defer c.wdUnMark()
conn, err := c.dial()
if err != nil {
return nil, err
}
defer conn.Close()
client := pb.NewBackendClient(conn)
return client.VoiceVerify(ctx, in, opts...)
}
func (c *Client) VoiceAnalyze(ctx context.Context, in *pb.VoiceAnalyzeRequest, opts ...grpc.CallOption) (*pb.VoiceAnalyzeResponse, error) {
if !c.parallel {
c.opMutex.Lock()
defer c.opMutex.Unlock()
}
c.setBusy(true)
defer c.setBusy(false)
c.wdMark()
defer c.wdUnMark()
conn, err := c.dial()
if err != nil {
return nil, err
}
defer conn.Close()
client := pb.NewBackendClient(conn)
return client.VoiceAnalyze(ctx, in, opts...)
}
func (c *Client) VoiceEmbed(ctx context.Context, in *pb.VoiceEmbedRequest, opts ...grpc.CallOption) (*pb.VoiceEmbedResponse, error) {
if !c.parallel {
c.opMutex.Lock()
defer c.opMutex.Unlock()
}
c.setBusy(true)
defer c.setBusy(false)
c.wdMark()
defer c.wdUnMark()
conn, err := c.dial()
if err != nil {
return nil, err
}
defer conn.Close()
client := pb.NewBackendClient(conn)
return client.VoiceEmbed(ctx, in, opts...)
}
func (c *Client) AudioEncode(ctx context.Context, in *pb.AudioEncodeRequest, opts ...grpc.CallOption) (*pb.AudioEncodeResult, error) {
if !c.parallel {
c.opMutex.Lock()

View File

@@ -79,6 +79,18 @@ func (e *embedBackend) FaceAnalyze(ctx context.Context, in *pb.FaceAnalyzeReques
return e.s.FaceAnalyze(ctx, in)
}
func (e *embedBackend) VoiceVerify(ctx context.Context, in *pb.VoiceVerifyRequest, opts ...grpc.CallOption) (*pb.VoiceVerifyResponse, error) {
return e.s.VoiceVerify(ctx, in)
}
func (e *embedBackend) VoiceAnalyze(ctx context.Context, in *pb.VoiceAnalyzeRequest, opts ...grpc.CallOption) (*pb.VoiceAnalyzeResponse, error) {
return e.s.VoiceAnalyze(ctx, in)
}
func (e *embedBackend) VoiceEmbed(ctx context.Context, in *pb.VoiceEmbedRequest, opts ...grpc.CallOption) (*pb.VoiceEmbedResponse, error) {
return e.s.VoiceEmbed(ctx, in)
}
func (e *embedBackend) AudioTranscription(ctx context.Context, in *pb.TranscriptRequest, opts ...grpc.CallOption) (*pb.TranscriptResult, error) {
return e.s.AudioTranscription(ctx, in)
}

View File

@@ -19,6 +19,9 @@ type AIModel interface {
Detect(*pb.DetectOptions) (pb.DetectResponse, error)
FaceVerify(*pb.FaceVerifyRequest) (pb.FaceVerifyResponse, error)
FaceAnalyze(*pb.FaceAnalyzeRequest) (pb.FaceAnalyzeResponse, error)
VoiceVerify(*pb.VoiceVerifyRequest) (pb.VoiceVerifyResponse, error)
VoiceAnalyze(*pb.VoiceAnalyzeRequest) (pb.VoiceAnalyzeResponse, error)
VoiceEmbed(*pb.VoiceEmbedRequest) (pb.VoiceEmbedResponse, error)
AudioTranscription(*pb.TranscriptRequest) (pb.TranscriptResult, error)
AudioTranscriptionStream(*pb.TranscriptRequest, chan *pb.TranscriptStreamResponse) error
TTS(*pb.TTSRequest) error

View File

@@ -175,6 +175,42 @@ func (s *server) FaceAnalyze(ctx context.Context, in *pb.FaceAnalyzeRequest) (*p
return &res, nil
}
func (s *server) VoiceVerify(ctx context.Context, in *pb.VoiceVerifyRequest) (*pb.VoiceVerifyResponse, error) {
if s.llm.Locking() {
s.llm.Lock()
defer s.llm.Unlock()
}
res, err := s.llm.VoiceVerify(in)
if err != nil {
return nil, err
}
return &res, nil
}
func (s *server) VoiceAnalyze(ctx context.Context, in *pb.VoiceAnalyzeRequest) (*pb.VoiceAnalyzeResponse, error) {
if s.llm.Locking() {
s.llm.Lock()
defer s.llm.Unlock()
}
res, err := s.llm.VoiceAnalyze(in)
if err != nil {
return nil, err
}
return &res, nil
}
func (s *server) VoiceEmbed(ctx context.Context, in *pb.VoiceEmbedRequest) (*pb.VoiceEmbedResponse, error) {
if s.llm.Locking() {
s.llm.Lock()
defer s.llm.Unlock()
}
res, err := s.llm.VoiceEmbed(in)
if err != nil {
return nil, err
}
return &res, nil
}
func (s *server) AudioTranscription(ctx context.Context, in *pb.TranscriptRequest) (*pb.TranscriptResult, error) {
if s.llm.Locking() {
s.llm.Lock()

View File

@@ -1166,6 +1166,25 @@ const docTemplate = `{
}
}
},
"/backends/known": {
"get": {
"tags": [
"backends"
],
"summary": "List all known Backends (importer registry + curated pref-only + installed-on-disk)",
"responses": {
"200": {
"description": "Response",
"schema": {
"type": "array",
"items": {
"$ref": "#/definitions/schema.KnownBackend"
}
}
}
}
}
},
"/backends/upgrade/{name}": {
"post": {
"tags": [
@@ -2261,6 +2280,165 @@ const docTemplate = `{
}
}
},
"/v1/voice/analyze": {
"post": {
"tags": [
"voice-recognition"
],
"summary": "Analyze demographic attributes (age, gender, emotion) from a voice clip.",
"parameters": [
{
"description": "query params",
"name": "request",
"in": "body",
"required": true,
"schema": {
"$ref": "#/definitions/schema.VoiceAnalyzeRequest"
}
}
],
"responses": {
"200": {
"description": "Response",
"schema": {
"$ref": "#/definitions/schema.VoiceAnalyzeResponse"
}
}
}
}
},
"/v1/voice/embed": {
"post": {
"tags": [
"voice-recognition"
],
"summary": "Extract a speaker embedding from an audio clip.",
"parameters": [
{
"description": "query params",
"name": "request",
"in": "body",
"required": true,
"schema": {
"$ref": "#/definitions/schema.VoiceEmbedRequest"
}
}
],
"responses": {
"200": {
"description": "Response",
"schema": {
"$ref": "#/definitions/schema.VoiceEmbedResponse"
}
}
}
}
},
"/v1/voice/forget": {
"post": {
"tags": [
"voice-recognition"
],
"summary": "Remove a previously-registered speaker by ID.",
"parameters": [
{
"description": "query params",
"name": "request",
"in": "body",
"required": true,
"schema": {
"$ref": "#/definitions/schema.VoiceForgetRequest"
}
}
],
"responses": {
"204": {
"description": "No Content"
}
}
}
},
"/v1/voice/identify": {
"post": {
"tags": [
"voice-recognition"
],
"summary": "Identify a speaker against the registered database (1:N recognition).",
"parameters": [
{
"description": "query params",
"name": "request",
"in": "body",
"required": true,
"schema": {
"$ref": "#/definitions/schema.VoiceIdentifyRequest"
}
}
],
"responses": {
"200": {
"description": "Response",
"schema": {
"$ref": "#/definitions/schema.VoiceIdentifyResponse"
}
}
}
}
},
"/v1/voice/register": {
"post": {
"tags": [
"voice-recognition"
],
"summary": "Register a speaker for 1:N identification.",
"parameters": [
{
"description": "query params",
"name": "request",
"in": "body",
"required": true,
"schema": {
"$ref": "#/definitions/schema.VoiceRegisterRequest"
}
}
],
"responses": {
"200": {
"description": "Response",
"schema": {
"$ref": "#/definitions/schema.VoiceRegisterResponse"
}
}
}
}
},
"/v1/voice/verify": {
"post": {
"tags": [
"voice-recognition"
],
"summary": "Verify that two audio clips were spoken by the same person.",
"parameters": [
{
"description": "query params",
"name": "request",
"in": "body",
"required": true,
"schema": {
"$ref": "#/definitions/schema.VoiceVerifyRequest"
}
}
],
"responses": {
"200": {
"description": "Response",
"schema": {
"$ref": "#/definitions/schema.VoiceVerifyResponse"
}
}
}
}
},
"/vad": {
"post": {
"consumes": [
@@ -3850,6 +4028,27 @@ const docTemplate = `{
}
}
},
"schema.KnownBackend": {
"type": "object",
"properties": {
"auto_detect": {
"type": "boolean"
},
"description": {
"type": "string"
},
"installed": {
"description": "Installed is true when the backend is currently present on disk — i.e. it\nappears in gallery.ListSystemBackends(systemState). Importer-registered or\ncurated pref-only backends default to false unless they also show up on\ndisk. The import form uses this to warn users that submitting an import\nmay trigger an automatic backend download.",
"type": "boolean"
},
"modality": {
"type": "string"
},
"name": {
"type": "string"
}
}
},
"schema.LogprobContent": {
"type": "object",
"properties": {
@@ -5098,6 +5297,248 @@ const docTemplate = `{
}
}
},
"schema.VoiceAnalysis": {
"type": "object",
"properties": {
"age": {
"type": "number"
},
"dominant_emotion": {
"type": "string"
},
"dominant_gender": {
"type": "string"
},
"emotion": {
"type": "object",
"additionalProperties": {
"type": "number",
"format": "float32"
}
},
"end": {
"type": "number"
},
"gender": {
"type": "object",
"additionalProperties": {
"type": "number",
"format": "float32"
}
},
"start": {
"type": "number"
}
}
},
"schema.VoiceAnalyzeRequest": {
"type": "object",
"properties": {
"actions": {
"description": "subset of {\"age\",\"gender\",\"emotion\"}",
"type": "array",
"items": {
"type": "string"
}
},
"audio": {
"type": "string"
},
"model": {
"type": "string"
}
}
},
"schema.VoiceAnalyzeResponse": {
"type": "object",
"properties": {
"segments": {
"type": "array",
"items": {
"$ref": "#/definitions/schema.VoiceAnalysis"
}
}
}
},
"schema.VoiceEmbedRequest": {
"type": "object",
"properties": {
"audio": {
"type": "string"
},
"model": {
"type": "string"
}
}
},
"schema.VoiceEmbedResponse": {
"type": "object",
"properties": {
"dim": {
"type": "integer"
},
"embedding": {
"type": "array",
"items": {
"type": "number"
}
},
"model": {
"type": "string"
}
}
},
"schema.VoiceForgetRequest": {
"type": "object",
"properties": {
"id": {
"type": "string"
},
"model": {
"type": "string"
},
"store": {
"type": "string"
}
}
},
"schema.VoiceIdentifyMatch": {
"type": "object",
"properties": {
"confidence": {
"type": "number"
},
"distance": {
"type": "number"
},
"id": {
"type": "string"
},
"labels": {
"type": "object",
"additionalProperties": {
"type": "string"
}
},
"match": {
"type": "boolean"
},
"name": {
"type": "string"
}
}
},
"schema.VoiceIdentifyRequest": {
"type": "object",
"properties": {
"audio": {
"type": "string"
},
"model": {
"type": "string"
},
"store": {
"type": "string"
},
"threshold": {
"type": "number"
},
"top_k": {
"type": "integer"
}
}
},
"schema.VoiceIdentifyResponse": {
"type": "object",
"properties": {
"matches": {
"type": "array",
"items": {
"$ref": "#/definitions/schema.VoiceIdentifyMatch"
}
}
}
},
"schema.VoiceRegisterRequest": {
"type": "object",
"properties": {
"audio": {
"type": "string"
},
"labels": {
"type": "object",
"additionalProperties": {
"type": "string"
}
},
"model": {
"type": "string"
},
"name": {
"type": "string"
},
"store": {
"type": "string"
}
}
},
"schema.VoiceRegisterResponse": {
"type": "object",
"properties": {
"id": {
"type": "string"
},
"name": {
"type": "string"
},
"registered_at": {
"type": "string"
}
}
},
"schema.VoiceVerifyRequest": {
"type": "object",
"properties": {
"anti_spoofing": {
"type": "boolean"
},
"audio1": {
"type": "string"
},
"audio2": {
"type": "string"
},
"model": {
"type": "string"
},
"threshold": {
"type": "number"
}
}
},
"schema.VoiceVerifyResponse": {
"type": "object",
"properties": {
"confidence": {
"type": "number"
},
"distance": {
"type": "number"
},
"model": {
"type": "string"
},
"processing_time_ms": {
"type": "number"
},
"threshold": {
"type": "number"
},
"verified": {
"type": "boolean"
}
}
},
"schema.WebhookConfig": {
"type": "object",
"properties": {

View File

@@ -1163,6 +1163,25 @@
}
}
},
"/backends/known": {
"get": {
"tags": [
"backends"
],
"summary": "List all known Backends (importer registry + curated pref-only + installed-on-disk)",
"responses": {
"200": {
"description": "Response",
"schema": {
"type": "array",
"items": {
"$ref": "#/definitions/schema.KnownBackend"
}
}
}
}
}
},
"/backends/upgrade/{name}": {
"post": {
"tags": [
@@ -2258,6 +2277,165 @@
}
}
},
"/v1/voice/analyze": {
"post": {
"tags": [
"voice-recognition"
],
"summary": "Analyze demographic attributes (age, gender, emotion) from a voice clip.",
"parameters": [
{
"description": "query params",
"name": "request",
"in": "body",
"required": true,
"schema": {
"$ref": "#/definitions/schema.VoiceAnalyzeRequest"
}
}
],
"responses": {
"200": {
"description": "Response",
"schema": {
"$ref": "#/definitions/schema.VoiceAnalyzeResponse"
}
}
}
}
},
"/v1/voice/embed": {
"post": {
"tags": [
"voice-recognition"
],
"summary": "Extract a speaker embedding from an audio clip.",
"parameters": [
{
"description": "query params",
"name": "request",
"in": "body",
"required": true,
"schema": {
"$ref": "#/definitions/schema.VoiceEmbedRequest"
}
}
],
"responses": {
"200": {
"description": "Response",
"schema": {
"$ref": "#/definitions/schema.VoiceEmbedResponse"
}
}
}
}
},
"/v1/voice/forget": {
"post": {
"tags": [
"voice-recognition"
],
"summary": "Remove a previously-registered speaker by ID.",
"parameters": [
{
"description": "query params",
"name": "request",
"in": "body",
"required": true,
"schema": {
"$ref": "#/definitions/schema.VoiceForgetRequest"
}
}
],
"responses": {
"204": {
"description": "No Content"
}
}
}
},
"/v1/voice/identify": {
"post": {
"tags": [
"voice-recognition"
],
"summary": "Identify a speaker against the registered database (1:N recognition).",
"parameters": [
{
"description": "query params",
"name": "request",
"in": "body",
"required": true,
"schema": {
"$ref": "#/definitions/schema.VoiceIdentifyRequest"
}
}
],
"responses": {
"200": {
"description": "Response",
"schema": {
"$ref": "#/definitions/schema.VoiceIdentifyResponse"
}
}
}
}
},
"/v1/voice/register": {
"post": {
"tags": [
"voice-recognition"
],
"summary": "Register a speaker for 1:N identification.",
"parameters": [
{
"description": "query params",
"name": "request",
"in": "body",
"required": true,
"schema": {
"$ref": "#/definitions/schema.VoiceRegisterRequest"
}
}
],
"responses": {
"200": {
"description": "Response",
"schema": {
"$ref": "#/definitions/schema.VoiceRegisterResponse"
}
}
}
}
},
"/v1/voice/verify": {
"post": {
"tags": [
"voice-recognition"
],
"summary": "Verify that two audio clips were spoken by the same person.",
"parameters": [
{
"description": "query params",
"name": "request",
"in": "body",
"required": true,
"schema": {
"$ref": "#/definitions/schema.VoiceVerifyRequest"
}
}
],
"responses": {
"200": {
"description": "Response",
"schema": {
"$ref": "#/definitions/schema.VoiceVerifyResponse"
}
}
}
}
},
"/vad": {
"post": {
"consumes": [
@@ -3847,6 +4025,27 @@
}
}
},
"schema.KnownBackend": {
"type": "object",
"properties": {
"auto_detect": {
"type": "boolean"
},
"description": {
"type": "string"
},
"installed": {
"description": "Installed is true when the backend is currently present on disk — i.e. it\nappears in gallery.ListSystemBackends(systemState). Importer-registered or\ncurated pref-only backends default to false unless they also show up on\ndisk. The import form uses this to warn users that submitting an import\nmay trigger an automatic backend download.",
"type": "boolean"
},
"modality": {
"type": "string"
},
"name": {
"type": "string"
}
}
},
"schema.LogprobContent": {
"type": "object",
"properties": {
@@ -5095,6 +5294,248 @@
}
}
},
"schema.VoiceAnalysis": {
"type": "object",
"properties": {
"age": {
"type": "number"
},
"dominant_emotion": {
"type": "string"
},
"dominant_gender": {
"type": "string"
},
"emotion": {
"type": "object",
"additionalProperties": {
"type": "number",
"format": "float32"
}
},
"end": {
"type": "number"
},
"gender": {
"type": "object",
"additionalProperties": {
"type": "number",
"format": "float32"
}
},
"start": {
"type": "number"
}
}
},
"schema.VoiceAnalyzeRequest": {
"type": "object",
"properties": {
"actions": {
"description": "subset of {\"age\",\"gender\",\"emotion\"}",
"type": "array",
"items": {
"type": "string"
}
},
"audio": {
"type": "string"
},
"model": {
"type": "string"
}
}
},
"schema.VoiceAnalyzeResponse": {
"type": "object",
"properties": {
"segments": {
"type": "array",
"items": {
"$ref": "#/definitions/schema.VoiceAnalysis"
}
}
}
},
"schema.VoiceEmbedRequest": {
"type": "object",
"properties": {
"audio": {
"type": "string"
},
"model": {
"type": "string"
}
}
},
"schema.VoiceEmbedResponse": {
"type": "object",
"properties": {
"dim": {
"type": "integer"
},
"embedding": {
"type": "array",
"items": {
"type": "number"
}
},
"model": {
"type": "string"
}
}
},
"schema.VoiceForgetRequest": {
"type": "object",
"properties": {
"id": {
"type": "string"
},
"model": {
"type": "string"
},
"store": {
"type": "string"
}
}
},
"schema.VoiceIdentifyMatch": {
"type": "object",
"properties": {
"confidence": {
"type": "number"
},
"distance": {
"type": "number"
},
"id": {
"type": "string"
},
"labels": {
"type": "object",
"additionalProperties": {
"type": "string"
}
},
"match": {
"type": "boolean"
},
"name": {
"type": "string"
}
}
},
"schema.VoiceIdentifyRequest": {
"type": "object",
"properties": {
"audio": {
"type": "string"
},
"model": {
"type": "string"
},
"store": {
"type": "string"
},
"threshold": {
"type": "number"
},
"top_k": {
"type": "integer"
}
}
},
"schema.VoiceIdentifyResponse": {
"type": "object",
"properties": {
"matches": {
"type": "array",
"items": {
"$ref": "#/definitions/schema.VoiceIdentifyMatch"
}
}
}
},
"schema.VoiceRegisterRequest": {
"type": "object",
"properties": {
"audio": {
"type": "string"
},
"labels": {
"type": "object",
"additionalProperties": {
"type": "string"
}
},
"model": {
"type": "string"
},
"name": {
"type": "string"
},
"store": {
"type": "string"
}
}
},
"schema.VoiceRegisterResponse": {
"type": "object",
"properties": {
"id": {
"type": "string"
},
"name": {
"type": "string"
},
"registered_at": {
"type": "string"
}
}
},
"schema.VoiceVerifyRequest": {
"type": "object",
"properties": {
"anti_spoofing": {
"type": "boolean"
},
"audio1": {
"type": "string"
},
"audio2": {
"type": "string"
},
"model": {
"type": "string"
},
"threshold": {
"type": "number"
}
}
},
"schema.VoiceVerifyResponse": {
"type": "object",
"properties": {
"confidence": {
"type": "number"
},
"distance": {
"type": "number"
},
"model": {
"type": "string"
},
"processing_time_ms": {
"type": "number"
},
"threshold": {
"type": "number"
},
"verified": {
"type": "boolean"
}
}
},
"schema.WebhookConfig": {
"type": "object",
"properties": {

View File

@@ -1038,6 +1038,25 @@ definitions:
description: '"reasoning", "tool_call", "tool_result", "status"'
type: string
type: object
schema.KnownBackend:
properties:
auto_detect:
type: boolean
description:
type: string
installed:
description: |-
Installed is true when the backend is currently present on disk — i.e. it
appears in gallery.ListSystemBackends(systemState). Importer-registered or
curated pref-only backends default to false unless they also show up on
disk. The import form uses this to warn users that submitting an import
may trigger an automatic backend download.
type: boolean
modality:
type: string
name:
type: string
type: object
schema.LogprobContent:
properties:
bytes:
@@ -1901,6 +1920,164 @@ definitions:
description: output width in pixels
type: integer
type: object
schema.VoiceAnalysis:
properties:
age:
type: number
dominant_emotion:
type: string
dominant_gender:
type: string
emotion:
additionalProperties:
format: float32
type: number
type: object
end:
type: number
gender:
additionalProperties:
format: float32
type: number
type: object
start:
type: number
type: object
schema.VoiceAnalyzeRequest:
properties:
actions:
description: subset of {"age","gender","emotion"}
items:
type: string
type: array
audio:
type: string
model:
type: string
type: object
schema.VoiceAnalyzeResponse:
properties:
segments:
items:
$ref: '#/definitions/schema.VoiceAnalysis'
type: array
type: object
schema.VoiceEmbedRequest:
properties:
audio:
type: string
model:
type: string
type: object
schema.VoiceEmbedResponse:
properties:
dim:
type: integer
embedding:
items:
type: number
type: array
model:
type: string
type: object
schema.VoiceForgetRequest:
properties:
id:
type: string
model:
type: string
store:
type: string
type: object
schema.VoiceIdentifyMatch:
properties:
confidence:
type: number
distance:
type: number
id:
type: string
labels:
additionalProperties:
type: string
type: object
match:
type: boolean
name:
type: string
type: object
schema.VoiceIdentifyRequest:
properties:
audio:
type: string
model:
type: string
store:
type: string
threshold:
type: number
top_k:
type: integer
type: object
schema.VoiceIdentifyResponse:
properties:
matches:
items:
$ref: '#/definitions/schema.VoiceIdentifyMatch'
type: array
type: object
schema.VoiceRegisterRequest:
properties:
audio:
type: string
labels:
additionalProperties:
type: string
type: object
model:
type: string
name:
type: string
store:
type: string
type: object
schema.VoiceRegisterResponse:
properties:
id:
type: string
name:
type: string
registered_at:
type: string
type: object
schema.VoiceVerifyRequest:
properties:
anti_spoofing:
type: boolean
audio1:
type: string
audio2:
type: string
model:
type: string
threshold:
type: number
type: object
schema.VoiceVerifyResponse:
properties:
confidence:
type: number
distance:
type: number
model:
type: string
processing_time_ms:
type: number
threshold:
type: number
verified:
type: boolean
type: object
schema.WebhookConfig:
properties:
headers:
@@ -2688,6 +2865,18 @@ paths:
summary: Returns the job status
tags:
- backends
/backends/known:
get:
responses:
"200":
description: Response
schema:
items:
$ref: '#/definitions/schema.KnownBackend'
type: array
summary: List all known Backends (importer registry + curated pref-only + installed-on-disk)
tags:
- backends
/backends/upgrade/{name}:
post:
parameters:
@@ -3392,6 +3581,107 @@ paths:
summary: Tokenize the input.
tags:
- tokenize
/v1/voice/analyze:
post:
parameters:
- description: query params
in: body
name: request
required: true
schema:
$ref: '#/definitions/schema.VoiceAnalyzeRequest'
responses:
"200":
description: Response
schema:
$ref: '#/definitions/schema.VoiceAnalyzeResponse'
summary: Analyze demographic attributes (age, gender, emotion) from a voice
clip.
tags:
- voice-recognition
/v1/voice/embed:
post:
parameters:
- description: query params
in: body
name: request
required: true
schema:
$ref: '#/definitions/schema.VoiceEmbedRequest'
responses:
"200":
description: Response
schema:
$ref: '#/definitions/schema.VoiceEmbedResponse'
summary: Extract a speaker embedding from an audio clip.
tags:
- voice-recognition
/v1/voice/forget:
post:
parameters:
- description: query params
in: body
name: request
required: true
schema:
$ref: '#/definitions/schema.VoiceForgetRequest'
responses:
"204":
description: No Content
summary: Remove a previously-registered speaker by ID.
tags:
- voice-recognition
/v1/voice/identify:
post:
parameters:
- description: query params
in: body
name: request
required: true
schema:
$ref: '#/definitions/schema.VoiceIdentifyRequest'
responses:
"200":
description: Response
schema:
$ref: '#/definitions/schema.VoiceIdentifyResponse'
summary: Identify a speaker against the registered database (1:N recognition).
tags:
- voice-recognition
/v1/voice/register:
post:
parameters:
- description: query params
in: body
name: request
required: true
schema:
$ref: '#/definitions/schema.VoiceRegisterRequest'
responses:
"200":
description: Response
schema:
$ref: '#/definitions/schema.VoiceRegisterResponse'
summary: Register a speaker for 1:N identification.
tags:
- voice-recognition
/v1/voice/verify:
post:
parameters:
- description: query params
in: body
name: request
required: true
schema:
$ref: '#/definitions/schema.VoiceVerifyRequest'
responses:
"200":
description: Response
schema:
$ref: '#/definitions/schema.VoiceVerifyResponse'
summary: Verify that two audio clips were spoken by the same person.
tags:
- voice-recognition
/vad:
post:
consumes:

View File

@@ -88,6 +88,9 @@ const (
capFaceEmbed = "face_embed"
capFaceVerify = "face_verify"
capFaceAnalyze = "face_analyze"
capVoiceEmbed = "voice_embed"
capVoiceVerify = "voice_verify"
capVoiceAnalyze = "voice_analyze"
defaultPrompt = "The capital of France is"
streamPrompt = "Once upon a time"
@@ -137,6 +140,14 @@ var _ = Describe("Backend container", Ordered, func() {
faceFile1 string
faceFile2 string
faceFile3 string
// Voice fixtures: two clips of the same speaker + one different speaker.
voiceFile1 string
voiceFile2 string
voiceFile3 string
// voiceVerifyCeiling is the upper-bound cosine distance for a
// same-speaker pair; varies with the recognizer (ECAPA-TDNN
// runs close to 0.2, WeSpeaker around 0.3).
voiceVerifyCeiling float32
// verifyCeiling is the upper-bound cosine distance for a
// same-person pair; each model configuration can override it via
// BACKEND_TEST_VERIFY_DISTANCE_CEILING because SFace's distance
@@ -218,6 +229,13 @@ var _ = Describe("Backend container", Ordered, func() {
faceFile3 = resolveFaceFixture(workDir, "BACKEND_TEST_FACE_IMAGE_3", "face_b.jpg")
verifyCeiling = envFloat32("BACKEND_TEST_VERIFY_DISTANCE_CEILING", defaultVerifyDistanceCeil)
// Voice fixtures for the voice-recognition specs. Same resolver
// as faces — the helper is content-agnostic.
voiceFile1 = resolveFaceFixture(workDir, "BACKEND_TEST_VOICE_AUDIO_1", "voice_a_1.wav")
voiceFile2 = resolveFaceFixture(workDir, "BACKEND_TEST_VOICE_AUDIO_2", "voice_a_2.wav")
voiceFile3 = resolveFaceFixture(workDir, "BACKEND_TEST_VOICE_AUDIO_3", "voice_b.wav")
voiceVerifyCeiling = envFloat32("BACKEND_TEST_VOICE_VERIFY_DISTANCE_CEILING", 0.4)
// Pick a free port and launch the backend.
port, err := freeport.GetFreePort()
Expect(err).NotTo(HaveOccurred())
@@ -668,6 +686,107 @@ var _ = Describe("Backend container", Ordered, func() {
}
GinkgoWriter.Printf("face_analyze: %d faces\n", len(res.GetFaces()))
})
// ─── voice (speaker) recognition specs ──────────────────────────────
It("produces speaker embeddings via VoiceEmbed", func() {
if !caps[capVoiceEmbed] {
Skip("voice_embed capability not enabled")
}
Expect(voiceFile1).NotTo(BeEmpty(), "BACKEND_TEST_VOICE_AUDIO_1_FILE or _URL must be set")
ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)
defer cancel()
res, err := client.VoiceEmbed(ctx, &pb.VoiceEmbedRequest{Audio: voiceFile1})
Expect(err).NotTo(HaveOccurred())
vec := res.GetEmbedding()
Expect(vec).NotTo(BeEmpty(), "VoiceEmbed returned empty vector")
GinkgoWriter.Printf("voice_embed: dim=%d\n", len(vec))
})
It("verifies speakers via VoiceVerify", func() {
if !caps[capVoiceVerify] {
Skip("voice_verify capability not enabled")
}
Expect(voiceFile1).NotTo(BeEmpty(), "BACKEND_TEST_VOICE_AUDIO_1_FILE or _URL must be set")
ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)
defer cancel()
// Same clip twice — expected verified=true with very small distance.
same, err := client.VoiceVerify(ctx, &pb.VoiceVerifyRequest{
Audio1: voiceFile1, Audio2: voiceFile1, Threshold: voiceVerifyCeiling,
})
Expect(err).NotTo(HaveOccurred())
Expect(same.GetVerified()).To(BeTrue(), "same clip should verify: dist=%.3f", same.GetDistance())
Expect(same.GetDistance()).To(BeNumerically("<", 0.05),
"identical-clip distance should be near zero, got %.3f", same.GetDistance())
GinkgoWriter.Printf("voice_verify(same): dist=%.3f confidence=%.1f\n", same.GetDistance(), same.GetConfidence())
// Cross-pair distance — assert relative ordering: d(file1,file3) > d(same).
// We don't require the fixtures to contain true same-speaker pairs —
// good same-speaker audio is hard to source un-gated. The RPC
// correctness is pinned by the same-clip check above; the pair
// distances here are about asserting the embedding actually encodes
// speaker info (ordering changes with speaker identity).
var d12, d13 float32
if voiceFile3 != "" {
res, err := client.VoiceVerify(ctx, &pb.VoiceVerifyRequest{
Audio1: voiceFile1, Audio2: voiceFile3, Threshold: voiceVerifyCeiling,
})
if err != nil {
GinkgoWriter.Printf("voice_verify(1vs3): skipped — %v\n", err)
} else {
d13 = res.GetDistance()
Expect(d13).To(BeNumerically(">", same.GetDistance()),
"cross-clip distance %.3f should exceed same-clip distance %.3f", d13, same.GetDistance())
GinkgoWriter.Printf("voice_verify(1vs3): dist=%.3f verified=%v\n", d13, res.GetVerified())
}
}
if voiceFile2 != "" {
res, err := client.VoiceVerify(ctx, &pb.VoiceVerifyRequest{
Audio1: voiceFile1, Audio2: voiceFile2, Threshold: voiceVerifyCeiling,
})
if err != nil {
GinkgoWriter.Printf("voice_verify(1vs2): skipped — %v\n", err)
} else {
d12 = res.GetDistance()
Expect(d12).To(BeNumerically(">", same.GetDistance()),
"cross-clip distance %.3f should exceed same-clip distance %.3f", d12, same.GetDistance())
GinkgoWriter.Printf("voice_verify(1vs2): dist=%.3f verified=%v\n", d12, res.GetVerified())
}
}
// If both pair distances were computed, record their ordering.
// We log rather than assert: ordering depends on the specific
// fixtures used, and CI defaults point at three different speakers.
if d12 > 0 && d13 > 0 {
GinkgoWriter.Printf("voice_verify ordering: d(1,2)=%.3f d(1,3)=%.3f\n", d12, d13)
}
})
It("analyzes voice via VoiceAnalyze", func() {
if !caps[capVoiceAnalyze] {
Skip("voice_analyze capability not enabled")
}
Expect(voiceFile1).NotTo(BeEmpty(), "BACKEND_TEST_VOICE_AUDIO_1_FILE or _URL must be set")
ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)
defer cancel()
res, err := client.VoiceAnalyze(ctx, &pb.VoiceAnalyzeRequest{Audio: voiceFile1})
Expect(err).NotTo(HaveOccurred())
Expect(res.GetSegments()).NotTo(BeEmpty(), "VoiceAnalyze returned no segments")
for _, s := range res.GetSegments() {
Expect(s.GetAge()).To(BeNumerically(">", 0), "age should be populated by analyze-capable engines")
// Audeering's age-gender head outputs female / male / child;
// LocalAI capitalises those to Female / Male / Child. Custom
// checkpoints wired via the age_gender_model option may use
// different labels, so accept anything non-empty.
Expect(s.GetDominantGender()).NotTo(BeEmpty())
}
GinkgoWriter.Printf("voice_analyze: %d segments\n", len(res.GetSegments()))
})
})
// extractImage runs `docker create` + `docker export` to materialise the image