mirror of
https://github.com/mudler/LocalAI.git
synced 2026-04-29 11:37:40 -04:00
feat: voice recognition (#9500)
* feat(voice-recognition): add /v1/voice/{verify,analyze,embed} + speaker-recognition backend
Audio analog to face recognition. Adds three gRPC RPCs
(VoiceVerify / VoiceAnalyze / VoiceEmbed), their Go service and HTTP
layers, a new FLAG_SPEAKER_RECOGNITION capability flag, and a Python
backend scaffold under backend/python/speaker-recognition/ wrapping
SpeechBrain ECAPA-TDNN with a parallel OnnxDirectEngine for
WeSpeaker / 3D-Speaker ONNX exports.
The kokoros Rust backend gets matching unimplemented trait stubs —
tonic's async_trait has no defaults, so adding an RPC without Rust
stubs breaks the build (same regression fixed by eb01c772 for face).
Swagger, /api/instructions, and the auth RouteFeatureRegistry /
APIFeatures list are updated so the endpoints surface everywhere a
client or admin UI looks.
Assisted-by: Claude:claude-opus-4-7
* feat(voice-recognition): add 1:N identify + register/forget endpoints
Mirrors the face-recognition register/identify/forget surface. New
package core/services/voicerecognition/ carries a Registry interface
and a local-store-backed implementation (same in-memory vector-store
plumbing facerecognition uses, separate instance so the embedding
spaces stay isolated).
Handlers under /v1/voice/{register,identify,forget} reuse
backend.VoiceEmbed to compute the probe vector, then delegate the
nearest-neighbour search to the registry. Default cosine-distance
threshold is tuned for ECAPA-TDNN on VoxCeleb (0.25, EER ~1.9%).
As with the face registry, the current backing is in-memory only — a
pgvector implementation is a future constructor-level swap.
Assisted-by: Claude:claude-opus-4-7
* feat(voice-recognition): gallery, docs, CI and e2e coverage
- backend/index.yaml: speaker-recognition backend entry + CPU and
CUDA-12 image variants (plus matching development variants).
- gallery/index.yaml: speechbrain-ecapa-tdnn (default) and
wespeaker-resnet34 model entries. The WeSpeaker SHA-256 is a
deliberate placeholder — the HF URI must be curl'd and its hash
filled in before the entry installs.
- docs/content/features/voice-recognition.md: API reference + quickstart,
mirrors the face-recognition docs.
- React UI: CAP_SPEAKER_RECOGNITION flag export (consumers follow face's
precedent — no dedicated tab yet).
- tests/e2e-backends: voice_embed / voice_verify / voice_analyze specs.
Helper resolveFaceFixture is reused as-is — the only thing face/voice
share is "download a file into workDir", so no need for a new helper.
- Makefile: docker-build-speaker-recognition + test-extra-backend-
speaker-recognition-{ecapa,all} targets. Audio fixtures default to
VCTK p225/p226 samples from HuggingFace.
- CI: test-extra.yml grows a tests-speaker-recognition-grpc job
mirroring insightface. backend.yml matrix gains CPU + CUDA-12 image
build entries — scripts/changed-backends.js auto-picks these up.
Assisted-by: Claude:claude-opus-4-7
* feat(voice-recognition): wire a working /v1/voice/analyze head
Adds AnalysisHead: a lazy-loading age / gender / emotion inference
wrapper that plugs into both SpeechBrainEngine and OnnxDirectEngine.
Defaults to two open-licence HuggingFace checkpoints:
- audeering/wav2vec2-large-robust-24-ft-age-gender (Apache 2.0) —
age regression + 3-way gender (female / male / child).
- superb/wav2vec2-base-superb-er (Apache 2.0) — 4-way emotion.
Both are optional and degrade gracefully when transformers or the
model can't be loaded — the engine raises NotImplementedError so the
gRPC layer returns 501 instead of a generic 500.
Emotion classes pass through from the model (neutral/happy/angry/sad
on the default checkpoint); the e2e test now accepts any non-empty
dominant gender so custom age_gender_model overrides don't fail it.
Adds transformers to the backend's CPU and CUDA-12 requirements.
Assisted-by: Claude:claude-opus-4-7
* fix(voice-recognition): pin real WeSpeaker ResNet34 ONNX SHA-256
Replaces the placeholder hash in gallery/index.yaml with the actual
SHA-256 (7bb2f06e…) of the upstream
Wespeaker/wespeaker-voxceleb-resnet34-LM ONNX at ~25MB. `local-ai
models install wespeaker-resnet34` now succeeds.
Assisted-by: Claude:claude-opus-4-7
* fix(voice-recognition): soundfile loader + honest analyze default
Two issues surfaced on first end-to-end smoke with the actual backend
image:
1. torchaudio.load in torchaudio 2.8+ requires the torchcodec package
for audio decoding. Switch SpeechBrainEngine._load_waveform to the
already-present soundfile (listed in requirements.txt) plus a numpy
linear resample to 16kHz. Drops a heavy ffmpeg-linked dep and the
codepath we never exercise (torchaudio's ffmpeg backend).
2. The AnalysisHead was defaulting to audeering/wav2vec2-large-robust-
24-ft-age-gender, but AutoModelForAudioClassification silently
mangles that checkpoint — it reports the age head weights as
UNEXPECTED and re-initialises the classifier head with random
values, so the "gender" output is noise and there is no age output
at all. Make age/gender opt-in instead (empty default; users wire
a cleanly-loadable Wav2Vec2ForSequenceClassification checkpoint via
age_gender_model: option). Emotion keeps its working Superb default.
Also broaden _infer_age_gender's tensor-shape handling and catch
runtime exceptions so a dodgy age/gender head never takes down the
whole analyze call.
Docs and README updated to match the new policy.
Verified with the branch-scoped gallery on localhost:
- voice/embed → 192-d ECAPA-TDNN vector
- voice/verify → same-clip dist≈6e-08 verified=true; cross-speaker
dist 0.76–0.99 verified=false (as expected)
- voice/register/identify/forget → round-trip works, 404 on unknown id
- voice/analyze → emotion populated, age/gender omitted (opt-in)
Assisted-by: Claude:claude-opus-4-7
* fix(voice-recognition): real CI audio fixtures + fixture-agnostic verify spec
Two issues surfaced after CI actually ran the speaker-recognition e2e
target (I'd curl-tested against a running server but hadn't run the
make target locally):
1. The default BACKEND_TEST_VOICE_AUDIO_* URLs pointed at
huggingface.co/datasets/CSTR-Edinburgh/vctk paths that return 404
(the dataset is gated). Swap them for the speechbrain test samples
served from github.com/speechbrain/speechbrain/raw/develop/ —
public, no auth, correct 16kHz mono format.
2. The VoiceVerify spec required d(file1,file2) < 0.4, assuming
file1/file2 were same-speaker. The speechbrain samples are three
different speakers (example1/2/5), and there is no easy un-gated
source of true same-speaker audio pairs (VoxCeleb/VCTK/LibriSpeech
are all license- or size-gated for CI use). Replace the ceiling
check with a relative-ordering assertion: d(pair) > d(same-clip)
for both file2 and file3 — that's enough to prove the embeddings
encode speaker info, and it works with any three non-identical
clips. Actual speaker ordering d(1,2) vs d(1,3) is logged but not
asserted.
Local run: 4/4 voice specs pass (Health, LoadModel, VoiceEmbed,
VoiceVerify) on the built backend image. 12 non-voice specs skipped
as expected.
Assisted-by: Claude:claude-opus-4-7
* fix(ci): checkout with submodules in the reusable backend_build workflow
The kokoros Rust backend build fails with
failed to read .../sources/Kokoros/kokoros/Cargo.toml: No such file
because the reusable backend_build.yml workflow's actions/checkout
step was missing `submodules: true`. Dockerfile.rust does `COPY .
/LocalAI`, and without the submodule files the subsequent `cargo
build` can't find the vendored Kokoros crate.
The bug pre-dates this PR — scripts/changed-backends.js only triggers
the kokoros image job when something under backend/rust/kokoros or
the shared proto changes, so master had been coasting past it. The
voice-recognition proto addition re-broke it.
Other checkouts in backend.yml (llama-cpp-darwin) and test-extra.yml
(insightface, kokoros, speaker-recognition) already pass
`submodules: true`; this brings the shared backend image builder in
line.
Assisted-by: Claude:claude-opus-4-7
This commit is contained in:
committed by
GitHub
parent
1c59165d63
commit
181ebb6df4
27
.github/workflows/backend.yml
vendored
27
.github/workflows/backend.yml
vendored
@@ -724,6 +724,19 @@ jobs:
|
||||
dockerfile: "./backend/Dockerfile.python"
|
||||
context: "./"
|
||||
ubuntu-version: '2404'
|
||||
- build-type: 'cublas'
|
||||
cuda-major-version: "12"
|
||||
cuda-minor-version: "8"
|
||||
platforms: 'linux/amd64'
|
||||
tag-latest: 'auto'
|
||||
tag-suffix: '-gpu-nvidia-cuda-12-speaker-recognition'
|
||||
runs-on: 'ubuntu-latest'
|
||||
base-image: "ubuntu:24.04"
|
||||
skip-drivers: 'false'
|
||||
backend: "speaker-recognition"
|
||||
dockerfile: "./backend/Dockerfile.python"
|
||||
context: "./"
|
||||
ubuntu-version: '2404'
|
||||
- build-type: 'cublas'
|
||||
cuda-major-version: "12"
|
||||
cuda-minor-version: "8"
|
||||
@@ -2653,6 +2666,20 @@ jobs:
|
||||
dockerfile: "./backend/Dockerfile.python"
|
||||
context: "./"
|
||||
ubuntu-version: '2404'
|
||||
# speaker-recognition (voice/speaker biometrics)
|
||||
- build-type: ''
|
||||
cuda-major-version: ""
|
||||
cuda-minor-version: ""
|
||||
platforms: 'linux/amd64,linux/arm64'
|
||||
tag-latest: 'auto'
|
||||
tag-suffix: '-cpu-speaker-recognition'
|
||||
runs-on: 'ubuntu-latest'
|
||||
base-image: "ubuntu:24.04"
|
||||
skip-drivers: 'false'
|
||||
backend: "speaker-recognition"
|
||||
dockerfile: "./backend/Dockerfile.python"
|
||||
context: "./"
|
||||
ubuntu-version: '2404'
|
||||
- build-type: 'intel'
|
||||
cuda-major-version: ""
|
||||
cuda-minor-version: ""
|
||||
|
||||
2
.github/workflows/backend_build.yml
vendored
2
.github/workflows/backend_build.yml
vendored
@@ -108,6 +108,8 @@ jobs:
|
||||
|
||||
- name: Checkout
|
||||
uses: actions/checkout@v6
|
||||
with:
|
||||
submodules: true
|
||||
|
||||
- name: Release space from worker
|
||||
if: inputs.runs-on == 'ubuntu-latest'
|
||||
|
||||
27
.github/workflows/test-extra.yml
vendored
27
.github/workflows/test-extra.yml
vendored
@@ -39,6 +39,7 @@ jobs:
|
||||
voxtral: ${{ steps.detect.outputs.voxtral }}
|
||||
kokoros: ${{ steps.detect.outputs.kokoros }}
|
||||
insightface: ${{ steps.detect.outputs.insightface }}
|
||||
speaker-recognition: ${{ steps.detect.outputs.speaker-recognition }}
|
||||
steps:
|
||||
- name: Checkout repository
|
||||
uses: actions/checkout@v6
|
||||
@@ -778,3 +779,29 @@ jobs:
|
||||
- name: Build insightface backend image and run both model configurations
|
||||
run: |
|
||||
make test-extra-backend-insightface-all
|
||||
tests-speaker-recognition-grpc:
|
||||
needs: detect-changes
|
||||
if: needs.detect-changes.outputs.speaker-recognition == 'true' || needs.detect-changes.outputs.run-all == 'true'
|
||||
runs-on: ubuntu-latest
|
||||
timeout-minutes: 90
|
||||
steps:
|
||||
- name: Clone
|
||||
uses: actions/checkout@v6
|
||||
with:
|
||||
submodules: true
|
||||
- name: Dependencies
|
||||
run: |
|
||||
sudo apt-get update
|
||||
sudo apt-get install -y --no-install-recommends \
|
||||
make build-essential curl ca-certificates git tar
|
||||
- name: Setup Go
|
||||
uses: actions/setup-go@v5
|
||||
with:
|
||||
go-version: '1.26.0'
|
||||
- name: Free disk space
|
||||
run: |
|
||||
sudo rm -rf /usr/share/dotnet /opt/ghc /usr/local/lib/android /opt/hostedtoolcache/CodeQL || true
|
||||
df -h
|
||||
- name: Build speaker-recognition backend image and run the ECAPA-TDNN configuration
|
||||
run: |
|
||||
make test-extra-backend-speaker-recognition-all
|
||||
|
||||
43
Makefile
43
Makefile
@@ -1,5 +1,5 @@
|
||||
# Disable parallel execution for backend builds
|
||||
.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/insightface backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/tinygrad
|
||||
.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/tinygrad
|
||||
|
||||
GOCMD=go
|
||||
GOTEST=$(GOCMD) test
|
||||
@@ -435,6 +435,7 @@ prepare-test-extra: protogen-python
|
||||
$(MAKE) -C backend/python/trl
|
||||
$(MAKE) -C backend/python/tinygrad
|
||||
$(MAKE) -C backend/python/insightface
|
||||
$(MAKE) -C backend/python/speaker-recognition
|
||||
$(MAKE) -C backend/rust/kokoros kokoros-grpc
|
||||
|
||||
test-extra: prepare-test-extra
|
||||
@@ -459,6 +460,7 @@ test-extra: prepare-test-extra
|
||||
$(MAKE) -C backend/python/trl test
|
||||
$(MAKE) -C backend/python/tinygrad test
|
||||
$(MAKE) -C backend/python/insightface test
|
||||
$(MAKE) -C backend/python/speaker-recognition test
|
||||
$(MAKE) -C backend/rust/kokoros test
|
||||
|
||||
##
|
||||
@@ -713,6 +715,41 @@ test-extra-backend-insightface-all: \
|
||||
test-extra-backend-insightface-buffalo-sc \
|
||||
test-extra-backend-insightface-opencv
|
||||
|
||||
## speaker-recognition — voice (speaker) biometrics.
|
||||
##
|
||||
## Audio fixtures default to the speechbrain test samples served
|
||||
## straight from their GitHub repo — public, no auth needed, and they
|
||||
## ship as 16kHz mono WAV/FLAC which is exactly what the engine wants.
|
||||
## example{1,2,5} are three different speakers; the suite treats
|
||||
## example1 as the "same-image twin" probe (verify(clip, clip) must
|
||||
## return distance≈0) and the other two as cross-speaker ceilings.
|
||||
## Override with BACKEND_TEST_VOICE_AUDIO_{1,2,3}_FILE for offline runs.
|
||||
VOICE_AUDIO_1_URL ?= https://github.com/speechbrain/speechbrain/raw/develop/tests/samples/single-mic/example1.wav
|
||||
VOICE_AUDIO_2_URL ?= https://github.com/speechbrain/speechbrain/raw/develop/tests/samples/single-mic/example2.flac
|
||||
VOICE_AUDIO_3_URL ?= https://github.com/speechbrain/speechbrain/raw/develop/tests/samples/single-mic/example5.wav
|
||||
|
||||
## ECAPA-TDNN via SpeechBrain — default CI configuration. Auto-downloads
|
||||
## the checkpoint from HuggingFace on first LoadModel (bundled in the
|
||||
## backend image pip install). 192-d embeddings, cosine-distance based.
|
||||
## The e2e suite drives LoadModel directly so we don't rely on LocalAI's
|
||||
## gallery flow here.
|
||||
test-extra-backend-speaker-recognition-ecapa: docker-build-speaker-recognition
|
||||
BACKEND_IMAGE=local-ai-backend:speaker-recognition \
|
||||
BACKEND_TEST_MODEL_NAME=speechbrain/spkrec-ecapa-voxceleb \
|
||||
BACKEND_TEST_OPTIONS=engine:speechbrain,source:speechbrain/spkrec-ecapa-voxceleb \
|
||||
BACKEND_TEST_CAPS=health,load,voice_embed,voice_verify \
|
||||
BACKEND_TEST_VOICE_AUDIO_1_URL=$(VOICE_AUDIO_1_URL) \
|
||||
BACKEND_TEST_VOICE_AUDIO_2_URL=$(VOICE_AUDIO_2_URL) \
|
||||
BACKEND_TEST_VOICE_AUDIO_3_URL=$(VOICE_AUDIO_3_URL) \
|
||||
BACKEND_TEST_VOICE_VERIFY_DISTANCE_CEILING=0.4 \
|
||||
$(MAKE) test-extra-backend
|
||||
|
||||
## Aggregate — today there's only one voice config; the target exists
|
||||
## so the CI workflow matches the insightface-all naming convention and
|
||||
## can grow to include WeSpeaker / 3D-Speaker later.
|
||||
test-extra-backend-speaker-recognition-all: \
|
||||
test-extra-backend-speaker-recognition-ecapa
|
||||
|
||||
## sglang mirrors the vllm setup: HuggingFace model id, same tiny Qwen,
|
||||
## tool-call extraction via sglang's native qwen parser. CPU builds use
|
||||
## sglang's upstream pyproject_cpu.toml recipe (see backend/python/sglang/install.sh).
|
||||
@@ -859,6 +896,7 @@ BACKEND_FASTER_WHISPER = faster-whisper|python|.|false|true
|
||||
BACKEND_COQUI = coqui|python|.|false|true
|
||||
BACKEND_RFDETR = rfdetr|python|.|false|true
|
||||
BACKEND_INSIGHTFACE = insightface|python|.|false|true
|
||||
BACKEND_SPEAKER_RECOGNITION = speaker-recognition|python|.|false|true
|
||||
BACKEND_KITTEN_TTS = kitten-tts|python|.|false|true
|
||||
BACKEND_NEUTTS = neutts|python|.|false|true
|
||||
BACKEND_KOKORO = kokoro|python|.|false|true
|
||||
@@ -931,6 +969,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_FASTER_WHISPER)))
|
||||
$(eval $(call generate-docker-build-target,$(BACKEND_COQUI)))
|
||||
$(eval $(call generate-docker-build-target,$(BACKEND_RFDETR)))
|
||||
$(eval $(call generate-docker-build-target,$(BACKEND_INSIGHTFACE)))
|
||||
$(eval $(call generate-docker-build-target,$(BACKEND_SPEAKER_RECOGNITION)))
|
||||
$(eval $(call generate-docker-build-target,$(BACKEND_KITTEN_TTS)))
|
||||
$(eval $(call generate-docker-build-target,$(BACKEND_NEUTTS)))
|
||||
$(eval $(call generate-docker-build-target,$(BACKEND_KOKORO)))
|
||||
@@ -965,7 +1004,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_SAM3_CPP)))
|
||||
docker-save-%: backend-images
|
||||
docker save local-ai-backend:$* -o backend-images/$*.tar
|
||||
|
||||
docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp docker-build-insightface
|
||||
docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp docker-build-insightface docker-build-speaker-recognition
|
||||
|
||||
########################################################
|
||||
### Mock Backend for E2E Tests
|
||||
|
||||
@@ -26,6 +26,9 @@ service Backend {
|
||||
rpc Detect(DetectOptions) returns (DetectResponse) {}
|
||||
rpc FaceVerify(FaceVerifyRequest) returns (FaceVerifyResponse) {}
|
||||
rpc FaceAnalyze(FaceAnalyzeRequest) returns (FaceAnalyzeResponse) {}
|
||||
rpc VoiceVerify(VoiceVerifyRequest) returns (VoiceVerifyResponse) {}
|
||||
rpc VoiceAnalyze(VoiceAnalyzeRequest) returns (VoiceAnalyzeResponse) {}
|
||||
rpc VoiceEmbed(VoiceEmbedRequest) returns (VoiceEmbedResponse) {}
|
||||
|
||||
rpc StoresSet(StoresSetOptions) returns (Result) {}
|
||||
rpc StoresDelete(StoresDeleteOptions) returns (Result) {}
|
||||
@@ -528,6 +531,57 @@ message FaceAnalyzeResponse {
|
||||
repeated FaceAnalysis faces = 1;
|
||||
}
|
||||
|
||||
// --- Voice (speaker) recognition messages ---
|
||||
//
|
||||
// Analogous to the Face* messages above, but for speaker biometrics.
|
||||
// Audio fields accept a filesystem path (same convention as
|
||||
// TranscriptRequest.dst). The HTTP layer materialises base64 / URL /
|
||||
// data-URI inputs to a temp file before calling the gRPC backend.
|
||||
|
||||
message VoiceVerifyRequest {
|
||||
string audio1 = 1; // path to first audio clip
|
||||
string audio2 = 2; // path to second audio clip
|
||||
float threshold = 3; // cosine-distance threshold; 0 = use backend default
|
||||
bool anti_spoofing = 4; // reserved for future AASIST bolt-on
|
||||
}
|
||||
|
||||
message VoiceVerifyResponse {
|
||||
bool verified = 1;
|
||||
float distance = 2; // 1 - cosine_similarity
|
||||
float threshold = 3;
|
||||
float confidence = 4; // 0-100
|
||||
string model = 5; // e.g. "speechbrain/spkrec-ecapa-voxceleb"
|
||||
float processing_time_ms = 6;
|
||||
}
|
||||
|
||||
message VoiceAnalyzeRequest {
|
||||
string audio = 1; // path to audio clip
|
||||
repeated string actions = 2; // subset of ["age","gender","emotion"]; empty = all-supported
|
||||
}
|
||||
|
||||
message VoiceAnalysis {
|
||||
float start = 1; // segment start time in seconds (0 if single-utterance)
|
||||
float end = 2; // segment end time in seconds
|
||||
float age = 3;
|
||||
string dominant_gender = 4;
|
||||
map<string, float> gender = 5;
|
||||
string dominant_emotion = 6;
|
||||
map<string, float> emotion = 7;
|
||||
}
|
||||
|
||||
message VoiceAnalyzeResponse {
|
||||
repeated VoiceAnalysis segments = 1;
|
||||
}
|
||||
|
||||
message VoiceEmbedRequest {
|
||||
string audio = 1; // path to audio clip
|
||||
}
|
||||
|
||||
message VoiceEmbedResponse {
|
||||
repeated float embedding = 1;
|
||||
string model = 2;
|
||||
}
|
||||
|
||||
message ToolFormatMarkers {
|
||||
string format_type = 1; // "json_native", "tag_with_json", "tag_with_tagged"
|
||||
|
||||
|
||||
@@ -3773,3 +3773,64 @@
|
||||
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-insightface"
|
||||
mirrors:
|
||||
- localai/localai-backends:master-gpu-nvidia-cuda-12-insightface
|
||||
|
||||
# speaker-recognition (voice/speaker biometrics) — Apache-2.0 stack
|
||||
- &speakerrecognition
|
||||
name: "speaker-recognition"
|
||||
alias: "speaker-recognition"
|
||||
# SpeechBrain is Apache-2.0. WeSpeaker / 3D-Speaker ONNX exports are
|
||||
# Apache-2.0. The backend itself ships only Python deps — all model
|
||||
# weights flow through LocalAI's gallery download mechanism (or
|
||||
# SpeechBrain's built-in HF auto-download at first LoadModel).
|
||||
license: apache-2.0
|
||||
description: |
|
||||
Speaker (voice) recognition backend — the audio analog to
|
||||
insightface. Wraps SpeechBrain ECAPA-TDNN (default engine, 192-d
|
||||
embeddings, ~1.9% EER on VoxCeleb) plus an OnnxDirectEngine for
|
||||
pre-exported WeSpeaker / 3D-Speaker ONNX models.
|
||||
|
||||
Exposes speaker verification (/v1/voice/verify), speaker embedding
|
||||
(/v1/voice/embed), speaker analysis (/v1/voice/analyze), and 1:N
|
||||
speaker identification (/v1/voice/{register,identify,forget}).
|
||||
Registrations use LocalAI's built-in vector store — same in-memory
|
||||
backing the face-recognition registry uses, separate instance.
|
||||
urls:
|
||||
- https://speechbrain.github.io/
|
||||
- https://github.com/wenet-e2e/wespeaker
|
||||
- https://github.com/modelscope/3D-Speaker
|
||||
tags:
|
||||
- voice-recognition
|
||||
- speaker-verification
|
||||
- speaker-embedding
|
||||
- gpu
|
||||
- cpu
|
||||
capabilities:
|
||||
default: "cpu-speaker-recognition"
|
||||
nvidia: "cuda12-speaker-recognition"
|
||||
nvidia-cuda-12: "cuda12-speaker-recognition"
|
||||
- !!merge <<: *speakerrecognition
|
||||
name: "speaker-recognition-development"
|
||||
capabilities:
|
||||
default: "cpu-speaker-recognition-development"
|
||||
nvidia: "cuda12-speaker-recognition-development"
|
||||
nvidia-cuda-12: "cuda12-speaker-recognition-development"
|
||||
- !!merge <<: *speakerrecognition
|
||||
name: "cpu-speaker-recognition"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-speaker-recognition"
|
||||
mirrors:
|
||||
- localai/localai-backends:latest-cpu-speaker-recognition
|
||||
- !!merge <<: *speakerrecognition
|
||||
name: "cuda12-speaker-recognition"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-speaker-recognition"
|
||||
mirrors:
|
||||
- localai/localai-backends:latest-gpu-nvidia-cuda-12-speaker-recognition
|
||||
- !!merge <<: *speakerrecognition
|
||||
name: "cpu-speaker-recognition-development"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-speaker-recognition"
|
||||
mirrors:
|
||||
- localai/localai-backends:master-cpu-speaker-recognition
|
||||
- !!merge <<: *speakerrecognition
|
||||
name: "cuda12-speaker-recognition-development"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-speaker-recognition"
|
||||
mirrors:
|
||||
- localai/localai-backends:master-gpu-nvidia-cuda-12-speaker-recognition
|
||||
|
||||
13
backend/python/speaker-recognition/Makefile
Normal file
13
backend/python/speaker-recognition/Makefile
Normal file
@@ -0,0 +1,13 @@
|
||||
.DEFAULT_GOAL := install
|
||||
|
||||
.PHONY: install
|
||||
install:
|
||||
bash install.sh
|
||||
|
||||
.PHONY: protogen-clean
|
||||
protogen-clean:
|
||||
$(RM) backend_pb2_grpc.py backend_pb2.py
|
||||
|
||||
.PHONY: clean
|
||||
clean: protogen-clean
|
||||
rm -rf venv __pycache__
|
||||
40
backend/python/speaker-recognition/README.md
Normal file
40
backend/python/speaker-recognition/README.md
Normal file
@@ -0,0 +1,40 @@
|
||||
# speaker-recognition
|
||||
|
||||
Speaker (voice) recognition backend for LocalAI. The audio analog to
|
||||
`insightface` — produces speaker embeddings and supports 1:1 voice
|
||||
verification and voice demographic analysis.
|
||||
|
||||
## Engines
|
||||
|
||||
- **SpeechBrainEngine** (default): ECAPA-TDNN trained on VoxCeleb.
|
||||
192-d L2-normalised embeddings, cosine distance for verification.
|
||||
Auto-downloads from HuggingFace on first LoadModel.
|
||||
- **OnnxDirectEngine**: Any pre-exported ONNX speaker encoder
|
||||
(WeSpeaker ResNet, 3D-Speaker ERes2Net, CAM++, …). Model path comes
|
||||
from the gallery `files:` entry.
|
||||
|
||||
Engine selection is gallery-driven: if the model config provides
|
||||
`model_path:` / `onnx:` the ONNX engine is used, otherwise the
|
||||
SpeechBrain engine.
|
||||
|
||||
## Endpoints
|
||||
|
||||
- `POST /v1/voice/verify` — 1:1 same-speaker check.
|
||||
- `POST /v1/voice/embed` — extract a speaker embedding vector.
|
||||
- `POST /v1/voice/analyze` — voice demographics, loaded lazily on
|
||||
the first analyze call:
|
||||
- **Emotion** (default, opt-out): `superb/wav2vec2-base-superb-er`
|
||||
(Apache-2.0), 4-way categorical (neutral / happy / angry / sad).
|
||||
- **Age + gender** (opt-in): no default — wire a checkpoint with a
|
||||
standard `Wav2Vec2ForSequenceClassification` head via
|
||||
`age_gender_model:<repo>` in options. The Audeering
|
||||
age-gender model is *not* usable as a drop-in because its
|
||||
multi-task head isn't loadable via `AutoModelForAudioClassification`.
|
||||
|
||||
Both heads are optional. When nothing loads, the engine returns 501.
|
||||
|
||||
## Audio input
|
||||
|
||||
Audio is materialised by the HTTP layer to a temp wav before calling
|
||||
the gRPC backend. Accepted input forms on the HTTP side: URL, data-URI,
|
||||
or raw base64. The backend itself always receives a filesystem path.
|
||||
205
backend/python/speaker-recognition/backend.py
Normal file
205
backend/python/speaker-recognition/backend.py
Normal file
@@ -0,0 +1,205 @@
|
||||
#!/usr/bin/env python3
|
||||
"""gRPC server for the LocalAI speaker-recognition backend.
|
||||
|
||||
Implements Health / LoadModel / Status plus the voice-specific methods:
|
||||
VoiceVerify, VoiceAnalyze, VoiceEmbed. The heavy lifting lives in
|
||||
engines.py — this file is just the gRPC plumbing, mirroring the
|
||||
insightface backend's two-engine split (SpeechBrain + OnnxDirect).
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import os
|
||||
import signal
|
||||
import sys
|
||||
import time
|
||||
from concurrent import futures
|
||||
|
||||
import backend_pb2
|
||||
import backend_pb2_grpc
|
||||
import grpc
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "common"))
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "common"))
|
||||
from grpc_auth import get_auth_interceptors # noqa: E402
|
||||
|
||||
from engines import SpeakerEngine, build_engine # noqa: E402
|
||||
|
||||
_ONE_DAY = 60 * 60 * 24
|
||||
MAX_WORKERS = int(os.environ.get("PYTHON_GRPC_MAX_WORKERS", "1"))
|
||||
|
||||
# ECAPA-TDNN on VoxCeleb is the reference. Threshold is tuned for
|
||||
# cosine distance (1 - cosine_similarity). Clients may override.
|
||||
DEFAULT_VERIFY_THRESHOLD = 0.25
|
||||
|
||||
|
||||
def _parse_options(raw: list[str]) -> dict[str, str]:
|
||||
out: dict[str, str] = {}
|
||||
for entry in raw:
|
||||
if ":" not in entry:
|
||||
continue
|
||||
k, v = entry.split(":", 1)
|
||||
out[k.strip()] = v.strip()
|
||||
return out
|
||||
|
||||
|
||||
class BackendServicer(backend_pb2_grpc.BackendServicer):
|
||||
def __init__(self) -> None:
|
||||
self.engine: SpeakerEngine | None = None
|
||||
self.engine_name: str = ""
|
||||
self.model_name: str = ""
|
||||
self.verify_threshold: float = DEFAULT_VERIFY_THRESHOLD
|
||||
|
||||
def Health(self, request, context):
|
||||
return backend_pb2.Reply(message=bytes("OK", "utf-8"))
|
||||
|
||||
def LoadModel(self, request, context):
|
||||
options = _parse_options(list(request.Options))
|
||||
# Surface LocalAI's models directory (ModelPath) so engines can
|
||||
# anchor relative paths and auto-download into a writable spot
|
||||
# alongside every other gallery-managed asset.
|
||||
options["_model_path"] = request.ModelPath or ""
|
||||
try:
|
||||
engine, engine_name = build_engine(request.Model, options)
|
||||
except Exception as exc: # noqa: BLE001
|
||||
return backend_pb2.Result(success=False, message=f"engine init failed: {exc}")
|
||||
|
||||
self.engine = engine
|
||||
self.engine_name = engine_name
|
||||
self.model_name = request.Model
|
||||
|
||||
threshold_opt = options.get("verify_threshold")
|
||||
if threshold_opt:
|
||||
try:
|
||||
self.verify_threshold = float(threshold_opt)
|
||||
except ValueError:
|
||||
pass
|
||||
return backend_pb2.Result(success=True, message=f"loaded {engine_name}")
|
||||
|
||||
def Status(self, request, context):
|
||||
state = backend_pb2.StatusResponse.State.READY if self.engine else backend_pb2.StatusResponse.State.UNINITIALIZED
|
||||
return backend_pb2.StatusResponse(state=state)
|
||||
|
||||
def _require_engine(self, context) -> SpeakerEngine | None:
|
||||
if self.engine is None:
|
||||
context.set_code(grpc.StatusCode.FAILED_PRECONDITION)
|
||||
context.set_details("no speaker-recognition model loaded")
|
||||
return None
|
||||
return self.engine
|
||||
|
||||
def VoiceVerify(self, request, context):
|
||||
engine = self._require_engine(context)
|
||||
if engine is None:
|
||||
return backend_pb2.VoiceVerifyResponse()
|
||||
if not request.audio1 or not request.audio2:
|
||||
context.set_code(grpc.StatusCode.INVALID_ARGUMENT)
|
||||
context.set_details("audio1 and audio2 are required")
|
||||
return backend_pb2.VoiceVerifyResponse()
|
||||
|
||||
threshold = request.threshold if request.threshold > 0 else self.verify_threshold
|
||||
started = time.time()
|
||||
try:
|
||||
distance = engine.compare(request.audio1, request.audio2)
|
||||
except Exception as exc: # noqa: BLE001
|
||||
context.set_code(grpc.StatusCode.INTERNAL)
|
||||
context.set_details(f"voice verify failed: {exc}")
|
||||
return backend_pb2.VoiceVerifyResponse()
|
||||
|
||||
elapsed_ms = (time.time() - started) * 1000.0
|
||||
# Confidence goes linearly from 100 at distance=0 to 0 at distance=threshold.
|
||||
confidence = max(0.0, min(100.0, (1.0 - distance / threshold) * 100.0))
|
||||
return backend_pb2.VoiceVerifyResponse(
|
||||
verified=distance <= threshold,
|
||||
distance=distance,
|
||||
threshold=threshold,
|
||||
confidence=confidence,
|
||||
model=self.model_name,
|
||||
processing_time_ms=elapsed_ms,
|
||||
)
|
||||
|
||||
def VoiceEmbed(self, request, context):
|
||||
engine = self._require_engine(context)
|
||||
if engine is None:
|
||||
return backend_pb2.VoiceEmbedResponse()
|
||||
if not request.audio:
|
||||
context.set_code(grpc.StatusCode.INVALID_ARGUMENT)
|
||||
context.set_details("audio is required")
|
||||
return backend_pb2.VoiceEmbedResponse()
|
||||
try:
|
||||
vec = engine.embed(request.audio)
|
||||
except Exception as exc: # noqa: BLE001
|
||||
context.set_code(grpc.StatusCode.INTERNAL)
|
||||
context.set_details(f"voice embed failed: {exc}")
|
||||
return backend_pb2.VoiceEmbedResponse()
|
||||
return backend_pb2.VoiceEmbedResponse(embedding=list(vec), model=self.model_name)
|
||||
|
||||
def VoiceAnalyze(self, request, context):
|
||||
engine = self._require_engine(context)
|
||||
if engine is None:
|
||||
return backend_pb2.VoiceAnalyzeResponse()
|
||||
if not request.audio:
|
||||
context.set_code(grpc.StatusCode.INVALID_ARGUMENT)
|
||||
context.set_details("audio is required")
|
||||
return backend_pb2.VoiceAnalyzeResponse()
|
||||
|
||||
actions = list(request.actions) or ["age", "gender", "emotion"]
|
||||
try:
|
||||
segments = engine.analyze(request.audio, actions)
|
||||
except NotImplementedError:
|
||||
context.set_code(grpc.StatusCode.UNIMPLEMENTED)
|
||||
context.set_details(f"analyze not supported by {self.engine_name}")
|
||||
return backend_pb2.VoiceAnalyzeResponse()
|
||||
except Exception as exc: # noqa: BLE001
|
||||
context.set_code(grpc.StatusCode.INTERNAL)
|
||||
context.set_details(f"voice analyze failed: {exc}")
|
||||
return backend_pb2.VoiceAnalyzeResponse()
|
||||
|
||||
proto_segments = []
|
||||
for seg in segments:
|
||||
proto_segments.append(
|
||||
backend_pb2.VoiceAnalysis(
|
||||
start=seg.get("start", 0.0),
|
||||
end=seg.get("end", 0.0),
|
||||
age=seg.get("age", 0.0),
|
||||
dominant_gender=seg.get("dominant_gender", ""),
|
||||
gender=seg.get("gender", {}),
|
||||
dominant_emotion=seg.get("dominant_emotion", ""),
|
||||
emotion=seg.get("emotion", {}),
|
||||
)
|
||||
)
|
||||
return backend_pb2.VoiceAnalyzeResponse(segments=proto_segments)
|
||||
|
||||
|
||||
def serve(address: str) -> None:
|
||||
interceptors = get_auth_interceptors()
|
||||
server = grpc.server(
|
||||
futures.ThreadPoolExecutor(max_workers=MAX_WORKERS),
|
||||
interceptors=interceptors,
|
||||
options=[
|
||||
("grpc.max_send_message_length", 128 * 1024 * 1024),
|
||||
("grpc.max_receive_message_length", 128 * 1024 * 1024),
|
||||
],
|
||||
)
|
||||
backend_pb2_grpc.add_BackendServicer_to_server(BackendServicer(), server)
|
||||
server.add_insecure_port(address)
|
||||
server.start()
|
||||
print("speaker-recognition backend listening on", address, flush=True)
|
||||
|
||||
def _stop(*_):
|
||||
server.stop(0)
|
||||
sys.exit(0)
|
||||
|
||||
signal.signal(signal.SIGTERM, _stop)
|
||||
signal.signal(signal.SIGINT, _stop)
|
||||
try:
|
||||
while True:
|
||||
time.sleep(_ONE_DAY)
|
||||
except KeyboardInterrupt:
|
||||
server.stop(0)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--addr", default="localhost:50051")
|
||||
args = parser.parse_args()
|
||||
serve(args.addr)
|
||||
387
backend/python/speaker-recognition/engines.py
Normal file
387
backend/python/speaker-recognition/engines.py
Normal file
@@ -0,0 +1,387 @@
|
||||
"""Speaker-recognition engines.
|
||||
|
||||
Two engines are offered, mirroring the insightface backend's split:
|
||||
|
||||
* SpeechBrainEngine: full PyTorch / SpeechBrain path. Uses the
|
||||
ECAPA-TDNN recipe trained on VoxCeleb; 192-d L2-normalized
|
||||
embeddings, cosine distance for verification. Auto-downloads the
|
||||
checkpoint into LocalAI's models directory on first LoadModel.
|
||||
|
||||
* OnnxDirectEngine: CPU-friendly fallback that runs pre-exported
|
||||
ONNX speaker encoders (WeSpeaker ResNet34, 3D-Speaker ERes2Net,
|
||||
CAM++, etc.). Model paths come from the model config — the gallery
|
||||
`files:` flow drops them into the models directory.
|
||||
|
||||
Engine selection follows the same gallery-driven convention face
|
||||
recognition uses (insightface commits 9c6da0f7 / 405fec0b): the
|
||||
Python backend reads `engine` / `model_path` / `checkpoint` from the
|
||||
options dict and picks an engine accordingly.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
from typing import Any, Iterable, Protocol
|
||||
|
||||
|
||||
class SpeakerEngine(Protocol):
|
||||
"""Interface both concrete engines satisfy."""
|
||||
|
||||
name: str
|
||||
|
||||
def embed(self, audio_path: str) -> list[float]: # pragma: no cover - interface
|
||||
...
|
||||
|
||||
def compare(self, audio1: str, audio2: str) -> float: # pragma: no cover
|
||||
...
|
||||
|
||||
def analyze(self, audio_path: str, actions: Iterable[str]) -> list[dict[str, Any]]: # pragma: no cover
|
||||
...
|
||||
|
||||
|
||||
def _cosine_distance(a, b) -> float:
|
||||
import numpy as np
|
||||
|
||||
va = np.asarray(a, dtype=np.float32).reshape(-1)
|
||||
vb = np.asarray(b, dtype=np.float32).reshape(-1)
|
||||
na = float(np.linalg.norm(va))
|
||||
nb = float(np.linalg.norm(vb))
|
||||
if na == 0.0 or nb == 0.0:
|
||||
return 1.0
|
||||
return float(1.0 - np.dot(va, vb) / (na * nb))
|
||||
|
||||
|
||||
class AnalysisHead:
|
||||
"""Age / gender / emotion head, lazy-loaded on first analyze call.
|
||||
|
||||
Wraps two open-licence HuggingFace checkpoints:
|
||||
|
||||
* audeering/wav2vec2-large-robust-24-ft-age-gender — age
|
||||
regression (0–100 years) + 3-way gender (female/male/child).
|
||||
Apache 2.0.
|
||||
* superb/wav2vec2-base-superb-er — 4-way emotion classification
|
||||
(neutral / happy / angry / sad). Apache 2.0.
|
||||
|
||||
Either model is optional — the head degrades gracefully to only the
|
||||
attributes it could load. Override the checkpoint with the
|
||||
`age_gender_model` / `emotion_model` option if you want something
|
||||
else. Set either to an empty string to disable that head.
|
||||
"""
|
||||
|
||||
# Age + gender is OFF by default: the high-accuracy Apache-2.0
|
||||
# checkpoint (Audeering wav2vec2-large-robust-24-ft-age-gender) uses a
|
||||
# custom multi-task head that AutoModelForAudioClassification silently
|
||||
# mangles — it drops the age weights as UNEXPECTED and re-initialises
|
||||
# the classifier head with random values, so the output is noise. Users
|
||||
# who have a cleanly loadable age/gender classifier can opt in with
|
||||
# `age_gender_model:<repo>` in options. The emotion default below
|
||||
# (superb/wav2vec2-base-superb-er) loads via the standard audio-
|
||||
# classification pipeline with no such caveat.
|
||||
DEFAULT_AGE_GENDER_MODEL = ""
|
||||
DEFAULT_EMOTION_MODEL = "superb/wav2vec2-base-superb-er"
|
||||
AGE_GENDER_LABELS = ("female", "male", "child")
|
||||
|
||||
def __init__(self, options: dict[str, str]):
|
||||
self._options = options
|
||||
self._age_gender = None
|
||||
self._age_gender_processor = None
|
||||
self._age_gender_loaded = False
|
||||
self._age_gender_error: str | None = None
|
||||
self._emotion = None
|
||||
self._emotion_loaded = False
|
||||
self._emotion_error: str | None = None
|
||||
|
||||
# --- age / gender -------------------------------------------------
|
||||
def _ensure_age_gender(self):
|
||||
if self._age_gender_loaded:
|
||||
return
|
||||
self._age_gender_loaded = True
|
||||
model_id = self._options.get(
|
||||
"age_gender_model", self.DEFAULT_AGE_GENDER_MODEL
|
||||
)
|
||||
if not model_id:
|
||||
self._age_gender_error = "disabled"
|
||||
return
|
||||
try:
|
||||
# Late imports — torch / transformers are heavy and only
|
||||
# pulled in when the analyze head actually runs.
|
||||
import torch # type: ignore
|
||||
from transformers import AutoFeatureExtractor, AutoModelForAudioClassification # type: ignore
|
||||
|
||||
self._torch = torch
|
||||
self._age_gender_processor = AutoFeatureExtractor.from_pretrained(model_id)
|
||||
self._age_gender = AutoModelForAudioClassification.from_pretrained(model_id)
|
||||
self._age_gender.eval()
|
||||
except Exception as exc: # noqa: BLE001
|
||||
self._age_gender_error = f"{type(exc).__name__}: {exc}"
|
||||
|
||||
def _infer_age_gender(self, waveform_16k) -> dict[str, Any]:
|
||||
self._ensure_age_gender()
|
||||
if self._age_gender is None:
|
||||
return {}
|
||||
import numpy as np
|
||||
|
||||
try:
|
||||
inputs = self._age_gender_processor(
|
||||
waveform_16k, sampling_rate=16000, return_tensors="pt"
|
||||
)
|
||||
with self._torch.no_grad():
|
||||
outputs = self._age_gender(**inputs)
|
||||
|
||||
# Audeering's checkpoint is published with a custom head: the
|
||||
# official recipe exposes `(hidden_states, logits_age, logits_gender)`.
|
||||
# AutoModelForAudioClassification flattens that into a single
|
||||
# `logits` tensor of shape [batch, 4] — [age_regression, female, male, child].
|
||||
# Fall back gracefully when the shape is different (e.g. a
|
||||
# user-supplied age_gender_model checkpoint that returns a proper tuple).
|
||||
hidden = getattr(outputs, "logits", outputs)
|
||||
age_years = None
|
||||
gender_logits = None
|
||||
if isinstance(hidden, (tuple, list)) and len(hidden) >= 2:
|
||||
age_years = float(hidden[0].squeeze().item()) * 100.0
|
||||
gender_logits = hidden[1]
|
||||
else:
|
||||
flat = hidden.squeeze()
|
||||
if flat.ndim == 1 and flat.numel() >= 4:
|
||||
age_years = float(flat[0].item()) * 100.0
|
||||
gender_logits = flat[1:4]
|
||||
elif flat.ndim == 1 and flat.numel() == 1:
|
||||
age_years = float(flat.item()) * 100.0
|
||||
|
||||
if age_years is None and gender_logits is None:
|
||||
return {}
|
||||
|
||||
result: dict[str, Any] = {}
|
||||
if age_years is not None:
|
||||
result["age"] = age_years
|
||||
if gender_logits is not None:
|
||||
probs = self._torch.softmax(gender_logits, dim=-1).cpu().numpy()
|
||||
probs = np.asarray(probs).reshape(-1)
|
||||
gender_map = {
|
||||
label: float(probs[i])
|
||||
for i, label in enumerate(self.AGE_GENDER_LABELS[: len(probs)])
|
||||
}
|
||||
result["gender"] = gender_map
|
||||
if gender_map:
|
||||
dom = max(gender_map.items(), key=lambda kv: kv[1])[0]
|
||||
result["dominant_gender"] = {
|
||||
"female": "Female",
|
||||
"male": "Male",
|
||||
"child": "Child",
|
||||
}.get(dom, dom.capitalize())
|
||||
return result
|
||||
except Exception as exc: # noqa: BLE001
|
||||
# Analyze is a best-effort feature — never take down the
|
||||
# whole analyze call because the age/gender head had a bad
|
||||
# day. Mark the failure so the emotion branch still runs.
|
||||
self._age_gender_error = f"runtime: {type(exc).__name__}: {exc}"
|
||||
return {}
|
||||
|
||||
# --- emotion ------------------------------------------------------
|
||||
def _ensure_emotion(self):
|
||||
if self._emotion_loaded:
|
||||
return
|
||||
self._emotion_loaded = True
|
||||
model_id = self._options.get("emotion_model", self.DEFAULT_EMOTION_MODEL)
|
||||
if not model_id:
|
||||
self._emotion_error = "disabled"
|
||||
return
|
||||
try:
|
||||
from transformers import pipeline # type: ignore
|
||||
|
||||
self._emotion = pipeline("audio-classification", model=model_id)
|
||||
except Exception as exc: # noqa: BLE001
|
||||
self._emotion_error = f"{type(exc).__name__}: {exc}"
|
||||
|
||||
def _infer_emotion(self, audio_path: str) -> dict[str, Any]:
|
||||
self._ensure_emotion()
|
||||
if self._emotion is None:
|
||||
return {}
|
||||
try:
|
||||
raw = self._emotion(audio_path, top_k=8)
|
||||
except Exception as exc: # noqa: BLE001
|
||||
# Second-line defense: don't fail the whole analyze call
|
||||
# over a runtime inference hiccup.
|
||||
self._emotion_error = f"runtime: {type(exc).__name__}: {exc}"
|
||||
return {}
|
||||
emotion_map = {row["label"].lower(): float(row["score"]) for row in raw}
|
||||
if not emotion_map:
|
||||
return {}
|
||||
dom = max(emotion_map.items(), key=lambda kv: kv[1])[0]
|
||||
return {"emotion": emotion_map, "dominant_emotion": dom}
|
||||
|
||||
# --- orchestrator -------------------------------------------------
|
||||
def analyze(self, audio_path: str, waveform_16k, actions: Iterable[str]) -> dict[str, Any]:
|
||||
wanted = {a.strip().lower() for a in actions} if actions else {"age", "gender", "emotion"}
|
||||
result: dict[str, Any] = {}
|
||||
if "age" in wanted or "gender" in wanted:
|
||||
ag = self._infer_age_gender(waveform_16k)
|
||||
if "age" in wanted and "age" in ag:
|
||||
result["age"] = ag["age"]
|
||||
if "gender" in wanted:
|
||||
if "gender" in ag:
|
||||
result["gender"] = ag["gender"]
|
||||
if "dominant_gender" in ag:
|
||||
result["dominant_gender"] = ag["dominant_gender"]
|
||||
if "emotion" in wanted:
|
||||
em = self._infer_emotion(audio_path)
|
||||
result.update(em)
|
||||
return result
|
||||
|
||||
|
||||
class SpeechBrainEngine:
|
||||
"""ECAPA-TDNN via SpeechBrain. Auto-downloads on first use."""
|
||||
|
||||
name = "speechbrain-ecapa-tdnn"
|
||||
|
||||
def __init__(self, model_name: str, options: dict[str, str]):
|
||||
# Late imports so the module can be introspected / tested
|
||||
# without torch / speechbrain being installed.
|
||||
from speechbrain.inference.speaker import EncoderClassifier # type: ignore
|
||||
|
||||
source = options.get("source") or model_name or "speechbrain/spkrec-ecapa-voxceleb"
|
||||
savedir = options.get("_model_path") or os.environ.get("HF_HOME") or "./pretrained_models"
|
||||
self._model = EncoderClassifier.from_hparams(source=source, savedir=savedir)
|
||||
self._analysis = AnalysisHead(options)
|
||||
|
||||
def _load_waveform(self, path: str):
|
||||
# Use soundfile + torch directly — torchaudio.load in torchaudio
|
||||
# 2.8+ requires the torchcodec package for decoding, which adds
|
||||
# another heavy ffmpeg-linked dep. soundfile covers WAV/FLAC
|
||||
# which is what we care about here.
|
||||
import numpy as np
|
||||
import soundfile as sf # type: ignore
|
||||
import torch # type: ignore
|
||||
|
||||
audio, sr = sf.read(path, always_2d=False)
|
||||
if audio.ndim > 1:
|
||||
audio = audio.mean(axis=1)
|
||||
audio = np.asarray(audio, dtype=np.float32)
|
||||
if sr != 16000:
|
||||
# Simple linear resample — good enough for 16kHz downsampling
|
||||
# from 44.1/48kHz, and we expect 16kHz inputs in practice.
|
||||
ratio = 16000 / float(sr)
|
||||
n = int(round(len(audio) * ratio))
|
||||
audio = np.interp(
|
||||
np.linspace(0, len(audio), n, endpoint=False),
|
||||
np.arange(len(audio)),
|
||||
audio,
|
||||
).astype(np.float32)
|
||||
return torch.from_numpy(audio).unsqueeze(0) # [1, T]
|
||||
|
||||
def embed(self, audio_path: str) -> list[float]:
|
||||
waveform = self._load_waveform(audio_path)
|
||||
vec = self._model.encode_batch(waveform).squeeze().detach().cpu().numpy()
|
||||
return [float(x) for x in vec]
|
||||
|
||||
def compare(self, audio1: str, audio2: str) -> float:
|
||||
return _cosine_distance(self.embed(audio1), self.embed(audio2))
|
||||
|
||||
def analyze(self, audio_path: str, actions):
|
||||
# Age / gender / emotion aren't produced by ECAPA-TDNN itself;
|
||||
# delegate to AnalysisHead which wraps separate Apache-2.0
|
||||
# checkpoints. Returns a single segment spanning the clip —
|
||||
# segmentation / diarisation is a future enhancement.
|
||||
waveform = self._load_waveform(audio_path)
|
||||
mono = waveform.squeeze().detach().cpu().numpy()
|
||||
attrs = self._analysis.analyze(audio_path, mono, actions)
|
||||
if not attrs:
|
||||
raise NotImplementedError(
|
||||
"analyze head failed to load — install transformers + torch or pass age_gender_model/emotion_model options"
|
||||
)
|
||||
duration = float(mono.shape[-1]) / 16000.0 if mono.size else 0.0
|
||||
return [dict(start=0.0, end=duration, **attrs)]
|
||||
|
||||
|
||||
class OnnxDirectEngine:
|
||||
"""Run a pre-exported ONNX speaker encoder (WeSpeaker / 3D-Speaker)."""
|
||||
|
||||
name = "onnx-direct"
|
||||
|
||||
def __init__(self, model_name: str, options: dict[str, str]):
|
||||
import onnxruntime as ort # type: ignore
|
||||
|
||||
# The gallery is expected to have dropped the ONNX file under
|
||||
# the models directory; accept either an absolute path or a
|
||||
# filename relative to _model_path.
|
||||
onnx_path = options.get("model_path") or options.get("onnx")
|
||||
if not onnx_path:
|
||||
raise ValueError("OnnxDirectEngine requires `model_path: <file.onnx>` in options")
|
||||
if not os.path.isabs(onnx_path):
|
||||
onnx_path = os.path.join(options.get("_model_path", ""), onnx_path)
|
||||
if not os.path.isfile(onnx_path):
|
||||
raise FileNotFoundError(f"ONNX model not found: {onnx_path}")
|
||||
|
||||
providers = options.get("providers")
|
||||
if providers:
|
||||
provider_list = [p.strip() for p in providers.split(",") if p.strip()]
|
||||
else:
|
||||
provider_list = ["CPUExecutionProvider"]
|
||||
self._session = ort.InferenceSession(onnx_path, providers=provider_list)
|
||||
self._input_name = self._session.get_inputs()[0].name
|
||||
self._expected_sr = int(options.get("sample_rate", "16000"))
|
||||
self._analysis = AnalysisHead(options)
|
||||
|
||||
def _load_waveform(self, path: str):
|
||||
import numpy as np
|
||||
import soundfile as sf # type: ignore
|
||||
|
||||
audio, sr = sf.read(path, always_2d=False)
|
||||
if sr != self._expected_sr:
|
||||
# Cheap linear resample — good enough for sanity; callers
|
||||
# should pre-resample for production.
|
||||
ratio = self._expected_sr / float(sr)
|
||||
n = int(round(len(audio) * ratio))
|
||||
audio = np.interp(
|
||||
np.linspace(0, len(audio), n, endpoint=False),
|
||||
np.arange(len(audio)),
|
||||
audio,
|
||||
)
|
||||
if audio.ndim > 1:
|
||||
audio = audio.mean(axis=1)
|
||||
return audio.astype("float32")
|
||||
|
||||
def embed(self, audio_path: str) -> list[float]:
|
||||
import numpy as np
|
||||
|
||||
audio = self._load_waveform(audio_path)
|
||||
feed = audio.reshape(1, -1)
|
||||
out = self._session.run(None, {self._input_name: feed})
|
||||
vec = np.asarray(out[0]).reshape(-1)
|
||||
return [float(x) for x in vec]
|
||||
|
||||
def compare(self, audio1: str, audio2: str) -> float:
|
||||
return _cosine_distance(self.embed(audio1), self.embed(audio2))
|
||||
|
||||
def analyze(self, audio_path: str, actions):
|
||||
# AnalysisHead expects 16kHz mono; _load_waveform already
|
||||
# resamples to self._expected_sr. If the user configured a
|
||||
# non-16k expected rate, resample one more time for analyze.
|
||||
audio = self._load_waveform(audio_path)
|
||||
if self._expected_sr != 16000:
|
||||
import numpy as np
|
||||
|
||||
ratio = 16000 / float(self._expected_sr)
|
||||
n = int(round(len(audio) * ratio))
|
||||
audio = np.interp(
|
||||
np.linspace(0, len(audio), n, endpoint=False),
|
||||
np.arange(len(audio)),
|
||||
audio,
|
||||
).astype("float32")
|
||||
attrs = self._analysis.analyze(audio_path, audio, actions)
|
||||
if not attrs:
|
||||
raise NotImplementedError(
|
||||
"analyze head failed to load — install transformers + torch or pass age_gender_model/emotion_model options"
|
||||
)
|
||||
duration = float(len(audio)) / 16000.0 if len(audio) else 0.0
|
||||
return [dict(start=0.0, end=duration, **attrs)]
|
||||
|
||||
|
||||
def build_engine(model_name: str, options: dict[str, str]) -> tuple[SpeakerEngine, str]:
|
||||
"""Pick an engine based on the options. ONNX path takes priority:
|
||||
if the gallery has dropped a `model_path:` or `onnx:` option, run
|
||||
the direct ONNX engine. Otherwise, fall back to SpeechBrain.
|
||||
"""
|
||||
engine_kind = (options.get("engine") or "").lower()
|
||||
if engine_kind == "onnx" or options.get("model_path") or options.get("onnx"):
|
||||
return OnnxDirectEngine(model_name, options), OnnxDirectEngine.name
|
||||
return SpeechBrainEngine(model_name, options), SpeechBrainEngine.name
|
||||
19
backend/python/speaker-recognition/install.sh
Executable file
19
backend/python/speaker-recognition/install.sh
Executable file
@@ -0,0 +1,19 @@
|
||||
#!/bin/bash
|
||||
set -e
|
||||
|
||||
backend_dir=$(dirname $0)
|
||||
if [ -d $backend_dir/common ]; then
|
||||
source $backend_dir/common/libbackend.sh
|
||||
else
|
||||
source $backend_dir/../common/libbackend.sh
|
||||
fi
|
||||
|
||||
installRequirements
|
||||
|
||||
# No pre-baked model weights. Weights flow through LocalAI's gallery
|
||||
# `files:` mechanism — see gallery entries for speechbrain-ecapa-tdnn
|
||||
# and WeSpeaker / 3D-Speaker ONNX packs. SpeechBrain's
|
||||
# EncoderClassifier.from_hparams also knows how to auto-download from
|
||||
# HuggingFace into the configured savedir (we point it at ModelPath),
|
||||
# so the first LoadModel call bootstraps the checkpoint if the gallery
|
||||
# flow wasn't used.
|
||||
5
backend/python/speaker-recognition/requirements-cpu.txt
Normal file
5
backend/python/speaker-recognition/requirements-cpu.txt
Normal file
@@ -0,0 +1,5 @@
|
||||
torch
|
||||
torchaudio
|
||||
speechbrain
|
||||
transformers
|
||||
onnxruntime
|
||||
@@ -0,0 +1,5 @@
|
||||
torch
|
||||
torchaudio
|
||||
speechbrain
|
||||
transformers
|
||||
onnxruntime-gpu
|
||||
5
backend/python/speaker-recognition/requirements.txt
Normal file
5
backend/python/speaker-recognition/requirements.txt
Normal file
@@ -0,0 +1,5 @@
|
||||
grpcio==1.71.0
|
||||
protobuf
|
||||
grpcio-tools
|
||||
numpy
|
||||
soundfile
|
||||
9
backend/python/speaker-recognition/run.sh
Executable file
9
backend/python/speaker-recognition/run.sh
Executable file
@@ -0,0 +1,9 @@
|
||||
#!/bin/bash
|
||||
backend_dir=$(dirname $0)
|
||||
if [ -d $backend_dir/common ]; then
|
||||
source $backend_dir/common/libbackend.sh
|
||||
else
|
||||
source $backend_dir/../common/libbackend.sh
|
||||
fi
|
||||
|
||||
startBackend $@
|
||||
78
backend/python/speaker-recognition/test.py
Normal file
78
backend/python/speaker-recognition/test.py
Normal file
@@ -0,0 +1,78 @@
|
||||
"""Unit tests for the speaker-recognition gRPC backend.
|
||||
|
||||
The servicer is instantiated in-process (no gRPC channel) and driven
|
||||
directly. The default path exercises SpeechBrain's ECAPA-TDNN — the
|
||||
first run downloads the checkpoint into a temp savedir. Tests are
|
||||
skipped gracefully when the heavy optional dependencies (torch /
|
||||
speechbrain / onnxruntime) are not installed, so the gRPC plumbing
|
||||
can still be verified on a bare image.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import importlib
|
||||
import os
|
||||
import sys
|
||||
import tempfile
|
||||
import unittest
|
||||
|
||||
sys.path.insert(0, os.path.dirname(__file__))
|
||||
|
||||
import backend_pb2 # noqa: E402
|
||||
|
||||
from backend import BackendServicer # noqa: E402
|
||||
|
||||
|
||||
def _have(*mods: str) -> bool:
|
||||
for m in mods:
|
||||
if importlib.util.find_spec(m) is None:
|
||||
return False
|
||||
return True
|
||||
|
||||
|
||||
class _FakeCtx:
|
||||
"""Minimal stand-in for a gRPC servicer context."""
|
||||
|
||||
def __init__(self) -> None:
|
||||
self.code = None
|
||||
self.details = ""
|
||||
|
||||
def set_code(self, c):
|
||||
self.code = c
|
||||
|
||||
def set_details(self, d):
|
||||
self.details = d
|
||||
|
||||
|
||||
class ServicerPlumbingTest(unittest.TestCase):
|
||||
"""Checks that LoadModel returns a clear error when no engine deps
|
||||
are installed, and that Voice* calls on an uninitialised servicer
|
||||
surface FAILED_PRECONDITION — both verifying the gRPC wiring
|
||||
without requiring SpeechBrain or ONNX at test time."""
|
||||
|
||||
def test_pre_load_voice_calls_are_rejected(self):
|
||||
svc = BackendServicer()
|
||||
ctx = _FakeCtx()
|
||||
svc.VoiceVerify(backend_pb2.VoiceVerifyRequest(audio1="/tmp/a.wav", audio2="/tmp/b.wav"), ctx)
|
||||
self.assertEqual(str(ctx.code), "StatusCode.FAILED_PRECONDITION")
|
||||
|
||||
def test_load_without_deps_fails_cleanly(self):
|
||||
svc = BackendServicer()
|
||||
req = backend_pb2.ModelOptions(Model="speechbrain/spkrec-ecapa-voxceleb", ModelPath="")
|
||||
result = svc.LoadModel(req, _FakeCtx())
|
||||
# Either the deps are installed and it loaded, or they aren't
|
||||
# and we got a structured error instead of a crash.
|
||||
self.assertTrue(result.success or "engine init failed" in result.message)
|
||||
|
||||
|
||||
@unittest.skipUnless(_have("speechbrain", "torch", "torchaudio"), "speechbrain / torch missing")
|
||||
class SpeechBrainEngineSmokeTest(unittest.TestCase):
|
||||
def test_load_and_embed(self):
|
||||
svc = BackendServicer()
|
||||
with tempfile.TemporaryDirectory() as td:
|
||||
req = backend_pb2.ModelOptions(Model="speechbrain/spkrec-ecapa-voxceleb", ModelPath=td)
|
||||
result = svc.LoadModel(req, _FakeCtx())
|
||||
self.assertTrue(result.success, result.message)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
unittest.main()
|
||||
11
backend/python/speaker-recognition/test.sh
Executable file
11
backend/python/speaker-recognition/test.sh
Executable file
@@ -0,0 +1,11 @@
|
||||
#!/bin/bash
|
||||
set -e
|
||||
|
||||
backend_dir=$(dirname $0)
|
||||
if [ -d $backend_dir/common ]; then
|
||||
source $backend_dir/common/libbackend.sh
|
||||
else
|
||||
source $backend_dir/../common/libbackend.sh
|
||||
fi
|
||||
|
||||
runUnittests
|
||||
@@ -386,6 +386,27 @@ impl Backend for KokorosService {
|
||||
Err(Status::unimplemented("Not supported"))
|
||||
}
|
||||
|
||||
async fn voice_verify(
|
||||
&self,
|
||||
_: Request<backend::VoiceVerifyRequest>,
|
||||
) -> Result<Response<backend::VoiceVerifyResponse>, Status> {
|
||||
Err(Status::unimplemented("Not supported"))
|
||||
}
|
||||
|
||||
async fn voice_analyze(
|
||||
&self,
|
||||
_: Request<backend::VoiceAnalyzeRequest>,
|
||||
) -> Result<Response<backend::VoiceAnalyzeResponse>, Status> {
|
||||
Err(Status::unimplemented("Not supported"))
|
||||
}
|
||||
|
||||
async fn voice_embed(
|
||||
&self,
|
||||
_: Request<backend::VoiceEmbedRequest>,
|
||||
) -> Result<Response<backend::VoiceEmbedResponse>, Status> {
|
||||
Err(Status::unimplemented("Not supported"))
|
||||
}
|
||||
|
||||
async fn stores_set(
|
||||
&self,
|
||||
_: Request<backend::StoresSetOptions>,
|
||||
|
||||
@@ -14,6 +14,7 @@ import (
|
||||
"github.com/mudler/LocalAI/core/services/facerecognition"
|
||||
"github.com/mudler/LocalAI/core/services/galleryop"
|
||||
"github.com/mudler/LocalAI/core/services/nodes"
|
||||
"github.com/mudler/LocalAI/core/services/voicerecognition"
|
||||
"github.com/mudler/LocalAI/core/templates"
|
||||
pkggrpc "github.com/mudler/LocalAI/pkg/grpc"
|
||||
"github.com/mudler/LocalAI/pkg/model"
|
||||
@@ -29,6 +30,12 @@ import (
|
||||
// family per deployment; we keep the door open instead.
|
||||
const faceEmbeddingDim = 0
|
||||
|
||||
// voiceEmbeddingDim is the expected dimension for speaker embeddings.
|
||||
// 0 so the Registry accepts whatever dim the loaded recognizer
|
||||
// produces — ECAPA-TDNN is 192, WeSpeaker ResNet34 is 256, 3D-Speaker
|
||||
// ERes2Net is 192, CAM++ is 512.
|
||||
const voiceEmbeddingDim = 0
|
||||
|
||||
type Application struct {
|
||||
backendLoader *config.ModelConfigLoader
|
||||
modelLoader *model.ModelLoader
|
||||
@@ -39,6 +46,7 @@ type Application struct {
|
||||
agentJobService *agentpool.AgentJobService
|
||||
agentPoolService atomic.Pointer[agentpool.AgentPoolService]
|
||||
faceRegistry facerecognition.Registry
|
||||
voiceRegistry voicerecognition.Registry
|
||||
authDB *gorm.DB
|
||||
watchdogMutex sync.Mutex
|
||||
watchdogStop chan bool
|
||||
@@ -78,6 +86,14 @@ func newApplication(appConfig *config.ApplicationConfig) *Application {
|
||||
}
|
||||
app.faceRegistry = facerecognition.NewStoreRegistry(faceStoreResolver, "", faceEmbeddingDim)
|
||||
|
||||
// Voice (speaker) recognition registry — same plumbing, separate
|
||||
// registry so embedding spaces stay isolated (a face vector and a
|
||||
// speaker vector are not comparable).
|
||||
voiceStoreResolver := func(_ context.Context, storeName string) (pkggrpc.Backend, error) {
|
||||
return corebackend.StoreBackend(ml, appConfig, storeName, "")
|
||||
}
|
||||
app.voiceRegistry = voicerecognition.NewStoreRegistry(voiceStoreResolver, "", voiceEmbeddingDim)
|
||||
|
||||
return app
|
||||
}
|
||||
|
||||
@@ -130,6 +146,14 @@ func (a *Application) FaceRegistry() facerecognition.Registry {
|
||||
return a.faceRegistry
|
||||
}
|
||||
|
||||
// VoiceRegistry returns the voice (speaker) recognition registry used
|
||||
// for 1:N identification. Same in-memory local-store backing as
|
||||
// FaceRegistry but a separate instance — voice embeddings live in
|
||||
// their own vector space.
|
||||
func (a *Application) VoiceRegistry() voicerecognition.Registry {
|
||||
return a.voiceRegistry
|
||||
}
|
||||
|
||||
// AuthDB returns the auth database connection, or nil if auth is not enabled.
|
||||
func (a *Application) AuthDB() *gorm.DB {
|
||||
return a.authDB
|
||||
|
||||
58
core/backend/voice_analyze.go
Normal file
58
core/backend/voice_analyze.go
Normal file
@@ -0,0 +1,58 @@
|
||||
package backend
|
||||
|
||||
import (
|
||||
"context"
|
||||
"fmt"
|
||||
"time"
|
||||
|
||||
"github.com/mudler/LocalAI/core/config"
|
||||
"github.com/mudler/LocalAI/core/trace"
|
||||
"github.com/mudler/LocalAI/pkg/grpc/proto"
|
||||
"github.com/mudler/LocalAI/pkg/model"
|
||||
)
|
||||
|
||||
func VoiceAnalyze(
|
||||
audio string,
|
||||
actions []string,
|
||||
loader *model.ModelLoader,
|
||||
appConfig *config.ApplicationConfig,
|
||||
modelConfig config.ModelConfig,
|
||||
) (*proto.VoiceAnalyzeResponse, error) {
|
||||
opts := ModelOptions(modelConfig, appConfig)
|
||||
voiceModel, err := loader.Load(opts...)
|
||||
if err != nil {
|
||||
recordModelLoadFailure(appConfig, modelConfig.Name, modelConfig.Backend, err, nil)
|
||||
return nil, err
|
||||
}
|
||||
if voiceModel == nil {
|
||||
return nil, fmt.Errorf("could not load voice recognition model")
|
||||
}
|
||||
|
||||
var startTime time.Time
|
||||
if appConfig.EnableTracing {
|
||||
trace.InitBackendTracingIfEnabled(appConfig.TracingMaxItems)
|
||||
startTime = time.Now()
|
||||
}
|
||||
|
||||
res, err := voiceModel.VoiceAnalyze(context.Background(), &proto.VoiceAnalyzeRequest{
|
||||
Audio: audio,
|
||||
Actions: actions,
|
||||
})
|
||||
|
||||
if appConfig.EnableTracing {
|
||||
errStr := ""
|
||||
if err != nil {
|
||||
errStr = err.Error()
|
||||
}
|
||||
trace.RecordBackendTrace(trace.BackendTrace{
|
||||
Timestamp: startTime,
|
||||
Duration: time.Since(startTime),
|
||||
Type: trace.BackendTraceVoiceAnalyze,
|
||||
ModelName: modelConfig.Name,
|
||||
Backend: modelConfig.Backend,
|
||||
Error: errStr,
|
||||
})
|
||||
}
|
||||
|
||||
return res, err
|
||||
}
|
||||
66
core/backend/voice_embed.go
Normal file
66
core/backend/voice_embed.go
Normal file
@@ -0,0 +1,66 @@
|
||||
package backend
|
||||
|
||||
import (
|
||||
"context"
|
||||
"fmt"
|
||||
"time"
|
||||
|
||||
"github.com/mudler/LocalAI/core/config"
|
||||
"github.com/mudler/LocalAI/core/trace"
|
||||
"github.com/mudler/LocalAI/pkg/grpc/proto"
|
||||
"github.com/mudler/LocalAI/pkg/model"
|
||||
)
|
||||
|
||||
// VoiceEmbed returns a speaker embedding (typically 192-d for ECAPA-TDNN)
|
||||
// for the audio file at audioPath. Unlike ModelEmbedding (which is
|
||||
// OpenAI-compatible and text-only), this call takes an audio path and
|
||||
// returns the backend's speaker-encoder output.
|
||||
func VoiceEmbed(
|
||||
audioPath string,
|
||||
loader *model.ModelLoader,
|
||||
appConfig *config.ApplicationConfig,
|
||||
modelConfig config.ModelConfig,
|
||||
) (*proto.VoiceEmbedResponse, error) {
|
||||
opts := ModelOptions(modelConfig, appConfig)
|
||||
voiceModel, err := loader.Load(opts...)
|
||||
if err != nil {
|
||||
recordModelLoadFailure(appConfig, modelConfig.Name, modelConfig.Backend, err, nil)
|
||||
return nil, err
|
||||
}
|
||||
if voiceModel == nil {
|
||||
return nil, fmt.Errorf("could not load voice recognition model")
|
||||
}
|
||||
|
||||
var startTime time.Time
|
||||
if appConfig.EnableTracing {
|
||||
trace.InitBackendTracingIfEnabled(appConfig.TracingMaxItems)
|
||||
startTime = time.Now()
|
||||
}
|
||||
|
||||
res, err := voiceModel.VoiceEmbed(context.Background(), &proto.VoiceEmbedRequest{
|
||||
Audio: audioPath,
|
||||
})
|
||||
|
||||
if appConfig.EnableTracing {
|
||||
errStr := ""
|
||||
if err != nil {
|
||||
errStr = err.Error()
|
||||
}
|
||||
trace.RecordBackendTrace(trace.BackendTrace{
|
||||
Timestamp: startTime,
|
||||
Duration: time.Since(startTime),
|
||||
Type: trace.BackendTraceVoiceEmbed,
|
||||
ModelName: modelConfig.Name,
|
||||
Backend: modelConfig.Backend,
|
||||
Error: errStr,
|
||||
})
|
||||
}
|
||||
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
if res == nil || len(res.Embedding) == 0 {
|
||||
return nil, fmt.Errorf("voice embedding returned empty vector (no speech detected?)")
|
||||
}
|
||||
return res, nil
|
||||
}
|
||||
61
core/backend/voice_verify.go
Normal file
61
core/backend/voice_verify.go
Normal file
@@ -0,0 +1,61 @@
|
||||
package backend
|
||||
|
||||
import (
|
||||
"context"
|
||||
"fmt"
|
||||
"time"
|
||||
|
||||
"github.com/mudler/LocalAI/core/config"
|
||||
"github.com/mudler/LocalAI/core/trace"
|
||||
"github.com/mudler/LocalAI/pkg/grpc/proto"
|
||||
"github.com/mudler/LocalAI/pkg/model"
|
||||
)
|
||||
|
||||
func VoiceVerify(
|
||||
audio1, audio2 string,
|
||||
threshold float32,
|
||||
antiSpoofing bool,
|
||||
loader *model.ModelLoader,
|
||||
appConfig *config.ApplicationConfig,
|
||||
modelConfig config.ModelConfig,
|
||||
) (*proto.VoiceVerifyResponse, error) {
|
||||
opts := ModelOptions(modelConfig, appConfig)
|
||||
voiceModel, err := loader.Load(opts...)
|
||||
if err != nil {
|
||||
recordModelLoadFailure(appConfig, modelConfig.Name, modelConfig.Backend, err, nil)
|
||||
return nil, err
|
||||
}
|
||||
if voiceModel == nil {
|
||||
return nil, fmt.Errorf("could not load voice recognition model")
|
||||
}
|
||||
|
||||
var startTime time.Time
|
||||
if appConfig.EnableTracing {
|
||||
trace.InitBackendTracingIfEnabled(appConfig.TracingMaxItems)
|
||||
startTime = time.Now()
|
||||
}
|
||||
|
||||
res, err := voiceModel.VoiceVerify(context.Background(), &proto.VoiceVerifyRequest{
|
||||
Audio1: audio1,
|
||||
Audio2: audio2,
|
||||
Threshold: threshold,
|
||||
AntiSpoofing: antiSpoofing,
|
||||
})
|
||||
|
||||
if appConfig.EnableTracing {
|
||||
errStr := ""
|
||||
if err != nil {
|
||||
errStr = err.Error()
|
||||
}
|
||||
trace.RecordBackendTrace(trace.BackendTrace{
|
||||
Timestamp: startTime,
|
||||
Duration: time.Since(startTime),
|
||||
Type: trace.BackendTraceVoiceVerify,
|
||||
ModelName: modelConfig.Name,
|
||||
Backend: modelConfig.Backend,
|
||||
Error: errStr,
|
||||
})
|
||||
}
|
||||
|
||||
return res, err
|
||||
}
|
||||
@@ -588,7 +588,8 @@ const (
|
||||
FLAG_VAD ModelConfigUsecase = 0b010000000000
|
||||
FLAG_VIDEO ModelConfigUsecase = 0b100000000000
|
||||
FLAG_DETECTION ModelConfigUsecase = 0b1000000000000
|
||||
FLAG_FACE_RECOGNITION ModelConfigUsecase = 0b10000000000000
|
||||
FLAG_FACE_RECOGNITION ModelConfigUsecase = 0b10000000000000
|
||||
FLAG_SPEAKER_RECOGNITION ModelConfigUsecase = 0b100000000000000
|
||||
|
||||
// Common Subsets
|
||||
FLAG_LLM ModelConfigUsecase = FLAG_CHAT | FLAG_COMPLETION | FLAG_EDIT
|
||||
@@ -612,7 +613,8 @@ func GetAllModelConfigUsecases() map[string]ModelConfigUsecase {
|
||||
"FLAG_LLM": FLAG_LLM,
|
||||
"FLAG_VIDEO": FLAG_VIDEO,
|
||||
"FLAG_DETECTION": FLAG_DETECTION,
|
||||
"FLAG_FACE_RECOGNITION": FLAG_FACE_RECOGNITION,
|
||||
"FLAG_FACE_RECOGNITION": FLAG_FACE_RECOGNITION,
|
||||
"FLAG_SPEAKER_RECOGNITION": FLAG_SPEAKER_RECOGNITION,
|
||||
}
|
||||
}
|
||||
|
||||
@@ -653,7 +655,7 @@ func (c *ModelConfig) GuessUsecases(u ModelConfigUsecase) bool {
|
||||
nonTextGenBackends := []string{
|
||||
"whisper", "piper", "kokoro",
|
||||
"diffusers", "stablediffusion", "stablediffusion-ggml",
|
||||
"rerankers", "silero-vad", "rfdetr", "insightface",
|
||||
"rerankers", "silero-vad", "rfdetr", "insightface", "speaker-recognition",
|
||||
"transformers-musicgen", "ace-step", "acestep-cpp",
|
||||
}
|
||||
|
||||
@@ -743,6 +745,13 @@ func (c *ModelConfig) GuessUsecases(u ModelConfigUsecase) bool {
|
||||
}
|
||||
}
|
||||
|
||||
if (u & FLAG_SPEAKER_RECOGNITION) == FLAG_SPEAKER_RECOGNITION {
|
||||
speakerBackends := []string{"speaker-recognition"}
|
||||
if !slices.Contains(speakerBackends, c.Backend) {
|
||||
return false
|
||||
}
|
||||
}
|
||||
|
||||
if (u & FLAG_SOUND_GENERATION) == FLAG_SOUND_GENERATION {
|
||||
soundGenBackends := []string{"transformers-musicgen", "ace-step", "acestep-cpp", "mock-backend"}
|
||||
if !slices.Contains(soundGenBackends, c.Backend) {
|
||||
|
||||
@@ -65,6 +65,14 @@ var RouteFeatureRegistry = []RouteFeature{
|
||||
{"POST", "/v1/face/identify", FeatureFaceRecognition},
|
||||
{"POST", "/v1/face/forget", FeatureFaceRecognition},
|
||||
|
||||
// Voice (speaker) recognition
|
||||
{"POST", "/v1/voice/verify", FeatureVoiceRecognition},
|
||||
{"POST", "/v1/voice/analyze", FeatureVoiceRecognition},
|
||||
{"POST", "/v1/voice/embed", FeatureVoiceRecognition},
|
||||
{"POST", "/v1/voice/register", FeatureVoiceRecognition},
|
||||
{"POST", "/v1/voice/identify", FeatureVoiceRecognition},
|
||||
{"POST", "/v1/voice/forget", FeatureVoiceRecognition},
|
||||
|
||||
// Video
|
||||
{"POST", "/video", FeatureVideo},
|
||||
|
||||
@@ -160,5 +168,6 @@ func APIFeatureMetas() []FeatureMeta {
|
||||
{FeatureMCP, "MCP", true},
|
||||
{FeatureStores, "Stores", true},
|
||||
{FeatureFaceRecognition, "Face Recognition", true},
|
||||
{FeatureVoiceRecognition, "Voice Recognition", true},
|
||||
}
|
||||
}
|
||||
|
||||
@@ -52,6 +52,7 @@ const (
|
||||
FeatureMCP = "mcp"
|
||||
FeatureStores = "stores"
|
||||
FeatureFaceRecognition = "face_recognition"
|
||||
FeatureVoiceRecognition = "voice_recognition"
|
||||
)
|
||||
|
||||
// AgentFeatures lists agent-related features (default OFF).
|
||||
@@ -65,7 +66,7 @@ var APIFeatures = []string{
|
||||
FeatureChat, FeatureImages, FeatureAudioSpeech, FeatureAudioTranscription,
|
||||
FeatureVAD, FeatureDetection, FeatureVideo, FeatureEmbeddings, FeatureSound,
|
||||
FeatureRealtime, FeatureRerank, FeatureTokenize, FeatureMCP, FeatureStores,
|
||||
FeatureFaceRecognition,
|
||||
FeatureFaceRecognition, FeatureVoiceRecognition,
|
||||
}
|
||||
|
||||
// AllFeatures lists all known features (used by UI and validation).
|
||||
|
||||
@@ -79,6 +79,12 @@ var instructionDefs = []instructionDef{
|
||||
Tags: []string{"face-recognition"},
|
||||
Intro: "The /v1/face/register, /identify, and /forget endpoints build on a vector store — registrations are in-memory by default and lost on restart. Use /v1/face/embed for a raw embedding; /v1/embeddings is OpenAI-compatible and text-only.",
|
||||
},
|
||||
{
|
||||
Name: "voice-recognition",
|
||||
Description: "Speaker verification (1:1), embedding, and demographic analysis from voice",
|
||||
Tags: []string{"voice-recognition"},
|
||||
Intro: "Voice (speaker) recognition — the audio analog to /v1/face/*. Use /v1/voice/verify for 1:1 speaker comparison, /v1/voice/identify for 1:N match against the registered store, /v1/voice/{register,forget} to manage that store, /v1/voice/embed for a raw speaker-encoder vector, and /v1/voice/analyze for age / gender / emotion inferred from speech. Registrations are in-memory by default and lost on restart. Audio inputs accept URL, base64, or data-URI; /v1/embeddings remains text-only.",
|
||||
},
|
||||
}
|
||||
|
||||
// swaggerState holds parsed swagger spec data, initialised once.
|
||||
|
||||
82
core/http/endpoints/localai/audio.go
Normal file
82
core/http/endpoints/localai/audio.go
Normal file
@@ -0,0 +1,82 @@
|
||||
package localai
|
||||
|
||||
import (
|
||||
"encoding/base64"
|
||||
"fmt"
|
||||
"io"
|
||||
"net/http"
|
||||
"os"
|
||||
"regexp"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
"github.com/labstack/echo/v4"
|
||||
"github.com/mudler/LocalAI/pkg/utils"
|
||||
)
|
||||
|
||||
var audioDataURIPattern = regexp.MustCompile(`^data:([^;]+);base64,`)
|
||||
|
||||
var audioDownloadClient = http.Client{Timeout: 30 * time.Second}
|
||||
|
||||
// decodeAudioInput materialises a URL / data-URI / raw-base64 audio
|
||||
// payload to a temporary file and returns its path plus a cleanup
|
||||
// function. Voice backends expect a filesystem path (same convention
|
||||
// as TranscriptRequest.dst) — callers must defer the returned cleanup
|
||||
// so the temp file does not leak.
|
||||
//
|
||||
// Bad inputs (invalid URL, undecodable base64, non-audio payload) are
|
||||
// surfaced as 400 Bad Request rather than 500 so API consumers can
|
||||
// distinguish a client mistake from a server failure.
|
||||
func decodeAudioInput(s string) (string, func(), error) {
|
||||
if s == "" {
|
||||
return "", func() {}, echo.NewHTTPError(http.StatusBadRequest, "audio input is empty")
|
||||
}
|
||||
|
||||
var raw []byte
|
||||
switch {
|
||||
case strings.HasPrefix(s, "http://") || strings.HasPrefix(s, "https://"):
|
||||
if err := utils.ValidateExternalURL(s); err != nil {
|
||||
return "", func() {}, echo.NewHTTPError(http.StatusBadRequest, fmt.Sprintf("invalid audio URL: %v", err))
|
||||
}
|
||||
resp, err := audioDownloadClient.Get(s)
|
||||
if err != nil {
|
||||
return "", func() {}, echo.NewHTTPError(http.StatusBadRequest, fmt.Sprintf("audio download failed: %v", err))
|
||||
}
|
||||
defer resp.Body.Close()
|
||||
raw, err = io.ReadAll(resp.Body)
|
||||
if err != nil {
|
||||
return "", func() {}, echo.NewHTTPError(http.StatusBadRequest, fmt.Sprintf("audio download read failed: %v", err))
|
||||
}
|
||||
default:
|
||||
payload := s
|
||||
if m := audioDataURIPattern.FindString(s); m != "" {
|
||||
payload = strings.Replace(s, m, "", 1)
|
||||
}
|
||||
decoded, err := base64.StdEncoding.DecodeString(payload)
|
||||
if err != nil {
|
||||
return "", func() {}, echo.NewHTTPError(http.StatusBadRequest, fmt.Sprintf("invalid audio base64: %v", err))
|
||||
}
|
||||
raw = decoded
|
||||
}
|
||||
|
||||
if len(raw) == 0 {
|
||||
return "", func() {}, echo.NewHTTPError(http.StatusBadRequest, "audio input decoded to zero bytes")
|
||||
}
|
||||
|
||||
f, err := os.CreateTemp("", "localai-voice-*.wav")
|
||||
if err != nil {
|
||||
return "", func() {}, err
|
||||
}
|
||||
path := f.Name()
|
||||
cleanup := func() { _ = os.Remove(path) }
|
||||
if _, err := f.Write(raw); err != nil {
|
||||
f.Close()
|
||||
cleanup()
|
||||
return "", func() {}, err
|
||||
}
|
||||
if err := f.Close(); err != nil {
|
||||
cleanup()
|
||||
return "", func() {}, err
|
||||
}
|
||||
return path, cleanup, nil
|
||||
}
|
||||
60
core/http/endpoints/localai/voice_analyze.go
Normal file
60
core/http/endpoints/localai/voice_analyze.go
Normal file
@@ -0,0 +1,60 @@
|
||||
package localai
|
||||
|
||||
import (
|
||||
"net/http"
|
||||
|
||||
"github.com/labstack/echo/v4"
|
||||
"github.com/mudler/LocalAI/core/backend"
|
||||
"github.com/mudler/LocalAI/core/config"
|
||||
"github.com/mudler/LocalAI/core/http/middleware"
|
||||
"github.com/mudler/LocalAI/core/schema"
|
||||
"github.com/mudler/LocalAI/pkg/model"
|
||||
"github.com/mudler/xlog"
|
||||
)
|
||||
|
||||
// VoiceAnalyzeEndpoint returns demographic attributes inferred from speech.
|
||||
// @Summary Analyze demographic attributes (age, gender, emotion) from a voice clip.
|
||||
// @Tags voice-recognition
|
||||
// @Param request body schema.VoiceAnalyzeRequest true "query params"
|
||||
// @Success 200 {object} schema.VoiceAnalyzeResponse "Response"
|
||||
// @Router /v1/voice/analyze [post]
|
||||
func VoiceAnalyzeEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, appConfig *config.ApplicationConfig) echo.HandlerFunc {
|
||||
return func(c echo.Context) error {
|
||||
input, ok := c.Get(middleware.CONTEXT_LOCALS_KEY_LOCALAI_REQUEST).(*schema.VoiceAnalyzeRequest)
|
||||
if !ok || input.Model == "" {
|
||||
return echo.ErrBadRequest
|
||||
}
|
||||
cfg, ok := c.Get(middleware.CONTEXT_LOCALS_KEY_MODEL_CONFIG).(*config.ModelConfig)
|
||||
if !ok || cfg == nil {
|
||||
return echo.ErrBadRequest
|
||||
}
|
||||
|
||||
audio, cleanup, err := decodeAudioInput(input.Audio)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
defer cleanup()
|
||||
|
||||
xlog.Debug("VoiceAnalyze", "model", cfg.Name, "backend", cfg.Backend, "actions", input.Actions)
|
||||
res, err := backend.VoiceAnalyze(audio, input.Actions, ml, appConfig, *cfg)
|
||||
if err != nil {
|
||||
return mapBackendError(err)
|
||||
}
|
||||
|
||||
response := schema.VoiceAnalyzeResponse{
|
||||
Segments: make([]schema.VoiceAnalysis, len(res.GetSegments())),
|
||||
}
|
||||
for i, s := range res.GetSegments() {
|
||||
response.Segments[i] = schema.VoiceAnalysis{
|
||||
Start: s.GetStart(),
|
||||
End: s.GetEnd(),
|
||||
Age: s.GetAge(),
|
||||
DominantGender: s.GetDominantGender(),
|
||||
Gender: s.GetGender(),
|
||||
DominantEmotion: s.GetDominantEmotion(),
|
||||
Emotion: s.GetEmotion(),
|
||||
}
|
||||
}
|
||||
return c.JSON(http.StatusOK, response)
|
||||
}
|
||||
}
|
||||
54
core/http/endpoints/localai/voice_embed.go
Normal file
54
core/http/endpoints/localai/voice_embed.go
Normal file
@@ -0,0 +1,54 @@
|
||||
package localai
|
||||
|
||||
import (
|
||||
"net/http"
|
||||
|
||||
"github.com/labstack/echo/v4"
|
||||
"github.com/mudler/LocalAI/core/backend"
|
||||
"github.com/mudler/LocalAI/core/config"
|
||||
"github.com/mudler/LocalAI/core/http/middleware"
|
||||
"github.com/mudler/LocalAI/core/schema"
|
||||
"github.com/mudler/LocalAI/pkg/model"
|
||||
"github.com/mudler/xlog"
|
||||
)
|
||||
|
||||
// VoiceEmbedEndpoint extracts a speaker embedding vector from an audio clip.
|
||||
//
|
||||
// Distinct from /v1/embeddings, which is OpenAI-compatible and text-only
|
||||
// by contract. Use this endpoint when you need a speaker-encoder output
|
||||
// (typically 192-d for ECAPA-TDNN, 256-d for ResNet/WeSpeaker).
|
||||
//
|
||||
// @Summary Extract a speaker embedding from an audio clip.
|
||||
// @Tags voice-recognition
|
||||
// @Param request body schema.VoiceEmbedRequest true "query params"
|
||||
// @Success 200 {object} schema.VoiceEmbedResponse "Response"
|
||||
// @Router /v1/voice/embed [post]
|
||||
func VoiceEmbedEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, appConfig *config.ApplicationConfig) echo.HandlerFunc {
|
||||
return func(c echo.Context) error {
|
||||
input, ok := c.Get(middleware.CONTEXT_LOCALS_KEY_LOCALAI_REQUEST).(*schema.VoiceEmbedRequest)
|
||||
if !ok || input.Model == "" {
|
||||
return echo.ErrBadRequest
|
||||
}
|
||||
cfg, ok := c.Get(middleware.CONTEXT_LOCALS_KEY_MODEL_CONFIG).(*config.ModelConfig)
|
||||
if !ok || cfg == nil {
|
||||
return echo.ErrBadRequest
|
||||
}
|
||||
|
||||
audio, cleanup, err := decodeAudioInput(input.Audio)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
defer cleanup()
|
||||
|
||||
xlog.Debug("VoiceEmbed", "model", cfg.Name, "backend", cfg.Backend)
|
||||
res, err := backend.VoiceEmbed(audio, ml, appConfig, *cfg)
|
||||
if err != nil {
|
||||
return mapBackendError(err)
|
||||
}
|
||||
return c.JSON(http.StatusOK, schema.VoiceEmbedResponse{
|
||||
Embedding: res.GetEmbedding(),
|
||||
Dim: len(res.GetEmbedding()),
|
||||
Model: res.GetModel(),
|
||||
})
|
||||
}
|
||||
}
|
||||
45
core/http/endpoints/localai/voice_forget.go
Normal file
45
core/http/endpoints/localai/voice_forget.go
Normal file
@@ -0,0 +1,45 @@
|
||||
package localai
|
||||
|
||||
import (
|
||||
"errors"
|
||||
"net/http"
|
||||
|
||||
"github.com/labstack/echo/v4"
|
||||
"github.com/mudler/LocalAI/core/http/middleware"
|
||||
"github.com/mudler/LocalAI/core/schema"
|
||||
"github.com/mudler/LocalAI/core/services/voicerecognition"
|
||||
"github.com/mudler/xlog"
|
||||
)
|
||||
|
||||
// VoiceForgetEndpoint removes a previously-registered speaker by ID.
|
||||
// @Summary Remove a previously-registered speaker by ID.
|
||||
// @Tags voice-recognition
|
||||
// @Param request body schema.VoiceForgetRequest true "query params"
|
||||
// @Success 204 "No Content"
|
||||
// @Router /v1/voice/forget [post]
|
||||
func VoiceForgetEndpoint(registry voicerecognition.Registry) echo.HandlerFunc {
|
||||
return func(c echo.Context) error {
|
||||
input, ok := c.Get(middleware.CONTEXT_LOCALS_KEY_LOCALAI_REQUEST).(*schema.VoiceForgetRequest)
|
||||
if !ok {
|
||||
// Forget doesn't load a model — fall back to a raw bind when
|
||||
// the request extractor hasn't run (route registered without
|
||||
// SetModelAndConfig).
|
||||
input = new(schema.VoiceForgetRequest)
|
||||
if err := c.Bind(input); err != nil {
|
||||
return echo.ErrBadRequest
|
||||
}
|
||||
}
|
||||
if input.ID == "" {
|
||||
return echo.NewHTTPError(http.StatusBadRequest, "id is required")
|
||||
}
|
||||
|
||||
xlog.Debug("VoiceForget", "id", input.ID)
|
||||
if err := registry.Forget(c.Request().Context(), input.ID); err != nil {
|
||||
if errors.Is(err, voicerecognition.ErrNotFound) {
|
||||
return echo.NewHTTPError(http.StatusNotFound, err.Error())
|
||||
}
|
||||
return err
|
||||
}
|
||||
return c.NoContent(http.StatusNoContent)
|
||||
}
|
||||
}
|
||||
82
core/http/endpoints/localai/voice_identify.go
Normal file
82
core/http/endpoints/localai/voice_identify.go
Normal file
@@ -0,0 +1,82 @@
|
||||
package localai
|
||||
|
||||
import (
|
||||
"cmp"
|
||||
"net/http"
|
||||
|
||||
"github.com/labstack/echo/v4"
|
||||
"github.com/mudler/LocalAI/core/backend"
|
||||
"github.com/mudler/LocalAI/core/config"
|
||||
"github.com/mudler/LocalAI/core/http/middleware"
|
||||
"github.com/mudler/LocalAI/core/schema"
|
||||
"github.com/mudler/LocalAI/core/services/voicerecognition"
|
||||
"github.com/mudler/LocalAI/pkg/model"
|
||||
"github.com/mudler/xlog"
|
||||
)
|
||||
|
||||
// defaultVoiceIdentifyThreshold is the cosine-distance cutoff applied
|
||||
// when the client does not specify one. Tuned for ECAPA-TDNN on
|
||||
// VoxCeleb (EER ~1.9%). Other recognizers (WeSpeaker, ERes2Net) may
|
||||
// need overrides.
|
||||
const defaultVoiceIdentifyThreshold = float32(0.25)
|
||||
|
||||
// VoiceIdentifyEndpoint runs 1:N identification against the registered store.
|
||||
// @Summary Identify a speaker against the registered database (1:N recognition).
|
||||
// @Tags voice-recognition
|
||||
// @Param request body schema.VoiceIdentifyRequest true "query params"
|
||||
// @Success 200 {object} schema.VoiceIdentifyResponse "Response"
|
||||
// @Router /v1/voice/identify [post]
|
||||
func VoiceIdentifyEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, appConfig *config.ApplicationConfig, registry voicerecognition.Registry) echo.HandlerFunc {
|
||||
return func(c echo.Context) error {
|
||||
input, ok := c.Get(middleware.CONTEXT_LOCALS_KEY_LOCALAI_REQUEST).(*schema.VoiceIdentifyRequest)
|
||||
if !ok || input.Model == "" {
|
||||
return echo.ErrBadRequest
|
||||
}
|
||||
cfg, ok := c.Get(middleware.CONTEXT_LOCALS_KEY_MODEL_CONFIG).(*config.ModelConfig)
|
||||
if !ok || cfg == nil {
|
||||
return echo.ErrBadRequest
|
||||
}
|
||||
|
||||
audio, cleanup, err := decodeAudioInput(input.Audio)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
defer cleanup()
|
||||
|
||||
topK := cmp.Or(input.TopK, 5)
|
||||
threshold := cmp.Or(input.Threshold, defaultVoiceIdentifyThreshold)
|
||||
|
||||
xlog.Debug("VoiceIdentify", "model", cfg.Name, "topK", topK, "threshold", threshold)
|
||||
embed, err := backend.VoiceEmbed(audio, ml, appConfig, *cfg)
|
||||
if err != nil {
|
||||
return mapBackendError(err)
|
||||
}
|
||||
|
||||
matches, err := registry.Identify(c.Request().Context(), embed.GetEmbedding(), topK)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
response := schema.VoiceIdentifyResponse{
|
||||
Matches: make([]schema.VoiceIdentifyMatch, len(matches)),
|
||||
}
|
||||
for i, m := range matches {
|
||||
confidence := (1 - m.Distance/threshold) * 100
|
||||
if confidence < 0 {
|
||||
confidence = 0
|
||||
}
|
||||
if confidence > 100 {
|
||||
confidence = 100
|
||||
}
|
||||
response.Matches[i] = schema.VoiceIdentifyMatch{
|
||||
ID: m.ID,
|
||||
Name: m.Metadata.Name,
|
||||
Labels: m.Metadata.Labels,
|
||||
Distance: m.Distance,
|
||||
Confidence: confidence,
|
||||
Match: m.Distance <= threshold,
|
||||
}
|
||||
}
|
||||
return c.JSON(http.StatusOK, response)
|
||||
}
|
||||
}
|
||||
61
core/http/endpoints/localai/voice_register.go
Normal file
61
core/http/endpoints/localai/voice_register.go
Normal file
@@ -0,0 +1,61 @@
|
||||
package localai
|
||||
|
||||
import (
|
||||
"net/http"
|
||||
|
||||
"github.com/labstack/echo/v4"
|
||||
"github.com/mudler/LocalAI/core/backend"
|
||||
"github.com/mudler/LocalAI/core/config"
|
||||
"github.com/mudler/LocalAI/core/http/middleware"
|
||||
"github.com/mudler/LocalAI/core/schema"
|
||||
"github.com/mudler/LocalAI/core/services/voicerecognition"
|
||||
"github.com/mudler/LocalAI/pkg/model"
|
||||
"github.com/mudler/xlog"
|
||||
)
|
||||
|
||||
// VoiceRegisterEndpoint enrolls a speaker into the 1:N identification store.
|
||||
// @Summary Register a speaker for 1:N identification.
|
||||
// @Tags voice-recognition
|
||||
// @Param request body schema.VoiceRegisterRequest true "query params"
|
||||
// @Success 200 {object} schema.VoiceRegisterResponse "Response"
|
||||
// @Router /v1/voice/register [post]
|
||||
func VoiceRegisterEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, appConfig *config.ApplicationConfig, registry voicerecognition.Registry) echo.HandlerFunc {
|
||||
return func(c echo.Context) error {
|
||||
input, ok := c.Get(middleware.CONTEXT_LOCALS_KEY_LOCALAI_REQUEST).(*schema.VoiceRegisterRequest)
|
||||
if !ok || input.Model == "" {
|
||||
return echo.ErrBadRequest
|
||||
}
|
||||
cfg, ok := c.Get(middleware.CONTEXT_LOCALS_KEY_MODEL_CONFIG).(*config.ModelConfig)
|
||||
if !ok || cfg == nil {
|
||||
return echo.ErrBadRequest
|
||||
}
|
||||
if input.Name == "" {
|
||||
return echo.NewHTTPError(http.StatusBadRequest, "name is required")
|
||||
}
|
||||
|
||||
audio, cleanup, err := decodeAudioInput(input.Audio)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
defer cleanup()
|
||||
|
||||
xlog.Debug("VoiceRegister", "model", cfg.Name, "name", input.Name)
|
||||
res, err := backend.VoiceEmbed(audio, ml, appConfig, *cfg)
|
||||
if err != nil {
|
||||
return mapBackendError(err)
|
||||
}
|
||||
|
||||
stored, err := registry.Register(c.Request().Context(), res.GetEmbedding(), voicerecognition.Metadata{
|
||||
Name: input.Name,
|
||||
Labels: input.Labels,
|
||||
})
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
return c.JSON(http.StatusOK, schema.VoiceRegisterResponse{
|
||||
ID: stored.ID,
|
||||
Name: stored.Name,
|
||||
RegisteredAt: stored.RegisteredAt,
|
||||
})
|
||||
}
|
||||
}
|
||||
59
core/http/endpoints/localai/voice_verify.go
Normal file
59
core/http/endpoints/localai/voice_verify.go
Normal file
@@ -0,0 +1,59 @@
|
||||
package localai
|
||||
|
||||
import (
|
||||
"net/http"
|
||||
|
||||
"github.com/labstack/echo/v4"
|
||||
"github.com/mudler/LocalAI/core/backend"
|
||||
"github.com/mudler/LocalAI/core/config"
|
||||
"github.com/mudler/LocalAI/core/http/middleware"
|
||||
"github.com/mudler/LocalAI/core/schema"
|
||||
"github.com/mudler/LocalAI/pkg/model"
|
||||
"github.com/mudler/xlog"
|
||||
)
|
||||
|
||||
// VoiceVerifyEndpoint compares two audio clips and reports whether they were
|
||||
// spoken by the same person.
|
||||
// @Summary Verify that two audio clips were spoken by the same person.
|
||||
// @Tags voice-recognition
|
||||
// @Param request body schema.VoiceVerifyRequest true "query params"
|
||||
// @Success 200 {object} schema.VoiceVerifyResponse "Response"
|
||||
// @Router /v1/voice/verify [post]
|
||||
func VoiceVerifyEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, appConfig *config.ApplicationConfig) echo.HandlerFunc {
|
||||
return func(c echo.Context) error {
|
||||
input, ok := c.Get(middleware.CONTEXT_LOCALS_KEY_LOCALAI_REQUEST).(*schema.VoiceVerifyRequest)
|
||||
if !ok || input.Model == "" {
|
||||
return echo.ErrBadRequest
|
||||
}
|
||||
cfg, ok := c.Get(middleware.CONTEXT_LOCALS_KEY_MODEL_CONFIG).(*config.ModelConfig)
|
||||
if !ok || cfg == nil {
|
||||
return echo.ErrBadRequest
|
||||
}
|
||||
|
||||
audio1, cleanup1, err := decodeAudioInput(input.Audio1)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
defer cleanup1()
|
||||
audio2, cleanup2, err := decodeAudioInput(input.Audio2)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
defer cleanup2()
|
||||
|
||||
xlog.Debug("VoiceVerify", "model", cfg.Name, "backend", cfg.Backend)
|
||||
res, err := backend.VoiceVerify(audio1, audio2, input.Threshold, input.AntiSpoofing, ml, appConfig, *cfg)
|
||||
if err != nil {
|
||||
return mapBackendError(err)
|
||||
}
|
||||
|
||||
return c.JSON(http.StatusOK, schema.VoiceVerifyResponse{
|
||||
Verified: res.GetVerified(),
|
||||
Distance: res.GetDistance(),
|
||||
Threshold: res.GetThreshold(),
|
||||
Confidence: res.GetConfidence(),
|
||||
Model: res.GetModel(),
|
||||
ProcessingTimeMs: res.GetProcessingTimeMs(),
|
||||
})
|
||||
}
|
||||
}
|
||||
1
core/http/react-ui/src/utils/capabilities.js
vendored
1
core/http/react-ui/src/utils/capabilities.js
vendored
@@ -13,3 +13,4 @@ export const CAP_VAD = 'FLAG_VAD'
|
||||
export const CAP_VIDEO = 'FLAG_VIDEO'
|
||||
export const CAP_DETECTION = 'FLAG_DETECTION'
|
||||
export const CAP_FACE_RECOGNITION = 'FLAG_FACE_RECOGNITION'
|
||||
export const CAP_SPEAKER_RECOGNITION = 'FLAG_SPEAKER_RECOGNITION'
|
||||
|
||||
@@ -120,6 +120,28 @@ func RegisterLocalAIRoutes(router *echo.Echo,
|
||||
// Forget does not load a face model — it only needs the registry.
|
||||
router.POST("/v1/face/forget", localai.FaceForgetEndpoint(app.FaceRegistry()))
|
||||
|
||||
// Voice (speaker) recognition endpoints
|
||||
voiceMw := []echo.MiddlewareFunc{
|
||||
requestExtractor.BuildFilteredFirstAvailableDefaultModel(config.BuildUsecaseFilterFn(config.FLAG_SPEAKER_RECOGNITION)),
|
||||
}
|
||||
router.POST("/v1/voice/verify",
|
||||
localai.VoiceVerifyEndpoint(cl, ml, appConfig),
|
||||
append(voiceMw, requestExtractor.SetModelAndConfig(func() schema.LocalAIRequest { return new(schema.VoiceVerifyRequest) }))...)
|
||||
router.POST("/v1/voice/analyze",
|
||||
localai.VoiceAnalyzeEndpoint(cl, ml, appConfig),
|
||||
append(voiceMw, requestExtractor.SetModelAndConfig(func() schema.LocalAIRequest { return new(schema.VoiceAnalyzeRequest) }))...)
|
||||
router.POST("/v1/voice/embed",
|
||||
localai.VoiceEmbedEndpoint(cl, ml, appConfig),
|
||||
append(voiceMw, requestExtractor.SetModelAndConfig(func() schema.LocalAIRequest { return new(schema.VoiceEmbedRequest) }))...)
|
||||
router.POST("/v1/voice/register",
|
||||
localai.VoiceRegisterEndpoint(cl, ml, appConfig, app.VoiceRegistry()),
|
||||
append(voiceMw, requestExtractor.SetModelAndConfig(func() schema.LocalAIRequest { return new(schema.VoiceRegisterRequest) }))...)
|
||||
router.POST("/v1/voice/identify",
|
||||
localai.VoiceIdentifyEndpoint(cl, ml, appConfig, app.VoiceRegistry()),
|
||||
append(voiceMw, requestExtractor.SetModelAndConfig(func() schema.LocalAIRequest { return new(schema.VoiceIdentifyRequest) }))...)
|
||||
// Forget does not load a voice model — it only needs the registry.
|
||||
router.POST("/v1/voice/forget", localai.VoiceForgetEndpoint(app.VoiceRegistry()))
|
||||
|
||||
ttsHandler := localai.TTSEndpoint(cl, ml, appConfig)
|
||||
router.POST("/tts",
|
||||
ttsHandler,
|
||||
|
||||
@@ -290,6 +290,110 @@ type FaceForgetRequest struct {
|
||||
Store string `json:"store,omitempty"`
|
||||
}
|
||||
|
||||
// ─── Voice (speaker) recognition ───────────────────────────────────
|
||||
//
|
||||
// VoiceVerifyRequest compares two audio clips and reports whether they
|
||||
// were spoken by the same speaker. Audio1/Audio2 accept URL, base64,
|
||||
// or data-URI (the HTTP layer materialises the bytes to a temp file
|
||||
// before calling the gRPC backend).
|
||||
type VoiceVerifyRequest struct {
|
||||
BasicModelRequest
|
||||
Audio1 string `json:"audio1"`
|
||||
Audio2 string `json:"audio2"`
|
||||
Threshold float32 `json:"threshold,omitempty"`
|
||||
AntiSpoofing bool `json:"anti_spoofing,omitempty"`
|
||||
}
|
||||
|
||||
type VoiceVerifyResponse struct {
|
||||
Verified bool `json:"verified"`
|
||||
Distance float32 `json:"distance"`
|
||||
Threshold float32 `json:"threshold"`
|
||||
Confidence float32 `json:"confidence"`
|
||||
Model string `json:"model"`
|
||||
ProcessingTimeMs float32 `json:"processing_time_ms,omitempty"`
|
||||
}
|
||||
|
||||
// VoiceAnalyzeRequest asks the backend for demographic attributes
|
||||
// (age, gender, emotion) inferred from the audio clip.
|
||||
type VoiceAnalyzeRequest struct {
|
||||
BasicModelRequest
|
||||
Audio string `json:"audio"`
|
||||
Actions []string `json:"actions,omitempty"` // subset of {"age","gender","emotion"}
|
||||
}
|
||||
|
||||
type VoiceAnalyzeResponse struct {
|
||||
Segments []VoiceAnalysis `json:"segments"`
|
||||
}
|
||||
|
||||
type VoiceAnalysis struct {
|
||||
Start float32 `json:"start"`
|
||||
End float32 `json:"end"`
|
||||
Age float32 `json:"age,omitempty"`
|
||||
DominantGender string `json:"dominant_gender,omitempty"`
|
||||
Gender map[string]float32 `json:"gender,omitempty"`
|
||||
DominantEmotion string `json:"dominant_emotion,omitempty"`
|
||||
Emotion map[string]float32 `json:"emotion,omitempty"`
|
||||
}
|
||||
|
||||
// VoiceEmbedRequest extracts a speaker embedding from an audio clip.
|
||||
// Distinct from /v1/embeddings (OpenAI-compatible, text-only) — this
|
||||
// endpoint accepts URL / base64 / data-URI audio inputs.
|
||||
type VoiceEmbedRequest struct {
|
||||
BasicModelRequest
|
||||
Audio string `json:"audio"`
|
||||
}
|
||||
|
||||
type VoiceEmbedResponse struct {
|
||||
Embedding []float32 `json:"embedding"`
|
||||
Dim int `json:"dim"`
|
||||
Model string `json:"model,omitempty"`
|
||||
}
|
||||
|
||||
// VoiceRegisterRequest enrolls a speaker into the 1:N identification store.
|
||||
type VoiceRegisterRequest struct {
|
||||
BasicModelRequest
|
||||
Audio string `json:"audio"`
|
||||
Name string `json:"name"`
|
||||
Labels map[string]string `json:"labels,omitempty"`
|
||||
Store string `json:"store,omitempty"`
|
||||
}
|
||||
|
||||
type VoiceRegisterResponse struct {
|
||||
ID string `json:"id"`
|
||||
Name string `json:"name"`
|
||||
RegisteredAt time.Time `json:"registered_at"`
|
||||
}
|
||||
|
||||
// VoiceIdentifyRequest runs 1:N recognition: embed the probe and
|
||||
// return the top-K nearest registered speakers.
|
||||
type VoiceIdentifyRequest struct {
|
||||
BasicModelRequest
|
||||
Audio string `json:"audio"`
|
||||
TopK int `json:"top_k,omitempty"`
|
||||
Threshold float32 `json:"threshold,omitempty"`
|
||||
Store string `json:"store,omitempty"`
|
||||
}
|
||||
|
||||
type VoiceIdentifyResponse struct {
|
||||
Matches []VoiceIdentifyMatch `json:"matches"`
|
||||
}
|
||||
|
||||
type VoiceIdentifyMatch struct {
|
||||
ID string `json:"id"`
|
||||
Name string `json:"name"`
|
||||
Labels map[string]string `json:"labels,omitempty"`
|
||||
Distance float32 `json:"distance"`
|
||||
Confidence float32 `json:"confidence"`
|
||||
Match bool `json:"match"`
|
||||
}
|
||||
|
||||
// VoiceForgetRequest removes a previously-registered speaker by ID.
|
||||
type VoiceForgetRequest struct {
|
||||
BasicModelRequest
|
||||
ID string `json:"id"`
|
||||
Store string `json:"store,omitempty"`
|
||||
}
|
||||
|
||||
type ImportModelRequest struct {
|
||||
URI string `json:"uri"`
|
||||
Preferences json.RawMessage `json:"preferences,omitempty"`
|
||||
|
||||
@@ -174,6 +174,15 @@ func (c *fakeBackendClient) FaceVerify(_ context.Context, _ *pb.FaceVerifyReques
|
||||
func (c *fakeBackendClient) FaceAnalyze(_ context.Context, _ *pb.FaceAnalyzeRequest, _ ...ggrpc.CallOption) (*pb.FaceAnalyzeResponse, error) {
|
||||
return nil, nil
|
||||
}
|
||||
func (c *fakeBackendClient) VoiceVerify(_ context.Context, _ *pb.VoiceVerifyRequest, _ ...ggrpc.CallOption) (*pb.VoiceVerifyResponse, error) {
|
||||
return nil, nil
|
||||
}
|
||||
func (c *fakeBackendClient) VoiceAnalyze(_ context.Context, _ *pb.VoiceAnalyzeRequest, _ ...ggrpc.CallOption) (*pb.VoiceAnalyzeResponse, error) {
|
||||
return nil, nil
|
||||
}
|
||||
func (c *fakeBackendClient) VoiceEmbed(_ context.Context, _ *pb.VoiceEmbedRequest, _ ...ggrpc.CallOption) (*pb.VoiceEmbedResponse, error) {
|
||||
return nil, nil
|
||||
}
|
||||
func (c *fakeBackendClient) AudioTranscription(_ context.Context, _ *pb.TranscriptRequest, _ ...ggrpc.CallOption) (*pb.TranscriptResult, error) {
|
||||
return nil, nil
|
||||
}
|
||||
|
||||
@@ -99,6 +99,18 @@ func (f *fakeGRPCBackend) FaceAnalyze(_ context.Context, _ *pb.FaceAnalyzeReques
|
||||
return &pb.FaceAnalyzeResponse{}, nil
|
||||
}
|
||||
|
||||
func (f *fakeGRPCBackend) VoiceVerify(_ context.Context, _ *pb.VoiceVerifyRequest, _ ...ggrpc.CallOption) (*pb.VoiceVerifyResponse, error) {
|
||||
return &pb.VoiceVerifyResponse{}, nil
|
||||
}
|
||||
|
||||
func (f *fakeGRPCBackend) VoiceAnalyze(_ context.Context, _ *pb.VoiceAnalyzeRequest, _ ...ggrpc.CallOption) (*pb.VoiceAnalyzeResponse, error) {
|
||||
return &pb.VoiceAnalyzeResponse{}, nil
|
||||
}
|
||||
|
||||
func (f *fakeGRPCBackend) VoiceEmbed(_ context.Context, _ *pb.VoiceEmbedRequest, _ ...ggrpc.CallOption) (*pb.VoiceEmbedResponse, error) {
|
||||
return &pb.VoiceEmbedResponse{}, nil
|
||||
}
|
||||
|
||||
func (f *fakeGRPCBackend) AudioTranscription(_ context.Context, _ *pb.TranscriptRequest, _ ...ggrpc.CallOption) (*pb.TranscriptResult, error) {
|
||||
return &pb.TranscriptResult{}, nil
|
||||
}
|
||||
|
||||
58
core/services/voicerecognition/registry.go
Normal file
58
core/services/voicerecognition/registry.go
Normal file
@@ -0,0 +1,58 @@
|
||||
// Package voicerecognition provides a swappable backing store for
|
||||
// speaker embeddings and the 1:N identification pipeline on top of it.
|
||||
//
|
||||
// Mirrors the facerecognition package — the audio analog. The current
|
||||
// implementation (NewStoreRegistry) is backed by LocalAI's in-memory
|
||||
// local-store gRPC backend, so all registrations are lost on restart.
|
||||
//
|
||||
// TODO: share a persistent pgvector-backed implementation with
|
||||
// facerecognition once the first one lands. The Registry interface
|
||||
// here is intentionally identical in shape, so a shared generic
|
||||
// biometric registry can replace both without HTTP-handler churn.
|
||||
package voicerecognition
|
||||
|
||||
import (
|
||||
"context"
|
||||
"errors"
|
||||
"time"
|
||||
)
|
||||
|
||||
// Registry stores speaker embeddings keyed by an opaque ID and
|
||||
// supports approximate similarity search. Implementations are expected
|
||||
// to be safe for concurrent use.
|
||||
type Registry interface {
|
||||
// Register stores a speaker embedding alongside its metadata.
|
||||
// Returns the stored metadata with ID and RegisteredAt populated.
|
||||
Register(ctx context.Context, embedding []float32, meta Metadata) (Metadata, error)
|
||||
|
||||
// Identify returns up to topK matches for the probe embedding,
|
||||
// sorted by ascending distance (closest first).
|
||||
Identify(ctx context.Context, probe []float32, topK int) ([]Match, error)
|
||||
|
||||
// Forget removes a previously-registered embedding by ID.
|
||||
// Returns ErrNotFound if the ID is unknown.
|
||||
Forget(ctx context.Context, id string) error
|
||||
}
|
||||
|
||||
// Metadata is the user-supplied payload stored alongside a speaker embedding.
|
||||
type Metadata struct {
|
||||
// ID is populated by the registry at Register time; callers must not set it.
|
||||
ID string `json:"id"`
|
||||
Name string `json:"name"`
|
||||
Labels map[string]string `json:"labels,omitempty"`
|
||||
RegisteredAt time.Time `json:"registered_at"`
|
||||
}
|
||||
|
||||
// Match is a single result from Identify, ranked by similarity.
|
||||
type Match struct {
|
||||
ID string
|
||||
Metadata Metadata
|
||||
Distance float32 // 1 - cosine_similarity; lower = closer
|
||||
}
|
||||
|
||||
// Sentinel errors; callers should compare with errors.Is.
|
||||
var (
|
||||
ErrNotFound = errors.New("voicerecognition: id not found")
|
||||
ErrEmptyEmbedding = errors.New("voicerecognition: embedding is empty")
|
||||
ErrDimensionMismatch = errors.New("voicerecognition: embedding dimension mismatch")
|
||||
)
|
||||
138
core/services/voicerecognition/store_registry.go
Normal file
138
core/services/voicerecognition/store_registry.go
Normal file
@@ -0,0 +1,138 @@
|
||||
package voicerecognition
|
||||
|
||||
import (
|
||||
"context"
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"sort"
|
||||
"sync"
|
||||
"time"
|
||||
|
||||
"github.com/google/uuid"
|
||||
|
||||
"github.com/mudler/LocalAI/pkg/grpc"
|
||||
"github.com/mudler/LocalAI/pkg/store"
|
||||
)
|
||||
|
||||
// StoreResolver resolves a named vector store to a gRPC backend. The
|
||||
// HTTP handler layer wires this to backend.StoreBackend so the
|
||||
// registry stays decoupled from ModelLoader plumbing.
|
||||
type StoreResolver func(ctx context.Context, storeName string) (grpc.Backend, error)
|
||||
|
||||
// NewStoreRegistry returns a Registry backed by LocalAI's generic
|
||||
// StoresSet / StoresFind / StoresDelete gRPC surface.
|
||||
//
|
||||
// storeName selects which vector-store model to use (defaults to the
|
||||
// local-store Go backend). `dim` is the expected embedding dimension;
|
||||
// pass 0 to accept whatever dimension arrives (useful when the voice
|
||||
// backend exposes recognizers of different sizes, e.g. ECAPA-TDNN at
|
||||
// 192 vs ResNet at 256).
|
||||
func NewStoreRegistry(resolve StoreResolver, storeName string, dim int) Registry {
|
||||
return &storeRegistry{
|
||||
resolve: resolve,
|
||||
storeName: storeName,
|
||||
dim: dim,
|
||||
}
|
||||
}
|
||||
|
||||
type storeRegistry struct {
|
||||
resolve StoreResolver
|
||||
storeName string
|
||||
dim int
|
||||
|
||||
// TODO(postgres): the local-store gRPC surface keys by embedding
|
||||
// vector and exposes no "list all" method, so we cannot delete by
|
||||
// ID without remembering the embedding. This in-memory index is
|
||||
// rebuilt on every Register and lost on restart — acceptable while
|
||||
// the only implementation is itself in-memory.
|
||||
idIndex sync.Map // map[string][]float32
|
||||
}
|
||||
|
||||
func (r *storeRegistry) Register(ctx context.Context, embedding []float32, meta Metadata) (Metadata, error) {
|
||||
if len(embedding) == 0 {
|
||||
return Metadata{}, ErrEmptyEmbedding
|
||||
}
|
||||
if r.dim != 0 && len(embedding) != r.dim {
|
||||
return Metadata{}, fmt.Errorf("%w: expected %d, got %d", ErrDimensionMismatch, r.dim, len(embedding))
|
||||
}
|
||||
|
||||
backend, err := r.resolve(ctx, r.storeName)
|
||||
if err != nil {
|
||||
return Metadata{}, fmt.Errorf("voicerecognition: resolve store: %w", err)
|
||||
}
|
||||
|
||||
meta.ID = uuid.NewString()
|
||||
if meta.RegisteredAt.IsZero() {
|
||||
meta.RegisteredAt = time.Now().UTC()
|
||||
}
|
||||
|
||||
payload, err := json.Marshal(meta)
|
||||
if err != nil {
|
||||
return Metadata{}, fmt.Errorf("voicerecognition: marshal metadata: %w", err)
|
||||
}
|
||||
|
||||
if err := store.SetSingle(ctx, backend, embedding, payload); err != nil {
|
||||
return Metadata{}, fmt.Errorf("voicerecognition: set: %w", err)
|
||||
}
|
||||
|
||||
embCopy := append([]float32(nil), embedding...)
|
||||
r.idIndex.Store(meta.ID, embCopy)
|
||||
return meta, nil
|
||||
}
|
||||
|
||||
func (r *storeRegistry) Identify(ctx context.Context, probe []float32, topK int) ([]Match, error) {
|
||||
if len(probe) == 0 {
|
||||
return nil, ErrEmptyEmbedding
|
||||
}
|
||||
if r.dim != 0 && len(probe) != r.dim {
|
||||
return nil, fmt.Errorf("%w: expected %d, got %d", ErrDimensionMismatch, r.dim, len(probe))
|
||||
}
|
||||
if topK <= 0 {
|
||||
topK = 5
|
||||
}
|
||||
|
||||
backend, err := r.resolve(ctx, r.storeName)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("voicerecognition: resolve store: %w", err)
|
||||
}
|
||||
|
||||
_, values, similarities, err := store.Find(ctx, backend, probe, topK)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("voicerecognition: find: %w", err)
|
||||
}
|
||||
|
||||
matches := make([]Match, 0, len(values))
|
||||
for i, raw := range values {
|
||||
var meta Metadata
|
||||
if err := json.Unmarshal(raw, &meta); err != nil {
|
||||
// Shared stores may contain non-voice records; skip them.
|
||||
continue
|
||||
}
|
||||
matches = append(matches, Match{
|
||||
ID: meta.ID,
|
||||
Metadata: meta,
|
||||
Distance: 1 - similarities[i],
|
||||
})
|
||||
}
|
||||
|
||||
sort.SliceStable(matches, func(i, j int) bool { return matches[i].Distance < matches[j].Distance })
|
||||
return matches, nil
|
||||
}
|
||||
|
||||
func (r *storeRegistry) Forget(ctx context.Context, id string) error {
|
||||
raw, ok := r.idIndex.Load(id)
|
||||
if !ok {
|
||||
return ErrNotFound
|
||||
}
|
||||
embedding := raw.([]float32)
|
||||
|
||||
backend, err := r.resolve(ctx, r.storeName)
|
||||
if err != nil {
|
||||
return fmt.Errorf("voicerecognition: resolve store: %w", err)
|
||||
}
|
||||
if err := store.DeleteSingle(ctx, backend, embedding); err != nil {
|
||||
return fmt.Errorf("voicerecognition: delete: %w", err)
|
||||
}
|
||||
r.idIndex.Delete(id)
|
||||
return nil
|
||||
}
|
||||
@@ -26,6 +26,9 @@ const (
|
||||
BackendTraceDetection BackendTraceType = "detection"
|
||||
BackendTraceFaceVerify BackendTraceType = "face_verify"
|
||||
BackendTraceFaceAnalyze BackendTraceType = "face_analyze"
|
||||
BackendTraceVoiceVerify BackendTraceType = "voice_verify"
|
||||
BackendTraceVoiceAnalyze BackendTraceType = "voice_analyze"
|
||||
BackendTraceVoiceEmbed BackendTraceType = "voice_embed"
|
||||
BackendTraceModelLoad BackendTraceType = "model_load"
|
||||
)
|
||||
|
||||
|
||||
247
docs/content/features/voice-recognition.md
Normal file
247
docs/content/features/voice-recognition.md
Normal file
@@ -0,0 +1,247 @@
|
||||
+++
|
||||
disableToc = false
|
||||
title = "Voice Recognition"
|
||||
weight = 15
|
||||
url = "/features/voice-recognition/"
|
||||
+++
|
||||
|
||||
LocalAI supports voice (speaker) recognition through the
|
||||
`speaker-recognition` backend: speaker verification (1:1), speaker
|
||||
identification (1:N) against a built-in vector store, speaker
|
||||
embedding, and demographic analysis (age / gender / emotion from
|
||||
voice).
|
||||
|
||||
The audio analog to [Face Recognition](/features/face-recognition/),
|
||||
following the same two-engine pattern under one image.
|
||||
|
||||
## Engines
|
||||
|
||||
| Gallery entry | Model | Size | License |
|
||||
|---|---|---|---|
|
||||
| `speechbrain-ecapa-tdnn` | ECAPA-TDNN on VoxCeleb (SpeechBrain) | ~17 MB | **Apache 2.0 — commercial-safe** |
|
||||
| `wespeaker-resnet34` | WeSpeaker ResNet34 ONNX | ~26 MB | **Apache 2.0 — commercial-safe** |
|
||||
|
||||
Both entries are commercial-safe Apache-2.0. SpeechBrain is the
|
||||
default — it's a lightweight pure-PyTorch checkpoint that auto-
|
||||
downloads on first use. The `wespeaker-resnet34` entry wires the
|
||||
direct-ONNX path for CPU-only deployments that don't want the torch
|
||||
runtime.
|
||||
|
||||
## Quickstart
|
||||
|
||||
Install the default backend and model:
|
||||
|
||||
```bash
|
||||
local-ai models install speechbrain-ecapa-tdnn
|
||||
```
|
||||
|
||||
Verify that two audio clips were spoken by the same person:
|
||||
|
||||
```bash
|
||||
curl -sX POST http://localhost:8080/v1/voice/verify \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "speechbrain-ecapa-tdnn",
|
||||
"audio1": "https://example.com/alice_1.wav",
|
||||
"audio2": "https://example.com/alice_2.wav"
|
||||
}'
|
||||
```
|
||||
|
||||
Response:
|
||||
|
||||
```json
|
||||
{
|
||||
"verified": true,
|
||||
"distance": 0.18,
|
||||
"threshold": 0.25,
|
||||
"confidence": 28.0,
|
||||
"model": "speechbrain-ecapa-tdnn",
|
||||
"processing_time_ms": 340.0
|
||||
}
|
||||
```
|
||||
|
||||
## 1:N identification workflow (register → identify → forget)
|
||||
|
||||
Same flow as face recognition, same in-memory vector store under the
|
||||
hood.
|
||||
|
||||
1. Register known speakers:
|
||||
|
||||
```bash
|
||||
curl -sX POST http://localhost:8080/v1/voice/register \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "speechbrain-ecapa-tdnn",
|
||||
"name": "Alice",
|
||||
"audio": "https://example.com/alice.wav"
|
||||
}'
|
||||
# → {"id": "b2f...", "name": "Alice", "registered_at": "2026-04-22T..."}
|
||||
```
|
||||
|
||||
2. Identify an unknown probe:
|
||||
|
||||
```bash
|
||||
curl -sX POST http://localhost:8080/v1/voice/identify \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "speechbrain-ecapa-tdnn",
|
||||
"audio": "https://example.com/unknown.wav",
|
||||
"top_k": 5
|
||||
}'
|
||||
# → {"matches": [{"id":"b2f...","name":"Alice","distance":0.19,"match":true,...}]}
|
||||
```
|
||||
|
||||
3. Remove a speaker by ID:
|
||||
|
||||
```bash
|
||||
curl -sX POST http://localhost:8080/v1/voice/forget \
|
||||
-d '{"id": "b2f..."}'
|
||||
# → 204 No Content
|
||||
```
|
||||
|
||||
{{% alert icon="⚠️" color="warning" %}}
|
||||
**Storage caveat.** The default vector store is in-memory. All
|
||||
registered speakers are lost when LocalAI restarts. Persistent storage
|
||||
(pgvector) is a tracked future enhancement shared with face
|
||||
recognition — the voice-recognition HTTP API is designed to swap the
|
||||
backing store without changing the wire format.
|
||||
{{% /alert %}}
|
||||
|
||||
## API reference
|
||||
|
||||
### `POST /v1/voice/verify` (1:1)
|
||||
|
||||
| field | type | description |
|
||||
|---|---|---|
|
||||
| `model` | string | gallery entry name (e.g. `speechbrain-ecapa-tdnn`) |
|
||||
| `audio1`, `audio2` | string | URL, base64, or data-URI of an audio file |
|
||||
| `threshold` | float, optional | cosine-distance cutoff; default 0.25 for ECAPA-TDNN |
|
||||
| `anti_spoofing` | bool, optional | reserved — unused in the current release |
|
||||
|
||||
Returns `verified`, `distance`, `threshold`, `confidence`, `model`,
|
||||
and `processing_time_ms`.
|
||||
|
||||
### `POST /v1/voice/analyze`
|
||||
|
||||
Returns demographic attributes (age, gender, emotion) inferred from
|
||||
speech:
|
||||
|
||||
| field | type | description |
|
||||
|---|---|---|
|
||||
| `model` | string | gallery entry |
|
||||
| `audio` | string | URL / base64 / data-URI |
|
||||
| `actions` | string[] | subset of `["age","gender","emotion"]`; empty = all supported |
|
||||
|
||||
Emotion is inferred from the SUPERB emotion-recognition checkpoint
|
||||
(`superb/wav2vec2-base-superb-er`, Apache 2.0) — 4-way categorical
|
||||
neutral / happy / angry / sad. The model auto-downloads on the first
|
||||
analyze call.
|
||||
|
||||
Age and gender are **opt-in**: no standard-transformers checkpoint
|
||||
with a clean classifier head is shipped as the default. The
|
||||
high-accuracy Audeering age/gender model uses a custom multi-task
|
||||
head that `AutoModelForAudioClassification` doesn't load safely
|
||||
(the age weights are silently dropped and the classifier is
|
||||
re-initialised with random values). To enable age/gender, set
|
||||
`age_gender_model:<repo>` in the model YAML's `options:` pointing at
|
||||
a checkpoint with a vanilla `Wav2Vec2ForSequenceClassification`
|
||||
head. Override the emotion default similarly via `emotion_model:`.
|
||||
Set either to an empty string to disable that head.
|
||||
|
||||
If a head fails to load (offline, disk full, `transformers`
|
||||
missing), the engine degrades gracefully: it still returns the
|
||||
attributes it could compute. When nothing can be computed the backend
|
||||
returns `501 Unimplemented`.
|
||||
|
||||
Analyze is supported by both `speechbrain-ecapa-tdnn` and
|
||||
`wespeaker-resnet34` — the speaker recognizer and the analysis head
|
||||
are independent.
|
||||
|
||||
### `POST /v1/voice/register` (1:N enrollment)
|
||||
|
||||
| field | type | description |
|
||||
|---|---|---|
|
||||
| `model` | string | voice recognition model |
|
||||
| `audio` | string | speaker audio to enroll |
|
||||
| `name` | string | human-readable label |
|
||||
| `labels` | map[string]string, optional | arbitrary metadata |
|
||||
| `store` | string, optional | vector store model; defaults to local-store |
|
||||
|
||||
Returns `{id, name, registered_at}`. The `id` is an opaque UUID used
|
||||
by `/v1/voice/identify` and `/v1/voice/forget`.
|
||||
|
||||
### `POST /v1/voice/identify` (1:N recognition)
|
||||
|
||||
| field | type | description |
|
||||
|---|---|---|
|
||||
| `model` | string | voice recognition model |
|
||||
| `audio` | string | probe audio |
|
||||
| `top_k` | int, optional | max matches to return; default 5 |
|
||||
| `threshold` | float, optional | cosine-distance cutoff; default 0.25 |
|
||||
| `store` | string, optional | vector store model |
|
||||
|
||||
Returns a list of matches sorted by ascending distance, each with
|
||||
`id`, `name`, `labels`, `distance`, `confidence`, and `match`
|
||||
(`distance ≤ threshold`).
|
||||
|
||||
### `POST /v1/voice/forget`
|
||||
|
||||
| field | type | description |
|
||||
|---|---|---|
|
||||
| `id` | string | ID returned by `/v1/voice/register` |
|
||||
|
||||
Returns `204 No Content` on success, `404 Not Found` if the ID is
|
||||
unknown.
|
||||
|
||||
### `POST /v1/voice/embed`
|
||||
|
||||
Returns the L2-normalized speaker embedding vector.
|
||||
|
||||
| field | type | description |
|
||||
|---|---|---|
|
||||
| `model` | string | voice model |
|
||||
| `audio` | string | URL / base64 / data-URI |
|
||||
|
||||
Returns `{embedding: float[], dim: int, model: string}`. Dimension
|
||||
depends on the recognizer: 192 for ECAPA-TDNN, 256 for WeSpeaker
|
||||
ResNet34.
|
||||
|
||||
> **Note:** the OpenAI-compatible `/v1/embeddings` endpoint is
|
||||
> intentionally text-only — it does nothing useful with audio input.
|
||||
> Use `/v1/voice/embed` for audio.
|
||||
|
||||
## Audio input
|
||||
|
||||
Audio is materialised by the HTTP layer to a temporary WAV file
|
||||
before the gRPC call. All audio fields accept:
|
||||
|
||||
- `http://` / `https://` URLs (downloaded server-side, subject to
|
||||
`ValidateExternalURL` safety checks).
|
||||
- Raw base64 (no prefix).
|
||||
- Data URIs (`data:audio/wav;base64,...`).
|
||||
|
||||
The backend itself always receives a filesystem path — the same
|
||||
convention the Whisper / Voxtral transcription backends use.
|
||||
|
||||
## Threshold reference
|
||||
|
||||
| Recognizer | Cosine-distance threshold |
|
||||
|---|---|
|
||||
| ECAPA-TDNN (SpeechBrain, VoxCeleb) | ~0.25 |
|
||||
| WeSpeaker ResNet34 | ~0.30 |
|
||||
| 3D-Speaker ERes2Net | ~0.28 |
|
||||
|
||||
Pass `threshold` explicitly when switching recognizers — the per-model
|
||||
default only applies when omitted.
|
||||
|
||||
## Related features
|
||||
|
||||
- [Face Recognition](/features/face-recognition/) — the image analog;
|
||||
the two share a registry design.
|
||||
- [Audio to Text](/features/audio-to-text/) — transcription (Whisper,
|
||||
Voxtral, faster-whisper). Runs in addition to, not instead of,
|
||||
voice recognition.
|
||||
- [Stores](/features/stores/) — the generic vector store powering
|
||||
both the face and voice 1:N recognition pipelines.
|
||||
- [Embeddings](/features/embeddings/) — text-only OpenAI-compatible
|
||||
embedding endpoint; for audio embeddings use `/v1/voice/embed`.
|
||||
@@ -3993,6 +3993,57 @@
|
||||
- filename: face_recognition_sface_2021dec_int8.onnx
|
||||
sha256: 2b0e941e6f16cc048c20aee0c8e31f569118f65d702914540f7bfdc14048d78a
|
||||
uri: https://github.com/opencv/opencv_zoo/raw/main/models/face_recognition_sface/face_recognition_sface_2021dec_int8.onnx
|
||||
- &speechbrain_ecapa_tdnn
|
||||
name: "speechbrain-ecapa-tdnn"
|
||||
url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
|
||||
license: apache-2.0
|
||||
description: |
|
||||
Speaker (voice) recognition with SpeechBrain's ECAPA-TDNN trained
|
||||
on VoxCeleb. 192-d L2-normalised embeddings, ~1.9% Equal Error
|
||||
Rate on VoxCeleb1-O. APACHE 2.0 — commercial-safe.
|
||||
|
||||
The checkpoint is auto-downloaded from HuggingFace on first
|
||||
LoadModel (no separate weight file in gallery `files:`). Points at
|
||||
the upstream SpeechBrain HF repo directly — same bytes every
|
||||
deployment.
|
||||
tags: [voice-recognition, speaker-verification, speaker-embedding, commercial-ok, cpu, gpu]
|
||||
urls:
|
||||
- https://speechbrain.github.io/
|
||||
- https://huggingface.co/speechbrain/spkrec-ecapa-voxceleb
|
||||
overrides:
|
||||
backend: speaker-recognition
|
||||
parameters: {model: speechbrain/spkrec-ecapa-voxceleb}
|
||||
options:
|
||||
- "engine:speechbrain"
|
||||
- "source:speechbrain/spkrec-ecapa-voxceleb"
|
||||
known_usecases: [speaker_recognition]
|
||||
- &wespeaker_resnet34
|
||||
name: "wespeaker-resnet34"
|
||||
url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
|
||||
license: apache-2.0
|
||||
description: |
|
||||
Speaker recognition with WeSpeaker's ResNet34 trained on VoxCeleb,
|
||||
exported to ONNX. 256-d embeddings, CPU-friendly — avoids the
|
||||
PyTorch runtime entirely (onnxruntime only). APACHE 2.0.
|
||||
|
||||
Pair with the `speaker-recognition` backend's OnnxDirectEngine.
|
||||
Use when ECAPA-TDNN's torch dependency is undesirable (small
|
||||
images, edge deployments).
|
||||
tags: [voice-recognition, speaker-verification, speaker-embedding, commercial-ok, edge, cpu]
|
||||
urls:
|
||||
- https://github.com/wenet-e2e/wespeaker
|
||||
overrides:
|
||||
backend: speaker-recognition
|
||||
parameters: {model: wespeaker_voxceleb_resnet34.onnx}
|
||||
options:
|
||||
- "engine:onnx"
|
||||
- "model_path:wespeaker_voxceleb_resnet34.onnx"
|
||||
- "sample_rate:16000"
|
||||
known_usecases: [speaker_recognition]
|
||||
files:
|
||||
- filename: wespeaker_voxceleb_resnet34.onnx
|
||||
sha256: 7bb2f06e9df17cdf1ef14ee8a15ab08ed28e8d0ef5054ee135741560df2ec068
|
||||
uri: https://huggingface.co/Wespeaker/wespeaker-voxceleb-resnet34-LM/resolve/main/voxceleb_resnet34_LM.onnx
|
||||
- &rfdetr
|
||||
name: "rfdetr-base"
|
||||
url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
|
||||
|
||||
@@ -56,6 +56,9 @@ type Backend interface {
|
||||
Detect(ctx context.Context, in *pb.DetectOptions, opts ...grpc.CallOption) (*pb.DetectResponse, error)
|
||||
FaceVerify(ctx context.Context, in *pb.FaceVerifyRequest, opts ...grpc.CallOption) (*pb.FaceVerifyResponse, error)
|
||||
FaceAnalyze(ctx context.Context, in *pb.FaceAnalyzeRequest, opts ...grpc.CallOption) (*pb.FaceAnalyzeResponse, error)
|
||||
VoiceVerify(ctx context.Context, in *pb.VoiceVerifyRequest, opts ...grpc.CallOption) (*pb.VoiceVerifyResponse, error)
|
||||
VoiceAnalyze(ctx context.Context, in *pb.VoiceAnalyzeRequest, opts ...grpc.CallOption) (*pb.VoiceAnalyzeResponse, error)
|
||||
VoiceEmbed(ctx context.Context, in *pb.VoiceEmbedRequest, opts ...grpc.CallOption) (*pb.VoiceEmbedResponse, error)
|
||||
AudioTranscription(ctx context.Context, in *pb.TranscriptRequest, opts ...grpc.CallOption) (*pb.TranscriptResult, error)
|
||||
AudioTranscriptionStream(ctx context.Context, in *pb.TranscriptRequest, f func(chunk *pb.TranscriptStreamResponse), opts ...grpc.CallOption) error
|
||||
TokenizeString(ctx context.Context, in *pb.PredictOptions, opts ...grpc.CallOption) (*pb.TokenizationResponse, error)
|
||||
|
||||
@@ -89,6 +89,18 @@ func (llm *Base) FaceAnalyze(*pb.FaceAnalyzeRequest) (pb.FaceAnalyzeResponse, er
|
||||
return pb.FaceAnalyzeResponse{}, fmt.Errorf("unimplemented")
|
||||
}
|
||||
|
||||
func (llm *Base) VoiceVerify(*pb.VoiceVerifyRequest) (pb.VoiceVerifyResponse, error) {
|
||||
return pb.VoiceVerifyResponse{}, fmt.Errorf("unimplemented")
|
||||
}
|
||||
|
||||
func (llm *Base) VoiceAnalyze(*pb.VoiceAnalyzeRequest) (pb.VoiceAnalyzeResponse, error) {
|
||||
return pb.VoiceAnalyzeResponse{}, fmt.Errorf("unimplemented")
|
||||
}
|
||||
|
||||
func (llm *Base) VoiceEmbed(*pb.VoiceEmbedRequest) (pb.VoiceEmbedResponse, error) {
|
||||
return pb.VoiceEmbedResponse{}, fmt.Errorf("unimplemented")
|
||||
}
|
||||
|
||||
func (llm *Base) TokenizeString(opts *pb.PredictOptions) (pb.TokenizationResponse, error) {
|
||||
return pb.TokenizationResponse{}, fmt.Errorf("unimplemented")
|
||||
}
|
||||
|
||||
@@ -616,6 +616,60 @@ func (c *Client) FaceAnalyze(ctx context.Context, in *pb.FaceAnalyzeRequest, opt
|
||||
return client.FaceAnalyze(ctx, in, opts...)
|
||||
}
|
||||
|
||||
func (c *Client) VoiceVerify(ctx context.Context, in *pb.VoiceVerifyRequest, opts ...grpc.CallOption) (*pb.VoiceVerifyResponse, error) {
|
||||
if !c.parallel {
|
||||
c.opMutex.Lock()
|
||||
defer c.opMutex.Unlock()
|
||||
}
|
||||
c.setBusy(true)
|
||||
defer c.setBusy(false)
|
||||
c.wdMark()
|
||||
defer c.wdUnMark()
|
||||
conn, err := c.dial()
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
defer conn.Close()
|
||||
client := pb.NewBackendClient(conn)
|
||||
return client.VoiceVerify(ctx, in, opts...)
|
||||
}
|
||||
|
||||
func (c *Client) VoiceAnalyze(ctx context.Context, in *pb.VoiceAnalyzeRequest, opts ...grpc.CallOption) (*pb.VoiceAnalyzeResponse, error) {
|
||||
if !c.parallel {
|
||||
c.opMutex.Lock()
|
||||
defer c.opMutex.Unlock()
|
||||
}
|
||||
c.setBusy(true)
|
||||
defer c.setBusy(false)
|
||||
c.wdMark()
|
||||
defer c.wdUnMark()
|
||||
conn, err := c.dial()
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
defer conn.Close()
|
||||
client := pb.NewBackendClient(conn)
|
||||
return client.VoiceAnalyze(ctx, in, opts...)
|
||||
}
|
||||
|
||||
func (c *Client) VoiceEmbed(ctx context.Context, in *pb.VoiceEmbedRequest, opts ...grpc.CallOption) (*pb.VoiceEmbedResponse, error) {
|
||||
if !c.parallel {
|
||||
c.opMutex.Lock()
|
||||
defer c.opMutex.Unlock()
|
||||
}
|
||||
c.setBusy(true)
|
||||
defer c.setBusy(false)
|
||||
c.wdMark()
|
||||
defer c.wdUnMark()
|
||||
conn, err := c.dial()
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
defer conn.Close()
|
||||
client := pb.NewBackendClient(conn)
|
||||
return client.VoiceEmbed(ctx, in, opts...)
|
||||
}
|
||||
|
||||
func (c *Client) AudioEncode(ctx context.Context, in *pb.AudioEncodeRequest, opts ...grpc.CallOption) (*pb.AudioEncodeResult, error) {
|
||||
if !c.parallel {
|
||||
c.opMutex.Lock()
|
||||
|
||||
@@ -79,6 +79,18 @@ func (e *embedBackend) FaceAnalyze(ctx context.Context, in *pb.FaceAnalyzeReques
|
||||
return e.s.FaceAnalyze(ctx, in)
|
||||
}
|
||||
|
||||
func (e *embedBackend) VoiceVerify(ctx context.Context, in *pb.VoiceVerifyRequest, opts ...grpc.CallOption) (*pb.VoiceVerifyResponse, error) {
|
||||
return e.s.VoiceVerify(ctx, in)
|
||||
}
|
||||
|
||||
func (e *embedBackend) VoiceAnalyze(ctx context.Context, in *pb.VoiceAnalyzeRequest, opts ...grpc.CallOption) (*pb.VoiceAnalyzeResponse, error) {
|
||||
return e.s.VoiceAnalyze(ctx, in)
|
||||
}
|
||||
|
||||
func (e *embedBackend) VoiceEmbed(ctx context.Context, in *pb.VoiceEmbedRequest, opts ...grpc.CallOption) (*pb.VoiceEmbedResponse, error) {
|
||||
return e.s.VoiceEmbed(ctx, in)
|
||||
}
|
||||
|
||||
func (e *embedBackend) AudioTranscription(ctx context.Context, in *pb.TranscriptRequest, opts ...grpc.CallOption) (*pb.TranscriptResult, error) {
|
||||
return e.s.AudioTranscription(ctx, in)
|
||||
}
|
||||
|
||||
@@ -19,6 +19,9 @@ type AIModel interface {
|
||||
Detect(*pb.DetectOptions) (pb.DetectResponse, error)
|
||||
FaceVerify(*pb.FaceVerifyRequest) (pb.FaceVerifyResponse, error)
|
||||
FaceAnalyze(*pb.FaceAnalyzeRequest) (pb.FaceAnalyzeResponse, error)
|
||||
VoiceVerify(*pb.VoiceVerifyRequest) (pb.VoiceVerifyResponse, error)
|
||||
VoiceAnalyze(*pb.VoiceAnalyzeRequest) (pb.VoiceAnalyzeResponse, error)
|
||||
VoiceEmbed(*pb.VoiceEmbedRequest) (pb.VoiceEmbedResponse, error)
|
||||
AudioTranscription(*pb.TranscriptRequest) (pb.TranscriptResult, error)
|
||||
AudioTranscriptionStream(*pb.TranscriptRequest, chan *pb.TranscriptStreamResponse) error
|
||||
TTS(*pb.TTSRequest) error
|
||||
|
||||
@@ -175,6 +175,42 @@ func (s *server) FaceAnalyze(ctx context.Context, in *pb.FaceAnalyzeRequest) (*p
|
||||
return &res, nil
|
||||
}
|
||||
|
||||
func (s *server) VoiceVerify(ctx context.Context, in *pb.VoiceVerifyRequest) (*pb.VoiceVerifyResponse, error) {
|
||||
if s.llm.Locking() {
|
||||
s.llm.Lock()
|
||||
defer s.llm.Unlock()
|
||||
}
|
||||
res, err := s.llm.VoiceVerify(in)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
return &res, nil
|
||||
}
|
||||
|
||||
func (s *server) VoiceAnalyze(ctx context.Context, in *pb.VoiceAnalyzeRequest) (*pb.VoiceAnalyzeResponse, error) {
|
||||
if s.llm.Locking() {
|
||||
s.llm.Lock()
|
||||
defer s.llm.Unlock()
|
||||
}
|
||||
res, err := s.llm.VoiceAnalyze(in)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
return &res, nil
|
||||
}
|
||||
|
||||
func (s *server) VoiceEmbed(ctx context.Context, in *pb.VoiceEmbedRequest) (*pb.VoiceEmbedResponse, error) {
|
||||
if s.llm.Locking() {
|
||||
s.llm.Lock()
|
||||
defer s.llm.Unlock()
|
||||
}
|
||||
res, err := s.llm.VoiceEmbed(in)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
return &res, nil
|
||||
}
|
||||
|
||||
func (s *server) AudioTranscription(ctx context.Context, in *pb.TranscriptRequest) (*pb.TranscriptResult, error) {
|
||||
if s.llm.Locking() {
|
||||
s.llm.Lock()
|
||||
|
||||
441
swagger/docs.go
441
swagger/docs.go
@@ -1166,6 +1166,25 @@ const docTemplate = `{
|
||||
}
|
||||
}
|
||||
},
|
||||
"/backends/known": {
|
||||
"get": {
|
||||
"tags": [
|
||||
"backends"
|
||||
],
|
||||
"summary": "List all known Backends (importer registry + curated pref-only + installed-on-disk)",
|
||||
"responses": {
|
||||
"200": {
|
||||
"description": "Response",
|
||||
"schema": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"$ref": "#/definitions/schema.KnownBackend"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"/backends/upgrade/{name}": {
|
||||
"post": {
|
||||
"tags": [
|
||||
@@ -2261,6 +2280,165 @@ const docTemplate = `{
|
||||
}
|
||||
}
|
||||
},
|
||||
"/v1/voice/analyze": {
|
||||
"post": {
|
||||
"tags": [
|
||||
"voice-recognition"
|
||||
],
|
||||
"summary": "Analyze demographic attributes (age, gender, emotion) from a voice clip.",
|
||||
"parameters": [
|
||||
{
|
||||
"description": "query params",
|
||||
"name": "request",
|
||||
"in": "body",
|
||||
"required": true,
|
||||
"schema": {
|
||||
"$ref": "#/definitions/schema.VoiceAnalyzeRequest"
|
||||
}
|
||||
}
|
||||
],
|
||||
"responses": {
|
||||
"200": {
|
||||
"description": "Response",
|
||||
"schema": {
|
||||
"$ref": "#/definitions/schema.VoiceAnalyzeResponse"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"/v1/voice/embed": {
|
||||
"post": {
|
||||
"tags": [
|
||||
"voice-recognition"
|
||||
],
|
||||
"summary": "Extract a speaker embedding from an audio clip.",
|
||||
"parameters": [
|
||||
{
|
||||
"description": "query params",
|
||||
"name": "request",
|
||||
"in": "body",
|
||||
"required": true,
|
||||
"schema": {
|
||||
"$ref": "#/definitions/schema.VoiceEmbedRequest"
|
||||
}
|
||||
}
|
||||
],
|
||||
"responses": {
|
||||
"200": {
|
||||
"description": "Response",
|
||||
"schema": {
|
||||
"$ref": "#/definitions/schema.VoiceEmbedResponse"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"/v1/voice/forget": {
|
||||
"post": {
|
||||
"tags": [
|
||||
"voice-recognition"
|
||||
],
|
||||
"summary": "Remove a previously-registered speaker by ID.",
|
||||
"parameters": [
|
||||
{
|
||||
"description": "query params",
|
||||
"name": "request",
|
||||
"in": "body",
|
||||
"required": true,
|
||||
"schema": {
|
||||
"$ref": "#/definitions/schema.VoiceForgetRequest"
|
||||
}
|
||||
}
|
||||
],
|
||||
"responses": {
|
||||
"204": {
|
||||
"description": "No Content"
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"/v1/voice/identify": {
|
||||
"post": {
|
||||
"tags": [
|
||||
"voice-recognition"
|
||||
],
|
||||
"summary": "Identify a speaker against the registered database (1:N recognition).",
|
||||
"parameters": [
|
||||
{
|
||||
"description": "query params",
|
||||
"name": "request",
|
||||
"in": "body",
|
||||
"required": true,
|
||||
"schema": {
|
||||
"$ref": "#/definitions/schema.VoiceIdentifyRequest"
|
||||
}
|
||||
}
|
||||
],
|
||||
"responses": {
|
||||
"200": {
|
||||
"description": "Response",
|
||||
"schema": {
|
||||
"$ref": "#/definitions/schema.VoiceIdentifyResponse"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"/v1/voice/register": {
|
||||
"post": {
|
||||
"tags": [
|
||||
"voice-recognition"
|
||||
],
|
||||
"summary": "Register a speaker for 1:N identification.",
|
||||
"parameters": [
|
||||
{
|
||||
"description": "query params",
|
||||
"name": "request",
|
||||
"in": "body",
|
||||
"required": true,
|
||||
"schema": {
|
||||
"$ref": "#/definitions/schema.VoiceRegisterRequest"
|
||||
}
|
||||
}
|
||||
],
|
||||
"responses": {
|
||||
"200": {
|
||||
"description": "Response",
|
||||
"schema": {
|
||||
"$ref": "#/definitions/schema.VoiceRegisterResponse"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"/v1/voice/verify": {
|
||||
"post": {
|
||||
"tags": [
|
||||
"voice-recognition"
|
||||
],
|
||||
"summary": "Verify that two audio clips were spoken by the same person.",
|
||||
"parameters": [
|
||||
{
|
||||
"description": "query params",
|
||||
"name": "request",
|
||||
"in": "body",
|
||||
"required": true,
|
||||
"schema": {
|
||||
"$ref": "#/definitions/schema.VoiceVerifyRequest"
|
||||
}
|
||||
}
|
||||
],
|
||||
"responses": {
|
||||
"200": {
|
||||
"description": "Response",
|
||||
"schema": {
|
||||
"$ref": "#/definitions/schema.VoiceVerifyResponse"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"/vad": {
|
||||
"post": {
|
||||
"consumes": [
|
||||
@@ -3850,6 +4028,27 @@ const docTemplate = `{
|
||||
}
|
||||
}
|
||||
},
|
||||
"schema.KnownBackend": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"auto_detect": {
|
||||
"type": "boolean"
|
||||
},
|
||||
"description": {
|
||||
"type": "string"
|
||||
},
|
||||
"installed": {
|
||||
"description": "Installed is true when the backend is currently present on disk — i.e. it\nappears in gallery.ListSystemBackends(systemState). Importer-registered or\ncurated pref-only backends default to false unless they also show up on\ndisk. The import form uses this to warn users that submitting an import\nmay trigger an automatic backend download.",
|
||||
"type": "boolean"
|
||||
},
|
||||
"modality": {
|
||||
"type": "string"
|
||||
},
|
||||
"name": {
|
||||
"type": "string"
|
||||
}
|
||||
}
|
||||
},
|
||||
"schema.LogprobContent": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
@@ -5098,6 +5297,248 @@ const docTemplate = `{
|
||||
}
|
||||
}
|
||||
},
|
||||
"schema.VoiceAnalysis": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"age": {
|
||||
"type": "number"
|
||||
},
|
||||
"dominant_emotion": {
|
||||
"type": "string"
|
||||
},
|
||||
"dominant_gender": {
|
||||
"type": "string"
|
||||
},
|
||||
"emotion": {
|
||||
"type": "object",
|
||||
"additionalProperties": {
|
||||
"type": "number",
|
||||
"format": "float32"
|
||||
}
|
||||
},
|
||||
"end": {
|
||||
"type": "number"
|
||||
},
|
||||
"gender": {
|
||||
"type": "object",
|
||||
"additionalProperties": {
|
||||
"type": "number",
|
||||
"format": "float32"
|
||||
}
|
||||
},
|
||||
"start": {
|
||||
"type": "number"
|
||||
}
|
||||
}
|
||||
},
|
||||
"schema.VoiceAnalyzeRequest": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"actions": {
|
||||
"description": "subset of {\"age\",\"gender\",\"emotion\"}",
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "string"
|
||||
}
|
||||
},
|
||||
"audio": {
|
||||
"type": "string"
|
||||
},
|
||||
"model": {
|
||||
"type": "string"
|
||||
}
|
||||
}
|
||||
},
|
||||
"schema.VoiceAnalyzeResponse": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"segments": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"$ref": "#/definitions/schema.VoiceAnalysis"
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"schema.VoiceEmbedRequest": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"audio": {
|
||||
"type": "string"
|
||||
},
|
||||
"model": {
|
||||
"type": "string"
|
||||
}
|
||||
}
|
||||
},
|
||||
"schema.VoiceEmbedResponse": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"dim": {
|
||||
"type": "integer"
|
||||
},
|
||||
"embedding": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "number"
|
||||
}
|
||||
},
|
||||
"model": {
|
||||
"type": "string"
|
||||
}
|
||||
}
|
||||
},
|
||||
"schema.VoiceForgetRequest": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"id": {
|
||||
"type": "string"
|
||||
},
|
||||
"model": {
|
||||
"type": "string"
|
||||
},
|
||||
"store": {
|
||||
"type": "string"
|
||||
}
|
||||
}
|
||||
},
|
||||
"schema.VoiceIdentifyMatch": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"confidence": {
|
||||
"type": "number"
|
||||
},
|
||||
"distance": {
|
||||
"type": "number"
|
||||
},
|
||||
"id": {
|
||||
"type": "string"
|
||||
},
|
||||
"labels": {
|
||||
"type": "object",
|
||||
"additionalProperties": {
|
||||
"type": "string"
|
||||
}
|
||||
},
|
||||
"match": {
|
||||
"type": "boolean"
|
||||
},
|
||||
"name": {
|
||||
"type": "string"
|
||||
}
|
||||
}
|
||||
},
|
||||
"schema.VoiceIdentifyRequest": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"audio": {
|
||||
"type": "string"
|
||||
},
|
||||
"model": {
|
||||
"type": "string"
|
||||
},
|
||||
"store": {
|
||||
"type": "string"
|
||||
},
|
||||
"threshold": {
|
||||
"type": "number"
|
||||
},
|
||||
"top_k": {
|
||||
"type": "integer"
|
||||
}
|
||||
}
|
||||
},
|
||||
"schema.VoiceIdentifyResponse": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"matches": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"$ref": "#/definitions/schema.VoiceIdentifyMatch"
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"schema.VoiceRegisterRequest": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"audio": {
|
||||
"type": "string"
|
||||
},
|
||||
"labels": {
|
||||
"type": "object",
|
||||
"additionalProperties": {
|
||||
"type": "string"
|
||||
}
|
||||
},
|
||||
"model": {
|
||||
"type": "string"
|
||||
},
|
||||
"name": {
|
||||
"type": "string"
|
||||
},
|
||||
"store": {
|
||||
"type": "string"
|
||||
}
|
||||
}
|
||||
},
|
||||
"schema.VoiceRegisterResponse": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"id": {
|
||||
"type": "string"
|
||||
},
|
||||
"name": {
|
||||
"type": "string"
|
||||
},
|
||||
"registered_at": {
|
||||
"type": "string"
|
||||
}
|
||||
}
|
||||
},
|
||||
"schema.VoiceVerifyRequest": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"anti_spoofing": {
|
||||
"type": "boolean"
|
||||
},
|
||||
"audio1": {
|
||||
"type": "string"
|
||||
},
|
||||
"audio2": {
|
||||
"type": "string"
|
||||
},
|
||||
"model": {
|
||||
"type": "string"
|
||||
},
|
||||
"threshold": {
|
||||
"type": "number"
|
||||
}
|
||||
}
|
||||
},
|
||||
"schema.VoiceVerifyResponse": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"confidence": {
|
||||
"type": "number"
|
||||
},
|
||||
"distance": {
|
||||
"type": "number"
|
||||
},
|
||||
"model": {
|
||||
"type": "string"
|
||||
},
|
||||
"processing_time_ms": {
|
||||
"type": "number"
|
||||
},
|
||||
"threshold": {
|
||||
"type": "number"
|
||||
},
|
||||
"verified": {
|
||||
"type": "boolean"
|
||||
}
|
||||
}
|
||||
},
|
||||
"schema.WebhookConfig": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
|
||||
@@ -1163,6 +1163,25 @@
|
||||
}
|
||||
}
|
||||
},
|
||||
"/backends/known": {
|
||||
"get": {
|
||||
"tags": [
|
||||
"backends"
|
||||
],
|
||||
"summary": "List all known Backends (importer registry + curated pref-only + installed-on-disk)",
|
||||
"responses": {
|
||||
"200": {
|
||||
"description": "Response",
|
||||
"schema": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"$ref": "#/definitions/schema.KnownBackend"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"/backends/upgrade/{name}": {
|
||||
"post": {
|
||||
"tags": [
|
||||
@@ -2258,6 +2277,165 @@
|
||||
}
|
||||
}
|
||||
},
|
||||
"/v1/voice/analyze": {
|
||||
"post": {
|
||||
"tags": [
|
||||
"voice-recognition"
|
||||
],
|
||||
"summary": "Analyze demographic attributes (age, gender, emotion) from a voice clip.",
|
||||
"parameters": [
|
||||
{
|
||||
"description": "query params",
|
||||
"name": "request",
|
||||
"in": "body",
|
||||
"required": true,
|
||||
"schema": {
|
||||
"$ref": "#/definitions/schema.VoiceAnalyzeRequest"
|
||||
}
|
||||
}
|
||||
],
|
||||
"responses": {
|
||||
"200": {
|
||||
"description": "Response",
|
||||
"schema": {
|
||||
"$ref": "#/definitions/schema.VoiceAnalyzeResponse"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"/v1/voice/embed": {
|
||||
"post": {
|
||||
"tags": [
|
||||
"voice-recognition"
|
||||
],
|
||||
"summary": "Extract a speaker embedding from an audio clip.",
|
||||
"parameters": [
|
||||
{
|
||||
"description": "query params",
|
||||
"name": "request",
|
||||
"in": "body",
|
||||
"required": true,
|
||||
"schema": {
|
||||
"$ref": "#/definitions/schema.VoiceEmbedRequest"
|
||||
}
|
||||
}
|
||||
],
|
||||
"responses": {
|
||||
"200": {
|
||||
"description": "Response",
|
||||
"schema": {
|
||||
"$ref": "#/definitions/schema.VoiceEmbedResponse"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"/v1/voice/forget": {
|
||||
"post": {
|
||||
"tags": [
|
||||
"voice-recognition"
|
||||
],
|
||||
"summary": "Remove a previously-registered speaker by ID.",
|
||||
"parameters": [
|
||||
{
|
||||
"description": "query params",
|
||||
"name": "request",
|
||||
"in": "body",
|
||||
"required": true,
|
||||
"schema": {
|
||||
"$ref": "#/definitions/schema.VoiceForgetRequest"
|
||||
}
|
||||
}
|
||||
],
|
||||
"responses": {
|
||||
"204": {
|
||||
"description": "No Content"
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"/v1/voice/identify": {
|
||||
"post": {
|
||||
"tags": [
|
||||
"voice-recognition"
|
||||
],
|
||||
"summary": "Identify a speaker against the registered database (1:N recognition).",
|
||||
"parameters": [
|
||||
{
|
||||
"description": "query params",
|
||||
"name": "request",
|
||||
"in": "body",
|
||||
"required": true,
|
||||
"schema": {
|
||||
"$ref": "#/definitions/schema.VoiceIdentifyRequest"
|
||||
}
|
||||
}
|
||||
],
|
||||
"responses": {
|
||||
"200": {
|
||||
"description": "Response",
|
||||
"schema": {
|
||||
"$ref": "#/definitions/schema.VoiceIdentifyResponse"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"/v1/voice/register": {
|
||||
"post": {
|
||||
"tags": [
|
||||
"voice-recognition"
|
||||
],
|
||||
"summary": "Register a speaker for 1:N identification.",
|
||||
"parameters": [
|
||||
{
|
||||
"description": "query params",
|
||||
"name": "request",
|
||||
"in": "body",
|
||||
"required": true,
|
||||
"schema": {
|
||||
"$ref": "#/definitions/schema.VoiceRegisterRequest"
|
||||
}
|
||||
}
|
||||
],
|
||||
"responses": {
|
||||
"200": {
|
||||
"description": "Response",
|
||||
"schema": {
|
||||
"$ref": "#/definitions/schema.VoiceRegisterResponse"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"/v1/voice/verify": {
|
||||
"post": {
|
||||
"tags": [
|
||||
"voice-recognition"
|
||||
],
|
||||
"summary": "Verify that two audio clips were spoken by the same person.",
|
||||
"parameters": [
|
||||
{
|
||||
"description": "query params",
|
||||
"name": "request",
|
||||
"in": "body",
|
||||
"required": true,
|
||||
"schema": {
|
||||
"$ref": "#/definitions/schema.VoiceVerifyRequest"
|
||||
}
|
||||
}
|
||||
],
|
||||
"responses": {
|
||||
"200": {
|
||||
"description": "Response",
|
||||
"schema": {
|
||||
"$ref": "#/definitions/schema.VoiceVerifyResponse"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"/vad": {
|
||||
"post": {
|
||||
"consumes": [
|
||||
@@ -3847,6 +4025,27 @@
|
||||
}
|
||||
}
|
||||
},
|
||||
"schema.KnownBackend": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"auto_detect": {
|
||||
"type": "boolean"
|
||||
},
|
||||
"description": {
|
||||
"type": "string"
|
||||
},
|
||||
"installed": {
|
||||
"description": "Installed is true when the backend is currently present on disk — i.e. it\nappears in gallery.ListSystemBackends(systemState). Importer-registered or\ncurated pref-only backends default to false unless they also show up on\ndisk. The import form uses this to warn users that submitting an import\nmay trigger an automatic backend download.",
|
||||
"type": "boolean"
|
||||
},
|
||||
"modality": {
|
||||
"type": "string"
|
||||
},
|
||||
"name": {
|
||||
"type": "string"
|
||||
}
|
||||
}
|
||||
},
|
||||
"schema.LogprobContent": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
@@ -5095,6 +5294,248 @@
|
||||
}
|
||||
}
|
||||
},
|
||||
"schema.VoiceAnalysis": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"age": {
|
||||
"type": "number"
|
||||
},
|
||||
"dominant_emotion": {
|
||||
"type": "string"
|
||||
},
|
||||
"dominant_gender": {
|
||||
"type": "string"
|
||||
},
|
||||
"emotion": {
|
||||
"type": "object",
|
||||
"additionalProperties": {
|
||||
"type": "number",
|
||||
"format": "float32"
|
||||
}
|
||||
},
|
||||
"end": {
|
||||
"type": "number"
|
||||
},
|
||||
"gender": {
|
||||
"type": "object",
|
||||
"additionalProperties": {
|
||||
"type": "number",
|
||||
"format": "float32"
|
||||
}
|
||||
},
|
||||
"start": {
|
||||
"type": "number"
|
||||
}
|
||||
}
|
||||
},
|
||||
"schema.VoiceAnalyzeRequest": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"actions": {
|
||||
"description": "subset of {\"age\",\"gender\",\"emotion\"}",
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "string"
|
||||
}
|
||||
},
|
||||
"audio": {
|
||||
"type": "string"
|
||||
},
|
||||
"model": {
|
||||
"type": "string"
|
||||
}
|
||||
}
|
||||
},
|
||||
"schema.VoiceAnalyzeResponse": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"segments": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"$ref": "#/definitions/schema.VoiceAnalysis"
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"schema.VoiceEmbedRequest": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"audio": {
|
||||
"type": "string"
|
||||
},
|
||||
"model": {
|
||||
"type": "string"
|
||||
}
|
||||
}
|
||||
},
|
||||
"schema.VoiceEmbedResponse": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"dim": {
|
||||
"type": "integer"
|
||||
},
|
||||
"embedding": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "number"
|
||||
}
|
||||
},
|
||||
"model": {
|
||||
"type": "string"
|
||||
}
|
||||
}
|
||||
},
|
||||
"schema.VoiceForgetRequest": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"id": {
|
||||
"type": "string"
|
||||
},
|
||||
"model": {
|
||||
"type": "string"
|
||||
},
|
||||
"store": {
|
||||
"type": "string"
|
||||
}
|
||||
}
|
||||
},
|
||||
"schema.VoiceIdentifyMatch": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"confidence": {
|
||||
"type": "number"
|
||||
},
|
||||
"distance": {
|
||||
"type": "number"
|
||||
},
|
||||
"id": {
|
||||
"type": "string"
|
||||
},
|
||||
"labels": {
|
||||
"type": "object",
|
||||
"additionalProperties": {
|
||||
"type": "string"
|
||||
}
|
||||
},
|
||||
"match": {
|
||||
"type": "boolean"
|
||||
},
|
||||
"name": {
|
||||
"type": "string"
|
||||
}
|
||||
}
|
||||
},
|
||||
"schema.VoiceIdentifyRequest": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"audio": {
|
||||
"type": "string"
|
||||
},
|
||||
"model": {
|
||||
"type": "string"
|
||||
},
|
||||
"store": {
|
||||
"type": "string"
|
||||
},
|
||||
"threshold": {
|
||||
"type": "number"
|
||||
},
|
||||
"top_k": {
|
||||
"type": "integer"
|
||||
}
|
||||
}
|
||||
},
|
||||
"schema.VoiceIdentifyResponse": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"matches": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"$ref": "#/definitions/schema.VoiceIdentifyMatch"
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"schema.VoiceRegisterRequest": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"audio": {
|
||||
"type": "string"
|
||||
},
|
||||
"labels": {
|
||||
"type": "object",
|
||||
"additionalProperties": {
|
||||
"type": "string"
|
||||
}
|
||||
},
|
||||
"model": {
|
||||
"type": "string"
|
||||
},
|
||||
"name": {
|
||||
"type": "string"
|
||||
},
|
||||
"store": {
|
||||
"type": "string"
|
||||
}
|
||||
}
|
||||
},
|
||||
"schema.VoiceRegisterResponse": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"id": {
|
||||
"type": "string"
|
||||
},
|
||||
"name": {
|
||||
"type": "string"
|
||||
},
|
||||
"registered_at": {
|
||||
"type": "string"
|
||||
}
|
||||
}
|
||||
},
|
||||
"schema.VoiceVerifyRequest": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"anti_spoofing": {
|
||||
"type": "boolean"
|
||||
},
|
||||
"audio1": {
|
||||
"type": "string"
|
||||
},
|
||||
"audio2": {
|
||||
"type": "string"
|
||||
},
|
||||
"model": {
|
||||
"type": "string"
|
||||
},
|
||||
"threshold": {
|
||||
"type": "number"
|
||||
}
|
||||
}
|
||||
},
|
||||
"schema.VoiceVerifyResponse": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"confidence": {
|
||||
"type": "number"
|
||||
},
|
||||
"distance": {
|
||||
"type": "number"
|
||||
},
|
||||
"model": {
|
||||
"type": "string"
|
||||
},
|
||||
"processing_time_ms": {
|
||||
"type": "number"
|
||||
},
|
||||
"threshold": {
|
||||
"type": "number"
|
||||
},
|
||||
"verified": {
|
||||
"type": "boolean"
|
||||
}
|
||||
}
|
||||
},
|
||||
"schema.WebhookConfig": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
|
||||
@@ -1038,6 +1038,25 @@ definitions:
|
||||
description: '"reasoning", "tool_call", "tool_result", "status"'
|
||||
type: string
|
||||
type: object
|
||||
schema.KnownBackend:
|
||||
properties:
|
||||
auto_detect:
|
||||
type: boolean
|
||||
description:
|
||||
type: string
|
||||
installed:
|
||||
description: |-
|
||||
Installed is true when the backend is currently present on disk — i.e. it
|
||||
appears in gallery.ListSystemBackends(systemState). Importer-registered or
|
||||
curated pref-only backends default to false unless they also show up on
|
||||
disk. The import form uses this to warn users that submitting an import
|
||||
may trigger an automatic backend download.
|
||||
type: boolean
|
||||
modality:
|
||||
type: string
|
||||
name:
|
||||
type: string
|
||||
type: object
|
||||
schema.LogprobContent:
|
||||
properties:
|
||||
bytes:
|
||||
@@ -1901,6 +1920,164 @@ definitions:
|
||||
description: output width in pixels
|
||||
type: integer
|
||||
type: object
|
||||
schema.VoiceAnalysis:
|
||||
properties:
|
||||
age:
|
||||
type: number
|
||||
dominant_emotion:
|
||||
type: string
|
||||
dominant_gender:
|
||||
type: string
|
||||
emotion:
|
||||
additionalProperties:
|
||||
format: float32
|
||||
type: number
|
||||
type: object
|
||||
end:
|
||||
type: number
|
||||
gender:
|
||||
additionalProperties:
|
||||
format: float32
|
||||
type: number
|
||||
type: object
|
||||
start:
|
||||
type: number
|
||||
type: object
|
||||
schema.VoiceAnalyzeRequest:
|
||||
properties:
|
||||
actions:
|
||||
description: subset of {"age","gender","emotion"}
|
||||
items:
|
||||
type: string
|
||||
type: array
|
||||
audio:
|
||||
type: string
|
||||
model:
|
||||
type: string
|
||||
type: object
|
||||
schema.VoiceAnalyzeResponse:
|
||||
properties:
|
||||
segments:
|
||||
items:
|
||||
$ref: '#/definitions/schema.VoiceAnalysis'
|
||||
type: array
|
||||
type: object
|
||||
schema.VoiceEmbedRequest:
|
||||
properties:
|
||||
audio:
|
||||
type: string
|
||||
model:
|
||||
type: string
|
||||
type: object
|
||||
schema.VoiceEmbedResponse:
|
||||
properties:
|
||||
dim:
|
||||
type: integer
|
||||
embedding:
|
||||
items:
|
||||
type: number
|
||||
type: array
|
||||
model:
|
||||
type: string
|
||||
type: object
|
||||
schema.VoiceForgetRequest:
|
||||
properties:
|
||||
id:
|
||||
type: string
|
||||
model:
|
||||
type: string
|
||||
store:
|
||||
type: string
|
||||
type: object
|
||||
schema.VoiceIdentifyMatch:
|
||||
properties:
|
||||
confidence:
|
||||
type: number
|
||||
distance:
|
||||
type: number
|
||||
id:
|
||||
type: string
|
||||
labels:
|
||||
additionalProperties:
|
||||
type: string
|
||||
type: object
|
||||
match:
|
||||
type: boolean
|
||||
name:
|
||||
type: string
|
||||
type: object
|
||||
schema.VoiceIdentifyRequest:
|
||||
properties:
|
||||
audio:
|
||||
type: string
|
||||
model:
|
||||
type: string
|
||||
store:
|
||||
type: string
|
||||
threshold:
|
||||
type: number
|
||||
top_k:
|
||||
type: integer
|
||||
type: object
|
||||
schema.VoiceIdentifyResponse:
|
||||
properties:
|
||||
matches:
|
||||
items:
|
||||
$ref: '#/definitions/schema.VoiceIdentifyMatch'
|
||||
type: array
|
||||
type: object
|
||||
schema.VoiceRegisterRequest:
|
||||
properties:
|
||||
audio:
|
||||
type: string
|
||||
labels:
|
||||
additionalProperties:
|
||||
type: string
|
||||
type: object
|
||||
model:
|
||||
type: string
|
||||
name:
|
||||
type: string
|
||||
store:
|
||||
type: string
|
||||
type: object
|
||||
schema.VoiceRegisterResponse:
|
||||
properties:
|
||||
id:
|
||||
type: string
|
||||
name:
|
||||
type: string
|
||||
registered_at:
|
||||
type: string
|
||||
type: object
|
||||
schema.VoiceVerifyRequest:
|
||||
properties:
|
||||
anti_spoofing:
|
||||
type: boolean
|
||||
audio1:
|
||||
type: string
|
||||
audio2:
|
||||
type: string
|
||||
model:
|
||||
type: string
|
||||
threshold:
|
||||
type: number
|
||||
type: object
|
||||
schema.VoiceVerifyResponse:
|
||||
properties:
|
||||
confidence:
|
||||
type: number
|
||||
distance:
|
||||
type: number
|
||||
model:
|
||||
type: string
|
||||
processing_time_ms:
|
||||
type: number
|
||||
threshold:
|
||||
type: number
|
||||
verified:
|
||||
type: boolean
|
||||
type: object
|
||||
schema.WebhookConfig:
|
||||
properties:
|
||||
headers:
|
||||
@@ -2688,6 +2865,18 @@ paths:
|
||||
summary: Returns the job status
|
||||
tags:
|
||||
- backends
|
||||
/backends/known:
|
||||
get:
|
||||
responses:
|
||||
"200":
|
||||
description: Response
|
||||
schema:
|
||||
items:
|
||||
$ref: '#/definitions/schema.KnownBackend'
|
||||
type: array
|
||||
summary: List all known Backends (importer registry + curated pref-only + installed-on-disk)
|
||||
tags:
|
||||
- backends
|
||||
/backends/upgrade/{name}:
|
||||
post:
|
||||
parameters:
|
||||
@@ -3392,6 +3581,107 @@ paths:
|
||||
summary: Tokenize the input.
|
||||
tags:
|
||||
- tokenize
|
||||
/v1/voice/analyze:
|
||||
post:
|
||||
parameters:
|
||||
- description: query params
|
||||
in: body
|
||||
name: request
|
||||
required: true
|
||||
schema:
|
||||
$ref: '#/definitions/schema.VoiceAnalyzeRequest'
|
||||
responses:
|
||||
"200":
|
||||
description: Response
|
||||
schema:
|
||||
$ref: '#/definitions/schema.VoiceAnalyzeResponse'
|
||||
summary: Analyze demographic attributes (age, gender, emotion) from a voice
|
||||
clip.
|
||||
tags:
|
||||
- voice-recognition
|
||||
/v1/voice/embed:
|
||||
post:
|
||||
parameters:
|
||||
- description: query params
|
||||
in: body
|
||||
name: request
|
||||
required: true
|
||||
schema:
|
||||
$ref: '#/definitions/schema.VoiceEmbedRequest'
|
||||
responses:
|
||||
"200":
|
||||
description: Response
|
||||
schema:
|
||||
$ref: '#/definitions/schema.VoiceEmbedResponse'
|
||||
summary: Extract a speaker embedding from an audio clip.
|
||||
tags:
|
||||
- voice-recognition
|
||||
/v1/voice/forget:
|
||||
post:
|
||||
parameters:
|
||||
- description: query params
|
||||
in: body
|
||||
name: request
|
||||
required: true
|
||||
schema:
|
||||
$ref: '#/definitions/schema.VoiceForgetRequest'
|
||||
responses:
|
||||
"204":
|
||||
description: No Content
|
||||
summary: Remove a previously-registered speaker by ID.
|
||||
tags:
|
||||
- voice-recognition
|
||||
/v1/voice/identify:
|
||||
post:
|
||||
parameters:
|
||||
- description: query params
|
||||
in: body
|
||||
name: request
|
||||
required: true
|
||||
schema:
|
||||
$ref: '#/definitions/schema.VoiceIdentifyRequest'
|
||||
responses:
|
||||
"200":
|
||||
description: Response
|
||||
schema:
|
||||
$ref: '#/definitions/schema.VoiceIdentifyResponse'
|
||||
summary: Identify a speaker against the registered database (1:N recognition).
|
||||
tags:
|
||||
- voice-recognition
|
||||
/v1/voice/register:
|
||||
post:
|
||||
parameters:
|
||||
- description: query params
|
||||
in: body
|
||||
name: request
|
||||
required: true
|
||||
schema:
|
||||
$ref: '#/definitions/schema.VoiceRegisterRequest'
|
||||
responses:
|
||||
"200":
|
||||
description: Response
|
||||
schema:
|
||||
$ref: '#/definitions/schema.VoiceRegisterResponse'
|
||||
summary: Register a speaker for 1:N identification.
|
||||
tags:
|
||||
- voice-recognition
|
||||
/v1/voice/verify:
|
||||
post:
|
||||
parameters:
|
||||
- description: query params
|
||||
in: body
|
||||
name: request
|
||||
required: true
|
||||
schema:
|
||||
$ref: '#/definitions/schema.VoiceVerifyRequest'
|
||||
responses:
|
||||
"200":
|
||||
description: Response
|
||||
schema:
|
||||
$ref: '#/definitions/schema.VoiceVerifyResponse'
|
||||
summary: Verify that two audio clips were spoken by the same person.
|
||||
tags:
|
||||
- voice-recognition
|
||||
/vad:
|
||||
post:
|
||||
consumes:
|
||||
|
||||
@@ -88,6 +88,9 @@ const (
|
||||
capFaceEmbed = "face_embed"
|
||||
capFaceVerify = "face_verify"
|
||||
capFaceAnalyze = "face_analyze"
|
||||
capVoiceEmbed = "voice_embed"
|
||||
capVoiceVerify = "voice_verify"
|
||||
capVoiceAnalyze = "voice_analyze"
|
||||
|
||||
defaultPrompt = "The capital of France is"
|
||||
streamPrompt = "Once upon a time"
|
||||
@@ -137,6 +140,14 @@ var _ = Describe("Backend container", Ordered, func() {
|
||||
faceFile1 string
|
||||
faceFile2 string
|
||||
faceFile3 string
|
||||
// Voice fixtures: two clips of the same speaker + one different speaker.
|
||||
voiceFile1 string
|
||||
voiceFile2 string
|
||||
voiceFile3 string
|
||||
// voiceVerifyCeiling is the upper-bound cosine distance for a
|
||||
// same-speaker pair; varies with the recognizer (ECAPA-TDNN
|
||||
// runs close to 0.2, WeSpeaker around 0.3).
|
||||
voiceVerifyCeiling float32
|
||||
// verifyCeiling is the upper-bound cosine distance for a
|
||||
// same-person pair; each model configuration can override it via
|
||||
// BACKEND_TEST_VERIFY_DISTANCE_CEILING because SFace's distance
|
||||
@@ -218,6 +229,13 @@ var _ = Describe("Backend container", Ordered, func() {
|
||||
faceFile3 = resolveFaceFixture(workDir, "BACKEND_TEST_FACE_IMAGE_3", "face_b.jpg")
|
||||
verifyCeiling = envFloat32("BACKEND_TEST_VERIFY_DISTANCE_CEILING", defaultVerifyDistanceCeil)
|
||||
|
||||
// Voice fixtures for the voice-recognition specs. Same resolver
|
||||
// as faces — the helper is content-agnostic.
|
||||
voiceFile1 = resolveFaceFixture(workDir, "BACKEND_TEST_VOICE_AUDIO_1", "voice_a_1.wav")
|
||||
voiceFile2 = resolveFaceFixture(workDir, "BACKEND_TEST_VOICE_AUDIO_2", "voice_a_2.wav")
|
||||
voiceFile3 = resolveFaceFixture(workDir, "BACKEND_TEST_VOICE_AUDIO_3", "voice_b.wav")
|
||||
voiceVerifyCeiling = envFloat32("BACKEND_TEST_VOICE_VERIFY_DISTANCE_CEILING", 0.4)
|
||||
|
||||
// Pick a free port and launch the backend.
|
||||
port, err := freeport.GetFreePort()
|
||||
Expect(err).NotTo(HaveOccurred())
|
||||
@@ -668,6 +686,107 @@ var _ = Describe("Backend container", Ordered, func() {
|
||||
}
|
||||
GinkgoWriter.Printf("face_analyze: %d faces\n", len(res.GetFaces()))
|
||||
})
|
||||
|
||||
// ─── voice (speaker) recognition specs ──────────────────────────────
|
||||
|
||||
It("produces speaker embeddings via VoiceEmbed", func() {
|
||||
if !caps[capVoiceEmbed] {
|
||||
Skip("voice_embed capability not enabled")
|
||||
}
|
||||
Expect(voiceFile1).NotTo(BeEmpty(), "BACKEND_TEST_VOICE_AUDIO_1_FILE or _URL must be set")
|
||||
|
||||
ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)
|
||||
defer cancel()
|
||||
res, err := client.VoiceEmbed(ctx, &pb.VoiceEmbedRequest{Audio: voiceFile1})
|
||||
Expect(err).NotTo(HaveOccurred())
|
||||
vec := res.GetEmbedding()
|
||||
Expect(vec).NotTo(BeEmpty(), "VoiceEmbed returned empty vector")
|
||||
GinkgoWriter.Printf("voice_embed: dim=%d\n", len(vec))
|
||||
})
|
||||
|
||||
It("verifies speakers via VoiceVerify", func() {
|
||||
if !caps[capVoiceVerify] {
|
||||
Skip("voice_verify capability not enabled")
|
||||
}
|
||||
Expect(voiceFile1).NotTo(BeEmpty(), "BACKEND_TEST_VOICE_AUDIO_1_FILE or _URL must be set")
|
||||
|
||||
ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)
|
||||
defer cancel()
|
||||
|
||||
// Same clip twice — expected verified=true with very small distance.
|
||||
same, err := client.VoiceVerify(ctx, &pb.VoiceVerifyRequest{
|
||||
Audio1: voiceFile1, Audio2: voiceFile1, Threshold: voiceVerifyCeiling,
|
||||
})
|
||||
Expect(err).NotTo(HaveOccurred())
|
||||
Expect(same.GetVerified()).To(BeTrue(), "same clip should verify: dist=%.3f", same.GetDistance())
|
||||
Expect(same.GetDistance()).To(BeNumerically("<", 0.05),
|
||||
"identical-clip distance should be near zero, got %.3f", same.GetDistance())
|
||||
GinkgoWriter.Printf("voice_verify(same): dist=%.3f confidence=%.1f\n", same.GetDistance(), same.GetConfidence())
|
||||
|
||||
// Cross-pair distance — assert relative ordering: d(file1,file3) > d(same).
|
||||
// We don't require the fixtures to contain true same-speaker pairs —
|
||||
// good same-speaker audio is hard to source un-gated. The RPC
|
||||
// correctness is pinned by the same-clip check above; the pair
|
||||
// distances here are about asserting the embedding actually encodes
|
||||
// speaker info (ordering changes with speaker identity).
|
||||
var d12, d13 float32
|
||||
if voiceFile3 != "" {
|
||||
res, err := client.VoiceVerify(ctx, &pb.VoiceVerifyRequest{
|
||||
Audio1: voiceFile1, Audio2: voiceFile3, Threshold: voiceVerifyCeiling,
|
||||
})
|
||||
if err != nil {
|
||||
GinkgoWriter.Printf("voice_verify(1vs3): skipped — %v\n", err)
|
||||
} else {
|
||||
d13 = res.GetDistance()
|
||||
Expect(d13).To(BeNumerically(">", same.GetDistance()),
|
||||
"cross-clip distance %.3f should exceed same-clip distance %.3f", d13, same.GetDistance())
|
||||
GinkgoWriter.Printf("voice_verify(1vs3): dist=%.3f verified=%v\n", d13, res.GetVerified())
|
||||
}
|
||||
}
|
||||
|
||||
if voiceFile2 != "" {
|
||||
res, err := client.VoiceVerify(ctx, &pb.VoiceVerifyRequest{
|
||||
Audio1: voiceFile1, Audio2: voiceFile2, Threshold: voiceVerifyCeiling,
|
||||
})
|
||||
if err != nil {
|
||||
GinkgoWriter.Printf("voice_verify(1vs2): skipped — %v\n", err)
|
||||
} else {
|
||||
d12 = res.GetDistance()
|
||||
Expect(d12).To(BeNumerically(">", same.GetDistance()),
|
||||
"cross-clip distance %.3f should exceed same-clip distance %.3f", d12, same.GetDistance())
|
||||
GinkgoWriter.Printf("voice_verify(1vs2): dist=%.3f verified=%v\n", d12, res.GetVerified())
|
||||
}
|
||||
}
|
||||
|
||||
// If both pair distances were computed, record their ordering.
|
||||
// We log rather than assert: ordering depends on the specific
|
||||
// fixtures used, and CI defaults point at three different speakers.
|
||||
if d12 > 0 && d13 > 0 {
|
||||
GinkgoWriter.Printf("voice_verify ordering: d(1,2)=%.3f d(1,3)=%.3f\n", d12, d13)
|
||||
}
|
||||
})
|
||||
|
||||
It("analyzes voice via VoiceAnalyze", func() {
|
||||
if !caps[capVoiceAnalyze] {
|
||||
Skip("voice_analyze capability not enabled")
|
||||
}
|
||||
Expect(voiceFile1).NotTo(BeEmpty(), "BACKEND_TEST_VOICE_AUDIO_1_FILE or _URL must be set")
|
||||
|
||||
ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)
|
||||
defer cancel()
|
||||
res, err := client.VoiceAnalyze(ctx, &pb.VoiceAnalyzeRequest{Audio: voiceFile1})
|
||||
Expect(err).NotTo(HaveOccurred())
|
||||
Expect(res.GetSegments()).NotTo(BeEmpty(), "VoiceAnalyze returned no segments")
|
||||
for _, s := range res.GetSegments() {
|
||||
Expect(s.GetAge()).To(BeNumerically(">", 0), "age should be populated by analyze-capable engines")
|
||||
// Audeering's age-gender head outputs female / male / child;
|
||||
// LocalAI capitalises those to Female / Male / Child. Custom
|
||||
// checkpoints wired via the age_gender_model option may use
|
||||
// different labels, so accept anything non-empty.
|
||||
Expect(s.GetDominantGender()).NotTo(BeEmpty())
|
||||
}
|
||||
GinkgoWriter.Printf("voice_analyze: %d segments\n", len(res.GetSegments()))
|
||||
})
|
||||
})
|
||||
|
||||
// extractImage runs `docker create` + `docker export` to materialise the image
|
||||
|
||||
Reference in New Issue
Block a user