feat: voice recognition (#9500)

* feat(voice-recognition): add /v1/voice/{verify,analyze,embed} + speaker-recognition backend Audio analog to face recognition. Adds three gRPC RPCs (VoiceVerify / VoiceAnalyze / VoiceEmbed), their Go service and HTTP layers, a new FLAG_SPEAKER_RECOGNITION capability flag, and a Python backend scaffold under backend/python/speaker-recognition/ wrapping SpeechBrain ECAPA-TDNN with a parallel OnnxDirectEngine for WeSpeaker / 3D-Speaker ONNX exports. The kokoros Rust backend gets matching unimplemented trait stubs — tonic's async_trait has no defaults, so adding an RPC without Rust stubs breaks the build (same regression fixed by eb01c772 for face). Swagger, /api/instructions, and the auth RouteFeatureRegistry / APIFeatures list are updated so the endpoints surface everywhere a client or admin UI looks. Assisted-by: Claude:claude-opus-4-7 * feat(voice-recognition): add 1:N identify + register/forget endpoints Mirrors the face-recognition register/identify/forget surface. New package core/services/voicerecognition/ carries a Registry interface and a local-store-backed implementation (same in-memory vector-store plumbing facerecognition uses, separate instance so the embedding spaces stay isolated). Handlers under /v1/voice/{register,identify,forget} reuse backend.VoiceEmbed to compute the probe vector, then delegate the nearest-neighbour search to the registry. Default cosine-distance threshold is tuned for ECAPA-TDNN on VoxCeleb (0.25, EER ~1.9%). As with the face registry, the current backing is in-memory only — a pgvector implementation is a future constructor-level swap. Assisted-by: Claude:claude-opus-4-7 * feat(voice-recognition): gallery, docs, CI and e2e coverage - backend/index.yaml: speaker-recognition backend entry + CPU and CUDA-12 image variants (plus matching development variants). - gallery/index.yaml: speechbrain-ecapa-tdnn (default) and wespeaker-resnet34 model entries. The WeSpeaker SHA-256 is a deliberate placeholder — the HF URI must be curl'd and its hash filled in before the entry installs. - docs/content/features/voice-recognition.md: API reference + quickstart, mirrors the face-recognition docs. - React UI: CAP_SPEAKER_RECOGNITION flag export (consumers follow face's precedent — no dedicated tab yet). - tests/e2e-backends: voice_embed / voice_verify / voice_analyze specs. Helper resolveFaceFixture is reused as-is — the only thing face/voice share is "download a file into workDir", so no need for a new helper. - Makefile: docker-build-speaker-recognition + test-extra-backend- speaker-recognition-{ecapa,all} targets. Audio fixtures default to VCTK p225/p226 samples from HuggingFace. - CI: test-extra.yml grows a tests-speaker-recognition-grpc job mirroring insightface. backend.yml matrix gains CPU + CUDA-12 image build entries — scripts/changed-backends.js auto-picks these up. Assisted-by: Claude:claude-opus-4-7 * feat(voice-recognition): wire a working /v1/voice/analyze head Adds AnalysisHead: a lazy-loading age / gender / emotion inference wrapper that plugs into both SpeechBrainEngine and OnnxDirectEngine. Defaults to two open-licence HuggingFace checkpoints: - audeering/wav2vec2-large-robust-24-ft-age-gender (Apache 2.0) — age regression + 3-way gender (female / male / child). - superb/wav2vec2-base-superb-er (Apache 2.0) — 4-way emotion. Both are optional and degrade gracefully when transformers or the model can't be loaded — the engine raises NotImplementedError so the gRPC layer returns 501 instead of a generic 500. Emotion classes pass through from the model (neutral/happy/angry/sad on the default checkpoint); the e2e test now accepts any non-empty dominant gender so custom age_gender_model overrides don't fail it. Adds transformers to the backend's CPU and CUDA-12 requirements. Assisted-by: Claude:claude-opus-4-7 * fix(voice-recognition): pin real WeSpeaker ResNet34 ONNX SHA-256 Replaces the placeholder hash in gallery/index.yaml with the actual SHA-256 (7bb2f06e…) of the upstream Wespeaker/wespeaker-voxceleb-resnet34-LM ONNX at ~25MB. `local-ai models install wespeaker-resnet34` now succeeds. Assisted-by: Claude:claude-opus-4-7 * fix(voice-recognition): soundfile loader + honest analyze default Two issues surfaced on first end-to-end smoke with the actual backend image: 1. torchaudio.load in torchaudio 2.8+ requires the torchcodec package for audio decoding. Switch SpeechBrainEngine._load_waveform to the already-present soundfile (listed in requirements.txt) plus a numpy linear resample to 16kHz. Drops a heavy ffmpeg-linked dep and the codepath we never exercise (torchaudio's ffmpeg backend). 2. The AnalysisHead was defaulting to audeering/wav2vec2-large-robust- 24-ft-age-gender, but AutoModelForAudioClassification silently mangles that checkpoint — it reports the age head weights as UNEXPECTED and re-initialises the classifier head with random values, so the "gender" output is noise and there is no age output at all. Make age/gender opt-in instead (empty default; users wire a cleanly-loadable Wav2Vec2ForSequenceClassification checkpoint via age_gender_model: option). Emotion keeps its working Superb default. Also broaden _infer_age_gender's tensor-shape handling and catch runtime exceptions so a dodgy age/gender head never takes down the whole analyze call. Docs and README updated to match the new policy. Verified with the branch-scoped gallery on localhost: - voice/embed → 192-d ECAPA-TDNN vector - voice/verify → same-clip dist≈6e-08 verified=true; cross-speaker dist 0.76–0.99 verified=false (as expected) - voice/register/identify/forget → round-trip works, 404 on unknown id - voice/analyze → emotion populated, age/gender omitted (opt-in) Assisted-by: Claude:claude-opus-4-7 * fix(voice-recognition): real CI audio fixtures + fixture-agnostic verify spec Two issues surfaced after CI actually ran the speaker-recognition e2e target (I'd curl-tested against a running server but hadn't run the make target locally): 1. The default BACKEND_TEST_VOICE_AUDIO_* URLs pointed at huggingface.co/datasets/CSTR-Edinburgh/vctk paths that return 404 (the dataset is gated). Swap them for the speechbrain test samples served from github.com/speechbrain/speechbrain/raw/develop/ — public, no auth, correct 16kHz mono format. 2. The VoiceVerify spec required d(file1,file2) < 0.4, assuming file1/file2 were same-speaker. The speechbrain samples are three different speakers (example1/2/5), and there is no easy un-gated source of true same-speaker audio pairs (VoxCeleb/VCTK/LibriSpeech are all license- or size-gated for CI use). Replace the ceiling check with a relative-ordering assertion: d(pair) > d(same-clip) for both file2 and file3 — that's enough to prove the embeddings encode speaker info, and it works with any three non-identical clips. Actual speaker ordering d(1,2) vs d(1,3) is logged but not asserted. Local run: 4/4 voice specs pass (Health, LoadModel, VoiceEmbed, VoiceVerify) on the built backend image. 12 non-voice specs skipped as expected. Assisted-by: Claude:claude-opus-4-7 * fix(ci): checkout with submodules in the reusable backend_build workflow The kokoros Rust backend build fails with failed to read .../sources/Kokoros/kokoros/Cargo.toml: No such file because the reusable backend_build.yml workflow's actions/checkout step was missing `submodules: true`. Dockerfile.rust does `COPY . /LocalAI`, and without the submodule files the subsequent `cargo build` can't find the vendored Kokoros crate. The bug pre-dates this PR — scripts/changed-backends.js only triggers the kokoros image job when something under backend/rust/kokoros or the shared proto changes, so master had been coasting past it. The voice-recognition proto addition re-broke it. Other checkouts in backend.yml (llama-cpp-darwin) and test-extra.yml (insightface, kokoros, speaker-recognition) already pass `submodules: true`; this brings the shared backend image builder in line. Assisted-by: Claude:claude-opus-4-7
2026-04-29 11:37:40 -04:00 · 2026-04-23 12:07:14 +02:00
parent 1c59165d63
commit 181ebb6df4
53 changed files with 3747 additions and 6 deletions
--- a/.github/workflows/backend.yml
+++ b/.github/workflows/backend.yml
@@ -724,6 +724,19 @@ jobs:
            dockerfile: "./backend/Dockerfile.python"
            context: "./"
            ubuntu-version: '2404'
+          - build-type: 'cublas'
+            cuda-major-version: "12"
+            cuda-minor-version: "8"
+            platforms: 'linux/amd64'
+            tag-latest: 'auto'
+            tag-suffix: '-gpu-nvidia-cuda-12-speaker-recognition'
+            runs-on: 'ubuntu-latest'
+            base-image: "ubuntu:24.04"
+            skip-drivers: 'false'
+            backend: "speaker-recognition"
+            dockerfile: "./backend/Dockerfile.python"
+            context: "./"
+            ubuntu-version: '2404'
          - build-type: 'cublas'
            cuda-major-version: "12"
            cuda-minor-version: "8"
@@ -2653,6 +2666,20 @@ jobs:
            dockerfile: "./backend/Dockerfile.python"
            context: "./"
            ubuntu-version: '2404'
+          # speaker-recognition (voice/speaker biometrics)
+          - build-type: ''
+            cuda-major-version: ""
+            cuda-minor-version: ""
+            platforms: 'linux/amd64,linux/arm64'
+            tag-latest: 'auto'
+            tag-suffix: '-cpu-speaker-recognition'
+            runs-on: 'ubuntu-latest'
+            base-image: "ubuntu:24.04"
+            skip-drivers: 'false'
+            backend: "speaker-recognition"
+            dockerfile: "./backend/Dockerfile.python"
+            context: "./"
+            ubuntu-version: '2404'
          - build-type: 'intel'
            cuda-major-version: ""
            cuda-minor-version: ""
--- a/.github/workflows/backend_build.yml
+++ b/.github/workflows/backend_build.yml
@@ -108,6 +108,8 @@ jobs:

      - name: Checkout
        uses: actions/checkout@v6
+        with:
+          submodules: true

      - name: Release space from worker
        if: inputs.runs-on == 'ubuntu-latest'
--- a/.github/workflows/test-extra.yml
+++ b/.github/workflows/test-extra.yml
@@ -39,6 +39,7 @@ jobs:
      voxtral: ${{ steps.detect.outputs.voxtral }}
      kokoros: ${{ steps.detect.outputs.kokoros }}
      insightface: ${{ steps.detect.outputs.insightface }}
+      speaker-recognition: ${{ steps.detect.outputs.speaker-recognition }}
    steps:
      - name: Checkout repository
        uses: actions/checkout@v6
@@ -778,3 +779,29 @@ jobs:
      - name: Build insightface backend image and run both model configurations
        run: |
          make test-extra-backend-insightface-all
+  tests-speaker-recognition-grpc:
+    needs: detect-changes
+    if: needs.detect-changes.outputs.speaker-recognition == 'true' || needs.detect-changes.outputs.run-all == 'true'
+    runs-on: ubuntu-latest
+    timeout-minutes: 90
+    steps:
+      - name: Clone
+        uses: actions/checkout@v6
+        with:
+          submodules: true
+      - name: Dependencies
+        run: |
+          sudo apt-get update
+          sudo apt-get install -y --no-install-recommends \
+              make build-essential curl ca-certificates git tar
+      - name: Setup Go
+        uses: actions/setup-go@v5
+        with:
+          go-version: '1.26.0'
+      - name: Free disk space
+        run: |
+          sudo rm -rf /usr/share/dotnet /opt/ghc /usr/local/lib/android /opt/hostedtoolcache/CodeQL || true
+          df -h
+      - name: Build speaker-recognition backend image and run the ECAPA-TDNN configuration
+        run: |
+          make test-extra-backend-speaker-recognition-all
--- a/43
+++ b/43
@@ -1,5 +1,5 @@
 # Disable parallel execution for backend builds
-.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/insightface backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/tinygrad
+.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/tinygrad

 GOCMD=go
 GOTEST=$(GOCMD) test
@@ -435,6 +435,7 @@ prepare-test-extra: protogen-python
 	$(MAKE) -C backend/python/trl
 	$(MAKE) -C backend/python/tinygrad
 	$(MAKE) -C backend/python/insightface
+	$(MAKE) -C backend/python/speaker-recognition
 	$(MAKE) -C backend/rust/kokoros kokoros-grpc

 test-extra: prepare-test-extra
@@ -459,6 +460,7 @@ test-extra: prepare-test-extra
 	$(MAKE) -C backend/python/trl test
 	$(MAKE) -C backend/python/tinygrad test
 	$(MAKE) -C backend/python/insightface test
+	$(MAKE) -C backend/python/speaker-recognition test
 	$(MAKE) -C backend/rust/kokoros test

 ##
@@ -713,6 +715,41 @@ test-extra-backend-insightface-all: \
 	test-extra-backend-insightface-buffalo-sc \
 	test-extra-backend-insightface-opencv

+## speaker-recognition — voice (speaker) biometrics.
+##
+## Audio fixtures default to the speechbrain test samples served
+## straight from their GitHub repo — public, no auth needed, and they
+## ship as 16kHz mono WAV/FLAC which is exactly what the engine wants.
+## example{1,2,5} are three different speakers; the suite treats
+## example1 as the "same-image twin" probe (verify(clip, clip) must
+## return distance≈0) and the other two as cross-speaker ceilings.
+## Override with BACKEND_TEST_VOICE_AUDIO_{1,2,3}_FILE for offline runs.
+VOICE_AUDIO_1_URL ?= https://github.com/speechbrain/speechbrain/raw/develop/tests/samples/single-mic/example1.wav
+VOICE_AUDIO_2_URL ?= https://github.com/speechbrain/speechbrain/raw/develop/tests/samples/single-mic/example2.flac
+VOICE_AUDIO_3_URL ?= https://github.com/speechbrain/speechbrain/raw/develop/tests/samples/single-mic/example5.wav
+
+## ECAPA-TDNN via SpeechBrain — default CI configuration. Auto-downloads
+## the checkpoint from HuggingFace on first LoadModel (bundled in the
+## backend image pip install). 192-d embeddings, cosine-distance based.
+## The e2e suite drives LoadModel directly so we don't rely on LocalAI's
+## gallery flow here.
+test-extra-backend-speaker-recognition-ecapa: docker-build-speaker-recognition
+	BACKEND_IMAGE=local-ai-backend:speaker-recognition \
+	BACKEND_TEST_MODEL_NAME=speechbrain/spkrec-ecapa-voxceleb \
+	BACKEND_TEST_OPTIONS=engine:speechbrain,source:speechbrain/spkrec-ecapa-voxceleb \
+	BACKEND_TEST_CAPS=health,load,voice_embed,voice_verify \
+	BACKEND_TEST_VOICE_AUDIO_1_URL=$(VOICE_AUDIO_1_URL) \
+	BACKEND_TEST_VOICE_AUDIO_2_URL=$(VOICE_AUDIO_2_URL) \
+	BACKEND_TEST_VOICE_AUDIO_3_URL=$(VOICE_AUDIO_3_URL) \
+	BACKEND_TEST_VOICE_VERIFY_DISTANCE_CEILING=0.4 \
+	$(MAKE) test-extra-backend
+
+## Aggregate — today there's only one voice config; the target exists
+## so the CI workflow matches the insightface-all naming convention and
+## can grow to include WeSpeaker / 3D-Speaker later.
+test-extra-backend-speaker-recognition-all: \
+	test-extra-backend-speaker-recognition-ecapa
+
 ## sglang mirrors the vllm setup: HuggingFace model id, same tiny Qwen,
 ## tool-call extraction via sglang's native qwen parser. CPU builds use
 ## sglang's upstream pyproject_cpu.toml recipe (see backend/python/sglang/install.sh).
@@ -859,6 +896,7 @@ BACKEND_FASTER_WHISPER = faster-whisper|python|.|false|true
 BACKEND_COQUI = coqui|python|.|false|true
 BACKEND_RFDETR = rfdetr|python|.|false|true
 BACKEND_INSIGHTFACE = insightface|python|.|false|true
+BACKEND_SPEAKER_RECOGNITION = speaker-recognition|python|.|false|true
 BACKEND_KITTEN_TTS = kitten-tts|python|.|false|true
 BACKEND_NEUTTS = neutts|python|.|false|true
 BACKEND_KOKORO = kokoro|python|.|false|true
@@ -931,6 +969,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_FASTER_WHISPER)))
 $(eval $(call generate-docker-build-target,$(BACKEND_COQUI)))
 $(eval $(call generate-docker-build-target,$(BACKEND_RFDETR)))
 $(eval $(call generate-docker-build-target,$(BACKEND_INSIGHTFACE)))
+$(eval $(call generate-docker-build-target,$(BACKEND_SPEAKER_RECOGNITION)))
 $(eval $(call generate-docker-build-target,$(BACKEND_KITTEN_TTS)))
 $(eval $(call generate-docker-build-target,$(BACKEND_NEUTTS)))
 $(eval $(call generate-docker-build-target,$(BACKEND_KOKORO)))
@@ -965,7 +1004,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_SAM3_CPP)))
 docker-save-%: backend-images
 	docker save local-ai-backend:$* -o backend-images/$*.tar

-docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp docker-build-insightface
+docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp docker-build-insightface docker-build-speaker-recognition

 ########################################################
 ### Mock Backend for E2E Tests
--- a/backend/backend.proto
+++ b/backend/backend.proto
@@ -26,6 +26,9 @@ service Backend {
  rpc Detect(DetectOptions) returns (DetectResponse) {}
  rpc FaceVerify(FaceVerifyRequest) returns (FaceVerifyResponse) {}
  rpc FaceAnalyze(FaceAnalyzeRequest) returns (FaceAnalyzeResponse) {}
+  rpc VoiceVerify(VoiceVerifyRequest) returns (VoiceVerifyResponse) {}
+  rpc VoiceAnalyze(VoiceAnalyzeRequest) returns (VoiceAnalyzeResponse) {}
+  rpc VoiceEmbed(VoiceEmbedRequest) returns (VoiceEmbedResponse) {}

  rpc StoresSet(StoresSetOptions) returns (Result) {}
  rpc StoresDelete(StoresDeleteOptions) returns (Result) {}
@@ -528,6 +531,57 @@ message FaceAnalyzeResponse {
  repeated FaceAnalysis faces = 1;
 }

+// --- Voice (speaker) recognition messages ---
+//
+// Analogous to the Face* messages above, but for speaker biometrics.
+// Audio fields accept a filesystem path (same convention as
+// TranscriptRequest.dst). The HTTP layer materialises base64 / URL /
+// data-URI inputs to a temp file before calling the gRPC backend.
+
+message VoiceVerifyRequest {
+  string audio1 = 1;            // path to first audio clip
+  string audio2 = 2;            // path to second audio clip
+  float  threshold = 3;         // cosine-distance threshold; 0 = use backend default
+  bool   anti_spoofing = 4;     // reserved for future AASIST bolt-on
+}
+
+message VoiceVerifyResponse {
+  bool   verified = 1;
+  float  distance = 2;          // 1 - cosine_similarity
+  float  threshold = 3;
+  float  confidence = 4;        // 0-100
+  string model = 5;             // e.g. "speechbrain/spkrec-ecapa-voxceleb"
+  float  processing_time_ms = 6;
+}
+
+message VoiceAnalyzeRequest {
+  string          audio = 1;        // path to audio clip
+  repeated string actions = 2;      // subset of ["age","gender","emotion"]; empty = all-supported
+}
+
+message VoiceAnalysis {
+  float              start = 1;          // segment start time in seconds (0 if single-utterance)
+  float              end = 2;            // segment end time in seconds
+  float              age = 3;
+  string             dominant_gender = 4;
+  map<string, float> gender = 5;
+  string             dominant_emotion = 6;
+  map<string, float> emotion = 7;
+}
+
+message VoiceAnalyzeResponse {
+  repeated VoiceAnalysis segments = 1;
+}
+
+message VoiceEmbedRequest {
+  string audio = 1;              // path to audio clip
+}
+
+message VoiceEmbedResponse {
+  repeated float embedding = 1;
+  string         model = 2;
+}
+
 message ToolFormatMarkers {
  string format_type = 1;           // "json_native", "tag_with_json", "tag_with_tagged"

--- a/backend/index.yaml
+++ b/backend/index.yaml
@@ -3773,3 +3773,64 @@
  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-insightface"
  mirrors:
    - localai/localai-backends:master-gpu-nvidia-cuda-12-insightface
+
+# speaker-recognition (voice/speaker biometrics) — Apache-2.0 stack
+- &speakerrecognition
+  name: "speaker-recognition"
+  alias: "speaker-recognition"
+  # SpeechBrain is Apache-2.0. WeSpeaker / 3D-Speaker ONNX exports are
+  # Apache-2.0. The backend itself ships only Python deps — all model
+  # weights flow through LocalAI's gallery download mechanism (or
+  # SpeechBrain's built-in HF auto-download at first LoadModel).
+  license: apache-2.0
+  description: |
+    Speaker (voice) recognition backend — the audio analog to
+    insightface. Wraps SpeechBrain ECAPA-TDNN (default engine, 192-d
+    embeddings, ~1.9% EER on VoxCeleb) plus an OnnxDirectEngine for
+    pre-exported WeSpeaker / 3D-Speaker ONNX models.
+
+    Exposes speaker verification (/v1/voice/verify), speaker embedding
+    (/v1/voice/embed), speaker analysis (/v1/voice/analyze), and 1:N
+    speaker identification (/v1/voice/{register,identify,forget}).
+    Registrations use LocalAI's built-in vector store — same in-memory
+    backing the face-recognition registry uses, separate instance.
+  urls:
+    - https://speechbrain.github.io/
+    - https://github.com/wenet-e2e/wespeaker
+    - https://github.com/modelscope/3D-Speaker
+  tags:
+    - voice-recognition
+    - speaker-verification
+    - speaker-embedding
+    - gpu
+    - cpu
+  capabilities:
+    default: "cpu-speaker-recognition"
+    nvidia: "cuda12-speaker-recognition"
+    nvidia-cuda-12: "cuda12-speaker-recognition"
+- !!merge <<: *speakerrecognition
+  name: "speaker-recognition-development"
+  capabilities:
+    default: "cpu-speaker-recognition-development"
+    nvidia: "cuda12-speaker-recognition-development"
+    nvidia-cuda-12: "cuda12-speaker-recognition-development"
+- !!merge <<: *speakerrecognition
+  name: "cpu-speaker-recognition"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-speaker-recognition"
+  mirrors:
+    - localai/localai-backends:latest-cpu-speaker-recognition
+- !!merge <<: *speakerrecognition
+  name: "cuda12-speaker-recognition"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-speaker-recognition"
+  mirrors:
+    - localai/localai-backends:latest-gpu-nvidia-cuda-12-speaker-recognition
+- !!merge <<: *speakerrecognition
+  name: "cpu-speaker-recognition-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-cpu-speaker-recognition"
+  mirrors:
+    - localai/localai-backends:master-cpu-speaker-recognition
+- !!merge <<: *speakerrecognition
+  name: "cuda12-speaker-recognition-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-speaker-recognition"
+  mirrors:
+    - localai/localai-backends:master-gpu-nvidia-cuda-12-speaker-recognition
--- a/backend/python/speaker-recognition/Makefile
+++ b/backend/python/speaker-recognition/Makefile
@@ -0,0 +1,13 @@
+.DEFAULT_GOAL := install
+
+.PHONY: install
+install:
+	bash install.sh
+
+.PHONY: protogen-clean
+protogen-clean:
+	$(RM) backend_pb2_grpc.py backend_pb2.py
+
+.PHONY: clean
+clean: protogen-clean
+	rm -rf venv __pycache__
--- a/backend/python/speaker-recognition/README.md
+++ b/backend/python/speaker-recognition/README.md
@@ -0,0 +1,40 @@
+# speaker-recognition
+
+Speaker (voice) recognition backend for LocalAI. The audio analog to
+`insightface` — produces speaker embeddings and supports 1:1 voice
+verification and voice demographic analysis.
+
+## Engines
+
+- **SpeechBrainEngine** (default): ECAPA-TDNN trained on VoxCeleb.
+  192-d L2-normalised embeddings, cosine distance for verification.
+  Auto-downloads from HuggingFace on first LoadModel.
+- **OnnxDirectEngine**: Any pre-exported ONNX speaker encoder
+  (WeSpeaker ResNet, 3D-Speaker ERes2Net, CAM++, …). Model path comes
+  from the gallery `files:` entry.
+
+Engine selection is gallery-driven: if the model config provides
+`model_path:` / `onnx:` the ONNX engine is used, otherwise the
+SpeechBrain engine.
+
+## Endpoints
+
+- `POST /v1/voice/verify` — 1:1 same-speaker check.
+- `POST /v1/voice/embed` — extract a speaker embedding vector.
+- `POST /v1/voice/analyze` — voice demographics, loaded lazily on
+  the first analyze call:
+  - **Emotion** (default, opt-out): `superb/wav2vec2-base-superb-er`
+    (Apache-2.0), 4-way categorical (neutral / happy / angry / sad).
+  - **Age + gender** (opt-in): no default — wire a checkpoint with a
+    standard `Wav2Vec2ForSequenceClassification` head via
+    `age_gender_model:<repo>` in options. The Audeering
+    age-gender model is *not* usable as a drop-in because its
+    multi-task head isn't loadable via `AutoModelForAudioClassification`.
+
+  Both heads are optional. When nothing loads, the engine returns 501.
+
+## Audio input
+
+Audio is materialised by the HTTP layer to a temp wav before calling
+the gRPC backend. Accepted input forms on the HTTP side: URL, data-URI,
+or raw base64. The backend itself always receives a filesystem path.
--- a/backend/python/speaker-recognition/backend.py
+++ b/backend/python/speaker-recognition/backend.py
@@ -0,0 +1,205 @@
+#!/usr/bin/env python3
+"""gRPC server for the LocalAI speaker-recognition backend.
+
+Implements Health / LoadModel / Status plus the voice-specific methods:
+VoiceVerify, VoiceAnalyze, VoiceEmbed. The heavy lifting lives in
+engines.py — this file is just the gRPC plumbing, mirroring the
+insightface backend's two-engine split (SpeechBrain + OnnxDirect).
+"""
+from __future__ import annotations
+
+import argparse
+import os
+import signal
+import sys
+import time
+from concurrent import futures
+
+import backend_pb2
+import backend_pb2_grpc
+import grpc
+
+sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "common"))
+sys.path.insert(0, os.path.join(os.path.dirname(__file__), "common"))
+from grpc_auth import get_auth_interceptors  # noqa: E402
+
+from engines import SpeakerEngine, build_engine  # noqa: E402
+
+_ONE_DAY = 60 * 60 * 24
+MAX_WORKERS = int(os.environ.get("PYTHON_GRPC_MAX_WORKERS", "1"))
+
+# ECAPA-TDNN on VoxCeleb is the reference. Threshold is tuned for
+# cosine distance (1 - cosine_similarity). Clients may override.
+DEFAULT_VERIFY_THRESHOLD = 0.25
+
+
+def _parse_options(raw: list[str]) -> dict[str, str]:
+    out: dict[str, str] = {}
+    for entry in raw:
+        if ":" not in entry:
+            continue
+        k, v = entry.split(":", 1)
+        out[k.strip()] = v.strip()
+    return out
+
+
+class BackendServicer(backend_pb2_grpc.BackendServicer):
+    def __init__(self) -> None:
+        self.engine: SpeakerEngine | None = None
+        self.engine_name: str = ""
+        self.model_name: str = ""
+        self.verify_threshold: float = DEFAULT_VERIFY_THRESHOLD
+
+    def Health(self, request, context):
+        return backend_pb2.Reply(message=bytes("OK", "utf-8"))
+
+    def LoadModel(self, request, context):
+        options = _parse_options(list(request.Options))
+        # Surface LocalAI's models directory (ModelPath) so engines can
+        # anchor relative paths and auto-download into a writable spot
+        # alongside every other gallery-managed asset.
+        options["_model_path"] = request.ModelPath or ""
+        try:
+            engine, engine_name = build_engine(request.Model, options)
+        except Exception as exc:  # noqa: BLE001
+            return backend_pb2.Result(success=False, message=f"engine init failed: {exc}")
+
+        self.engine = engine
+        self.engine_name = engine_name
+        self.model_name = request.Model
+
+        threshold_opt = options.get("verify_threshold")
+        if threshold_opt:
+            try:
+                self.verify_threshold = float(threshold_opt)
+            except ValueError:
+                pass
+        return backend_pb2.Result(success=True, message=f"loaded {engine_name}")
+
+    def Status(self, request, context):
+        state = backend_pb2.StatusResponse.State.READY if self.engine else backend_pb2.StatusResponse.State.UNINITIALIZED
+        return backend_pb2.StatusResponse(state=state)
+
+    def _require_engine(self, context) -> SpeakerEngine | None:
+        if self.engine is None:
+            context.set_code(grpc.StatusCode.FAILED_PRECONDITION)
+            context.set_details("no speaker-recognition model loaded")
+            return None
+        return self.engine
+
+    def VoiceVerify(self, request, context):
+        engine = self._require_engine(context)
+        if engine is None:
+            return backend_pb2.VoiceVerifyResponse()
+        if not request.audio1 or not request.audio2:
+            context.set_code(grpc.StatusCode.INVALID_ARGUMENT)
+            context.set_details("audio1 and audio2 are required")
+            return backend_pb2.VoiceVerifyResponse()
+
+        threshold = request.threshold if request.threshold > 0 else self.verify_threshold
+        started = time.time()
+        try:
+            distance = engine.compare(request.audio1, request.audio2)
+        except Exception as exc:  # noqa: BLE001
+            context.set_code(grpc.StatusCode.INTERNAL)
+            context.set_details(f"voice verify failed: {exc}")
+            return backend_pb2.VoiceVerifyResponse()
+
+        elapsed_ms = (time.time() - started) * 1000.0
+        # Confidence goes linearly from 100 at distance=0 to 0 at distance=threshold.
+        confidence = max(0.0, min(100.0, (1.0 - distance / threshold) * 100.0))
+        return backend_pb2.VoiceVerifyResponse(
+            verified=distance <= threshold,
+            distance=distance,
+            threshold=threshold,
+            confidence=confidence,
+            model=self.model_name,
+            processing_time_ms=elapsed_ms,
+        )
+
+    def VoiceEmbed(self, request, context):
+        engine = self._require_engine(context)
+        if engine is None:
+            return backend_pb2.VoiceEmbedResponse()
+        if not request.audio:
+            context.set_code(grpc.StatusCode.INVALID_ARGUMENT)
+            context.set_details("audio is required")
+            return backend_pb2.VoiceEmbedResponse()
+        try:
+            vec = engine.embed(request.audio)
+        except Exception as exc:  # noqa: BLE001
+            context.set_code(grpc.StatusCode.INTERNAL)
+            context.set_details(f"voice embed failed: {exc}")
+            return backend_pb2.VoiceEmbedResponse()
+        return backend_pb2.VoiceEmbedResponse(embedding=list(vec), model=self.model_name)
+
+    def VoiceAnalyze(self, request, context):
+        engine = self._require_engine(context)
+        if engine is None:
+            return backend_pb2.VoiceAnalyzeResponse()
+        if not request.audio:
+            context.set_code(grpc.StatusCode.INVALID_ARGUMENT)
+            context.set_details("audio is required")
+            return backend_pb2.VoiceAnalyzeResponse()
+
+        actions = list(request.actions) or ["age", "gender", "emotion"]
+        try:
+            segments = engine.analyze(request.audio, actions)
+        except NotImplementedError:
+            context.set_code(grpc.StatusCode.UNIMPLEMENTED)
+            context.set_details(f"analyze not supported by {self.engine_name}")
+            return backend_pb2.VoiceAnalyzeResponse()
+        except Exception as exc:  # noqa: BLE001
+            context.set_code(grpc.StatusCode.INTERNAL)
+            context.set_details(f"voice analyze failed: {exc}")
+            return backend_pb2.VoiceAnalyzeResponse()
+
+        proto_segments = []
+        for seg in segments:
+            proto_segments.append(
+                backend_pb2.VoiceAnalysis(
+                    start=seg.get("start", 0.0),
+                    end=seg.get("end", 0.0),
+                    age=seg.get("age", 0.0),
+                    dominant_gender=seg.get("dominant_gender", ""),
+                    gender=seg.get("gender", {}),
+                    dominant_emotion=seg.get("dominant_emotion", ""),
+                    emotion=seg.get("emotion", {}),
+                )
+            )
+        return backend_pb2.VoiceAnalyzeResponse(segments=proto_segments)
+
+
+def serve(address: str) -> None:
+    interceptors = get_auth_interceptors()
+    server = grpc.server(
+        futures.ThreadPoolExecutor(max_workers=MAX_WORKERS),
+        interceptors=interceptors,
+        options=[
+            ("grpc.max_send_message_length", 128 * 1024 * 1024),
+            ("grpc.max_receive_message_length", 128 * 1024 * 1024),
+        ],
+    )
+    backend_pb2_grpc.add_BackendServicer_to_server(BackendServicer(), server)
+    server.add_insecure_port(address)
+    server.start()
+    print("speaker-recognition backend listening on", address, flush=True)
+
+    def _stop(*_):
+        server.stop(0)
+        sys.exit(0)
+
+    signal.signal(signal.SIGTERM, _stop)
+    signal.signal(signal.SIGINT, _stop)
+    try:
+        while True:
+            time.sleep(_ONE_DAY)
+    except KeyboardInterrupt:
+        server.stop(0)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--addr", default="localhost:50051")
+    args = parser.parse_args()
+    serve(args.addr)
--- a/backend/python/speaker-recognition/engines.py
+++ b/backend/python/speaker-recognition/engines.py
@@ -0,0 +1,387 @@
+"""Speaker-recognition engines.
+
+Two engines are offered, mirroring the insightface backend's split:
+
+  * SpeechBrainEngine: full PyTorch / SpeechBrain path. Uses the
+    ECAPA-TDNN recipe trained on VoxCeleb; 192-d L2-normalized
+    embeddings, cosine distance for verification. Auto-downloads the
+    checkpoint into LocalAI's models directory on first LoadModel.
+
+  * OnnxDirectEngine: CPU-friendly fallback that runs pre-exported
+    ONNX speaker encoders (WeSpeaker ResNet34, 3D-Speaker ERes2Net,
+    CAM++, etc.). Model paths come from the model config — the gallery
+    `files:` flow drops them into the models directory.
+
+Engine selection follows the same gallery-driven convention face
+recognition uses (insightface commits 9c6da0f7 / 405fec0b): the
+Python backend reads `engine` / `model_path` / `checkpoint` from the
+options dict and picks an engine accordingly.
+"""
+from __future__ import annotations
+
+import os
+from typing import Any, Iterable, Protocol
+
+
+class SpeakerEngine(Protocol):
+    """Interface both concrete engines satisfy."""
+
+    name: str
+
+    def embed(self, audio_path: str) -> list[float]:  # pragma: no cover - interface
+        ...
+
+    def compare(self, audio1: str, audio2: str) -> float:  # pragma: no cover
+        ...
+
+    def analyze(self, audio_path: str, actions: Iterable[str]) -> list[dict[str, Any]]:  # pragma: no cover
+        ...
+
+
+def _cosine_distance(a, b) -> float:
+    import numpy as np
+
+    va = np.asarray(a, dtype=np.float32).reshape(-1)
+    vb = np.asarray(b, dtype=np.float32).reshape(-1)
+    na = float(np.linalg.norm(va))
+    nb = float(np.linalg.norm(vb))
+    if na == 0.0 or nb == 0.0:
+        return 1.0
+    return float(1.0 - np.dot(va, vb) / (na * nb))
+
+
+class AnalysisHead:
+    """Age / gender / emotion head, lazy-loaded on first analyze call.
+
+    Wraps two open-licence HuggingFace checkpoints:
+
+      * audeering/wav2vec2-large-robust-24-ft-age-gender — age
+        regression (0–100 years) + 3-way gender (female/male/child).
+        Apache 2.0.
+      * superb/wav2vec2-base-superb-er — 4-way emotion classification
+        (neutral / happy / angry / sad). Apache 2.0.
+
+    Either model is optional — the head degrades gracefully to only the
+    attributes it could load. Override the checkpoint with the
+    `age_gender_model` / `emotion_model` option if you want something
+    else. Set either to an empty string to disable that head.
+    """
+
+    # Age + gender is OFF by default: the high-accuracy Apache-2.0
+    # checkpoint (Audeering wav2vec2-large-robust-24-ft-age-gender) uses a
+    # custom multi-task head that AutoModelForAudioClassification silently
+    # mangles — it drops the age weights as UNEXPECTED and re-initialises
+    # the classifier head with random values, so the output is noise. Users
+    # who have a cleanly loadable age/gender classifier can opt in with
+    # `age_gender_model:<repo>` in options. The emotion default below
+    # (superb/wav2vec2-base-superb-er) loads via the standard audio-
+    # classification pipeline with no such caveat.
+    DEFAULT_AGE_GENDER_MODEL = ""
+    DEFAULT_EMOTION_MODEL = "superb/wav2vec2-base-superb-er"
+    AGE_GENDER_LABELS = ("female", "male", "child")
+
+    def __init__(self, options: dict[str, str]):
+        self._options = options
+        self._age_gender = None
+        self._age_gender_processor = None
+        self._age_gender_loaded = False
+        self._age_gender_error: str | None = None
+        self._emotion = None
+        self._emotion_loaded = False
+        self._emotion_error: str | None = None
+
+    # --- age / gender -------------------------------------------------
+    def _ensure_age_gender(self):
+        if self._age_gender_loaded:
+            return
+        self._age_gender_loaded = True
+        model_id = self._options.get(
+            "age_gender_model", self.DEFAULT_AGE_GENDER_MODEL
+        )
+        if not model_id:
+            self._age_gender_error = "disabled"
+            return
+        try:
+            # Late imports — torch / transformers are heavy and only
+            # pulled in when the analyze head actually runs.
+            import torch  # type: ignore
+            from transformers import AutoFeatureExtractor, AutoModelForAudioClassification  # type: ignore
+
+            self._torch = torch
+            self._age_gender_processor = AutoFeatureExtractor.from_pretrained(model_id)
+            self._age_gender = AutoModelForAudioClassification.from_pretrained(model_id)
+            self._age_gender.eval()
+        except Exception as exc:  # noqa: BLE001
+            self._age_gender_error = f"{type(exc).__name__}: {exc}"
+
+    def _infer_age_gender(self, waveform_16k) -> dict[str, Any]:
+        self._ensure_age_gender()
+        if self._age_gender is None:
+            return {}
+        import numpy as np
+
+        try:
+            inputs = self._age_gender_processor(
+                waveform_16k, sampling_rate=16000, return_tensors="pt"
+            )
+            with self._torch.no_grad():
+                outputs = self._age_gender(**inputs)
+
+            # Audeering's checkpoint is published with a custom head: the
+            # official recipe exposes `(hidden_states, logits_age, logits_gender)`.
+            # AutoModelForAudioClassification flattens that into a single
+            # `logits` tensor of shape [batch, 4] — [age_regression, female, male, child].
+            # Fall back gracefully when the shape is different (e.g. a
+            # user-supplied age_gender_model checkpoint that returns a proper tuple).
+            hidden = getattr(outputs, "logits", outputs)
+            age_years = None
+            gender_logits = None
+            if isinstance(hidden, (tuple, list)) and len(hidden) >= 2:
+                age_years = float(hidden[0].squeeze().item()) * 100.0
+                gender_logits = hidden[1]
+            else:
+                flat = hidden.squeeze()
+                if flat.ndim == 1 and flat.numel() >= 4:
+                    age_years = float(flat[0].item()) * 100.0
+                    gender_logits = flat[1:4]
+                elif flat.ndim == 1 and flat.numel() == 1:
+                    age_years = float(flat.item()) * 100.0
+
+            if age_years is None and gender_logits is None:
+                return {}
+
+            result: dict[str, Any] = {}
+            if age_years is not None:
+                result["age"] = age_years
+            if gender_logits is not None:
+                probs = self._torch.softmax(gender_logits, dim=-1).cpu().numpy()
+                probs = np.asarray(probs).reshape(-1)
+                gender_map = {
+                    label: float(probs[i])
+                    for i, label in enumerate(self.AGE_GENDER_LABELS[: len(probs)])
+                }
+                result["gender"] = gender_map
+                if gender_map:
+                    dom = max(gender_map.items(), key=lambda kv: kv[1])[0]
+                    result["dominant_gender"] = {
+                        "female": "Female",
+                        "male": "Male",
+                        "child": "Child",
+                    }.get(dom, dom.capitalize())
+            return result
+        except Exception as exc:  # noqa: BLE001
+            # Analyze is a best-effort feature — never take down the
+            # whole analyze call because the age/gender head had a bad
+            # day. Mark the failure so the emotion branch still runs.
+            self._age_gender_error = f"runtime: {type(exc).__name__}: {exc}"
+            return {}
+
+    # --- emotion ------------------------------------------------------
+    def _ensure_emotion(self):
+        if self._emotion_loaded:
+            return
+        self._emotion_loaded = True
+        model_id = self._options.get("emotion_model", self.DEFAULT_EMOTION_MODEL)
+        if not model_id:
+            self._emotion_error = "disabled"
+            return
+        try:
+            from transformers import pipeline  # type: ignore
+
+            self._emotion = pipeline("audio-classification", model=model_id)
+        except Exception as exc:  # noqa: BLE001
+            self._emotion_error = f"{type(exc).__name__}: {exc}"
+
+    def _infer_emotion(self, audio_path: str) -> dict[str, Any]:
+        self._ensure_emotion()
+        if self._emotion is None:
+            return {}
+        try:
+            raw = self._emotion(audio_path, top_k=8)
+        except Exception as exc:  # noqa: BLE001
+            # Second-line defense: don't fail the whole analyze call
+            # over a runtime inference hiccup.
+            self._emotion_error = f"runtime: {type(exc).__name__}: {exc}"
+            return {}
+        emotion_map = {row["label"].lower(): float(row["score"]) for row in raw}
+        if not emotion_map:
+            return {}
+        dom = max(emotion_map.items(), key=lambda kv: kv[1])[0]
+        return {"emotion": emotion_map, "dominant_emotion": dom}
+
+    # --- orchestrator -------------------------------------------------
+    def analyze(self, audio_path: str, waveform_16k, actions: Iterable[str]) -> dict[str, Any]:
+        wanted = {a.strip().lower() for a in actions} if actions else {"age", "gender", "emotion"}
+        result: dict[str, Any] = {}
+        if "age" in wanted or "gender" in wanted:
+            ag = self._infer_age_gender(waveform_16k)
+            if "age" in wanted and "age" in ag:
+                result["age"] = ag["age"]
+            if "gender" in wanted:
+                if "gender" in ag:
+                    result["gender"] = ag["gender"]
+                if "dominant_gender" in ag:
+                    result["dominant_gender"] = ag["dominant_gender"]
+        if "emotion" in wanted:
+            em = self._infer_emotion(audio_path)
+            result.update(em)
+        return result
+
+
+class SpeechBrainEngine:
+    """ECAPA-TDNN via SpeechBrain. Auto-downloads on first use."""
+
+    name = "speechbrain-ecapa-tdnn"
+
+    def __init__(self, model_name: str, options: dict[str, str]):
+        # Late imports so the module can be introspected / tested
+        # without torch / speechbrain being installed.
+        from speechbrain.inference.speaker import EncoderClassifier  # type: ignore
+
+        source = options.get("source") or model_name or "speechbrain/spkrec-ecapa-voxceleb"
+        savedir = options.get("_model_path") or os.environ.get("HF_HOME") or "./pretrained_models"
+        self._model = EncoderClassifier.from_hparams(source=source, savedir=savedir)
+        self._analysis = AnalysisHead(options)
+
+    def _load_waveform(self, path: str):
+        # Use soundfile + torch directly — torchaudio.load in torchaudio
+        # 2.8+ requires the torchcodec package for decoding, which adds
+        # another heavy ffmpeg-linked dep. soundfile covers WAV/FLAC
+        # which is what we care about here.
+        import numpy as np
+        import soundfile as sf  # type: ignore
+        import torch  # type: ignore
+
+        audio, sr = sf.read(path, always_2d=False)
+        if audio.ndim > 1:
+            audio = audio.mean(axis=1)
+        audio = np.asarray(audio, dtype=np.float32)
+        if sr != 16000:
+            # Simple linear resample — good enough for 16kHz downsampling
+            # from 44.1/48kHz, and we expect 16kHz inputs in practice.
+            ratio = 16000 / float(sr)
+            n = int(round(len(audio) * ratio))
+            audio = np.interp(
+                np.linspace(0, len(audio), n, endpoint=False),
+                np.arange(len(audio)),
+                audio,
+            ).astype(np.float32)
+        return torch.from_numpy(audio).unsqueeze(0)  # [1, T]
+
+    def embed(self, audio_path: str) -> list[float]:
+        waveform = self._load_waveform(audio_path)
+        vec = self._model.encode_batch(waveform).squeeze().detach().cpu().numpy()
+        return [float(x) for x in vec]
+
+    def compare(self, audio1: str, audio2: str) -> float:
+        return _cosine_distance(self.embed(audio1), self.embed(audio2))
+
+    def analyze(self, audio_path: str, actions):
+        # Age / gender / emotion aren't produced by ECAPA-TDNN itself;
+        # delegate to AnalysisHead which wraps separate Apache-2.0
+        # checkpoints. Returns a single segment spanning the clip —
+        # segmentation / diarisation is a future enhancement.
+        waveform = self._load_waveform(audio_path)
+        mono = waveform.squeeze().detach().cpu().numpy()
+        attrs = self._analysis.analyze(audio_path, mono, actions)
+        if not attrs:
+            raise NotImplementedError(
+                "analyze head failed to load — install transformers + torch or pass age_gender_model/emotion_model options"
+            )
+        duration = float(mono.shape[-1]) / 16000.0 if mono.size else 0.0
+        return [dict(start=0.0, end=duration, **attrs)]
+
+
+class OnnxDirectEngine:
+    """Run a pre-exported ONNX speaker encoder (WeSpeaker / 3D-Speaker)."""
+
+    name = "onnx-direct"
+
+    def __init__(self, model_name: str, options: dict[str, str]):
+        import onnxruntime as ort  # type: ignore
+
+        # The gallery is expected to have dropped the ONNX file under
+        # the models directory; accept either an absolute path or a
+        # filename relative to _model_path.
+        onnx_path = options.get("model_path") or options.get("onnx")
+        if not onnx_path:
+            raise ValueError("OnnxDirectEngine requires `model_path: <file.onnx>` in options")
+        if not os.path.isabs(onnx_path):
+            onnx_path = os.path.join(options.get("_model_path", ""), onnx_path)
+        if not os.path.isfile(onnx_path):
+            raise FileNotFoundError(f"ONNX model not found: {onnx_path}")
+
+        providers = options.get("providers")
+        if providers:
+            provider_list = [p.strip() for p in providers.split(",") if p.strip()]
+        else:
+            provider_list = ["CPUExecutionProvider"]
+        self._session = ort.InferenceSession(onnx_path, providers=provider_list)
+        self._input_name = self._session.get_inputs()[0].name
+        self._expected_sr = int(options.get("sample_rate", "16000"))
+        self._analysis = AnalysisHead(options)
+
+    def _load_waveform(self, path: str):
+        import numpy as np
+        import soundfile as sf  # type: ignore
+
+        audio, sr = sf.read(path, always_2d=False)
+        if sr != self._expected_sr:
+            # Cheap linear resample — good enough for sanity; callers
+            # should pre-resample for production.
+            ratio = self._expected_sr / float(sr)
+            n = int(round(len(audio) * ratio))
+            audio = np.interp(
+                np.linspace(0, len(audio), n, endpoint=False),
+                np.arange(len(audio)),
+                audio,
+            )
+        if audio.ndim > 1:
+            audio = audio.mean(axis=1)
+        return audio.astype("float32")
+
+    def embed(self, audio_path: str) -> list[float]:
+        import numpy as np
+
+        audio = self._load_waveform(audio_path)
+        feed = audio.reshape(1, -1)
+        out = self._session.run(None, {self._input_name: feed})
+        vec = np.asarray(out[0]).reshape(-1)
+        return [float(x) for x in vec]
+
+    def compare(self, audio1: str, audio2: str) -> float:
+        return _cosine_distance(self.embed(audio1), self.embed(audio2))
+
+    def analyze(self, audio_path: str, actions):
+        # AnalysisHead expects 16kHz mono; _load_waveform already
+        # resamples to self._expected_sr. If the user configured a
+        # non-16k expected rate, resample one more time for analyze.
+        audio = self._load_waveform(audio_path)
+        if self._expected_sr != 16000:
+            import numpy as np
+
+            ratio = 16000 / float(self._expected_sr)
+            n = int(round(len(audio) * ratio))
+            audio = np.interp(
+                np.linspace(0, len(audio), n, endpoint=False),
+                np.arange(len(audio)),
+                audio,
+            ).astype("float32")
+        attrs = self._analysis.analyze(audio_path, audio, actions)
+        if not attrs:
+            raise NotImplementedError(
+                "analyze head failed to load — install transformers + torch or pass age_gender_model/emotion_model options"
+            )
+        duration = float(len(audio)) / 16000.0 if len(audio) else 0.0
+        return [dict(start=0.0, end=duration, **attrs)]
+
+
+def build_engine(model_name: str, options: dict[str, str]) -> tuple[SpeakerEngine, str]:
+    """Pick an engine based on the options. ONNX path takes priority:
+    if the gallery has dropped a `model_path:` or `onnx:` option, run
+    the direct ONNX engine. Otherwise, fall back to SpeechBrain.
+    """
+    engine_kind = (options.get("engine") or "").lower()
+    if engine_kind == "onnx" or options.get("model_path") or options.get("onnx"):
+        return OnnxDirectEngine(model_name, options), OnnxDirectEngine.name
+    return SpeechBrainEngine(model_name, options), SpeechBrainEngine.name
--- a/backend/python/speaker-recognition/install.sh
+++ b/backend/python/speaker-recognition/install.sh
@@ -0,0 +1,19 @@
+#!/bin/bash
+set -e
+
+backend_dir=$(dirname $0)
+if [ -d $backend_dir/common ]; then
+    source $backend_dir/common/libbackend.sh
+else
+    source $backend_dir/../common/libbackend.sh
+fi
+
+installRequirements
+
+# No pre-baked model weights. Weights flow through LocalAI's gallery
+# `files:` mechanism — see gallery entries for speechbrain-ecapa-tdnn
+# and WeSpeaker / 3D-Speaker ONNX packs. SpeechBrain's
+# EncoderClassifier.from_hparams also knows how to auto-download from
+# HuggingFace into the configured savedir (we point it at ModelPath),
+# so the first LoadModel call bootstraps the checkpoint if the gallery
+# flow wasn't used.
--- a/backend/python/speaker-recognition/requirements-cpu.txt
+++ b/backend/python/speaker-recognition/requirements-cpu.txt
@@ -0,0 +1,5 @@
+torch
+torchaudio
+speechbrain
+transformers
+onnxruntime
--- a/backend/python/speaker-recognition/requirements-cublas12.txt
+++ b/backend/python/speaker-recognition/requirements-cublas12.txt
@@ -0,0 +1,5 @@
+torch
+torchaudio
+speechbrain
+transformers
+onnxruntime-gpu
--- a/backend/python/speaker-recognition/requirements.txt
+++ b/backend/python/speaker-recognition/requirements.txt
@@ -0,0 +1,5 @@
+grpcio==1.71.0
+protobuf
+grpcio-tools
+numpy
+soundfile
--- a/backend/python/speaker-recognition/run.sh
+++ b/backend/python/speaker-recognition/run.sh
@@ -0,0 +1,9 @@
+#!/bin/bash
+backend_dir=$(dirname $0)
+if [ -d $backend_dir/common ]; then
+    source $backend_dir/common/libbackend.sh
+else
+    source $backend_dir/../common/libbackend.sh
+fi
+
+startBackend $@
--- a/backend/python/speaker-recognition/test.py
+++ b/backend/python/speaker-recognition/test.py
@@ -0,0 +1,78 @@
+"""Unit tests for the speaker-recognition gRPC backend.
+
+The servicer is instantiated in-process (no gRPC channel) and driven
+directly. The default path exercises SpeechBrain's ECAPA-TDNN — the
+first run downloads the checkpoint into a temp savedir. Tests are
+skipped gracefully when the heavy optional dependencies (torch /
+speechbrain / onnxruntime) are not installed, so the gRPC plumbing
+can still be verified on a bare image.
+"""
+from __future__ import annotations
+
+import importlib
+import os
+import sys
+import tempfile
+import unittest
+
+sys.path.insert(0, os.path.dirname(__file__))
+
+import backend_pb2  # noqa: E402
+
+from backend import BackendServicer  # noqa: E402
+
+
+def _have(*mods: str) -> bool:
+    for m in mods:
+        if importlib.util.find_spec(m) is None:
+            return False
+    return True
+
+
+class _FakeCtx:
+    """Minimal stand-in for a gRPC servicer context."""
+
+    def __init__(self) -> None:
+        self.code = None
+        self.details = ""
+
+    def set_code(self, c):
+        self.code = c
+
+    def set_details(self, d):
+        self.details = d
+
+
+class ServicerPlumbingTest(unittest.TestCase):
+    """Checks that LoadModel returns a clear error when no engine deps
+    are installed, and that Voice* calls on an uninitialised servicer
+    surface FAILED_PRECONDITION — both verifying the gRPC wiring
+    without requiring SpeechBrain or ONNX at test time."""
+
+    def test_pre_load_voice_calls_are_rejected(self):
+        svc = BackendServicer()
+        ctx = _FakeCtx()
+        svc.VoiceVerify(backend_pb2.VoiceVerifyRequest(audio1="/tmp/a.wav", audio2="/tmp/b.wav"), ctx)
+        self.assertEqual(str(ctx.code), "StatusCode.FAILED_PRECONDITION")
+
+    def test_load_without_deps_fails_cleanly(self):
+        svc = BackendServicer()
+        req = backend_pb2.ModelOptions(Model="speechbrain/spkrec-ecapa-voxceleb", ModelPath="")
+        result = svc.LoadModel(req, _FakeCtx())
+        # Either the deps are installed and it loaded, or they aren't
+        # and we got a structured error instead of a crash.
+        self.assertTrue(result.success or "engine init failed" in result.message)
+
+
+@unittest.skipUnless(_have("speechbrain", "torch", "torchaudio"), "speechbrain / torch missing")
+class SpeechBrainEngineSmokeTest(unittest.TestCase):
+    def test_load_and_embed(self):
+        svc = BackendServicer()
+        with tempfile.TemporaryDirectory() as td:
+            req = backend_pb2.ModelOptions(Model="speechbrain/spkrec-ecapa-voxceleb", ModelPath=td)
+            result = svc.LoadModel(req, _FakeCtx())
+            self.assertTrue(result.success, result.message)
+
+
+if __name__ == "__main__":
+    unittest.main()
--- a/backend/python/speaker-recognition/test.sh
+++ b/backend/python/speaker-recognition/test.sh
@@ -0,0 +1,11 @@
+#!/bin/bash
+set -e
+
+backend_dir=$(dirname $0)
+if [ -d $backend_dir/common ]; then
+    source $backend_dir/common/libbackend.sh
+else
+    source $backend_dir/../common/libbackend.sh
+fi
+
+runUnittests
--- a/backend/rust/kokoros/src/service.rs
+++ b/backend/rust/kokoros/src/service.rs
@@ -386,6 +386,27 @@ impl Backend for KokorosService {
        Err(Status::unimplemented("Not supported"))
    }

+    async fn voice_verify(
+        &self,
+        _: Request<backend::VoiceVerifyRequest>,
+    ) -> Result<Response<backend::VoiceVerifyResponse>, Status> {
+        Err(Status::unimplemented("Not supported"))
+    }
+
+    async fn voice_analyze(
+        &self,
+        _: Request<backend::VoiceAnalyzeRequest>,
+    ) -> Result<Response<backend::VoiceAnalyzeResponse>, Status> {
+        Err(Status::unimplemented("Not supported"))
+    }
+
+    async fn voice_embed(
+        &self,
+        _: Request<backend::VoiceEmbedRequest>,
+    ) -> Result<Response<backend::VoiceEmbedResponse>, Status> {
+        Err(Status::unimplemented("Not supported"))
+    }
+
    async fn stores_set(
        &self,
        _: Request<backend::StoresSetOptions>,
--- a/core/application/application.go
+++ b/core/application/application.go
@@ -14,6 +14,7 @@ import (
 	"github.com/mudler/LocalAI/core/services/facerecognition"
 	"github.com/mudler/LocalAI/core/services/galleryop"
 	"github.com/mudler/LocalAI/core/services/nodes"
+	"github.com/mudler/LocalAI/core/services/voicerecognition"
 	"github.com/mudler/LocalAI/core/templates"
 	pkggrpc "github.com/mudler/LocalAI/pkg/grpc"
 	"github.com/mudler/LocalAI/pkg/model"
@@ -29,6 +30,12 @@ import (
 // family per deployment; we keep the door open instead.
 const faceEmbeddingDim = 0

+// voiceEmbeddingDim is the expected dimension for speaker embeddings.
+// 0 so the Registry accepts whatever dim the loaded recognizer
+// produces — ECAPA-TDNN is 192, WeSpeaker ResNet34 is 256, 3D-Speaker
+// ERes2Net is 192, CAM++ is 512.
+const voiceEmbeddingDim = 0
+
 type Application struct {
 	backendLoader      *config.ModelConfigLoader
 	modelLoader        *model.ModelLoader
@@ -39,6 +46,7 @@ type Application struct {
 	agentJobService    *agentpool.AgentJobService
 	agentPoolService   atomic.Pointer[agentpool.AgentPoolService]
 	faceRegistry       facerecognition.Registry
+	voiceRegistry      voicerecognition.Registry
 	authDB             *gorm.DB
 	watchdogMutex      sync.Mutex
 	watchdogStop       chan bool
@@ -78,6 +86,14 @@ func newApplication(appConfig *config.ApplicationConfig) *Application {
 	}
 	app.faceRegistry = facerecognition.NewStoreRegistry(faceStoreResolver, "", faceEmbeddingDim)

+	// Voice (speaker) recognition registry — same plumbing, separate
+	// registry so embedding spaces stay isolated (a face vector and a
+	// speaker vector are not comparable).
+	voiceStoreResolver := func(_ context.Context, storeName string) (pkggrpc.Backend, error) {
+		return corebackend.StoreBackend(ml, appConfig, storeName, "")
+	}
+	app.voiceRegistry = voicerecognition.NewStoreRegistry(voiceStoreResolver, "", voiceEmbeddingDim)
+
 	return app
 }

@@ -130,6 +146,14 @@ func (a *Application) FaceRegistry() facerecognition.Registry {
 	return a.faceRegistry
 }

+// VoiceRegistry returns the voice (speaker) recognition registry used
+// for 1:N identification. Same in-memory local-store backing as
+// FaceRegistry but a separate instance — voice embeddings live in
+// their own vector space.
+func (a *Application) VoiceRegistry() voicerecognition.Registry {
+	return a.voiceRegistry
+}
+
 // AuthDB returns the auth database connection, or nil if auth is not enabled.
 func (a *Application) AuthDB() *gorm.DB {
 	return a.authDB
--- a/core/backend/voice_analyze.go
+++ b/core/backend/voice_analyze.go
@@ -0,0 +1,58 @@
+package backend
+
+import (
+	"context"
+	"fmt"
+	"time"
+
+	"github.com/mudler/LocalAI/core/config"
+	"github.com/mudler/LocalAI/core/trace"
+	"github.com/mudler/LocalAI/pkg/grpc/proto"
+	"github.com/mudler/LocalAI/pkg/model"
+)
+
+func VoiceAnalyze(
+	audio string,
+	actions []string,
+	loader *model.ModelLoader,
+	appConfig *config.ApplicationConfig,
+	modelConfig config.ModelConfig,
+) (*proto.VoiceAnalyzeResponse, error) {
+	opts := ModelOptions(modelConfig, appConfig)
+	voiceModel, err := loader.Load(opts...)
+	if err != nil {
+		recordModelLoadFailure(appConfig, modelConfig.Name, modelConfig.Backend, err, nil)
+		return nil, err
+	}
+	if voiceModel == nil {
+		return nil, fmt.Errorf("could not load voice recognition model")
+	}
+
+	var startTime time.Time
+	if appConfig.EnableTracing {
+		trace.InitBackendTracingIfEnabled(appConfig.TracingMaxItems)
+		startTime = time.Now()
+	}
+
+	res, err := voiceModel.VoiceAnalyze(context.Background(), &proto.VoiceAnalyzeRequest{
+		Audio:   audio,
+		Actions: actions,
+	})
+
+	if appConfig.EnableTracing {
+		errStr := ""
+		if err != nil {
+			errStr = err.Error()
+		}
+		trace.RecordBackendTrace(trace.BackendTrace{
+			Timestamp: startTime,
+			Duration:  time.Since(startTime),
+			Type:      trace.BackendTraceVoiceAnalyze,
+			ModelName: modelConfig.Name,
+			Backend:   modelConfig.Backend,
+			Error:     errStr,
+		})
+	}
+
+	return res, err
+}
--- a/core/backend/voice_embed.go
+++ b/core/backend/voice_embed.go
@@ -0,0 +1,66 @@
+package backend
+
+import (
+	"context"
+	"fmt"
+	"time"
+
+	"github.com/mudler/LocalAI/core/config"
+	"github.com/mudler/LocalAI/core/trace"
+	"github.com/mudler/LocalAI/pkg/grpc/proto"
+	"github.com/mudler/LocalAI/pkg/model"
+)
+
+// VoiceEmbed returns a speaker embedding (typically 192-d for ECAPA-TDNN)
+// for the audio file at audioPath. Unlike ModelEmbedding (which is
+// OpenAI-compatible and text-only), this call takes an audio path and
+// returns the backend's speaker-encoder output.
+func VoiceEmbed(
+	audioPath string,
+	loader *model.ModelLoader,
+	appConfig *config.ApplicationConfig,
+	modelConfig config.ModelConfig,
+) (*proto.VoiceEmbedResponse, error) {
+	opts := ModelOptions(modelConfig, appConfig)
+	voiceModel, err := loader.Load(opts...)
+	if err != nil {
+		recordModelLoadFailure(appConfig, modelConfig.Name, modelConfig.Backend, err, nil)
+		return nil, err
+	}
+	if voiceModel == nil {
+		return nil, fmt.Errorf("could not load voice recognition model")
+	}
+
+	var startTime time.Time
+	if appConfig.EnableTracing {
+		trace.InitBackendTracingIfEnabled(appConfig.TracingMaxItems)
+		startTime = time.Now()
+	}
+
+	res, err := voiceModel.VoiceEmbed(context.Background(), &proto.VoiceEmbedRequest{
+		Audio: audioPath,
+	})
+
+	if appConfig.EnableTracing {
+		errStr := ""
+		if err != nil {
+			errStr = err.Error()
+		}
+		trace.RecordBackendTrace(trace.BackendTrace{
+			Timestamp: startTime,
+			Duration:  time.Since(startTime),
+			Type:      trace.BackendTraceVoiceEmbed,
+			ModelName: modelConfig.Name,
+			Backend:   modelConfig.Backend,
+			Error:     errStr,
+		})
+	}
+
+	if err != nil {
+		return nil, err
+	}
+	if res == nil || len(res.Embedding) == 0 {
+		return nil, fmt.Errorf("voice embedding returned empty vector (no speech detected?)")
+	}
+	return res, nil
+}
--- a/core/backend/voice_verify.go
+++ b/core/backend/voice_verify.go
@@ -0,0 +1,61 @@
+package backend
+
+import (
+	"context"
+	"fmt"
+	"time"
+
+	"github.com/mudler/LocalAI/core/config"
+	"github.com/mudler/LocalAI/core/trace"
+	"github.com/mudler/LocalAI/pkg/grpc/proto"
+	"github.com/mudler/LocalAI/pkg/model"
+)
+
+func VoiceVerify(
+	audio1, audio2 string,
+	threshold float32,
+	antiSpoofing bool,
+	loader *model.ModelLoader,
+	appConfig *config.ApplicationConfig,
+	modelConfig config.ModelConfig,
+) (*proto.VoiceVerifyResponse, error) {
+	opts := ModelOptions(modelConfig, appConfig)
+	voiceModel, err := loader.Load(opts...)
+	if err != nil {
+		recordModelLoadFailure(appConfig, modelConfig.Name, modelConfig.Backend, err, nil)
+		return nil, err
+	}
+	if voiceModel == nil {
+		return nil, fmt.Errorf("could not load voice recognition model")
+	}
+
+	var startTime time.Time
+	if appConfig.EnableTracing {
+		trace.InitBackendTracingIfEnabled(appConfig.TracingMaxItems)
+		startTime = time.Now()
+	}
+
+	res, err := voiceModel.VoiceVerify(context.Background(), &proto.VoiceVerifyRequest{
+		Audio1:       audio1,
+		Audio2:       audio2,
+		Threshold:    threshold,
+		AntiSpoofing: antiSpoofing,
+	})
+
+	if appConfig.EnableTracing {
+		errStr := ""
+		if err != nil {
+			errStr = err.Error()
+		}
+		trace.RecordBackendTrace(trace.BackendTrace{
+			Timestamp: startTime,
+			Duration:  time.Since(startTime),
+			Type:      trace.BackendTraceVoiceVerify,
+			ModelName: modelConfig.Name,
+			Backend:   modelConfig.Backend,
+			Error:     errStr,
+		})
+	}
+
+	return res, err
+}
--- a/core/config/model_config.go
+++ b/core/config/model_config.go
@@ -588,7 +588,8 @@ const (
 	FLAG_VAD              ModelConfigUsecase = 0b010000000000
 	FLAG_VIDEO            ModelConfigUsecase = 0b100000000000
 	FLAG_DETECTION        ModelConfigUsecase = 0b1000000000000
-	FLAG_FACE_RECOGNITION ModelConfigUsecase = 0b10000000000000
+	FLAG_FACE_RECOGNITION    ModelConfigUsecase = 0b10000000000000
+	FLAG_SPEAKER_RECOGNITION ModelConfigUsecase = 0b100000000000000

 	// Common Subsets
 	FLAG_LLM ModelConfigUsecase = FLAG_CHAT | FLAG_COMPLETION | FLAG_EDIT
@@ -612,7 +613,8 @@ func GetAllModelConfigUsecases() map[string]ModelConfigUsecase {
 		"FLAG_LLM":              FLAG_LLM,
 		"FLAG_VIDEO":            FLAG_VIDEO,
 		"FLAG_DETECTION":        FLAG_DETECTION,
-		"FLAG_FACE_RECOGNITION": FLAG_FACE_RECOGNITION,
+		"FLAG_FACE_RECOGNITION":    FLAG_FACE_RECOGNITION,
+		"FLAG_SPEAKER_RECOGNITION": FLAG_SPEAKER_RECOGNITION,
 	}
 }

@@ -653,7 +655,7 @@ func (c *ModelConfig) GuessUsecases(u ModelConfigUsecase) bool {
 	nonTextGenBackends := []string{
 		"whisper", "piper", "kokoro",
 		"diffusers", "stablediffusion", "stablediffusion-ggml",
-		"rerankers", "silero-vad", "rfdetr", "insightface",
+		"rerankers", "silero-vad", "rfdetr", "insightface", "speaker-recognition",
 		"transformers-musicgen", "ace-step", "acestep-cpp",
 	}

@@ -743,6 +745,13 @@ func (c *ModelConfig) GuessUsecases(u ModelConfigUsecase) bool {
 		}
 	}

+	if (u & FLAG_SPEAKER_RECOGNITION) == FLAG_SPEAKER_RECOGNITION {
+		speakerBackends := []string{"speaker-recognition"}
+		if !slices.Contains(speakerBackends, c.Backend) {
+			return false
+		}
+	}
+
 	if (u & FLAG_SOUND_GENERATION) == FLAG_SOUND_GENERATION {
 		soundGenBackends := []string{"transformers-musicgen", "ace-step", "acestep-cpp", "mock-backend"}
 		if !slices.Contains(soundGenBackends, c.Backend) {
--- a/core/http/auth/features.go
+++ b/core/http/auth/features.go
@@ -65,6 +65,14 @@ var RouteFeatureRegistry = []RouteFeature{
 	{"POST", "/v1/face/identify", FeatureFaceRecognition},
 	{"POST", "/v1/face/forget", FeatureFaceRecognition},

+	// Voice (speaker) recognition
+	{"POST", "/v1/voice/verify", FeatureVoiceRecognition},
+	{"POST", "/v1/voice/analyze", FeatureVoiceRecognition},
+	{"POST", "/v1/voice/embed", FeatureVoiceRecognition},
+	{"POST", "/v1/voice/register", FeatureVoiceRecognition},
+	{"POST", "/v1/voice/identify", FeatureVoiceRecognition},
+	{"POST", "/v1/voice/forget", FeatureVoiceRecognition},
+
 	// Video
 	{"POST", "/video", FeatureVideo},

@@ -160,5 +168,6 @@ func APIFeatureMetas() []FeatureMeta {
 		{FeatureMCP, "MCP", true},
 		{FeatureStores, "Stores", true},
 		{FeatureFaceRecognition, "Face Recognition", true},
+		{FeatureVoiceRecognition, "Voice Recognition", true},
 	}
 }
--- a/core/http/auth/permissions.go
+++ b/core/http/auth/permissions.go
@@ -52,6 +52,7 @@ const (
 	FeatureMCP                = "mcp"
 	FeatureStores             = "stores"
 	FeatureFaceRecognition    = "face_recognition"
+	FeatureVoiceRecognition   = "voice_recognition"
 )

 // AgentFeatures lists agent-related features (default OFF).
@@ -65,7 +66,7 @@ var APIFeatures = []string{
 	FeatureChat, FeatureImages, FeatureAudioSpeech, FeatureAudioTranscription,
 	FeatureVAD, FeatureDetection, FeatureVideo, FeatureEmbeddings, FeatureSound,
 	FeatureRealtime, FeatureRerank, FeatureTokenize, FeatureMCP, FeatureStores,
-	FeatureFaceRecognition,
+	FeatureFaceRecognition, FeatureVoiceRecognition,
 }

 // AllFeatures lists all known features (used by UI and validation).
--- a/core/http/endpoints/localai/api_instructions.go
+++ b/core/http/endpoints/localai/api_instructions.go
@@ -79,6 +79,12 @@ var instructionDefs = []instructionDef{
 		Tags:        []string{"face-recognition"},
 		Intro:       "The /v1/face/register, /identify, and /forget endpoints build on a vector store — registrations are in-memory by default and lost on restart. Use /v1/face/embed for a raw embedding; /v1/embeddings is OpenAI-compatible and text-only.",
 	},
+	{
+		Name:        "voice-recognition",
+		Description: "Speaker verification (1:1), embedding, and demographic analysis from voice",
+		Tags:        []string{"voice-recognition"},
+		Intro:       "Voice (speaker) recognition — the audio analog to /v1/face/*. Use /v1/voice/verify for 1:1 speaker comparison, /v1/voice/identify for 1:N match against the registered store, /v1/voice/{register,forget} to manage that store, /v1/voice/embed for a raw speaker-encoder vector, and /v1/voice/analyze for age / gender / emotion inferred from speech. Registrations are in-memory by default and lost on restart. Audio inputs accept URL, base64, or data-URI; /v1/embeddings remains text-only.",
+	},
 }

 // swaggerState holds parsed swagger spec data, initialised once.
--- a/core/http/endpoints/localai/audio.go
+++ b/core/http/endpoints/localai/audio.go
@@ -0,0 +1,82 @@
+package localai
+
+import (
+	"encoding/base64"
+	"fmt"
+	"io"
+	"net/http"
+	"os"
+	"regexp"
+	"strings"
+	"time"
+
+	"github.com/labstack/echo/v4"
+	"github.com/mudler/LocalAI/pkg/utils"
+)
+
+var audioDataURIPattern = regexp.MustCompile(`^data:([^;]+);base64,`)
+
+var audioDownloadClient = http.Client{Timeout: 30 * time.Second}
+
+// decodeAudioInput materialises a URL / data-URI / raw-base64 audio
+// payload to a temporary file and returns its path plus a cleanup
+// function. Voice backends expect a filesystem path (same convention
+// as TranscriptRequest.dst) — callers must defer the returned cleanup
+// so the temp file does not leak.
+//
+// Bad inputs (invalid URL, undecodable base64, non-audio payload) are
+// surfaced as 400 Bad Request rather than 500 so API consumers can
+// distinguish a client mistake from a server failure.
+func decodeAudioInput(s string) (string, func(), error) {
+	if s == "" {
+		return "", func() {}, echo.NewHTTPError(http.StatusBadRequest, "audio input is empty")
+	}
+
+	var raw []byte
+	switch {
+	case strings.HasPrefix(s, "http://") || strings.HasPrefix(s, "https://"):
+		if err := utils.ValidateExternalURL(s); err != nil {
+			return "", func() {}, echo.NewHTTPError(http.StatusBadRequest, fmt.Sprintf("invalid audio URL: %v", err))
+		}
+		resp, err := audioDownloadClient.Get(s)
+		if err != nil {
+			return "", func() {}, echo.NewHTTPError(http.StatusBadRequest, fmt.Sprintf("audio download failed: %v", err))
+		}
+		defer resp.Body.Close()
+		raw, err = io.ReadAll(resp.Body)
+		if err != nil {
+			return "", func() {}, echo.NewHTTPError(http.StatusBadRequest, fmt.Sprintf("audio download read failed: %v", err))
+		}
+	default:
+		payload := s
+		if m := audioDataURIPattern.FindString(s); m != "" {
+			payload = strings.Replace(s, m, "", 1)
+		}
+		decoded, err := base64.StdEncoding.DecodeString(payload)
+		if err != nil {
+			return "", func() {}, echo.NewHTTPError(http.StatusBadRequest, fmt.Sprintf("invalid audio base64: %v", err))
+		}
+		raw = decoded
+	}
+
+	if len(raw) == 0 {
+		return "", func() {}, echo.NewHTTPError(http.StatusBadRequest, "audio input decoded to zero bytes")
+	}
+
+	f, err := os.CreateTemp("", "localai-voice-*.wav")
+	if err != nil {
+		return "", func() {}, err
+	}
+	path := f.Name()
+	cleanup := func() { _ = os.Remove(path) }
+	if _, err := f.Write(raw); err != nil {
+		f.Close()
+		cleanup()
+		return "", func() {}, err
+	}
+	if err := f.Close(); err != nil {
+		cleanup()
+		return "", func() {}, err
+	}
+	return path, cleanup, nil
+}
--- a/core/http/endpoints/localai/voice_analyze.go
+++ b/core/http/endpoints/localai/voice_analyze.go
@@ -0,0 +1,60 @@
+package localai
+
+import (
+	"net/http"
+
+	"github.com/labstack/echo/v4"
+	"github.com/mudler/LocalAI/core/backend"
+	"github.com/mudler/LocalAI/core/config"
+	"github.com/mudler/LocalAI/core/http/middleware"
+	"github.com/mudler/LocalAI/core/schema"
+	"github.com/mudler/LocalAI/pkg/model"
+	"github.com/mudler/xlog"
+)
+
+// VoiceAnalyzeEndpoint returns demographic attributes inferred from speech.
+// @Summary Analyze demographic attributes (age, gender, emotion) from a voice clip.
+// @Tags voice-recognition
+// @Param request body schema.VoiceAnalyzeRequest true "query params"
+// @Success 200 {object} schema.VoiceAnalyzeResponse "Response"
+// @Router /v1/voice/analyze [post]
+func VoiceAnalyzeEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, appConfig *config.ApplicationConfig) echo.HandlerFunc {
+	return func(c echo.Context) error {
+		input, ok := c.Get(middleware.CONTEXT_LOCALS_KEY_LOCALAI_REQUEST).(*schema.VoiceAnalyzeRequest)
+		if !ok || input.Model == "" {
+			return echo.ErrBadRequest
+		}
+		cfg, ok := c.Get(middleware.CONTEXT_LOCALS_KEY_MODEL_CONFIG).(*config.ModelConfig)
+		if !ok || cfg == nil {
+			return echo.ErrBadRequest
+		}
+
+		audio, cleanup, err := decodeAudioInput(input.Audio)
+		if err != nil {
+			return err
+		}
+		defer cleanup()
+
+		xlog.Debug("VoiceAnalyze", "model", cfg.Name, "backend", cfg.Backend, "actions", input.Actions)
+		res, err := backend.VoiceAnalyze(audio, input.Actions, ml, appConfig, *cfg)
+		if err != nil {
+			return mapBackendError(err)
+		}
+
+		response := schema.VoiceAnalyzeResponse{
+			Segments: make([]schema.VoiceAnalysis, len(res.GetSegments())),
+		}
+		for i, s := range res.GetSegments() {
+			response.Segments[i] = schema.VoiceAnalysis{
+				Start:           s.GetStart(),
+				End:             s.GetEnd(),
+				Age:             s.GetAge(),
+				DominantGender:  s.GetDominantGender(),
+				Gender:          s.GetGender(),
+				DominantEmotion: s.GetDominantEmotion(),
+				Emotion:         s.GetEmotion(),
+			}
+		}
+		return c.JSON(http.StatusOK, response)
+	}
+}
--- a/core/http/endpoints/localai/voice_embed.go
+++ b/core/http/endpoints/localai/voice_embed.go
@@ -0,0 +1,54 @@
+package localai
+
+import (
+	"net/http"
+
+	"github.com/labstack/echo/v4"
+	"github.com/mudler/LocalAI/core/backend"
+	"github.com/mudler/LocalAI/core/config"
+	"github.com/mudler/LocalAI/core/http/middleware"
+	"github.com/mudler/LocalAI/core/schema"
+	"github.com/mudler/LocalAI/pkg/model"
+	"github.com/mudler/xlog"
+)
+
+// VoiceEmbedEndpoint extracts a speaker embedding vector from an audio clip.
+//
+// Distinct from /v1/embeddings, which is OpenAI-compatible and text-only
+// by contract. Use this endpoint when you need a speaker-encoder output
+// (typically 192-d for ECAPA-TDNN, 256-d for ResNet/WeSpeaker).
+//
+// @Summary Extract a speaker embedding from an audio clip.
+// @Tags voice-recognition
+// @Param request body schema.VoiceEmbedRequest true "query params"
+// @Success 200 {object} schema.VoiceEmbedResponse "Response"
+// @Router /v1/voice/embed [post]
+func VoiceEmbedEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, appConfig *config.ApplicationConfig) echo.HandlerFunc {
+	return func(c echo.Context) error {
+		input, ok := c.Get(middleware.CONTEXT_LOCALS_KEY_LOCALAI_REQUEST).(*schema.VoiceEmbedRequest)
+		if !ok || input.Model == "" {
+			return echo.ErrBadRequest
+		}
+		cfg, ok := c.Get(middleware.CONTEXT_LOCALS_KEY_MODEL_CONFIG).(*config.ModelConfig)
+		if !ok || cfg == nil {
+			return echo.ErrBadRequest
+		}
+
+		audio, cleanup, err := decodeAudioInput(input.Audio)
+		if err != nil {
+			return err
+		}
+		defer cleanup()
+
+		xlog.Debug("VoiceEmbed", "model", cfg.Name, "backend", cfg.Backend)
+		res, err := backend.VoiceEmbed(audio, ml, appConfig, *cfg)
+		if err != nil {
+			return mapBackendError(err)
+		}
+		return c.JSON(http.StatusOK, schema.VoiceEmbedResponse{
+			Embedding: res.GetEmbedding(),
+			Dim:       len(res.GetEmbedding()),
+			Model:     res.GetModel(),
+		})
+	}
+}
--- a/core/http/endpoints/localai/voice_forget.go
+++ b/core/http/endpoints/localai/voice_forget.go
@@ -0,0 +1,45 @@
+package localai
+
+import (
+	"errors"
+	"net/http"
+
+	"github.com/labstack/echo/v4"
+	"github.com/mudler/LocalAI/core/http/middleware"
+	"github.com/mudler/LocalAI/core/schema"
+	"github.com/mudler/LocalAI/core/services/voicerecognition"
+	"github.com/mudler/xlog"
+)
+
+// VoiceForgetEndpoint removes a previously-registered speaker by ID.
+// @Summary Remove a previously-registered speaker by ID.
+// @Tags voice-recognition
+// @Param request body schema.VoiceForgetRequest true "query params"
+// @Success 204 "No Content"
+// @Router /v1/voice/forget [post]
+func VoiceForgetEndpoint(registry voicerecognition.Registry) echo.HandlerFunc {
+	return func(c echo.Context) error {
+		input, ok := c.Get(middleware.CONTEXT_LOCALS_KEY_LOCALAI_REQUEST).(*schema.VoiceForgetRequest)
+		if !ok {
+			// Forget doesn't load a model — fall back to a raw bind when
+			// the request extractor hasn't run (route registered without
+			// SetModelAndConfig).
+			input = new(schema.VoiceForgetRequest)
+			if err := c.Bind(input); err != nil {
+				return echo.ErrBadRequest
+			}
+		}
+		if input.ID == "" {
+			return echo.NewHTTPError(http.StatusBadRequest, "id is required")
+		}
+
+		xlog.Debug("VoiceForget", "id", input.ID)
+		if err := registry.Forget(c.Request().Context(), input.ID); err != nil {
+			if errors.Is(err, voicerecognition.ErrNotFound) {
+				return echo.NewHTTPError(http.StatusNotFound, err.Error())
+			}
+			return err
+		}
+		return c.NoContent(http.StatusNoContent)
+	}
+}
--- a/core/http/endpoints/localai/voice_identify.go
+++ b/core/http/endpoints/localai/voice_identify.go
@@ -0,0 +1,82 @@
+package localai
+
+import (
+	"cmp"
+	"net/http"
+
+	"github.com/labstack/echo/v4"
+	"github.com/mudler/LocalAI/core/backend"
+	"github.com/mudler/LocalAI/core/config"
+	"github.com/mudler/LocalAI/core/http/middleware"
+	"github.com/mudler/LocalAI/core/schema"
+	"github.com/mudler/LocalAI/core/services/voicerecognition"
+	"github.com/mudler/LocalAI/pkg/model"
+	"github.com/mudler/xlog"
+)
+
+// defaultVoiceIdentifyThreshold is the cosine-distance cutoff applied
+// when the client does not specify one. Tuned for ECAPA-TDNN on
+// VoxCeleb (EER ~1.9%). Other recognizers (WeSpeaker, ERes2Net) may
+// need overrides.
+const defaultVoiceIdentifyThreshold = float32(0.25)
+
+// VoiceIdentifyEndpoint runs 1:N identification against the registered store.
+// @Summary Identify a speaker against the registered database (1:N recognition).
+// @Tags voice-recognition
+// @Param request body schema.VoiceIdentifyRequest true "query params"
+// @Success 200 {object} schema.VoiceIdentifyResponse "Response"
+// @Router /v1/voice/identify [post]
+func VoiceIdentifyEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, appConfig *config.ApplicationConfig, registry voicerecognition.Registry) echo.HandlerFunc {
+	return func(c echo.Context) error {
+		input, ok := c.Get(middleware.CONTEXT_LOCALS_KEY_LOCALAI_REQUEST).(*schema.VoiceIdentifyRequest)
+		if !ok || input.Model == "" {
+			return echo.ErrBadRequest
+		}
+		cfg, ok := c.Get(middleware.CONTEXT_LOCALS_KEY_MODEL_CONFIG).(*config.ModelConfig)
+		if !ok || cfg == nil {
+			return echo.ErrBadRequest
+		}
+
+		audio, cleanup, err := decodeAudioInput(input.Audio)
+		if err != nil {
+			return err
+		}
+		defer cleanup()
+
+		topK := cmp.Or(input.TopK, 5)
+		threshold := cmp.Or(input.Threshold, defaultVoiceIdentifyThreshold)
+
+		xlog.Debug("VoiceIdentify", "model", cfg.Name, "topK", topK, "threshold", threshold)
+		embed, err := backend.VoiceEmbed(audio, ml, appConfig, *cfg)
+		if err != nil {
+			return mapBackendError(err)
+		}
+
+		matches, err := registry.Identify(c.Request().Context(), embed.GetEmbedding(), topK)
+		if err != nil {
+			return err
+		}
+
+		response := schema.VoiceIdentifyResponse{
+			Matches: make([]schema.VoiceIdentifyMatch, len(matches)),
+		}
+		for i, m := range matches {
+			confidence := (1 - m.Distance/threshold) * 100
+			if confidence < 0 {
+				confidence = 0
+			}
+			if confidence > 100 {
+				confidence = 100
+			}
+			response.Matches[i] = schema.VoiceIdentifyMatch{
+				ID:         m.ID,
+				Name:       m.Metadata.Name,
+				Labels:     m.Metadata.Labels,
+				Distance:   m.Distance,
+				Confidence: confidence,
+				Match:      m.Distance <= threshold,
+			}
+		}
+		return c.JSON(http.StatusOK, response)
+	}
+}
--- a/core/http/endpoints/localai/voice_register.go
+++ b/core/http/endpoints/localai/voice_register.go
@@ -0,0 +1,61 @@
+package localai
+
+import (
+	"net/http"
+
+	"github.com/labstack/echo/v4"
+	"github.com/mudler/LocalAI/core/backend"
+	"github.com/mudler/LocalAI/core/config"
+	"github.com/mudler/LocalAI/core/http/middleware"
+	"github.com/mudler/LocalAI/core/schema"
+	"github.com/mudler/LocalAI/core/services/voicerecognition"
+	"github.com/mudler/LocalAI/pkg/model"
+	"github.com/mudler/xlog"
+)
+
+// VoiceRegisterEndpoint enrolls a speaker into the 1:N identification store.
+// @Summary Register a speaker for 1:N identification.
+// @Tags voice-recognition
+// @Param request body schema.VoiceRegisterRequest true "query params"
+// @Success 200 {object} schema.VoiceRegisterResponse "Response"
+// @Router /v1/voice/register [post]
+func VoiceRegisterEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, appConfig *config.ApplicationConfig, registry voicerecognition.Registry) echo.HandlerFunc {
+	return func(c echo.Context) error {
+		input, ok := c.Get(middleware.CONTEXT_LOCALS_KEY_LOCALAI_REQUEST).(*schema.VoiceRegisterRequest)
+		if !ok || input.Model == "" {
+			return echo.ErrBadRequest
+		}
+		cfg, ok := c.Get(middleware.CONTEXT_LOCALS_KEY_MODEL_CONFIG).(*config.ModelConfig)
+		if !ok || cfg == nil {
+			return echo.ErrBadRequest
+		}
+		if input.Name == "" {
+			return echo.NewHTTPError(http.StatusBadRequest, "name is required")
+		}
+
+		audio, cleanup, err := decodeAudioInput(input.Audio)
+		if err != nil {
+			return err
+		}
+		defer cleanup()
+
+		xlog.Debug("VoiceRegister", "model", cfg.Name, "name", input.Name)
+		res, err := backend.VoiceEmbed(audio, ml, appConfig, *cfg)
+		if err != nil {
+			return mapBackendError(err)
+		}
+
+		stored, err := registry.Register(c.Request().Context(), res.GetEmbedding(), voicerecognition.Metadata{
+			Name:   input.Name,
+			Labels: input.Labels,
+		})
+		if err != nil {
+			return err
+		}
+		return c.JSON(http.StatusOK, schema.VoiceRegisterResponse{
+			ID:           stored.ID,
+			Name:         stored.Name,
+			RegisteredAt: stored.RegisteredAt,
+		})
+	}
+}
--- a/core/http/endpoints/localai/voice_verify.go
+++ b/core/http/endpoints/localai/voice_verify.go
@@ -0,0 +1,59 @@
+package localai
+
+import (
+	"net/http"
+
+	"github.com/labstack/echo/v4"
+	"github.com/mudler/LocalAI/core/backend"
+	"github.com/mudler/LocalAI/core/config"
+	"github.com/mudler/LocalAI/core/http/middleware"
+	"github.com/mudler/LocalAI/core/schema"
+	"github.com/mudler/LocalAI/pkg/model"
+	"github.com/mudler/xlog"
+)
+
+// VoiceVerifyEndpoint compares two audio clips and reports whether they were
+// spoken by the same person.
+// @Summary Verify that two audio clips were spoken by the same person.
+// @Tags voice-recognition
+// @Param request body schema.VoiceVerifyRequest true "query params"
+// @Success 200 {object} schema.VoiceVerifyResponse "Response"
+// @Router /v1/voice/verify [post]
+func VoiceVerifyEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, appConfig *config.ApplicationConfig) echo.HandlerFunc {
+	return func(c echo.Context) error {
+		input, ok := c.Get(middleware.CONTEXT_LOCALS_KEY_LOCALAI_REQUEST).(*schema.VoiceVerifyRequest)
+		if !ok || input.Model == "" {
+			return echo.ErrBadRequest
+		}
+		cfg, ok := c.Get(middleware.CONTEXT_LOCALS_KEY_MODEL_CONFIG).(*config.ModelConfig)
+		if !ok || cfg == nil {
+			return echo.ErrBadRequest
+		}
+
+		audio1, cleanup1, err := decodeAudioInput(input.Audio1)
+		if err != nil {
+			return err
+		}
+		defer cleanup1()
+		audio2, cleanup2, err := decodeAudioInput(input.Audio2)
+		if err != nil {
+			return err
+		}
+		defer cleanup2()
+
+		xlog.Debug("VoiceVerify", "model", cfg.Name, "backend", cfg.Backend)
+		res, err := backend.VoiceVerify(audio1, audio2, input.Threshold, input.AntiSpoofing, ml, appConfig, *cfg)
+		if err != nil {
+			return mapBackendError(err)
+		}
+
+		return c.JSON(http.StatusOK, schema.VoiceVerifyResponse{
+			Verified:         res.GetVerified(),
+			Distance:         res.GetDistance(),
+			Threshold:        res.GetThreshold(),
+			Confidence:       res.GetConfidence(),
+			Model:            res.GetModel(),
+			ProcessingTimeMs: res.GetProcessingTimeMs(),
+		})
+	}
+}
--- a/core/http/react-ui/src/utils/capabilities.js
+++ b/core/http/react-ui/src/utils/capabilities.js
@@ -13,3 +13,4 @@ export const CAP_VAD = 'FLAG_VAD'
 export const CAP_VIDEO = 'FLAG_VIDEO'
 export const CAP_DETECTION = 'FLAG_DETECTION'
 export const CAP_FACE_RECOGNITION = 'FLAG_FACE_RECOGNITION'
+export const CAP_SPEAKER_RECOGNITION = 'FLAG_SPEAKER_RECOGNITION'
--- a/core/http/routes/localai.go
+++ b/core/http/routes/localai.go
@@ -120,6 +120,28 @@ func RegisterLocalAIRoutes(router *echo.Echo,
 	// Forget does not load a face model — it only needs the registry.
 	router.POST("/v1/face/forget", localai.FaceForgetEndpoint(app.FaceRegistry()))

+	// Voice (speaker) recognition endpoints
+	voiceMw := []echo.MiddlewareFunc{
+		requestExtractor.BuildFilteredFirstAvailableDefaultModel(config.BuildUsecaseFilterFn(config.FLAG_SPEAKER_RECOGNITION)),
+	}
+	router.POST("/v1/voice/verify",
+		localai.VoiceVerifyEndpoint(cl, ml, appConfig),
+		append(voiceMw, requestExtractor.SetModelAndConfig(func() schema.LocalAIRequest { return new(schema.VoiceVerifyRequest) }))...)
+	router.POST("/v1/voice/analyze",
+		localai.VoiceAnalyzeEndpoint(cl, ml, appConfig),
+		append(voiceMw, requestExtractor.SetModelAndConfig(func() schema.LocalAIRequest { return new(schema.VoiceAnalyzeRequest) }))...)
+	router.POST("/v1/voice/embed",
+		localai.VoiceEmbedEndpoint(cl, ml, appConfig),
+		append(voiceMw, requestExtractor.SetModelAndConfig(func() schema.LocalAIRequest { return new(schema.VoiceEmbedRequest) }))...)
+	router.POST("/v1/voice/register",
+		localai.VoiceRegisterEndpoint(cl, ml, appConfig, app.VoiceRegistry()),
+		append(voiceMw, requestExtractor.SetModelAndConfig(func() schema.LocalAIRequest { return new(schema.VoiceRegisterRequest) }))...)
+	router.POST("/v1/voice/identify",
+		localai.VoiceIdentifyEndpoint(cl, ml, appConfig, app.VoiceRegistry()),
+		append(voiceMw, requestExtractor.SetModelAndConfig(func() schema.LocalAIRequest { return new(schema.VoiceIdentifyRequest) }))...)
+	// Forget does not load a voice model — it only needs the registry.
+	router.POST("/v1/voice/forget", localai.VoiceForgetEndpoint(app.VoiceRegistry()))
+
 	ttsHandler := localai.TTSEndpoint(cl, ml, appConfig)
 	router.POST("/tts",
 		ttsHandler,
--- a/core/schema/localai.go
+++ b/core/schema/localai.go
@@ -290,6 +290,110 @@ type FaceForgetRequest struct {
 	Store string `json:"store,omitempty"`
 }

+// ─── Voice (speaker) recognition ───────────────────────────────────
+//
+// VoiceVerifyRequest compares two audio clips and reports whether they
+// were spoken by the same speaker. Audio1/Audio2 accept URL, base64,
+// or data-URI (the HTTP layer materialises the bytes to a temp file
+// before calling the gRPC backend).
+type VoiceVerifyRequest struct {
+	BasicModelRequest
+	Audio1       string  `json:"audio1"`
+	Audio2       string  `json:"audio2"`
+	Threshold    float32 `json:"threshold,omitempty"`
+	AntiSpoofing bool    `json:"anti_spoofing,omitempty"`
+}
+
+type VoiceVerifyResponse struct {
+	Verified         bool    `json:"verified"`
+	Distance         float32 `json:"distance"`
+	Threshold        float32 `json:"threshold"`
+	Confidence       float32 `json:"confidence"`
+	Model            string  `json:"model"`
+	ProcessingTimeMs float32 `json:"processing_time_ms,omitempty"`
+}
+
+// VoiceAnalyzeRequest asks the backend for demographic attributes
+// (age, gender, emotion) inferred from the audio clip.
+type VoiceAnalyzeRequest struct {
+	BasicModelRequest
+	Audio   string   `json:"audio"`
+	Actions []string `json:"actions,omitempty"` // subset of {"age","gender","emotion"}
+}
+
+type VoiceAnalyzeResponse struct {
+	Segments []VoiceAnalysis `json:"segments"`
+}
+
+type VoiceAnalysis struct {
+	Start           float32            `json:"start"`
+	End             float32            `json:"end"`
+	Age             float32            `json:"age,omitempty"`
+	DominantGender  string             `json:"dominant_gender,omitempty"`
+	Gender          map[string]float32 `json:"gender,omitempty"`
+	DominantEmotion string             `json:"dominant_emotion,omitempty"`
+	Emotion         map[string]float32 `json:"emotion,omitempty"`
+}
+
+// VoiceEmbedRequest extracts a speaker embedding from an audio clip.
+// Distinct from /v1/embeddings (OpenAI-compatible, text-only) — this
+// endpoint accepts URL / base64 / data-URI audio inputs.
+type VoiceEmbedRequest struct {
+	BasicModelRequest
+	Audio string `json:"audio"`
+}
+
+type VoiceEmbedResponse struct {
+	Embedding []float32 `json:"embedding"`
+	Dim       int       `json:"dim"`
+	Model     string    `json:"model,omitempty"`
+}
+
+// VoiceRegisterRequest enrolls a speaker into the 1:N identification store.
+type VoiceRegisterRequest struct {
+	BasicModelRequest
+	Audio  string            `json:"audio"`
+	Name   string            `json:"name"`
+	Labels map[string]string `json:"labels,omitempty"`
+	Store  string            `json:"store,omitempty"`
+}
+
+type VoiceRegisterResponse struct {
+	ID           string    `json:"id"`
+	Name         string    `json:"name"`
+	RegisteredAt time.Time `json:"registered_at"`
+}
+
+// VoiceIdentifyRequest runs 1:N recognition: embed the probe and
+// return the top-K nearest registered speakers.
+type VoiceIdentifyRequest struct {
+	BasicModelRequest
+	Audio     string  `json:"audio"`
+	TopK      int     `json:"top_k,omitempty"`
+	Threshold float32 `json:"threshold,omitempty"`
+	Store     string  `json:"store,omitempty"`
+}
+
+type VoiceIdentifyResponse struct {
+	Matches []VoiceIdentifyMatch `json:"matches"`
+}
+
+type VoiceIdentifyMatch struct {
+	ID         string            `json:"id"`
+	Name       string            `json:"name"`
+	Labels     map[string]string `json:"labels,omitempty"`
+	Distance   float32           `json:"distance"`
+	Confidence float32           `json:"confidence"`
+	Match      bool              `json:"match"`
+}
+
+// VoiceForgetRequest removes a previously-registered speaker by ID.
+type VoiceForgetRequest struct {
+	BasicModelRequest
+	ID    string `json:"id"`
+	Store string `json:"store,omitempty"`
+}
+
 type ImportModelRequest struct {
 	URI         string          `json:"uri"`
 	Preferences json.RawMessage `json:"preferences,omitempty"`
--- a/core/services/nodes/health_mock_test.go
+++ b/core/services/nodes/health_mock_test.go
@@ -174,6 +174,15 @@ func (c *fakeBackendClient) FaceVerify(_ context.Context, _ *pb.FaceVerifyReques
 func (c *fakeBackendClient) FaceAnalyze(_ context.Context, _ *pb.FaceAnalyzeRequest, _ ...ggrpc.CallOption) (*pb.FaceAnalyzeResponse, error) {
 	return nil, nil
 }
+func (c *fakeBackendClient) VoiceVerify(_ context.Context, _ *pb.VoiceVerifyRequest, _ ...ggrpc.CallOption) (*pb.VoiceVerifyResponse, error) {
+	return nil, nil
+}
+func (c *fakeBackendClient) VoiceAnalyze(_ context.Context, _ *pb.VoiceAnalyzeRequest, _ ...ggrpc.CallOption) (*pb.VoiceAnalyzeResponse, error) {
+	return nil, nil
+}
+func (c *fakeBackendClient) VoiceEmbed(_ context.Context, _ *pb.VoiceEmbedRequest, _ ...ggrpc.CallOption) (*pb.VoiceEmbedResponse, error) {
+	return nil, nil
+}
 func (c *fakeBackendClient) AudioTranscription(_ context.Context, _ *pb.TranscriptRequest, _ ...ggrpc.CallOption) (*pb.TranscriptResult, error) {
 	return nil, nil
 }
--- a/core/services/nodes/inflight_test.go
+++ b/core/services/nodes/inflight_test.go
@@ -99,6 +99,18 @@ func (f *fakeGRPCBackend) FaceAnalyze(_ context.Context, _ *pb.FaceAnalyzeReques
 	return &pb.FaceAnalyzeResponse{}, nil
 }

+func (f *fakeGRPCBackend) VoiceVerify(_ context.Context, _ *pb.VoiceVerifyRequest, _ ...ggrpc.CallOption) (*pb.VoiceVerifyResponse, error) {
+	return &pb.VoiceVerifyResponse{}, nil
+}
+
+func (f *fakeGRPCBackend) VoiceAnalyze(_ context.Context, _ *pb.VoiceAnalyzeRequest, _ ...ggrpc.CallOption) (*pb.VoiceAnalyzeResponse, error) {
+	return &pb.VoiceAnalyzeResponse{}, nil
+}
+
+func (f *fakeGRPCBackend) VoiceEmbed(_ context.Context, _ *pb.VoiceEmbedRequest, _ ...ggrpc.CallOption) (*pb.VoiceEmbedResponse, error) {
+	return &pb.VoiceEmbedResponse{}, nil
+}
+
 func (f *fakeGRPCBackend) AudioTranscription(_ context.Context, _ *pb.TranscriptRequest, _ ...ggrpc.CallOption) (*pb.TranscriptResult, error) {
 	return &pb.TranscriptResult{}, nil
 }
--- a/core/services/voicerecognition/registry.go
+++ b/core/services/voicerecognition/registry.go
@@ -0,0 +1,58 @@
+// Package voicerecognition provides a swappable backing store for
+// speaker embeddings and the 1:N identification pipeline on top of it.
+//
+// Mirrors the facerecognition package — the audio analog. The current
+// implementation (NewStoreRegistry) is backed by LocalAI's in-memory
+// local-store gRPC backend, so all registrations are lost on restart.
+//
+// TODO: share a persistent pgvector-backed implementation with
+// facerecognition once the first one lands. The Registry interface
+// here is intentionally identical in shape, so a shared generic
+// biometric registry can replace both without HTTP-handler churn.
+package voicerecognition
+
+import (
+	"context"
+	"errors"
+	"time"
+)
+
+// Registry stores speaker embeddings keyed by an opaque ID and
+// supports approximate similarity search. Implementations are expected
+// to be safe for concurrent use.
+type Registry interface {
+	// Register stores a speaker embedding alongside its metadata.
+	// Returns the stored metadata with ID and RegisteredAt populated.
+	Register(ctx context.Context, embedding []float32, meta Metadata) (Metadata, error)
+
+	// Identify returns up to topK matches for the probe embedding,
+	// sorted by ascending distance (closest first).
+	Identify(ctx context.Context, probe []float32, topK int) ([]Match, error)
+
+	// Forget removes a previously-registered embedding by ID.
+	// Returns ErrNotFound if the ID is unknown.
+	Forget(ctx context.Context, id string) error
+}
+
+// Metadata is the user-supplied payload stored alongside a speaker embedding.
+type Metadata struct {
+	// ID is populated by the registry at Register time; callers must not set it.
+	ID           string            `json:"id"`
+	Name         string            `json:"name"`
+	Labels       map[string]string `json:"labels,omitempty"`
+	RegisteredAt time.Time         `json:"registered_at"`
+}
+
+// Match is a single result from Identify, ranked by similarity.
+type Match struct {
+	ID       string
+	Metadata Metadata
+	Distance float32 // 1 - cosine_similarity; lower = closer
+}
+
+// Sentinel errors; callers should compare with errors.Is.
+var (
+	ErrNotFound          = errors.New("voicerecognition: id not found")
+	ErrEmptyEmbedding    = errors.New("voicerecognition: embedding is empty")
+	ErrDimensionMismatch = errors.New("voicerecognition: embedding dimension mismatch")
+)
--- a/core/services/voicerecognition/store_registry.go
+++ b/core/services/voicerecognition/store_registry.go
@@ -0,0 +1,138 @@
+package voicerecognition
+
+import (
+	"context"
+	"encoding/json"
+	"fmt"
+	"sort"
+	"sync"
+	"time"
+
+	"github.com/google/uuid"
+
+	"github.com/mudler/LocalAI/pkg/grpc"
+	"github.com/mudler/LocalAI/pkg/store"
+)
+
+// StoreResolver resolves a named vector store to a gRPC backend. The
+// HTTP handler layer wires this to backend.StoreBackend so the
+// registry stays decoupled from ModelLoader plumbing.
+type StoreResolver func(ctx context.Context, storeName string) (grpc.Backend, error)
+
+// NewStoreRegistry returns a Registry backed by LocalAI's generic
+// StoresSet / StoresFind / StoresDelete gRPC surface.
+//
+// storeName selects which vector-store model to use (defaults to the
+// local-store Go backend). `dim` is the expected embedding dimension;
+// pass 0 to accept whatever dimension arrives (useful when the voice
+// backend exposes recognizers of different sizes, e.g. ECAPA-TDNN at
+// 192 vs ResNet at 256).
+func NewStoreRegistry(resolve StoreResolver, storeName string, dim int) Registry {
+	return &storeRegistry{
+		resolve:   resolve,
+		storeName: storeName,
+		dim:       dim,
+	}
+}
+
+type storeRegistry struct {
+	resolve   StoreResolver
+	storeName string
+	dim       int
+
+	// TODO(postgres): the local-store gRPC surface keys by embedding
+	// vector and exposes no "list all" method, so we cannot delete by
+	// ID without remembering the embedding. This in-memory index is
+	// rebuilt on every Register and lost on restart — acceptable while
+	// the only implementation is itself in-memory.
+	idIndex sync.Map // map[string][]float32
+}
+
+func (r *storeRegistry) Register(ctx context.Context, embedding []float32, meta Metadata) (Metadata, error) {
+	if len(embedding) == 0 {
+		return Metadata{}, ErrEmptyEmbedding
+	}
+	if r.dim != 0 && len(embedding) != r.dim {
+		return Metadata{}, fmt.Errorf("%w: expected %d, got %d", ErrDimensionMismatch, r.dim, len(embedding))
+	}
+
+	backend, err := r.resolve(ctx, r.storeName)
+	if err != nil {
+		return Metadata{}, fmt.Errorf("voicerecognition: resolve store: %w", err)
+	}
+
+	meta.ID = uuid.NewString()
+	if meta.RegisteredAt.IsZero() {
+		meta.RegisteredAt = time.Now().UTC()
+	}
+
+	payload, err := json.Marshal(meta)
+	if err != nil {
+		return Metadata{}, fmt.Errorf("voicerecognition: marshal metadata: %w", err)
+	}
+
+	if err := store.SetSingle(ctx, backend, embedding, payload); err != nil {
+		return Metadata{}, fmt.Errorf("voicerecognition: set: %w", err)
+	}
+
+	embCopy := append([]float32(nil), embedding...)
+	r.idIndex.Store(meta.ID, embCopy)
+	return meta, nil
+}
+
+func (r *storeRegistry) Identify(ctx context.Context, probe []float32, topK int) ([]Match, error) {
+	if len(probe) == 0 {
+		return nil, ErrEmptyEmbedding
+	}
+	if r.dim != 0 && len(probe) != r.dim {
+		return nil, fmt.Errorf("%w: expected %d, got %d", ErrDimensionMismatch, r.dim, len(probe))
+	}
+	if topK <= 0 {
+		topK = 5
+	}
+
+	backend, err := r.resolve(ctx, r.storeName)
+	if err != nil {
+		return nil, fmt.Errorf("voicerecognition: resolve store: %w", err)
+	}
+
+	_, values, similarities, err := store.Find(ctx, backend, probe, topK)
+	if err != nil {
+		return nil, fmt.Errorf("voicerecognition: find: %w", err)
+	}
+
+	matches := make([]Match, 0, len(values))
+	for i, raw := range values {
+		var meta Metadata
+		if err := json.Unmarshal(raw, &meta); err != nil {
+			// Shared stores may contain non-voice records; skip them.
+			continue
+		}
+		matches = append(matches, Match{
+			ID:       meta.ID,
+			Metadata: meta,
+			Distance: 1 - similarities[i],
+		})
+	}
+
+	sort.SliceStable(matches, func(i, j int) bool { return matches[i].Distance < matches[j].Distance })
+	return matches, nil
+}
+
+func (r *storeRegistry) Forget(ctx context.Context, id string) error {
+	raw, ok := r.idIndex.Load(id)
+	if !ok {
+		return ErrNotFound
+	}
+	embedding := raw.([]float32)
+
+	backend, err := r.resolve(ctx, r.storeName)
+	if err != nil {
+		return fmt.Errorf("voicerecognition: resolve store: %w", err)
+	}
+	if err := store.DeleteSingle(ctx, backend, embedding); err != nil {
+		return fmt.Errorf("voicerecognition: delete: %w", err)
+	}
+	r.idIndex.Delete(id)
+	return nil
+}
--- a/core/trace/backend_trace.go
+++ b/core/trace/backend_trace.go
@@ -26,6 +26,9 @@ const (
 	BackendTraceDetection       BackendTraceType = "detection"
 	BackendTraceFaceVerify      BackendTraceType = "face_verify"
 	BackendTraceFaceAnalyze     BackendTraceType = "face_analyze"
+	BackendTraceVoiceVerify     BackendTraceType = "voice_verify"
+	BackendTraceVoiceAnalyze    BackendTraceType = "voice_analyze"
+	BackendTraceVoiceEmbed      BackendTraceType = "voice_embed"
 	BackendTraceModelLoad       BackendTraceType = "model_load"
 )

--- a/docs/content/features/voice-recognition.md
+++ b/docs/content/features/voice-recognition.md
@@ -0,0 +1,247 @@
+++
+disableToc = false
+title = "Voice Recognition"
+weight = 15
+url = "/features/voice-recognition/"
+++
+
+LocalAI supports voice (speaker) recognition through the
+`speaker-recognition` backend: speaker verification (1:1), speaker
+identification (1:N) against a built-in vector store, speaker
+embedding, and demographic analysis (age / gender / emotion from
+voice).
+
+The audio analog to [Face Recognition](/features/face-recognition/),
+following the same two-engine pattern under one image.
+
+## Engines
+
+| Gallery entry | Model | Size | License |
+|---|---|---|---|
+| `speechbrain-ecapa-tdnn` | ECAPA-TDNN on VoxCeleb (SpeechBrain) | ~17 MB | **Apache 2.0 — commercial-safe** |
+| `wespeaker-resnet34` | WeSpeaker ResNet34 ONNX | ~26 MB | **Apache 2.0 — commercial-safe** |
+
+Both entries are commercial-safe Apache-2.0. SpeechBrain is the
+default — it's a lightweight pure-PyTorch checkpoint that auto-
+downloads on first use. The `wespeaker-resnet34` entry wires the
+direct-ONNX path for CPU-only deployments that don't want the torch
+runtime.
+
+## Quickstart
+
+Install the default backend and model:
+
+```bash
+local-ai models install speechbrain-ecapa-tdnn
+```
+
+Verify that two audio clips were spoken by the same person:
+
+```bash
+curl -sX POST http://localhost:8080/v1/voice/verify \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "speechbrain-ecapa-tdnn",
+    "audio1": "https://example.com/alice_1.wav",
+    "audio2": "https://example.com/alice_2.wav"
+  }'
+```
+
+Response:
+
+```json
+{
+  "verified": true,
+  "distance": 0.18,
+  "threshold": 0.25,
+  "confidence": 28.0,
+  "model": "speechbrain-ecapa-tdnn",
+  "processing_time_ms": 340.0
+}
+```
+
+## 1:N identification workflow (register → identify → forget)
+
+Same flow as face recognition, same in-memory vector store under the
+hood.
+
+1. Register known speakers:
+
+    ```bash
+    curl -sX POST http://localhost:8080/v1/voice/register \
+      -H "Content-Type: application/json" \
+      -d '{
+        "model": "speechbrain-ecapa-tdnn",
+        "name": "Alice",
+        "audio": "https://example.com/alice.wav"
+      }'
+    # → {"id": "b2f...", "name": "Alice", "registered_at": "2026-04-22T..."}
+    ```
+
+2. Identify an unknown probe:
+
+    ```bash
+    curl -sX POST http://localhost:8080/v1/voice/identify \
+      -H "Content-Type: application/json" \
+      -d '{
+        "model": "speechbrain-ecapa-tdnn",
+        "audio": "https://example.com/unknown.wav",
+        "top_k": 5
+      }'
+    # → {"matches": [{"id":"b2f...","name":"Alice","distance":0.19,"match":true,...}]}
+    ```
+
+3. Remove a speaker by ID:
+
+    ```bash
+    curl -sX POST http://localhost:8080/v1/voice/forget \
+      -d '{"id": "b2f..."}'
+    # → 204 No Content
+    ```
+
+{{% alert icon="⚠️" color="warning" %}}
+**Storage caveat.** The default vector store is in-memory. All
+registered speakers are lost when LocalAI restarts. Persistent storage
+(pgvector) is a tracked future enhancement shared with face
+recognition — the voice-recognition HTTP API is designed to swap the
+backing store without changing the wire format.
+{{% /alert %}}
+
+## API reference
+
+### `POST /v1/voice/verify` (1:1)
+
+| field | type | description |
+|---|---|---|
+| `model` | string | gallery entry name (e.g. `speechbrain-ecapa-tdnn`) |
+| `audio1`, `audio2` | string | URL, base64, or data-URI of an audio file |
+| `threshold` | float, optional | cosine-distance cutoff; default 0.25 for ECAPA-TDNN |
+| `anti_spoofing` | bool, optional | reserved — unused in the current release |
+
+Returns `verified`, `distance`, `threshold`, `confidence`, `model`,
+and `processing_time_ms`.
+
+### `POST /v1/voice/analyze`
+
+Returns demographic attributes (age, gender, emotion) inferred from
+speech:
+
+| field | type | description |
+|---|---|---|
+| `model` | string | gallery entry |
+| `audio` | string | URL / base64 / data-URI |
+| `actions` | string[] | subset of `["age","gender","emotion"]`; empty = all supported |
+
+Emotion is inferred from the SUPERB emotion-recognition checkpoint
+(`superb/wav2vec2-base-superb-er`, Apache 2.0) — 4-way categorical
+neutral / happy / angry / sad. The model auto-downloads on the first
+analyze call.
+
+Age and gender are **opt-in**: no standard-transformers checkpoint
+with a clean classifier head is shipped as the default. The
+high-accuracy Audeering age/gender model uses a custom multi-task
+head that `AutoModelForAudioClassification` doesn't load safely
+(the age weights are silently dropped and the classifier is
+re-initialised with random values). To enable age/gender, set
+`age_gender_model:<repo>` in the model YAML's `options:` pointing at
+a checkpoint with a vanilla `Wav2Vec2ForSequenceClassification`
+head. Override the emotion default similarly via `emotion_model:`.
+Set either to an empty string to disable that head.
+
+If a head fails to load (offline, disk full, `transformers`
+missing), the engine degrades gracefully: it still returns the
+attributes it could compute. When nothing can be computed the backend
+returns `501 Unimplemented`.
+
+Analyze is supported by both `speechbrain-ecapa-tdnn` and
+`wespeaker-resnet34` — the speaker recognizer and the analysis head
+are independent.
+
+### `POST /v1/voice/register` (1:N enrollment)
+
+| field | type | description |
+|---|---|---|
+| `model` | string | voice recognition model |
+| `audio` | string | speaker audio to enroll |
+| `name` | string | human-readable label |
+| `labels` | map[string]string, optional | arbitrary metadata |
+| `store` | string, optional | vector store model; defaults to local-store |
+
+Returns `{id, name, registered_at}`. The `id` is an opaque UUID used
+by `/v1/voice/identify` and `/v1/voice/forget`.
+
+### `POST /v1/voice/identify` (1:N recognition)
+
+| field | type | description |
+|---|---|---|
+| `model` | string | voice recognition model |
+| `audio` | string | probe audio |
+| `top_k` | int, optional | max matches to return; default 5 |
+| `threshold` | float, optional | cosine-distance cutoff; default 0.25 |
+| `store` | string, optional | vector store model |
+
+Returns a list of matches sorted by ascending distance, each with
+`id`, `name`, `labels`, `distance`, `confidence`, and `match`
+(`distance ≤ threshold`).
+
+### `POST /v1/voice/forget`
+
+| field | type | description |
+|---|---|---|
+| `id` | string | ID returned by `/v1/voice/register` |
+
+Returns `204 No Content` on success, `404 Not Found` if the ID is
+unknown.
+
+### `POST /v1/voice/embed`
+
+Returns the L2-normalized speaker embedding vector.
+
+| field | type | description |
+|---|---|---|
+| `model` | string | voice model |
+| `audio` | string | URL / base64 / data-URI |
+
+Returns `{embedding: float[], dim: int, model: string}`. Dimension
+depends on the recognizer: 192 for ECAPA-TDNN, 256 for WeSpeaker
+ResNet34.
+
+> **Note:** the OpenAI-compatible `/v1/embeddings` endpoint is
+> intentionally text-only — it does nothing useful with audio input.
+> Use `/v1/voice/embed` for audio.
+
+## Audio input
+
+Audio is materialised by the HTTP layer to a temporary WAV file
+before the gRPC call. All audio fields accept:
+
+- `http://` / `https://` URLs (downloaded server-side, subject to
+  `ValidateExternalURL` safety checks).
+- Raw base64 (no prefix).
+- Data URIs (`data:audio/wav;base64,...`).
+
+The backend itself always receives a filesystem path — the same
+convention the Whisper / Voxtral transcription backends use.
+
+## Threshold reference
+
+| Recognizer | Cosine-distance threshold |
+|---|---|
+| ECAPA-TDNN (SpeechBrain, VoxCeleb) | ~0.25 |
+| WeSpeaker ResNet34 | ~0.30 |
+| 3D-Speaker ERes2Net | ~0.28 |
+
+Pass `threshold` explicitly when switching recognizers — the per-model
+default only applies when omitted.
+
+## Related features
+
+- [Face Recognition](/features/face-recognition/) — the image analog;
+  the two share a registry design.
+- [Audio to Text](/features/audio-to-text/) — transcription (Whisper,
+  Voxtral, faster-whisper). Runs in addition to, not instead of,
+  voice recognition.
+- [Stores](/features/stores/) — the generic vector store powering
+  both the face and voice 1:N recognition pipelines.
+- [Embeddings](/features/embeddings/) — text-only OpenAI-compatible
+  embedding endpoint; for audio embeddings use `/v1/voice/embed`.
--- a/gallery/index.yaml
+++ b/gallery/index.yaml
@@ -3993,6 +3993,57 @@
    - filename: face_recognition_sface_2021dec_int8.onnx
      sha256: 2b0e941e6f16cc048c20aee0c8e31f569118f65d702914540f7bfdc14048d78a
      uri: https://github.com/opencv/opencv_zoo/raw/main/models/face_recognition_sface/face_recognition_sface_2021dec_int8.onnx
+- &speechbrain_ecapa_tdnn
+  name: "speechbrain-ecapa-tdnn"
+  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
+  license: apache-2.0
+  description: |
+    Speaker (voice) recognition with SpeechBrain's ECAPA-TDNN trained
+    on VoxCeleb. 192-d L2-normalised embeddings, ~1.9% Equal Error
+    Rate on VoxCeleb1-O. APACHE 2.0 — commercial-safe.
+
+    The checkpoint is auto-downloaded from HuggingFace on first
+    LoadModel (no separate weight file in gallery `files:`). Points at
+    the upstream SpeechBrain HF repo directly — same bytes every
+    deployment.
+  tags: [voice-recognition, speaker-verification, speaker-embedding, commercial-ok, cpu, gpu]
+  urls:
+    - https://speechbrain.github.io/
+    - https://huggingface.co/speechbrain/spkrec-ecapa-voxceleb
+  overrides:
+    backend: speaker-recognition
+    parameters: {model: speechbrain/spkrec-ecapa-voxceleb}
+    options:
+      - "engine:speechbrain"
+      - "source:speechbrain/spkrec-ecapa-voxceleb"
+    known_usecases: [speaker_recognition]
+- &wespeaker_resnet34
+  name: "wespeaker-resnet34"
+  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
+  license: apache-2.0
+  description: |
+    Speaker recognition with WeSpeaker's ResNet34 trained on VoxCeleb,
+    exported to ONNX. 256-d embeddings, CPU-friendly — avoids the
+    PyTorch runtime entirely (onnxruntime only). APACHE 2.0.
+
+    Pair with the `speaker-recognition` backend's OnnxDirectEngine.
+    Use when ECAPA-TDNN's torch dependency is undesirable (small
+    images, edge deployments).
+  tags: [voice-recognition, speaker-verification, speaker-embedding, commercial-ok, edge, cpu]
+  urls:
+    - https://github.com/wenet-e2e/wespeaker
+  overrides:
+    backend: speaker-recognition
+    parameters: {model: wespeaker_voxceleb_resnet34.onnx}
+    options:
+      - "engine:onnx"
+      - "model_path:wespeaker_voxceleb_resnet34.onnx"
+      - "sample_rate:16000"
+    known_usecases: [speaker_recognition]
+  files:
+    - filename: wespeaker_voxceleb_resnet34.onnx
+      sha256: 7bb2f06e9df17cdf1ef14ee8a15ab08ed28e8d0ef5054ee135741560df2ec068
+      uri: https://huggingface.co/Wespeaker/wespeaker-voxceleb-resnet34-LM/resolve/main/voxceleb_resnet34_LM.onnx
 - &rfdetr
  name: "rfdetr-base"
  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
--- a/pkg/grpc/backend.go
+++ b/pkg/grpc/backend.go
@@ -56,6 +56,9 @@ type Backend interface {
 	Detect(ctx context.Context, in *pb.DetectOptions, opts ...grpc.CallOption) (*pb.DetectResponse, error)
 	FaceVerify(ctx context.Context, in *pb.FaceVerifyRequest, opts ...grpc.CallOption) (*pb.FaceVerifyResponse, error)
 	FaceAnalyze(ctx context.Context, in *pb.FaceAnalyzeRequest, opts ...grpc.CallOption) (*pb.FaceAnalyzeResponse, error)
+	VoiceVerify(ctx context.Context, in *pb.VoiceVerifyRequest, opts ...grpc.CallOption) (*pb.VoiceVerifyResponse, error)
+	VoiceAnalyze(ctx context.Context, in *pb.VoiceAnalyzeRequest, opts ...grpc.CallOption) (*pb.VoiceAnalyzeResponse, error)
+	VoiceEmbed(ctx context.Context, in *pb.VoiceEmbedRequest, opts ...grpc.CallOption) (*pb.VoiceEmbedResponse, error)
 	AudioTranscription(ctx context.Context, in *pb.TranscriptRequest, opts ...grpc.CallOption) (*pb.TranscriptResult, error)
 	AudioTranscriptionStream(ctx context.Context, in *pb.TranscriptRequest, f func(chunk *pb.TranscriptStreamResponse), opts ...grpc.CallOption) error
 	TokenizeString(ctx context.Context, in *pb.PredictOptions, opts ...grpc.CallOption) (*pb.TokenizationResponse, error)
--- a/pkg/grpc/base/base.go
+++ b/pkg/grpc/base/base.go
@@ -89,6 +89,18 @@ func (llm *Base) FaceAnalyze(*pb.FaceAnalyzeRequest) (pb.FaceAnalyzeResponse, er
 	return pb.FaceAnalyzeResponse{}, fmt.Errorf("unimplemented")
 }

+func (llm *Base) VoiceVerify(*pb.VoiceVerifyRequest) (pb.VoiceVerifyResponse, error) {
+	return pb.VoiceVerifyResponse{}, fmt.Errorf("unimplemented")
+}
+
+func (llm *Base) VoiceAnalyze(*pb.VoiceAnalyzeRequest) (pb.VoiceAnalyzeResponse, error) {
+	return pb.VoiceAnalyzeResponse{}, fmt.Errorf("unimplemented")
+}
+
+func (llm *Base) VoiceEmbed(*pb.VoiceEmbedRequest) (pb.VoiceEmbedResponse, error) {
+	return pb.VoiceEmbedResponse{}, fmt.Errorf("unimplemented")
+}
+
 func (llm *Base) TokenizeString(opts *pb.PredictOptions) (pb.TokenizationResponse, error) {
 	return pb.TokenizationResponse{}, fmt.Errorf("unimplemented")
 }
--- a/pkg/grpc/client.go
+++ b/pkg/grpc/client.go
@@ -616,6 +616,60 @@ func (c *Client) FaceAnalyze(ctx context.Context, in *pb.FaceAnalyzeRequest, opt
 	return client.FaceAnalyze(ctx, in, opts...)
 }

+func (c *Client) VoiceVerify(ctx context.Context, in *pb.VoiceVerifyRequest, opts ...grpc.CallOption) (*pb.VoiceVerifyResponse, error) {
+	if !c.parallel {
+		c.opMutex.Lock()
+		defer c.opMutex.Unlock()
+	}
+	c.setBusy(true)
+	defer c.setBusy(false)
+	c.wdMark()
+	defer c.wdUnMark()
+	conn, err := c.dial()
+	if err != nil {
+		return nil, err
+	}
+	defer conn.Close()
+	client := pb.NewBackendClient(conn)
+	return client.VoiceVerify(ctx, in, opts...)
+}
+
+func (c *Client) VoiceAnalyze(ctx context.Context, in *pb.VoiceAnalyzeRequest, opts ...grpc.CallOption) (*pb.VoiceAnalyzeResponse, error) {
+	if !c.parallel {
+		c.opMutex.Lock()
+		defer c.opMutex.Unlock()
+	}
+	c.setBusy(true)
+	defer c.setBusy(false)
+	c.wdMark()
+	defer c.wdUnMark()
+	conn, err := c.dial()
+	if err != nil {
+		return nil, err
+	}
+	defer conn.Close()
+	client := pb.NewBackendClient(conn)
+	return client.VoiceAnalyze(ctx, in, opts...)
+}
+
+func (c *Client) VoiceEmbed(ctx context.Context, in *pb.VoiceEmbedRequest, opts ...grpc.CallOption) (*pb.VoiceEmbedResponse, error) {
+	if !c.parallel {
+		c.opMutex.Lock()
+		defer c.opMutex.Unlock()
+	}
+	c.setBusy(true)
+	defer c.setBusy(false)
+	c.wdMark()
+	defer c.wdUnMark()
+	conn, err := c.dial()
+	if err != nil {
+		return nil, err
+	}
+	defer conn.Close()
+	client := pb.NewBackendClient(conn)
+	return client.VoiceEmbed(ctx, in, opts...)
+}
+
 func (c *Client) AudioEncode(ctx context.Context, in *pb.AudioEncodeRequest, opts ...grpc.CallOption) (*pb.AudioEncodeResult, error) {
 	if !c.parallel {
 		c.opMutex.Lock()
--- a/pkg/grpc/embed.go
+++ b/pkg/grpc/embed.go
@@ -79,6 +79,18 @@ func (e *embedBackend) FaceAnalyze(ctx context.Context, in *pb.FaceAnalyzeReques
 	return e.s.FaceAnalyze(ctx, in)
 }

+func (e *embedBackend) VoiceVerify(ctx context.Context, in *pb.VoiceVerifyRequest, opts ...grpc.CallOption) (*pb.VoiceVerifyResponse, error) {
+	return e.s.VoiceVerify(ctx, in)
+}
+
+func (e *embedBackend) VoiceAnalyze(ctx context.Context, in *pb.VoiceAnalyzeRequest, opts ...grpc.CallOption) (*pb.VoiceAnalyzeResponse, error) {
+	return e.s.VoiceAnalyze(ctx, in)
+}
+
+func (e *embedBackend) VoiceEmbed(ctx context.Context, in *pb.VoiceEmbedRequest, opts ...grpc.CallOption) (*pb.VoiceEmbedResponse, error) {
+	return e.s.VoiceEmbed(ctx, in)
+}
+
 func (e *embedBackend) AudioTranscription(ctx context.Context, in *pb.TranscriptRequest, opts ...grpc.CallOption) (*pb.TranscriptResult, error) {
 	return e.s.AudioTranscription(ctx, in)
 }
--- a/pkg/grpc/interface.go
+++ b/pkg/grpc/interface.go
@@ -19,6 +19,9 @@ type AIModel interface {
 	Detect(*pb.DetectOptions) (pb.DetectResponse, error)
 	FaceVerify(*pb.FaceVerifyRequest) (pb.FaceVerifyResponse, error)
 	FaceAnalyze(*pb.FaceAnalyzeRequest) (pb.FaceAnalyzeResponse, error)
+	VoiceVerify(*pb.VoiceVerifyRequest) (pb.VoiceVerifyResponse, error)
+	VoiceAnalyze(*pb.VoiceAnalyzeRequest) (pb.VoiceAnalyzeResponse, error)
+	VoiceEmbed(*pb.VoiceEmbedRequest) (pb.VoiceEmbedResponse, error)
 	AudioTranscription(*pb.TranscriptRequest) (pb.TranscriptResult, error)
 	AudioTranscriptionStream(*pb.TranscriptRequest, chan *pb.TranscriptStreamResponse) error
 	TTS(*pb.TTSRequest) error
--- a/pkg/grpc/server.go
+++ b/pkg/grpc/server.go
@@ -175,6 +175,42 @@ func (s *server) FaceAnalyze(ctx context.Context, in *pb.FaceAnalyzeRequest) (*p
 	return &res, nil
 }

+func (s *server) VoiceVerify(ctx context.Context, in *pb.VoiceVerifyRequest) (*pb.VoiceVerifyResponse, error) {
+	if s.llm.Locking() {
+		s.llm.Lock()
+		defer s.llm.Unlock()
+	}
+	res, err := s.llm.VoiceVerify(in)
+	if err != nil {
+		return nil, err
+	}
+	return &res, nil
+}
+
+func (s *server) VoiceAnalyze(ctx context.Context, in *pb.VoiceAnalyzeRequest) (*pb.VoiceAnalyzeResponse, error) {
+	if s.llm.Locking() {
+		s.llm.Lock()
+		defer s.llm.Unlock()
+	}
+	res, err := s.llm.VoiceAnalyze(in)
+	if err != nil {
+		return nil, err
+	}
+	return &res, nil
+}
+
+func (s *server) VoiceEmbed(ctx context.Context, in *pb.VoiceEmbedRequest) (*pb.VoiceEmbedResponse, error) {
+	if s.llm.Locking() {
+		s.llm.Lock()
+		defer s.llm.Unlock()
+	}
+	res, err := s.llm.VoiceEmbed(in)
+	if err != nil {
+		return nil, err
+	}
+	return &res, nil
+}
+
 func (s *server) AudioTranscription(ctx context.Context, in *pb.TranscriptRequest) (*pb.TranscriptResult, error) {
 	if s.llm.Locking() {
 		s.llm.Lock()
--- a/swagger/docs.go
+++ b/swagger/docs.go
@@ -1166,6 +1166,25 @@ const docTemplate = `{
                }
            }
        },
+        "/backends/known": {
+            "get": {
+                "tags": [
+                    "backends"
+                ],
+                "summary": "List all known Backends (importer registry + curated pref-only + installed-on-disk)",
+                "responses": {
+                    "200": {
+                        "description": "Response",
+                        "schema": {
+                            "type": "array",
+                            "items": {
+                                "$ref": "#/definitions/schema.KnownBackend"
+                            }
+                        }
+                    }
+                }
+            }
+        },
        "/backends/upgrade/{name}": {
            "post": {
                "tags": [
@@ -2261,6 +2280,165 @@ const docTemplate = `{
                }
            }
        },
+        "/v1/voice/analyze": {
+            "post": {
+                "tags": [
+                    "voice-recognition"
+                ],
+                "summary": "Analyze demographic attributes (age, gender, emotion) from a voice clip.",
+                "parameters": [
+                    {
+                        "description": "query params",
+                        "name": "request",
+                        "in": "body",
+                        "required": true,
+                        "schema": {
+                            "$ref": "#/definitions/schema.VoiceAnalyzeRequest"
+                        }
+                    }
+                ],
+                "responses": {
+                    "200": {
+                        "description": "Response",
+                        "schema": {
+                            "$ref": "#/definitions/schema.VoiceAnalyzeResponse"
+                        }
+                    }
+                }
+            }
+        },
+        "/v1/voice/embed": {
+            "post": {
+                "tags": [
+                    "voice-recognition"
+                ],
+                "summary": "Extract a speaker embedding from an audio clip.",
+                "parameters": [
+                    {
+                        "description": "query params",
+                        "name": "request",
+                        "in": "body",
+                        "required": true,
+                        "schema": {
+                            "$ref": "#/definitions/schema.VoiceEmbedRequest"
+                        }
+                    }
+                ],
+                "responses": {
+                    "200": {
+                        "description": "Response",
+                        "schema": {
+                            "$ref": "#/definitions/schema.VoiceEmbedResponse"
+                        }
+                    }
+                }
+            }
+        },
+        "/v1/voice/forget": {
+            "post": {
+                "tags": [
+                    "voice-recognition"
+                ],
+                "summary": "Remove a previously-registered speaker by ID.",
+                "parameters": [
+                    {
+                        "description": "query params",
+                        "name": "request",
+                        "in": "body",
+                        "required": true,
+                        "schema": {
+                            "$ref": "#/definitions/schema.VoiceForgetRequest"
+                        }
+                    }
+                ],
+                "responses": {
+                    "204": {
+                        "description": "No Content"
+                    }
+                }
+            }
+        },
+        "/v1/voice/identify": {
+            "post": {
+                "tags": [
+                    "voice-recognition"
+                ],
+                "summary": "Identify a speaker against the registered database (1:N recognition).",
+                "parameters": [
+                    {
+                        "description": "query params",
+                        "name": "request",
+                        "in": "body",
+                        "required": true,
+                        "schema": {
+                            "$ref": "#/definitions/schema.VoiceIdentifyRequest"
+                        }
+                    }
+                ],
+                "responses": {
+                    "200": {
+                        "description": "Response",
+                        "schema": {
+                            "$ref": "#/definitions/schema.VoiceIdentifyResponse"
+                        }
+                    }
+                }
+            }
+        },
+        "/v1/voice/register": {
+            "post": {
+                "tags": [
+                    "voice-recognition"
+                ],
+                "summary": "Register a speaker for 1:N identification.",
+                "parameters": [
+                    {
+                        "description": "query params",
+                        "name": "request",
+                        "in": "body",
+                        "required": true,
+                        "schema": {
+                            "$ref": "#/definitions/schema.VoiceRegisterRequest"
+                        }
+                    }
+                ],
+                "responses": {
+                    "200": {
+                        "description": "Response",
+                        "schema": {
+                            "$ref": "#/definitions/schema.VoiceRegisterResponse"
+                        }
+                    }
+                }
+            }
+        },
+        "/v1/voice/verify": {
+            "post": {
+                "tags": [
+                    "voice-recognition"
+                ],
+                "summary": "Verify that two audio clips were spoken by the same person.",
+                "parameters": [
+                    {
+                        "description": "query params",
+                        "name": "request",
+                        "in": "body",
+                        "required": true,
+                        "schema": {
+                            "$ref": "#/definitions/schema.VoiceVerifyRequest"
+                        }
+                    }
+                ],
+                "responses": {
+                    "200": {
+                        "description": "Response",
+                        "schema": {
+                            "$ref": "#/definitions/schema.VoiceVerifyResponse"
+                        }
+                    }
+                }
+            }
+        },
        "/vad": {
            "post": {
                "consumes": [
@@ -3850,6 +4028,27 @@ const docTemplate = `{
                }
            }
        },
+        "schema.KnownBackend": {
+            "type": "object",
+            "properties": {
+                "auto_detect": {
+                    "type": "boolean"
+                },
+                "description": {
+                    "type": "string"
+                },
+                "installed": {
+                    "description": "Installed is true when the backend is currently present on disk — i.e. it\nappears in gallery.ListSystemBackends(systemState). Importer-registered or\ncurated pref-only backends default to false unless they also show up on\ndisk. The import form uses this to warn users that submitting an import\nmay trigger an automatic backend download.",
+                    "type": "boolean"
+                },
+                "modality": {
+                    "type": "string"
+                },
+                "name": {
+                    "type": "string"
+                }
+            }
+        },
        "schema.LogprobContent": {
            "type": "object",
            "properties": {
@@ -5098,6 +5297,248 @@ const docTemplate = `{
                }
            }
        },
+        "schema.VoiceAnalysis": {
+            "type": "object",
+            "properties": {
+                "age": {
+                    "type": "number"
+                },
+                "dominant_emotion": {
+                    "type": "string"
+                },
+                "dominant_gender": {
+                    "type": "string"
+                },
+                "emotion": {
+                    "type": "object",
+                    "additionalProperties": {
+                        "type": "number",
+                        "format": "float32"
+                    }
+                },
+                "end": {
+                    "type": "number"
+                },
+                "gender": {
+                    "type": "object",
+                    "additionalProperties": {
+                        "type": "number",
+                        "format": "float32"
+                    }
+                },
+                "start": {
+                    "type": "number"
+                }
+            }
+        },
+        "schema.VoiceAnalyzeRequest": {
+            "type": "object",
+            "properties": {
+                "actions": {
+                    "description": "subset of {\"age\",\"gender\",\"emotion\"}",
+                    "type": "array",
+                    "items": {
+                        "type": "string"
+                    }
+                },
+                "audio": {
+                    "type": "string"
+                },
+                "model": {
+                    "type": "string"
+                }
+            }
+        },
+        "schema.VoiceAnalyzeResponse": {
+            "type": "object",
+            "properties": {
+                "segments": {
+                    "type": "array",
+                    "items": {
+                        "$ref": "#/definitions/schema.VoiceAnalysis"
+                    }
+                }
+            }
+        },
+        "schema.VoiceEmbedRequest": {
+            "type": "object",
+            "properties": {
+                "audio": {
+                    "type": "string"
+                },
+                "model": {
+                    "type": "string"
+                }
+            }
+        },
+        "schema.VoiceEmbedResponse": {
+            "type": "object",
+            "properties": {
+                "dim": {
+                    "type": "integer"
+                },
+                "embedding": {
+                    "type": "array",
+                    "items": {
+                        "type": "number"
+                    }
+                },
+                "model": {
+                    "type": "string"
+                }
+            }
+        },
+        "schema.VoiceForgetRequest": {
+            "type": "object",
+            "properties": {
+                "id": {
+                    "type": "string"
+                },
+                "model": {
+                    "type": "string"
+                },
+                "store": {
+                    "type": "string"
+                }
+            }
+        },
+        "schema.VoiceIdentifyMatch": {
+            "type": "object",
+            "properties": {
+                "confidence": {
+                    "type": "number"
+                },
+                "distance": {
+                    "type": "number"
+                },
+                "id": {
+                    "type": "string"
+                },
+                "labels": {
+                    "type": "object",
+                    "additionalProperties": {
+                        "type": "string"
+                    }
+                },
+                "match": {
+                    "type": "boolean"
+                },
+                "name": {
+                    "type": "string"
+                }
+            }
+        },
+        "schema.VoiceIdentifyRequest": {
+            "type": "object",
+            "properties": {
+                "audio": {
+                    "type": "string"
+                },
+                "model": {
+                    "type": "string"
+                },
+                "store": {
+                    "type": "string"
+                },
+                "threshold": {
+                    "type": "number"
+                },
+                "top_k": {
+                    "type": "integer"
+                }
+            }
+        },
+        "schema.VoiceIdentifyResponse": {
+            "type": "object",
+            "properties": {
+                "matches": {
+                    "type": "array",
+                    "items": {
+                        "$ref": "#/definitions/schema.VoiceIdentifyMatch"
+                    }
+                }
+            }
+        },
+        "schema.VoiceRegisterRequest": {
+            "type": "object",
+            "properties": {
+                "audio": {
+                    "type": "string"
+                },
+                "labels": {
+                    "type": "object",
+                    "additionalProperties": {
+                        "type": "string"
+                    }
+                },
+                "model": {
+                    "type": "string"
+                },
+                "name": {
+                    "type": "string"
+                },
+                "store": {
+                    "type": "string"
+                }
+            }
+        },
+        "schema.VoiceRegisterResponse": {
+            "type": "object",
+            "properties": {
+                "id": {
+                    "type": "string"
+                },
+                "name": {
+                    "type": "string"
+                },
+                "registered_at": {
+                    "type": "string"
+                }
+            }
+        },
+        "schema.VoiceVerifyRequest": {
+            "type": "object",
+            "properties": {
+                "anti_spoofing": {
+                    "type": "boolean"
+                },
+                "audio1": {
+                    "type": "string"
+                },
+                "audio2": {
+                    "type": "string"
+                },
+                "model": {
+                    "type": "string"
+                },
+                "threshold": {
+                    "type": "number"
+                }
+            }
+        },
+        "schema.VoiceVerifyResponse": {
+            "type": "object",
+            "properties": {
+                "confidence": {
+                    "type": "number"
+                },
+                "distance": {
+                    "type": "number"
+                },
+                "model": {
+                    "type": "string"
+                },
+                "processing_time_ms": {
+                    "type": "number"
+                },
+                "threshold": {
+                    "type": "number"
+                },
+                "verified": {
+                    "type": "boolean"
+                }
+            }
+        },
        "schema.WebhookConfig": {
            "type": "object",
            "properties": {
--- a/swagger/swagger.json
+++ b/swagger/swagger.json
@@ -1163,6 +1163,25 @@
                }
            }
        },
+        "/backends/known": {
+            "get": {
+                "tags": [
+                    "backends"
+                ],
+                "summary": "List all known Backends (importer registry + curated pref-only + installed-on-disk)",
+                "responses": {
+                    "200": {
+                        "description": "Response",
+                        "schema": {
+                            "type": "array",
+                            "items": {
+                                "$ref": "#/definitions/schema.KnownBackend"
+                            }
+                        }
+                    }
+                }
+            }
+        },
        "/backends/upgrade/{name}": {
            "post": {
                "tags": [
@@ -2258,6 +2277,165 @@
                }
            }
        },
+        "/v1/voice/analyze": {
+            "post": {
+                "tags": [
+                    "voice-recognition"
+                ],
+                "summary": "Analyze demographic attributes (age, gender, emotion) from a voice clip.",
+                "parameters": [
+                    {
+                        "description": "query params",
+                        "name": "request",
+                        "in": "body",
+                        "required": true,
+                        "schema": {
+                            "$ref": "#/definitions/schema.VoiceAnalyzeRequest"
+                        }
+                    }
+                ],
+                "responses": {
+                    "200": {
+                        "description": "Response",
+                        "schema": {
+                            "$ref": "#/definitions/schema.VoiceAnalyzeResponse"
+                        }
+                    }
+                }
+            }
+        },
+        "/v1/voice/embed": {
+            "post": {
+                "tags": [
+                    "voice-recognition"
+                ],
+                "summary": "Extract a speaker embedding from an audio clip.",
+                "parameters": [
+                    {
+                        "description": "query params",
+                        "name": "request",
+                        "in": "body",
+                        "required": true,
+                        "schema": {
+                            "$ref": "#/definitions/schema.VoiceEmbedRequest"
+                        }
+                    }
+                ],
+                "responses": {
+                    "200": {
+                        "description": "Response",
+                        "schema": {
+                            "$ref": "#/definitions/schema.VoiceEmbedResponse"
+                        }
+                    }
+                }
+            }
+        },
+        "/v1/voice/forget": {
+            "post": {
+                "tags": [
+                    "voice-recognition"
+                ],
+                "summary": "Remove a previously-registered speaker by ID.",
+                "parameters": [
+                    {
+                        "description": "query params",
+                        "name": "request",
+                        "in": "body",
+                        "required": true,
+                        "schema": {
+                            "$ref": "#/definitions/schema.VoiceForgetRequest"
+                        }
+                    }
+                ],
+                "responses": {
+                    "204": {
+                        "description": "No Content"
+                    }
+                }
+            }
+        },
+        "/v1/voice/identify": {
+            "post": {
+                "tags": [
+                    "voice-recognition"
+                ],
+                "summary": "Identify a speaker against the registered database (1:N recognition).",
+                "parameters": [
+                    {
+                        "description": "query params",
+                        "name": "request",
+                        "in": "body",
+                        "required": true,
+                        "schema": {
+                            "$ref": "#/definitions/schema.VoiceIdentifyRequest"
+                        }
+                    }
+                ],
+                "responses": {
+                    "200": {
+                        "description": "Response",
+                        "schema": {
+                            "$ref": "#/definitions/schema.VoiceIdentifyResponse"
+                        }
+                    }
+                }
+            }
+        },
+        "/v1/voice/register": {
+            "post": {
+                "tags": [
+                    "voice-recognition"
+                ],
+                "summary": "Register a speaker for 1:N identification.",
+                "parameters": [
+                    {
+                        "description": "query params",
+                        "name": "request",
+                        "in": "body",
+                        "required": true,
+                        "schema": {
+                            "$ref": "#/definitions/schema.VoiceRegisterRequest"
+                        }
+                    }
+                ],
+                "responses": {
+                    "200": {
+                        "description": "Response",
+                        "schema": {
+                            "$ref": "#/definitions/schema.VoiceRegisterResponse"
+                        }
+                    }
+                }
+            }
+        },
+        "/v1/voice/verify": {
+            "post": {
+                "tags": [
+                    "voice-recognition"
+                ],
+                "summary": "Verify that two audio clips were spoken by the same person.",
+                "parameters": [
+                    {
+                        "description": "query params",
+                        "name": "request",
+                        "in": "body",
+                        "required": true,
+                        "schema": {
+                            "$ref": "#/definitions/schema.VoiceVerifyRequest"
+                        }
+                    }
+                ],
+                "responses": {
+                    "200": {
+                        "description": "Response",
+                        "schema": {
+                            "$ref": "#/definitions/schema.VoiceVerifyResponse"
+                        }
+                    }
+                }
+            }
+        },
        "/vad": {
            "post": {
                "consumes": [
@@ -3847,6 +4025,27 @@
                }
            }
        },
+        "schema.KnownBackend": {
+            "type": "object",
+            "properties": {
+                "auto_detect": {
+                    "type": "boolean"
+                },
+                "description": {
+                    "type": "string"
+                },
+                "installed": {
+                    "description": "Installed is true when the backend is currently present on disk — i.e. it\nappears in gallery.ListSystemBackends(systemState). Importer-registered or\ncurated pref-only backends default to false unless they also show up on\ndisk. The import form uses this to warn users that submitting an import\nmay trigger an automatic backend download.",
+                    "type": "boolean"
+                },
+                "modality": {
+                    "type": "string"
+                },
+                "name": {
+                    "type": "string"
+                }
+            }
+        },
        "schema.LogprobContent": {
            "type": "object",
            "properties": {
@@ -5095,6 +5294,248 @@
                }
            }
        },
+        "schema.VoiceAnalysis": {
+            "type": "object",
+            "properties": {
+                "age": {
+                    "type": "number"
+                },
+                "dominant_emotion": {
+                    "type": "string"
+                },
+                "dominant_gender": {
+                    "type": "string"
+                },
+                "emotion": {
+                    "type": "object",
+                    "additionalProperties": {
+                        "type": "number",
+                        "format": "float32"
+                    }
+                },
+                "end": {
+                    "type": "number"
+                },
+                "gender": {
+                    "type": "object",
+                    "additionalProperties": {
+                        "type": "number",
+                        "format": "float32"
+                    }
+                },
+                "start": {
+                    "type": "number"
+                }
+            }
+        },
+        "schema.VoiceAnalyzeRequest": {
+            "type": "object",
+            "properties": {
+                "actions": {
+                    "description": "subset of {\"age\",\"gender\",\"emotion\"}",
+                    "type": "array",
+                    "items": {
+                        "type": "string"
+                    }
+                },
+                "audio": {
+                    "type": "string"
+                },
+                "model": {
+                    "type": "string"
+                }
+            }
+        },
+        "schema.VoiceAnalyzeResponse": {
+            "type": "object",
+            "properties": {
+                "segments": {
+                    "type": "array",
+                    "items": {
+                        "$ref": "#/definitions/schema.VoiceAnalysis"
+                    }
+                }
+            }
+        },
+        "schema.VoiceEmbedRequest": {
+            "type": "object",
+            "properties": {
+                "audio": {
+                    "type": "string"
+                },
+                "model": {
+                    "type": "string"
+                }
+            }
+        },
+        "schema.VoiceEmbedResponse": {
+            "type": "object",
+            "properties": {
+                "dim": {
+                    "type": "integer"
+                },
+                "embedding": {
+                    "type": "array",
+                    "items": {
+                        "type": "number"
+                    }
+                },
+                "model": {
+                    "type": "string"
+                }
+            }
+        },
+        "schema.VoiceForgetRequest": {
+            "type": "object",
+            "properties": {
+                "id": {
+                    "type": "string"
+                },
+                "model": {
+                    "type": "string"
+                },
+                "store": {
+                    "type": "string"
+                }
+            }
+        },
+        "schema.VoiceIdentifyMatch": {
+            "type": "object",
+            "properties": {
+                "confidence": {
+                    "type": "number"
+                },
+                "distance": {
+                    "type": "number"
+                },
+                "id": {
+                    "type": "string"
+                },
+                "labels": {
+                    "type": "object",
+                    "additionalProperties": {
+                        "type": "string"
+                    }
+                },
+                "match": {
+                    "type": "boolean"
+                },
+                "name": {
+                    "type": "string"
+                }
+            }
+        },
+        "schema.VoiceIdentifyRequest": {
+            "type": "object",
+            "properties": {
+                "audio": {
+                    "type": "string"
+                },
+                "model": {
+                    "type": "string"
+                },
+                "store": {
+                    "type": "string"
+                },
+                "threshold": {
+                    "type": "number"
+                },
+                "top_k": {
+                    "type": "integer"
+                }
+            }
+        },
+        "schema.VoiceIdentifyResponse": {
+            "type": "object",
+            "properties": {
+                "matches": {
+                    "type": "array",
+                    "items": {
+                        "$ref": "#/definitions/schema.VoiceIdentifyMatch"
+                    }
+                }
+            }
+        },
+        "schema.VoiceRegisterRequest": {
+            "type": "object",
+            "properties": {
+                "audio": {
+                    "type": "string"
+                },
+                "labels": {
+                    "type": "object",
+                    "additionalProperties": {
+                        "type": "string"
+                    }
+                },
+                "model": {
+                    "type": "string"
+                },
+                "name": {
+                    "type": "string"
+                },
+                "store": {
+                    "type": "string"
+                }
+            }
+        },
+        "schema.VoiceRegisterResponse": {
+            "type": "object",
+            "properties": {
+                "id": {
+                    "type": "string"
+                },
+                "name": {
+                    "type": "string"
+                },
+                "registered_at": {
+                    "type": "string"
+                }
+            }
+        },
+        "schema.VoiceVerifyRequest": {
+            "type": "object",
+            "properties": {
+                "anti_spoofing": {
+                    "type": "boolean"
+                },
+                "audio1": {
+                    "type": "string"
+                },
+                "audio2": {
+                    "type": "string"
+                },
+                "model": {
+                    "type": "string"
+                },
+                "threshold": {
+                    "type": "number"
+                }
+            }
+        },
+        "schema.VoiceVerifyResponse": {
+            "type": "object",
+            "properties": {
+                "confidence": {
+                    "type": "number"
+                },
+                "distance": {
+                    "type": "number"
+                },
+                "model": {
+                    "type": "string"
+                },
+                "processing_time_ms": {
+                    "type": "number"
+                },
+                "threshold": {
+                    "type": "number"
+                },
+                "verified": {
+                    "type": "boolean"
+                }
+            }
+        },
        "schema.WebhookConfig": {
            "type": "object",
            "properties": {
--- a/swagger/swagger.yaml
+++ b/swagger/swagger.yaml
@@ -1038,6 +1038,25 @@ definitions:
        description: '"reasoning", "tool_call", "tool_result", "status"'
        type: string
    type: object
+  schema.KnownBackend:
+    properties:
+      auto_detect:
+        type: boolean
+      description:
+        type: string
+      installed:
+        description: |-
+          Installed is true when the backend is currently present on disk — i.e. it
+          appears in gallery.ListSystemBackends(systemState). Importer-registered or
+          curated pref-only backends default to false unless they also show up on
+          disk. The import form uses this to warn users that submitting an import
+          may trigger an automatic backend download.
+        type: boolean
+      modality:
+        type: string
+      name:
+        type: string
+    type: object
  schema.LogprobContent:
    properties:
      bytes:
@@ -1901,6 +1920,164 @@ definitions:
        description: output width in pixels
        type: integer
    type: object
+  schema.VoiceAnalysis:
+    properties:
+      age:
+        type: number
+      dominant_emotion:
+        type: string
+      dominant_gender:
+        type: string
+      emotion:
+        additionalProperties:
+          format: float32
+          type: number
+        type: object
+      end:
+        type: number
+      gender:
+        additionalProperties:
+          format: float32
+          type: number
+        type: object
+      start:
+        type: number
+    type: object
+  schema.VoiceAnalyzeRequest:
+    properties:
+      actions:
+        description: subset of {"age","gender","emotion"}
+        items:
+          type: string
+        type: array
+      audio:
+        type: string
+      model:
+        type: string
+    type: object
+  schema.VoiceAnalyzeResponse:
+    properties:
+      segments:
+        items:
+          $ref: '#/definitions/schema.VoiceAnalysis'
+        type: array
+    type: object
+  schema.VoiceEmbedRequest:
+    properties:
+      audio:
+        type: string
+      model:
+        type: string
+    type: object
+  schema.VoiceEmbedResponse:
+    properties:
+      dim:
+        type: integer
+      embedding:
+        items:
+          type: number
+        type: array
+      model:
+        type: string
+    type: object
+  schema.VoiceForgetRequest:
+    properties:
+      id:
+        type: string
+      model:
+        type: string
+      store:
+        type: string
+    type: object
+  schema.VoiceIdentifyMatch:
+    properties:
+      confidence:
+        type: number
+      distance:
+        type: number
+      id:
+        type: string
+      labels:
+        additionalProperties:
+          type: string
+        type: object
+      match:
+        type: boolean
+      name:
+        type: string
+    type: object
+  schema.VoiceIdentifyRequest:
+    properties:
+      audio:
+        type: string
+      model:
+        type: string
+      store:
+        type: string
+      threshold:
+        type: number
+      top_k:
+        type: integer
+    type: object
+  schema.VoiceIdentifyResponse:
+    properties:
+      matches:
+        items:
+          $ref: '#/definitions/schema.VoiceIdentifyMatch'
+        type: array
+    type: object
+  schema.VoiceRegisterRequest:
+    properties:
+      audio:
+        type: string
+      labels:
+        additionalProperties:
+          type: string
+        type: object
+      model:
+        type: string
+      name:
+        type: string
+      store:
+        type: string
+    type: object
+  schema.VoiceRegisterResponse:
+    properties:
+      id:
+        type: string
+      name:
+        type: string
+      registered_at:
+        type: string
+    type: object
+  schema.VoiceVerifyRequest:
+    properties:
+      anti_spoofing:
+        type: boolean
+      audio1:
+        type: string
+      audio2:
+        type: string
+      model:
+        type: string
+      threshold:
+        type: number
+    type: object
+  schema.VoiceVerifyResponse:
+    properties:
+      confidence:
+        type: number
+      distance:
+        type: number
+      model:
+        type: string
+      processing_time_ms:
+        type: number
+      threshold:
+        type: number
+      verified:
+        type: boolean
+    type: object
  schema.WebhookConfig:
    properties:
      headers:
@@ -2688,6 +2865,18 @@ paths:
      summary: Returns the job status
      tags:
      - backends
+  /backends/known:
+    get:
+      responses:
+        "200":
+          description: Response
+          schema:
+            items:
+              $ref: '#/definitions/schema.KnownBackend'
+            type: array
+      summary: List all known Backends (importer registry + curated pref-only + installed-on-disk)
+      tags:
+      - backends
  /backends/upgrade/{name}:
    post:
      parameters:
@@ -3392,6 +3581,107 @@ paths:
      summary: Tokenize the input.
      tags:
      - tokenize
+  /v1/voice/analyze:
+    post:
+      parameters:
+      - description: query params
+        in: body
+        name: request
+        required: true
+        schema:
+          $ref: '#/definitions/schema.VoiceAnalyzeRequest'
+      responses:
+        "200":
+          description: Response
+          schema:
+            $ref: '#/definitions/schema.VoiceAnalyzeResponse'
+      summary: Analyze demographic attributes (age, gender, emotion) from a voice
+        clip.
+      tags:
+      - voice-recognition
+  /v1/voice/embed:
+    post:
+      parameters:
+      - description: query params
+        in: body
+        name: request
+        required: true
+        schema:
+          $ref: '#/definitions/schema.VoiceEmbedRequest'
+      responses:
+        "200":
+          description: Response
+          schema:
+            $ref: '#/definitions/schema.VoiceEmbedResponse'
+      summary: Extract a speaker embedding from an audio clip.
+      tags:
+      - voice-recognition
+  /v1/voice/forget:
+    post:
+      parameters:
+      - description: query params
+        in: body
+        name: request
+        required: true
+        schema:
+          $ref: '#/definitions/schema.VoiceForgetRequest'
+      responses:
+        "204":
+          description: No Content
+      summary: Remove a previously-registered speaker by ID.
+      tags:
+      - voice-recognition
+  /v1/voice/identify:
+    post:
+      parameters:
+      - description: query params
+        in: body
+        name: request
+        required: true
+        schema:
+          $ref: '#/definitions/schema.VoiceIdentifyRequest'
+      responses:
+        "200":
+          description: Response
+          schema:
+            $ref: '#/definitions/schema.VoiceIdentifyResponse'
+      summary: Identify a speaker against the registered database (1:N recognition).
+      tags:
+      - voice-recognition
+  /v1/voice/register:
+    post:
+      parameters:
+      - description: query params
+        in: body
+        name: request
+        required: true
+        schema:
+          $ref: '#/definitions/schema.VoiceRegisterRequest'
+      responses:
+        "200":
+          description: Response
+          schema:
+            $ref: '#/definitions/schema.VoiceRegisterResponse'
+      summary: Register a speaker for 1:N identification.
+      tags:
+      - voice-recognition
+  /v1/voice/verify:
+    post:
+      parameters:
+      - description: query params
+        in: body
+        name: request
+        required: true
+        schema:
+          $ref: '#/definitions/schema.VoiceVerifyRequest'
+      responses:
+        "200":
+          description: Response
+          schema:
+            $ref: '#/definitions/schema.VoiceVerifyResponse'
+      summary: Verify that two audio clips were spoken by the same person.
+      tags:
+      - voice-recognition
  /vad:
    post:
      consumes:
--- a/tests/e2e-backends/backend_test.go
+++ b/tests/e2e-backends/backend_test.go
@@ -88,6 +88,9 @@ const (
 	capFaceEmbed     = "face_embed"
 	capFaceVerify    = "face_verify"
 	capFaceAnalyze   = "face_analyze"
+	capVoiceEmbed    = "voice_embed"
+	capVoiceVerify   = "voice_verify"
+	capVoiceAnalyze  = "voice_analyze"

 	defaultPrompt             = "The capital of France is"
 	streamPrompt              = "Once upon a time"
@@ -137,6 +140,14 @@ var _ = Describe("Backend container", Ordered, func() {
 		faceFile1 string
 		faceFile2 string
 		faceFile3 string
+		// Voice fixtures: two clips of the same speaker + one different speaker.
+		voiceFile1 string
+		voiceFile2 string
+		voiceFile3 string
+		// voiceVerifyCeiling is the upper-bound cosine distance for a
+		// same-speaker pair; varies with the recognizer (ECAPA-TDNN
+		// runs close to 0.2, WeSpeaker around 0.3).
+		voiceVerifyCeiling float32
 		// verifyCeiling is the upper-bound cosine distance for a
 		// same-person pair; each model configuration can override it via
 		// BACKEND_TEST_VERIFY_DISTANCE_CEILING because SFace's distance
@@ -218,6 +229,13 @@ var _ = Describe("Backend container", Ordered, func() {
 		faceFile3 = resolveFaceFixture(workDir, "BACKEND_TEST_FACE_IMAGE_3", "face_b.jpg")
 		verifyCeiling = envFloat32("BACKEND_TEST_VERIFY_DISTANCE_CEILING", defaultVerifyDistanceCeil)

+		// Voice fixtures for the voice-recognition specs. Same resolver
+		// as faces — the helper is content-agnostic.
+		voiceFile1 = resolveFaceFixture(workDir, "BACKEND_TEST_VOICE_AUDIO_1", "voice_a_1.wav")
+		voiceFile2 = resolveFaceFixture(workDir, "BACKEND_TEST_VOICE_AUDIO_2", "voice_a_2.wav")
+		voiceFile3 = resolveFaceFixture(workDir, "BACKEND_TEST_VOICE_AUDIO_3", "voice_b.wav")
+		voiceVerifyCeiling = envFloat32("BACKEND_TEST_VOICE_VERIFY_DISTANCE_CEILING", 0.4)
+
 		// Pick a free port and launch the backend.
 		port, err := freeport.GetFreePort()
 		Expect(err).NotTo(HaveOccurred())
@@ -668,6 +686,107 @@ var _ = Describe("Backend container", Ordered, func() {
 		}
 		GinkgoWriter.Printf("face_analyze: %d faces\n", len(res.GetFaces()))
 	})
+
+	// ─── voice (speaker) recognition specs ──────────────────────────────
+
+	It("produces speaker embeddings via VoiceEmbed", func() {
+		if !caps[capVoiceEmbed] {
+			Skip("voice_embed capability not enabled")
+		}
+		Expect(voiceFile1).NotTo(BeEmpty(), "BACKEND_TEST_VOICE_AUDIO_1_FILE or _URL must be set")
+
+		ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)
+		defer cancel()
+		res, err := client.VoiceEmbed(ctx, &pb.VoiceEmbedRequest{Audio: voiceFile1})
+		Expect(err).NotTo(HaveOccurred())
+		vec := res.GetEmbedding()
+		Expect(vec).NotTo(BeEmpty(), "VoiceEmbed returned empty vector")
+		GinkgoWriter.Printf("voice_embed: dim=%d\n", len(vec))
+	})
+
+	It("verifies speakers via VoiceVerify", func() {
+		if !caps[capVoiceVerify] {
+			Skip("voice_verify capability not enabled")
+		}
+		Expect(voiceFile1).NotTo(BeEmpty(), "BACKEND_TEST_VOICE_AUDIO_1_FILE or _URL must be set")
+
+		ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)
+		defer cancel()
+
+		// Same clip twice — expected verified=true with very small distance.
+		same, err := client.VoiceVerify(ctx, &pb.VoiceVerifyRequest{
+			Audio1: voiceFile1, Audio2: voiceFile1, Threshold: voiceVerifyCeiling,
+		})
+		Expect(err).NotTo(HaveOccurred())
+		Expect(same.GetVerified()).To(BeTrue(), "same clip should verify: dist=%.3f", same.GetDistance())
+		Expect(same.GetDistance()).To(BeNumerically("<", 0.05),
+			"identical-clip distance should be near zero, got %.3f", same.GetDistance())
+		GinkgoWriter.Printf("voice_verify(same): dist=%.3f confidence=%.1f\n", same.GetDistance(), same.GetConfidence())
+
+		// Cross-pair distance — assert relative ordering: d(file1,file3) > d(same).
+		// We don't require the fixtures to contain true same-speaker pairs —
+		// good same-speaker audio is hard to source un-gated. The RPC
+		// correctness is pinned by the same-clip check above; the pair
+		// distances here are about asserting the embedding actually encodes
+		// speaker info (ordering changes with speaker identity).
+		var d12, d13 float32
+		if voiceFile3 != "" {
+			res, err := client.VoiceVerify(ctx, &pb.VoiceVerifyRequest{
+				Audio1: voiceFile1, Audio2: voiceFile3, Threshold: voiceVerifyCeiling,
+			})
+			if err != nil {
+				GinkgoWriter.Printf("voice_verify(1vs3): skipped — %v\n", err)
+			} else {
+				d13 = res.GetDistance()
+				Expect(d13).To(BeNumerically(">", same.GetDistance()),
+					"cross-clip distance %.3f should exceed same-clip distance %.3f", d13, same.GetDistance())
+				GinkgoWriter.Printf("voice_verify(1vs3): dist=%.3f verified=%v\n", d13, res.GetVerified())
+			}
+		}
+
+		if voiceFile2 != "" {
+			res, err := client.VoiceVerify(ctx, &pb.VoiceVerifyRequest{
+				Audio1: voiceFile1, Audio2: voiceFile2, Threshold: voiceVerifyCeiling,
+			})
+			if err != nil {
+				GinkgoWriter.Printf("voice_verify(1vs2): skipped — %v\n", err)
+			} else {
+				d12 = res.GetDistance()
+				Expect(d12).To(BeNumerically(">", same.GetDistance()),
+					"cross-clip distance %.3f should exceed same-clip distance %.3f", d12, same.GetDistance())
+				GinkgoWriter.Printf("voice_verify(1vs2): dist=%.3f verified=%v\n", d12, res.GetVerified())
+			}
+		}
+
+		// If both pair distances were computed, record their ordering.
+		// We log rather than assert: ordering depends on the specific
+		// fixtures used, and CI defaults point at three different speakers.
+		if d12 > 0 && d13 > 0 {
+			GinkgoWriter.Printf("voice_verify ordering: d(1,2)=%.3f d(1,3)=%.3f\n", d12, d13)
+		}
+	})
+
+	It("analyzes voice via VoiceAnalyze", func() {
+		if !caps[capVoiceAnalyze] {
+			Skip("voice_analyze capability not enabled")
+		}
+		Expect(voiceFile1).NotTo(BeEmpty(), "BACKEND_TEST_VOICE_AUDIO_1_FILE or _URL must be set")
+
+		ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)
+		defer cancel()
+		res, err := client.VoiceAnalyze(ctx, &pb.VoiceAnalyzeRequest{Audio: voiceFile1})
+		Expect(err).NotTo(HaveOccurred())
+		Expect(res.GetSegments()).NotTo(BeEmpty(), "VoiceAnalyze returned no segments")
+		for _, s := range res.GetSegments() {
+			Expect(s.GetAge()).To(BeNumerically(">", 0), "age should be populated by analyze-capable engines")
+			// Audeering's age-gender head outputs female / male / child;
+			// LocalAI capitalises those to Female / Male / Child. Custom
+			// checkpoints wired via the age_gender_model option may use
+			// different labels, so accept anything non-empty.
+			Expect(s.GetDominantGender()).NotTo(BeEmpty())
+		}
+		GinkgoWriter.Printf("voice_analyze: %d segments\n", len(res.GetSegments()))
+	})
 })

 // extractImage runs `docker create` + `docker export` to materialise the image