Neither the sherpa-onnx nor the speaker-recognition backend had a darwin/arm64 image, so `local-ai backends install` failed with "no child with platform darwin/arm64" on macOS. This left /v1/audio/diarization (the sherpa-onnx path) and /v1/voice/embed without any usable backend on Apple Silicon. Both backends build on darwin/arm64: - sherpa-onnx (Go) already fetches the onnxruntime osx-arm64 runtime in its Makefile; it only needed a darwin matrix entry (build-type metal, lang go, like whisper and silero-vad). - speaker-recognition (Python) needed a requirements-mps.txt so the mps build installs plain onnxruntime (which ships a macOS arm64 wheel) instead of the onnxruntime-gpu pulled by its base requirements (which does not). Add both to the includeDarwin build matrix, wire the metal capability and metal image aliases into the gallery, and add the speaker-recognition requirements-mps.txt. Fixes #10268 Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
speaker-recognition
Speaker (voice) recognition backend for LocalAI. The audio analog to
insightface — produces speaker embeddings and supports 1:1 voice
verification and voice demographic analysis.
Engines
- SpeechBrainEngine (default): ECAPA-TDNN trained on VoxCeleb. 192-d L2-normalised embeddings, cosine distance for verification. Auto-downloads from HuggingFace on first LoadModel.
- OnnxDirectEngine: Any pre-exported ONNX speaker encoder
(WeSpeaker ResNet, 3D-Speaker ERes2Net, CAM++, …). Model path comes
from the gallery
files:entry.
Engine selection is gallery-driven: if the model config provides
model_path: / onnx: the ONNX engine is used, otherwise the
SpeechBrain engine.
Endpoints
-
POST /v1/voice/verify— 1:1 same-speaker check. -
POST /v1/voice/embed— extract a speaker embedding vector. -
POST /v1/voice/analyze— voice demographics, loaded lazily on the first analyze call:- Emotion (default, opt-out):
superb/wav2vec2-base-superb-er(Apache-2.0), 4-way categorical (neutral / happy / angry / sad). - Age + gender (opt-in): no default — wire a checkpoint with a
standard
Wav2Vec2ForSequenceClassificationhead viaage_gender_model:<repo>in options. The Audeering age-gender model is not usable as a drop-in because its multi-task head isn't loadable viaAutoModelForAudioClassification.
Both heads are optional. When nothing loads, the engine returns 501.
- Emotion (default, opt-out):
Audio input
Audio is materialised by the HTTP layer to a temp wav before calling the gRPC backend. Accepted input forms on the HTTP side: URL, data-URI, or raw base64. The backend itself always receives a filesystem path.