# speaker-recognition Speaker (voice) recognition backend for LocalAI. The audio analog to `insightface` — produces speaker embeddings and supports 1:1 voice verification and voice demographic analysis. ## Engines - **SpeechBrainEngine** (default): ECAPA-TDNN trained on VoxCeleb. 192-d L2-normalised embeddings, cosine distance for verification. Auto-downloads from HuggingFace on first LoadModel. - **OnnxDirectEngine**: Any pre-exported ONNX speaker encoder (WeSpeaker ResNet, 3D-Speaker ERes2Net, CAM++, …). Model path comes from the gallery `files:` entry. Engine selection is gallery-driven: if the model config provides `model_path:` / `onnx:` the ONNX engine is used, otherwise the SpeechBrain engine. ## Endpoints - `POST /v1/voice/verify` — 1:1 same-speaker check. - `POST /v1/voice/embed` — extract a speaker embedding vector. - `POST /v1/voice/analyze` — voice demographics, loaded lazily on the first analyze call: - **Emotion** (default, opt-out): `superb/wav2vec2-base-superb-er` (Apache-2.0), 4-way categorical (neutral / happy / angry / sad). - **Age + gender** (opt-in): no default — wire a checkpoint with a standard `Wav2Vec2ForSequenceClassification` head via `age_gender_model:` in options. The Audeering age-gender model is *not* usable as a drop-in because its multi-task head isn't loadable via `AutoModelForAudioClassification`. Both heads are optional. When nothing loads, the engine returns 501. ## Audio input Audio is materialised by the HTTP layer to a temp wav before calling the gRPC backend. Accepted input forms on the HTTP side: URL, data-URI, or raw base64. The backend itself always receives a filesystem path.