Files
LocalAI/backend/python/speaker-recognition
LocalAI [bot] 60facc7252 fix(darwin): publish sherpa-onnx and speaker-recognition images for darwin/arm64 (#10275)
Neither the sherpa-onnx nor the speaker-recognition backend had a
darwin/arm64 image, so `local-ai backends install` failed with "no child
with platform darwin/arm64" on macOS. This left /v1/audio/diarization (the
sherpa-onnx path) and /v1/voice/embed without any usable backend on Apple
Silicon.

Both backends build on darwin/arm64:
- sherpa-onnx (Go) already fetches the onnxruntime osx-arm64 runtime in its
  Makefile; it only needed a darwin matrix entry (build-type metal, lang go,
  like whisper and silero-vad).
- speaker-recognition (Python) needed a requirements-mps.txt so the mps build
  installs plain onnxruntime (which ships a macOS arm64 wheel) instead of the
  onnxruntime-gpu pulled by its base requirements (which does not).

Add both to the includeDarwin build matrix, wire the metal capability and
metal image aliases into the gallery, and add the speaker-recognition
requirements-mps.txt.

Fixes #10268


Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-12 22:32:42 +02:00
..
2026-04-23 12:07:14 +02:00
2026-04-24 08:50:34 +02:00
2026-04-23 12:07:14 +02:00
2026-04-23 12:07:14 +02:00
2026-04-23 12:07:14 +02:00
2026-04-23 12:07:14 +02:00
2026-04-23 12:07:14 +02:00
2026-04-23 12:07:14 +02:00

speaker-recognition

Speaker (voice) recognition backend for LocalAI. The audio analog to insightface — produces speaker embeddings and supports 1:1 voice verification and voice demographic analysis.

Engines

  • SpeechBrainEngine (default): ECAPA-TDNN trained on VoxCeleb. 192-d L2-normalised embeddings, cosine distance for verification. Auto-downloads from HuggingFace on first LoadModel.
  • OnnxDirectEngine: Any pre-exported ONNX speaker encoder (WeSpeaker ResNet, 3D-Speaker ERes2Net, CAM++, …). Model path comes from the gallery files: entry.

Engine selection is gallery-driven: if the model config provides model_path: / onnx: the ONNX engine is used, otherwise the SpeechBrain engine.

Endpoints

  • POST /v1/voice/verify — 1:1 same-speaker check.

  • POST /v1/voice/embed — extract a speaker embedding vector.

  • POST /v1/voice/analyze — voice demographics, loaded lazily on the first analyze call:

    • Emotion (default, opt-out): superb/wav2vec2-base-superb-er (Apache-2.0), 4-way categorical (neutral / happy / angry / sad).
    • Age + gender (opt-in): no default — wire a checkpoint with a standard Wav2Vec2ForSequenceClassification head via age_gender_model:<repo> in options. The Audeering age-gender model is not usable as a drop-in because its multi-task head isn't loadable via AutoModelForAudioClassification.

    Both heads are optional. When nothing loads, the engine returns 501.

Audio input

Audio is materialised by the HTTP layer to a temp wav before calling the gRPC backend. Accepted input forms on the HTTP side: URL, data-URI, or raw base64. The backend itself always receives a filesystem path.