Files
LocalAI/backend/python
Ettore Di Giacinto 181ebb6df4 feat: voice recognition (#9500)
* feat(voice-recognition): add /v1/voice/{verify,analyze,embed} + speaker-recognition backend

Audio analog to face recognition. Adds three gRPC RPCs
(VoiceVerify / VoiceAnalyze / VoiceEmbed), their Go service and HTTP
layers, a new FLAG_SPEAKER_RECOGNITION capability flag, and a Python
backend scaffold under backend/python/speaker-recognition/ wrapping
SpeechBrain ECAPA-TDNN with a parallel OnnxDirectEngine for
WeSpeaker / 3D-Speaker ONNX exports.

The kokoros Rust backend gets matching unimplemented trait stubs —
tonic's async_trait has no defaults, so adding an RPC without Rust
stubs breaks the build (same regression fixed by eb01c772 for face).

Swagger, /api/instructions, and the auth RouteFeatureRegistry /
APIFeatures list are updated so the endpoints surface everywhere a
client or admin UI looks.

Assisted-by: Claude:claude-opus-4-7

* feat(voice-recognition): add 1:N identify + register/forget endpoints

Mirrors the face-recognition register/identify/forget surface. New
package core/services/voicerecognition/ carries a Registry interface
and a local-store-backed implementation (same in-memory vector-store
plumbing facerecognition uses, separate instance so the embedding
spaces stay isolated).

Handlers under /v1/voice/{register,identify,forget} reuse
backend.VoiceEmbed to compute the probe vector, then delegate the
nearest-neighbour search to the registry. Default cosine-distance
threshold is tuned for ECAPA-TDNN on VoxCeleb (0.25, EER ~1.9%).

As with the face registry, the current backing is in-memory only — a
pgvector implementation is a future constructor-level swap.

Assisted-by: Claude:claude-opus-4-7

* feat(voice-recognition): gallery, docs, CI and e2e coverage

- backend/index.yaml: speaker-recognition backend entry + CPU and
  CUDA-12 image variants (plus matching development variants).
- gallery/index.yaml: speechbrain-ecapa-tdnn (default) and
  wespeaker-resnet34 model entries. The WeSpeaker SHA-256 is a
  deliberate placeholder — the HF URI must be curl'd and its hash
  filled in before the entry installs.
- docs/content/features/voice-recognition.md: API reference + quickstart,
  mirrors the face-recognition docs.
- React UI: CAP_SPEAKER_RECOGNITION flag export (consumers follow face's
  precedent — no dedicated tab yet).
- tests/e2e-backends: voice_embed / voice_verify / voice_analyze specs.
  Helper resolveFaceFixture is reused as-is — the only thing face/voice
  share is "download a file into workDir", so no need for a new helper.
- Makefile: docker-build-speaker-recognition + test-extra-backend-
  speaker-recognition-{ecapa,all} targets. Audio fixtures default to
  VCTK p225/p226 samples from HuggingFace.
- CI: test-extra.yml grows a tests-speaker-recognition-grpc job
  mirroring insightface. backend.yml matrix gains CPU + CUDA-12 image
  build entries — scripts/changed-backends.js auto-picks these up.

Assisted-by: Claude:claude-opus-4-7

* feat(voice-recognition): wire a working /v1/voice/analyze head

Adds AnalysisHead: a lazy-loading age / gender / emotion inference
wrapper that plugs into both SpeechBrainEngine and OnnxDirectEngine.

Defaults to two open-licence HuggingFace checkpoints:
  - audeering/wav2vec2-large-robust-24-ft-age-gender (Apache 2.0) —
    age regression + 3-way gender (female / male / child).
  - superb/wav2vec2-base-superb-er (Apache 2.0) — 4-way emotion.

Both are optional and degrade gracefully when transformers or the
model can't be loaded — the engine raises NotImplementedError so the
gRPC layer returns 501 instead of a generic 500.

Emotion classes pass through from the model (neutral/happy/angry/sad
on the default checkpoint); the e2e test now accepts any non-empty
dominant gender so custom age_gender_model overrides don't fail it.

Adds transformers to the backend's CPU and CUDA-12 requirements.

Assisted-by: Claude:claude-opus-4-7

* fix(voice-recognition): pin real WeSpeaker ResNet34 ONNX SHA-256

Replaces the placeholder hash in gallery/index.yaml with the actual
SHA-256 (7bb2f06e…) of the upstream
Wespeaker/wespeaker-voxceleb-resnet34-LM ONNX at ~25MB. `local-ai
models install wespeaker-resnet34` now succeeds.

Assisted-by: Claude:claude-opus-4-7

* fix(voice-recognition): soundfile loader + honest analyze default

Two issues surfaced on first end-to-end smoke with the actual backend
image:

1. torchaudio.load in torchaudio 2.8+ requires the torchcodec package
   for audio decoding. Switch SpeechBrainEngine._load_waveform to the
   already-present soundfile (listed in requirements.txt) plus a numpy
   linear resample to 16kHz. Drops a heavy ffmpeg-linked dep and the
   codepath we never exercise (torchaudio's ffmpeg backend).

2. The AnalysisHead was defaulting to audeering/wav2vec2-large-robust-
   24-ft-age-gender, but AutoModelForAudioClassification silently
   mangles that checkpoint — it reports the age head weights as
   UNEXPECTED and re-initialises the classifier head with random
   values, so the "gender" output is noise and there is no age output
   at all. Make age/gender opt-in instead (empty default; users wire
   a cleanly-loadable Wav2Vec2ForSequenceClassification checkpoint via
   age_gender_model: option). Emotion keeps its working Superb default.
   Also broaden _infer_age_gender's tensor-shape handling and catch
   runtime exceptions so a dodgy age/gender head never takes down the
   whole analyze call.

Docs and README updated to match the new policy.

Verified with the branch-scoped gallery on localhost:
- voice/embed    → 192-d ECAPA-TDNN vector
- voice/verify   → same-clip dist≈6e-08 verified=true; cross-speaker
                   dist 0.76–0.99 verified=false (as expected)
- voice/register/identify/forget → round-trip works, 404 on unknown id
- voice/analyze  → emotion populated, age/gender omitted (opt-in)

Assisted-by: Claude:claude-opus-4-7

* fix(voice-recognition): real CI audio fixtures + fixture-agnostic verify spec

Two issues surfaced after CI actually ran the speaker-recognition e2e
target (I'd curl-tested against a running server but hadn't run the
make target locally):

1. The default BACKEND_TEST_VOICE_AUDIO_* URLs pointed at
   huggingface.co/datasets/CSTR-Edinburgh/vctk paths that return 404
   (the dataset is gated). Swap them for the speechbrain test samples
   served from github.com/speechbrain/speechbrain/raw/develop/ —
   public, no auth, correct 16kHz mono format.

2. The VoiceVerify spec required d(file1,file2) < 0.4, assuming
   file1/file2 were same-speaker. The speechbrain samples are three
   different speakers (example1/2/5), and there is no easy un-gated
   source of true same-speaker audio pairs (VoxCeleb/VCTK/LibriSpeech
   are all license- or size-gated for CI use). Replace the ceiling
   check with a relative-ordering assertion: d(pair) > d(same-clip)
   for both file2 and file3 — that's enough to prove the embeddings
   encode speaker info, and it works with any three non-identical
   clips. Actual speaker ordering d(1,2) vs d(1,3) is logged but not
   asserted.

Local run: 4/4 voice specs pass (Health, LoadModel, VoiceEmbed,
VoiceVerify) on the built backend image. 12 non-voice specs skipped
as expected.

Assisted-by: Claude:claude-opus-4-7

* fix(ci): checkout with submodules in the reusable backend_build workflow

The kokoros Rust backend build fails with

    failed to read .../sources/Kokoros/kokoros/Cargo.toml: No such file

because the reusable backend_build.yml workflow's actions/checkout
step was missing `submodules: true`. Dockerfile.rust does `COPY .
/LocalAI`, and without the submodule files the subsequent `cargo
build` can't find the vendored Kokoros crate.

The bug pre-dates this PR — scripts/changed-backends.js only triggers
the kokoros image job when something under backend/rust/kokoros or
the shared proto changes, so master had been coasting past it. The
voice-recognition proto addition re-broke it.

Other checkouts in backend.yml (llama-cpp-darwin) and test-extra.yml
(insightface, kokoros, speaker-recognition) already pass
`submodules: true`; this brings the shared backend image builder in
line.

Assisted-by: Claude:claude-opus-4-7
2026-04-23 12:07:14 +02:00
..
2026-04-12 08:51:30 +02:00
2026-04-12 08:51:30 +02:00
2026-04-12 08:51:30 +02:00
2026-04-12 08:51:30 +02:00
2026-04-12 08:51:30 +02:00
2026-04-12 08:51:30 +02:00
2026-04-12 08:51:30 +02:00
2026-04-12 08:51:30 +02:00
2026-04-12 08:51:30 +02:00
2026-04-12 08:51:30 +02:00
2026-04-12 08:51:30 +02:00
2026-04-16 22:40:56 +02:00
2026-03-30 00:47:27 +02:00
2026-04-12 08:51:30 +02:00
2026-04-12 08:51:30 +02:00

Python Backends for LocalAI

This directory contains Python-based AI backends for LocalAI, providing support for various AI models and hardware acceleration targets.

Overview

The Python backends use a unified build system based on libbackend.sh that provides:

  • Automatic virtual environment management with support for both uv and pip
  • Hardware-specific dependency installation (CPU, CUDA, Intel, MLX, etc.)
  • Portable Python support for standalone deployments
  • Consistent backend execution across different environments

Available Backends

Core AI Models

  • transformers - Hugging Face Transformers framework (PyTorch-based)
  • vllm - High-performance LLM inference engine
  • mlx - Apple Silicon optimized ML framework

Audio & Speech

  • coqui - Coqui TTS models
  • faster-whisper - Fast Whisper speech recognition
  • kitten-tts - Lightweight TTS
  • mlx-audio - Apple Silicon audio processing
  • chatterbox - TTS model
  • kokoro - TTS models

Computer Vision

  • diffusers - Stable Diffusion and image generation
  • mlx-vlm - Vision-language models for Apple Silicon
  • rfdetr - Object detection models

Specialized

  • rerankers - Text reranking models

Quick Start

Prerequisites

  • Python 3.10+ (default: 3.10.18)
  • uv package manager (recommended) or pip
  • Appropriate hardware drivers for your target (CUDA, Intel, etc.)

Installation

Each backend can be installed individually:

# Navigate to a specific backend
cd backend/python/transformers

# Install dependencies
make transformers
# or
bash install.sh

# Run the backend
make run
# or
bash run.sh

Using the Unified Build System

The libbackend.sh script provides consistent commands across all backends:

# Source the library in your backend script
source $(dirname $0)/../common/libbackend.sh

# Install requirements (automatically handles hardware detection)
installRequirements

# Start the backend server
startBackend $@

# Run tests
runUnittests

Hardware Targets

The build system automatically detects and configures for different hardware:

  • CPU - Standard CPU-only builds
  • CUDA - NVIDIA GPU acceleration (supports CUDA 12/13)
  • Intel - Intel XPU/GPU optimization
  • MLX - Apple Silicon (M1/M2/M3) optimization
  • HIP - AMD GPU acceleration

Target-Specific Requirements

Backends can specify hardware-specific dependencies:

  • requirements.txt - Base requirements
  • requirements-cpu.txt - CPU-specific packages
  • requirements-cublas12.txt - CUDA 12 packages
  • requirements-cublas13.txt - CUDA 13 packages
  • requirements-intel.txt - Intel-optimized packages
  • requirements-mps.txt - Apple Silicon packages

Configuration Options

Environment Variables

  • PYTHON_VERSION - Python version (default: 3.10)
  • PYTHON_PATCH - Python patch version (default: 18)
  • BUILD_TYPE - Force specific build target
  • USE_PIP - Use pip instead of uv (default: false)
  • PORTABLE_PYTHON - Enable portable Python builds
  • LIMIT_TARGETS - Restrict backend to specific targets

Example: CUDA 12 Only Backend

# In your backend script
LIMIT_TARGETS="cublas12"
source $(dirname $0)/../common/libbackend.sh

Example: Intel-Optimized Backend

# In your backend script
LIMIT_TARGETS="intel"
source $(dirname $0)/../common/libbackend.sh

Development

Adding a New Backend

  1. Create a new directory in backend/python/
  2. Copy the template structure from common/template/
  3. Implement your backend.py with the required gRPC interface
  4. Add appropriate requirements files for your target hardware
  5. Use libbackend.sh for consistent build and execution

Testing

# Run backend tests
make test
# or
bash test.sh

Building

# Install dependencies
make <backend-name>

# Clean build artifacts
make clean

Architecture

Each backend follows a consistent structure:

backend-name/
├── backend.py          # Main backend implementation
├── requirements.txt    # Base dependencies
├── requirements-*.txt  # Hardware-specific dependencies
├── install.sh         # Installation script
├── run.sh            # Execution script
├── test.sh           # Test script
├── Makefile          # Build targets
└── test.py           # Unit tests

Troubleshooting

Common Issues

  1. Missing dependencies: Ensure all requirements files are properly configured
  2. Hardware detection: Check that BUILD_TYPE matches your system
  3. Python version: Verify Python 3.10+ is available
  4. Virtual environment: Use ensureVenv to create/activate environments

Contributing

When adding new backends or modifying existing ones:

  1. Follow the established directory structure
  2. Use libbackend.sh for consistent behavior
  3. Include appropriate requirements files for all target hardware
  4. Add comprehensive tests
  5. Update this README if adding new backend types