* test(whisper): wire e2e streaming transcription target Adds test-extra-backend-whisper-transcription, mirroring the existing llama-cpp / sherpa-onnx / vibevoice-cpp targets. The generic AudioTranscriptionStream spec at tests/e2e-backends/backend_test.go:644 fails today because backend/go/whisper has no streaming impl - this target is the failing TDD gate that the next phase makes pass. Confirmed RED locally: 3 Passed (health, load, offline transcription), 1 Failed (streaming spec hits its 300s context deadline because the base implementation returns 'unimplemented' but doesn't close the result channel, leaving the gRPC stream open until the client times out). Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(whisper-cpp): expose new_segment_callback to the Go side Adds set_new_segment_callback() and a C-side trampoline that whisper.cpp invokes once per new text segment during whisper_full(). The trampoline dispatches (idx_first, n_new, user_data) to a Go function pointer registered via purego.NewCallback - text and timings are pulled by Go through the existing get_segment_text/get_segment_t0/get_segment_t1 getters. Wires the hook only when streaming is actually requested, to avoid a per-segment function-pointer dispatch on the offline path. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(whisper-cpp): implement AudioTranscriptionStream Wires whisper.cpp's new_segment_callback through purego back to Go so the streaming transcription RPC produces real, time-correlated deltas while whisper_full() is still decoding. Each segment becomes one TranscriptStreamResponse{Delta}; whisper_full's return is the TranscriptStreamResponse{FinalResult} carrying the full segment list, language, and duration. Per-call state is tracked in a sync.Map keyed by an atomic counter; the Go callback registered via purego.NewCallback is a singleton, dispatched through user_data. SingleThread today means only one entry is ever live, but the map shape matches the sherpa-onnx TTS callback pattern. The streaming path's final.Text is the literal concat of every emitted delta (a strings.Builder accumulated by onNewSegment) so the e2e invariant `final.Text == concat(deltas)` holds exactly. The first delta has no leading space; subsequent deltas are space-prefixed. The offline AudioTranscription path is unchanged. Closes the gap with sherpa-onnx, vibevoice-cpp, llama-cpp, and tinygrad, which already implement AudioTranscriptionStream. Verified GREEN locally: make test-extra-backend-whisper-transcription passes 4/4 specs (3 Passed initially under RED, +1 streaming spec now). Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(whisper-cpp): assert progressive multi-segment streaming Drives AudioTranscriptionStream against a real long-audio fixture and asserts len(deltas) >= 2. The generic e2e spec at tests/e2e-backends/backend_test.go:644 only checks len(deltas) >= 1 which is satisfied by both real and faked streaming - this spec is the guardrail that a future "fake" impl can't sneak past. Skipped by default (env-gated, like the cancellation spec); set WHISPER_LIBRARY, WHISPER_MODEL_PATH, and WHISPER_AUDIO_PATH to a 30+ second clip to run. Verified locally with a 55s 5x-JFK concat against ggml-base.en.bin: 1 Passed in 7.3s, deltas >= 2, finalSegmentCount >= 2, concat(deltas) == final.Text. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(whisper-cpp): add transcription gRPC e2e job Mirrors tests-sherpa-onnx-grpc-transcription / tests-llama-cpp-grpc-transcription. Runs make test-extra-backend-whisper-transcription whenever the whisper backend or the run-all switch fires, so a pin-bump or refactor that breaks streaming transcription gets caught before merge. The whisper output on detect-changes is already emitted by scripts/changed-backends.js (it iterates allBackendPaths); this PR just exposes it as a workflow output and consumes it. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(whisper-cpp): silence errcheck on AudioTranscriptionStream defers golangci-lint runs with new-from-merge-base=origin/master, so the identical defer patterns in the existing offline AudioTranscription path are grandfathered while the new ones in AudioTranscriptionStream trip errcheck. Wrap both defers in `func() { _ = ... }()` to match what errcheck wants without altering behavior. The errors from os.RemoveAll and *os.File.Close are not actionable inside a defer here (we're already returning), matching the offline path's contract. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
LocalAI Backend Architecture
This directory contains the core backend infrastructure for LocalAI, including the gRPC protocol definition, multi-language Dockerfiles, and language-specific backend implementations.
Overview
LocalAI uses a unified gRPC-based architecture that allows different programming languages to implement AI backends while maintaining consistent interfaces and capabilities. The backend system supports multiple hardware acceleration targets and provides a standardized way to integrate various AI models and frameworks.
Architecture Components
1. Protocol Definition (backend.proto)
The backend.proto file defines the gRPC service interface that all backends must implement. This ensures consistency across different language implementations and provides a contract for communication between LocalAI core and backend services.
Core Services
- Text Generation:
Predict,PredictStreamfor LLM inference - Embeddings:
Embeddingfor text vectorization - Image Generation:
GenerateImagefor stable diffusion and image models - Audio Processing:
AudioTranscription,TTS,SoundGeneration - Video Generation:
GenerateVideofor video synthesis - Object Detection:
Detectfor computer vision tasks - Vector Storage:
StoresSet,StoresGet,StoresFindfor RAG operations - Reranking:
Rerankfor document relevance scoring - Voice Activity Detection:
VADfor audio segmentation
Key Message Types
PredictOptions: Comprehensive configuration for text generationModelOptions: Model loading and configuration parametersResult: Standardized response formatStatusResponse: Backend health and memory usage information
2. Multi-Language Dockerfiles
The backend system provides language-specific Dockerfiles that handle the build environment and dependencies for different programming languages:
Dockerfile.pythonDockerfile.golangDockerfile.llama-cpp
3. Language-Specific Implementations
Python Backends (python/)
- transformers: Hugging Face Transformers framework
- vllm: High-performance LLM inference
- mlx: Apple Silicon optimization
- diffusers: Stable Diffusion models
- Audio: coqui, faster-whisper, kitten-tts
- Vision: mlx-vlm, rfdetr
- Specialized: rerankers, chatterbox, kokoro
Go Backends (go/)
- whisper: OpenAI Whisper speech recognition in Go with GGML cpp backend (whisper.cpp)
- stablediffusion-ggml: Stable Diffusion in Go with GGML Cpp backend
- piper: Text-to-speech synthesis Golang with C bindings using rhaspy/piper
- local-store: Vector storage backend
C++ Backends (cpp/)
- llama-cpp: Llama.cpp integration
- grpc: GRPC utilities and helpers
Hardware Acceleration Support
CUDA (NVIDIA)
- Versions: CUDA 12.x, 13.x
- Features: cuBLAS, cuDNN, TensorRT optimization
- Targets: x86_64, ARM64 (Jetson)
ROCm (AMD)
- Features: HIP, rocBLAS, MIOpen
- Targets: AMD GPUs with ROCm support
Intel
- Features: oneAPI, Intel Extension for PyTorch
- Targets: Intel GPUs, XPUs, CPUs
Vulkan
- Features: Cross-platform GPU acceleration
- Targets: Windows, Linux, Android, macOS
Apple Silicon
- Features: MLX framework, Metal Performance Shaders
- Targets: M1/M2/M3 Macs
Backend Registry (index.yaml)
The index.yaml file serves as a central registry for all available backends, providing:
- Metadata: Name, description, license, icons
- Capabilities: Hardware targets and optimization profiles
- Tags: Categorization for discovery
- URLs: Source code and documentation links
Building Backends
Prerequisites
- Docker with multi-architecture support
- Appropriate hardware drivers (CUDA, ROCm, etc.)
- Build tools (make, cmake, compilers)
Build Commands
Example of build commands with Docker
# Build Python backend
docker build -f backend/Dockerfile.python \
--build-arg BACKEND=transformers \
--build-arg BUILD_TYPE=cublas12 \
--build-arg CUDA_MAJOR_VERSION=12 \
--build-arg CUDA_MINOR_VERSION=0 \
-t localai-backend-transformers .
# Build Go backend
docker build -f backend/Dockerfile.golang \
--build-arg BACKEND=whisper \
--build-arg BUILD_TYPE=cpu \
-t localai-backend-whisper .
# Build C++ backend
docker build -f backend/Dockerfile.llama-cpp \
--build-arg BACKEND=llama-cpp \
--build-arg BUILD_TYPE=cublas12 \
-t localai-backend-llama-cpp .
For ARM64/Mac builds, docker can't be used, and the makefile in the respective backend has to be used.
Build Types
cpu: CPU-only optimizationcublas12,cublas13: CUDA 12.x, 13.x with cuBLAShipblas: ROCm with rocBLASintel: Intel oneAPI optimizationvulkan: Vulkan-based accelerationmetal: Apple Metal optimization
Backend Development
Creating a New Backend
- Choose Language: Select Python, Go, or C++ based on requirements
- Implement Interface: Implement the gRPC service defined in
backend.proto - Add Dependencies: Create appropriate requirements files
- Configure Build: Set up Dockerfile and build scripts
- Register Backend: Add entry to
index.yaml - Test Integration: Verify gRPC communication and functionality
Backend Structure
backend-name/
├── backend.py/go/cpp # Main implementation
├── requirements.txt # Dependencies
├── Dockerfile # Build configuration
├── install.sh # Installation script
├── run.sh # Execution script
├── test.sh # Test script
└── README.md # Backend documentation
Required gRPC Methods
At minimum, backends must implement:
Health()- Service health checkLoadModel()- Model loading and initializationPredict()- Main inference endpointStatus()- Backend status and metrics
Integration with LocalAI Core
Backends communicate with LocalAI core through gRPC:
- Service Discovery: Core discovers available backends
- Model Loading: Core requests model loading via
LoadModel - Inference: Core sends requests via
Predictor specialized endpoints - Streaming: Core handles streaming responses for real-time generation
- Monitoring: Core tracks backend health and performance
Performance Optimization
Memory Management
- Model Caching: Efficient model loading and caching
- Batch Processing: Optimize for multiple concurrent requests
- Memory Pinning: GPU memory optimization for CUDA/ROCm
Hardware Utilization
- Multi-GPU: Support for tensor parallelism
- Mixed Precision: FP16/BF16 for memory efficiency
- Kernel Fusion: Optimized CUDA/ROCm kernels
Troubleshooting
Common Issues
- GRPC Connection: Verify backend service is running and accessible
- Model Loading: Check model paths and dependencies
- Hardware Detection: Ensure appropriate drivers and libraries
- Memory Issues: Monitor GPU memory usage and model sizes
Contributing
When contributing to the backend system:
- Follow Protocol: Implement the exact gRPC interface
- Add Tests: Include comprehensive test coverage
- Document: Provide clear usage examples
- Optimize: Consider performance and resource usage
- Validate: Test across different hardware targets