* feat(crispasr): backend source files (Go gRPC server, C-ABI shim, build files) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * polish(crispasr): brand error strings + fix stale shim comment Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * build(crispasr): register backend in root Makefile Mirror the whisper Go backend registration for the new crispasr backend: NOTPARALLEL entry, prepare-test-extra/test-extra hooks, BACKEND_CRISPASR definition, docker-build target generation, and the docker-build-backends aggregate target. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(crispasr): add backend build matrix entries Mirror the 11 whisper golang Dockerfile matrix entries (CPU amd64/arm64, CUDA 12/13, L4T CUDA 13, Intel SYCL f32/f16, Vulkan amd64/arm64, L4T arm64, ROCm hipblas) with backend and tag-suffix substituted to crispasr. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(gallery): add crispasr backend gallery entries Add the crispasr meta anchor and its full set of image gallery entries (cpu, metal, cuda12/13, rocm, intel-sycl f32/f16, vulkan, L4T arm64, L4T cuda13 arm64, plus -development variants), mirroring the whisper backend gallery block. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(crispasr): bump CRISPASR_VERSION via bump_deps workflow Track CrispStrobe/CrispASR main branch and bump CRISPASR_VERSION in backend/go/crispasr/Makefile. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * build(crispasr): don't wire fixture-gated test into test-extra Mirror the whisper Go backend: its AudioTranscription test is gated on model/audio fixtures and skips in CI, so building crispasr (the heaviest ggml compile in the tree) inside the unit-test lane adds a long compile for zero coverage. The backend image build in backend-matrix.yml remains the authoritative compile check. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(crispasr): add darwin metal build entry (mirror whisper) The metal-crispasr gallery entries and capabilities.metal mapping reference -metal-darwin-arm64-crispasr, which is only produced by an includeDarwin entry. Mirror whisper's darwin metal entry so the tag actually gets built. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(crispasr): place hipblas matrix entry next to whisper twin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(crispasr): register crispasr as pref-only ASR backend + test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(crispasr): port whisper behavioral suite (cancellation + streaming) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(crispasr): fix skip message env var names to CRISPASR_* Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(crispasr): switch shim to crispasr_session_* multi-architecture API The shim used whisper_full(), which in CrispASR is the whisper-only path: libcrispasr only transcribes Whisper GGUFs through it. Multi-architecture transcription (Parakeet, Voxtral, Qwen3-ASR, Canary, Granite, FunASR, Paraformer, SenseVoice, ...) goes through the crispasr_session_* C-ABI, which auto-detects the architecture from the GGUF and dispatches to the matching backend. Rewrite the C shim around crispasr_session_open / _transcribe_lang / _result_* and add get_backend() so the selected backend is logged. load_model now takes a threads param (session_open binds n_threads at open). The session result is segment+word based with no token IDs and no per-decode callback, so drop n_tokens / get_token_id / get_segment_speaker_turn_next / set_new_segment_callback. set_abort is kept for API parity but is best-effort: the session transcribe is blocking with no abort hook. Update the purego bindings and gocrispasr.go to match: tokens are left empty, speaker-turn handling is removed, and AudioTranscriptionStream emits one delta per non-empty segment after the blocking decode returns (no progressive streaming via the session API), preserving the concat(deltas) == final.Text invariant. crispasr_session_set_translate is exported by libcrispasr but not declared in crispasr.h, so it is forward-declared in the shim alongside the open/transcribe/result functions. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * build(crispasr): link full CrispASR backend set for multi-arch support The shim's crispasr_session_* dispatch calls into the per-architecture backend libs (parakeet, voxtral, qwen3_asr, canary, funasr, paraformer, sensevoice, ...), which CrispASR builds as static archives. Linking only crispasr + ggml dead-stripped every backend object from the final module (nm backend-symbol count: 0), leaving a whisper-only .so. Link the same backend set as crispasr-cli so the static archives are pulled in. After this the module carries the backend symbols (nm count 407, .so grows from ~2.1MB to ~6.7MB) and the session API can dispatch to every compiled-in architecture. Also rewrite ${CMAKE_SOURCE_DIR}/examples/talk-llama to ${PROJECT_SOURCE_DIR}/... in the vendored src/CMakeLists.txt: CrispASR locates its vendored llama.cpp via ${CMAKE_SOURCE_DIR}, which is wrong when CrispASR is add_subdirectory'd (CMAKE_SOURCE_DIR points at this backend dir, not the CrispASR root). PROJECT_SOURCE_DIR is correct both standalone and as a subproject; the sed is idempotent. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(crispasr): adapt suite to session API (blocking, no decode callback) Register the new symbol set (drop the removed token/speaker/callback funcs, add get_backend; load_model now takes 2 args). The session transcribe is blocking with no abort hook, so a mid-decode cancel can't interrupt it: change the cancellation spec to cancel the context before the call and assert codes.Canceled from the pre-call ctx.Err() check, dropping the <5s mid-decode timing assertion. The streaming spec still holds with per-segment post-decode emission (>=2 deltas, concat(deltas) == final.Text). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(gallery): add CrispASR ASR model entries (-crispasr) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(gallery): keep only session-auto-detectable CrispASR ASR models The crispasr backend loads models via crispasr_session_open, which auto-detects the backend from the GGUF general.architecture using crispasr_detect_backend_from_gguf. Architectures not in that detect map cannot be opened, so those gallery entries fail to load. Removed entries whose architecture is not wired into CrispASR v0.6.11's session auto-detect router (they can be re-added when upstream maps them): - Not in the detect map: data2vec, firered-asr, funasr, fun-asr-mlt-nano, glm-asr, hubert, kyutai-stt, mega-asr, mimo-asr, moonshine{,-de,-streaming,-tiny-de}, omniasr{,-llm,-llm-1b}, paraformer, sensevoice. - Pending verification (filename-heuristic routed, not arch-detected): parakeet-ctc-0.6b, parakeet-ctc-1.1b. Their GGUFs are routed to the fastconformer-ctc backend by a filename heuristic in the model registry, which implies general.architecture is not a mapped string. Kept the parakeet rnnt/tdt_ctc variants: convert-parakeet-to-gguf.py writes general.architecture="parakeet" unconditionally and encodes the rnnt/ctc distinction in metadata fields, so they session-auto-detect. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(crispasr): TTS synthesis via crispasr_session_synthesize (24kHz) Add tts_synthesize/tts_free/tts_set_voice to the C-ABI shim. They reuse the already-open g_session (crispasr_session_open auto-detects a TTS model) and dispatch to the upstream synthesis call, which returns malloc'd 24 kHz mono float PCM. Orpheus needs a SNAC codec path that we do not set, so it returns NULL here and surfaces as an error Go-side. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(crispasr): implement TTS/TTSStream gRPC methods Bind the new shim functions via purego and implement TTS, TTSStream and a writeWAV24k helper. synthesize copies the C-owned PCM out before freeing it; TTS writes a 24 kHz mono 16-bit WAV to req.Dst via go-audio/wav. CrispASR has no progressive synth, so TTSStream synthesizes fully, encodes to WAV, and emits the bytes as a single chunk; it owns the results-channel close (the gRPC server wrapper ranges until close), mirroring vibevoice-cpp's TTSStream. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(crispasr): log when a TTS voice override is not honored Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(gallery): add CrispASR vibevoice-tts model entry Only vibevoice-tts works through the current shim: qwen3-tts, chatterbox, and orpheus require companion codec/s3gen/SNAC paths (set_codec_path / set_s3gen_path) that the shim doesn't wire yet, and kokoro/indextts/voxcpm2 aren't in the session auto-detect map. Those are follow-ups. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(crispasr): gated TTS synthesis spec Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(crispasr): satisfy golangci-lint (errcheck defers + unsafeptr nolint) The crispasr Go file is entirely new, so new-from-merge-base lints every line (unlike the grandfathered whisper backend it was forked from): - handle os.RemoveAll / fh.Close return values in AudioTranscription - annotate the two intentional C-pointer unsafe.Slice sites with //nolint:govet Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(crispasr): backend: and codec: model options (explicit arch + companion files) Add two model-config options to the CrispASR backend via opts.Options: - backend:<name> selects an explicit CrispASR backend (bypassing auto-detect) by routing load_model through crispasr_session_open_explicit, unlocking architectures the detector won't pick on its own (qwen3, cohere, granite, voxtral, moonshine, mimo-asr, orpheus, kokoro, chatterbox, etc.). - codec:<path> loads a companion file (qwen3-tts codec, orpheus SNAC, chatterbox s3gen, or mimo-asr tokenizer) via the universal crispasr_session_set_codec_path setter after the session opens. A relative path resolves against the model directory. rc==0 means success or not-applicable; only a negative rc is fatal. The C shim load_model gains a backend_name argument and a new set_codec_path entry point; the Go bridge parses the prefix:value options and registers the new symbol. The vad_only path is unchanged. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(gallery): expand CrispASR models via backend:/codec: options (explicit arch + companions) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(gallery): use virtual.yaml base for crispasr models The crispasr entries are just backend + model + a couple options, fully expressed inline via overrides:/files: in gallery/index.yaml. Point each url: at the shared gallery/virtual.yaml (the established 'virtual' model trick) and drop the 36 redundant per-model gallery/*-crispasr.yaml files. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(gallery): drop voice-requiring TTS entries (keep vibevoice-tts) Real e2e showed qwen3-tts/orpheus/chatterbox don't synthesize through the current shim: the codec: companion loads fine, but these engines additionally need a voice pack / voice prompt / reference clip (qwen3-tts base errors 'no voice'; chatterbox is zero-shot cloning; orpheus uses named voices) that the backend doesn't wire. (qwen3-tts also can't auto-detect: its GGUF arch is 'qwen3tts', unmapped by the detector — would need backend:qwen3-tts.) Removed to avoid shipping non-working gallery entries; vibevoice-tts (built-in voice, e2e-verified) remains the working TTS. Voice-pack wiring is a follow-up. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(crispasr): speaker: and voice: TTS options (baked speakers + voice packs/prompts) speaker:<name> -> crispasr_session_set_speaker_name (baked speakers: qwen3-tts CustomVoice, orpheus). voice:<path>(+voice_text:<ref>) -> crispasr_session_set_voice (voice-pack GGUF, or WAV zero-shot clone with ref text). Applied at Load as the default voice; req.Voice still overrides the speaker per request. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(gallery): re-add e2e-verified TTS engines (chatterbox, qwen3-tts-customvoice, orpheus) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
LocalAI Backend Architecture
This directory contains the core backend infrastructure for LocalAI, including the gRPC protocol definition, multi-language Dockerfiles, and language-specific backend implementations.
Overview
LocalAI uses a unified gRPC-based architecture that allows different programming languages to implement AI backends while maintaining consistent interfaces and capabilities. The backend system supports multiple hardware acceleration targets and provides a standardized way to integrate various AI models and frameworks.
Architecture Components
1. Protocol Definition (backend.proto)
The backend.proto file defines the gRPC service interface that all backends must implement. This ensures consistency across different language implementations and provides a contract for communication between LocalAI core and backend services.
Core Services
- Text Generation:
Predict,PredictStreamfor LLM inference - Embeddings:
Embeddingfor text vectorization - Image Generation:
GenerateImagefor stable diffusion and image models - Audio Processing:
AudioTranscription,TTS,SoundGeneration - Video Generation:
GenerateVideofor video synthesis - Object Detection:
Detectfor computer vision tasks - Vector Storage:
StoresSet,StoresGet,StoresFindfor RAG operations - Reranking:
Rerankfor document relevance scoring - Voice Activity Detection:
VADfor audio segmentation
Key Message Types
PredictOptions: Comprehensive configuration for text generationModelOptions: Model loading and configuration parametersResult: Standardized response formatStatusResponse: Backend health and memory usage information
2. Multi-Language Dockerfiles
The backend system provides language-specific Dockerfiles that handle the build environment and dependencies for different programming languages:
Dockerfile.pythonDockerfile.golangDockerfile.llama-cpp
3. Language-Specific Implementations
Python Backends (python/)
- transformers: Hugging Face Transformers framework
- vllm: High-performance LLM inference
- mlx: Apple Silicon optimization
- diffusers: Stable Diffusion models
- Audio: coqui, faster-whisper, kitten-tts
- Vision: mlx-vlm, rfdetr
- Specialized: rerankers, chatterbox, kokoro
Go Backends (go/)
- whisper: OpenAI Whisper speech recognition in Go with GGML cpp backend (whisper.cpp)
- stablediffusion-ggml: Stable Diffusion in Go with GGML Cpp backend
- piper: Text-to-speech synthesis Golang with C bindings using rhaspy/piper
- local-store: Vector storage backend
C++ Backends (cpp/)
- llama-cpp: Llama.cpp integration
- grpc: GRPC utilities and helpers
Hardware Acceleration Support
CUDA (NVIDIA)
- Versions: CUDA 12.x, 13.x
- Features: cuBLAS, cuDNN, TensorRT optimization
- Targets: x86_64, ARM64 (Jetson)
ROCm (AMD)
- Features: HIP, rocBLAS, MIOpen
- Targets: AMD GPUs with ROCm support
Intel
- Features: oneAPI, Intel Extension for PyTorch
- Targets: Intel GPUs, XPUs, CPUs
Vulkan
- Features: Cross-platform GPU acceleration
- Targets: Windows, Linux, Android, macOS
Apple Silicon
- Features: MLX framework, Metal Performance Shaders
- Targets: M1/M2/M3 Macs
Backend Registry (index.yaml)
The index.yaml file serves as a central registry for all available backends, providing:
- Metadata: Name, description, license, icons
- Capabilities: Hardware targets and optimization profiles
- Tags: Categorization for discovery
- URLs: Source code and documentation links
Building Backends
Prerequisites
- Docker with multi-architecture support
- Appropriate hardware drivers (CUDA, ROCm, etc.)
- Build tools (make, cmake, compilers)
Build Commands
Example of build commands with Docker
# Build Python backend
docker build -f backend/Dockerfile.python \
--build-arg BACKEND=transformers \
--build-arg BUILD_TYPE=cublas12 \
--build-arg CUDA_MAJOR_VERSION=12 \
--build-arg CUDA_MINOR_VERSION=0 \
-t localai-backend-transformers .
# Build Go backend
docker build -f backend/Dockerfile.golang \
--build-arg BACKEND=whisper \
--build-arg BUILD_TYPE=cpu \
-t localai-backend-whisper .
# Build C++ backend
docker build -f backend/Dockerfile.llama-cpp \
--build-arg BACKEND=llama-cpp \
--build-arg BUILD_TYPE=cublas12 \
-t localai-backend-llama-cpp .
For ARM64/Mac builds, docker can't be used, and the makefile in the respective backend has to be used.
Build Types
cpu: CPU-only optimizationcublas12,cublas13: CUDA 12.x, 13.x with cuBLAShipblas: ROCm with rocBLASintel: Intel oneAPI optimizationvulkan: Vulkan-based accelerationmetal: Apple Metal optimization
Backend Development
Creating a New Backend
- Choose Language: Select Python, Go, or C++ based on requirements
- Implement Interface: Implement the gRPC service defined in
backend.proto - Add Dependencies: Create appropriate requirements files
- Configure Build: Set up Dockerfile and build scripts
- Register Backend: Add entry to
index.yaml - Test Integration: Verify gRPC communication and functionality
Backend Structure
backend-name/
├── backend.py/go/cpp # Main implementation
├── requirements.txt # Dependencies
├── Dockerfile # Build configuration
├── install.sh # Installation script
├── run.sh # Execution script
├── test.sh # Test script
└── README.md # Backend documentation
Required gRPC Methods
At minimum, backends must implement:
Health()- Service health checkLoadModel()- Model loading and initializationPredict()- Main inference endpointStatus()- Backend status and metrics
Integration with LocalAI Core
Backends communicate with LocalAI core through gRPC:
- Service Discovery: Core discovers available backends
- Model Loading: Core requests model loading via
LoadModel - Inference: Core sends requests via
Predictor specialized endpoints - Streaming: Core handles streaming responses for real-time generation
- Monitoring: Core tracks backend health and performance
Performance Optimization
Memory Management
- Model Caching: Efficient model loading and caching
- Batch Processing: Optimize for multiple concurrent requests
- Memory Pinning: GPU memory optimization for CUDA/ROCm
Hardware Utilization
- Multi-GPU: Support for tensor parallelism
- Mixed Precision: FP16/BF16 for memory efficiency
- Kernel Fusion: Optimized CUDA/ROCm kernels
Troubleshooting
Common Issues
- GRPC Connection: Verify backend service is running and accessible
- Model Loading: Check model paths and dependencies
- Hardware Detection: Ensure appropriate drivers and libraries
- Memory Issues: Monitor GPU memory usage and model sizes
Contributing
When contributing to the backend system:
- Follow Protocol: Implement the exact gRPC interface
- Add Tests: Include comprehensive test coverage
- Document: Provide clear usage examples
- Optimize: Consider performance and resource usage
- Validate: Test across different hardware targets