Files
LocalAI/backend
LocalAI [bot] 0258f8af55 fix(backends): repair release CI build/test breaks (kokoros, fish-speech, llama-cpp-quantization, sglang) (#10547)
* fix(kokoros): implement new Backend RPCs to fix the build

The backend.proto grew six RPCs (SoundDetection, Depth, TokenClassify,
Score and the bidi-streaming Forward) that the kokoros gRPC service never
implemented, so the trait impl no longer satisfies `Backend`:

    error[E0046]: not all trait items implemented, missing:
      `sound_detection`, `depth`, `token_classify`, `score`,
      `ForwardStream`, `forward`

kokoros is a TTS backend with no use for these, so add `unimplemented`
stubs (plus the `ForwardStream` associated type) matching the existing
pattern for every other unsupported RPC in this file.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* fix(fish-speech): add setuptools-rust for the editable source install

install.sh installs the fish-speech source tree editable with
`--no-build-isolation`, which means the build backends of its transitive
dependencies must already be present in the venv. One of them builds a
Rust extension and its metadata step fails with:

    ModuleNotFoundError: No module named 'setuptools_rust'

Add setuptools-rust to requirements.txt so installRequirements provisions
it before the editable install runs.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* fix(llama-cpp-quantization): vendor convert_hf_to_gguf.py with conversion/

Upstream llama.cpp split the model-specific logic out of the single
convert_hf_to_gguf.py file into a sibling `conversion/` package, so the
script now starts with `from conversion import ...`. Downloading just the
one file therefore fails at runtime with:

    ModuleNotFoundError: No module named 'conversion'

Clone the repo (reusing the clone already needed to build llama-quantize)
and copy both the script and the `conversion/` package into the backend
dir. Python puts the script's own directory on sys.path[0], so the package
resolves when it sits beside the script.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* fix(sglang): pin the CPU source build to sglang v0.5.11

The CPU profile builds sgl-kernel from a `git clone` of sglang with no
ref, so it always tracks master. Recent master added CPU kernels (e.g.
mamba/fla.cpp) that fail to compile in our builder:

    constexpr variable 'scale' must be initialized by a constant
    static library kineto_LIBRARY-NOTFOUND not found

Pin the clone to v0.5.11, the same release the GPU path already floors on
(requirements-cublas12-after.txt). Overridable via SGLANG_VERSION so the
pin can be bumped deliberately.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 09:42:22 +02:00
..

LocalAI Backend Architecture

This directory contains the core backend infrastructure for LocalAI, including the gRPC protocol definition, multi-language Dockerfiles, and language-specific backend implementations.

Overview

LocalAI uses a unified gRPC-based architecture that allows different programming languages to implement AI backends while maintaining consistent interfaces and capabilities. The backend system supports multiple hardware acceleration targets and provides a standardized way to integrate various AI models and frameworks.

Architecture Components

1. Protocol Definition (backend.proto)

The backend.proto file defines the gRPC service interface that all backends must implement. This ensures consistency across different language implementations and provides a contract for communication between LocalAI core and backend services.

Core Services

  • Text Generation: Predict, PredictStream for LLM inference
  • Embeddings: Embedding for text vectorization
  • Image Generation: GenerateImage for stable diffusion and image models
  • Audio Processing: AudioTranscription, TTS, SoundGeneration
  • Video Generation: GenerateVideo for video synthesis
  • Object Detection: Detect for computer vision tasks
  • Vector Storage: StoresSet, StoresGet, StoresFind for RAG operations
  • Reranking: Rerank for document relevance scoring
  • Voice Activity Detection: VAD for audio segmentation

Key Message Types

  • PredictOptions: Comprehensive configuration for text generation
  • ModelOptions: Model loading and configuration parameters
  • Result: Standardized response format
  • StatusResponse: Backend health and memory usage information

2. Multi-Language Dockerfiles

The backend system provides language-specific Dockerfiles that handle the build environment and dependencies for different programming languages:

  • Dockerfile.python
  • Dockerfile.golang
  • Dockerfile.llama-cpp

3. Language-Specific Implementations

Python Backends (python/)

  • transformers: Hugging Face Transformers framework
  • vllm: High-performance LLM inference
  • mlx: Apple Silicon optimization
  • diffusers: Stable Diffusion models
  • Audio: coqui, faster-whisper, kitten-tts
  • Vision: mlx-vlm, rfdetr
  • Specialized: rerankers, chatterbox, kokoro

Go Backends (go/)

  • whisper: OpenAI Whisper speech recognition in Go with GGML cpp backend (whisper.cpp)
  • stablediffusion-ggml: Stable Diffusion in Go with GGML Cpp backend
  • piper: Text-to-speech synthesis Golang with C bindings using rhaspy/piper
  • local-store: Vector storage backend

C++ Backends (cpp/)

  • llama-cpp: Llama.cpp integration
  • grpc: GRPC utilities and helpers

Hardware Acceleration Support

CUDA (NVIDIA)

  • Versions: CUDA 12.x, 13.x
  • Features: cuBLAS, cuDNN, TensorRT optimization
  • Targets: x86_64, ARM64 (Jetson)

ROCm (AMD)

  • Features: HIP, rocBLAS, MIOpen
  • Targets: AMD GPUs with ROCm support

Intel

  • Features: oneAPI, Intel Extension for PyTorch
  • Targets: Intel GPUs, XPUs, CPUs

Vulkan

  • Features: Cross-platform GPU acceleration
  • Targets: Windows, Linux, Android, macOS

Apple Silicon

  • Features: MLX framework, Metal Performance Shaders
  • Targets: M1/M2/M3 Macs

Backend Registry (index.yaml)

The index.yaml file serves as a central registry for all available backends, providing:

  • Metadata: Name, description, license, icons
  • Capabilities: Hardware targets and optimization profiles
  • Tags: Categorization for discovery
  • URLs: Source code and documentation links

Building Backends

Prerequisites

  • Docker with multi-architecture support
  • Appropriate hardware drivers (CUDA, ROCm, etc.)
  • Build tools (make, cmake, compilers)

Build Commands

Example of build commands with Docker

# Build Python backend
docker build -f backend/Dockerfile.python \
  --build-arg BACKEND=transformers \
  --build-arg BUILD_TYPE=cublas12 \
  --build-arg CUDA_MAJOR_VERSION=12 \
  --build-arg CUDA_MINOR_VERSION=0 \
  -t localai-backend-transformers .

# Build Go backend
docker build -f backend/Dockerfile.golang \
  --build-arg BACKEND=whisper \
  --build-arg BUILD_TYPE=cpu \
  -t localai-backend-whisper .

# Build C++ backend
docker build -f backend/Dockerfile.llama-cpp \
  --build-arg BACKEND=llama-cpp \
  --build-arg BUILD_TYPE=cublas12 \
  -t localai-backend-llama-cpp .

For ARM64/Mac builds, docker can't be used, and the makefile in the respective backend has to be used.

Build Types

  • cpu: CPU-only optimization
  • cublas12, cublas13: CUDA 12.x, 13.x with cuBLAS
  • hipblas: ROCm with rocBLAS
  • intel: Intel oneAPI optimization
  • vulkan: Vulkan-based acceleration
  • metal: Apple Metal optimization

Backend Development

Creating a New Backend

  1. Choose Language: Select Python, Go, or C++ based on requirements
  2. Implement Interface: Implement the gRPC service defined in backend.proto
  3. Add Dependencies: Create appropriate requirements files
  4. Configure Build: Set up Dockerfile and build scripts
  5. Register Backend: Add entry to index.yaml
  6. Test Integration: Verify gRPC communication and functionality

Backend Structure

backend-name/
├── backend.py/go/cpp    # Main implementation
├── requirements.txt      # Dependencies
├── Dockerfile           # Build configuration
├── install.sh           # Installation script
├── run.sh              # Execution script
├── test.sh             # Test script
└── README.md           # Backend documentation

Required gRPC Methods

At minimum, backends must implement:

  • Health() - Service health check
  • LoadModel() - Model loading and initialization
  • Predict() - Main inference endpoint
  • Status() - Backend status and metrics

Integration with LocalAI Core

Backends communicate with LocalAI core through gRPC:

  1. Service Discovery: Core discovers available backends
  2. Model Loading: Core requests model loading via LoadModel
  3. Inference: Core sends requests via Predict or specialized endpoints
  4. Streaming: Core handles streaming responses for real-time generation
  5. Monitoring: Core tracks backend health and performance

Performance Optimization

Memory Management

  • Model Caching: Efficient model loading and caching
  • Batch Processing: Optimize for multiple concurrent requests
  • Memory Pinning: GPU memory optimization for CUDA/ROCm

Hardware Utilization

  • Multi-GPU: Support for tensor parallelism
  • Mixed Precision: FP16/BF16 for memory efficiency
  • Kernel Fusion: Optimized CUDA/ROCm kernels

Troubleshooting

Common Issues

  1. GRPC Connection: Verify backend service is running and accessible
  2. Model Loading: Check model paths and dependencies
  3. Hardware Detection: Ensure appropriate drivers and libraries
  4. Memory Issues: Monitor GPU memory usage and model sizes

Contributing

When contributing to the backend system:

  1. Follow Protocol: Implement the exact gRPC interface
  2. Add Tests: Include comprehensive test coverage
  3. Document: Provide clear usage examples
  4. Optimize: Consider performance and resource usage
  5. Validate: Test across different hardware targets