Files
LocalAI/backend
LocalAI [bot] 4ac67d255d feat: single-build ggml CPU_ALL_VARIANTS for llama-cpp + turboquant (x86/arm64/apple) (#10497)
* feat(llama-cpp): single x86 CPU build via ggml CPU_ALL_VARIANTS

Replace the per-microarch avx/avx2/avx512/fallback multi-binary build on
x86 with a single grpc-server plus the dlopen-able libggml-cpu-*.so set
that ggml's backend registry selects at runtime by probing host CPU
features. One build instead of four, broader microarch coverage (adds
alderlake AVX-VNNI, zen4 AVX512-BF16, sapphirerapids AMX), and the
shell-side /proc/cpuinfo probing in run.sh goes away.

Build/link notes:
- CPU_ALL_VARIANTS requires GGML_BACKEND_DL + BUILD_SHARED_LIBS=ON, so
  ggml/llama become shared objects. SHARED_LIBS is now a make variable
  (default OFF) so the override survives the recursive sub-make into the
  VARIANT build dir instead of being re-clobbered by the base flags.
- The cpu-all target also builds "--target ggml": the per-microarch
  backends are runtime-dlopened, not link deps, so they only compile via
  ggml's add_dependencies().
- hw_grpc_proto is pinned STATIC. Under BUILD_SHARED_LIBS=ON it would
  otherwise become a DSO referencing hidden-visibility symbols in the
  static libprotobuf.a, which fails to link ("hidden symbol ... is
  referenced by DSO"). Keeping it static links gRPC/protobuf into the
  executable while only ggml/llama stay shared, so no PIC or base-image
  change is required.
- package.sh bundles the libggml-*.so set into package/lib; ggml finds
  them by scanning the bundled ld.so directory (/proc/self/exe), which
  run.sh launches from.

Scope: x86 only. arm64/darwin keep the single fallback build. The
ik-llama-cpp / turboquant forks and the other ggml C++ backends are
unchanged; the same recipe applies but is out of scope here.

Validated with a full docker build plus a live inference smoke test:
the model loads, ggml selects the AVX512_BF16 variant on a Zen-class
host, and tokens generate correctly.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(llama-cpp,turboquant): extend CPU_ALL_VARIANTS to arm64 + turboquant

- llama-cpp: x86 AND arm64 now use the single llama-cpp-cpu-all build
  (only hipblas keeps the fallback build). ggml's arm64 variant table
  (armv8.x / armv9.x, plus apple_m* on darwin) is selected at runtime.
- turboquant: same recipe via a turboquant-cpu-all target. turboquant
  copies backend/cpp/llama-cpp's CMakeLists.txt + Makefile per flavor, so
  the hw_grpc_proto STATIC fix and the SHARED_LIBS / EXTRA_CMAKE_ARGS
  make-vars are inherited; the target just passes SHARED_LIBS=ON, the DL
  flags and --target ggml through, then collects the .so set. run.sh and
  package.sh updated to ship/select turboquant-cpu-all.
- Makefile lib-collection find now also matches *.dylib (for the darwin
  build, which emits dylibs rather than .so).

ik-llama-cpp is intentionally left unchanged: its pinned ggml has no
CPU_ALL_VARIANTS support and its IQK kernels require AVX2, so the
per-microarch dynamic backend set does not apply.

Scope still excludes the darwin packaging wiring (separate change).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(llama-cpp,turboquant): arm64 gcc-14 for SME variants + darwin cpu-all packaging

- arm64: ggml CPU_ALL_VARIANTS builds armv9.2 SME variants whose -march=...+sme
  is rejected by the Ubuntu 24.04 default gcc-13. Build the arm64 variants with
  gcc-14 (installed in the compile step). The host only selects a variant it
  actually supports at runtime, but every variant must still compile.
- darwin: scripts/build/llama-cpp-darwin.sh builds llama-cpp-cpu-all instead of
  the fallback binary, keeps Metal (GGML_METAL stays ON; --target ggml also builds
  ggml-metal). The per-microarch libggml-cpu-*.dylib are placed in the package
  root next to the binary (darwin has no bundled ld.so, so ggml's executable-dir
  scan looks there), while the other shared dylibs go in lib/ for DYLD_LIBRARY_PATH.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* fix(llama-cpp-darwin): distribute ggml backends by suffix (.so root, .dylib lib)

ggml emits its loadable backends (per-microarch CPU variants, metal, blas) with a
.so suffix even on darwin, while the core libraries (ggml-base/ggml/llama/
llama-common/mtmd) use .dylib. Split the distribution by suffix: .so DL backends
go in the package root for ggml's executable-directory scan, .dylib core libs go
in lib/ for DYLD_LIBRARY_PATH. The previous .dylib name-pattern matched none of the
variants.

Verified on an M4: ggml loads the apple_m4 CPU variant (SME=1) and Metal, model
loads and generates correct tokens.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* fix(llama-cpp,turboquant): only CPU_ALL_VARIANTS for pure-CPU builds, GPU uses fallback

The previous gate sent every non-hipblas build through llama-cpp-cpu-all, so the
GPU image builds (cublas, sycl_f16/f32, vulkan, nvidia l4t) compiled the whole CPU
microarch variant matrix on top of their already-huge GPU backend - blowing the
build time (the sycl job was only 59% done after 2h11m) - and the arm64 l4t build
failed at `apt-get install gcc-14` (exit 100) on the Jetson base.

Gate on an empty BUILD_TYPE instead: only the pure CPU image (build-type: '' in
.github/backend-matrix.yml) builds the CPU_ALL_VARIANTS set; every GPU build gets a
single fallback CPU grpc-server, since the accelerator does the compute. This also
confines the arm64 gcc-14 step (needed for the armv9.2 SME variants) to the CPU
build, away from the GPU base images.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* docs(llama-cpp): correct run.sh comment for arm64/darwin cpu-all

arm64 and darwin CPU images now also ship llama-cpp-cpu-all (not fallback-only);
only GPU images ship fallback-only. Fix the stale comment to match.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-25 15:47:03 +02:00
..

LocalAI Backend Architecture

This directory contains the core backend infrastructure for LocalAI, including the gRPC protocol definition, multi-language Dockerfiles, and language-specific backend implementations.

Overview

LocalAI uses a unified gRPC-based architecture that allows different programming languages to implement AI backends while maintaining consistent interfaces and capabilities. The backend system supports multiple hardware acceleration targets and provides a standardized way to integrate various AI models and frameworks.

Architecture Components

1. Protocol Definition (backend.proto)

The backend.proto file defines the gRPC service interface that all backends must implement. This ensures consistency across different language implementations and provides a contract for communication between LocalAI core and backend services.

Core Services

  • Text Generation: Predict, PredictStream for LLM inference
  • Embeddings: Embedding for text vectorization
  • Image Generation: GenerateImage for stable diffusion and image models
  • Audio Processing: AudioTranscription, TTS, SoundGeneration
  • Video Generation: GenerateVideo for video synthesis
  • Object Detection: Detect for computer vision tasks
  • Vector Storage: StoresSet, StoresGet, StoresFind for RAG operations
  • Reranking: Rerank for document relevance scoring
  • Voice Activity Detection: VAD for audio segmentation

Key Message Types

  • PredictOptions: Comprehensive configuration for text generation
  • ModelOptions: Model loading and configuration parameters
  • Result: Standardized response format
  • StatusResponse: Backend health and memory usage information

2. Multi-Language Dockerfiles

The backend system provides language-specific Dockerfiles that handle the build environment and dependencies for different programming languages:

  • Dockerfile.python
  • Dockerfile.golang
  • Dockerfile.llama-cpp

3. Language-Specific Implementations

Python Backends (python/)

  • transformers: Hugging Face Transformers framework
  • vllm: High-performance LLM inference
  • mlx: Apple Silicon optimization
  • diffusers: Stable Diffusion models
  • Audio: coqui, faster-whisper, kitten-tts
  • Vision: mlx-vlm, rfdetr
  • Specialized: rerankers, chatterbox, kokoro

Go Backends (go/)

  • whisper: OpenAI Whisper speech recognition in Go with GGML cpp backend (whisper.cpp)
  • stablediffusion-ggml: Stable Diffusion in Go with GGML Cpp backend
  • piper: Text-to-speech synthesis Golang with C bindings using rhaspy/piper
  • local-store: Vector storage backend

C++ Backends (cpp/)

  • llama-cpp: Llama.cpp integration
  • grpc: GRPC utilities and helpers

Hardware Acceleration Support

CUDA (NVIDIA)

  • Versions: CUDA 12.x, 13.x
  • Features: cuBLAS, cuDNN, TensorRT optimization
  • Targets: x86_64, ARM64 (Jetson)

ROCm (AMD)

  • Features: HIP, rocBLAS, MIOpen
  • Targets: AMD GPUs with ROCm support

Intel

  • Features: oneAPI, Intel Extension for PyTorch
  • Targets: Intel GPUs, XPUs, CPUs

Vulkan

  • Features: Cross-platform GPU acceleration
  • Targets: Windows, Linux, Android, macOS

Apple Silicon

  • Features: MLX framework, Metal Performance Shaders
  • Targets: M1/M2/M3 Macs

Backend Registry (index.yaml)

The index.yaml file serves as a central registry for all available backends, providing:

  • Metadata: Name, description, license, icons
  • Capabilities: Hardware targets and optimization profiles
  • Tags: Categorization for discovery
  • URLs: Source code and documentation links

Building Backends

Prerequisites

  • Docker with multi-architecture support
  • Appropriate hardware drivers (CUDA, ROCm, etc.)
  • Build tools (make, cmake, compilers)

Build Commands

Example of build commands with Docker

# Build Python backend
docker build -f backend/Dockerfile.python \
  --build-arg BACKEND=transformers \
  --build-arg BUILD_TYPE=cublas12 \
  --build-arg CUDA_MAJOR_VERSION=12 \
  --build-arg CUDA_MINOR_VERSION=0 \
  -t localai-backend-transformers .

# Build Go backend
docker build -f backend/Dockerfile.golang \
  --build-arg BACKEND=whisper \
  --build-arg BUILD_TYPE=cpu \
  -t localai-backend-whisper .

# Build C++ backend
docker build -f backend/Dockerfile.llama-cpp \
  --build-arg BACKEND=llama-cpp \
  --build-arg BUILD_TYPE=cublas12 \
  -t localai-backend-llama-cpp .

For ARM64/Mac builds, docker can't be used, and the makefile in the respective backend has to be used.

Build Types

  • cpu: CPU-only optimization
  • cublas12, cublas13: CUDA 12.x, 13.x with cuBLAS
  • hipblas: ROCm with rocBLAS
  • intel: Intel oneAPI optimization
  • vulkan: Vulkan-based acceleration
  • metal: Apple Metal optimization

Backend Development

Creating a New Backend

  1. Choose Language: Select Python, Go, or C++ based on requirements
  2. Implement Interface: Implement the gRPC service defined in backend.proto
  3. Add Dependencies: Create appropriate requirements files
  4. Configure Build: Set up Dockerfile and build scripts
  5. Register Backend: Add entry to index.yaml
  6. Test Integration: Verify gRPC communication and functionality

Backend Structure

backend-name/
├── backend.py/go/cpp    # Main implementation
├── requirements.txt      # Dependencies
├── Dockerfile           # Build configuration
├── install.sh           # Installation script
├── run.sh              # Execution script
├── test.sh             # Test script
└── README.md           # Backend documentation

Required gRPC Methods

At minimum, backends must implement:

  • Health() - Service health check
  • LoadModel() - Model loading and initialization
  • Predict() - Main inference endpoint
  • Status() - Backend status and metrics

Integration with LocalAI Core

Backends communicate with LocalAI core through gRPC:

  1. Service Discovery: Core discovers available backends
  2. Model Loading: Core requests model loading via LoadModel
  3. Inference: Core sends requests via Predict or specialized endpoints
  4. Streaming: Core handles streaming responses for real-time generation
  5. Monitoring: Core tracks backend health and performance

Performance Optimization

Memory Management

  • Model Caching: Efficient model loading and caching
  • Batch Processing: Optimize for multiple concurrent requests
  • Memory Pinning: GPU memory optimization for CUDA/ROCm

Hardware Utilization

  • Multi-GPU: Support for tensor parallelism
  • Mixed Precision: FP16/BF16 for memory efficiency
  • Kernel Fusion: Optimized CUDA/ROCm kernels

Troubleshooting

Common Issues

  1. GRPC Connection: Verify backend service is running and accessible
  2. Model Loading: Check model paths and dependencies
  3. Hardware Detection: Ensure appropriate drivers and libraries
  4. Memory Issues: Monitor GPU memory usage and model sizes

Contributing

When contributing to the backend system:

  1. Follow Protocol: Implement the exact gRPC interface
  2. Add Tests: Include comprehensive test coverage
  3. Document: Provide clear usage examples
  4. Optimize: Consider performance and resource usage
  5. Validate: Test across different hardware targets