Files
LocalAI/backend
LocalAI [bot] 07f6c15a37 feat(ds4): layer-split distributed inference (#10098)
* feat(ds4): add standalone ds4-worker distributed worker binary

Add worker_main.c, a minimal standalone worker that owns a slice of the
model's transformer layers and serves activations over ds4's own TCP
transport via ds4_dist_run(). It links the same engine objects the
backend already builds (including ds4_distributed.o) and has NO
gRPC/protobuf dependency, so it builds even on hosts lacking protobuf/grpc
dev headers. Launched by `local-ai worker ds4-distributed`.

Wire the ds4-worker CMake target (mirrors grpc-server's object/GPU/native
handling) and have the Makefile copy + clean the binary alongside
grpc-server. Ignore the built ds4-worker artifact.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(ds4): package ds4-worker alongside grpc-server

Copy the standalone ds4-worker binary into the backend package (Linux
package.sh) and the Darwin OCI tar (ds4-darwin.sh: both the explicit copy
and the otool dylib-bundling loop) so distributed workers ship with the
backend.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* fix(ds4): tighten ds4-worker integer arg validation to match upstream

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(ds4): wire grpc-server as distributed coordinator

Add distributed COORDINATOR support to the ds4 backend's gRPC server.
Distributed inference is an engine backend: when LoadModel receives
'ds4_role:coordinator', the process populates ds4_engine_options.distributed
(role, layer slice, listen host/port) before ds4_engine_open, then the normal
ds4_session_* generation path runs transparently once the worker route covers
all layers.

- New LoadModel options: ds4_role, ds4_layers (START:END or START:output),
  ds4_listen (host:port), ds4_route_timeout.
- parse_layers_spec() maps the layer spec onto ds4_distributed_layers.
- wait_route_ready() blocks generation until
  ds4_session_distributed_route_ready() reports full coverage (or timeout),
  gating both Predict and PredictStream; returns UNAVAILABLE on timeout/error.
- No ds4_role => g_distributed stays false and wait_route_ready is a no-op,
  so single-node behavior is unchanged.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* fix(ds4): don't block Status during route wait; validate coordinator opts

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(cli): add ds4-distributed worker exec helper

Add the ds4WorkerArgs helper plus findDS4Backend/DS4Distributed.Run that
resolve the ds4 backend via the gallery and exec the packaged ds4-worker
binary. Unlike worker_llamacpp.go, ds4 bundles its own dynamic loader
(lib/ld.so) for glibc compatibility, so when present we exec ds4-worker
through that loader with LD_LIBRARY_PATH=<backend>/lib, mirroring
backend/cpp/ds4/run.sh; otherwise we exec it directly.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(cli): register the ds4-distributed worker subcommand

Wire DS4Distributed into the Worker kong command tree so
`local-ai worker ds4-distributed` is available.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* docs(ds4): document layer-split distributed inference

Add a ds4 section to the distributed-mode feature docs (coordinator
model YAML, manual worker command, layer-range semantics, the
'GGUF on every machine' requirement, coordinator-listens dial
direction vs llama.cpp) and a terse Distributed mode section to the
ds4 backend agent guide.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* test(ds4): opt-in hardware-gated distributed e2e spec

Add a self-contained, opt-in Ginkgo spec to the backend e2e suite that
spins a ds4 coordinator (via the packaged run.sh, loaded with
ds4_role/ds4_layers/ds4_listen options) plus a ds4-worker process for
the upper layers, then uses Eventually to assert a short successful
Predict once the layer route forms, before tearing the worker down.

Gated by BACKEND_TEST_DS4_DISTRIBUTED=1 (plus the existing
BACKEND_BINARY + BACKEND_TEST_MODEL_FILE and optional layer/listen/accel
knobs); compiles and skips cleanly with no env, hardware, or model.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* test(ds4): pass coordinator ctx to worker; lowercase error string

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* docs(ds4): note distributed transport is plaintext/unauthenticated

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* style(ds4): replace em dashes in distributed docs/agent/test per repo convention

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* fix(ds4): link ds4-worker with the C++ driver for CUDA/Metal builds

The ds4-worker target is built from worker_main.c (C), so CMake linked it
with the C driver. The nvcc-built ds4_cuda.o (and Obj-C++ ds4_metal.o)
reference the C++ runtime, so the CUDA/Metal builds failed with undefined
libstdc++ symbols (std::__throw_length_error). The CPU build passed because
ds4_cpu.o is pure C. Force LINKER_LANGUAGE CXX so libstdc++ is linked.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-31 00:09:55 +02:00
..

LocalAI Backend Architecture

This directory contains the core backend infrastructure for LocalAI, including the gRPC protocol definition, multi-language Dockerfiles, and language-specific backend implementations.

Overview

LocalAI uses a unified gRPC-based architecture that allows different programming languages to implement AI backends while maintaining consistent interfaces and capabilities. The backend system supports multiple hardware acceleration targets and provides a standardized way to integrate various AI models and frameworks.

Architecture Components

1. Protocol Definition (backend.proto)

The backend.proto file defines the gRPC service interface that all backends must implement. This ensures consistency across different language implementations and provides a contract for communication between LocalAI core and backend services.

Core Services

  • Text Generation: Predict, PredictStream for LLM inference
  • Embeddings: Embedding for text vectorization
  • Image Generation: GenerateImage for stable diffusion and image models
  • Audio Processing: AudioTranscription, TTS, SoundGeneration
  • Video Generation: GenerateVideo for video synthesis
  • Object Detection: Detect for computer vision tasks
  • Vector Storage: StoresSet, StoresGet, StoresFind for RAG operations
  • Reranking: Rerank for document relevance scoring
  • Voice Activity Detection: VAD for audio segmentation

Key Message Types

  • PredictOptions: Comprehensive configuration for text generation
  • ModelOptions: Model loading and configuration parameters
  • Result: Standardized response format
  • StatusResponse: Backend health and memory usage information

2. Multi-Language Dockerfiles

The backend system provides language-specific Dockerfiles that handle the build environment and dependencies for different programming languages:

  • Dockerfile.python
  • Dockerfile.golang
  • Dockerfile.llama-cpp

3. Language-Specific Implementations

Python Backends (python/)

  • transformers: Hugging Face Transformers framework
  • vllm: High-performance LLM inference
  • mlx: Apple Silicon optimization
  • diffusers: Stable Diffusion models
  • Audio: coqui, faster-whisper, kitten-tts
  • Vision: mlx-vlm, rfdetr
  • Specialized: rerankers, chatterbox, kokoro

Go Backends (go/)

  • whisper: OpenAI Whisper speech recognition in Go with GGML cpp backend (whisper.cpp)
  • stablediffusion-ggml: Stable Diffusion in Go with GGML Cpp backend
  • piper: Text-to-speech synthesis Golang with C bindings using rhaspy/piper
  • local-store: Vector storage backend

C++ Backends (cpp/)

  • llama-cpp: Llama.cpp integration
  • grpc: GRPC utilities and helpers

Hardware Acceleration Support

CUDA (NVIDIA)

  • Versions: CUDA 12.x, 13.x
  • Features: cuBLAS, cuDNN, TensorRT optimization
  • Targets: x86_64, ARM64 (Jetson)

ROCm (AMD)

  • Features: HIP, rocBLAS, MIOpen
  • Targets: AMD GPUs with ROCm support

Intel

  • Features: oneAPI, Intel Extension for PyTorch
  • Targets: Intel GPUs, XPUs, CPUs

Vulkan

  • Features: Cross-platform GPU acceleration
  • Targets: Windows, Linux, Android, macOS

Apple Silicon

  • Features: MLX framework, Metal Performance Shaders
  • Targets: M1/M2/M3 Macs

Backend Registry (index.yaml)

The index.yaml file serves as a central registry for all available backends, providing:

  • Metadata: Name, description, license, icons
  • Capabilities: Hardware targets and optimization profiles
  • Tags: Categorization for discovery
  • URLs: Source code and documentation links

Building Backends

Prerequisites

  • Docker with multi-architecture support
  • Appropriate hardware drivers (CUDA, ROCm, etc.)
  • Build tools (make, cmake, compilers)

Build Commands

Example of build commands with Docker

# Build Python backend
docker build -f backend/Dockerfile.python \
  --build-arg BACKEND=transformers \
  --build-arg BUILD_TYPE=cublas12 \
  --build-arg CUDA_MAJOR_VERSION=12 \
  --build-arg CUDA_MINOR_VERSION=0 \
  -t localai-backend-transformers .

# Build Go backend
docker build -f backend/Dockerfile.golang \
  --build-arg BACKEND=whisper \
  --build-arg BUILD_TYPE=cpu \
  -t localai-backend-whisper .

# Build C++ backend
docker build -f backend/Dockerfile.llama-cpp \
  --build-arg BACKEND=llama-cpp \
  --build-arg BUILD_TYPE=cublas12 \
  -t localai-backend-llama-cpp .

For ARM64/Mac builds, docker can't be used, and the makefile in the respective backend has to be used.

Build Types

  • cpu: CPU-only optimization
  • cublas12, cublas13: CUDA 12.x, 13.x with cuBLAS
  • hipblas: ROCm with rocBLAS
  • intel: Intel oneAPI optimization
  • vulkan: Vulkan-based acceleration
  • metal: Apple Metal optimization

Backend Development

Creating a New Backend

  1. Choose Language: Select Python, Go, or C++ based on requirements
  2. Implement Interface: Implement the gRPC service defined in backend.proto
  3. Add Dependencies: Create appropriate requirements files
  4. Configure Build: Set up Dockerfile and build scripts
  5. Register Backend: Add entry to index.yaml
  6. Test Integration: Verify gRPC communication and functionality

Backend Structure

backend-name/
├── backend.py/go/cpp    # Main implementation
├── requirements.txt      # Dependencies
├── Dockerfile           # Build configuration
├── install.sh           # Installation script
├── run.sh              # Execution script
├── test.sh             # Test script
└── README.md           # Backend documentation

Required gRPC Methods

At minimum, backends must implement:

  • Health() - Service health check
  • LoadModel() - Model loading and initialization
  • Predict() - Main inference endpoint
  • Status() - Backend status and metrics

Integration with LocalAI Core

Backends communicate with LocalAI core through gRPC:

  1. Service Discovery: Core discovers available backends
  2. Model Loading: Core requests model loading via LoadModel
  3. Inference: Core sends requests via Predict or specialized endpoints
  4. Streaming: Core handles streaming responses for real-time generation
  5. Monitoring: Core tracks backend health and performance

Performance Optimization

Memory Management

  • Model Caching: Efficient model loading and caching
  • Batch Processing: Optimize for multiple concurrent requests
  • Memory Pinning: GPU memory optimization for CUDA/ROCm

Hardware Utilization

  • Multi-GPU: Support for tensor parallelism
  • Mixed Precision: FP16/BF16 for memory efficiency
  • Kernel Fusion: Optimized CUDA/ROCm kernels

Troubleshooting

Common Issues

  1. GRPC Connection: Verify backend service is running and accessible
  2. Model Loading: Check model paths and dependencies
  3. Hardware Detection: Ensure appropriate drivers and libraries
  4. Memory Issues: Monitor GPU memory usage and model sizes

Contributing

When contributing to the backend system:

  1. Follow Protocol: Implement the exact gRPC interface
  2. Add Tests: Include comprehensive test coverage
  3. Document: Provide clear usage examples
  4. Optimize: Consider performance and resource usage
  5. Validate: Test across different hardware targets