mirror of https://github.com/mudler/LocalAI.git synced 2026-06-03 05:51:53 -04:00

Files

Ettore Di Giacinto 9eb21e9a20 fix(turboquant): patch ggml-hip CMakeLists to compile new f16-turbo fattn-vec instances

Fork commit fa4e8be0a0ce ("fix(cuda): add F16-K + TURBO-V dispatch cases
in fattn.cu") added three new template instance files under
ggml-cuda/template-instances/ (fattn-vec-instance-f16-turbo{2,3,4}_0.cu)
and wired matching FATTN_VEC_CASES_ALL_D(GGML_TYPE_F16, GGML_TYPE_TURBO*)
dispatch cases into fattn.cu.

fattn.cu is shared with the HIP build via hipify, but the fork forgot
to mirror the new source files into ggml/src/ggml-hip/CMakeLists.txt.
CMake's ROCm branch carries a hand-curated template-instance list (used
when GGML_CUDA_FA_ALL_QUANTS is OFF, the default), so the HIP build
ended up with the extern template declarations but no matching
instantiations — the -gpu-rocm-hipblas-turboquant job failed partway
through the 3h+ build.

Add patches/0001-ggml-hip-add-f16-turbo-vec-instances.patch, which the
existing apply-patches.sh machinery applies to the cloned fork sources
after fetch. The patch appends the three new f16-turbo instance files
to ggml-hip's source list in the same interleaved order used by
ggml-cuda's CMakeLists.txt. Drop this patch once the fork syncs the
ROCm list — the build will fail fast if the anchor context goes stale,
which is the signal to retire it.

CUDA builds were unaffected (ggml-cuda's CMakeLists.txt was updated
upstream) — the link failure was isolated to HIP.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]

2026-04-22 07:17:33 +00:00

cpp

fix(turboquant): patch ggml-hip CMakeLists to compile new f16-turbo fattn-vec instances

2026-04-22 07:17:33 +00:00

chore: ⬆️ Update ggml-org/whisper.cpp to fc674574ca27cac59a15e5b22a09b9d9ad62aafe (#9450 )

2026-04-21 11:09:05 +02:00

python

chore(whisperx): drop ROCm/hipblas build target (#9474 )

2026-04-21 21:50:18 +02:00

rust/kokoros

fix(kokoros): implement audio_transcription_stream trait stub (#9422 )

2026-04-19 13:29:58 +02:00

backend.proto

fix(vision): propagate mtmd media marker from backend via ModelMetadata (#9412 )

2026-04-18 20:30:13 +02:00

Dockerfile.golang

feat(realtime): WebRTC support (#8790 )

2026-03-13 21:37:15 +01:00

Dockerfile.ik-llama-cpp

feat(backends): add ik-llama-cpp (#9326 )

2026-04-12 13:51:28 +02:00

Dockerfile.llama-cpp

fix(rocm): add gfx1151 support and expose AMDGPU_TARGETS build-arg (#9410 )

2026-04-18 20:39:40 +02:00

Dockerfile.python

feat(vllm): parity with llama.cpp backend (#9328 )

2026-04-13 11:00:29 +02:00

Dockerfile.rust

feat: Add Kokoros backend (#9212 )

2026-04-08 19:23:16 +02:00

Dockerfile.turboquant

feat(backend): add turboquant llama.cpp-fork backend (#9355 )

2026-04-15 01:25:04 +02:00

index.yaml

chore(whisperx): drop ROCm/hipblas build target (#9474 )

2026-04-21 21:50:18 +02:00

README.md

Remove HuggingFace backend support (#8971 )

2026-03-13 01:09:30 +01:00

README.md

LocalAI Backend Architecture

This directory contains the core backend infrastructure for LocalAI, including the gRPC protocol definition, multi-language Dockerfiles, and language-specific backend implementations.

Overview

LocalAI uses a unified gRPC-based architecture that allows different programming languages to implement AI backends while maintaining consistent interfaces and capabilities. The backend system supports multiple hardware acceleration targets and provides a standardized way to integrate various AI models and frameworks.

Architecture Components

1. Protocol Definition (`backend.proto`)

The backend.proto file defines the gRPC service interface that all backends must implement. This ensures consistency across different language implementations and provides a contract for communication between LocalAI core and backend services.

Core Services

Text Generation: Predict, PredictStream for LLM inference
Embeddings: Embedding for text vectorization
Image Generation: GenerateImage for stable diffusion and image models
Audio Processing: AudioTranscription, TTS, SoundGeneration
Video Generation: GenerateVideo for video synthesis
Object Detection: Detect for computer vision tasks
Vector Storage: StoresSet, StoresGet, StoresFind for RAG operations
Reranking: Rerank for document relevance scoring
Voice Activity Detection: VAD for audio segmentation

Key Message Types

PredictOptions: Comprehensive configuration for text generation
ModelOptions: Model loading and configuration parameters
Result: Standardized response format
StatusResponse: Backend health and memory usage information

2. Multi-Language Dockerfiles

The backend system provides language-specific Dockerfiles that handle the build environment and dependencies for different programming languages:

Dockerfile.python
Dockerfile.golang
Dockerfile.llama-cpp

3. Language-Specific Implementations

Python Backends (`python/`)

transformers: Hugging Face Transformers framework
vllm: High-performance LLM inference
mlx: Apple Silicon optimization
diffusers: Stable Diffusion models
Audio: coqui, faster-whisper, kitten-tts
Vision: mlx-vlm, rfdetr
Specialized: rerankers, chatterbox, kokoro

Go Backends (`go/`)

whisper: OpenAI Whisper speech recognition in Go with GGML cpp backend (whisper.cpp)
stablediffusion-ggml: Stable Diffusion in Go with GGML Cpp backend
piper: Text-to-speech synthesis Golang with C bindings using rhaspy/piper
local-store: Vector storage backend

C++ Backends (`cpp/`)

llama-cpp: Llama.cpp integration
grpc: GRPC utilities and helpers

Hardware Acceleration Support

CUDA (NVIDIA)

Versions: CUDA 12.x, 13.x
Features: cuBLAS, cuDNN, TensorRT optimization
Targets: x86_64, ARM64 (Jetson)

ROCm (AMD)

Features: HIP, rocBLAS, MIOpen
Targets: AMD GPUs with ROCm support

Intel

Features: oneAPI, Intel Extension for PyTorch
Targets: Intel GPUs, XPUs, CPUs

Vulkan

Features: Cross-platform GPU acceleration
Targets: Windows, Linux, Android, macOS

Apple Silicon

Features: MLX framework, Metal Performance Shaders
Targets: M1/M2/M3 Macs

Backend Registry (`index.yaml`)

The index.yaml file serves as a central registry for all available backends, providing:

Metadata: Name, description, license, icons
Capabilities: Hardware targets and optimization profiles
Tags: Categorization for discovery
URLs: Source code and documentation links

Building Backends

Prerequisites

Docker with multi-architecture support
Appropriate hardware drivers (CUDA, ROCm, etc.)
Build tools (make, cmake, compilers)

Build Commands

Example of build commands with Docker

# Build Python backend
docker build -f backend/Dockerfile.python \
  --build-arg BACKEND=transformers \
  --build-arg BUILD_TYPE=cublas12 \
  --build-arg CUDA_MAJOR_VERSION=12 \
  --build-arg CUDA_MINOR_VERSION=0 \
  -t localai-backend-transformers .

# Build Go backend
docker build -f backend/Dockerfile.golang \
  --build-arg BACKEND=whisper \
  --build-arg BUILD_TYPE=cpu \
  -t localai-backend-whisper .

# Build C++ backend
docker build -f backend/Dockerfile.llama-cpp \
  --build-arg BACKEND=llama-cpp \
  --build-arg BUILD_TYPE=cublas12 \
  -t localai-backend-llama-cpp .

For ARM64/Mac builds, docker can't be used, and the makefile in the respective backend has to be used.

Build Types

cpu: CPU-only optimization
cublas12, cublas13: CUDA 12.x, 13.x with cuBLAS
hipblas: ROCm with rocBLAS
intel: Intel oneAPI optimization
vulkan: Vulkan-based acceleration
metal: Apple Metal optimization

Backend Development

Creating a New Backend

Choose Language: Select Python, Go, or C++ based on requirements
Implement Interface: Implement the gRPC service defined in backend.proto
Add Dependencies: Create appropriate requirements files
Configure Build: Set up Dockerfile and build scripts
Register Backend: Add entry to index.yaml
Test Integration: Verify gRPC communication and functionality

Backend Structure

backend-name/
├── backend.py/go/cpp    # Main implementation
├── requirements.txt      # Dependencies
├── Dockerfile           # Build configuration
├── install.sh           # Installation script
├── run.sh              # Execution script
├── test.sh             # Test script
└── README.md           # Backend documentation

Required gRPC Methods

At minimum, backends must implement:

Health() - Service health check
LoadModel() - Model loading and initialization
Predict() - Main inference endpoint
Status() - Backend status and metrics

Integration with LocalAI Core

Backends communicate with LocalAI core through gRPC:

Service Discovery: Core discovers available backends
Model Loading: Core requests model loading via LoadModel
Inference: Core sends requests via Predict or specialized endpoints
Streaming: Core handles streaming responses for real-time generation
Monitoring: Core tracks backend health and performance

Performance Optimization

Memory Management

Model Caching: Efficient model loading and caching
Batch Processing: Optimize for multiple concurrent requests
Memory Pinning: GPU memory optimization for CUDA/ROCm

Hardware Utilization

Multi-GPU: Support for tensor parallelism
Mixed Precision: FP16/BF16 for memory efficiency
Kernel Fusion: Optimized CUDA/ROCm kernels

Troubleshooting

Common Issues

GRPC Connection: Verify backend service is running and accessible
Model Loading: Check model paths and dependencies
Hardware Detection: Ensure appropriate drivers and libraries
Memory Issues: Monitor GPU memory usage and model sizes

Contributing

When contributing to the backend system:

Follow Protocol: Implement the exact gRPC interface
Add Tests: Include comprehensive test coverage
Document: Provide clear usage examples
Optimize: Consider performance and resource usage
Validate: Test across different hardware targets

README.md

LocalAI Backend Architecture

Overview

Architecture Components

1. Protocol Definition (backend.proto)

Core Services

Key Message Types

2. Multi-Language Dockerfiles

3. Language-Specific Implementations

Python Backends (python/)

Go Backends (go/)

C++ Backends (cpp/)

Hardware Acceleration Support

CUDA (NVIDIA)

ROCm (AMD)

Intel

Vulkan

Apple Silicon

Backend Registry (index.yaml)

Building Backends

Prerequisites

Build Commands

Build Types

Backend Development

Creating a New Backend

Backend Structure

Required gRPC Methods

Integration with LocalAI Core

Performance Optimization

Memory Management

Hardware Utilization

Troubleshooting

Common Issues

Contributing

1. Protocol Definition (`backend.proto`)

Python Backends (`python/`)

Go Backends (`go/`)

C++ Backends (`cpp/`)

Backend Registry (`index.yaml`)