mirror of https://github.com/mudler/LocalAI.git synced 2026-07-03 04:46:54 -04:00

Files

LocalAI [bot] a4e6e01e4d fix(process): give backend workers a parent-death safety net (#10639 )

* fix(grpc): self-terminate backend workers when LocalAI dies non-gracefully

Symptom: a backend model-worker subprocess (the per-model gRPC server LocalAI
spawns) can be orphaned and linger — holding VRAM and its listen port — if the
LocalAI process is killed non-gracefully (e.g. a supervisor's graceful-shutdown
grace period elapses and LocalAI is SIGKILLed) before its own teardown runs.

Root cause: LocalAI's graceful teardown (pkg/signals/handler.go installs the
SIGINT/SIGTERM handler; core/cli/run.go registers app.Shutdown ->
ModelLoader.StopAllGRPC -> process.Stop in pkg/model/process.go) only runs when
LocalAI receives a catchable signal and survives long enough to run its
handlers. Backends are spawned via github.com/mudler/go-processmanager v0.1.1,
whose getSysProcAttr() sets Setpgid:true (own process group, so the group can be
signalled) but never PR_SET_PDEATHSIG/Pdeathsig, and exposes no Config field or
option for a caller to inject/extend SysProcAttr. LocalAI fully delegates
spawning to that library (it never builds the exec.Cmd itself), so it cannot set
a kernel parent-death signal at the spawn site. If LocalAI is SIGKILLed, nothing
tells the backend to exit and it is reparented to init.

Fix: add a best-effort, backend-side safety net at the one shared choke point
every out-of-process Go backend routes through — grpc.StartServer / RunServer in
pkg/grpc. On startup it captures getppid() and polls; when the process is
reparented (getppid changes / becomes 1 — the standard POSIX signal the original
parent died) it logs and self-terminates. getppid() reparent detection is
portable (Linux + macOS), unlike Linux-only PR_SET_PDEATHSIG. Toggle via
LOCALAI_BACKEND_PARENT_WATCH (default on; off on Windows) and
LOCALAI_BACKEND_PARENT_WATCH_INTERVAL. This is strictly a backstop alongside the
existing graceful SIGTERM->grace->SIGKILL teardown, which is unchanged.

Scope/limitations: covers Go-based backends (everything using pkg/grpc). The
C++ backends (e.g. llama-cpp) and Python backends do not route through
pkg/grpc and are not covered by this mechanism — they would each need an
equivalent parent-death check (follow-up). The fully general fix is for
go-processmanager to expose SysProcAttr injection so LocalAI can set Pdeathsig
at spawn for every backend regardless of language (suggested upstream follow-up;
out of scope for this LocalAI-only PR).

Test: pkg/grpc/parentwatch_test.go builds a real test -> middle -> grandchild
process tree, lets the middle process exit to orphan the grandchild running the
real watchParentDeath, and asserts it detects the reparent and self-terminates.
Unix-only (build-tagged), runs in CI (Linux).

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(process): extend parent-death backstop to C++ and Python backends

The Go parent-death watcher (pkg/grpc/parentwatch.go, commit 772b435d5)
only protects backends that route through pkg/grpc. C++ and Python
backends don't, so the originally-reported case — the llama.cpp gRPC
worker surviving a non-graceful LocalAI death — was still uncovered.

Extend the same best-effort backstop to both languages, reusing the
exact mechanism and semantics:

- capture getppid() at startup, skip if already orphaned (<=1)
- a background thread polls getppid() and self-exits on reparenting
  (getppid() != orig || == 1), portable across Linux/macOS, no-op on
  Windows
- same env vars: LOCALAI_BACKEND_PARENT_WATCH (default on; falsy
  false/0/no/off disable) and LOCALAI_BACKEND_PARENT_WATCH_INTERVAL
  (default 2s; accepts Go-style durations like 500ms/2s/1m)

C++: implemented in backend/cpp/llama-cpp (the reported, most-used C++
backend) as a dependency-free header parent_watch.h, wired into
grpc-server.cpp's main() and copied at build time via prepare.sh. C++
backends have no shared server scaffolding, so other C++ backends
(ds4, ik-llama-cpp, privacy-filter, ...) are not yet covered and would
each need the same one-line include+call as follow-ups.

Python: implemented once in the shared common/parent_watch.py and armed
from common/grpc_auth.py's get_auth_interceptors() — the single helper
every one of the 35 Python backends invokes while building its gRPC
server — so all Python backends (and future ones) are covered with no
per-backend edits and no duplicated implementation.

Tests (real process-tree reparent detection, mirroring the Go test):
- backend/cpp/llama-cpp/parent_watch_test.cpp (via run-unit-tests.sh)
- backend/python/common/parent_watch_test.py (python -m unittest)

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Claude Sonnet 5 <noreply@anthropic.com>

2026-07-02 19:16:48 +02:00

cpp

fix(process): give backend workers a parent-death safety net (#10639 )

2026-07-02 19:16:48 +02:00

fix(cloud-proxy): parameter compatibility with newest reasoning models (#10640 )

2026-07-02 19:15:43 +02:00

python

fix(process): give backend workers a parent-death safety net (#10639 )

2026-07-02 19:16:48 +02:00

rust/kokoros

fix(kokoros): implement AudioTranscriptionLive trait stub (#10612 )

2026-06-30 19:38:41 +02:00

backend.proto

feat(realtime): Semantic VAD EOU token (#10444 )

2026-06-30 09:01:22 +02:00

Dockerfile.base-grpc-builder

ci: refactor llama-cpp variant Dockerfiles to consume prebuilt base-grpc images (PR 2/2) (#9738 )

2026-05-10 00:03:52 +02:00

Dockerfile.ds4

feat: add ds4 backend (DeepSeek V4 Flash) with tool calls, thinking, KV cache (#9758 )

2026-05-11 22:15:47 +02:00

Dockerfile.golang

feat(backends): add voice-detect + face-detect ggml backends (replace Python insightface/speaker-recognition) (#10441 )

2026-06-28 09:29:08 +02:00

Dockerfile.ik-llama-cpp

ci: refactor llama-cpp variant Dockerfiles to consume prebuilt base-grpc images (PR 2/2) (#9738 )

2026-05-10 00:03:52 +02:00

Dockerfile.llama-cpp

ci: refactor llama-cpp variant Dockerfiles to consume prebuilt base-grpc images (PR 2/2) (#9738 )

2026-05-10 00:03:52 +02:00

Dockerfile.privacy-filter

feat(pii): NER tier engine — privacy-filter.cpp backend + NER-centric PII filter (#10360 )

2026-06-18 11:45:22 +01:00

Dockerfile.python

feat(vulkan): make Vulkan backends self-contained on the GPU (#10404 )

2026-06-19 17:16:33 +02:00

Dockerfile.rust

feat(ci): allow routing apt traffic through an alternate Ubuntu mirror (#9650 )

2026-05-03 23:50:13 +02:00

Dockerfile.turboquant

feat(llama-cpp): bump to 1ec7ba0c, adapt grpc-server, expose new spec-decoding options (#9765 )

2026-05-12 17:22:37 +02:00

index.yaml

feat(backends): add voice-detect + face-detect ggml backends (replace Python insightface/speaker-recognition) (#10441 )

2026-06-28 09:29:08 +02:00

README.md

Remove HuggingFace backend support (#8971 )

2026-03-13 01:09:30 +01:00

README.md

LocalAI Backend Architecture

This directory contains the core backend infrastructure for LocalAI, including the gRPC protocol definition, multi-language Dockerfiles, and language-specific backend implementations.

Overview

LocalAI uses a unified gRPC-based architecture that allows different programming languages to implement AI backends while maintaining consistent interfaces and capabilities. The backend system supports multiple hardware acceleration targets and provides a standardized way to integrate various AI models and frameworks.

Architecture Components

1. Protocol Definition (`backend.proto`)

The backend.proto file defines the gRPC service interface that all backends must implement. This ensures consistency across different language implementations and provides a contract for communication between LocalAI core and backend services.

Core Services

Text Generation: Predict, PredictStream for LLM inference
Embeddings: Embedding for text vectorization
Image Generation: GenerateImage for stable diffusion and image models
Audio Processing: AudioTranscription, TTS, SoundGeneration
Video Generation: GenerateVideo for video synthesis
Object Detection: Detect for computer vision tasks
Vector Storage: StoresSet, StoresGet, StoresFind for RAG operations
Reranking: Rerank for document relevance scoring
Voice Activity Detection: VAD for audio segmentation

Key Message Types

PredictOptions: Comprehensive configuration for text generation
ModelOptions: Model loading and configuration parameters
Result: Standardized response format
StatusResponse: Backend health and memory usage information

2. Multi-Language Dockerfiles

The backend system provides language-specific Dockerfiles that handle the build environment and dependencies for different programming languages:

Dockerfile.python
Dockerfile.golang
Dockerfile.llama-cpp

3. Language-Specific Implementations

Python Backends (`python/`)

transformers: Hugging Face Transformers framework
vllm: High-performance LLM inference
mlx: Apple Silicon optimization
diffusers: Stable Diffusion models
Audio: coqui, faster-whisper, kitten-tts
Vision: mlx-vlm, rfdetr
Specialized: rerankers, chatterbox, kokoro

Go Backends (`go/`)

whisper: OpenAI Whisper speech recognition in Go with GGML cpp backend (whisper.cpp)
stablediffusion-ggml: Stable Diffusion in Go with GGML Cpp backend
piper: Text-to-speech synthesis Golang with C bindings using rhaspy/piper
local-store: Vector storage backend

C++ Backends (`cpp/`)

llama-cpp: Llama.cpp integration
grpc: GRPC utilities and helpers

Hardware Acceleration Support

CUDA (NVIDIA)

Versions: CUDA 12.x, 13.x
Features: cuBLAS, cuDNN, TensorRT optimization
Targets: x86_64, ARM64 (Jetson)

ROCm (AMD)

Features: HIP, rocBLAS, MIOpen
Targets: AMD GPUs with ROCm support

Intel

Features: oneAPI, Intel Extension for PyTorch
Targets: Intel GPUs, XPUs, CPUs

Vulkan

Features: Cross-platform GPU acceleration
Targets: Windows, Linux, Android, macOS

Apple Silicon

Features: MLX framework, Metal Performance Shaders
Targets: M1/M2/M3 Macs

Backend Registry (`index.yaml`)

The index.yaml file serves as a central registry for all available backends, providing:

Metadata: Name, description, license, icons
Capabilities: Hardware targets and optimization profiles
Tags: Categorization for discovery
URLs: Source code and documentation links

Building Backends

Prerequisites

Docker with multi-architecture support
Appropriate hardware drivers (CUDA, ROCm, etc.)
Build tools (make, cmake, compilers)

Build Commands

Example of build commands with Docker

# Build Python backend
docker build -f backend/Dockerfile.python \
  --build-arg BACKEND=transformers \
  --build-arg BUILD_TYPE=cublas12 \
  --build-arg CUDA_MAJOR_VERSION=12 \
  --build-arg CUDA_MINOR_VERSION=0 \
  -t localai-backend-transformers .

# Build Go backend
docker build -f backend/Dockerfile.golang \
  --build-arg BACKEND=whisper \
  --build-arg BUILD_TYPE=cpu \
  -t localai-backend-whisper .

# Build C++ backend
docker build -f backend/Dockerfile.llama-cpp \
  --build-arg BACKEND=llama-cpp \
  --build-arg BUILD_TYPE=cublas12 \
  -t localai-backend-llama-cpp .

For ARM64/Mac builds, docker can't be used, and the makefile in the respective backend has to be used.

Build Types

cpu: CPU-only optimization
cublas12, cublas13: CUDA 12.x, 13.x with cuBLAS
hipblas: ROCm with rocBLAS
intel: Intel oneAPI optimization
vulkan: Vulkan-based acceleration
metal: Apple Metal optimization

Backend Development

Creating a New Backend

Choose Language: Select Python, Go, or C++ based on requirements
Implement Interface: Implement the gRPC service defined in backend.proto
Add Dependencies: Create appropriate requirements files
Configure Build: Set up Dockerfile and build scripts
Register Backend: Add entry to index.yaml
Test Integration: Verify gRPC communication and functionality

Backend Structure

backend-name/
├── backend.py/go/cpp    # Main implementation
├── requirements.txt      # Dependencies
├── Dockerfile           # Build configuration
├── install.sh           # Installation script
├── run.sh              # Execution script
├── test.sh             # Test script
└── README.md           # Backend documentation

Required gRPC Methods

At minimum, backends must implement:

Health() - Service health check
LoadModel() - Model loading and initialization
Predict() - Main inference endpoint
Status() - Backend status and metrics

Integration with LocalAI Core

Backends communicate with LocalAI core through gRPC:

Service Discovery: Core discovers available backends
Model Loading: Core requests model loading via LoadModel
Inference: Core sends requests via Predict or specialized endpoints
Streaming: Core handles streaming responses for real-time generation
Monitoring: Core tracks backend health and performance

Performance Optimization

Memory Management

Model Caching: Efficient model loading and caching
Batch Processing: Optimize for multiple concurrent requests
Memory Pinning: GPU memory optimization for CUDA/ROCm

Hardware Utilization

Multi-GPU: Support for tensor parallelism
Mixed Precision: FP16/BF16 for memory efficiency
Kernel Fusion: Optimized CUDA/ROCm kernels

Troubleshooting

Common Issues

GRPC Connection: Verify backend service is running and accessible
Model Loading: Check model paths and dependencies
Hardware Detection: Ensure appropriate drivers and libraries
Memory Issues: Monitor GPU memory usage and model sizes

Contributing

When contributing to the backend system:

Follow Protocol: Implement the exact gRPC interface
Add Tests: Include comprehensive test coverage
Document: Provide clear usage examples
Optimize: Consider performance and resource usage
Validate: Test across different hardware targets

README.md

LocalAI Backend Architecture

Overview

Architecture Components

1. Protocol Definition (backend.proto)

Core Services

Key Message Types

2. Multi-Language Dockerfiles

3. Language-Specific Implementations

Python Backends (python/)

Go Backends (go/)

C++ Backends (cpp/)

Hardware Acceleration Support

CUDA (NVIDIA)

ROCm (AMD)

Intel

Vulkan

Apple Silicon

Backend Registry (index.yaml)

Building Backends

Prerequisites

Build Commands

Build Types

Backend Development

Creating a New Backend

Backend Structure

Required gRPC Methods

Integration with LocalAI Core

Performance Optimization

Memory Management

Hardware Utilization

Troubleshooting

Common Issues

Contributing

1. Protocol Definition (`backend.proto`)

Python Backends (`python/`)

Go Backends (`go/`)

C++ Backends (`cpp/`)

Backend Registry (`index.yaml`)