* fix(vllm): switch L4T13 backend to PyPI aarch64+cu130 wheels The L4T13 vllm backend pulled torch / torchvision / torchaudio / vllm from pypi.jetson-ai-lab.io's sbsa/cu130 mirror via [tool.uv.sources] with no version pins. That mirror started shipping torch 2.11.0 next to a vllm-0.20.0+cu130 wheel that was still compiled against torch 2.10's c10 ABI, so uv landed on the mismatched pair and vllm crashed at import: ImportError: vllm/_C.abi3.so: undefined symbol: _ZN3c1013MessageLoggerC1EPKciib (c10::MessageLogger's constructor signature changed between torch 2.10 and 2.11; the vllm wheel referenced the 2.10 form, the installed libc10.so exported only the 2.11 form.) Since torch 2.11 (April 2026) PyPI publishes its own aarch64 + cu130 manylinux wheels, and vllm 0.20.0 ships an aarch64 wheel whose Requires- Dist locks torch==2.11.0 / torchvision==0.26.0 / torchaudio==2.11.0. That makes uv's resolver produce an ABI-consistent set automatically, so the mirror and the [tool.uv.sources] pinning are no longer needed. flash-attn is dropped from the dep list: PyPI has no aarch64 wheel, but vLLM 0.20+ already bundles its own vllm_flash_attn (fa2 + fa3) inside the main wheel, so the Dao-AILab package isn't required at runtime. Reference: https://pytorch.org/blog/vllm-and-pytorch-work-together-to-improve-the-developer-experience-on-aarch64/ Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash] [WebFetch] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(vllm): retire l4t13 pyproject.toml in favor of requirements-*.txt pyproject.toml only existed because uv pip install -r requirements.txt doesn't honor [tool.uv.sources]. The previous commit dropped [tool.uv. sources] (PyPI now serves the aarch64 + cu130 wheels directly), so the file no longer carries any logic the requirements-*.txt path can't. Replace with the same two-file pattern every other build profile uses: - requirements-l4t13.txt (accelerate / torch / transformers / bitsandbytes - matches cublas13's split) - requirements-l4t13-after.txt (vllm; runs after the base resolve so the cu130 torch wheel lands first) install.sh's whole l4t13 elif branch goes away; libbackend.sh's installRequirements already handles the requirements-install.txt build- deps pass, the C_INCLUDE_PATH export for PORTABLE_PYTHON, and the runProtogen call, so falling through to the standard else: branch produces identical install behavior with less surface area. No functional change at install time - same wheels, same order. Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(sglang,vllm-omni): switch L4T13 backends to PyPI aarch64+cu130 wheels Same root cause and same fix as the vllm backend in the previous commits: the L4T13 sglang and vllm-omni backends both pulled their accelerator stack from pypi.jetson-ai-lab.io's sbsa/cu130 mirror with no version pins, so they would silently land on the same torch 2.11 vs cu130-built wheel ABI mismatch the moment the mirror published an out-of-sync pair. sglang ------ - Drop pyproject.toml + [tool.uv.sources]. The historical comment said the [all] extra was unsafe on aarch64 because of decord, but sglang 0.5.x now uses `decord2` on aarch64/arm/armv7l (which ships cp312 aarch64 wheels), so we can match cublas13's sglang[all]>=0.5.11 pin and stop being capped at the 0.5.1.post2 the L4T mirror shipped. That unblocks Gemma 4 / MTP recipes on Jetson Thor. - New requirements-l4t13.txt mirrors the cublas13 split (accelerate / torch / torchvision / torchaudio / transformers), requirements-l4t13- after.txt carries sglang[all]>=0.5.11. - install.sh's l4t13 elif branch goes away; falls through to the standard installRequirements path. vllm-omni --------- - requirements-l4t13.txt drops --extra-index-url to jetson-ai-lab and drops flash-attn (PyPI has no aarch64 wheel, vLLM 0.20+ bundles its own vllm_flash_attn fa2 + fa3 internally). - install.sh's l4t13 vllm-install branch collapses into the cublas13 branch since both now just run `pip install vllm --torch-backend=auto` against PyPI. - --index-strategy=unsafe-best-match is dropped from the top-level l4t13 guard; without the L4T mirror in the picture it had no purpose. The from-source vllm-omni install on top still keeps its existing `sed -i '/^fa3-fwd[[:space:]]*==/d' requirements/cuda.txt` workaround - fa3-fwd has no aarch64 wheel and no sdist, unrelated to flash-attn. Reference: https://pytorch.org/blog/vllm-and-pytorch-work-together-to-improve-the-developer-experience-on-aarch64/ Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash] [WebFetch] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(sglang): drop [all] extra on l4t13 - xatlas has no aarch64 wheel CI revealed that sglang[all]==0.5.12 transitively pulls xatlas via the [diffusion] sub-extra, and xatlas ships no aarch64 wheel. Its sdist depends on scikit_build_core without declaring it in build-system. requires, so under --no-build-isolation uv can't build it from source: × Failed to build `xatlas==0.0.11` ├─▶ The build backend returned an error ╰─▶ Call to `scikit_build_core.build.build_wheel` failed (exit status: 1) ModuleNotFoundError: No module named 'scikit_build_core' help: `xatlas` (v0.0.11) was included because `sglang[all]` (v0.5.12) depends on `xatlas` Upstream sglang explicitly gates st_attn and vsa on `platform_machine != aarch64` inside the same [diffusion] extra but forgot xatlas - same class of bug that bit the old decord pin. Use plain `sglang>=0.5.11` on l4t13. backend.py imports only base sglang.srt symbols (Engine, ServerArgs, FunctionCallParser, ReasoningParser); the [all] extras are optional accelerators not required at import time. cublas13 (x86_64) keeps [all] because xatlas has x86_64 wheels there. Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
LocalAI Backend Architecture
This directory contains the core backend infrastructure for LocalAI, including the gRPC protocol definition, multi-language Dockerfiles, and language-specific backend implementations.
Overview
LocalAI uses a unified gRPC-based architecture that allows different programming languages to implement AI backends while maintaining consistent interfaces and capabilities. The backend system supports multiple hardware acceleration targets and provides a standardized way to integrate various AI models and frameworks.
Architecture Components
1. Protocol Definition (backend.proto)
The backend.proto file defines the gRPC service interface that all backends must implement. This ensures consistency across different language implementations and provides a contract for communication between LocalAI core and backend services.
Core Services
- Text Generation:
Predict,PredictStreamfor LLM inference - Embeddings:
Embeddingfor text vectorization - Image Generation:
GenerateImagefor stable diffusion and image models - Audio Processing:
AudioTranscription,TTS,SoundGeneration - Video Generation:
GenerateVideofor video synthesis - Object Detection:
Detectfor computer vision tasks - Vector Storage:
StoresSet,StoresGet,StoresFindfor RAG operations - Reranking:
Rerankfor document relevance scoring - Voice Activity Detection:
VADfor audio segmentation
Key Message Types
PredictOptions: Comprehensive configuration for text generationModelOptions: Model loading and configuration parametersResult: Standardized response formatStatusResponse: Backend health and memory usage information
2. Multi-Language Dockerfiles
The backend system provides language-specific Dockerfiles that handle the build environment and dependencies for different programming languages:
Dockerfile.pythonDockerfile.golangDockerfile.llama-cpp
3. Language-Specific Implementations
Python Backends (python/)
- transformers: Hugging Face Transformers framework
- vllm: High-performance LLM inference
- mlx: Apple Silicon optimization
- diffusers: Stable Diffusion models
- Audio: coqui, faster-whisper, kitten-tts
- Vision: mlx-vlm, rfdetr
- Specialized: rerankers, chatterbox, kokoro
Go Backends (go/)
- whisper: OpenAI Whisper speech recognition in Go with GGML cpp backend (whisper.cpp)
- stablediffusion-ggml: Stable Diffusion in Go with GGML Cpp backend
- piper: Text-to-speech synthesis Golang with C bindings using rhaspy/piper
- local-store: Vector storage backend
C++ Backends (cpp/)
- llama-cpp: Llama.cpp integration
- grpc: GRPC utilities and helpers
Hardware Acceleration Support
CUDA (NVIDIA)
- Versions: CUDA 12.x, 13.x
- Features: cuBLAS, cuDNN, TensorRT optimization
- Targets: x86_64, ARM64 (Jetson)
ROCm (AMD)
- Features: HIP, rocBLAS, MIOpen
- Targets: AMD GPUs with ROCm support
Intel
- Features: oneAPI, Intel Extension for PyTorch
- Targets: Intel GPUs, XPUs, CPUs
Vulkan
- Features: Cross-platform GPU acceleration
- Targets: Windows, Linux, Android, macOS
Apple Silicon
- Features: MLX framework, Metal Performance Shaders
- Targets: M1/M2/M3 Macs
Backend Registry (index.yaml)
The index.yaml file serves as a central registry for all available backends, providing:
- Metadata: Name, description, license, icons
- Capabilities: Hardware targets and optimization profiles
- Tags: Categorization for discovery
- URLs: Source code and documentation links
Building Backends
Prerequisites
- Docker with multi-architecture support
- Appropriate hardware drivers (CUDA, ROCm, etc.)
- Build tools (make, cmake, compilers)
Build Commands
Example of build commands with Docker
# Build Python backend
docker build -f backend/Dockerfile.python \
--build-arg BACKEND=transformers \
--build-arg BUILD_TYPE=cublas12 \
--build-arg CUDA_MAJOR_VERSION=12 \
--build-arg CUDA_MINOR_VERSION=0 \
-t localai-backend-transformers .
# Build Go backend
docker build -f backend/Dockerfile.golang \
--build-arg BACKEND=whisper \
--build-arg BUILD_TYPE=cpu \
-t localai-backend-whisper .
# Build C++ backend
docker build -f backend/Dockerfile.llama-cpp \
--build-arg BACKEND=llama-cpp \
--build-arg BUILD_TYPE=cublas12 \
-t localai-backend-llama-cpp .
For ARM64/Mac builds, docker can't be used, and the makefile in the respective backend has to be used.
Build Types
cpu: CPU-only optimizationcublas12,cublas13: CUDA 12.x, 13.x with cuBLAShipblas: ROCm with rocBLASintel: Intel oneAPI optimizationvulkan: Vulkan-based accelerationmetal: Apple Metal optimization
Backend Development
Creating a New Backend
- Choose Language: Select Python, Go, or C++ based on requirements
- Implement Interface: Implement the gRPC service defined in
backend.proto - Add Dependencies: Create appropriate requirements files
- Configure Build: Set up Dockerfile and build scripts
- Register Backend: Add entry to
index.yaml - Test Integration: Verify gRPC communication and functionality
Backend Structure
backend-name/
├── backend.py/go/cpp # Main implementation
├── requirements.txt # Dependencies
├── Dockerfile # Build configuration
├── install.sh # Installation script
├── run.sh # Execution script
├── test.sh # Test script
└── README.md # Backend documentation
Required gRPC Methods
At minimum, backends must implement:
Health()- Service health checkLoadModel()- Model loading and initializationPredict()- Main inference endpointStatus()- Backend status and metrics
Integration with LocalAI Core
Backends communicate with LocalAI core through gRPC:
- Service Discovery: Core discovers available backends
- Model Loading: Core requests model loading via
LoadModel - Inference: Core sends requests via
Predictor specialized endpoints - Streaming: Core handles streaming responses for real-time generation
- Monitoring: Core tracks backend health and performance
Performance Optimization
Memory Management
- Model Caching: Efficient model loading and caching
- Batch Processing: Optimize for multiple concurrent requests
- Memory Pinning: GPU memory optimization for CUDA/ROCm
Hardware Utilization
- Multi-GPU: Support for tensor parallelism
- Mixed Precision: FP16/BF16 for memory efficiency
- Kernel Fusion: Optimized CUDA/ROCm kernels
Troubleshooting
Common Issues
- GRPC Connection: Verify backend service is running and accessible
- Model Loading: Check model paths and dependencies
- Hardware Detection: Ensure appropriate drivers and libraries
- Memory Issues: Monitor GPU memory usage and model sizes
Contributing
When contributing to the backend system:
- Follow Protocol: Implement the exact gRPC interface
- Add Tests: Include comprehensive test coverage
- Document: Provide clear usage examples
- Optimize: Consider performance and resource usage
- Validate: Test across different hardware targets