mirror of
https://github.com/mudler/LocalAI.git
synced 2026-05-17 04:56:52 -04:00
Bring the sglang Python backend up to feature parity with vllm by adding
the same engine_args:-map plumbing the vLLM backend already has. Any
ServerArgs field (~380 in sglang 0.5.11) becomes settable from a model
YAML, including the speculative-decoding flags needed for Multi-Token
Prediction. Validation matches the vllm backend's: keys are checked
against dataclasses.fields(ServerArgs), unknown keys raise ValueError
with a difflib close-match suggestion at LoadModel time, and the typed
ModelOptions fields keep their existing meaning with engine_args
overriding them.
Backend code:
* backend/python/sglang/backend.py: add _apply_engine_args, import
dataclasses/difflib/ServerArgs, call from LoadModel; rename Seed ->
sampling_seed (sglang 0.5.11 renamed the SamplingParams field).
* backend/python/sglang/test.py + test.sh + Makefile: six unit tests
exercising the helper directly (no engine load required).
Build / CI / backend gallery (cuda13 + l4t13 paths are now first-class):
* backend/python/sglang/install.sh: add --prerelease=allow because
sglang 0.5.11 hard-pins flash-attn-4 which only ships beta wheels;
add --index-strategy=unsafe-best-match for cublas12 so the cu128
torch index wins over default-PyPI's cu130; new pyproject.toml-driven
l4t13 install path so [tool.uv.sources] can pin torch/torchvision/
torchaudio/sglang to the jetson-ai-lab index without forcing every
transitive PyPI dep through the L4T mirror's flaky proxy (mirrors the
equivalent fix in backend/python/vllm/install.sh).
* backend/python/sglang/pyproject.toml (new): L4T project spec with
explicit-source jetson-ai-lab index. Replaces requirements-l4t13.txt
for the l4t13 BUILD_PROFILE; other profiles still go through the
requirements-*.txt pipeline via libbackend.sh's installRequirements.
* backend/python/sglang/requirements-l4t13.txt: removed; superseded
by pyproject.toml.
* backend/python/sglang/requirements-cublas{12,13}{,-after}.txt: pin
sglang>=0.5.11 (Gemma 4 floor); add cu130 torch index for cublas13
(new files) and cu128 torch index for cublas12 (default PyPI now
ships cu130 torch wheels by default and breaks cu12 hosts).
* backend/index.yaml: add cuda13-sglang and cuda13-sglang-development
capability mappings + image entries pointing at
quay.io/.../-gpu-nvidia-cuda-13-sglang.
* .github/workflows/backend.yml: new cublas13 sglang matrix entry,
mirroring vllm's cuda13 build.
Model gallery + docs:
* gallery/sglang.yaml: base sglang config template, mirrors vllm.yaml.
* gallery/sglang-gemma-4-{e2b,e4b}-mtp.yaml: Gemma 4 MTP demos
transcribed verbatim from the SGLang Gemma 4 cookbook MTP commands.
* gallery/sglang-mimo-7b-mtp.yaml: MiMo-7B-RL with built-in MTP heads
+ online fp8 weight quantization, verified end-to-end on a 16 GB
RTX 5070 Ti at ~88 tok/s. Uses mem_fraction_static: 0.7 because the
MTP draft worker's vocab embedding is loaded unquantised and OOMs
the static reservation at sglang's 0.85 default.
* gallery/index.yaml: three new entries (gemma-4-e2b-it:sglang-mtp,
gemma-4-e4b-it:sglang-mtp, mimo-7b-mtp:sglang).
* docs/content/features/text-generation.md: new SGLang section with
setup, engine_args reference, MTP demos, version requirements.
* .agents/sglang-backend.md (new): agent one-pager covering the flat
ServerArgs structure, the typed-vs-engine_args precedence, the
speculative-decoding cheatsheet, and the mem_fraction_static gotcha
documented above.
* AGENTS.md: index entry for the new agent doc.
Known limitation: the two Gemma 4 MTP gallery entries ship a recipe
that doesn't yet run on stock libraries. The drafter checkpoints
(google/gemma-4-{E2B,E4B}-it-assistant) declare
model_type: gemma4_assistant / Gemma4AssistantForCausalLM, which
neither transformers (<=5.6.0, including the SGLang cookbook's pinned
commit 91b1ab1f... and main HEAD) nor sglang's own model registry
(<=0.5.11) registers as of 2026-05-06. They will start working when
HF or sglang upstream registers the architecture -- no LocalAI
changes needed. The MiMo MTP demo and the non-MTP Gemma 4 paths work
today on this build (verified on RTX 5070 Ti, 16 GB).
Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Bash] [WebFetch] [WebSearch]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Python Backends for LocalAI
This directory contains Python-based AI backends for LocalAI, providing support for various AI models and hardware acceleration targets.
Overview
The Python backends use a unified build system based on libbackend.sh that provides:
- Automatic virtual environment management with support for both
uvandpip - Hardware-specific dependency installation (CPU, CUDA, Intel, MLX, etc.)
- Portable Python support for standalone deployments
- Consistent backend execution across different environments
Available Backends
Core AI Models
- transformers - Hugging Face Transformers framework (PyTorch-based)
- vllm - High-performance LLM inference engine
- mlx - Apple Silicon optimized ML framework
Audio & Speech
- coqui - Coqui TTS models
- faster-whisper - Fast Whisper speech recognition
- kitten-tts - Lightweight TTS
- mlx-audio - Apple Silicon audio processing
- chatterbox - TTS model
- kokoro - TTS models
Computer Vision
- diffusers - Stable Diffusion and image generation
- mlx-vlm - Vision-language models for Apple Silicon
- rfdetr - Object detection models
Specialized
- rerankers - Text reranking models
Quick Start
Prerequisites
- Python 3.10+ (default: 3.10.18)
uvpackage manager (recommended) orpip- Appropriate hardware drivers for your target (CUDA, Intel, etc.)
Installation
Each backend can be installed individually:
# Navigate to a specific backend
cd backend/python/transformers
# Install dependencies
make transformers
# or
bash install.sh
# Run the backend
make run
# or
bash run.sh
Using the Unified Build System
The libbackend.sh script provides consistent commands across all backends:
# Source the library in your backend script
source $(dirname $0)/../common/libbackend.sh
# Install requirements (automatically handles hardware detection)
installRequirements
# Start the backend server
startBackend $@
# Run tests
runUnittests
Hardware Targets
The build system automatically detects and configures for different hardware:
- CPU - Standard CPU-only builds
- CUDA - NVIDIA GPU acceleration (supports CUDA 12/13)
- Intel - Intel XPU/GPU optimization
- MLX - Apple Silicon (M1/M2/M3) optimization
- HIP - AMD GPU acceleration
Target-Specific Requirements
Backends can specify hardware-specific dependencies:
requirements.txt- Base requirementsrequirements-cpu.txt- CPU-specific packagesrequirements-cublas12.txt- CUDA 12 packagesrequirements-cublas13.txt- CUDA 13 packagesrequirements-intel.txt- Intel-optimized packagesrequirements-mps.txt- Apple Silicon packages
Configuration Options
Environment Variables
PYTHON_VERSION- Python version (default: 3.10)PYTHON_PATCH- Python patch version (default: 18)BUILD_TYPE- Force specific build targetUSE_PIP- Use pip instead of uv (default: false)PORTABLE_PYTHON- Enable portable Python buildsLIMIT_TARGETS- Restrict backend to specific targets
Example: CUDA 12 Only Backend
# In your backend script
LIMIT_TARGETS="cublas12"
source $(dirname $0)/../common/libbackend.sh
Example: Intel-Optimized Backend
# In your backend script
LIMIT_TARGETS="intel"
source $(dirname $0)/../common/libbackend.sh
Development
Adding a New Backend
- Create a new directory in
backend/python/ - Copy the template structure from
common/template/ - Implement your
backend.pywith the required gRPC interface - Add appropriate requirements files for your target hardware
- Use
libbackend.shfor consistent build and execution
Testing
# Run backend tests
make test
# or
bash test.sh
Building
# Install dependencies
make <backend-name>
# Clean build artifacts
make clean
Architecture
Each backend follows a consistent structure:
backend-name/
├── backend.py # Main backend implementation
├── requirements.txt # Base dependencies
├── requirements-*.txt # Hardware-specific dependencies
├── install.sh # Installation script
├── run.sh # Execution script
├── test.sh # Test script
├── Makefile # Build targets
└── test.py # Unit tests
Troubleshooting
Common Issues
- Missing dependencies: Ensure all requirements files are properly configured
- Hardware detection: Check that
BUILD_TYPEmatches your system - Python version: Verify Python 3.10+ is available
- Virtual environment: Use
ensureVenvto create/activate environments
Contributing
When adding new backends or modifying existing ones:
- Follow the established directory structure
- Use
libbackend.shfor consistent behavior - Include appropriate requirements files for all target hardware
- Add comprehensive tests
- Update this README if adding new backend types