Files
LocalAI/backend/python
LocalAI [bot] 3a87d9e48f feat(vllm): macOS/Metal support via vllm-metal (MLX) (#10489)
* feat(vllm): macOS/Metal support via vllm-metal (MLX)

Add an additive Apple-Silicon path to the existing vllm Python backend so
vLLM runs on macOS via vllm-metal (github.com/vllm-project/vllm-metal).

Spike outcome (proven on a real M4 / macOS 26.5, Qwen3-0.6B):
- vllm-metal registers through vLLM's platform-plugin entry point
  (metal -> vllm_metal:register); MetalPlatform activates and runs on the
  GPU through MLX.
- LocalAI's backend.py is UNCHANGED: AsyncEngineArgs(...) ->
  AsyncLLMEngine.from_engine_args transparently resolves to vLLM 0.23's v1
  AsyncLLM MLX engine, and async generate produced correct output.
- backend.py is NOT touched: its only empty_cache() call is CUDA-only
  (guarded by torch.cuda.is_available()), so the benign shutdown-only
  "Allocator for mps is not a DeviceAllocator" noise comes from vLLM's
  internal EngineCore teardown, not from our code.

Changes (all gated behind a darwin condition; Linux/CUDA/ROCm/Intel paths
are byte-for-byte unchanged):
- install.sh: darwin branch forces PYTHON_VERSION=3.12 (vllm-metal
  requirement), creates/activates LocalAI's managed venv via ensureVenv,
  then reproduces vllm-metal's installer INTO that venv (build vLLM 0.23.0
  from the release source tarball against requirements/cpu.txt, then install
  the prebuilt vllm-metal wheel from its latest GitHub release), and runs
  runProtogen. installRequirements is skipped on darwin.
- backend-matrix.yml: add a vllm includeDarwin entry (mps, python).
- index.yaml: add metal capability + concrete metal-vllm /
  metal-vllm-development child entries mirroring the metal-kitten-tts
  template.

Version coupling: vllm-metal pins vLLM 0.23.0, equal to LocalAI's current
vllm pin. Bumping vllm must be coordinated with a supporting vllm-metal
release; documented in install.sh and requirements-cublas13-after.txt.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]

* chore(vllm): track the darwin vllm-metal pin via the autobumper

The Apple Silicon build pinned vLLM 0.23.0 as a hidden string in install.sh
while floating the vllm-metal wheel on releases/latest - the two could drift
apart silently. Make both a tracked, reproducible pair (VLLM_METAL_VERSION +
VLLM_VERSION), fetch the wheel by tag, and add .github/bump_vllm_metal.sh wired
into bump_deps.yaml. It tracks vllm-project/vllm-metal (not vllm/vllm latest),
reading the coupled vLLM source version from vllm-metal's own installer, and
opens a bump PR - mirroring the existing bump_vllm_wheel.sh for the cu130 wheel.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]

* chore(vllm): derive the darwin vLLM version, drop the second pin

Follow-up: VLLM_VERSION was still a hardcoded string duplicating what
VLLM_METAL_VERSION already determines. Derive it at install time from
vllm-metal's own installer (vllm_v=) at the pinned tag - one source of truth,
no second value to drift. The bumper now touches only VLLM_METAL_VERSION;
the derivation is immutable per tag, so builds stay reproducible.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]

* fix(vllm): fetch the vllm-metal wheel without the GitHub API

The darwin build resolved the wheel URL via api.github.com, whose
unauthenticated rate limit (60/hr per IP) 403s on shared macOS runners
(observed after the 9-min vLLM source build). Construct the release-asset
download URL deterministically from the pinned tag and the cp312/arm64 wheel
name instead - no API call, no rate limit. Verified the URL resolves (200).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]

* fix(vllm): fail Score cleanly when the engine returns no prompt_logprobs

Audit of the Score path against vllm-metal (MLX on macOS): the engine accepts
SamplingParams(prompt_logprobs=1) but returns an all-None prompt_logprobs list
rather than computing it, so scoring is not supported there. The old guard
treated the truthy [None] list as valid and silently scored every candidate as
0. Detect the all-None case and return UNIMPLEMENTED instead. No-op on
Linux/CUDA, which populate real entries.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-25 15:46:19 +02:00
..
2026-04-12 08:51:30 +02:00
2026-04-12 08:51:30 +02:00
2026-04-12 08:51:30 +02:00
2026-04-12 08:51:30 +02:00
2026-04-12 08:51:30 +02:00
2026-04-12 08:51:30 +02:00

Python Backends for LocalAI

This directory contains Python-based AI backends for LocalAI, providing support for various AI models and hardware acceleration targets.

Overview

The Python backends use a unified build system based on libbackend.sh that provides:

  • Automatic virtual environment management with support for both uv and pip
  • Hardware-specific dependency installation (CPU, CUDA, Intel, MLX, etc.)
  • Portable Python support for standalone deployments
  • Consistent backend execution across different environments

Available Backends

Core AI Models

  • transformers - Hugging Face Transformers framework (PyTorch-based)
  • vllm - High-performance LLM inference engine
  • mlx - Apple Silicon optimized ML framework

Audio & Speech

  • coqui - Coqui TTS models
  • faster-whisper - Fast Whisper speech recognition
  • kitten-tts - Lightweight TTS
  • mlx-audio - Apple Silicon audio processing
  • chatterbox - TTS model
  • kokoro - TTS models

Computer Vision

  • diffusers - Stable Diffusion and image generation
  • mlx-vlm - Vision-language models for Apple Silicon
  • rfdetr - Object detection models

Specialized

  • rerankers - Text reranking models

Quick Start

Prerequisites

  • Python 3.10+ (default: 3.10.18)
  • uv package manager (recommended) or pip
  • Appropriate hardware drivers for your target (CUDA, Intel, etc.)

Installation

Each backend can be installed individually:

# Navigate to a specific backend
cd backend/python/transformers

# Install dependencies
make transformers
# or
bash install.sh

# Run the backend
make run
# or
bash run.sh

Using the Unified Build System

The libbackend.sh script provides consistent commands across all backends:

# Source the library in your backend script
source $(dirname $0)/../common/libbackend.sh

# Install requirements (automatically handles hardware detection)
installRequirements

# Start the backend server
startBackend $@

# Run tests
runUnittests

Hardware Targets

The build system automatically detects and configures for different hardware:

  • CPU - Standard CPU-only builds
  • CUDA - NVIDIA GPU acceleration (supports CUDA 12/13)
  • Intel - Intel XPU/GPU optimization
  • MLX - Apple Silicon (M1/M2/M3) optimization
  • HIP - AMD GPU acceleration

Target-Specific Requirements

Backends can specify hardware-specific dependencies:

  • requirements.txt - Base requirements
  • requirements-cpu.txt - CPU-specific packages
  • requirements-cublas12.txt - CUDA 12 packages
  • requirements-cublas13.txt - CUDA 13 packages
  • requirements-intel.txt - Intel-optimized packages
  • requirements-mps.txt - Apple Silicon packages

Configuration Options

Environment Variables

  • PYTHON_VERSION - Python version (default: 3.10)
  • PYTHON_PATCH - Python patch version (default: 18)
  • BUILD_TYPE - Force specific build target
  • USE_PIP - Use pip instead of uv (default: false)
  • PORTABLE_PYTHON - Enable portable Python builds
  • LIMIT_TARGETS - Restrict backend to specific targets

Example: CUDA 12 Only Backend

# In your backend script
LIMIT_TARGETS="cublas12"
source $(dirname $0)/../common/libbackend.sh

Example: Intel-Optimized Backend

# In your backend script
LIMIT_TARGETS="intel"
source $(dirname $0)/../common/libbackend.sh

Development

Adding a New Backend

  1. Create a new directory in backend/python/
  2. Copy the template structure from common/template/
  3. Implement your backend.py with the required gRPC interface
  4. Add appropriate requirements files for your target hardware
  5. Use libbackend.sh for consistent build and execution

Testing

# Run backend tests
make test
# or
bash test.sh

Building

# Install dependencies
make <backend-name>

# Clean build artifacts
make clean

Architecture

Each backend follows a consistent structure:

backend-name/
├── backend.py          # Main backend implementation
├── requirements.txt    # Base dependencies
├── requirements-*.txt  # Hardware-specific dependencies
├── install.sh         # Installation script
├── run.sh            # Execution script
├── test.sh           # Test script
├── Makefile          # Build targets
└── test.py           # Unit tests

Troubleshooting

Common Issues

  1. Missing dependencies: Ensure all requirements files are properly configured
  2. Hardware detection: Check that BUILD_TYPE matches your system
  3. Python version: Verify Python 3.10+ is available
  4. Virtual environment: Use ensureVenv to create/activate environments

Contributing

When adding new backends or modifying existing ones:

  1. Follow the established directory structure
  2. Use libbackend.sh for consistent behavior
  3. Include appropriate requirements files for all target hardware
  4. Add comprehensive tests
  5. Update this README if adding new backend types