Files
LocalAI/docs/content/reference/compatibility-table.md
Ettore Di Giacinto 0e7c0adee4 docs: document tool calling on vLLM and MLX backends
openai-functions.md used to claim LocalAI tool calling worked only on
llama.cpp-compatible models. That was true when it was written; it's
not true now — vLLM (since PR #9328) and MLX/MLX-VLM both extract
structured tool calls from model output.

- openai-functions.md: new 'Supported backends' matrix covering
  llama.cpp, vllm, vllm-omni, mlx, mlx-vlm, with the key distinction
  that vllm needs an explicit tool_parser: option while mlx auto-
  detects from the chat template. Reasoning content (think tags) is
  extracted on the same set of backends. Added setup snippets for
  both the vllm and mlx paths, and noted the gallery importer
  pre-fills tool_parser:/reasoning_parser: for known families.
- compatibility-table.md: fix the stale 'Streaming: no' for vllm,
  vllm-omni, mlx, mlx-vlm (all four support streaming now). Add
  'Functions' to their capabilities. Also widen the MLX Acceleration
  column to reflect the CPU/CUDA/Jetson L4T backends that already
  exist in backend/index.yaml — 'Metal' on its own was misleading.
2026-04-13 16:58:55 +00:00

7.8 KiB

+++ disableToc = false title = "Model compatibility table" weight = 24 url = "/model-compatibility/" +++

Besides llama based models, LocalAI is compatible also with other architectures. The table below lists all the backends, compatible models families and the associated repository.

{{% notice note %}}

LocalAI will attempt to automatically load models which are not explicitly configured for a specific backend. You can specify the backend to use by configuring a model with a YAML file. See [the advanced section]({{%relref "advanced" %}}) for more details.

{{% /notice %}}

Text Generation & Language Models

Backend Description Capability Embeddings Streaming Acceleration
llama.cpp LLM inference in C/C++. Supports LLaMA, Mamba, RWKV, Falcon, Starcoder, GPT-2, and many others GPT, Functions yes yes CPU, CUDA 12/13, ROCm, Intel SYCL, Vulkan, Metal, Jetson L4T
ik_llama.cpp Hard fork of llama.cpp optimized for CPU/hybrid CPU+GPU with IQK quants, custom quant mixes, and MLA for DeepSeek GPT yes yes CPU (AVX2+)
vLLM Fast LLM serving with PagedAttention GPT, Functions no yes CPU, CUDA 12, ROCm, Intel
vLLM Omni Unified multimodal generation (text, image, video, audio) Multimodal GPT, Functions no yes CUDA 12, ROCm
transformers HuggingFace Transformers framework GPT, Embeddings, Multimodal yes yes* CPU, CUDA 12/13, ROCm, Intel, Metal
MLX Apple Silicon LLM inference GPT, Functions no yes Metal, CPU, CUDA 12/13, Jetson L4T
MLX-VLM Vision-Language Models on Apple Silicon Multimodal GPT, Functions no yes Metal, CPU, CUDA 12/13, Jetson L4T
MLX Distributed Distributed LLM inference across multiple Apple Silicon Macs GPT no no Metal

Speech-to-Text

Backend Description Acceleration
whisper.cpp OpenAI Whisper in C/C++ CPU, CUDA 12/13, ROCm, Intel SYCL, Vulkan, Metal, Jetson L4T
faster-whisper Fast Whisper with CTranslate2 CUDA 12/13, ROCm, Intel, Metal
WhisperX Word-level timestamps and speaker diarization CPU, CUDA 12/13, ROCm, Metal
moonshine Ultra-fast transcription for low-end devices CPU, CUDA 12/13, Metal
voxtral Voxtral Realtime 4B speech-to-text in C CPU, Metal
Qwen3-ASR Qwen3 automatic speech recognition CPU, CUDA 12/13, ROCm, Intel, Metal, Jetson L4T
NeMo NVIDIA NeMo ASR toolkit CPU, CUDA 12/13, ROCm, Intel, Metal

Text-to-Speech

Backend Description Acceleration
piper Fast neural TTS CPU
Coqui TTS TTS with 1100+ languages and voice cloning CPU, CUDA 12/13, ROCm, Intel, Metal
Kokoro Lightweight TTS (82M params) CUDA 12/13, ROCm, Intel, Metal, Jetson L4T
Chatterbox Production-grade TTS with emotion control CPU, CUDA 12/13, Metal, Jetson L4T
VibeVoice Real-time TTS with voice cloning CPU, CUDA 12/13, ROCm, Intel, Metal, Jetson L4T
Qwen3-TTS TTS with custom voice, voice design, and voice cloning CPU, CUDA 12/13, ROCm, Intel, Metal, Jetson L4T
fish-speech High-quality TTS with voice cloning CPU, CUDA 12/13, ROCm, Intel, Metal, Jetson L4T
Pocket TTS Lightweight CPU-efficient TTS with voice cloning CPU, CUDA 12/13, ROCm, Intel, Metal, Jetson L4T
OuteTTS TTS with custom speaker voices CPU, CUDA 12
faster-qwen3-tts Real-time Qwen3-TTS with CUDA graph capture CUDA 12/13, Jetson L4T
NeuTTS Air Instant voice cloning TTS CPU, CUDA 12, ROCm
VoxCPM Expressive end-to-end TTS CPU, CUDA 12/13, ROCm, Intel, Metal
Kitten TTS Kitten TTS model CPU, Metal
MLX-Audio Audio models on Apple Silicon Metal, CPU, CUDA 12/13, Jetson L4T

Music Generation

Backend Description Acceleration
ACE-Step Music generation from text descriptions, lyrics, or audio CPU, CUDA 12/13, ROCm, Intel, Metal
acestep.cpp ACE-Step 1.5 C++ backend using GGML CPU, CUDA 12/13, ROCm, Intel SYCL, Vulkan, Metal, Jetson L4T

Image & Video Generation

Backend Description Acceleration
stable-diffusion.cpp Stable Diffusion, Flux, PhotoMaker in C/C++ CPU, CUDA 12/13, Intel SYCL, Vulkan, Metal, Jetson L4T
diffusers HuggingFace diffusion models (image and video generation) CPU, CUDA 12/13, ROCm, Intel, Metal, Jetson L4T

Specialized Tasks

Backend Description Acceleration
RF-DETR Real-time transformer-based object detection CPU, CUDA 12/13, Intel, Metal, Jetson L4T
rerankers Document reranking for RAG CUDA 12/13, ROCm, Intel, Metal
local-store Local vector database for embeddings CPU, Metal
Silero VAD Voice Activity Detection CPU
TRL Fine-tuning (SFT, DPO, GRPO, RLOO, KTO, ORPO) CPU, CUDA 12/13
llama.cpp quantization HuggingFace → GGUF model conversion and quantization CPU, Metal
Opus Audio codec for WebRTC / Realtime API CPU, Metal

Acceleration Support Summary

GPU Acceleration

  • NVIDIA CUDA: CUDA 12.0, CUDA 13.0 support across most backends
  • AMD ROCm: HIP-based acceleration for AMD GPUs
  • Intel oneAPI: SYCL-based acceleration for Intel GPUs (F16/F32 precision)
  • Vulkan: Cross-platform GPU acceleration
  • Metal: Apple Silicon GPU acceleration (M1/M2/M3+)

Specialized Hardware

  • NVIDIA Jetson (L4T CUDA 12): ARM64 support for embedded AI (AGX Orin, Jetson Nano, Jetson Xavier NX, Jetson AGX Xavier)
  • NVIDIA Jetson (L4T CUDA 13): ARM64 support for embedded AI (DGX Spark)
  • Apple Silicon: Native Metal acceleration for Mac M1/M2/M3+
  • Darwin x86: Intel Mac support

CPU Optimization

  • AVX/AVX2/AVX512: Advanced vector extensions for x86
  • Quantization: 4-bit, 5-bit, 8-bit integer quantization support
  • Mixed Precision: F16/F32 mixed precision support

Note: any backend name listed above can be used in the backend field of the model configuration file (See [the advanced section]({{%relref "advanced" %}})).

  • * Only for CUDA and OpenVINO CPU/XPU acceleration.