* feat(mlx): add thread-safe LRU prompt cache Port mlx-lm's LRUPromptCache to fix race condition where concurrent requests corrupt shared KV cache state. The previous implementation used a single prompt_cache instance shared across all requests. Changes: - Add backend/python/common/mlx_cache.py with ThreadSafeLRUPromptCache - Modify backend.py to use per-request cache isolation via fetch/insert - Add prefix matching for cache reuse across similar prompts - Add LRU eviction (default 10 entries, configurable) - Add concurrency and cache unit tests The cache uses a trie-based structure for efficient prefix matching, allowing prompts that share common prefixes to reuse cached KV states. Thread safety is provided via threading.Lock. New configuration options: - max_cache_entries: Maximum LRU cache entries (default: 10) - max_kv_size: Maximum KV cache size per entry (default: None) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Blightbow <blightbow@users.noreply.github.com> * feat(mlx): add min_p and top_k sampler support Add MinP field to proto (field 52) following the precedent set by other non-OpenAI sampling parameters like TopK, TailFreeSamplingZ, TypicalP, and Mirostat. Changes: - backend.proto: Add float MinP field for min-p sampling - backend.py: Extract and pass min_p and top_k to mlx_lm sampler (top_k was in proto but not being passed) - test.py: Fix test_sampling_params to use valid proto fields and switch to MLX-compatible model (mlx-community/Llama-3.2-1B-Instruct) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Blightbow <blightbow@users.noreply.github.com> * refactor(mlx): move mlx_cache.py from common to mlx backend The ThreadSafeLRUPromptCache is only used by the mlx backend. After evaluating mlx-vlm, it was determined that the cache cannot be shared because mlx-vlm's generate/stream_generate functions don't support the prompt_cache parameter that mlx_lm provides. - Move mlx_cache.py from backend/python/common/ to backend/python/mlx/ - Remove sys.path manipulation from backend.py and test.py - Fix test assertion to expect "MLX model loaded successfully" 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Blightbow <blightbow@users.noreply.github.com> * test(mlx): add comprehensive cache tests and document upstream behavior Added comprehensive unit tests (test_mlx_cache.py) covering all cache operation modes: - Exact match - Shorter prefix match - Longer prefix match with trimming - No match scenarios - LRU eviction and access order - Reference counting and deep copy behavior - Multi-model namespacing - Thread safety with data integrity verification Documents upstream mlx_lm/server.py behavior: single-token prefixes are deliberately not matched (uses > 0, not >= 0) to allow longer cached sequences to be preferred for trimming. This is acceptable because real prompts with chat templates are always many tokens. Removed weak unit tests from test.py that only verified "no exception thrown" rather than correctness. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Blightbow <blightbow@users.noreply.github.com> * chore(mlx): remove unused MinP proto field The MinP field was added to PredictOptions but is not populated by the Go frontend/API. The MLX backend uses getattr with a default value, so it works without the proto field. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Blightbow <blightbow@users.noreply.github.com> --------- Signed-off-by: Blightbow <blightbow@users.noreply.github.com> Co-authored-by: Blightbow <blightbow@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
💡 Get help - ❓FAQ 💭Discussions 💬 Discord 📖 Documentation website
💻 Quickstart 🖼️ Models 🚀 Roadmap 🛫 Examples Try on
LocalAI is the free, Open Source OpenAI alternative. LocalAI act as a drop-in replacement REST API that's compatible with OpenAI (Elevenlabs, Anthropic... ) API specifications for local AI inferencing. It allows you to run LLMs, generate images, audio (and not only) locally or on-prem with consumer grade hardware, supporting multiple model families. Does not require GPU. It is created and maintained by Ettore Di Giacinto.
📚🆕 Local Stack Family
🆕 LocalAI is now part of a comprehensive suite of AI tools designed to work together:
|
LocalAGIA powerful Local AI agent management platform that serves as a drop-in replacement for OpenAI's Responses API, enhanced with advanced agentic capabilities. |
|
LocalRecallA REST-ful API and knowledge base management system that provides persistent memory and storage capabilities for AI agents. |
Screenshots / Video
Youtube video
Screenshots
| Talk Interface | Generate Audio |
|---|---|
![]() |
![]() |
| Models Overview | Generate Images |
|---|---|
![]() |
![]() |
| Chat Interface | Home |
|---|---|
![]() |
![]() |
| Login | Swarm |
|---|---|
![]() |
![]() |
💻 Quickstart
Run the installer script:
# Basic installation
curl https://localai.io/install.sh | sh
For more installation options, see Installer Options.
macOS Download:
Note: the DMGs are not signed by Apple as quarantined. See https://github.com/mudler/LocalAI/issues/6268 for a workaround, fix is tracked here: https://github.com/mudler/LocalAI/issues/6244
Or run with docker:
💡 Docker Run vs Docker Start
docker runcreates and starts a new container. If a container with the same name already exists, this command will fail.docker startstarts an existing container that was previously created withdocker run.If you've already run LocalAI before and want to start it again, use:
docker start -i local-ai
CPU only image:
docker run -ti --name local-ai -p 8080:8080 localai/localai:latest
NVIDIA GPU Images:
# CUDA 12.0
docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-gpu-nvidia-cuda-12
# CUDA 11.7
docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-gpu-nvidia-cuda-11
# NVIDIA Jetson (L4T) ARM64
docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-nvidia-l4t-arm64
AMD GPU Images (ROCm):
docker run -ti --name local-ai -p 8080:8080 --device=/dev/kfd --device=/dev/dri --group-add=video localai/localai:latest-gpu-hipblas
Intel GPU Images (oneAPI):
docker run -ti --name local-ai -p 8080:8080 --device=/dev/dri/card1 --device=/dev/dri/renderD128 localai/localai:latest-gpu-intel
Vulkan GPU Images:
docker run -ti --name local-ai -p 8080:8080 localai/localai:latest-gpu-vulkan
AIO Images (pre-downloaded models):
# CPU version
docker run -ti --name local-ai -p 8080:8080 localai/localai:latest-aio-cpu
# NVIDIA CUDA 12 version
docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-aio-gpu-nvidia-cuda-12
# NVIDIA CUDA 11 version
docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-aio-gpu-nvidia-cuda-11
# Intel GPU version
docker run -ti --name local-ai -p 8080:8080 localai/localai:latest-aio-gpu-intel
# AMD GPU version
docker run -ti --name local-ai -p 8080:8080 --device=/dev/kfd --device=/dev/dri --group-add=video localai/localai:latest-aio-gpu-hipblas
For more information about the AIO images and pre-downloaded models, see Container Documentation.
To load models:
# From the model gallery (see available models with `local-ai models list`, in the WebUI from the model tab, or visiting https://models.localai.io)
local-ai run llama-3.2-1b-instruct:q4_k_m
# Start LocalAI with the phi-2 model directly from huggingface
local-ai run huggingface://TheBloke/phi-2-GGUF/phi-2.Q8_0.gguf
# Install and run a model from the Ollama OCI registry
local-ai run ollama://gemma:2b
# Run a model from a configuration file
local-ai run https://gist.githubusercontent.com/.../phi-2.yaml
# Install and run a model from a standard OCI registry (e.g., Docker Hub)
local-ai run oci://localai/phi-2:latest
⚡ Automatic Backend Detection: When you install models from the gallery or YAML files, LocalAI automatically detects your system's GPU capabilities (NVIDIA, AMD, Intel) and downloads the appropriate backend. For advanced configuration options, see GPU Acceleration.
For more information, see 💻 Getting started, if you are interested in our roadmap items and future enhancements, you can see the Issues labeled as Roadmap here
📰 Latest project news
- December 2025: Dynamic Memory Resource reclaimer, Automatic fitting of models to multiple GPUS(llama.cpp), Added Vibevoice backend
- November 2025: Major improvements to the UX. Among these: Import models via URL and Multiple chats and history
- October 2025: 🔌 Model Context Protocol (MCP) support added for agentic capabilities with external tools
- September 2025: New Launcher application for MacOS and Linux, extended support to many backends for Mac and Nvidia L4T devices. Models: Added MLX-Audio, WAN 2.2. WebUI improvements and Python-based backends now ships portable python environments.
- August 2025: MLX, MLX-VLM, Diffusers and llama.cpp are now supported on Mac M1/M2/M3+ chips ( with
developmentsuffix in the gallery ): https://github.com/mudler/LocalAI/pull/6049 https://github.com/mudler/LocalAI/pull/6119 https://github.com/mudler/LocalAI/pull/6121 https://github.com/mudler/LocalAI/pull/6060 - July/August 2025: 🔍 Object Detection added to the API featuring rf-detr
- July 2025: All backends migrated outside of the main binary. LocalAI is now more lightweight, small, and automatically downloads the required backend to run the model. Read the release notes
- June 2025: Backend management has been added. Attention: extras images are going to be deprecated from the next release! Read the backend management PR.
- May 2025: Audio input and Reranking in llama.cpp backend, Realtime API, Support to Gemma, SmollVLM, and more multimodal models (available in the gallery).
- May 2025: Important: image name changes See release
- Apr 2025: Rebrand, WebUI enhancements
- Apr 2025: LocalAGI and LocalRecall join the LocalAI family stack.
- Apr 2025: WebUI overhaul, AIO images updates
- Feb 2025: Backend cleanup, Breaking changes, new backends (kokoro, OutelTTS, faster-whisper), Nvidia L4T images
- Jan 2025: LocalAI model release: https://huggingface.co/mudler/LocalAI-functioncall-phi-4-v0.3, SANA support in diffusers: https://github.com/mudler/LocalAI/pull/4603
- Dec 2024: stablediffusion.cpp backend (ggml) added ( https://github.com/mudler/LocalAI/pull/4289 )
- Nov 2024: Bark.cpp backend added ( https://github.com/mudler/LocalAI/pull/4287 )
- Nov 2024: Voice activity detection models (VAD) added to the API: https://github.com/mudler/LocalAI/pull/4204
- Oct 2024: examples moved to LocalAI-examples
- Aug 2024: 🆕 FLUX-1, P2P Explorer
- July 2024: 🔥🔥 🆕 P2P Dashboard, LocalAI Federated mode and AI Swarms: https://github.com/mudler/LocalAI/pull/2723. P2P Global community pools: https://github.com/mudler/LocalAI/issues/3113
- May 2024: 🔥🔥 Decentralized P2P llama.cpp: https://github.com/mudler/LocalAI/pull/2343 (peer2peer llama.cpp!) 👉 Docs https://localai.io/features/distribute/
- May 2024: 🔥🔥 Distributed inferencing: https://github.com/mudler/LocalAI/pull/2324
- April 2024: Reranker API: https://github.com/mudler/LocalAI/pull/2121
Roadmap items: List of issues
🚀 Features
- 🧩 Backend Gallery: Install/remove backends on the fly, powered by OCI images — fully customizable and API-driven.
- 📖 Text generation with GPTs (
llama.cpp,transformers,vllm... 📖 and more) - 🗣 Text to Audio
- 🔈 Audio to Text (Audio transcription with
whisper.cpp) - 🎨 Image generation
- 🔥 OpenAI-alike tools API
- 🧠 Embeddings generation for vector databases
- ✍️ Constrained grammars
- 🖼️ Download Models directly from Huggingface
- 🥽 Vision API
- 🔍 Object Detection
- 📈 Reranker API
- 🆕🖧 P2P Inferencing
- 🆕🔌 Model Context Protocol (MCP) - Agentic capabilities with external tools and LocalAGI's Agentic capabilities
- 🔊 Voice activity detection (Silero-VAD support)
- 🌍 Integrated WebUI!
🧩 Supported Backends & Acceleration
LocalAI supports a comprehensive range of AI backends with multiple acceleration options:
Text Generation & Language Models
| Backend | Description | Acceleration Support |
|---|---|---|
| llama.cpp | LLM inference in C/C++ | CUDA 11/12, ROCm, Intel SYCL, Vulkan, Metal, CPU |
| vLLM | Fast LLM inference with PagedAttention | CUDA 12, ROCm, Intel |
| transformers | HuggingFace transformers framework | CUDA 11/12, ROCm, Intel, CPU |
| exllama2 | GPTQ inference library | CUDA 12 |
| MLX | Apple Silicon LLM inference | Metal (M1/M2/M3+) |
| MLX-VLM | Apple Silicon Vision-Language Models | Metal (M1/M2/M3+) |
Audio & Speech Processing
| Backend | Description | Acceleration Support |
|---|---|---|
| whisper.cpp | OpenAI Whisper in C/C++ | CUDA 12, ROCm, Intel SYCL, Vulkan, CPU |
| faster-whisper | Fast Whisper with CTranslate2 | CUDA 12, ROCm, Intel, CPU |
| bark | Text-to-audio generation | CUDA 12, ROCm, Intel |
| bark-cpp | C++ implementation of Bark | CUDA, Metal, CPU |
| coqui | Advanced TTS with 1100+ languages | CUDA 12, ROCm, Intel, CPU |
| kokoro | Lightweight TTS model | CUDA 12, ROCm, Intel, CPU |
| chatterbox | Production-grade TTS | CUDA 11/12, CPU |
| piper | Fast neural TTS system | CPU |
| kitten-tts | Kitten TTS models | CPU |
| silero-vad | Voice Activity Detection | CPU |
| neutts | Text-to-speech with voice cloning | CUDA 12, ROCm, CPU |
Image & Video Generation
| Backend | Description | Acceleration Support |
|---|---|---|
| stablediffusion.cpp | Stable Diffusion in C/C++ | CUDA 12, Intel SYCL, Vulkan, CPU |
| diffusers | HuggingFace diffusion models | CUDA 11/12, ROCm, Intel, Metal, CPU |
Specialized AI Tasks
| Backend | Description | Acceleration Support |
|---|---|---|
| rfdetr | Real-time object detection | CUDA 12, Intel, CPU |
| rerankers | Document reranking API | CUDA 11/12, ROCm, Intel, CPU |
| local-store | Vector database | CPU |
| huggingface | HuggingFace API integration | API-based |
Hardware Acceleration Matrix
| Acceleration Type | Supported Backends | Hardware Support |
|---|---|---|
| NVIDIA CUDA 11 | llama.cpp, whisper, stablediffusion, diffusers, rerankers, bark, chatterbox | Nvidia hardware |
| NVIDIA CUDA 12 | All CUDA-compatible backends | Nvidia hardware |
| AMD ROCm | llama.cpp, whisper, vllm, transformers, diffusers, rerankers, coqui, kokoro, bark, neutts | AMD Graphics |
| Intel oneAPI | llama.cpp, whisper, stablediffusion, vllm, transformers, diffusers, rfdetr, rerankers, exllama2, coqui, kokoro, bark | Intel Arc, Intel iGPUs |
| Apple Metal | llama.cpp, whisper, diffusers, MLX, MLX-VLM, bark-cpp | Apple M1/M2/M3+ |
| Vulkan | llama.cpp, whisper, stablediffusion | Cross-platform GPUs |
| NVIDIA Jetson | llama.cpp, whisper, stablediffusion, diffusers, rfdetr | ARM64 embedded AI |
| CPU Optimized | All backends | AVX/AVX2/AVX512, quantization support |
🔗 Community and integrations
Build and deploy custom containers:
WebUIs:
- https://github.com/Jirubizu/localai-admin
- https://github.com/go-skynet/LocalAI-frontend
- QA-Pilot(An interactive chat project that leverages LocalAI LLMs for rapid understanding and navigation of GitHub code repository) https://github.com/reid41/QA-Pilot
Agentic Libraries:
MCPs:
Model galleries
Voice:
Other:
- Helm chart https://github.com/go-skynet/helm-charts
- VSCode extension https://github.com/badgooooor/localai-vscode-plugin
- Langchain: https://python.langchain.com/docs/integrations/providers/localai/
- Terminal utility https://github.com/djcopley/ShellOracle
- Local Smart assistant https://github.com/mudler/LocalAGI
- Home Assistant https://github.com/sammcj/homeassistant-localai / https://github.com/drndos/hass-openai-custom-conversation / https://github.com/valentinfrlch/ha-gpt4vision
- Discord bot https://github.com/mudler/LocalAGI/tree/main/examples/discord
- Slack bot https://github.com/mudler/LocalAGI/tree/main/examples/slack
- Shell-Pilot(Interact with LLM using LocalAI models via pure shell scripts on your Linux or MacOS system) https://github.com/reid41/shell-pilot
- Telegram bot https://github.com/mudler/LocalAI/tree/master/examples/telegram-bot
- Another Telegram Bot https://github.com/JackBekket/Hellper
- Auto-documentation https://github.com/JackBekket/Reflexia
- Github bot which answer on issues, with code and documentation as context https://github.com/JackBekket/GitHelper
- Github Actions: https://github.com/marketplace/actions/start-localai
- Examples: https://github.com/mudler/LocalAI/tree/master/examples/
🔗 Resources
- LLM finetuning guide
- How to build locally
- How to install in Kubernetes
- Projects integrating LocalAI
- How tos section (curated by our community)
📖 🎥 Media, Blogs, Social
- Run Visual studio code with LocalAI (SUSE)
- 🆕 Run LocalAI on Jetson Nano Devkit
- Run LocalAI on AWS EKS with Pulumi
- Run LocalAI on AWS
- Create a slackbot for teams and OSS projects that answer to documentation
- LocalAI meets k8sgpt
- Question Answering on Documents locally with LangChain, LocalAI, Chroma, and GPT4All
- Tutorial to use k8sgpt with LocalAI
Citation
If you utilize this repository, data in a downstream project, please consider citing it with:
@misc{localai,
author = {Ettore Di Giacinto},
title = {LocalAI: The free, Open source OpenAI alternative},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/go-skynet/LocalAI}},
❤️ Sponsors
Do you find LocalAI useful?
Support the project by becoming a backer or sponsor. Your logo will show up here with a link to your website.
A huge thank you to our generous sponsors who support this project covering CI expenses, and our Sponsor list:
🌟 Star history
📖 License
LocalAI is a community-driven project created by Ettore Di Giacinto.
MIT - Author Ettore Di Giacinto mudler@localai.io
🙇 Acknowledgements
LocalAI couldn't have been built without the help of great software already available from the community. Thank you!
- llama.cpp
- https://github.com/tatsu-lab/stanford_alpaca
- https://github.com/cornelk/llama-go for the initial ideas
- https://github.com/antimatter15/alpaca.cpp
- https://github.com/EdVince/Stable-Diffusion-NCNN
- https://github.com/ggerganov/whisper.cpp
- https://github.com/rhasspy/piper
🤗 Contributors
This is a community project, a special thanks to our contributors! 🤗







