LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-04-04 07:01:39 -04:00

Go to file

blightbow 67baf66555 feat(mlx): add thread-safe LRU prompt cache and min_p/top_k sampling (#7556 )

* feat(mlx): add thread-safe LRU prompt cache

Port mlx-lm's LRUPromptCache to fix race condition where concurrent
requests corrupt shared KV cache state. The previous implementation
used a single prompt_cache instance shared across all requests.

Changes:
- Add backend/python/common/mlx_cache.py with ThreadSafeLRUPromptCache
- Modify backend.py to use per-request cache isolation via fetch/insert
- Add prefix matching for cache reuse across similar prompts
- Add LRU eviction (default 10 entries, configurable)
- Add concurrency and cache unit tests

The cache uses a trie-based structure for efficient prefix matching,
allowing prompts that share common prefixes to reuse cached KV states.
Thread safety is provided via threading.Lock.

New configuration options:
- max_cache_entries: Maximum LRU cache entries (default: 10)
- max_kv_size: Maximum KV cache size per entry (default: None)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Blightbow <blightbow@users.noreply.github.com>

* feat(mlx): add min_p and top_k sampler support

Add MinP field to proto (field 52) following the precedent set by
other non-OpenAI sampling parameters like TopK, TailFreeSamplingZ,
TypicalP, and Mirostat.

Changes:
- backend.proto: Add float MinP field for min-p sampling
- backend.py: Extract and pass min_p and top_k to mlx_lm sampler
  (top_k was in proto but not being passed)
- test.py: Fix test_sampling_params to use valid proto fields and
  switch to MLX-compatible model (mlx-community/Llama-3.2-1B-Instruct)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Blightbow <blightbow@users.noreply.github.com>

* refactor(mlx): move mlx_cache.py from common to mlx backend

The ThreadSafeLRUPromptCache is only used by the mlx backend. After
evaluating mlx-vlm, it was determined that the cache cannot be shared
because mlx-vlm's generate/stream_generate functions don't support
the prompt_cache parameter that mlx_lm provides.

- Move mlx_cache.py from backend/python/common/ to backend/python/mlx/
- Remove sys.path manipulation from backend.py and test.py
- Fix test assertion to expect "MLX model loaded successfully"

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Blightbow <blightbow@users.noreply.github.com>

* test(mlx): add comprehensive cache tests and document upstream behavior

Added comprehensive unit tests (test_mlx_cache.py) covering all cache
operation modes:
- Exact match
- Shorter prefix match
- Longer prefix match with trimming
- No match scenarios
- LRU eviction and access order
- Reference counting and deep copy behavior
- Multi-model namespacing
- Thread safety with data integrity verification

Documents upstream mlx_lm/server.py behavior: single-token prefixes are
deliberately not matched (uses > 0, not >= 0) to allow longer cached
sequences to be preferred for trimming. This is acceptable because real
prompts with chat templates are always many tokens.

Removed weak unit tests from test.py that only verified "no exception
thrown" rather than correctness.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Blightbow <blightbow@users.noreply.github.com>

* chore(mlx): remove unused MinP proto field

The MinP field was added to PredictOptions but is not populated by the
Go frontend/API. The MLX backend uses getattr with a default value,
so it works without the proto field.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Blightbow <blightbow@users.noreply.github.com>

---------

Signed-off-by: Blightbow <blightbow@users.noreply.github.com>
Co-authored-by: Blightbow <blightbow@users.noreply.github.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

2025-12-16 11:27:46 +01:00

.devcontainer

feat: refactor build process, drop embedded backends (#5875 )

2025-07-22 16:31:04 +02:00

.devcontainer-scripts

feat: refactor build process, drop embedded backends (#5875 )

2025-07-22 16:31:04 +02:00

.github

Revert "chore(deps): bump securego/gosec from 2.22.9 to 2.22.11" (#7602 )

2025-12-16 09:48:46 +01:00

.vscode

feat: refactor build process, drop embedded backends (#5875 )

2025-07-22 16:31:04 +02:00

aio

chore(aio): upgrade minicpm-v model to latest 4.5 (#6262 )

2025-09-14 15:04:58 +02:00

backend

feat(mlx): add thread-safe LRU prompt cache and min_p/top_k sampling (#7556 )

2025-12-16 11:27:46 +01:00

cmd

chore: update cogito and simplify MCP logics (#6413 )

2025-10-09 12:36:45 +02:00

configuration

refactor: move remaining api packages to core (#1731 )

2024-03-01 16:19:53 +01:00

core

fix: improve ram estimation (#7603 )

2025-12-16 10:18:36 +01:00

custom-ca-certs

feat(certificates): add support for custom CA certificates (#880 )

2023-11-01 20:10:14 +01:00

docs

chore(llama.cpp): Add Missing llama.cpp Options to gRPC Server (#7584 )

2025-12-15 21:55:20 +01:00

examples

chore: create examples/README to redirect to the new repository

2024-10-30 09:11:32 +01:00

gallery

chore(model-gallery): ⬆️ update checksum (#7530 )

2025-12-12 08:16:04 +01:00

internal

feat: cleanups, small enhancements

2023-07-04 18:58:19 +02:00

models

Add docker-compose

2023-04-13 01:13:14 +02:00

pkg

fix: improve ram estimation (#7603 )

2025-12-16 10:18:36 +01:00

prompt-templates

Requested Changes from GPT4ALL to Luna-AI-Llama2 (#1092 )

2023-09-22 11:22:17 +02:00

scripts

chore(ci): Build some Go based backends on Darwin (#6164 )

2025-09-01 22:18:30 +02:00

swagger

feat(swagger): update swagger (#7400 )

2025-11-30 21:39:29 +00:00

tests

feat(loader): enhance single active backend to support LRU eviction (#7535 )

2025-12-12 12:28:38 +01:00

.air.toml

feat(ui): chat stats, small visual enhancements (#7223 )

2025-11-10 18:12:07 +01:00

.dockerignore

feat(whisper-cpp): Convert to Purego and add VAD (#6087 )

2025-08-28 17:25:18 +02:00

.editorconfig

feat(stores): Vector store backend (#1795 )

2024-03-22 21:14:04 +01:00

.env

fix: use ubuntu 24.04 for cuda13 l4t images (#7418 )

2025-12-03 09:47:03 +01:00

.gitattributes

chore(linguist): add *.hpp files to linguist-vendored (#4154 )

2024-11-14 14:12:16 +01:00

.gitignore

feat(launcher): add LocalAI launcher app (#6127 )

2025-08-26 14:22:04 +02:00

.gitmodules

feat: docs revamp (#7313 )

2025-11-19 22:21:20 +01:00

.goreleaser.yaml

chore(refactor): cli -> cmd, update docs (#6148 )

2025-08-26 19:07:10 +02:00

.yamllint

fix: yamlint warnings and errors (#2131 )

2024-04-25 17:25:56 +00:00

CONTRIBUTING.md

chore: use air to live reload in dev environment (#7186 )

2025-11-07 21:53:44 +01:00

docker-compose.yaml

feat: refactor build process, drop embedded backends (#5875 )

2025-07-22 16:31:04 +02:00

Dockerfile

chore(ci): minor fixup

2025-12-03 16:26:33 +01:00

Dockerfile.aio

feat(aio): entrypoint, update workflows (#1872 )

2024-03-21 22:09:04 +01:00

Entitlements.plist

Feat: OSX Local Codesigning (#1319 )

2023-11-23 15:22:54 +01:00

entrypoint.sh

feat: ⚠️ reduce images size and stop bundling sources (#5721 )

2025-06-26 18:41:38 +02:00

go.mod

fix: improve ram estimation (#7603 )

2025-12-16 10:18:36 +01:00

go.sum

fix: improve ram estimation (#7603 )

2025-12-16 10:18:36 +01:00

LICENSE

chore(docs): update license year

2025-02-15 18:17:15 +01:00

Makefile

chore(makefile): Add buildargs for sd and cuda when building backend (#7525 )

2025-12-11 20:33:19 +01:00

README.md

Update latest project news in README

2025-12-16 09:16:42 +01:00

renovate.json

ci: manually update deps

2023-05-04 15:01:29 +02:00

SECURITY.md

Create SECURITY.md

2024-02-29 19:53:04 +01:00

webui_static.yaml

chore: refactor css, restyle to be slightly minimalistic (#7397 )

2025-11-29 22:11:44 +01:00

README.md

💡 Get help - ❓FAQ 💭Discussions 💬 Discord 📖 Documentation website

💻 Quickstart 🖼️ Models 🚀 Roadmap 🛫 Examples Try on

LocalAI is the free, Open Source OpenAI alternative. LocalAI act as a drop-in replacement REST API that's compatible with OpenAI (Elevenlabs, Anthropic... ) API specifications for local AI inferencing. It allows you to run LLMs, generate images, audio (and not only) locally or on-prem with consumer grade hardware, supporting multiple model families. Does not require GPU. It is created and maintained by Ettore Di Giacinto.

📚🆕 Local Stack Family

🆕 LocalAI is now part of a comprehensive suite of AI tools designed to work together:

LocalAGI

A powerful Local AI agent management platform that serves as a drop-in replacement for OpenAI's Responses API, enhanced with advanced agentic capabilities.

LocalRecall

A REST-ful API and knowledge base management system that provides persistent memory and storage capabilities for AI agents.

Screenshots / Video

Youtube video

Screenshots

Talk Interface	Generate Audio

Models Overview	Generate Images

Chat Interface	Home

Login	Swarm

💻 Quickstart

Run the installer script:

# Basic installation
curl https://localai.io/install.sh | sh

For more installation options, see Installer Options.

macOS Download:

Note: the DMGs are not signed by Apple as quarantined. See https://github.com/mudler/LocalAI/issues/6268 for a workaround, fix is tracked here: https://github.com/mudler/LocalAI/issues/6244

Or run with docker:

💡 Docker Run vs Docker Start

docker run creates and starts a new container. If a container with the same name already exists, this command will fail.

docker start starts an existing container that was previously created with docker run.

If you've already run LocalAI before and want to start it again, use: docker start -i local-ai

CPU only image:

docker run -ti --name local-ai -p 8080:8080 localai/localai:latest

NVIDIA GPU Images:

# CUDA 12.0
docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-gpu-nvidia-cuda-12

# CUDA 11.7
docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-gpu-nvidia-cuda-11

# NVIDIA Jetson (L4T) ARM64
docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-nvidia-l4t-arm64

AMD GPU Images (ROCm):

docker run -ti --name local-ai -p 8080:8080 --device=/dev/kfd --device=/dev/dri --group-add=video localai/localai:latest-gpu-hipblas

Intel GPU Images (oneAPI):

docker run -ti --name local-ai -p 8080:8080 --device=/dev/dri/card1 --device=/dev/dri/renderD128 localai/localai:latest-gpu-intel

Vulkan GPU Images:

docker run -ti --name local-ai -p 8080:8080 localai/localai:latest-gpu-vulkan

AIO Images (pre-downloaded models):

# CPU version
docker run -ti --name local-ai -p 8080:8080 localai/localai:latest-aio-cpu

# NVIDIA CUDA 12 version
docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-aio-gpu-nvidia-cuda-12

# NVIDIA CUDA 11 version
docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-aio-gpu-nvidia-cuda-11

# Intel GPU version
docker run -ti --name local-ai -p 8080:8080 localai/localai:latest-aio-gpu-intel

# AMD GPU version
docker run -ti --name local-ai -p 8080:8080 --device=/dev/kfd --device=/dev/dri --group-add=video localai/localai:latest-aio-gpu-hipblas

For more information about the AIO images and pre-downloaded models, see Container Documentation.

To load models:

# From the model gallery (see available models with `local-ai models list`, in the WebUI from the model tab, or visiting https://models.localai.io)
local-ai run llama-3.2-1b-instruct:q4_k_m
# Start LocalAI with the phi-2 model directly from huggingface
local-ai run huggingface://TheBloke/phi-2-GGUF/phi-2.Q8_0.gguf
# Install and run a model from the Ollama OCI registry
local-ai run ollama://gemma:2b
# Run a model from a configuration file
local-ai run https://gist.githubusercontent.com/.../phi-2.yaml
# Install and run a model from a standard OCI registry (e.g., Docker Hub)
local-ai run oci://localai/phi-2:latest

⚡ Automatic Backend Detection: When you install models from the gallery or YAML files, LocalAI automatically detects your system's GPU capabilities (NVIDIA, AMD, Intel) and downloads the appropriate backend. For advanced configuration options, see GPU Acceleration.

For more information, see 💻 Getting started, if you are interested in our roadmap items and future enhancements, you can see the Issues labeled as Roadmap here

📰 Latest project news

December 2025: Dynamic Memory Resource reclaimer, Automatic fitting of models to multiple GPUS(llama.cpp), Added Vibevoice backend
November 2025: Major improvements to the UX. Among these: Import models via URL and Multiple chats and history
October 2025: 🔌 Model Context Protocol (MCP) support added for agentic capabilities with external tools
September 2025: New Launcher application for MacOS and Linux, extended support to many backends for Mac and Nvidia L4T devices. Models: Added MLX-Audio, WAN 2.2. WebUI improvements and Python-based backends now ships portable python environments.
August 2025: MLX, MLX-VLM, Diffusers and llama.cpp are now supported on Mac M1/M2/M3+ chips ( with development suffix in the gallery ): https://github.com/mudler/LocalAI/pull/6049 https://github.com/mudler/LocalAI/pull/6119 https://github.com/mudler/LocalAI/pull/6121 https://github.com/mudler/LocalAI/pull/6060
July/August 2025: 🔍 Object Detection added to the API featuring rf-detr
July 2025: All backends migrated outside of the main binary. LocalAI is now more lightweight, small, and automatically downloads the required backend to run the model. Read the release notes
June 2025: Backend management has been added. Attention: extras images are going to be deprecated from the next release! Read the backend management PR.
May 2025: Audio input and Reranking in llama.cpp backend, Realtime API, Support to Gemma, SmollVLM, and more multimodal models (available in the gallery).
May 2025: Important: image name changes See release
Apr 2025: Rebrand, WebUI enhancements
Apr 2025: LocalAGI and LocalRecall join the LocalAI family stack.
Apr 2025: WebUI overhaul, AIO images updates
Feb 2025: Backend cleanup, Breaking changes, new backends (kokoro, OutelTTS, faster-whisper), Nvidia L4T images
Jan 2025: LocalAI model release: https://huggingface.co/mudler/LocalAI-functioncall-phi-4-v0.3, SANA support in diffusers: https://github.com/mudler/LocalAI/pull/4603
Dec 2024: stablediffusion.cpp backend (ggml) added ( https://github.com/mudler/LocalAI/pull/4289 )
Nov 2024: Bark.cpp backend added ( https://github.com/mudler/LocalAI/pull/4287 )
Nov 2024: Voice activity detection models (VAD) added to the API: https://github.com/mudler/LocalAI/pull/4204
Oct 2024: examples moved to LocalAI-examples
Aug 2024: 🆕 FLUX-1, P2P Explorer
July 2024: 🔥🔥 🆕 P2P Dashboard, LocalAI Federated mode and AI Swarms: https://github.com/mudler/LocalAI/pull/2723. P2P Global community pools: https://github.com/mudler/LocalAI/issues/3113
May 2024: 🔥🔥 Decentralized P2P llama.cpp: https://github.com/mudler/LocalAI/pull/2343 (peer2peer llama.cpp!) 👉 Docs https://localai.io/features/distribute/
May 2024: 🔥🔥 Distributed inferencing: https://github.com/mudler/LocalAI/pull/2324
April 2024: Reranker API: https://github.com/mudler/LocalAI/pull/2121

Roadmap items: List of issues

🚀 Features

🧩 Backend Gallery: Install/remove backends on the fly, powered by OCI images — fully customizable and API-driven.
📖 Text generation with GPTs (llama.cpp, transformers, vllm ... 📖 and more)
🗣 Text to Audio
🔈 Audio to Text (Audio transcription with whisper.cpp)
🎨 Image generation
🔥 OpenAI-alike tools API
🧠 Embeddings generation for vector databases
✍️ Constrained grammars
🖼️ Download Models directly from Huggingface
🥽 Vision API
🔍 Object Detection
📈 Reranker API
🆕🖧 P2P Inferencing
🆕🔌 Model Context Protocol (MCP) - Agentic capabilities with external tools and LocalAGI's Agentic capabilities
🔊 Voice activity detection (Silero-VAD support)
🌍 Integrated WebUI!

🧩 Supported Backends & Acceleration

LocalAI supports a comprehensive range of AI backends with multiple acceleration options:

Text Generation & Language Models

Backend	Description	Acceleration Support
llama.cpp	LLM inference in C/C++	CUDA 11/12, ROCm, Intel SYCL, Vulkan, Metal, CPU
vLLM	Fast LLM inference with PagedAttention	CUDA 12, ROCm, Intel
transformers	HuggingFace transformers framework	CUDA 11/12, ROCm, Intel, CPU
exllama2	GPTQ inference library	CUDA 12
MLX	Apple Silicon LLM inference	Metal (M1/M2/M3+)
MLX-VLM	Apple Silicon Vision-Language Models	Metal (M1/M2/M3+)

Audio & Speech Processing

Backend	Description	Acceleration Support
whisper.cpp	OpenAI Whisper in C/C++	CUDA 12, ROCm, Intel SYCL, Vulkan, CPU
faster-whisper	Fast Whisper with CTranslate2	CUDA 12, ROCm, Intel, CPU
bark	Text-to-audio generation	CUDA 12, ROCm, Intel
bark-cpp	C++ implementation of Bark	CUDA, Metal, CPU
coqui	Advanced TTS with 1100+ languages	CUDA 12, ROCm, Intel, CPU
kokoro	Lightweight TTS model	CUDA 12, ROCm, Intel, CPU
chatterbox	Production-grade TTS	CUDA 11/12, CPU
piper	Fast neural TTS system	CPU
kitten-tts	Kitten TTS models	CPU
silero-vad	Voice Activity Detection	CPU
neutts	Text-to-speech with voice cloning	CUDA 12, ROCm, CPU

Image & Video Generation

Backend	Description	Acceleration Support
stablediffusion.cpp	Stable Diffusion in C/C++	CUDA 12, Intel SYCL, Vulkan, CPU
diffusers	HuggingFace diffusion models	CUDA 11/12, ROCm, Intel, Metal, CPU

Specialized AI Tasks

Backend	Description	Acceleration Support
rfdetr	Real-time object detection	CUDA 12, Intel, CPU
rerankers	Document reranking API	CUDA 11/12, ROCm, Intel, CPU
local-store	Vector database	CPU
huggingface	HuggingFace API integration	API-based

Hardware Acceleration Matrix

Acceleration Type	Supported Backends	Hardware Support
NVIDIA CUDA 11	llama.cpp, whisper, stablediffusion, diffusers, rerankers, bark, chatterbox	Nvidia hardware
NVIDIA CUDA 12	All CUDA-compatible backends	Nvidia hardware
AMD ROCm	llama.cpp, whisper, vllm, transformers, diffusers, rerankers, coqui, kokoro, bark, neutts	AMD Graphics
Intel oneAPI	llama.cpp, whisper, stablediffusion, vllm, transformers, diffusers, rfdetr, rerankers, exllama2, coqui, kokoro, bark	Intel Arc, Intel iGPUs
Apple Metal	llama.cpp, whisper, diffusers, MLX, MLX-VLM, bark-cpp	Apple M1/M2/M3+
Vulkan	llama.cpp, whisper, stablediffusion	Cross-platform GPUs
NVIDIA Jetson	llama.cpp, whisper, stablediffusion, diffusers, rfdetr	ARM64 embedded AI
CPU Optimized	All backends	AVX/AVX2/AVX512, quantization support

🔗 Community and integrations

Build and deploy custom containers:

https://github.com/sozercan/aikit

WebUIs:

https://github.com/Jirubizu/localai-admin
https://github.com/go-skynet/LocalAI-frontend
QA-Pilot(An interactive chat project that leverages LocalAI LLMs for rapid understanding and navigation of GitHub code repository) https://github.com/reid41/QA-Pilot

Agentic Libraries:

https://github.com/mudler/cogito

MCPs:

https://github.com/mudler/MCPs

Model galleries

https://github.com/go-skynet/model-gallery

Voice:

https://github.com/richiejp/VoxInput

Other:

Helm chart https://github.com/go-skynet/helm-charts
VSCode extension https://github.com/badgooooor/localai-vscode-plugin
Langchain: https://python.langchain.com/docs/integrations/providers/localai/
Terminal utility https://github.com/djcopley/ShellOracle
Local Smart assistant https://github.com/mudler/LocalAGI
Home Assistant https://github.com/sammcj/homeassistant-localai / https://github.com/drndos/hass-openai-custom-conversation / https://github.com/valentinfrlch/ha-gpt4vision
Discord bot https://github.com/mudler/LocalAGI/tree/main/examples/discord
Slack bot https://github.com/mudler/LocalAGI/tree/main/examples/slack
Shell-Pilot(Interact with LLM using LocalAI models via pure shell scripts on your Linux or MacOS system) https://github.com/reid41/shell-pilot
Telegram bot https://github.com/mudler/LocalAI/tree/master/examples/telegram-bot
Another Telegram Bot https://github.com/JackBekket/Hellper
Auto-documentation https://github.com/JackBekket/Reflexia
Github bot which answer on issues, with code and documentation as context https://github.com/JackBekket/GitHelper
Github Actions: https://github.com/marketplace/actions/start-localai
Examples: https://github.com/mudler/LocalAI/tree/master/examples/

🔗 Resources

LLM finetuning guide
How to build locally
How to install in Kubernetes
Projects integrating LocalAI
How tos section (curated by our community)

Citation

If you utilize this repository, data in a downstream project, please consider citing it with:

@misc{localai,
  author = {Ettore Di Giacinto},
  title = {LocalAI: The free, Open source OpenAI alternative},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/go-skynet/LocalAI}},

❤️ Sponsors

Do you find LocalAI useful?

Support the project by becoming a backer or sponsor. Your logo will show up here with a link to your website.

A huge thank you to our generous sponsors who support this project covering CI expenses, and our Sponsor list:

🌟 Star history

📖 License

LocalAI is a community-driven project created by Ettore Di Giacinto.

MIT - Author Ettore Di Giacinto mudler@localai.io

🙇 Acknowledgements

LocalAI couldn't have been built without the help of great software already available from the community. Thank you!

🤗 Contributors

This is a community project, a special thanks to our contributors! 🤗

Languages

Go 69.6%

JavaScript 11.9%

HTML 7.3%

Python 6.1%

C++ 1.9%

Other 3.1%

README.md

📚🆕 Local Stack Family

LocalAGI

LocalRecall

Screenshots / Video

Youtube video

Screenshots

💻 Quickstart

macOS Download:

CPU only image:

NVIDIA GPU Images:

AMD GPU Images (ROCm):

Intel GPU Images (oneAPI):

Vulkan GPU Images:

AIO Images (pre-downloaded models):

📰 Latest project news

🚀 Features

🧩 Supported Backends & Acceleration

Text Generation & Language Models

Audio & Speech Processing

Image & Video Generation

Specialized AI Tasks

Hardware Acceleration Matrix

🔗 Community and integrations

🔗 Resources

📖 🎥 Media, Blogs, Social

Citation

❤️ Sponsors

🌟 Star history

📖 License

🙇 Acknowledgements

🤗 Contributors