blightbow 67baf66555 feat(mlx): add thread-safe LRU prompt cache and min_p/top_k sampling (#7556)
* feat(mlx): add thread-safe LRU prompt cache

Port mlx-lm's LRUPromptCache to fix race condition where concurrent
requests corrupt shared KV cache state. The previous implementation
used a single prompt_cache instance shared across all requests.

Changes:
- Add backend/python/common/mlx_cache.py with ThreadSafeLRUPromptCache
- Modify backend.py to use per-request cache isolation via fetch/insert
- Add prefix matching for cache reuse across similar prompts
- Add LRU eviction (default 10 entries, configurable)
- Add concurrency and cache unit tests

The cache uses a trie-based structure for efficient prefix matching,
allowing prompts that share common prefixes to reuse cached KV states.
Thread safety is provided via threading.Lock.

New configuration options:
- max_cache_entries: Maximum LRU cache entries (default: 10)
- max_kv_size: Maximum KV cache size per entry (default: None)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Blightbow <blightbow@users.noreply.github.com>

* feat(mlx): add min_p and top_k sampler support

Add MinP field to proto (field 52) following the precedent set by
other non-OpenAI sampling parameters like TopK, TailFreeSamplingZ,
TypicalP, and Mirostat.

Changes:
- backend.proto: Add float MinP field for min-p sampling
- backend.py: Extract and pass min_p and top_k to mlx_lm sampler
  (top_k was in proto but not being passed)
- test.py: Fix test_sampling_params to use valid proto fields and
  switch to MLX-compatible model (mlx-community/Llama-3.2-1B-Instruct)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Blightbow <blightbow@users.noreply.github.com>

* refactor(mlx): move mlx_cache.py from common to mlx backend

The ThreadSafeLRUPromptCache is only used by the mlx backend. After
evaluating mlx-vlm, it was determined that the cache cannot be shared
because mlx-vlm's generate/stream_generate functions don't support
the prompt_cache parameter that mlx_lm provides.

- Move mlx_cache.py from backend/python/common/ to backend/python/mlx/
- Remove sys.path manipulation from backend.py and test.py
- Fix test assertion to expect "MLX model loaded successfully"

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Blightbow <blightbow@users.noreply.github.com>

* test(mlx): add comprehensive cache tests and document upstream behavior

Added comprehensive unit tests (test_mlx_cache.py) covering all cache
operation modes:
- Exact match
- Shorter prefix match
- Longer prefix match with trimming
- No match scenarios
- LRU eviction and access order
- Reference counting and deep copy behavior
- Multi-model namespacing
- Thread safety with data integrity verification

Documents upstream mlx_lm/server.py behavior: single-token prefixes are
deliberately not matched (uses > 0, not >= 0) to allow longer cached
sequences to be preferred for trimming. This is acceptable because real
prompts with chat templates are always many tokens.

Removed weak unit tests from test.py that only verified "no exception
thrown" rather than correctness.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Blightbow <blightbow@users.noreply.github.com>

* chore(mlx): remove unused MinP proto field

The MinP field was added to PredictOptions but is not populated by the
Go frontend/API. The MLX backend uses getattr with a default value,
so it works without the proto field.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Blightbow <blightbow@users.noreply.github.com>

---------

Signed-off-by: Blightbow <blightbow@users.noreply.github.com>
Co-authored-by: Blightbow <blightbow@users.noreply.github.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-16 11:27:46 +01:00
2025-12-16 10:18:36 +01:00
2023-04-13 01:13:14 +02:00
2025-12-16 10:18:36 +01:00
2025-11-19 22:21:20 +01:00
2025-12-03 16:26:33 +01:00
2025-12-16 10:18:36 +01:00
2025-12-16 10:18:36 +01:00
2025-02-15 18:17:15 +01:00
2025-12-16 09:16:42 +01:00
2023-05-04 15:01:29 +02:00
2024-02-29 19:53:04 +01:00




LocalAI forks LocalAI stars LocalAI pull-requests

LocalAI Docker hub LocalAI Quay.io

Follow LocalAI_API Join LocalAI Discord Community

mudler%2FLocalAI | Trendshift

💡 Get help - FAQ 💭Discussions 💬 Discord 📖 Documentation website

💻 Quickstart 🖼️ Models 🚀 Roadmap 🛫 Examples Try on Telegram

testsBuild and Releasebuild container imagesBump dependenciesArtifact Hub

LocalAI is the free, Open Source OpenAI alternative. LocalAI act as a drop-in replacement REST API that's compatible with OpenAI (Elevenlabs, Anthropic... ) API specifications for local AI inferencing. It allows you to run LLMs, generate images, audio (and not only) locally or on-prem with consumer grade hardware, supporting multiple model families. Does not require GPU. It is created and maintained by Ettore Di Giacinto.

📚🆕 Local Stack Family

🆕 LocalAI is now part of a comprehensive suite of AI tools designed to work together:

LocalAGI Logo

LocalAGI

A powerful Local AI agent management platform that serves as a drop-in replacement for OpenAI's Responses API, enhanced with advanced agentic capabilities.

LocalRecall Logo

LocalRecall

A REST-ful API and knowledge base management system that provides persistent memory and storage capabilities for AI agents.

Screenshots / Video

Youtube video




Screenshots

Talk Interface Generate Audio
Screenshot 2025-03-31 at 12-01-36 LocalAI - Talk Screenshot 2025-03-31 at 12-01-29 LocalAI - Generate audio with voice-en-us-ryan-low
Models Overview Generate Images
Screenshot 2025-03-31 at 12-01-20 LocalAI - Models Screenshot 2025-03-31 at 12-31-41 LocalAI - Generate images with flux 1-dev
Chat Interface Home
Screenshot 2025-03-31 at 11-57-44 LocalAI - Chat with localai-functioncall-qwen2 5-7b-v0 5 Screenshot 2025-03-31 at 11-57-23 LocalAI API - c2a39e3 (c2a39e3639227cfd94ffffe9f5691239acc275a8)
Login Swarm
Screenshot 2025-03-31 at 12-09-59 Screenshot 2025-03-31 at 12-10-39 LocalAI - P2P dashboard

💻 Quickstart

Run the installer script:

# Basic installation
curl https://localai.io/install.sh | sh

For more installation options, see Installer Options.

macOS Download:

Download LocalAI for macOS

Note: the DMGs are not signed by Apple as quarantined. See https://github.com/mudler/LocalAI/issues/6268 for a workaround, fix is tracked here: https://github.com/mudler/LocalAI/issues/6244

Or run with docker:

💡 Docker Run vs Docker Start

  • docker run creates and starts a new container. If a container with the same name already exists, this command will fail.
  • docker start starts an existing container that was previously created with docker run.

If you've already run LocalAI before and want to start it again, use: docker start -i local-ai

CPU only image:

docker run -ti --name local-ai -p 8080:8080 localai/localai:latest

NVIDIA GPU Images:

# CUDA 12.0
docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-gpu-nvidia-cuda-12

# CUDA 11.7
docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-gpu-nvidia-cuda-11

# NVIDIA Jetson (L4T) ARM64
docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-nvidia-l4t-arm64

AMD GPU Images (ROCm):

docker run -ti --name local-ai -p 8080:8080 --device=/dev/kfd --device=/dev/dri --group-add=video localai/localai:latest-gpu-hipblas

Intel GPU Images (oneAPI):

docker run -ti --name local-ai -p 8080:8080 --device=/dev/dri/card1 --device=/dev/dri/renderD128 localai/localai:latest-gpu-intel

Vulkan GPU Images:

docker run -ti --name local-ai -p 8080:8080 localai/localai:latest-gpu-vulkan

AIO Images (pre-downloaded models):

# CPU version
docker run -ti --name local-ai -p 8080:8080 localai/localai:latest-aio-cpu

# NVIDIA CUDA 12 version
docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-aio-gpu-nvidia-cuda-12

# NVIDIA CUDA 11 version
docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-aio-gpu-nvidia-cuda-11

# Intel GPU version
docker run -ti --name local-ai -p 8080:8080 localai/localai:latest-aio-gpu-intel

# AMD GPU version
docker run -ti --name local-ai -p 8080:8080 --device=/dev/kfd --device=/dev/dri --group-add=video localai/localai:latest-aio-gpu-hipblas

For more information about the AIO images and pre-downloaded models, see Container Documentation.

To load models:

# From the model gallery (see available models with `local-ai models list`, in the WebUI from the model tab, or visiting https://models.localai.io)
local-ai run llama-3.2-1b-instruct:q4_k_m
# Start LocalAI with the phi-2 model directly from huggingface
local-ai run huggingface://TheBloke/phi-2-GGUF/phi-2.Q8_0.gguf
# Install and run a model from the Ollama OCI registry
local-ai run ollama://gemma:2b
# Run a model from a configuration file
local-ai run https://gist.githubusercontent.com/.../phi-2.yaml
# Install and run a model from a standard OCI registry (e.g., Docker Hub)
local-ai run oci://localai/phi-2:latest

Automatic Backend Detection: When you install models from the gallery or YAML files, LocalAI automatically detects your system's GPU capabilities (NVIDIA, AMD, Intel) and downloads the appropriate backend. For advanced configuration options, see GPU Acceleration.

For more information, see 💻 Getting started, if you are interested in our roadmap items and future enhancements, you can see the Issues labeled as Roadmap here

📰 Latest project news

Roadmap items: List of issues

🚀 Features

🧩 Supported Backends & Acceleration

LocalAI supports a comprehensive range of AI backends with multiple acceleration options:

Text Generation & Language Models

Backend Description Acceleration Support
llama.cpp LLM inference in C/C++ CUDA 11/12, ROCm, Intel SYCL, Vulkan, Metal, CPU
vLLM Fast LLM inference with PagedAttention CUDA 12, ROCm, Intel
transformers HuggingFace transformers framework CUDA 11/12, ROCm, Intel, CPU
exllama2 GPTQ inference library CUDA 12
MLX Apple Silicon LLM inference Metal (M1/M2/M3+)
MLX-VLM Apple Silicon Vision-Language Models Metal (M1/M2/M3+)

Audio & Speech Processing

Backend Description Acceleration Support
whisper.cpp OpenAI Whisper in C/C++ CUDA 12, ROCm, Intel SYCL, Vulkan, CPU
faster-whisper Fast Whisper with CTranslate2 CUDA 12, ROCm, Intel, CPU
bark Text-to-audio generation CUDA 12, ROCm, Intel
bark-cpp C++ implementation of Bark CUDA, Metal, CPU
coqui Advanced TTS with 1100+ languages CUDA 12, ROCm, Intel, CPU
kokoro Lightweight TTS model CUDA 12, ROCm, Intel, CPU
chatterbox Production-grade TTS CUDA 11/12, CPU
piper Fast neural TTS system CPU
kitten-tts Kitten TTS models CPU
silero-vad Voice Activity Detection CPU
neutts Text-to-speech with voice cloning CUDA 12, ROCm, CPU

Image & Video Generation

Backend Description Acceleration Support
stablediffusion.cpp Stable Diffusion in C/C++ CUDA 12, Intel SYCL, Vulkan, CPU
diffusers HuggingFace diffusion models CUDA 11/12, ROCm, Intel, Metal, CPU

Specialized AI Tasks

Backend Description Acceleration Support
rfdetr Real-time object detection CUDA 12, Intel, CPU
rerankers Document reranking API CUDA 11/12, ROCm, Intel, CPU
local-store Vector database CPU
huggingface HuggingFace API integration API-based

Hardware Acceleration Matrix

Acceleration Type Supported Backends Hardware Support
NVIDIA CUDA 11 llama.cpp, whisper, stablediffusion, diffusers, rerankers, bark, chatterbox Nvidia hardware
NVIDIA CUDA 12 All CUDA-compatible backends Nvidia hardware
AMD ROCm llama.cpp, whisper, vllm, transformers, diffusers, rerankers, coqui, kokoro, bark, neutts AMD Graphics
Intel oneAPI llama.cpp, whisper, stablediffusion, vllm, transformers, diffusers, rfdetr, rerankers, exllama2, coqui, kokoro, bark Intel Arc, Intel iGPUs
Apple Metal llama.cpp, whisper, diffusers, MLX, MLX-VLM, bark-cpp Apple M1/M2/M3+
Vulkan llama.cpp, whisper, stablediffusion Cross-platform GPUs
NVIDIA Jetson llama.cpp, whisper, stablediffusion, diffusers, rfdetr ARM64 embedded AI
CPU Optimized All backends AVX/AVX2/AVX512, quantization support

🔗 Community and integrations

Build and deploy custom containers:

WebUIs:

Agentic Libraries:

MCPs:

Model galleries

Voice:

Other:

🔗 Resources

📖 🎥 Media, Blogs, Social

Citation

If you utilize this repository, data in a downstream project, please consider citing it with:

@misc{localai,
  author = {Ettore Di Giacinto},
  title = {LocalAI: The free, Open source OpenAI alternative},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/go-skynet/LocalAI}},

❤️ Sponsors

Do you find LocalAI useful?

Support the project by becoming a backer or sponsor. Your logo will show up here with a link to your website.

A huge thank you to our generous sponsors who support this project covering CI expenses, and our Sponsor list:


🌟 Star history

LocalAI Star history Chart

📖 License

LocalAI is a community-driven project created by Ettore Di Giacinto.

MIT - Author Ettore Di Giacinto mudler@localai.io

🙇 Acknowledgements

LocalAI couldn't have been built without the help of great software already available from the community. Thank you!

🤗 Contributors

This is a community project, a special thanks to our contributors! 🤗

Description
No description provided
Readme MIT 124 MiB
Languages
Go 69.6%
JavaScript 11.9%
HTML 7.3%
Python 6.1%
C++ 1.9%
Other 3.1%