Files
blightbow 67baf66555 feat(mlx): add thread-safe LRU prompt cache and min_p/top_k sampling (#7556)
* feat(mlx): add thread-safe LRU prompt cache

Port mlx-lm's LRUPromptCache to fix race condition where concurrent
requests corrupt shared KV cache state. The previous implementation
used a single prompt_cache instance shared across all requests.

Changes:
- Add backend/python/common/mlx_cache.py with ThreadSafeLRUPromptCache
- Modify backend.py to use per-request cache isolation via fetch/insert
- Add prefix matching for cache reuse across similar prompts
- Add LRU eviction (default 10 entries, configurable)
- Add concurrency and cache unit tests

The cache uses a trie-based structure for efficient prefix matching,
allowing prompts that share common prefixes to reuse cached KV states.
Thread safety is provided via threading.Lock.

New configuration options:
- max_cache_entries: Maximum LRU cache entries (default: 10)
- max_kv_size: Maximum KV cache size per entry (default: None)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Blightbow <blightbow@users.noreply.github.com>

* feat(mlx): add min_p and top_k sampler support

Add MinP field to proto (field 52) following the precedent set by
other non-OpenAI sampling parameters like TopK, TailFreeSamplingZ,
TypicalP, and Mirostat.

Changes:
- backend.proto: Add float MinP field for min-p sampling
- backend.py: Extract and pass min_p and top_k to mlx_lm sampler
  (top_k was in proto but not being passed)
- test.py: Fix test_sampling_params to use valid proto fields and
  switch to MLX-compatible model (mlx-community/Llama-3.2-1B-Instruct)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Blightbow <blightbow@users.noreply.github.com>

* refactor(mlx): move mlx_cache.py from common to mlx backend

The ThreadSafeLRUPromptCache is only used by the mlx backend. After
evaluating mlx-vlm, it was determined that the cache cannot be shared
because mlx-vlm's generate/stream_generate functions don't support
the prompt_cache parameter that mlx_lm provides.

- Move mlx_cache.py from backend/python/common/ to backend/python/mlx/
- Remove sys.path manipulation from backend.py and test.py
- Fix test assertion to expect "MLX model loaded successfully"

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Blightbow <blightbow@users.noreply.github.com>

* test(mlx): add comprehensive cache tests and document upstream behavior

Added comprehensive unit tests (test_mlx_cache.py) covering all cache
operation modes:
- Exact match
- Shorter prefix match
- Longer prefix match with trimming
- No match scenarios
- LRU eviction and access order
- Reference counting and deep copy behavior
- Multi-model namespacing
- Thread safety with data integrity verification

Documents upstream mlx_lm/server.py behavior: single-token prefixes are
deliberately not matched (uses > 0, not >= 0) to allow longer cached
sequences to be preferred for trimming. This is acceptable because real
prompts with chat templates are always many tokens.

Removed weak unit tests from test.py that only verified "no exception
thrown" rather than correctness.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Blightbow <blightbow@users.noreply.github.com>

* chore(mlx): remove unused MinP proto field

The MinP field was added to PredictOptions but is not populated by the
Go frontend/API. The MLX backend uses getattr with a default value,
so it works without the proto field.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Blightbow <blightbow@users.noreply.github.com>

---------

Signed-off-by: Blightbow <blightbow@users.noreply.github.com>
Co-authored-by: Blightbow <blightbow@users.noreply.github.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-16 11:27:46 +01:00
..
2025-12-02 14:24:35 +01:00
2025-12-02 14:24:35 +01:00
2025-12-02 14:24:35 +01:00
2025-12-02 14:24:35 +01:00

Python Backends for LocalAI

This directory contains Python-based AI backends for LocalAI, providing support for various AI models and hardware acceleration targets.

Overview

The Python backends use a unified build system based on libbackend.sh that provides:

  • Automatic virtual environment management with support for both uv and pip
  • Hardware-specific dependency installation (CPU, CUDA, Intel, MLX, etc.)
  • Portable Python support for standalone deployments
  • Consistent backend execution across different environments

Available Backends

Core AI Models

  • transformers - Hugging Face Transformers framework (PyTorch-based)
  • vllm - High-performance LLM inference engine
  • mlx - Apple Silicon optimized ML framework
  • exllama2 - ExLlama2 quantized models

Audio & Speech

  • bark - Text-to-speech synthesis
  • coqui - Coqui TTS models
  • faster-whisper - Fast Whisper speech recognition
  • kitten-tts - Lightweight TTS
  • mlx-audio - Apple Silicon audio processing
  • chatterbox - TTS model
  • kokoro - TTS models

Computer Vision

  • diffusers - Stable Diffusion and image generation
  • mlx-vlm - Vision-language models for Apple Silicon
  • rfdetr - Object detection models

Specialized

  • rerankers - Text reranking models

Quick Start

Prerequisites

  • Python 3.10+ (default: 3.10.18)
  • uv package manager (recommended) or pip
  • Appropriate hardware drivers for your target (CUDA, Intel, etc.)

Installation

Each backend can be installed individually:

# Navigate to a specific backend
cd backend/python/transformers

# Install dependencies
make transformers
# or
bash install.sh

# Run the backend
make run
# or
bash run.sh

Using the Unified Build System

The libbackend.sh script provides consistent commands across all backends:

# Source the library in your backend script
source $(dirname $0)/../common/libbackend.sh

# Install requirements (automatically handles hardware detection)
installRequirements

# Start the backend server
startBackend $@

# Run tests
runUnittests

Hardware Targets

The build system automatically detects and configures for different hardware:

  • CPU - Standard CPU-only builds
  • CUDA - NVIDIA GPU acceleration (supports CUDA 11/12)
  • Intel - Intel XPU/GPU optimization
  • MLX - Apple Silicon (M1/M2/M3) optimization
  • HIP - AMD GPU acceleration

Target-Specific Requirements

Backends can specify hardware-specific dependencies:

  • requirements.txt - Base requirements
  • requirements-cpu.txt - CPU-specific packages
  • requirements-cublas11.txt - CUDA 11 packages
  • requirements-cublas12.txt - CUDA 12 packages
  • requirements-intel.txt - Intel-optimized packages
  • requirements-mps.txt - Apple Silicon packages

Configuration Options

Environment Variables

  • PYTHON_VERSION - Python version (default: 3.10)
  • PYTHON_PATCH - Python patch version (default: 18)
  • BUILD_TYPE - Force specific build target
  • USE_PIP - Use pip instead of uv (default: false)
  • PORTABLE_PYTHON - Enable portable Python builds
  • LIMIT_TARGETS - Restrict backend to specific targets

Example: CUDA 12 Only Backend

# In your backend script
LIMIT_TARGETS="cublas12"
source $(dirname $0)/../common/libbackend.sh

Example: Intel-Optimized Backend

# In your backend script
LIMIT_TARGETS="intel"
source $(dirname $0)/../common/libbackend.sh

Development

Adding a New Backend

  1. Create a new directory in backend/python/
  2. Copy the template structure from common/template/
  3. Implement your backend.py with the required gRPC interface
  4. Add appropriate requirements files for your target hardware
  5. Use libbackend.sh for consistent build and execution

Testing

# Run backend tests
make test
# or
bash test.sh

Building

# Install dependencies
make <backend-name>

# Clean build artifacts
make clean

Architecture

Each backend follows a consistent structure:

backend-name/
├── backend.py          # Main backend implementation
├── requirements.txt    # Base dependencies
├── requirements-*.txt  # Hardware-specific dependencies
├── install.sh         # Installation script
├── run.sh            # Execution script
├── test.sh           # Test script
├── Makefile          # Build targets
└── test.py           # Unit tests

Troubleshooting

Common Issues

  1. Missing dependencies: Ensure all requirements files are properly configured
  2. Hardware detection: Check that BUILD_TYPE matches your system
  3. Python version: Verify Python 3.10+ is available
  4. Virtual environment: Use ensureVenv to create/activate environments

Contributing

When adding new backends or modifying existing ones:

  1. Follow the established directory structure
  2. Use libbackend.sh for consistent behavior
  3. Include appropriate requirements files for all target hardware
  4. Add comprehensive tests
  5. Update this README if adding new backend types