* feat(mlx): add thread-safe LRU prompt cache
Port mlx-lm's LRUPromptCache to fix race condition where concurrent
requests corrupt shared KV cache state. The previous implementation
used a single prompt_cache instance shared across all requests.
Changes:
- Add backend/python/common/mlx_cache.py with ThreadSafeLRUPromptCache
- Modify backend.py to use per-request cache isolation via fetch/insert
- Add prefix matching for cache reuse across similar prompts
- Add LRU eviction (default 10 entries, configurable)
- Add concurrency and cache unit tests
The cache uses a trie-based structure for efficient prefix matching,
allowing prompts that share common prefixes to reuse cached KV states.
Thread safety is provided via threading.Lock.
New configuration options:
- max_cache_entries: Maximum LRU cache entries (default: 10)
- max_kv_size: Maximum KV cache size per entry (default: None)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Blightbow <blightbow@users.noreply.github.com>
* feat(mlx): add min_p and top_k sampler support
Add MinP field to proto (field 52) following the precedent set by
other non-OpenAI sampling parameters like TopK, TailFreeSamplingZ,
TypicalP, and Mirostat.
Changes:
- backend.proto: Add float MinP field for min-p sampling
- backend.py: Extract and pass min_p and top_k to mlx_lm sampler
(top_k was in proto but not being passed)
- test.py: Fix test_sampling_params to use valid proto fields and
switch to MLX-compatible model (mlx-community/Llama-3.2-1B-Instruct)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Blightbow <blightbow@users.noreply.github.com>
* refactor(mlx): move mlx_cache.py from common to mlx backend
The ThreadSafeLRUPromptCache is only used by the mlx backend. After
evaluating mlx-vlm, it was determined that the cache cannot be shared
because mlx-vlm's generate/stream_generate functions don't support
the prompt_cache parameter that mlx_lm provides.
- Move mlx_cache.py from backend/python/common/ to backend/python/mlx/
- Remove sys.path manipulation from backend.py and test.py
- Fix test assertion to expect "MLX model loaded successfully"
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Blightbow <blightbow@users.noreply.github.com>
* test(mlx): add comprehensive cache tests and document upstream behavior
Added comprehensive unit tests (test_mlx_cache.py) covering all cache
operation modes:
- Exact match
- Shorter prefix match
- Longer prefix match with trimming
- No match scenarios
- LRU eviction and access order
- Reference counting and deep copy behavior
- Multi-model namespacing
- Thread safety with data integrity verification
Documents upstream mlx_lm/server.py behavior: single-token prefixes are
deliberately not matched (uses > 0, not >= 0) to allow longer cached
sequences to be preferred for trimming. This is acceptable because real
prompts with chat templates are always many tokens.
Removed weak unit tests from test.py that only verified "no exception
thrown" rather than correctness.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Blightbow <blightbow@users.noreply.github.com>
* chore(mlx): remove unused MinP proto field
The MinP field was added to PredictOptions but is not populated by the
Go frontend/API. The MLX backend uses getattr with a default value,
so it works without the proto field.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Blightbow <blightbow@users.noreply.github.com>
---------
Signed-off-by: Blightbow <blightbow@users.noreply.github.com>
Co-authored-by: Blightbow <blightbow@users.noreply.github.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
* fix(7355): Update llama-cpp grpc for v3 interface
Signed-off-by: Simon Redman <simon@ergotech.com>
* feat(llama-gprc): Trim whitespace from servers list
Signed-off-by: Simon Redman <simon@ergotech.com>
* Trim trailing spaces in grpc-server.cpp
Signed-off-by: Simon Redman <simon@ergotech.com>
---------
Signed-off-by: Simon Redman <simon@ergotech.com>
* chore(deps): bump stable-diffusion.cpp to '8823dc48bcc1598eb9671da7b69e45338d0cc5a5'
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(Dockerfile.golang): Make curl noisy to see when download fails
Signed-off-by: Richard Palethorpe <io@richiejp.com>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Co-authored-by: Richard Palethorpe <io@richiejp.com>
* feat(vibevoice): add backend
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore: add workflow and backend index
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore(gallery): add vibevoice
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Use self-hosted for intel builds
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Pin python version for l4t
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(stablediffusion): Passthrough more parameters to support z-image and flux2
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* chore(z-image): Add Z-Image-Turbo GGML to library
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* fix(stablediffusion-ggml): flush stderr and check errors when writing PNG
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* fix(stablediffusion-ggml): Re-allocate Go strings in C++
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* fix(stablediffusion-ggml): Try to avoid segfaults
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* fix(stablediffusion-ggml): Init sample and easycache params
Signed-off-by: Richard Palethorpe <io@richiejp.com>
---------
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
* fix: use ubuntu 24.04 for cuda13 l4t images
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Drop openblas from containers
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Fixups
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fixups
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore(ci): add cuda13 jobs
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Add to pipelines and to capabilities. Start to work on the gallery
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* gallery
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* capabilities: try to detect by looking at /usr/local
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* neutts
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* backends.yaml
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* add cuda13 l4t requirements.txt
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* add cuda13 requirements.txt
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Fixups
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Fixups
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Pin vllm
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Not all backends are compatible
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* add vllm to requirements
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* vllm is not pre-compiled for cuda 13
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>