mirror of
https://github.com/mudler/LocalAI.git
synced 2026-05-29 19:19:19 -04:00
Aligns LocalAI's llama-cpp gRPC backend with upstream's auto-on prompt cache path so repeated system prompts (agents, OpenAI/Anthropic-compatible CLIs, coding assistants) skip prefill on subsequent calls without any YAML changes. Reported in #9921. Upstream's server enables `kv_unified=true` (and bumps `n_parallel` to 4) when slot count is auto, which unlocks `cache_idle_slots`. LocalAI hardcodes `n_parallel=1` and so far also hardcoded `kv_unified=false`, which silently force-disables idle-slot saving at server init. The host prompt cache was allocated but never written across requests. Changes in backend/cpp/llama-cpp/grpc-server.cpp: - params.kv_unified: false -> true (single-slot path now benefits from the prompt cache; users can opt out with `kv_unified:false`) - params.n_ctx_checkpoints: 8 -> 32 (match upstream default) - params.cache_idle_slots = true initialized explicitly (upstream default) - params.checkpoint_every_nt = 8192 initialized explicitly (upstream default) - New option parsers: cache_idle_slots / idle_slots_cache, checkpoint_every_nt / checkpoint_every_n_tokens Docs: - features/text-generation.md: fix misleading `cache_ram` description (it's the host-side prompt cache, not the KV cache), document the kv_unified + cache_ram + cache_idle_slots interaction, add rows for the two newly-exposed options, and add a worked example for the agent/CLI workload from the issue. - advanced/model-configuration.md: mark the legacy `prompt_cache_path` / `prompt_cache_all` / `prompt_cache_ro` YAML fields as unused by the llama-cpp gRPC backend (they target upstream's CLI completion tool and are not consumed by grpc-server.cpp) and point readers at the new prompt-cache explainer. Closes #9921 Assisted-by: claude:opus-4.7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>