mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-23 16:19:07 -04:00
Wire the continuous-batching serving path (update_slots) to the on-demand paged KV-cache engine (patches 0001-0004). update_slots already drives the engine transparently through the existing kv-cache seams: each slot's sequence allocates paged blocks on arrival (find_slot placement) and returns them on slot release (the seq_rm free seam). No serving-loop change is needed for correctness. This patch only exposes the enable cleanly: instead of forcing operators to export the process-wide LLAMA_KV_PAGED env, add `kv_paged` (aliases `paged_kv` / `paged_attention`) and `kv_paged_debug` model options that set the env before the model/context is created. Default off; when the option is absent nothing is touched, so an externally exported env still works and stock behaviour is unchanged. Verified on a dynamic continuous-batching harness (NP physical slots reused across M>NP queued prompts, single mixed llama_decode per step, greedy): 12 dynamically-arriving sequences over 4 slots are token-identical to the stock single-slot serial baseline under both the unified and per-sequence caches. The debug trace confirms per-slot [paged-alloc] grow on arrival and per-stream release on seq_rm. The per-slot allocate/free capacity benefit only materialises under a per-sequence cache (kv_unified:false), since paged block ownership is keyed by stream; the unified cache collapses every slot onto one stream and the run stays correct but degenerates to a single bounded, stock-recycled pool. We do not flip kv_unified here, to keep the default serving behaviour and idle-slot prompt cache unchanged. No core llama.cpp patch: no engine bug was found under dynamic slot churn. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
204 KiB
204 KiB