mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-23 16:19:07 -04:00
feat(llama-cpp): expose paged KV cache as a per-server option (patch 0005)
Wire the continuous-batching serving path (update_slots) to the on-demand paged KV-cache engine (patches 0001-0004). update_slots already drives the engine transparently through the existing kv-cache seams: each slot's sequence allocates paged blocks on arrival (find_slot placement) and returns them on slot release (the seq_rm free seam). No serving-loop change is needed for correctness. This patch only exposes the enable cleanly: instead of forcing operators to export the process-wide LLAMA_KV_PAGED env, add `kv_paged` (aliases `paged_kv` / `paged_attention`) and `kv_paged_debug` model options that set the env before the model/context is created. Default off; when the option is absent nothing is touched, so an externally exported env still works and stock behaviour is unchanged. Verified on a dynamic continuous-batching harness (NP physical slots reused across M>NP queued prompts, single mixed llama_decode per step, greedy): 12 dynamically-arriving sequences over 4 slots are token-identical to the stock single-slot serial baseline under both the unified and per-sequence caches. The debug trace confirms per-slot [paged-alloc] grow on arrival and per-stream release on seq_rm. The per-slot allocate/free capacity benefit only materialises under a per-sequence cache (kv_unified:false), since paged block ownership is keyed by stream; the unified cache collapses every slot onto one stream and the run stays correct but degenerates to a single bounded, stock-recycled pool. We do not flip kv_unified here, to keep the default serving behaviour and idle-slot prompt cache unchanged. No core llama.cpp patch: no engine bug was found under dynamic slot churn. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
@@ -732,6 +732,40 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
|
||||
} else if (optval_str == "false" || optval_str == "0" || optval_str == "no" || optval_str == "off" || optval_str == "disabled") {
|
||||
params.kv_unified = false;
|
||||
}
|
||||
// --- paged KV cache (experimental, off by default) ---
|
||||
// Enables the on-demand paged KV-cache engine (vendored PagedKVManager
|
||||
// + paged placement/gather/alloc seams). The engine is gated inside
|
||||
// llama.cpp by the LLAMA_KV_PAGED env var, evaluated once at first use;
|
||||
// here we expose it as a per-server model option instead of forcing the
|
||||
// operator to export a process-wide env. When enabled we set the env
|
||||
// BEFORE the model/context is created (later in this handler), so the
|
||||
// engine latches on. When the option is absent we touch nothing, so an
|
||||
// externally exported LLAMA_KV_PAGED still works as an escape hatch.
|
||||
// Note: the engine's env check is process-wide and latches on first
|
||||
// use, so enabling it for one model enables it for the worker process;
|
||||
// LocalAI runs one model per llama.cpp worker, so this maps cleanly to
|
||||
// per-server configuration. `kv_paged_debug` turns on the per-slot
|
||||
// [paged-alloc]/free trace (LLAMA_KV_PAGED_DEBUG).
|
||||
//
|
||||
// The continuous-batching serving loop (update_slots) drives paged KV
|
||||
// transparently through the existing kv-cache seams: each slot's
|
||||
// sequence allocates paged blocks on arrival (find_slot placement) and
|
||||
// returns them on slot release (the seq_rm free seam). This is
|
||||
// token-identical to stock under both the unified and per-sequence
|
||||
// caches. The per-slot allocate/free capacity benefit, however, only
|
||||
// materialises with a per-sequence cache, since paged block ownership
|
||||
// is keyed by stream and the unified cache collapses every slot onto a
|
||||
// single stream. Operators who want that benefit should pair this with
|
||||
// `kv_unified:false`; we do NOT flip kv_unified here, to keep the
|
||||
// default serving behaviour (and the idle-slot prompt cache) unchanged.
|
||||
} else if (!strcmp(optname, "kv_paged") || !strcmp(optname, "paged_kv") || !strcmp(optname, "paged_attention")) {
|
||||
if (optval_str == "true" || optval_str == "1" || optval_str == "yes" || optval_str == "on" || optval_str == "enabled") {
|
||||
setenv("LLAMA_KV_PAGED", "1", 1);
|
||||
}
|
||||
} else if (!strcmp(optname, "kv_paged_debug") || !strcmp(optname, "paged_kv_debug")) {
|
||||
if (optval_str == "true" || optval_str == "1" || optval_str == "yes" || optval_str == "on" || optval_str == "enabled") {
|
||||
setenv("LLAMA_KV_PAGED_DEBUG", "1", 1);
|
||||
}
|
||||
} else if (!strcmp(optname, "n_ctx_checkpoints") || !strcmp(optname, "ctx_checkpoints")) {
|
||||
if (optval != NULL) {
|
||||
try {
|
||||
|
||||
Reference in New Issue
Block a user