Files
LocalAI/backend/cpp/llama-cpp/patches/paged
Ettore Di Giacinto 67c6208b3a feat(llama-cpp/paged): cross-request prefix caching patch 0006
Mirror patch 0006 of the paged-attention series into the vendored llama.cpp
patch set. Extends the vendored PagedKVManager (src/paged-kv-manager) with
host-side cross-request prefix sharing: place_with_prefix reuses cached
physical blocks for a new sequence shared prefix (ref_cnt++) and allocates
only the divergent suffix; cow_block copy-on-writes a still-shared (ref>1)
block before a divergent write so co-owners stay byte-correct; ref-counted
free releases a shared block only at ref 0. Core kv-cache files untouched;
gated behind LLAMA_KV_PAGED, default off.

Gate 0 verified on the dev tree (CPU, Qwen3-0.6B-Q8_0): shared-prefix
greedy tokens byte-identical to the unshared baseline at both a block boundary
and mid-block, measured 2-block reuse (ref_cnt==2, only the suffix allocated),
and copy-on-write + seq_rm ref-count safety with no use-after-free.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-22 10:14:27 +00:00
..