LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-23 16:19:07 -04:00

Files

Ettore Di Giacinto 67c6208b3a feat(llama-cpp/paged): cross-request prefix caching patch 0006

Mirror patch 0006 of the paged-attention series into the vendored llama.cpp
patch set. Extends the vendored PagedKVManager (src/paged-kv-manager) with
host-side cross-request prefix sharing: place_with_prefix reuses cached
physical blocks for a new sequence shared prefix (ref_cnt++) and allocates
only the divergent suffix; cow_block copy-on-writes a still-shared (ref>1)
block before a divergent write so co-owners stay byte-correct; ref-counted
free releases a shared block only at ref 0. Core kv-cache files untouched;
gated behind LLAMA_KV_PAGED, default off.

Gate 0 verified on the dev tree (CPU, Qwen3-0.6B-Q8_0): shared-prefix
greedy tokens byte-identical to the unshared baseline at both a block boundary
and mid-block, measured 2-block reuse (ref_cnt==2, only the suffix allocated),
and copy-on-write + seq_rm ref-count safety with no use-after-free.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-22 10:14:27 +00:00

0001-vendor-paged-kv-manager.patch

build(llama-cpp): isolate paged patches in patches/paged/ behind LLAMA_PAGED flag (default on)

2026-06-22 09:22:36 +00:00

0002-paged-kv-block-placement-env-LLAMA_KV_PAGED.patch

build(llama-cpp): isolate paged patches in patches/paged/ behind LLAMA_PAGED flag (default on)

2026-06-22 09:22:36 +00:00

0003-gather-read-plan.md

build(llama-cpp): isolate paged patches in patches/paged/ behind LLAMA_PAGED flag (default on)

2026-06-22 09:22:36 +00:00

0003-paged-gather-read-env-LLAMA_KV_PAGED.patch

build(llama-cpp): isolate paged patches in patches/paged/ behind LLAMA_PAGED flag (default on)