feat(llama-cpp/paged): engine-level prefix recompute-skip (patch 0007)

Mirror patch 0007 of the paged-attention series into the vendored llama.cpp
patch set. It wires the host-side cross-request prefix cache (0006) into the
engine so a new sequence physically shares the cached prefix blocks (ref-counted)
and decodes only the divergent suffix - the shared prefix KV is never recomputed.

paged-alloc becomes one persistent caching PagedKVManager per (kv-cache, stream)
keyed by the real seq_id (per-sequence ref-counted free); two gated
llama_kv_cache methods (paged_prefix_share / paged_prefix_commit) mark the shared
physical cells' seq-membership so the engine attention mask covers the
already-computed prefix; find_slot anchors placement on each sequence's ubatch.pos.
Existing-file core touch is llama-kv-cache.{cpp,h} (+71 -3); everything else is
additive vendored units. Gated behind LLAMA_KV_PAGED, default off, stock
byte-identical.

Verified on Qwen3-0.6B-Q8_0 (CPU, unified cache): greedy byte-identity vs decode
from scratch at a block boundary and mid-block, prefill computing only the suffix
(32 prefix tokens skipped), and ref-counted free safety (2->1 on one sharer's
removal, survivor intact and re-shareable, pool restored when all freed). The
0004 serving gate stays byte-identical stock vs paged in unified and non-unified
mode.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
Ettore Di Giacinto
2026-06-22 10:47:10 +00:00
parent 67c6208b3a
commit ecffd4b097

View File

@@ -0,0 +1,531 @@
From da20c1c0571e84bc76202d915d4bb82892a3392b Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Mon, 22 Jun 2026 12:46:28 +0200
Subject: [PATCH] paged engine prefix recompute-skip (env LLAMA_KV_PAGED) -
patch 0007
Wire the host-side cross-request prefix cache (patch 0006) into the engine so a
new sequence physically SHARES the cached prefix blocks and skips recomputing the
shared prefix - the actual compute win that 0006 (which only proved the host-side
machinery + realised reuse via the stock seq_cp) did not yet deliver from the
paged path itself.
Mechanism (all gated behind LLAMA_KV_PAGED; default off, stock byte-identical):
* paged-alloc reworked from a per-stream, request-0, destroyed-on-free manager
into ONE persistent caching PagedKVManager per (kv-cache, stream) whose
requests are keyed by the real llama_seq_id. free(seq) now releases exactly
one sequence, so ref-counted shared blocks survive while another sharer holds
them. New seams: share_prefix (place_with_prefix -> shared prefix tokens),
slot, commit (publish a sequence into the content cache), ref-counted release,
plus ref/num-free introspection.
* Two gated llama_kv_cache methods (the core seq-membership handling 0007 needs):
paged_prefix_share() reuses the longest cached content prefix for a sequence
and marks the shared physical cells as belonging to it (cells.seq_add) so the
engine's attention mask includes the already-computed prefix KV; the caller
then decodes ONLY the divergent suffix. paged_prefix_commit() publishes a
sequence's full blocks for later reuse.
* find_slot's paged branch anchors placement on each sequence's own logical base
(ubatch.pos) and keys the manager request by seq_id, so an independently-freed
sequence and a shared prefix coexist in one unified pool. seq_rm/clear free
per-sequence (ref-counted) instead of nuking the whole stream.
* paged-prefix-api: a thin gated shim so a caller holding only the public
llama.h can reach the seam and the introspection without the internal headers.
Core existing-file touch: src/llama-kv-cache.{cpp,h}, +71 -3. Everything else is
additive vendored units. Verified on Qwen3-0.6B-Q8_0 (CPU, unified cache): a
sequence B sharing A's prefix decodes greedy tokens byte-identical to B from
scratch with the prefill computing ONLY the suffix (32 prefix tokens skipped) at
a block boundary AND mid-block; the shared block carries ref_cnt 2 while both
hold it, drops to 1 when one sharer is removed (survivor intact, re-shareable, no
use-after-free) and returns to the pool only when all sharers are freed. The
0004 serving gate (unified and non-unified) stays byte-identical stock vs paged.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
src/CMakeLists.txt | 1 +
src/llama-kv-cache.cpp | 66 +++++++++++++++++++++++--
src/llama-kv-cache.h | 8 +++
src/paged-alloc.cpp | 104 ++++++++++++++++++++++++++++++---------
src/paged-alloc.h | 69 +++++++++++++++++++-------
src/paged-prefix-api.cpp | 48 ++++++++++++++++++
src/paged-prefix-api.h | 27 ++++++++++
7 files changed, 280 insertions(+), 43 deletions(-)
create mode 100644 src/paged-prefix-api.cpp
create mode 100644 src/paged-prefix-api.h
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
index 4d9d7d1..432f42d 100644
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -27,6 +27,7 @@ add_library(llama
paged-kv-manager.cpp
paged-attn.cpp
paged-alloc.cpp
+ paged-prefix-api.cpp
llama-kv-cache-dsa.cpp
llama-memory.cpp
llama-memory-hybrid.cpp
diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp
index 1125d9a..7510ff9 100644
--- a/src/llama-kv-cache.cpp
+++ b/src/llama-kv-cache.cpp
@@ -419,7 +419,7 @@ bool llama_kv_cache::seq_rm(llama_seq_id seq_id, llama_pos p0, llama_pos p1) {
// removed (sequence end), so they return to the pool for reuse.
if (paged_alloc::active() && p0 == 0 && p1 == std::numeric_limits<llama_pos>::max()) {
if (seq_id >= 0) {
- paged_alloc::release(this, (int) seq_to_stream[seq_id]);
+ paged_alloc::release(this, (int) seq_to_stream[seq_id], (int) seq_id);
} else {
paged_alloc::release_all(this);
}
@@ -1056,10 +1056,15 @@ llama_kv_cache::slot_info llama_kv_cache::find_slot(const llama_ubatch & ubatch,
const uint32_t bs = 16; // block size (tokens/block)
const uint32_t nblk = cells.size() / bs; // this stream's block budget
if (nblk >= 2) {
- const uint32_t base = cells.get_used();
+ // [paged 0007] Anchor placement on this sequence's own logical
+ // base position (ubatch.pos), not the shared used-count, and key
+ // the manager request by the real seq_id. slot(seq,pos) is then
+ // stable per sequence, so an independently-freed (ref-counted)
+ // sequence and a shared prefix can coexist in one unified pool.
+ const uint32_t base = (uint32_t) ubatch.pos[s*n_tokens];
const int strm = (int) seq_to_stream[seq_id];
std::vector<uint32_t> placed;
- if (paged_alloc::place(this, strm, base, n_tokens, bs, nblk, placed)) {
+ if (paged_alloc::place(this, strm, (int) seq_id, base, n_tokens, bs, nblk, placed)) {
bool ok = (placed.size() == n_tokens);
for (uint32_t i = 0; ok && i < n_tokens; ++i) {
if (placed[i] >= cells.size() || !cells.is_empty(placed[i])) {
@@ -1165,6 +1170,61 @@ llama_kv_cache::slot_info llama_kv_cache::find_slot(const llama_ubatch & ubatch,
return res;
}
+// [paged 0007] Cross-request prefix recompute-skip.
+//
+// Reuse a cached content prefix for seq_id: share_prefix() splices the longest
+// matching cached physical blocks into seq_id (ref_cnt++) and reserves fresh
+// blocks for the divergent suffix. We then mark the shared physical cells as
+// belonging to seq_id - those cells already hold the owner's computed KV at the
+// matching logical positions, so the caller decodes ONLY the suffix and the
+// prefix is never recomputed. Returns the number of shared prefix tokens.
+// Gated behind LLAMA_KV_PAGED; a no-op (returns 0) otherwise.
+int32_t llama_kv_cache::paged_prefix_share(llama_seq_id seq_id, const std::vector<llama_token> & tokens) {
+ if (!paged_alloc::active() || tokens.empty()) {
+ return 0;
+ }
+ const uint32_t bs = 16;
+ const uint32_t strm = (uint32_t) seq_to_stream[seq_id];
+ auto & cells = v_cells[strm];
+ const uint32_t nblk = cells.size() / bs;
+ if (nblk < 2) {
+ return 0;
+ }
+
+ std::vector<int> toks(tokens.begin(), tokens.end());
+ const size_t kshare = paged_alloc::share_prefix(this, (int) strm, (int) seq_id, toks, bs, nblk);
+
+ for (size_t p = 0; p < kshare; ++p) {
+ const int64_t cell = paged_alloc::slot(this, (int) strm, (int) seq_id, (int) p);
+ if (cell < 0 || (uint32_t) cell >= cells.size() ||
+ cells.is_empty((uint32_t) cell) ||
+ cells.pos_get((uint32_t) cell) != (llama_pos) p) {
+ // Owner cell missing / repurposed: cannot safely share. Roll the
+ // sequence back so the caller recomputes the whole prompt.
+ paged_alloc::release(this, (int) strm, (int) seq_id);
+ return 0;
+ }
+ if (!cells.seq_has((uint32_t) cell, seq_id)) {
+ cells.seq_add((uint32_t) cell, seq_id);
+ }
+ }
+ return (int32_t) kshare;
+}
+
+// [paged 0007] Publish a sequence's full blocks into the content cache so a
+// later paged_prefix_share() can reuse them. Call after the sequence KV is
+// computed (its prefill decode has run).
+void llama_kv_cache::paged_prefix_commit(llama_seq_id seq_id, const std::vector<llama_token> & tokens) {
+ if (!paged_alloc::active() || tokens.empty()) {
+ return;
+ }
+ const uint32_t bs = 16;
+ const uint32_t strm = (uint32_t) seq_to_stream[seq_id];
+ const uint32_t nblk = v_cells[strm].size() / bs;
+ std::vector<int> toks(tokens.begin(), tokens.end());
+ paged_alloc::commit(this, (int) strm, (int) seq_id, toks, bs, nblk);
+}
+
void llama_kv_cache::apply_ubatch(const slot_info & sinfo, const llama_ubatch & ubatch) {
// TODO: refactor [TAG_KV_CACHE_SHARE_CELLS]
if (other) {
diff --git a/src/llama-kv-cache.h b/src/llama-kv-cache.h
index 494c0fb..f374ac6 100644
--- a/src/llama-kv-cache.h
+++ b/src/llama-kv-cache.h
@@ -199,6 +199,14 @@ public:
// emplace the ubatch context into slot: [sinfo.idxs[0...ubatch.n_tokens - 1]]
void apply_ubatch(const slot_info & sinfo, const llama_ubatch & ubatch);
+ // [paged 0007] Cross-request prefix recompute-skip (experimental, gated by
+ // env LLAMA_KV_PAGED). paged_prefix_share() reuses a cached content prefix
+ // for seq_id and returns the number of shared prefix tokens (the caller
+ // decodes only the suffix); paged_prefix_commit() publishes a sequence into
+ // the content cache for later reuse. No-ops when LLAMA_KV_PAGED is unset.
+ int32_t paged_prefix_share (llama_seq_id seq_id, const std::vector<llama_token> & tokens);
+ void paged_prefix_commit(llama_seq_id seq_id, const std::vector<llama_token> & tokens);
+
//
// input API
//
diff --git a/src/paged-alloc.cpp b/src/paged-alloc.cpp
index 1d13f9c..c1027fb 100644
--- a/src/paged-alloc.cpp
+++ b/src/paged-alloc.cpp
@@ -23,9 +23,13 @@ namespace {
using key_t = std::pair<const void *, int>;
-// One PagedKVManager per (kv-cache, stream): each stream owns a separate
-// physical pool of cells.size() cells, so a manager's block ids map directly to
-// cell ranges within that stream's pool. The internal request id is always 0.
+// One persistent PagedKVManager per (kv-cache, stream): each stream owns a
+// separate physical pool of cells.size() cells, so a manager's block ids map
+// directly to cell ranges within that stream's pool. Requests inside a manager
+// are keyed by the real llama_seq_id (NOT a fixed 0), so free(seq) releases one
+// sequence and shared blocks survive at ref>0 - this is what makes ref-counted
+// cross-request prefix sharing (0007) possible. Caching is enabled so commit()
+// can publish blocks and share_prefix() can hit them.
std::map<key_t, std::unique_ptr<paged::PagedKVManager>> g_managers;
paged::PagedKVManager * get_mgr(const void * cache, int stream,
@@ -33,18 +37,21 @@ paged::PagedKVManager * get_mgr(const void * cache, int stream,
const key_t k{cache, stream};
auto it = g_managers.find(k);
if (it == g_managers.end()) {
- // enable_caching=false: prefix caching is a later patch; 0004 exercises
- // only on-demand allocate / free.
auto mgr = std::make_unique<paged::PagedKVManager>(
- (int32_t) pool_blocks, (int) block_size, /*enable_caching=*/false);
+ (int32_t) pool_blocks, (int) block_size, /*enable_caching=*/true);
it = g_managers.emplace(k, std::move(mgr)).first;
}
return it->second.get();
}
+paged::PagedKVManager * find_mgr(const void * cache, int stream) {
+ auto it = g_managers.find({cache, stream});
+ return it == g_managers.end() ? nullptr : it->second.get();
+}
+
} // namespace
-bool place(const void * cache, int stream, uint32_t base, uint32_t n_tokens,
+bool place(const void * cache, int stream, int seq, uint32_t base, uint32_t n_tokens,
uint32_t block_size, uint32_t pool_blocks,
std::vector<uint32_t> & out) {
if (n_tokens == 0) {
@@ -53,43 +60,79 @@ bool place(const void * cache, int stream, uint32_t base, uint32_t n_tokens,
paged::PagedKVManager * mgr = get_mgr(cache, stream, pool_blocks, block_size);
- const size_t before = mgr->block_table(0).size();
+ const size_t before = mgr->block_table(seq).size();
- // Grow the request to cover the highest logical position. The manager pops
- // free blocks only for the boundaries actually crossed - that is the on-
- // demand behavior; an already-covered range adds nothing.
- if (!mgr->allocate(0, (size_t) base + n_tokens)) {
+ // Grow this sequence's request to cover its highest logical position. The
+ // manager pops free blocks only for boundaries actually crossed; if
+ // share_prefix() already reserved these blocks, this is a no-op.
+ if (!mgr->allocate(seq, (size_t) base + n_tokens)) {
return false; // pool exhausted -> caller falls back to the stock path
}
out.reserve(out.size() + n_tokens);
for (uint32_t i = 0; i < n_tokens; ++i) {
- const int64_t s = mgr->slot(0, (int) (base + i));
+ const int64_t s = mgr->slot(seq, (int) (base + i));
out.push_back((uint32_t) s);
}
if (debug()) {
- const size_t after = mgr->block_table(0).size();
+ const size_t after = mgr->block_table(seq).size();
if (after != before) {
fprintf(stderr,
- "[paged-alloc] cache=%p stream=%d grew %zu->%zu blocks "
+ "[paged-alloc] cache=%p stream=%d seq=%d grew %zu->%zu blocks "
"(budget=%u; base=%u +%u tok)\n",
- cache, stream, before, after, pool_blocks, base, n_tokens);
+ cache, stream, seq, before, after, pool_blocks, base, n_tokens);
}
}
return true;
}
-void release(const void * cache, int stream) {
- auto it = g_managers.find({cache, stream});
- if (it == g_managers.end()) {
+size_t share_prefix(const void * cache, int stream, int seq,
+ const std::vector<int> & tokens,
+ uint32_t block_size, uint32_t pool_blocks) {
+ paged::PagedKVManager * mgr = get_mgr(cache, stream, pool_blocks, block_size);
+ const size_t shared_blocks = mgr->place_with_prefix(seq, tokens);
+ const size_t shared_tokens = shared_blocks * (size_t) block_size;
+ if (debug() && shared_blocks > 0) {
+ fprintf(stderr,
+ "[paged-alloc] cache=%p stream=%d seq=%d shares %zu prefix blocks "
+ "(%zu tokens) - prefix NOT recomputed\n",
+ cache, stream, seq, shared_blocks, shared_tokens);
+ }
+ return shared_tokens;
+}
+
+int64_t slot(const void * cache, int stream, int seq, int pos) {
+ paged::PagedKVManager * mgr = find_mgr(cache, stream);
+ if (!mgr) {
+ return -1;
+ }
+ if ((size_t) (pos / mgr->block_size()) >= mgr->num_blocks(seq)) {
+ return -1;
+ }
+ return mgr->slot(seq, pos);
+}
+
+void commit(const void * cache, int stream, int seq,
+ const std::vector<int> & tokens, uint32_t block_size, uint32_t pool_blocks) {
+ paged::PagedKVManager * mgr = get_mgr(cache, stream, pool_blocks, block_size);
+ mgr->cache_blocks(seq, mgr->compute_block_hashes(tokens), tokens.size());
+ if (debug()) {
+ fprintf(stderr, "[paged-alloc] cache=%p stream=%d seq=%d committed %zu tokens\n",
+ cache, stream, seq, tokens.size());
+ }
+}
+
+void release(const void * cache, int stream, int seq) {
+ paged::PagedKVManager * mgr = find_mgr(cache, stream);
+ if (!mgr) {
return;
}
- it->second->free(0);
- g_managers.erase(it);
+ mgr->free(seq); // ref-counted: shared blocks survive while another seq holds them
if (debug()) {
- fprintf(stderr, "[paged-alloc] released cache=%p stream=%d\n", cache, stream);
+ fprintf(stderr, "[paged-alloc] released cache=%p stream=%d seq=%d (free=%zu)\n",
+ cache, stream, seq, mgr->num_free_blocks());
}
}
@@ -103,4 +146,21 @@ void release_all(const void * cache) {
}
}
+int ref_cnt_at(const void * cache, int stream, int seq, int pos, uint32_t block_size) {
+ paged::PagedKVManager * mgr = find_mgr(cache, stream);
+ if (!mgr) {
+ return -1;
+ }
+ const size_t bi = (size_t) pos / block_size;
+ if (bi >= mgr->num_blocks(seq)) {
+ return -1;
+ }
+ return mgr->block_ref_cnt_at(seq, bi);
+}
+
+size_t num_free(const void * cache, int stream) {
+ paged::PagedKVManager * mgr = find_mgr(cache, stream);
+ return mgr ? mgr->num_free_blocks() : 0;
+}
+
} // namespace paged_alloc
diff --git a/src/paged-alloc.h b/src/paged-alloc.h
index bf66665..88dedef 100644
--- a/src/paged-alloc.h
+++ b/src/paged-alloc.h
@@ -1,17 +1,27 @@
#pragma once
-// On-demand paged KV block allocation (patch 0004, experimental).
+// On-demand paged KV block allocation + cross-request prefix reuse
+// (patches 0004 + 0007, experimental).
//
-// Backs the paged placement in llama_kv_cache::find_slot (patch 0002) with the
-// vendored host-side PagedKVManager (patch 0001). Instead of mapping a
-// sequence's logical positions onto a fixed full-pool permutation, blocks are
-// popped from a free pool ON DEMAND as the sequence crosses block boundaries,
-// and returned to the pool on sequence end. This is where the paged memory-
-// capacity benefit begins: a short sequence holds only a few blocks, not the
-// whole reserved window.
+// Backs the paged placement in llama_kv_cache::find_slot with the vendored
+// host-side PagedKVManager (patch 0001). Two responsibilities:
//
-// Gated behind env LLAMA_KV_PAGED; a no-op when unset. All state lives in this
-// unit (a static registry keyed by kv-cache + stream), so the core kv-cache
-// struct stays untouched - find_slot only gains a gated call.
+// * On-demand allocation (0004): a sequence's logical positions are mapped to
+// physical cells block-by-block, popped from a free pool only as the
+// sequence grows and returned on sequence end.
+//
+// * Cross-request prefix reuse (0007): before a new sequence's suffix is
+// decoded, share_prefix() reuses the cached physical blocks of a matching
+// content prefix (ref_cnt++), so the engine shares the already-computed KV
+// cells and the caller decodes ONLY the divergent suffix - the prefix is not
+// recomputed. commit() publishes a sequence's full blocks into the content
+// cache so later sequences can hit them. Freeing is ref-counted: a shared
+// block returns to the pool only when every sharer has been released.
+//
+// One persistent PagedKVManager per (kv-cache, stream); requests inside it are
+// keyed by the real llama_seq_id, so free(seq) releases exactly one sequence and
+// shared blocks survive at ref>0. All state lives in this unit (a static
+// registry), so the core kv-cache struct stays untouched - find_slot gains only
+// gated calls. Gated behind env LLAMA_KV_PAGED; a no-op when unset.
#include <cstdint>
#include <vector>
@@ -21,19 +31,42 @@ namespace paged_alloc {
// true iff env LLAMA_KV_PAGED is set (evaluated once).
bool active();
-// Place n_tokens logical positions [base, base+n_tokens) of one stream on
-// demand, appending their physical cell indices to `out`. pool_blocks =
-// cells.size()/block_size is this stream's block budget. Returns false (leaving
+// Place n_tokens logical positions [base, base+n_tokens) of (cache,stream,seq)
+// on demand, appending their physical cell indices to `out`. pool_blocks =
+// cells.size()/block_size is the stream's block budget. Returns false (leaving
// `out` unchanged) on pool exhaustion, so the caller falls back to the stock
// allocator. The caller still validates each returned cell is empty.
-bool place(const void * cache, int stream, uint32_t base, uint32_t n_tokens,
+bool place(const void * cache, int stream, int seq, uint32_t base, uint32_t n_tokens,
uint32_t block_size, uint32_t pool_blocks,
std::vector<uint32_t> & out);
-// Return a stream's blocks to the pool (sequence end).
-void release(const void * cache, int stream);
+// [0007] Reuse the longest cached content prefix of `tokens` for (cache,stream,
+// seq): splice the shared physical blocks into seq (ref_cnt++) and reserve fresh
+// blocks for the divergent suffix. Returns the number of shared PREFIX TOKENS
+// (block-aligned); the caller marks those cells for seq and decodes only the
+// suffix. 0 if nothing matched or on pool exhaustion (sequence rolled back).
+size_t share_prefix(const void * cache, int stream, int seq,
+ const std::vector<int> & tokens,
+ uint32_t block_size, uint32_t pool_blocks);
+
+// [0007] Physical cell backing logical position `pos` of (cache,stream,seq), or
+// -1 if seq is unknown. Used to map a shared prefix position to its cell.
+int64_t slot(const void * cache, int stream, int seq, int pos);
-// Return every stream's blocks for a kv-cache (clear() / teardown).
+// [0007] Publish seq's full (block-aligned) blocks into the content cache so a
+// later share_prefix() can reuse them. Call after the sequence's KV is computed.
+void commit(const void * cache, int stream, int seq,
+ const std::vector<int> & tokens, uint32_t block_size, uint32_t pool_blocks);
+
+// Return one sequence's blocks to the pool (ref-counted; sequence end).
+void release(const void * cache, int stream, int seq);
+
+// Drop every manager for a kv-cache (clear() / teardown).
void release_all(const void * cache);
+// Introspection for the prefix-share gate (debug/tests). ref_cnt_at returns the
+// ref count of the block backing logical position `pos`, or -1 if unknown.
+int ref_cnt_at(const void * cache, int stream, int seq, int pos, uint32_t block_size);
+size_t num_free(const void * cache, int stream);
+
} // namespace paged_alloc
diff --git a/src/paged-prefix-api.cpp b/src/paged-prefix-api.cpp
new file mode 100644
index 0000000..8573cd2
--- /dev/null
+++ b/src/paged-prefix-api.cpp
@@ -0,0 +1,48 @@
+#include "paged-prefix-api.h"
+#include "paged-alloc.h"
+#include "llama-kv-cache.h"
+
+#include <vector>
+
+namespace paged_prefix_api {
+
+static llama_kv_cache * kv_of(llama_context * ctx) {
+ // The driver targets a plain unified KV-cache model; dynamic_cast yields null
+ // for wrapped caches (iSWA / hybrid), where cross-request cell sharing does
+ // not apply, so the shim degrades to a safe no-op.
+ return dynamic_cast<llama_kv_cache *>(llama_get_memory(ctx));
+}
+
+int32_t share(llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n) {
+ llama_kv_cache * kv = kv_of(ctx);
+ if (!kv || n <= 0) {
+ return 0;
+ }
+ return kv->paged_prefix_share(seq, std::vector<llama_token>(tokens, tokens + n));
+}
+
+void commit(llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n) {
+ llama_kv_cache * kv = kv_of(ctx);
+ if (!kv || n <= 0) {
+ return;
+ }
+ kv->paged_prefix_commit(seq, std::vector<llama_token>(tokens, tokens + n));
+}
+
+int ref_at(llama_context * ctx, llama_seq_id seq, int pos) {
+ llama_kv_cache * kv = kv_of(ctx);
+ if (!kv) {
+ return -1;
+ }
+ return paged_alloc::ref_cnt_at((const void *) kv, /*stream=*/0, (int) seq, pos, /*block_size=*/16);
+}
+
+long num_free(llama_context * ctx) {
+ llama_kv_cache * kv = kv_of(ctx);
+ if (!kv) {
+ return 0;
+ }
+ return (long) paged_alloc::num_free((const void *) kv, /*stream=*/0);
+}
+
+} // namespace paged_prefix_api
diff --git a/src/paged-prefix-api.h b/src/paged-prefix-api.h
new file mode 100644
index 0000000..78a3864
--- /dev/null
+++ b/src/paged-prefix-api.h
@@ -0,0 +1,27 @@
+#pragma once
+// Thin test/diagnostic shim over the paged cross-request prefix engine seam
+// (patch 0007). Lets a driver that only includes the public llama.h reach the
+// gated llama_kv_cache::paged_prefix_* methods and the paged-alloc introspection
+// without pulling in the internal kv-cache headers. All entry points are no-ops
+// (return 0) unless env LLAMA_KV_PAGED is set. Experimental; not a stable API.
+
+#include "llama.h"
+
+namespace paged_prefix_api {
+
+// Reuse the longest cached content prefix of [tokens, tokens+n) for `seq` and
+// return the number of shared prefix tokens (the caller decodes only the
+// suffix). 0 if nothing was shared.
+int32_t share(llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n);
+
+// Publish `seq`'s full blocks into the content cache (call after its KV is computed).
+void commit(llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n);
+
+// Ref count of the paged block backing logical position `pos` of `seq` (unified
+// stream 0), or -1 if unknown.
+int ref_at(llama_context * ctx, llama_seq_id seq, int pos);
+
+// Number of free blocks in the unified stream-0 pool, or 0 if no manager.
+long num_free(llama_context * ctx);
+
+} // namespace paged_prefix_api
--
2.43.0