mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-23 08:08:52 -04:00
feat(llama-cpp/paged): engine-level prefix recompute-skip (patch 0007)
Mirror patch 0007 of the paged-attention series into the vendored llama.cpp
patch set. It wires the host-side cross-request prefix cache (0006) into the
engine so a new sequence physically shares the cached prefix blocks (ref-counted)
and decodes only the divergent suffix - the shared prefix KV is never recomputed.
paged-alloc becomes one persistent caching PagedKVManager per (kv-cache, stream)
keyed by the real seq_id (per-sequence ref-counted free); two gated
llama_kv_cache methods (paged_prefix_share / paged_prefix_commit) mark the shared
physical cells' seq-membership so the engine attention mask covers the
already-computed prefix; find_slot anchors placement on each sequence's ubatch.pos.
Existing-file core touch is llama-kv-cache.{cpp,h} (+71 -3); everything else is
additive vendored units. Gated behind LLAMA_KV_PAGED, default off, stock
byte-identical.
Verified on Qwen3-0.6B-Q8_0 (CPU, unified cache): greedy byte-identity vs decode
from scratch at a block boundary and mid-block, prefill computing only the suffix
(32 prefix tokens skipped), and ref-counted free safety (2->1 on one sharer's
removal, survivor intact and re-shareable, pool restored when all freed). The
0004 serving gate stays byte-identical stock vs paged in unified and non-unified
mode.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
@@ -0,0 +1,531 @@
|
||||
From da20c1c0571e84bc76202d915d4bb82892a3392b Mon Sep 17 00:00:00 2001
|
||||
From: Ettore Di Giacinto <mudler@localai.io>
|
||||
Date: Mon, 22 Jun 2026 12:46:28 +0200
|
||||
Subject: [PATCH] paged engine prefix recompute-skip (env LLAMA_KV_PAGED) -
|
||||
patch 0007
|
||||
|
||||
Wire the host-side cross-request prefix cache (patch 0006) into the engine so a
|
||||
new sequence physically SHARES the cached prefix blocks and skips recomputing the
|
||||
shared prefix - the actual compute win that 0006 (which only proved the host-side
|
||||
machinery + realised reuse via the stock seq_cp) did not yet deliver from the
|
||||
paged path itself.
|
||||
|
||||
Mechanism (all gated behind LLAMA_KV_PAGED; default off, stock byte-identical):
|
||||
|
||||
* paged-alloc reworked from a per-stream, request-0, destroyed-on-free manager
|
||||
into ONE persistent caching PagedKVManager per (kv-cache, stream) whose
|
||||
requests are keyed by the real llama_seq_id. free(seq) now releases exactly
|
||||
one sequence, so ref-counted shared blocks survive while another sharer holds
|
||||
them. New seams: share_prefix (place_with_prefix -> shared prefix tokens),
|
||||
slot, commit (publish a sequence into the content cache), ref-counted release,
|
||||
plus ref/num-free introspection.
|
||||
|
||||
* Two gated llama_kv_cache methods (the core seq-membership handling 0007 needs):
|
||||
paged_prefix_share() reuses the longest cached content prefix for a sequence
|
||||
and marks the shared physical cells as belonging to it (cells.seq_add) so the
|
||||
engine's attention mask includes the already-computed prefix KV; the caller
|
||||
then decodes ONLY the divergent suffix. paged_prefix_commit() publishes a
|
||||
sequence's full blocks for later reuse.
|
||||
|
||||
* find_slot's paged branch anchors placement on each sequence's own logical base
|
||||
(ubatch.pos) and keys the manager request by seq_id, so an independently-freed
|
||||
sequence and a shared prefix coexist in one unified pool. seq_rm/clear free
|
||||
per-sequence (ref-counted) instead of nuking the whole stream.
|
||||
|
||||
* paged-prefix-api: a thin gated shim so a caller holding only the public
|
||||
llama.h can reach the seam and the introspection without the internal headers.
|
||||
|
||||
Core existing-file touch: src/llama-kv-cache.{cpp,h}, +71 -3. Everything else is
|
||||
additive vendored units. Verified on Qwen3-0.6B-Q8_0 (CPU, unified cache): a
|
||||
sequence B sharing A's prefix decodes greedy tokens byte-identical to B from
|
||||
scratch with the prefill computing ONLY the suffix (32 prefix tokens skipped) at
|
||||
a block boundary AND mid-block; the shared block carries ref_cnt 2 while both
|
||||
hold it, drops to 1 when one sharer is removed (survivor intact, re-shareable, no
|
||||
use-after-free) and returns to the pool only when all sharers are freed. The
|
||||
0004 serving gate (unified and non-unified) stays byte-identical stock vs paged.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
|
||||
---
|
||||
src/CMakeLists.txt | 1 +
|
||||
src/llama-kv-cache.cpp | 66 +++++++++++++++++++++++--
|
||||
src/llama-kv-cache.h | 8 +++
|
||||
src/paged-alloc.cpp | 104 ++++++++++++++++++++++++++++++---------
|
||||
src/paged-alloc.h | 69 +++++++++++++++++++-------
|
||||
src/paged-prefix-api.cpp | 48 ++++++++++++++++++
|
||||
src/paged-prefix-api.h | 27 ++++++++++
|
||||
7 files changed, 280 insertions(+), 43 deletions(-)
|
||||
create mode 100644 src/paged-prefix-api.cpp
|
||||
create mode 100644 src/paged-prefix-api.h
|
||||
|
||||
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
|
||||
index 4d9d7d1..432f42d 100644
|
||||
--- a/src/CMakeLists.txt
|
||||
+++ b/src/CMakeLists.txt
|
||||
@@ -27,6 +27,7 @@ add_library(llama
|
||||
paged-kv-manager.cpp
|
||||
paged-attn.cpp
|
||||
paged-alloc.cpp
|
||||
+ paged-prefix-api.cpp
|
||||
llama-kv-cache-dsa.cpp
|
||||
llama-memory.cpp
|
||||
llama-memory-hybrid.cpp
|
||||
diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp
|
||||
index 1125d9a..7510ff9 100644
|
||||
--- a/src/llama-kv-cache.cpp
|
||||
+++ b/src/llama-kv-cache.cpp
|
||||
@@ -419,7 +419,7 @@ bool llama_kv_cache::seq_rm(llama_seq_id seq_id, llama_pos p0, llama_pos p1) {
|
||||
// removed (sequence end), so they return to the pool for reuse.
|
||||
if (paged_alloc::active() && p0 == 0 && p1 == std::numeric_limits<llama_pos>::max()) {
|
||||
if (seq_id >= 0) {
|
||||
- paged_alloc::release(this, (int) seq_to_stream[seq_id]);
|
||||
+ paged_alloc::release(this, (int) seq_to_stream[seq_id], (int) seq_id);
|
||||
} else {
|
||||
paged_alloc::release_all(this);
|
||||
}
|
||||
@@ -1056,10 +1056,15 @@ llama_kv_cache::slot_info llama_kv_cache::find_slot(const llama_ubatch & ubatch,
|
||||
const uint32_t bs = 16; // block size (tokens/block)
|
||||
const uint32_t nblk = cells.size() / bs; // this stream's block budget
|
||||
if (nblk >= 2) {
|
||||
- const uint32_t base = cells.get_used();
|
||||
+ // [paged 0007] Anchor placement on this sequence's own logical
|
||||
+ // base position (ubatch.pos), not the shared used-count, and key
|
||||
+ // the manager request by the real seq_id. slot(seq,pos) is then
|
||||
+ // stable per sequence, so an independently-freed (ref-counted)
|
||||
+ // sequence and a shared prefix can coexist in one unified pool.
|
||||
+ const uint32_t base = (uint32_t) ubatch.pos[s*n_tokens];
|
||||
const int strm = (int) seq_to_stream[seq_id];
|
||||
std::vector<uint32_t> placed;
|
||||
- if (paged_alloc::place(this, strm, base, n_tokens, bs, nblk, placed)) {
|
||||
+ if (paged_alloc::place(this, strm, (int) seq_id, base, n_tokens, bs, nblk, placed)) {
|
||||
bool ok = (placed.size() == n_tokens);
|
||||
for (uint32_t i = 0; ok && i < n_tokens; ++i) {
|
||||
if (placed[i] >= cells.size() || !cells.is_empty(placed[i])) {
|
||||
@@ -1165,6 +1170,61 @@ llama_kv_cache::slot_info llama_kv_cache::find_slot(const llama_ubatch & ubatch,
|
||||
return res;
|
||||
}
|
||||
|
||||
+// [paged 0007] Cross-request prefix recompute-skip.
|
||||
+//
|
||||
+// Reuse a cached content prefix for seq_id: share_prefix() splices the longest
|
||||
+// matching cached physical blocks into seq_id (ref_cnt++) and reserves fresh
|
||||
+// blocks for the divergent suffix. We then mark the shared physical cells as
|
||||
+// belonging to seq_id - those cells already hold the owner's computed KV at the
|
||||
+// matching logical positions, so the caller decodes ONLY the suffix and the
|
||||
+// prefix is never recomputed. Returns the number of shared prefix tokens.
|
||||
+// Gated behind LLAMA_KV_PAGED; a no-op (returns 0) otherwise.
|
||||
+int32_t llama_kv_cache::paged_prefix_share(llama_seq_id seq_id, const std::vector<llama_token> & tokens) {
|
||||
+ if (!paged_alloc::active() || tokens.empty()) {
|
||||
+ return 0;
|
||||
+ }
|
||||
+ const uint32_t bs = 16;
|
||||
+ const uint32_t strm = (uint32_t) seq_to_stream[seq_id];
|
||||
+ auto & cells = v_cells[strm];
|
||||
+ const uint32_t nblk = cells.size() / bs;
|
||||
+ if (nblk < 2) {
|
||||
+ return 0;
|
||||
+ }
|
||||
+
|
||||
+ std::vector<int> toks(tokens.begin(), tokens.end());
|
||||
+ const size_t kshare = paged_alloc::share_prefix(this, (int) strm, (int) seq_id, toks, bs, nblk);
|
||||
+
|
||||
+ for (size_t p = 0; p < kshare; ++p) {
|
||||
+ const int64_t cell = paged_alloc::slot(this, (int) strm, (int) seq_id, (int) p);
|
||||
+ if (cell < 0 || (uint32_t) cell >= cells.size() ||
|
||||
+ cells.is_empty((uint32_t) cell) ||
|
||||
+ cells.pos_get((uint32_t) cell) != (llama_pos) p) {
|
||||
+ // Owner cell missing / repurposed: cannot safely share. Roll the
|
||||
+ // sequence back so the caller recomputes the whole prompt.
|
||||
+ paged_alloc::release(this, (int) strm, (int) seq_id);
|
||||
+ return 0;
|
||||
+ }
|
||||
+ if (!cells.seq_has((uint32_t) cell, seq_id)) {
|
||||
+ cells.seq_add((uint32_t) cell, seq_id);
|
||||
+ }
|
||||
+ }
|
||||
+ return (int32_t) kshare;
|
||||
+}
|
||||
+
|
||||
+// [paged 0007] Publish a sequence's full blocks into the content cache so a
|
||||
+// later paged_prefix_share() can reuse them. Call after the sequence KV is
|
||||
+// computed (its prefill decode has run).
|
||||
+void llama_kv_cache::paged_prefix_commit(llama_seq_id seq_id, const std::vector<llama_token> & tokens) {
|
||||
+ if (!paged_alloc::active() || tokens.empty()) {
|
||||
+ return;
|
||||
+ }
|
||||
+ const uint32_t bs = 16;
|
||||
+ const uint32_t strm = (uint32_t) seq_to_stream[seq_id];
|
||||
+ const uint32_t nblk = v_cells[strm].size() / bs;
|
||||
+ std::vector<int> toks(tokens.begin(), tokens.end());
|
||||
+ paged_alloc::commit(this, (int) strm, (int) seq_id, toks, bs, nblk);
|
||||
+}
|
||||
+
|
||||
void llama_kv_cache::apply_ubatch(const slot_info & sinfo, const llama_ubatch & ubatch) {
|
||||
// TODO: refactor [TAG_KV_CACHE_SHARE_CELLS]
|
||||
if (other) {
|
||||
diff --git a/src/llama-kv-cache.h b/src/llama-kv-cache.h
|
||||
index 494c0fb..f374ac6 100644
|
||||
--- a/src/llama-kv-cache.h
|
||||
+++ b/src/llama-kv-cache.h
|
||||
@@ -199,6 +199,14 @@ public:
|
||||
// emplace the ubatch context into slot: [sinfo.idxs[0...ubatch.n_tokens - 1]]
|
||||
void apply_ubatch(const slot_info & sinfo, const llama_ubatch & ubatch);
|
||||
|
||||
+ // [paged 0007] Cross-request prefix recompute-skip (experimental, gated by
|
||||
+ // env LLAMA_KV_PAGED). paged_prefix_share() reuses a cached content prefix
|
||||
+ // for seq_id and returns the number of shared prefix tokens (the caller
|
||||
+ // decodes only the suffix); paged_prefix_commit() publishes a sequence into
|
||||
+ // the content cache for later reuse. No-ops when LLAMA_KV_PAGED is unset.
|
||||
+ int32_t paged_prefix_share (llama_seq_id seq_id, const std::vector<llama_token> & tokens);
|
||||
+ void paged_prefix_commit(llama_seq_id seq_id, const std::vector<llama_token> & tokens);
|
||||
+
|
||||
//
|
||||
// input API
|
||||
//
|
||||
diff --git a/src/paged-alloc.cpp b/src/paged-alloc.cpp
|
||||
index 1d13f9c..c1027fb 100644
|
||||
--- a/src/paged-alloc.cpp
|
||||
+++ b/src/paged-alloc.cpp
|
||||
@@ -23,9 +23,13 @@ namespace {
|
||||
|
||||
using key_t = std::pair<const void *, int>;
|
||||
|
||||
-// One PagedKVManager per (kv-cache, stream): each stream owns a separate
|
||||
-// physical pool of cells.size() cells, so a manager's block ids map directly to
|
||||
-// cell ranges within that stream's pool. The internal request id is always 0.
|
||||
+// One persistent PagedKVManager per (kv-cache, stream): each stream owns a
|
||||
+// separate physical pool of cells.size() cells, so a manager's block ids map
|
||||
+// directly to cell ranges within that stream's pool. Requests inside a manager
|
||||
+// are keyed by the real llama_seq_id (NOT a fixed 0), so free(seq) releases one
|
||||
+// sequence and shared blocks survive at ref>0 - this is what makes ref-counted
|
||||
+// cross-request prefix sharing (0007) possible. Caching is enabled so commit()
|
||||
+// can publish blocks and share_prefix() can hit them.
|
||||
std::map<key_t, std::unique_ptr<paged::PagedKVManager>> g_managers;
|
||||
|
||||
paged::PagedKVManager * get_mgr(const void * cache, int stream,
|
||||
@@ -33,18 +37,21 @@ paged::PagedKVManager * get_mgr(const void * cache, int stream,
|
||||
const key_t k{cache, stream};
|
||||
auto it = g_managers.find(k);
|
||||
if (it == g_managers.end()) {
|
||||
- // enable_caching=false: prefix caching is a later patch; 0004 exercises
|
||||
- // only on-demand allocate / free.
|
||||
auto mgr = std::make_unique<paged::PagedKVManager>(
|
||||
- (int32_t) pool_blocks, (int) block_size, /*enable_caching=*/false);
|
||||
+ (int32_t) pool_blocks, (int) block_size, /*enable_caching=*/true);
|
||||
it = g_managers.emplace(k, std::move(mgr)).first;
|
||||
}
|
||||
return it->second.get();
|
||||
}
|
||||
|
||||
+paged::PagedKVManager * find_mgr(const void * cache, int stream) {
|
||||
+ auto it = g_managers.find({cache, stream});
|
||||
+ return it == g_managers.end() ? nullptr : it->second.get();
|
||||
+}
|
||||
+
|
||||
} // namespace
|
||||
|
||||
-bool place(const void * cache, int stream, uint32_t base, uint32_t n_tokens,
|
||||
+bool place(const void * cache, int stream, int seq, uint32_t base, uint32_t n_tokens,
|
||||
uint32_t block_size, uint32_t pool_blocks,
|
||||
std::vector<uint32_t> & out) {
|
||||
if (n_tokens == 0) {
|
||||
@@ -53,43 +60,79 @@ bool place(const void * cache, int stream, uint32_t base, uint32_t n_tokens,
|
||||
|
||||
paged::PagedKVManager * mgr = get_mgr(cache, stream, pool_blocks, block_size);
|
||||
|
||||
- const size_t before = mgr->block_table(0).size();
|
||||
+ const size_t before = mgr->block_table(seq).size();
|
||||
|
||||
- // Grow the request to cover the highest logical position. The manager pops
|
||||
- // free blocks only for the boundaries actually crossed - that is the on-
|
||||
- // demand behavior; an already-covered range adds nothing.
|
||||
- if (!mgr->allocate(0, (size_t) base + n_tokens)) {
|
||||
+ // Grow this sequence's request to cover its highest logical position. The
|
||||
+ // manager pops free blocks only for boundaries actually crossed; if
|
||||
+ // share_prefix() already reserved these blocks, this is a no-op.
|
||||
+ if (!mgr->allocate(seq, (size_t) base + n_tokens)) {
|
||||
return false; // pool exhausted -> caller falls back to the stock path
|
||||
}
|
||||
|
||||
out.reserve(out.size() + n_tokens);
|
||||
for (uint32_t i = 0; i < n_tokens; ++i) {
|
||||
- const int64_t s = mgr->slot(0, (int) (base + i));
|
||||
+ const int64_t s = mgr->slot(seq, (int) (base + i));
|
||||
out.push_back((uint32_t) s);
|
||||
}
|
||||
|
||||
if (debug()) {
|
||||
- const size_t after = mgr->block_table(0).size();
|
||||
+ const size_t after = mgr->block_table(seq).size();
|
||||
if (after != before) {
|
||||
fprintf(stderr,
|
||||
- "[paged-alloc] cache=%p stream=%d grew %zu->%zu blocks "
|
||||
+ "[paged-alloc] cache=%p stream=%d seq=%d grew %zu->%zu blocks "
|
||||
"(budget=%u; base=%u +%u tok)\n",
|
||||
- cache, stream, before, after, pool_blocks, base, n_tokens);
|
||||
+ cache, stream, seq, before, after, pool_blocks, base, n_tokens);
|
||||
}
|
||||
}
|
||||
|
||||
return true;
|
||||
}
|
||||
|
||||
-void release(const void * cache, int stream) {
|
||||
- auto it = g_managers.find({cache, stream});
|
||||
- if (it == g_managers.end()) {
|
||||
+size_t share_prefix(const void * cache, int stream, int seq,
|
||||
+ const std::vector<int> & tokens,
|
||||
+ uint32_t block_size, uint32_t pool_blocks) {
|
||||
+ paged::PagedKVManager * mgr = get_mgr(cache, stream, pool_blocks, block_size);
|
||||
+ const size_t shared_blocks = mgr->place_with_prefix(seq, tokens);
|
||||
+ const size_t shared_tokens = shared_blocks * (size_t) block_size;
|
||||
+ if (debug() && shared_blocks > 0) {
|
||||
+ fprintf(stderr,
|
||||
+ "[paged-alloc] cache=%p stream=%d seq=%d shares %zu prefix blocks "
|
||||
+ "(%zu tokens) - prefix NOT recomputed\n",
|
||||
+ cache, stream, seq, shared_blocks, shared_tokens);
|
||||
+ }
|
||||
+ return shared_tokens;
|
||||
+}
|
||||
+
|
||||
+int64_t slot(const void * cache, int stream, int seq, int pos) {
|
||||
+ paged::PagedKVManager * mgr = find_mgr(cache, stream);
|
||||
+ if (!mgr) {
|
||||
+ return -1;
|
||||
+ }
|
||||
+ if ((size_t) (pos / mgr->block_size()) >= mgr->num_blocks(seq)) {
|
||||
+ return -1;
|
||||
+ }
|
||||
+ return mgr->slot(seq, pos);
|
||||
+}
|
||||
+
|
||||
+void commit(const void * cache, int stream, int seq,
|
||||
+ const std::vector<int> & tokens, uint32_t block_size, uint32_t pool_blocks) {
|
||||
+ paged::PagedKVManager * mgr = get_mgr(cache, stream, pool_blocks, block_size);
|
||||
+ mgr->cache_blocks(seq, mgr->compute_block_hashes(tokens), tokens.size());
|
||||
+ if (debug()) {
|
||||
+ fprintf(stderr, "[paged-alloc] cache=%p stream=%d seq=%d committed %zu tokens\n",
|
||||
+ cache, stream, seq, tokens.size());
|
||||
+ }
|
||||
+}
|
||||
+
|
||||
+void release(const void * cache, int stream, int seq) {
|
||||
+ paged::PagedKVManager * mgr = find_mgr(cache, stream);
|
||||
+ if (!mgr) {
|
||||
return;
|
||||
}
|
||||
- it->second->free(0);
|
||||
- g_managers.erase(it);
|
||||
+ mgr->free(seq); // ref-counted: shared blocks survive while another seq holds them
|
||||
if (debug()) {
|
||||
- fprintf(stderr, "[paged-alloc] released cache=%p stream=%d\n", cache, stream);
|
||||
+ fprintf(stderr, "[paged-alloc] released cache=%p stream=%d seq=%d (free=%zu)\n",
|
||||
+ cache, stream, seq, mgr->num_free_blocks());
|
||||
}
|
||||
}
|
||||
|
||||
@@ -103,4 +146,21 @@ void release_all(const void * cache) {
|
||||
}
|
||||
}
|
||||
|
||||
+int ref_cnt_at(const void * cache, int stream, int seq, int pos, uint32_t block_size) {
|
||||
+ paged::PagedKVManager * mgr = find_mgr(cache, stream);
|
||||
+ if (!mgr) {
|
||||
+ return -1;
|
||||
+ }
|
||||
+ const size_t bi = (size_t) pos / block_size;
|
||||
+ if (bi >= mgr->num_blocks(seq)) {
|
||||
+ return -1;
|
||||
+ }
|
||||
+ return mgr->block_ref_cnt_at(seq, bi);
|
||||
+}
|
||||
+
|
||||
+size_t num_free(const void * cache, int stream) {
|
||||
+ paged::PagedKVManager * mgr = find_mgr(cache, stream);
|
||||
+ return mgr ? mgr->num_free_blocks() : 0;
|
||||
+}
|
||||
+
|
||||
} // namespace paged_alloc
|
||||
diff --git a/src/paged-alloc.h b/src/paged-alloc.h
|
||||
index bf66665..88dedef 100644
|
||||
--- a/src/paged-alloc.h
|
||||
+++ b/src/paged-alloc.h
|
||||
@@ -1,17 +1,27 @@
|
||||
#pragma once
|
||||
-// On-demand paged KV block allocation (patch 0004, experimental).
|
||||
+// On-demand paged KV block allocation + cross-request prefix reuse
|
||||
+// (patches 0004 + 0007, experimental).
|
||||
//
|
||||
-// Backs the paged placement in llama_kv_cache::find_slot (patch 0002) with the
|
||||
-// vendored host-side PagedKVManager (patch 0001). Instead of mapping a
|
||||
-// sequence's logical positions onto a fixed full-pool permutation, blocks are
|
||||
-// popped from a free pool ON DEMAND as the sequence crosses block boundaries,
|
||||
-// and returned to the pool on sequence end. This is where the paged memory-
|
||||
-// capacity benefit begins: a short sequence holds only a few blocks, not the
|
||||
-// whole reserved window.
|
||||
+// Backs the paged placement in llama_kv_cache::find_slot with the vendored
|
||||
+// host-side PagedKVManager (patch 0001). Two responsibilities:
|
||||
//
|
||||
-// Gated behind env LLAMA_KV_PAGED; a no-op when unset. All state lives in this
|
||||
-// unit (a static registry keyed by kv-cache + stream), so the core kv-cache
|
||||
-// struct stays untouched - find_slot only gains a gated call.
|
||||
+// * On-demand allocation (0004): a sequence's logical positions are mapped to
|
||||
+// physical cells block-by-block, popped from a free pool only as the
|
||||
+// sequence grows and returned on sequence end.
|
||||
+//
|
||||
+// * Cross-request prefix reuse (0007): before a new sequence's suffix is
|
||||
+// decoded, share_prefix() reuses the cached physical blocks of a matching
|
||||
+// content prefix (ref_cnt++), so the engine shares the already-computed KV
|
||||
+// cells and the caller decodes ONLY the divergent suffix - the prefix is not
|
||||
+// recomputed. commit() publishes a sequence's full blocks into the content
|
||||
+// cache so later sequences can hit them. Freeing is ref-counted: a shared
|
||||
+// block returns to the pool only when every sharer has been released.
|
||||
+//
|
||||
+// One persistent PagedKVManager per (kv-cache, stream); requests inside it are
|
||||
+// keyed by the real llama_seq_id, so free(seq) releases exactly one sequence and
|
||||
+// shared blocks survive at ref>0. All state lives in this unit (a static
|
||||
+// registry), so the core kv-cache struct stays untouched - find_slot gains only
|
||||
+// gated calls. Gated behind env LLAMA_KV_PAGED; a no-op when unset.
|
||||
|
||||
#include <cstdint>
|
||||
#include <vector>
|
||||
@@ -21,19 +31,42 @@ namespace paged_alloc {
|
||||
// true iff env LLAMA_KV_PAGED is set (evaluated once).
|
||||
bool active();
|
||||
|
||||
-// Place n_tokens logical positions [base, base+n_tokens) of one stream on
|
||||
-// demand, appending their physical cell indices to `out`. pool_blocks =
|
||||
-// cells.size()/block_size is this stream's block budget. Returns false (leaving
|
||||
+// Place n_tokens logical positions [base, base+n_tokens) of (cache,stream,seq)
|
||||
+// on demand, appending their physical cell indices to `out`. pool_blocks =
|
||||
+// cells.size()/block_size is the stream's block budget. Returns false (leaving
|
||||
// `out` unchanged) on pool exhaustion, so the caller falls back to the stock
|
||||
// allocator. The caller still validates each returned cell is empty.
|
||||
-bool place(const void * cache, int stream, uint32_t base, uint32_t n_tokens,
|
||||
+bool place(const void * cache, int stream, int seq, uint32_t base, uint32_t n_tokens,
|
||||
uint32_t block_size, uint32_t pool_blocks,
|
||||
std::vector<uint32_t> & out);
|
||||
|
||||
-// Return a stream's blocks to the pool (sequence end).
|
||||
-void release(const void * cache, int stream);
|
||||
+// [0007] Reuse the longest cached content prefix of `tokens` for (cache,stream,
|
||||
+// seq): splice the shared physical blocks into seq (ref_cnt++) and reserve fresh
|
||||
+// blocks for the divergent suffix. Returns the number of shared PREFIX TOKENS
|
||||
+// (block-aligned); the caller marks those cells for seq and decodes only the
|
||||
+// suffix. 0 if nothing matched or on pool exhaustion (sequence rolled back).
|
||||
+size_t share_prefix(const void * cache, int stream, int seq,
|
||||
+ const std::vector<int> & tokens,
|
||||
+ uint32_t block_size, uint32_t pool_blocks);
|
||||
+
|
||||
+// [0007] Physical cell backing logical position `pos` of (cache,stream,seq), or
|
||||
+// -1 if seq is unknown. Used to map a shared prefix position to its cell.
|
||||
+int64_t slot(const void * cache, int stream, int seq, int pos);
|
||||
|
||||
-// Return every stream's blocks for a kv-cache (clear() / teardown).
|
||||
+// [0007] Publish seq's full (block-aligned) blocks into the content cache so a
|
||||
+// later share_prefix() can reuse them. Call after the sequence's KV is computed.
|
||||
+void commit(const void * cache, int stream, int seq,
|
||||
+ const std::vector<int> & tokens, uint32_t block_size, uint32_t pool_blocks);
|
||||
+
|
||||
+// Return one sequence's blocks to the pool (ref-counted; sequence end).
|
||||
+void release(const void * cache, int stream, int seq);
|
||||
+
|
||||
+// Drop every manager for a kv-cache (clear() / teardown).
|
||||
void release_all(const void * cache);
|
||||
|
||||
+// Introspection for the prefix-share gate (debug/tests). ref_cnt_at returns the
|
||||
+// ref count of the block backing logical position `pos`, or -1 if unknown.
|
||||
+int ref_cnt_at(const void * cache, int stream, int seq, int pos, uint32_t block_size);
|
||||
+size_t num_free(const void * cache, int stream);
|
||||
+
|
||||
} // namespace paged_alloc
|
||||
diff --git a/src/paged-prefix-api.cpp b/src/paged-prefix-api.cpp
|
||||
new file mode 100644
|
||||
index 0000000..8573cd2
|
||||
--- /dev/null
|
||||
+++ b/src/paged-prefix-api.cpp
|
||||
@@ -0,0 +1,48 @@
|
||||
+#include "paged-prefix-api.h"
|
||||
+#include "paged-alloc.h"
|
||||
+#include "llama-kv-cache.h"
|
||||
+
|
||||
+#include <vector>
|
||||
+
|
||||
+namespace paged_prefix_api {
|
||||
+
|
||||
+static llama_kv_cache * kv_of(llama_context * ctx) {
|
||||
+ // The driver targets a plain unified KV-cache model; dynamic_cast yields null
|
||||
+ // for wrapped caches (iSWA / hybrid), where cross-request cell sharing does
|
||||
+ // not apply, so the shim degrades to a safe no-op.
|
||||
+ return dynamic_cast<llama_kv_cache *>(llama_get_memory(ctx));
|
||||
+}
|
||||
+
|
||||
+int32_t share(llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n) {
|
||||
+ llama_kv_cache * kv = kv_of(ctx);
|
||||
+ if (!kv || n <= 0) {
|
||||
+ return 0;
|
||||
+ }
|
||||
+ return kv->paged_prefix_share(seq, std::vector<llama_token>(tokens, tokens + n));
|
||||
+}
|
||||
+
|
||||
+void commit(llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n) {
|
||||
+ llama_kv_cache * kv = kv_of(ctx);
|
||||
+ if (!kv || n <= 0) {
|
||||
+ return;
|
||||
+ }
|
||||
+ kv->paged_prefix_commit(seq, std::vector<llama_token>(tokens, tokens + n));
|
||||
+}
|
||||
+
|
||||
+int ref_at(llama_context * ctx, llama_seq_id seq, int pos) {
|
||||
+ llama_kv_cache * kv = kv_of(ctx);
|
||||
+ if (!kv) {
|
||||
+ return -1;
|
||||
+ }
|
||||
+ return paged_alloc::ref_cnt_at((const void *) kv, /*stream=*/0, (int) seq, pos, /*block_size=*/16);
|
||||
+}
|
||||
+
|
||||
+long num_free(llama_context * ctx) {
|
||||
+ llama_kv_cache * kv = kv_of(ctx);
|
||||
+ if (!kv) {
|
||||
+ return 0;
|
||||
+ }
|
||||
+ return (long) paged_alloc::num_free((const void *) kv, /*stream=*/0);
|
||||
+}
|
||||
+
|
||||
+} // namespace paged_prefix_api
|
||||
diff --git a/src/paged-prefix-api.h b/src/paged-prefix-api.h
|
||||
new file mode 100644
|
||||
index 0000000..78a3864
|
||||
--- /dev/null
|
||||
+++ b/src/paged-prefix-api.h
|
||||
@@ -0,0 +1,27 @@
|
||||
+#pragma once
|
||||
+// Thin test/diagnostic shim over the paged cross-request prefix engine seam
|
||||
+// (patch 0007). Lets a driver that only includes the public llama.h reach the
|
||||
+// gated llama_kv_cache::paged_prefix_* methods and the paged-alloc introspection
|
||||
+// without pulling in the internal kv-cache headers. All entry points are no-ops
|
||||
+// (return 0) unless env LLAMA_KV_PAGED is set. Experimental; not a stable API.
|
||||
+
|
||||
+#include "llama.h"
|
||||
+
|
||||
+namespace paged_prefix_api {
|
||||
+
|
||||
+// Reuse the longest cached content prefix of [tokens, tokens+n) for `seq` and
|
||||
+// return the number of shared prefix tokens (the caller decodes only the
|
||||
+// suffix). 0 if nothing was shared.
|
||||
+int32_t share(llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n);
|
||||
+
|
||||
+// Publish `seq`'s full blocks into the content cache (call after its KV is computed).
|
||||
+void commit(llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n);
|
||||
+
|
||||
+// Ref count of the paged block backing logical position `pos` of `seq` (unified
|
||||
+// stream 0), or -1 if unknown.
|
||||
+int ref_at(llama_context * ctx, llama_seq_id seq, int pos);
|
||||
+
|
||||
+// Number of free blocks in the unified stream-0 pool, or 0 if no manager.
|
||||
+long num_free(llama_context * ctx);
|
||||
+
|
||||
+} // namespace paged_prefix_api
|
||||
--
|
||||
2.43.0
|
||||
|
||||
Reference in New Issue
Block a user