feat(paged): wire cross-request prefix share into llama-server (patch 0008)

Ship patch 0008 of the paged-attention series: wire the paged cross-request prefix recompute-skip (patch 0007's paged_prefix_api::share/commit engine seam) into the llama-server continuous-batching loop so CONCURRENT requests sharing a long prefix reuse one committed copy of the prefix blocks and prefill ONLY their divergent suffix. The server's native prompt cache only reuses a slot's own prior prompt; it does not share across distinct concurrent slots. 0008 adds that cross-slot share, fully gated behind LLAMA_KV_PAGED (stock byte-identical). The hook lives in tools/server/server-context.cpp update_slots (the only place with the slot prompt-processing loop; grpc-server.cpp includes it), ~50 gated lines: a fresh-slot share() that advances n_past past the committed prefix, and a commit() at the prefill->generation transition. The n_past<block gate guarantees every positive share is adopted so the engine reservation matches the suffix-only batch (no stale paged blocks). Verified in-server (32B NVFP4, CUDA, --kv-unified) with a live prefix holder: K=16/32 concurrent shared-prefix requests prefill only their ~27-token suffix instead of the ~1003-token prefix (36x fewer prefill tokens; K=16 23.9s->1.5s, K=32 57.9s->2.3s), engine logs 'shares ... prefix blocks - NOT recomputed' (ref_cnt>1), greedy output within the documented CUDA batch-shape non-determinism band. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-23 08:08:52 -04:00 · 2026-06-22 15:03:16 +00:00
parent 52f0f7b8cf
commit 80e0c1ac6b
1 changed files with 130 additions and 0 deletions
--- a/backend/cpp/llama-cpp/patches/paged/0008-paged-server-cross-request-prefix-share-env-LLAMA_KV_PAGED.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0008-paged-server-cross-request-prefix-share-env-LLAMA_KV_PAGED.patch
@@ -0,0 +1,130 @@
+From 088d58f3a0160cbc706226ac2e77ecfeae4c164a Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Mon, 22 Jun 2026 17:02:22 +0200
+Subject: [PATCH] paged server cross-request prefix share (env LLAMA_KV_PAGED)
+ - patch 0008
+
+Wire the paged cross-request prefix recompute-skip (patch 0007's engine seam,
+paged_prefix_api::share/commit) into the llama-server continuous-batching loop
+(update_slots) so CONCURRENT requests that share a long prefix physically reuse
+one committed copy of the prefix blocks and prefill only their divergent suffix.
+Patch 0007 proved the engine seam correct via a standalone driver, but the server
+never called it: two concurrent shared-prefix requests each recomputed the full
+prefix. The server's native prompt cache only reuses a slot's OWN prior prompt
+(longest-common-prefix vs slot.prompt.tokens) - it does not share across distinct
+concurrent slots. 0008 adds that cross-slot share.
+
+Mechanism (all gated behind LLAMA_KV_PAGED; default off, stock byte-identical):
+
+  * In update_slots prompt-processing, after the native n_past is computed and
+    only for a FRESH slot (n_past < one block, i.e. the native cache did not
+    already cover the prefix), call paged_prefix_api::share() to splice the
+    longest committed cross-request prefix into this sequence (ref_cnt++ on the
+    shared physical blocks) and advance n_past past it, so the batch fill computes
+    ONLY the suffix. The slot's own divergent tail cells are removed first so the
+    shared cells own [n_past, kshare) without colliding (the native path removes
+    these later anyway). The n_past < block gate guarantees any block-aligned
+    share the engine returns is strictly larger than n_past and therefore always
+    adopted, so the engine's reservation always matches the suffix-only batch and
+    never leaves stale blocks (which otherwise fragment the paged pool).
+
+  * When a slot finishes prefill (SLOT_STATE_DONE_PROMPT -> GENERATING, the prefix
+    KV just computed), call paged_prefix_api::commit() to publish its prefix so
+    concurrent/later sharers can reuse it.
+
+The share() / commit() entry points are forward-declared (defined in libllama,
+src/paged-prefix-api.cpp) to avoid pulling internal kv-cache headers into the
+server translation unit.
+
+Verified in the server (32B NVFP4, CUDA, --kv-unified): with a live sequence
+holding the prefix, K=16/32 concurrent shared-prefix requests prefill only their
+~27-token suffix instead of the ~1003-token prefix (36x fewer prefill tokens;
+K=16 23.9s -> 1.5s, K=32 57.9s -> 2.3s), the engine logs "shares ... prefix
+blocks - NOT recomputed" with ref_cnt>1, and greedy output stays within the
+documented CUDA batch-shape non-determinism band (stock native prompt-caching
+shows the same magnitude). Cross-request sharing requires the unified KV cache.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ tools/server/server-context.cpp | 50 +++++++++++++++++++++++++++++++++
+ 1 file changed, 50 insertions(+)
+
+diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
+index da6a475..04c6361 100644
+--- a/tools/server/server-context.cpp
+++ b/tools/server/server-context.cpp
+@@ -15,6 +15,16 @@
+ #include "mtmd.h"
+ #include "mtmd-helper.h"
+ 
+// [paged 0008] Cross-request prefix recompute-skip shim. share()/commit() are
+// defined in libllama (src/paged-prefix-api.cpp, patch 0007) and are no-ops
+// unless env LLAMA_KV_PAGED is set. Declared here so the paged cross-slot prefix
+// cache wires into update_slots() without pulling in internal kv-cache headers.
+// Fully gated; stock (paged off) is byte-identical.
+namespace paged_prefix_api {
+    int32_t share (llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n);
+    void    commit(llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n);
+}
+
+ #include <algorithm>
+ #include <cstddef>
+ #include <cinttypes>
+@@ -3007,6 +3017,37 @@ private:
+                             }
+                         }
+ 
+                        // [paged 0008] Cross-request prefix recompute-skip. The native prompt cache
+                        // above only reuses THIS slot's own prior prompt; when the paged KV
+                        // engine is active, also reuse a committed CROSS-slot prefix so
+                        // concurrent requests sharing a long prefix skip recompute. Gated on
+                        // LLAMA_KV_PAGED (paged_kv_share static); stock stays byte-identical.
+                        static const bool paged_kv_share = getenv("LLAMA_KV_PAGED") != nullptr;
+                        // Only attempt the cross-request share on a FRESH slot (the native
+                        // cache above did not already cover the prefix). With n_past < a
+                        // block, any block-aligned share the engine returns is strictly
+                        // larger than n_past and is therefore always adopted below - so the
+                        // engine's full-prompt reservation always matches the suffix-only
+                        // submission and never leaves stale blocks (which fragmented the
+                        // paged pool and crashed the server under high fan-out otherwise).
+                        if (paged_kv_share && n_past < 16 && slot.task->params.cache_prompt && !input_tokens.has_mtmd) {
+                            const llama_tokens ptoks = input_tokens.get_text_tokens();
+                            // Drop this slot's own cells beyond the natively-cached prefix before
+                            // splicing the shared physical prefix in, so the shared cells can own
+                            // [n_past, kshare) without colliding (the native path removes exactly
+                            // these later; a no-op for a fresh slot).
+                            common_context_seq_rm(ctx_tgt, slot.id, n_past, -1);
+                            const int32_t kshare = paged_prefix_api::share(ctx_tgt, slot.id, ptoks.data(), (int) ptoks.size());
+                            if (kshare > n_past) {
+                                slot.prompt.tokens.keep_first(n_past);
+                                for (int i = n_past; i < kshare; ++i) {
+                                    slot.prompt.tokens.push_back(ptoks[i]);
+                                }
+                                n_past = kshare;
+                                SLT_INF(slot, "paged: reusing %d cross-request shared prefix tokens - not recomputed\n", n_past);
+                            }
+                        }
+
+                         // [TAG_PROMPT_LOGITS]
+                         if (n_past == slot.task->n_tokens() && n_past > 0) {
+                             SLT_WRN(slot, "need to evaluate at least 1 token for each active slot (n_past = %d, task.n_tokens() = %d)\n", n_past, slot.task->n_tokens());
+@@ -3427,6 +3468,15 @@ private:
+                     // prompt evaluated for next-token prediction
+                     slot.state = SLOT_STATE_GENERATING;
+ 
+                    // [paged 0008] Publish this slot's computed prefix so concurrent/later
+                    // slots can share it (no-op unless LLAMA_KV_PAGED). The prefill decode
+                    // for [0, n_tokens) has just run, so the prefix KV is computed.
+                    static const bool paged_kv_commit = getenv("LLAMA_KV_PAGED") != nullptr;
+                    if (paged_kv_commit && slot.task->params.cache_prompt && !slot.prompt.tokens.has_mtmd) {
+                        const llama_tokens ctoks = slot.prompt.tokens.get_text_tokens();
+                        paged_prefix_api::commit(ctx_tgt, slot.id, ctoks.data(), (int) ctoks.size());
+                    }
+
+                     if (slot.can_speculate()) {
+                         common_speculative_begin(spec.get(), slot.id, slot.prompt.tokens.get_text_tokens());
+                     }
+-- 
+2.43.0
+