mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-23 08:08:52 -04:00
feat(paged): wire cross-request prefix share into llama-server (patch 0008)
Ship patch 0008 of the paged-attention series: wire the paged cross-request prefix recompute-skip (patch 0007's paged_prefix_api::share/commit engine seam) into the llama-server continuous-batching loop so CONCURRENT requests sharing a long prefix reuse one committed copy of the prefix blocks and prefill ONLY their divergent suffix. The server's native prompt cache only reuses a slot's own prior prompt; it does not share across distinct concurrent slots. 0008 adds that cross-slot share, fully gated behind LLAMA_KV_PAGED (stock byte-identical). The hook lives in tools/server/server-context.cpp update_slots (the only place with the slot prompt-processing loop; grpc-server.cpp includes it), ~50 gated lines: a fresh-slot share() that advances n_past past the committed prefix, and a commit() at the prefill->generation transition. The n_past<block gate guarantees every positive share is adopted so the engine reservation matches the suffix-only batch (no stale paged blocks). Verified in-server (32B NVFP4, CUDA, --kv-unified) with a live prefix holder: K=16/32 concurrent shared-prefix requests prefill only their ~27-token suffix instead of the ~1003-token prefix (36x fewer prefill tokens; K=16 23.9s->1.5s, K=32 57.9s->2.3s), engine logs 'shares ... prefix blocks - NOT recomputed' (ref_cnt>1), greedy output within the documented CUDA batch-shape non-determinism band. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
@@ -0,0 +1,130 @@
|
||||
From 088d58f3a0160cbc706226ac2e77ecfeae4c164a Mon Sep 17 00:00:00 2001
|
||||
From: Ettore Di Giacinto <mudler@localai.io>
|
||||
Date: Mon, 22 Jun 2026 17:02:22 +0200
|
||||
Subject: [PATCH] paged server cross-request prefix share (env LLAMA_KV_PAGED)
|
||||
- patch 0008
|
||||
|
||||
Wire the paged cross-request prefix recompute-skip (patch 0007's engine seam,
|
||||
paged_prefix_api::share/commit) into the llama-server continuous-batching loop
|
||||
(update_slots) so CONCURRENT requests that share a long prefix physically reuse
|
||||
one committed copy of the prefix blocks and prefill only their divergent suffix.
|
||||
Patch 0007 proved the engine seam correct via a standalone driver, but the server
|
||||
never called it: two concurrent shared-prefix requests each recomputed the full
|
||||
prefix. The server's native prompt cache only reuses a slot's OWN prior prompt
|
||||
(longest-common-prefix vs slot.prompt.tokens) - it does not share across distinct
|
||||
concurrent slots. 0008 adds that cross-slot share.
|
||||
|
||||
Mechanism (all gated behind LLAMA_KV_PAGED; default off, stock byte-identical):
|
||||
|
||||
* In update_slots prompt-processing, after the native n_past is computed and
|
||||
only for a FRESH slot (n_past < one block, i.e. the native cache did not
|
||||
already cover the prefix), call paged_prefix_api::share() to splice the
|
||||
longest committed cross-request prefix into this sequence (ref_cnt++ on the
|
||||
shared physical blocks) and advance n_past past it, so the batch fill computes
|
||||
ONLY the suffix. The slot's own divergent tail cells are removed first so the
|
||||
shared cells own [n_past, kshare) without colliding (the native path removes
|
||||
these later anyway). The n_past < block gate guarantees any block-aligned
|
||||
share the engine returns is strictly larger than n_past and therefore always
|
||||
adopted, so the engine's reservation always matches the suffix-only batch and
|
||||
never leaves stale blocks (which otherwise fragment the paged pool).
|
||||
|
||||
* When a slot finishes prefill (SLOT_STATE_DONE_PROMPT -> GENERATING, the prefix
|
||||
KV just computed), call paged_prefix_api::commit() to publish its prefix so
|
||||
concurrent/later sharers can reuse it.
|
||||
|
||||
The share() / commit() entry points are forward-declared (defined in libllama,
|
||||
src/paged-prefix-api.cpp) to avoid pulling internal kv-cache headers into the
|
||||
server translation unit.
|
||||
|
||||
Verified in the server (32B NVFP4, CUDA, --kv-unified): with a live sequence
|
||||
holding the prefix, K=16/32 concurrent shared-prefix requests prefill only their
|
||||
~27-token suffix instead of the ~1003-token prefix (36x fewer prefill tokens;
|
||||
K=16 23.9s -> 1.5s, K=32 57.9s -> 2.3s), the engine logs "shares ... prefix
|
||||
blocks - NOT recomputed" with ref_cnt>1, and greedy output stays within the
|
||||
documented CUDA batch-shape non-determinism band (stock native prompt-caching
|
||||
shows the same magnitude). Cross-request sharing requires the unified KV cache.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
|
||||
---
|
||||
tools/server/server-context.cpp | 50 +++++++++++++++++++++++++++++++++
|
||||
1 file changed, 50 insertions(+)
|
||||
|
||||
diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
|
||||
index da6a475..04c6361 100644
|
||||
--- a/tools/server/server-context.cpp
|
||||
+++ b/tools/server/server-context.cpp
|
||||
@@ -15,6 +15,16 @@
|
||||
#include "mtmd.h"
|
||||
#include "mtmd-helper.h"
|
||||
|
||||
+// [paged 0008] Cross-request prefix recompute-skip shim. share()/commit() are
|
||||
+// defined in libllama (src/paged-prefix-api.cpp, patch 0007) and are no-ops
|
||||
+// unless env LLAMA_KV_PAGED is set. Declared here so the paged cross-slot prefix
|
||||
+// cache wires into update_slots() without pulling in internal kv-cache headers.
|
||||
+// Fully gated; stock (paged off) is byte-identical.
|
||||
+namespace paged_prefix_api {
|
||||
+ int32_t share (llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n);
|
||||
+ void commit(llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n);
|
||||
+}
|
||||
+
|
||||
#include <algorithm>
|
||||
#include <cstddef>
|
||||
#include <cinttypes>
|
||||
@@ -3007,6 +3017,37 @@ private:
|
||||
}
|
||||
}
|
||||
|
||||
+ // [paged 0008] Cross-request prefix recompute-skip. The native prompt cache
|
||||
+ // above only reuses THIS slot's own prior prompt; when the paged KV
|
||||
+ // engine is active, also reuse a committed CROSS-slot prefix so
|
||||
+ // concurrent requests sharing a long prefix skip recompute. Gated on
|
||||
+ // LLAMA_KV_PAGED (paged_kv_share static); stock stays byte-identical.
|
||||
+ static const bool paged_kv_share = getenv("LLAMA_KV_PAGED") != nullptr;
|
||||
+ // Only attempt the cross-request share on a FRESH slot (the native
|
||||
+ // cache above did not already cover the prefix). With n_past < a
|
||||
+ // block, any block-aligned share the engine returns is strictly
|
||||
+ // larger than n_past and is therefore always adopted below - so the
|
||||
+ // engine's full-prompt reservation always matches the suffix-only
|
||||
+ // submission and never leaves stale blocks (which fragmented the
|
||||
+ // paged pool and crashed the server under high fan-out otherwise).
|
||||
+ if (paged_kv_share && n_past < 16 && slot.task->params.cache_prompt && !input_tokens.has_mtmd) {
|
||||
+ const llama_tokens ptoks = input_tokens.get_text_tokens();
|
||||
+ // Drop this slot's own cells beyond the natively-cached prefix before
|
||||
+ // splicing the shared physical prefix in, so the shared cells can own
|
||||
+ // [n_past, kshare) without colliding (the native path removes exactly
|
||||
+ // these later; a no-op for a fresh slot).
|
||||
+ common_context_seq_rm(ctx_tgt, slot.id, n_past, -1);
|
||||
+ const int32_t kshare = paged_prefix_api::share(ctx_tgt, slot.id, ptoks.data(), (int) ptoks.size());
|
||||
+ if (kshare > n_past) {
|
||||
+ slot.prompt.tokens.keep_first(n_past);
|
||||
+ for (int i = n_past; i < kshare; ++i) {
|
||||
+ slot.prompt.tokens.push_back(ptoks[i]);
|
||||
+ }
|
||||
+ n_past = kshare;
|
||||
+ SLT_INF(slot, "paged: reusing %d cross-request shared prefix tokens - not recomputed\n", n_past);
|
||||
+ }
|
||||
+ }
|
||||
+
|
||||
// [TAG_PROMPT_LOGITS]
|
||||
if (n_past == slot.task->n_tokens() && n_past > 0) {
|
||||
SLT_WRN(slot, "need to evaluate at least 1 token for each active slot (n_past = %d, task.n_tokens() = %d)\n", n_past, slot.task->n_tokens());
|
||||
@@ -3427,6 +3468,15 @@ private:
|
||||
// prompt evaluated for next-token prediction
|
||||
slot.state = SLOT_STATE_GENERATING;
|
||||
|
||||
+ // [paged 0008] Publish this slot's computed prefix so concurrent/later
|
||||
+ // slots can share it (no-op unless LLAMA_KV_PAGED). The prefill decode
|
||||
+ // for [0, n_tokens) has just run, so the prefix KV is computed.
|
||||
+ static const bool paged_kv_commit = getenv("LLAMA_KV_PAGED") != nullptr;
|
||||
+ if (paged_kv_commit && slot.task->params.cache_prompt && !slot.prompt.tokens.has_mtmd) {
|
||||
+ const llama_tokens ctoks = slot.prompt.tokens.get_text_tokens();
|
||||
+ paged_prefix_api::commit(ctx_tgt, slot.id, ctoks.data(), (int) ctoks.size());
|
||||
+ }
|
||||
+
|
||||
if (slot.can_speculate()) {
|
||||
common_speculative_begin(spec.get(), slot.id, slot.prompt.tokens.get_text_tokens());
|
||||
}
|
||||
--
|
||||
2.43.0
|
||||
|
||||
Reference in New Issue
Block a user