Files
LocalAI/backend/cpp/llama-cpp/patches/paged
Ettore Di Giacinto 80e0c1ac6b feat(paged): wire cross-request prefix share into llama-server (patch 0008)
Ship patch 0008 of the paged-attention series: wire the paged cross-request
prefix recompute-skip (patch 0007's paged_prefix_api::share/commit engine seam)
into the llama-server continuous-batching loop so CONCURRENT requests sharing a
long prefix reuse one committed copy of the prefix blocks and prefill ONLY their
divergent suffix. The server's native prompt cache only reuses a slot's own prior
prompt; it does not share across distinct concurrent slots. 0008 adds that
cross-slot share, fully gated behind LLAMA_KV_PAGED (stock byte-identical).

The hook lives in tools/server/server-context.cpp update_slots (the only place
with the slot prompt-processing loop; grpc-server.cpp includes it), ~50 gated
lines: a fresh-slot share() that advances n_past past the committed prefix, and a
commit() at the prefill->generation transition. The n_past<block gate guarantees
every positive share is adopted so the engine reservation matches the suffix-only
batch (no stale paged blocks).

Verified in-server (32B NVFP4, CUDA, --kv-unified) with a live prefix holder:
K=16/32 concurrent shared-prefix requests prefill only their ~27-token suffix
instead of the ~1003-token prefix (36x fewer prefill tokens; K=16 23.9s->1.5s,
K=32 57.9s->2.3s), engine logs 'shares ... prefix blocks - NOT recomputed'
(ref_cnt>1), greedy output within the documented CUDA batch-shape
non-determinism band.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-22 15:03:16 +00:00
..