LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-23 08:08:52 -04:00

Files

Ettore Di Giacinto ecffd4b097 feat(llama-cpp/paged): engine-level prefix recompute-skip (patch 0007)

Mirror patch 0007 of the paged-attention series into the vendored llama.cpp
patch set. It wires the host-side cross-request prefix cache (0006) into the
engine so a new sequence physically shares the cached prefix blocks (ref-counted)
and decodes only the divergent suffix - the shared prefix KV is never recomputed.

paged-alloc becomes one persistent caching PagedKVManager per (kv-cache, stream)
keyed by the real seq_id (per-sequence ref-counted free); two gated
llama_kv_cache methods (paged_prefix_share / paged_prefix_commit) mark the shared
physical cells' seq-membership so the engine attention mask covers the
already-computed prefix; find_slot anchors placement on each sequence's ubatch.pos.
Existing-file core touch is llama-kv-cache.{cpp,h} (+71 -3); everything else is
additive vendored units. Gated behind LLAMA_KV_PAGED, default off, stock
byte-identical.

Verified on Qwen3-0.6B-Q8_0 (CPU, unified cache): greedy byte-identity vs decode
from scratch at a block boundary and mid-block, prefill computing only the suffix
(32 prefix tokens skipped), and ref-counted free safety (2->1 on one sharer's
removal, survivor intact and re-shareable, pool restored when all freed). The
0004 serving gate stays byte-identical stock vs paged in unified and non-unified
mode.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-22 10:47:10 +00:00

ds4

chore: ⬆️ Update antirez/ds4 to 80ebbc396aee40eedc1d829222f3362d10fa4c6c (#10378 )

2026-06-18 00:32:13 +02:00

grpc

fix: speedup git submodule update with --single-branch (#2847 )

2024-07-13 22:32:25 +02:00

ik-llama-cpp

chore: ⬆️ Update ikawrakow/ik_llama.cpp to b3dfb7858cfcb9166e92f366e5af87f19ebc94be (#10395 )

2026-06-19 00:03:37 +02:00

llama-cpp

feat(llama-cpp/paged): engine-level prefix recompute-skip (patch 0007)