Files
LocalAI/backend/cpp
Ettore Di Giacinto 4968cd8a94 paged-attn 0004: on-demand KV block allocation
Wire the paged placement in find_slot through the vendored PagedKVManager
(0001) instead of a fixed full-pool permutation. Blocks are popped from a free
pool on demand as a sequence crosses block boundaries, and returned on sequence
end (full seq_rm / clear). One manager per (kv-cache, stream); all state lives
in a new src/paged-alloc unit keyed by a static registry, so the core kv-cache
struct is untouched (find_slot/clear/seq_rm gain only a gated call). Default
off; stock path byte-identical.

Gate 0 (CPU, Qwen3-0.6B-Q8_0), LLAMA_KV_PAGED=1 token-identical vs stock:
- single-stream llama-simple, 48 tok: identical
- multi-stream driver, 3 seqs x 40 tok: identical
Demand-driven confirmed via debug log: blocks grow 0->1->2->3->4 at logical
positions 16/32/48 (peak 4 blocks vs 16-block budget), per stream independently.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-22 08:50:57 +00:00
..