Files
LocalAI/backend/cpp/llama-cpp
Ettore Di Giacinto bbc84a9889 feat(paged): Gate 0 in-model - token-identical generation with paged KV placement
Wire paged, non-contiguous fixed-size BLOCK placement into the real
llama.cpp KV cache (find_slot), behind env LLAMA_KV_PAGED, and validate
Gate 0 on a real GGUF: Qwen3-0.6B greedy generation is TOKEN-IDENTICAL to
the contiguous cache while its KV is physically scattered across permuted
blocks (cells 0-15, 144-159, 32-47, ...). Proven non-contiguous via
LLAMA_KV_PAGED_DEBUG, not a silent fallback.

This retires the correctness premise of paged attention IN THE MODEL (not
just at the ggml-op level): attention is invariant to physical KV placement,
because reads use per-cell pos/seq metadata for masking. The patch lives at
patches/0001-paged-kv-block-placement.patch (against llama.cpp 0253fb21f).

Scope: storage/placement layer, single sequence. Remaining (P4): the
gather-read compute path (attend only a seq's own blocks) for the throughput
win, and the multi-sequence driver. README updated with repro + status.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-19 08:51:42 +00:00
..
2026-04-12 08:51:30 +02:00