Deliverables for pushing paged KV toward the real target (2xH200), since GB10 is
only the test box and its "no win" result is a low-bandwidth artifact:
1. Correctness verified. test-paged-kv-e2e is greedy-equivalent to the contiguous
reference (top-5 argmax ref=paged=3743, overlap 5/5). Found + fixed the blocking
bug: common_fit_paged_kv_blocks over-reports free VRAM on GB10's unified device
and tried 245GB of KV on a 119GB box, OOM-aborting context creation. Patch in
patches/0002; durable fix (clamp to free_vram, honor --fit off) noted.
2. paged-loadgen.cpp: a dynamic-load benchmark that actually exercises where paging
wins - variable prompt/gen lengths, continuous arrival, shared prefix - and
reports the capacity ratio (contiguous reserve / paged peak KV). The stock tools
run fixed-length all-at-once load, which is why they never show a paged win.
3. Projection to 2xH200, grounded in measured GB10 plateaus. Decode is bandwidth-
bound, so the ceiling (~16k t/s for 32B) needs ~3,800 concurrent seqs, but
contiguous KV fits only ~490 in HBM at 2k ctx - so KV memory IS the binding
constraint on the target (unlike GB10), and paged KV's ~5-10x capacity (no
over-reservation + prefix sharing) is what reaches the ceiling. The thesis holds
on the target; remaining work is hardening/finishing the paged op (PR22569 was
12-13% slower and lacks prefix sharing).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Wire paged, non-contiguous fixed-size BLOCK placement into the real
llama.cpp KV cache (find_slot), behind env LLAMA_KV_PAGED, and validate
Gate 0 on a real GGUF: Qwen3-0.6B greedy generation is TOKEN-IDENTICAL to
the contiguous cache while its KV is physically scattered across permuted
blocks (cells 0-15, 144-159, 32-47, ...). Proven non-contiguous via
LLAMA_KV_PAGED_DEBUG, not a silent fallback.
This retires the correctness premise of paged attention IN THE MODEL (not
just at the ggml-op level): attention is invariant to physical KV placement,
because reads use per-cell pos/seq metadata for masking. The patch lives at
patches/0001-paged-kv-block-placement.patch (against llama.cpp 0253fb21f).
Scope: storage/placement layer, single sequence. Remaining (P4): the
gather-read compute path (attend only a seq's own blocks) for the throughput
win, and the multi-sequence driver. README updated with repro + status.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>