Deliverables for pushing paged KV toward the real target (2xH200), since GB10 is
only the test box and its "no win" result is a low-bandwidth artifact:
1. Correctness verified. test-paged-kv-e2e is greedy-equivalent to the contiguous
reference (top-5 argmax ref=paged=3743, overlap 5/5). Found + fixed the blocking
bug: common_fit_paged_kv_blocks over-reports free VRAM on GB10's unified device
and tried 245GB of KV on a 119GB box, OOM-aborting context creation. Patch in
patches/0002; durable fix (clamp to free_vram, honor --fit off) noted.
2. paged-loadgen.cpp: a dynamic-load benchmark that actually exercises where paging
wins - variable prompt/gen lengths, continuous arrival, shared prefix - and
reports the capacity ratio (contiguous reserve / paged peak KV). The stock tools
run fixed-length all-at-once load, which is why they never show a paged win.
3. Projection to 2xH200, grounded in measured GB10 plateaus. Decode is bandwidth-
bound, so the ceiling (~16k t/s for 32B) needs ~3,800 concurrent seqs, but
contiguous KV fits only ~490 in HBM at 2k ctx - so KV memory IS the binding
constraint on the target (unlike GB10), and paged KV's ~5-10x capacity (no
over-reservation + prefix sharing) is what reaches the ceiling. The thesis holds
on the target; remaining work is hardening/finishing the paged op (PR22569 was
12-13% slower and lacks prefix sharing).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>