Deliverables for pushing paged KV toward the real target (2xH200), since GB10 is only the test box and its "no win" result is a low-bandwidth artifact: 1. Correctness verified. test-paged-kv-e2e is greedy-equivalent to the contiguous reference (top-5 argmax ref=paged=3743, overlap 5/5). Found + fixed the blocking bug: common_fit_paged_kv_blocks over-reports free VRAM on GB10's unified device and tried 245GB of KV on a 119GB box, OOM-aborting context creation. Patch in patches/0002; durable fix (clamp to free_vram, honor --fit off) noted. 2. paged-loadgen.cpp: a dynamic-load benchmark that actually exercises where paging wins - variable prompt/gen lengths, continuous arrival, shared prefix - and reports the capacity ratio (contiguous reserve / paged peak KV). The stock tools run fixed-length all-at-once load, which is why they never show a paged win. 3. Projection to 2xH200, grounded in measured GB10 plateaus. Decode is bandwidth- bound, so the ceiling (~16k t/s for 32B) needs ~3,800 concurrent seqs, but contiguous KV fits only ~490 in HBM at 2k ctx - so KV memory IS the binding constraint on the target (unlike GB10), and paged KV's ~5-10x capacity (no over-reservation + prefix sharing) is what reaches the ceiling. The thesis holds on the target; remaining work is hardening/finishing the paged op (PR22569 was 12-13% slower and lacks prefix sharing). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
8.5 KiB
Paged KV: target-readiness (correctness, dynamic benchmark, 2xH200 projection)
Target hardware: ~2x H200 (281 GB HBM3e total, ~4.8 TB/s per GPU). The GB10 box is the test rig, not the target - and several earlier "no win" findings are GB10-specific artifacts (low bandwidth caps throughput before KV memory ever binds). This document delivers the three things needed to push paged KV toward the real target:
- Correctness of the paged path - verified (and a blocking bug found + fixed).
- A dynamic-load benchmark that actually exercises where paging wins (
paged-loadgen.cpp). - A projection of the paged-KV payoff on 2x H200, grounded in measured GB10 numbers.
1. Correctness: PASS (after fixing the auto-fit OOM)
test-paged-kv-e2e checks the paged decode path against the contiguous reference
(greedy argmax + top-5 set overlap >= 4). On the box it was previously unverified -
it aborted at context creation. Root cause found:
common_fit_paged_kv_blocks(common/common.cpp:1144) unconditionally overridesn_gpu_blocksfromggml_backend_dev_memory, which over-reports free VRAM on the GB10 integrated/unified device (it sized ~245 GB of KV on a 119 GB box ->cudaMallocOOM ->GGML_ASSERTabort inllama-kv-cache-paged.cpp:74). The test's explicitn_gpu_blocks=64was being clobbered becauseparams.fit_paramsdefaults on.
Fix (item-1 patch, applied on the box):
--- a/tests/test-paged-kv-e2e.cpp
+++ b/tests/test-paged-kv-e2e.cpp
@@ run_paged()
params.kv_paged = true;
+ params.fit_params = false; // honor explicit n_gpu_blocks; GB10 dev_memory over-reports free VRAM
params.n_gpu_blocks = 64;
Result (Qwen3-0.6B-Q8_0, GB10):
test-paged-kv-e2e: top-5 argmax match: ref=3743 paged=3743
test-paged-kv-e2e: top-5 set overlap: 5/5 (require >= 4)
test-paged-kv-e2e: PASSED
The paged op is numerically greedy-equivalent to the contiguous path. The reshape
bug from PR22569_EVAL.md (decoupled head_dim) is already applied in the checkout.
Target-readiness caveat (the durable fix, not just the test): the auto-fit itself is
brittle and must be hardened before it runs on a real serving box - even though
ggml_backend_dev_memory reports correctly on a discrete H200, the function should still
(a) early-return when !params.fit_params, (b) clamp the computed n_gpu_blocks so
n_gpu_blocks * block_bytes <= free_vram - margin using the actual KV element size, and
(c) not override an explicitly-set value. One-screen change in common_fit_paged_kv_blocks.
2. Dynamic-load benchmark - paged-loadgen.cpp
Why the existing tools show no paged win: llama-batched-bench and the stock
examples/paged/paged.cpp both run fixed-length, all-arrive-at-once, single-prompt
load. That has no over-reservation and no fragmentation, so contiguous KV is already
memory-optimal and paging has nothing to reclaim (PAGED_KV_HIGH_CONCURRENCY.md). The
paged win only exists under variable lengths + continuous arrival + shared prefixes -
the real serving regime. No tool in the tree creates it.
paged-loadgen.cpp (committed here) does, via the confirmed llama_paged_scheduler_*
API:
- shared system prefix (
LG_PREFIXtokens) prepended to every request -> exercises cross-request prefix sharing, - variable prompt length (
LG_SUFMIN..LG_SUFMAXunique suffix), - bimodal generation length (
LG_GENLONGforLG_LONGPCT% of requests, elseLG_GENSHORT) - the over-reservation driver, - continuous arrival: keeps
LG_INFLIGHTrequests live, admitting a new one each time one finishes.
It reports the load-bearing number for the buy decision - the capacity ratio:
paged peak KV = sum over live seqs of ceil(used/block)*block * kv_bytes_per_token
contiguous reserve = peak_inflight * max_ctx * kv_bytes_per_token (worst-case per slot)
CAPACITY RATIO = contiguous_reserve / paged_peak (+ prefix sharing on top)
kv_bytes_per_token = 2 * n_layer * n_head_kv * head_dim * sizeof(f16) - confirmed against
llama-kv-cache-paged.cpp (e.g. Qwen3-32B: 26481282 = 256 KiB/token).
How to run (on the target): drop into PR #22569's examples/paged/, add to its
CMakeLists next to llama-paged, build, then e.g.
LG_INFLIGHT=2048 LG_LONGPCT=15 paged-loadgen -m <model> -kvp --fit off -ngpub <N> -ncpub <M> -ngl 99.
Sweep LG_INFLIGHT to the throughput plateau and read the capacity ratio at that point.
It is written to run on the target (2x H200) where the regime exists; on GB10 it runs but
the ratio is uninteresting because throughput plateaus before memory binds (see below).
3. Projection to 2x H200 (grounded in measured GB10 numbers)
Measured on GB10 (this work)
| model | decode plateau (aggregate) | plateau concurrency | bound by |
|---|---|---|---|
| Qwen3-32B-Q4_K_M (dense) | ~540 t/s | npl ~128 | compute |
| Qwen3-1.7B-Q8_0 | ~3,200 t/s | npl ~512 | bandwidth |
Hardware ratios (per GPU, then 2x TP at ~85% scaling)
| GB10 | H200 | per-GPU x | 2x H200 (TP) x | |
|---|---|---|---|---|
| mem bandwidth | 273 GB/s | ~4.8 TB/s | 17.6 | ~30 |
| BF16 compute | ~213 TFLOP | ~989 TFLOP | 4.6 | ~8 |
| HBM | 119 GB | 141 GB | 1.18 | 2.4 (281 GB) |
Decode is bandwidth-bound, so both the aggregate ceiling and the concurrency at which it is reached scale with bandwidth (~30x on 2x H200):
- 32B-dense aggregate decode ceiling: 540 x 30 ~= 16,000 t/s, reached at ~128 x 30 ~= 3,800 concurrent sequences.
Why paged KV becomes the binding lever on 2x H200 (and didn't on GB10)
To reach that ~16k t/s ceiling you must hold ~3,800 sequences of KV. The memory math:
- 32B weights (FP8) ~= 32 GB, sharded over 2 GPUs -> ~250 GB HBM free for KV.
- 32B KV = 256 KiB/token. At an avg held context of 2,000 tokens, per seq = 512 MiB.
- Contiguous unified KV (reserve for the live set) fits ~250 GB / 512 MiB ~= ~490 sequences - 8x short of the 3,800 needed to reach the throughput ceiling.
So on 2x H200 KV memory is the binding constraint at the throughput-optimal concurrency, and contiguous KV strands most of the bandwidth (you'd run at a fraction of 16k t/s). This is the gap paged KV closes. On GB10 it never appeared because GB10's 30x-lower bandwidth caps decode at npl ~128, whose KV fits in memory trivially - the constraint order is inverted on the real target.
Magnitude of the paged win
Paging recovers concurrency two ways, both multiplicative on achievable throughput:
- No over-reservation. Contiguous must back
max_ctxper slot; paging usesceil(actual/block). For a realistic bimodal workload (most generations short, ~15% long, prompts ~512) the average held context is several-fold belowmax_ctx->paged-loadgencapacity ratio typically ~4-10x (it measures the exact number for your workload's length distribution). - Cross-request prefix sharing of shared system prompts / RAG preambles - additional,
workload-dependent (chained-hash block cache; vLLM's
block_pool.py).
Net: on 2x H200, paged KV is plausibly the difference between serving ~500 and ~3,800 concurrent 32B sequences in HBM, i.e. between a fraction of and ~all of the ~16k t/s decode ceiling. That is the datacenter payoff, and it is real on the target even though GB10 cannot exhibit it.
Honest caveats for the buy case
- These are projections from GB10 + spec ratios; the capacity multiplier depends on the
workload's context-length distribution (more variable -> bigger paged win) and TP
efficiency.
paged-loadgenmeasures it directly once you have target-GPU time. - The paged op itself still needs work: PR #22569's
ggml_paged_attnwas 12-13% slower than the mature contiguous flash-attention path at equal concurrency (PR22569_EVAL.md), lacks prefix sharing (deferred to a non-existent Phase 2), and has the fit-robustness bug above. Adopting paged KV for the target means either hardening #22569 or finishing the from-scratch P4 - the capacity win above assumes a correct, competitive op, which is the remaining engineering. - Prefill on either KV layout is compute-capped, not a paged concern.
Bottom line for the decision: paged KV is the right lever for the 2x H200 target - the GB10 "no win" result is a bandwidth artifact, not a verdict. The paged path is now correctness-verified, the benchmark to size the win exists, and the projection says the payoff is ~5-10x concurrent-tenant capacity -> several-fold higher aggregate decode on the target. The remaining work is hardening/finishing the paged op, not proving the thesis.