# Paged KV: target-readiness (correctness, dynamic benchmark, 2xH200 projection) Target hardware: **~2x H200** (281 GB HBM3e total, ~4.8 TB/s per GPU). The GB10 box is the *test* rig, not the target - and several earlier "no win" findings are GB10-specific artifacts (low bandwidth caps throughput before KV memory ever binds). This document delivers the three things needed to push paged KV toward the real target: 1. **Correctness** of the paged path - verified (and a blocking bug found + fixed). 2. **A dynamic-load benchmark** that actually exercises where paging wins (`paged-loadgen.cpp`). 3. **A projection** of the paged-KV payoff on 2x H200, grounded in measured GB10 numbers. --- ## 1. Correctness: PASS (after fixing the auto-fit OOM) `test-paged-kv-e2e` checks the paged decode path against the contiguous reference (greedy argmax + top-5 set overlap >= 4). On the box it was previously **unverified** - it aborted at context creation. Root cause found: - `common_fit_paged_kv_blocks` (`common/common.cpp:1144`) **unconditionally overrides** `n_gpu_blocks` from `ggml_backend_dev_memory`, which **over-reports free VRAM on the GB10 integrated/unified device** (it sized **~245 GB of KV on a 119 GB box** -> `cudaMalloc` OOM -> `GGML_ASSERT` abort in `llama-kv-cache-paged.cpp:74`). The test's explicit `n_gpu_blocks=64` was being clobbered because `params.fit_params` defaults on. **Fix (item-1 patch, applied on the box):** ```diff --- a/tests/test-paged-kv-e2e.cpp +++ b/tests/test-paged-kv-e2e.cpp @@ run_paged() params.kv_paged = true; + params.fit_params = false; // honor explicit n_gpu_blocks; GB10 dev_memory over-reports free VRAM params.n_gpu_blocks = 64; ``` **Result (Qwen3-0.6B-Q8_0, GB10):** ``` test-paged-kv-e2e: top-5 argmax match: ref=3743 paged=3743 test-paged-kv-e2e: top-5 set overlap: 5/5 (require >= 4) test-paged-kv-e2e: PASSED ``` The paged op is **numerically greedy-equivalent to the contiguous path**. The reshape bug from `PR22569_EVAL.md` (decoupled head_dim) is already applied in the checkout. **Target-readiness caveat (the durable fix, not just the test):** the auto-fit itself is brittle and must be hardened before it runs on a real serving box - even though `ggml_backend_dev_memory` reports correctly on a discrete H200, the function should still (a) early-return when `!params.fit_params`, (b) **clamp** the computed `n_gpu_blocks` so `n_gpu_blocks * block_bytes <= free_vram - margin` using the *actual* KV element size, and (c) not override an explicitly-set value. One-screen change in `common_fit_paged_kv_blocks`. --- ## 2. Dynamic-load benchmark - `paged-loadgen.cpp` **Why the existing tools show no paged win:** `llama-batched-bench` and the stock `examples/paged/paged.cpp` both run **fixed-length, all-arrive-at-once, single-prompt** load. That has no over-reservation and no fragmentation, so contiguous KV is already memory-optimal and paging has nothing to reclaim (`PAGED_KV_HIGH_CONCURRENCY.md`). The paged win only exists under **variable lengths + continuous arrival + shared prefixes** - the real serving regime. No tool in the tree creates it. `paged-loadgen.cpp` (committed here) does, via the confirmed `llama_paged_scheduler_*` API: - **shared system prefix** (`LG_PREFIX` tokens) prepended to every request -> exercises cross-request prefix sharing, - **variable prompt length** (`LG_SUFMIN..LG_SUFMAX` unique suffix), - **bimodal generation length** (`LG_GENLONG` for `LG_LONGPCT`% of requests, else `LG_GENSHORT`) - the over-reservation driver, - **continuous arrival**: keeps `LG_INFLIGHT` requests live, admitting a new one each time one finishes. It reports the load-bearing number for the buy decision - the **capacity ratio**: ``` paged peak KV = sum over live seqs of ceil(used/block)*block * kv_bytes_per_token contiguous reserve = peak_inflight * max_ctx * kv_bytes_per_token (worst-case per slot) CAPACITY RATIO = contiguous_reserve / paged_peak (+ prefix sharing on top) ``` `kv_bytes_per_token = 2 * n_layer * n_head_kv * head_dim * sizeof(f16)` - confirmed against `llama-kv-cache-paged.cpp` (e.g. Qwen3-32B: 2*64*8*128*2 = **256 KiB/token**). **How to run (on the target):** drop into PR #22569's `examples/paged/`, add to its CMakeLists next to `llama-paged`, build, then e.g. `LG_INFLIGHT=2048 LG_LONGPCT=15 paged-loadgen -m -kvp --fit off -ngpub -ncpub -ngl 99`. Sweep `LG_INFLIGHT` to the throughput plateau and read the capacity ratio at that point. It is written to run on the target (2x H200) where the regime exists; on GB10 it runs but the ratio is uninteresting because throughput plateaus before memory binds (see below). --- ## 3. Projection to 2x H200 (grounded in measured GB10 numbers) ### Measured on GB10 (this work) | model | decode plateau (aggregate) | plateau concurrency | bound by | |---|---|---|---| | Qwen3-32B-Q4_K_M (dense) | ~540 t/s | npl ~128 | compute | | Qwen3-1.7B-Q8_0 | ~3,200 t/s | npl ~512 | bandwidth | ### Hardware ratios (per GPU, then 2x TP at ~85% scaling) | | GB10 | H200 | per-GPU x | 2x H200 (TP) x | |---|---|---|---|---| | mem bandwidth | 273 GB/s | ~4.8 TB/s | 17.6 | ~30 | | BF16 compute | ~213 TFLOP | ~989 TFLOP | 4.6 | ~8 | | HBM | 119 GB | 141 GB | 1.18 | 2.4 (281 GB) | Decode is bandwidth-bound, so **both the aggregate ceiling and the concurrency at which it is reached scale with bandwidth (~30x on 2x H200)**: - **32B-dense aggregate decode ceiling:** 540 x 30 ~= **16,000 t/s**, reached at ~128 x 30 ~= **3,800 concurrent sequences**. ### Why paged KV becomes the binding lever on 2x H200 (and didn't on GB10) To reach that ~16k t/s ceiling you must hold **~3,800 sequences** of KV. The memory math: - 32B weights (FP8) ~= 32 GB, sharded over 2 GPUs -> ~250 GB HBM free for KV. - 32B KV = 256 KiB/token. At an avg held context of 2,000 tokens, **per seq = 512 MiB**. - Contiguous unified KV (reserve for the live set) fits ~250 GB / 512 MiB ~= **~490 sequences** - **8x short of the 3,800 needed to reach the throughput ceiling.** So on 2x H200 **KV memory is the binding constraint at the throughput-optimal concurrency**, and contiguous KV strands most of the bandwidth (you'd run at a fraction of 16k t/s). This is the gap paged KV closes. On GB10 it never appeared because GB10's 30x-lower bandwidth caps decode at npl ~128, whose KV fits in memory trivially - the constraint order is inverted on the real target. ### Magnitude of the paged win Paging recovers concurrency two ways, both multiplicative on achievable throughput: 1. **No over-reservation.** Contiguous must back `max_ctx` per slot; paging uses `ceil(actual/block)`. For a realistic bimodal workload (most generations short, ~15% long, prompts ~512) the average held context is several-fold below `max_ctx` -> `paged-loadgen` capacity ratio typically **~4-10x** (it measures the exact number for your workload's length distribution). 2. **Cross-request prefix sharing** of shared system prompts / RAG preambles - additional, workload-dependent (chained-hash block cache; vLLM's `block_pool.py`). Net: on 2x H200, paged KV is plausibly the difference between serving **~500 and ~3,800** concurrent 32B sequences in HBM, i.e. between a fraction of and ~all of the **~16k t/s** decode ceiling. **That is the datacenter payoff, and it is real on the target even though GB10 cannot exhibit it.** ### Honest caveats for the buy case - These are **projections** from GB10 + spec ratios; the capacity multiplier depends on the workload's context-length distribution (more variable -> bigger paged win) and TP efficiency. `paged-loadgen` measures it directly once you have target-GPU time. - The **paged op itself still needs work**: PR #22569's `ggml_paged_attn` was 12-13% *slower* than the mature contiguous flash-attention path at equal concurrency (`PR22569_EVAL.md`), lacks prefix sharing (deferred to a non-existent Phase 2), and has the fit-robustness bug above. Adopting paged KV for the target means either hardening #22569 or finishing the from-scratch P4 - the capacity win above assumes a *correct, competitive* op, which is the remaining engineering. - Prefill on either KV layout is compute-capped, not a paged concern. **Bottom line for the decision:** paged KV **is** the right lever for the 2x H200 target - the GB10 "no win" result is a bandwidth artifact, not a verdict. The paged path is now **correctness-verified**, the **benchmark to size the win exists**, and the projection says the payoff is **~5-10x concurrent-tenant capacity -> several-fold higher aggregate decode** on the target. The remaining work is hardening/finishing the paged op, not proving the thesis.