mirror of https://github.com/mudler/LocalAI.git synced 2026-06-23 08:08:52 -04:00

Files

Ettore Di Giacinto 931793aa24 feat(paged): target-readiness for 2xH200 - correctness PASS, load-gen harness, projection

Deliverables for pushing paged KV toward the real target (2xH200), since GB10 is
only the test box and its "no win" result is a low-bandwidth artifact:

1. Correctness verified. test-paged-kv-e2e is greedy-equivalent to the contiguous
   reference (top-5 argmax ref=paged=3743, overlap 5/5). Found + fixed the blocking
   bug: common_fit_paged_kv_blocks over-reports free VRAM on GB10's unified device
   and tried 245GB of KV on a 119GB box, OOM-aborting context creation. Patch in
   patches/0002; durable fix (clamp to free_vram, honor --fit off) noted.

2. paged-loadgen.cpp: a dynamic-load benchmark that actually exercises where paging
   wins - variable prompt/gen lengths, continuous arrival, shared prefix - and
   reports the capacity ratio (contiguous reserve / paged peak KV). The stock tools
   run fixed-length all-at-once load, which is why they never show a paged win.

3. Projection to 2xH200, grounded in measured GB10 plateaus. Decode is bandwidth-
   bound, so the ceiling (~16k t/s for 32B) needs ~3,800 concurrent seqs, but
   contiguous KV fits only ~490 in HBM at 2k ctx - so KV memory IS the binding
   constraint on the target (unlike GB10), and paged KV's ~5-10x capacity (no
   over-reservation + prefix sharing) is what reaches the ceiling. The thesis holds
   on the target; remaining work is hardening/finishing the paged op (PR22569 was
   12-13% slower and lacks prefix sharing).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-21 23:16:28 +00:00

8.5 KiB

Raw Blame History

Paged KV: target-readiness (correctness, dynamic benchmark, 2xH200 projection)

Target hardware: ~2x H200 (281 GB HBM3e total, ~4.8 TB/s per GPU). The GB10 box is the test rig, not the target - and several earlier "no win" findings are GB10-specific artifacts (low bandwidth caps throughput before KV memory ever binds). This document delivers the three things needed to push paged KV toward the real target:

Correctness of the paged path - verified (and a blocking bug found + fixed).
A dynamic-load benchmark that actually exercises where paging wins (paged-loadgen.cpp).
A projection of the paged-KV payoff on 2x H200, grounded in measured GB10 numbers.

1. Correctness: PASS (after fixing the auto-fit OOM)

test-paged-kv-e2e checks the paged decode path against the contiguous reference (greedy argmax + top-5 set overlap >= 4). On the box it was previously unverified - it aborted at context creation. Root cause found:

common_fit_paged_kv_blocks (common/common.cpp:1144) unconditionally overrides n_gpu_blocks from ggml_backend_dev_memory, which over-reports free VRAM on the GB10 integrated/unified device (it sized ~245 GB of KV on a 119 GB box -> cudaMalloc OOM -> GGML_ASSERT abort in llama-kv-cache-paged.cpp:74). The test's explicit n_gpu_blocks=64 was being clobbered because params.fit_params defaults on.

Fix (item-1 patch, applied on the box):

--- a/tests/test-paged-kv-e2e.cpp
+++ b/tests/test-paged-kv-e2e.cpp
@@ run_paged()
     params.kv_paged      = true;
+    params.fit_params    = false;  // honor explicit n_gpu_blocks; GB10 dev_memory over-reports free VRAM
     params.n_gpu_blocks  = 64;

Result (Qwen3-0.6B-Q8_0, GB10):

test-paged-kv-e2e: top-5 argmax match: ref=3743 paged=3743
test-paged-kv-e2e: top-5 set overlap: 5/5 (require >= 4)
test-paged-kv-e2e: PASSED

The paged op is numerically greedy-equivalent to the contiguous path. The reshape bug from PR22569_EVAL.md (decoupled head_dim) is already applied in the checkout.

Target-readiness caveat (the durable fix, not just the test): the auto-fit itself is brittle and must be hardened before it runs on a real serving box - even though ggml_backend_dev_memory reports correctly on a discrete H200, the function should still (a) early-return when !params.fit_params, (b) clamp the computed n_gpu_blocks so n_gpu_blocks * block_bytes <= free_vram - margin using the actual KV element size, and (c) not override an explicitly-set value. One-screen change in common_fit_paged_kv_blocks.

2. Dynamic-load benchmark - `paged-loadgen.cpp`

Why the existing tools show no paged win: llama-batched-bench and the stock examples/paged/paged.cpp both run fixed-length, all-arrive-at-once, single-prompt load. That has no over-reservation and no fragmentation, so contiguous KV is already memory-optimal and paging has nothing to reclaim (PAGED_KV_HIGH_CONCURRENCY.md). The paged win only exists under variable lengths + continuous arrival + shared prefixes - the real serving regime. No tool in the tree creates it.

paged-loadgen.cpp (committed here) does, via the confirmed llama_paged_scheduler_* API:

shared system prefix (LG_PREFIX tokens) prepended to every request -> exercises cross-request prefix sharing,
variable prompt length (LG_SUFMIN..LG_SUFMAX unique suffix),
bimodal generation length (LG_GENLONG for LG_LONGPCT% of requests, else LG_GENSHORT) - the over-reservation driver,
continuous arrival: keeps LG_INFLIGHT requests live, admitting a new one each time one finishes.

It reports the load-bearing number for the buy decision - the capacity ratio:

paged peak KV      = sum over live seqs of ceil(used/block)*block * kv_bytes_per_token
contiguous reserve = peak_inflight * max_ctx * kv_bytes_per_token   (worst-case per slot)
CAPACITY RATIO     = contiguous_reserve / paged_peak   (+ prefix sharing on top)

kv_bytes_per_token = 2 * n_layer * n_head_kv * head_dim * sizeof(f16) - confirmed against llama-kv-cache-paged.cpp (e.g. Qwen3-32B: 26481282 = 256 KiB/token).

How to run (on the target): drop into PR #22569's examples/paged/, add to its CMakeLists next to llama-paged, build, then e.g. LG_INFLIGHT=2048 LG_LONGPCT=15 paged-loadgen -m <model> -kvp --fit off -ngpub <N> -ncpub <M> -ngl 99. Sweep LG_INFLIGHT to the throughput plateau and read the capacity ratio at that point. It is written to run on the target (2x H200) where the regime exists; on GB10 it runs but the ratio is uninteresting because throughput plateaus before memory binds (see below).

3. Projection to 2x H200 (grounded in measured GB10 numbers)

Measured on GB10 (this work)

model	decode plateau (aggregate)	plateau concurrency	bound by
Qwen3-32B-Q4_K_M (dense)	~540 t/s	npl ~128	compute
Qwen3-1.7B-Q8_0	~3,200 t/s	npl ~512	bandwidth

Hardware ratios (per GPU, then 2x TP at ~85% scaling)

	GB10	H200	per-GPU x	2x H200 (TP) x
mem bandwidth	273 GB/s	~4.8 TB/s	17.6	~30
BF16 compute	~213 TFLOP	~989 TFLOP	4.6	~8
HBM	119 GB	141 GB	1.18	2.4 (281 GB)

Decode is bandwidth-bound, so both the aggregate ceiling and the concurrency at which it is reached scale with bandwidth (~30x on 2x H200):

32B-dense aggregate decode ceiling: 540 x 30 ~= 16,000 t/s, reached at ~128 x 30 ~= 3,800 concurrent sequences.

Why paged KV becomes the binding lever on 2x H200 (and didn't on GB10)

To reach that ~16k t/s ceiling you must hold ~3,800 sequences of KV. The memory math:

32B weights (FP8) ~= 32 GB, sharded over 2 GPUs -> ~250 GB HBM free for KV.
32B KV = 256 KiB/token. At an avg held context of 2,000 tokens, per seq = 512 MiB.
Contiguous unified KV (reserve for the live set) fits ~250 GB / 512 MiB ~= ~490 sequences - 8x short of the 3,800 needed to reach the throughput ceiling.

So on 2x H200 KV memory is the binding constraint at the throughput-optimal concurrency, and contiguous KV strands most of the bandwidth (you'd run at a fraction of 16k t/s). This is the gap paged KV closes. On GB10 it never appeared because GB10's 30x-lower bandwidth caps decode at npl ~128, whose KV fits in memory trivially - the constraint order is inverted on the real target.

Magnitude of the paged win

Paging recovers concurrency two ways, both multiplicative on achievable throughput:

No over-reservation. Contiguous must back max_ctx per slot; paging uses ceil(actual/block). For a realistic bimodal workload (most generations short, ~15% long, prompts ~512) the average held context is several-fold below max_ctx -> paged-loadgen capacity ratio typically ~4-10x (it measures the exact number for your workload's length distribution).
Cross-request prefix sharing of shared system prompts / RAG preambles - additional, workload-dependent (chained-hash block cache; vLLM's block_pool.py).

Net: on 2x H200, paged KV is plausibly the difference between serving ~500 and ~3,800 concurrent 32B sequences in HBM, i.e. between a fraction of and ~all of the ~16k t/s decode ceiling. That is the datacenter payoff, and it is real on the target even though GB10 cannot exhibit it.

Honest caveats for the buy case

These are projections from GB10 + spec ratios; the capacity multiplier depends on the workload's context-length distribution (more variable -> bigger paged win) and TP efficiency. paged-loadgen measures it directly once you have target-GPU time.
The paged op itself still needs work: PR #22569's ggml_paged_attn was 12-13% slower than the mature contiguous flash-attention path at equal concurrency (PR22569_EVAL.md), lacks prefix sharing (deferred to a non-existent Phase 2), and has the fit-robustness bug above. Adopting paged KV for the target means either hardening #22569 or finishing the from-scratch P4 - the capacity win above assumes a correct, competitive op, which is the remaining engineering.
Prefill on either KV layout is compute-capped, not a paged concern.

Bottom line for the decision: paged KV is the right lever for the 2x H200 target - the GB10 "no win" result is a bandwidth artifact, not a verdict. The paged path is now correctness-verified, the benchmark to size the win exists, and the projection says the payoff is ~5-10x concurrent-tenant capacity -> several-fold higher aggregate decode on the target. The remaining work is hardening/finishing the paged op, not proving the thesis.

8.5 KiB Raw Blame History