Record the belt-and-suspenders GPU run of the 0007 prefix-engine driver and a shared-prefix throughput benchmark. The committed CPU driver passes ALL PASS; the CUDA build fails only the strict greedy-token-equality assertions (the same binary fails them at ngl=0 too), which is CUDA float-kernel non-determinism, not a paged-logic defect - every structural KV-reuse invariant passes on GPU. The shared-prefix benchmark shows a real, K-scaling win: prefill wall time drops 7.2x (32B K=16) to 10.3x (32B K=32) when the shared prefix is computed once and reused via the paged cross-request prefix cache. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
6.2 KiB
Paged-KV: GPU 0007 re-run + shared-prefix throughput benchmark
DGX Spark (NVIDIA GB10, sm_121 / cc 12.1), CUDA 13, dev tree ~/llama-paged-dev
branch paged, base pin f3e182816421c648188b5eab269853bf1531d950, full paged
engine (0001-0004, 0006, 0007). All paged behaviour stays gated by
LLAMA_KV_PAGED; default-off is byte-identical to stock. Models:
Qwen3-0.6B-Q8_0.gguf and Qwen3-32B-Q4_K_M.gguf.
Deliverable 1 - GPU run of the 0007 prefix-engine correctness driver
The committed driver examples/simple/paged-prefix-engine.cpp hardcodes
n_gpu_layers = 0. For this GPU run it was given a dev-only
PAGED_NGL env override (mp.n_gpu_layers = getenv("PAGED_NGL") ? atoi(...) : 0),
rebuilt in build-cuda, run, then the edit was reverted so the committed
driver stays byte-clean (it is dev scaffolding, never shipped in a patch).
Three runs of the same Gate-0 driver, Qwen3-0.6B, LLAMA_KV_PAGED=1:
| binary / offload | result |
|---|---|
committed build-cpu driver |
ALL PASS (failures=0) |
build-cuda, PAGED_NGL=99 (all layers) |
GATE FAILED (failures=3) |
build-cuda, PAGED_NGL=0 (same binary) |
GATE FAILED (failures=2) |
The GPU run did NOT print ALL PASS - reported honestly. But the failures are narrow and are not a paged-engine bug:
- Every structural / mechanical paged invariant PASSES on GPU, in both
scenarios (boundary and mid-block): prefill computed ONLY the suffix (32 prefix
tokens skipped), shared prefix block-aligned, shared-block
ref_cnt == 2while both sequences hold it, ref drops2 -> 1on freeing one sharer, only the private (suffix) blocks are returned, and the prefix block returns to the pool once all sharers free. The cross-request KV reuse mechanism itself is GPU-clean. - The only failures are the exact greedy-token byte-identical assertions
(e.g. boundary
B-sharedvsB-from-scratch). They diverge at a single near-tie token (boundary: 2nd generated token17971vs5671) and then cascade autoregressively.
Root cause is CUDA float-kernel non-determinism, not the paged logic: the
same CUDA binary fails the exact-token assertions even with PAGED_NGL=0 (zero
layers offloaded), whereas the genuine build-cpu binary passes all 16/16. The
CUDA backend (loaded via ggml_backend_load_all) uses non-associative reductions
whose result differs between the full-prefill batch shape and the
incremental-suffix batch shape; under greedy decode a single logit near-tie flips
and the sequences cascade apart. This refines the earlier note in
PAGED_GPU_VERIFY.md (which framed it as "not GPU-specific" and had no CPU pass
to compare against): the CPU build now passes clean, so the divergence is a strict
test-assertion artefact of CUDA float ordering, not a defect in 0006/0007.
Deliverable 2 - shared-prefix throughput benchmark (the real-win test)
Dev-only driver examples/simple/paged-prefix-bench.cpp (registered in
examples/simple/CMakeLists.txt, dev tree only - not in any shipped patch).
Workload: K sequences that all share a P-token common prefix (a system /
RAG preamble), each with a unique S-token suffix; prefill only (G=0,
generation is identical compute in both modes so it is excluded from the
headline). GPU, -ngl 99, kv_unified = true.
- NO-SHARE (stock):
LLAMA_KV_PAGEDunset; every sequence prefills the fullP+Stokens. Total prefill work= K*(P+S). - PAGED-SHARE:
LLAMA_KV_PAGED=1; the prefix is computed ONCE on seq 0, committed viapaged_prefix_api::commit, then every other seq callspaged_prefix_api::shareto physically reuse the ref-counted prefix blocks and prefills ONLY its suffix. Total prefill work= P + K*S.
kv_unified note: this engine's cross-request share is built around the
unified stream-0 pool (ref-counted shared cells), so kv_unified = true is what
makes the share engage - the same setting the committed 0007 driver uses. With
kv_unified = true the share engaged in every run (evidence below).
Reuse actually engaged (share mode)
In every share run: kshare(seq 1) = 1024 (the full block-aligned prefix is
reused, not recomputed), the shared prefix block's ref_cnt == K (all sharers
point at one physical copy), and prefill_tokens_submitted collapses from
K*(P+S) to P + K*S.
Results (P=1024, S=32, prefill-only)
| model | K | mode | prefill tokens | prefill time | raw tok/s | shared ref_cnt |
|---|---|---|---|---|---|---|
| Qwen3-0.6B | 32 | no-share | 33792 | 4.659 s | 7253 | - |
| Qwen3-0.6B | 32 | share | 2048 | 0.554 s | 3695 | 32 |
| Qwen3-32B | 16 | no-share | 16896 | 26.14 s | 647 | - |
| Qwen3-32B | 16 | share | 1536 | 3.64 s | 422 | 16 |
| Qwen3-32B | 32 | no-share | 33792 | 61.91 s | 546 | - |
| Qwen3-32B | 32 | share | 2048 | 6.02 s | 340 | 32 |
Verdict: YES, a real and substantial win, and it grows with K
- Prefill wall-time speedup: 0.6B K=32 -> 8.4x, 32B K=16 -> 7.2x,
32B K=32 -> 10.3x. The win grows with the number of sharers because
no-share prefix recompute is
O(K)while the shared prefix isO(1)plusKtiny suffixes. - Note the honest caveat in the raw-throughput column: share mode submits small 32-token suffix batches that are less GPU-efficient (340-422 tok/s) than the large no-share batches (546-7253 tok/s). The win is not higher tok/s - it is computing ~11-16x fewer tokens. On a fast GB10 prefill that still nets a 7-10x wall-time reduction because prefill is compute-bound and the shared prefix dominates the token count.
- This is exactly the many-users-one-system-prompt / RAG-preamble fan-out scenario, and the paged cross-request prefix cache delivers there.
Scaffolding (paged-prefix-bench.cpp, the PAGED_NGL driver tweak) stays
dev-tree-only and is not part of any shipped patch.
Assisted-by: Claude:opus-4.8 [Claude Code]