LocalAI/backend/cpp/llama-cpp/paged/PR22569_EVAL.md

# Evaluation: llama.cpp PR #22569 (paged KV cache, `-kvp`) on DGX Spark (GB10, sm_121)

Question: is upstream draft PR #22569 the right base to give LocalAI vLLM-class
high-concurrency GPU throughput, or should we finish our own from-scratch P4
(`backend/cpp/llama-cpp/paged/`)?

Date: 2026-06-21. Hardware: NVIDIA GB10 (compute 12.1 / sm_121), 122502 MiB unified
memory, CUDA 13.0, gcc 13.3. Models: `Qwen3-32B-Q4_K_M.gguf` (18.4 GB, 64 layers,
n_head 64 / n_head_kv 8 / head_dim 128 / n_embd 5120) and `Qwen3-0.6B-Q8_0.gguf` for
the correctness gate.

## TL;DR verdict: DO NOT adopt #22569. Finish our own P4.

On GB10 with a 32B dense model, PR #22569 delivers **no throughput win and no concurrency
win** - it is ~12% *slower* than the existing contiguous path and hits the *same*
256-sequence ceiling. The "scale to thousands of sequences like vLLM" premise does not
hold for this PR or this hardware/model. On top of that it is broken out of the box,
wired to the wrong integration surface, and a contested draft.

## 1. Builds? Correct?

- **Builds: YES.** Cloned `matiaslin/llama.cpp@paged_attention` (PR #22569, single commit
  `0b0f7bd...`, base = current master). Clean CUDA build for sm_121
  (`-DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=121 -DCMAKE_BUILD_TYPE=Release`).
  `llama-paged`, `llama-batched-bench`, `test-paged-kv`, `test-paged-kv-e2e` all link.
  It is self-contained (ships its own CPU+CUDA `ggml_paged_attn` op) and does **not**
  depend on the competing CUDA PR #17579 (ericcurtin, `--pagedattention`).

- **Runs out of the box: NO.** `llama-paged -kvp` on Qwen3-32B *and* Qwen3-0.6B crashes
  at context creation:
  `build_attn(llm_graph_input_attn_kv_paged*) -> ggml_reshape_2d ->`
  `GGML_ASSERT(ggml_nelements(a) == ne0*ne1)` (src/llama-graph.cpp:2556). Same crash with
  `--fit off` (so it is the real graph, not just the memory probe).
  **Root cause:** the paged path hardcodes `ggml_reshape_2d(cur, hparams.n_embd, ...)`,
  wrong for any model where `n_head*head_dim != n_embd`. Qwen3 decouples head_dim:
  32B = 64*128 = **8192** vs n_embd 5120; 0.6B = 16*128 = **2048** vs 1024. The PR's
  "qwen3 verified" claim does **not** hold against current Qwen3 GGUFs. Fix is ~1 line
  (use the real attention width `cur->ne[0]*cur->ne[1]`); applied for the rest of the eval.

- **`fit_params` (`-ngpub` auto-sizing) also crashed on GB10** in the same reshape path
  during the device-memory probe (before the fix). After the reshape fix, paged
  auto-fit works (sized 96624 GPU blocks on the 0.6B from 85 GiB free).

- **Correctness after the reshape fix:** paged decode runs and produces **coherent**
  output on Qwen3-32B (sensible mercury / miso-soup / Starry-Night answers across 128 and
  256 concurrent sequences), indicating the `ggml_paged_attn` op is functionally roughly
  correct. PR's own greedy/top-K equivalence test (`test-paged-kv-e2e`, top-K argmax +
  top-5 overlap >= 4 + first-4-token greedy match vs non-paged) on Qwen3-0.6B did
  **not** reach a PASS/FAIL verdict on GB10: its paged auto-fit grabs ~88 GiB
  (96531 blocks) and the run then stalls at cache init (a third GB10 fit-robustness
  issue, distinct from the reshape bug). So the formal greedy-equivalence gate is
  **unverified on this box**, but the qualitative evidence (coherent multi-sequence 32B
  output with explicit small `-ngpub`) indicates the fixed op is roughly correct. This
  does not change the verdict, which is decided by throughput below.

## 2. Throughput: paged vs contiguous on GB10 (Qwen3-32B-Q4_K_M)

Contiguous = `llama-batched-bench` (unified KV, continuous batching), S_TG decode tok/s.
Paged = `llama-paged -kvp --fit off` (its scheduler-driven continuous-batching loop),
`aggregate tps`. Both `npp~16, ntg/n_predict=128, n_batch=n_ubatch=2048, -ngl 99`.

| npl  | contiguous (S_TG t/s) | paged `-kvp` (agg t/s) | outcome |
|------|----------------------|------------------------|---------|
| 128  | **537** (S 553)      | **477**                | both run; paged ~12% slower |
| 256  | **541** (S 550)      | **471**                | both run; paged ~13% slower; neither gains over 128 |
| 512  | FAIL                 | FAIL                   | **both** die: `n_seq_max must be <= 256` |
| 1024 | FAIL                 | FAIL                   | **both** die: `n_seq_max must be <= 256` |

### The decisive facts

1. **PR #22569 does NOT lift the 256-sequence ceiling.** Both contiguous and paged fail
   identically at npl 512/1024 with `n_seq_max must be <= 256` (llama.cpp's compile-time
   `LLAMA_MAX_SEQ`). It is **not** an OOM - GB10 has 119 GiB and at npl=256 contiguous KV
   is only 16 GiB. Paging gives **zero** concurrency headroom over contiguous here. The
   "paged unlocks thousands of seqs" premise is false for this PR.

2. **Paged is slower, not faster.** The fresh `ggml_paged_attn` op (477/471 t/s) loses to
   the mature CUDA flash-attention contiguous path (537/541 t/s) by ~12-13% at equal
   concurrency. The PR's A10G "2.5x" came entirely from contiguous OOMing at 26 seqs on a
   24 GiB card; that lever does not exist on GB10's 119 GiB.

3. **The 32B dense model is compute-bound and plateaus by npl=128 on GB10.** Aggregate is
   flat from 128->256 (contiguous 537->541; paged 477->471). Doubling concurrency buys
   nothing because the GPU is already saturated on the 32B weight matmuls. Even if we
   recompiled with a larger `LLAMA_MAX_SEQ`, aggregate would not climb - so vLLM-class
   ~24k aggregate is **unreachable for 32B-dense on a single GB10 regardless of KV
   layout**. The throughput gap to vLLM at this model/hardware is a compute/bandwidth
   problem, not a KV-fragmentation problem.

## 3. Verdict and reasoning: finish our own P4

**Do not adopt #22569 as the base.** Reasons:

- **No win on target hardware.** Even fully completed, on GB10 + 32B it is slower than
  what we already have and capped at the same 256 seqs. There is no throughput or
  concurrency dividend to harvest here.
- **Wrong integration surface.** Paged is driven only by a brand-new parallel C API
  (`llama_paged_scheduler_init/add_request/prepare_batch/get_batch_info/update/...`) and a
  bespoke `examples/paged` loop. `-kvp`/`--kv-paged` is gated to `LLAMA_EXAMPLE_PAGED`
  only - it is NOT wired into `llama-server`/`batched-bench`/`parallel`, i.e. NOT the path
  LocalAI's grpc-server derives from. Adopting it means rewriting LocalAI's serving loop
  around the new scheduler API.
- **Broken / restricted.** Crashes out of the box on all current Qwen3 (and any
  decoupled-head-dim model); fit_params crashed; Phase-1 restrictions enforced at context
  creation: single CUDA device, full offload only, `n_batch == n_ubatch`, no SWA
  (gemma3/llama4/etc. unsupported), no CoW / prefix-caching, no
  `seq_cp`/`seq_keep`/`seq_div`/`seq_add`, no state save/load.
- **Contested draft.** Unmerged; the author is openly asking maintainers whether the C
  API is even the right design; maintainers are skeptical of paged for single-node use.

**What P4 should actually target (re-scoped by this data).** The aggregate-throughput
gap to vLLM on a compute-bound dense model on one GB10 is not addressable by paged KV.
The durable, real LocalAI wins from paging are the ones our from-scratch P0 already
implements the machinery for and that #22569 explicitly omits:
- **on-demand KV sizing** (fit more *diverse* concurrent tenants without per-seq
  over-reservation), and
- **automatic cross-tenant prefix sharing** (chained-hash block cache - shared system
  prompts / RAG preambles), which #22569 defers to a non-existent Phase 2.

Finish our own P4 (CPU gather-read + a CUDA gather-read) against these capacity/
prefix-sharing objectives - measured as max concurrent *distinct* tenants and KV memory
saved, not single-model aggregate tok/s. To chase raw aggregate, the levers are lifting
`LLAMA_MAX_SEQ` and smaller/MoE models in memory-bandwidth-bound regimes - orthogonal to
paged attention. The ~1-line reshape fix found here (and the GB10 fit_params crash) are
worth upstreaming to #22569 regardless, but the PR is not our base.

### Reproduction (DGX, `~/llama.cpp-pr22569`)
```sh
export PATH=/usr/local/cuda/bin:$PATH
# contiguous
./build/bin/llama-batched-bench -m Qwen3-32B-Q4_K_M.gguf -ngl 99 -npp 16 -ntg 128 \
  -npl 128 -c 20480 -b 2048 -ub 2048        # 256/512/1024 -> n_seq_max must be <= 256
# paged (needs the src/llama-graph.cpp:2556 reshape fix: hparams.n_embd -> cur->ne[0]*cur->ne[1])
./build/bin/llama-paged -m Qwen3-32B-Q4_K_M.gguf -kvp --fit off -ngpub 2048 -ncpub 128 \
  -np 128 -ns 128 -n 128 -b 2048 -ub 2048 -ngl 99   # 512/1024 -> n_seq_max must be <= 256
```