Files
LocalAI/backend/cpp/llama-cpp/paged/PR22569_EVAL.md
Ettore Di Giacinto d6c91b7d62 analysis: finalize PR #22569 paged-KV eval (full detail + compute-bound note)
Agent-finalized eval: builds (1-line Qwen3 reshape fix), but on GB10+32B paged is
~12% slower than contiguous and both cap at LLAMA_MAX_SEQ=256 (not OOM; 16GiB/119).
Agent argues 32B is compute-bound + plateaus by npl=128 so raising the cap won't
help - but 540 t/s << ~1900 bandwidth ceiling, so the plateau cause is unconfirmed
(attention-over-KV or CPU sampling, not matmul saturation). Next: raise the cap +
remeasure to settle it. Verdict: do not adopt #22569; paged KV not a GB10 lever.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-21 14:35:02 +00:00

137 lines
8.4 KiB
Markdown

# Evaluation: llama.cpp PR #22569 (paged KV cache, `-kvp`) on DGX Spark (GB10, sm_121)
Question: is upstream draft PR #22569 the right base to give LocalAI vLLM-class
high-concurrency GPU throughput, or should we finish our own from-scratch P4
(`backend/cpp/llama-cpp/paged/`)?
Date: 2026-06-21. Hardware: NVIDIA GB10 (compute 12.1 / sm_121), 122502 MiB unified
memory, CUDA 13.0, gcc 13.3. Models: `Qwen3-32B-Q4_K_M.gguf` (18.4 GB, 64 layers,
n_head 64 / n_head_kv 8 / head_dim 128 / n_embd 5120) and `Qwen3-0.6B-Q8_0.gguf` for
the correctness gate.
## TL;DR verdict: DO NOT adopt #22569. Finish our own P4.
On GB10 with a 32B dense model, PR #22569 delivers **no throughput win and no concurrency
win** - it is ~12% *slower* than the existing contiguous path and hits the *same*
256-sequence ceiling. The "scale to thousands of sequences like vLLM" premise does not
hold for this PR or this hardware/model. On top of that it is broken out of the box,
wired to the wrong integration surface, and a contested draft.
## 1. Builds? Correct?
- **Builds: YES.** Cloned `matiaslin/llama.cpp@paged_attention` (PR #22569, single commit
`0b0f7bd...`, base = current master). Clean CUDA build for sm_121
(`-DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=121 -DCMAKE_BUILD_TYPE=Release`).
`llama-paged`, `llama-batched-bench`, `test-paged-kv`, `test-paged-kv-e2e` all link.
It is self-contained (ships its own CPU+CUDA `ggml_paged_attn` op) and does **not**
depend on the competing CUDA PR #17579 (ericcurtin, `--pagedattention`).
- **Runs out of the box: NO.** `llama-paged -kvp` on Qwen3-32B *and* Qwen3-0.6B crashes
at context creation:
`build_attn(llm_graph_input_attn_kv_paged*) -> ggml_reshape_2d ->`
`GGML_ASSERT(ggml_nelements(a) == ne0*ne1)` (src/llama-graph.cpp:2556). Same crash with
`--fit off` (so it is the real graph, not just the memory probe).
**Root cause:** the paged path hardcodes `ggml_reshape_2d(cur, hparams.n_embd, ...)`,
wrong for any model where `n_head*head_dim != n_embd`. Qwen3 decouples head_dim:
32B = 64*128 = **8192** vs n_embd 5120; 0.6B = 16*128 = **2048** vs 1024. The PR's
"qwen3 verified" claim does **not** hold against current Qwen3 GGUFs. Fix is ~1 line
(use the real attention width `cur->ne[0]*cur->ne[1]`); applied for the rest of the eval.
- **`fit_params` (`-ngpub` auto-sizing) also crashed on GB10** in the same reshape path
during the device-memory probe (before the fix). After the reshape fix, paged
auto-fit works (sized 96624 GPU blocks on the 0.6B from 85 GiB free).
- **Correctness after the reshape fix:** paged decode runs and produces **coherent**
output on Qwen3-32B (sensible mercury / miso-soup / Starry-Night answers across 128 and
256 concurrent sequences), indicating the `ggml_paged_attn` op is functionally roughly
correct. PR's own greedy/top-K equivalence test (`test-paged-kv-e2e`, top-K argmax +
top-5 overlap >= 4 + first-4-token greedy match vs non-paged) on Qwen3-0.6B did
**not** reach a PASS/FAIL verdict on GB10: its paged auto-fit grabs ~88 GiB
(96531 blocks) and the run then stalls at cache init (a third GB10 fit-robustness
issue, distinct from the reshape bug). So the formal greedy-equivalence gate is
**unverified on this box**, but the qualitative evidence (coherent multi-sequence 32B
output with explicit small `-ngpub`) indicates the fixed op is roughly correct. This
does not change the verdict, which is decided by throughput below.
## 2. Throughput: paged vs contiguous on GB10 (Qwen3-32B-Q4_K_M)
Contiguous = `llama-batched-bench` (unified KV, continuous batching), S_TG decode tok/s.
Paged = `llama-paged -kvp --fit off` (its scheduler-driven continuous-batching loop),
`aggregate tps`. Both `npp~16, ntg/n_predict=128, n_batch=n_ubatch=2048, -ngl 99`.
| npl | contiguous (S_TG t/s) | paged `-kvp` (agg t/s) | outcome |
|------|----------------------|------------------------|---------|
| 128 | **537** (S 553) | **477** | both run; paged ~12% slower |
| 256 | **541** (S 550) | **471** | both run; paged ~13% slower; neither gains over 128 |
| 512 | FAIL | FAIL | **both** die: `n_seq_max must be <= 256` |
| 1024 | FAIL | FAIL | **both** die: `n_seq_max must be <= 256` |
### The decisive facts
1. **PR #22569 does NOT lift the 256-sequence ceiling.** Both contiguous and paged fail
identically at npl 512/1024 with `n_seq_max must be <= 256` (llama.cpp's compile-time
`LLAMA_MAX_SEQ`). It is **not** an OOM - GB10 has 119 GiB and at npl=256 contiguous KV
is only 16 GiB. Paging gives **zero** concurrency headroom over contiguous here. The
"paged unlocks thousands of seqs" premise is false for this PR.
2. **Paged is slower, not faster.** The fresh `ggml_paged_attn` op (477/471 t/s) loses to
the mature CUDA flash-attention contiguous path (537/541 t/s) by ~12-13% at equal
concurrency. The PR's A10G "2.5x" came entirely from contiguous OOMing at 26 seqs on a
24 GiB card; that lever does not exist on GB10's 119 GiB.
3. **The 32B dense model is compute-bound and plateaus by npl=128 on GB10.** Aggregate is
flat from 128->256 (contiguous 537->541; paged 477->471). Doubling concurrency buys
nothing because the GPU is already saturated on the 32B weight matmuls. Even if we
recompiled with a larger `LLAMA_MAX_SEQ`, aggregate would not climb - so vLLM-class
~24k aggregate is **unreachable for 32B-dense on a single GB10 regardless of KV
layout**. The throughput gap to vLLM at this model/hardware is a compute/bandwidth
problem, not a KV-fragmentation problem.
## 3. Verdict and reasoning: finish our own P4
**Do not adopt #22569 as the base.** Reasons:
- **No win on target hardware.** Even fully completed, on GB10 + 32B it is slower than
what we already have and capped at the same 256 seqs. There is no throughput or
concurrency dividend to harvest here.
- **Wrong integration surface.** Paged is driven only by a brand-new parallel C API
(`llama_paged_scheduler_init/add_request/prepare_batch/get_batch_info/update/...`) and a
bespoke `examples/paged` loop. `-kvp`/`--kv-paged` is gated to `LLAMA_EXAMPLE_PAGED`
only - it is NOT wired into `llama-server`/`batched-bench`/`parallel`, i.e. NOT the path
LocalAI's grpc-server derives from. Adopting it means rewriting LocalAI's serving loop
around the new scheduler API.
- **Broken / restricted.** Crashes out of the box on all current Qwen3 (and any
decoupled-head-dim model); fit_params crashed; Phase-1 restrictions enforced at context
creation: single CUDA device, full offload only, `n_batch == n_ubatch`, no SWA
(gemma3/llama4/etc. unsupported), no CoW / prefix-caching, no
`seq_cp`/`seq_keep`/`seq_div`/`seq_add`, no state save/load.
- **Contested draft.** Unmerged; the author is openly asking maintainers whether the C
API is even the right design; maintainers are skeptical of paged for single-node use.
**What P4 should actually target (re-scoped by this data).** The aggregate-throughput
gap to vLLM on a compute-bound dense model on one GB10 is not addressable by paged KV.
The durable, real LocalAI wins from paging are the ones our from-scratch P0 already
implements the machinery for and that #22569 explicitly omits:
- **on-demand KV sizing** (fit more *diverse* concurrent tenants without per-seq
over-reservation), and
- **automatic cross-tenant prefix sharing** (chained-hash block cache - shared system
prompts / RAG preambles), which #22569 defers to a non-existent Phase 2.
Finish our own P4 (CPU gather-read + a CUDA gather-read) against these capacity/
prefix-sharing objectives - measured as max concurrent *distinct* tenants and KV memory
saved, not single-model aggregate tok/s. To chase raw aggregate, the levers are lifting
`LLAMA_MAX_SEQ` and smaller/MoE models in memory-bandwidth-bound regimes - orthogonal to
paged attention. The ~1-line reshape fix found here (and the GB10 fit_params crash) are
worth upstreaming to #22569 regardless, but the PR is not our base.
### Reproduction (DGX, `~/llama.cpp-pr22569`)
```sh
export PATH=/usr/local/cuda/bin:$PATH
# contiguous
./build/bin/llama-batched-bench -m Qwen3-32B-Q4_K_M.gguf -ngl 99 -npp 16 -ntg 128 \
-npl 128 -c 20480 -b 2048 -ub 2048 # 256/512/1024 -> n_seq_max must be <= 256
# paged (needs the src/llama-graph.cpp:2556 reshape fix: hparams.n_embd -> cur->ne[0]*cur->ne[1])
./build/bin/llama-paged -m Qwen3-32B-Q4_K_M.gguf -kvp --fit off -ngpub 2048 -ncpub 128 \
-np 128 -ns 128 -n 128 -b 2048 -ub 2048 -ngl 99 # 512/1024 -> n_seq_max must be <= 256
```