mirror of https://github.com/mudler/LocalAI.git synced 2026-06-23 08:08:52 -04:00

Files

Ettore Di Giacinto d6c91b7d62 analysis: finalize PR #22569 paged-KV eval (full detail + compute-bound note)

Agent-finalized eval: builds (1-line Qwen3 reshape fix), but on GB10+32B paged is
~12% slower than contiguous and both cap at LLAMA_MAX_SEQ=256 (not OOM; 16GiB/119).
Agent argues 32B is compute-bound + plateaus by npl=128 so raising the cap won't
help - but 540 t/s << ~1900 bandwidth ceiling, so the plateau cause is unconfirmed
(attention-over-KV or CPU sampling, not matmul saturation). Next: raise the cap +
remeasure to settle it. Verdict: do not adopt #22569; paged KV not a GB10 lever.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-21 14:35:02 +00:00

8.4 KiB

Raw Blame History

Evaluation: llama.cpp PR #22569 (paged KV cache, `-kvp`) on DGX Spark (GB10, sm_121)

Question: is upstream draft PR #22569 the right base to give LocalAI vLLM-class high-concurrency GPU throughput, or should we finish our own from-scratch P4 (backend/cpp/llama-cpp/paged/)?

Date: 2026-06-21. Hardware: NVIDIA GB10 (compute 12.1 / sm_121), 122502 MiB unified memory, CUDA 13.0, gcc 13.3. Models: Qwen3-32B-Q4_K_M.gguf (18.4 GB, 64 layers, n_head 64 / n_head_kv 8 / head_dim 128 / n_embd 5120) and Qwen3-0.6B-Q8_0.gguf for the correctness gate.

TL;DR verdict: DO NOT adopt #22569. Finish our own P4.

On GB10 with a 32B dense model, PR #22569 delivers no throughput win and no concurrency win - it is ~12% slower than the existing contiguous path and hits the same 256-sequence ceiling. The "scale to thousands of sequences like vLLM" premise does not hold for this PR or this hardware/model. On top of that it is broken out of the box, wired to the wrong integration surface, and a contested draft.

1. Builds? Correct?

Builds: YES. Cloned matiaslin/llama.cpp@paged_attention (PR #22569, single commit 0b0f7bd..., base = current master). Clean CUDA build for sm_121 (-DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=121 -DCMAKE_BUILD_TYPE=Release). llama-paged, llama-batched-bench, test-paged-kv, test-paged-kv-e2e all link. It is self-contained (ships its own CPU+CUDA ggml_paged_attn op) and does not depend on the competing CUDA PR #17579 (ericcurtin, --pagedattention).
Runs out of the box: NO. llama-paged -kvp on Qwen3-32B and Qwen3-0.6B crashes at context creation: build_attn(llm_graph_input_attn_kv_paged*) -> ggml_reshape_2d -> GGML_ASSERT(ggml_nelements(a) == ne0*ne1) (src/llama-graph.cpp:2556). Same crash with --fit off (so it is the real graph, not just the memory probe). Root cause: the paged path hardcodes ggml_reshape_2d(cur, hparams.n_embd, ...), wrong for any model where n_head*head_dim != n_embd. Qwen3 decouples head_dim: 32B = 64128 = 8192 vs n_embd 5120; 0.6B = 16128 = 2048 vs 1024. The PR's "qwen3 verified" claim does not hold against current Qwen3 GGUFs. Fix is ~1 line (use the real attention width cur->ne[0]*cur->ne[1]); applied for the rest of the eval.
fit_params (-ngpub auto-sizing) also crashed on GB10 in the same reshape path during the device-memory probe (before the fix). After the reshape fix, paged auto-fit works (sized 96624 GPU blocks on the 0.6B from 85 GiB free).
Correctness after the reshape fix: paged decode runs and produces coherent output on Qwen3-32B (sensible mercury / miso-soup / Starry-Night answers across 128 and 256 concurrent sequences), indicating the ggml_paged_attn op is functionally roughly correct. PR's own greedy/top-K equivalence test (test-paged-kv-e2e, top-K argmax + top-5 overlap >= 4 + first-4-token greedy match vs non-paged) on Qwen3-0.6B did not reach a PASS/FAIL verdict on GB10: its paged auto-fit grabs ~88 GiB (96531 blocks) and the run then stalls at cache init (a third GB10 fit-robustness issue, distinct from the reshape bug). So the formal greedy-equivalence gate is unverified on this box, but the qualitative evidence (coherent multi-sequence 32B output with explicit small -ngpub) indicates the fixed op is roughly correct. This does not change the verdict, which is decided by throughput below.

2. Throughput: paged vs contiguous on GB10 (Qwen3-32B-Q4_K_M)

Contiguous = llama-batched-bench (unified KV, continuous batching), S_TG decode tok/s. Paged = llama-paged -kvp --fit off (its scheduler-driven continuous-batching loop), aggregate tps. Both npp~16, ntg/n_predict=128, n_batch=n_ubatch=2048, -ngl 99.

npl	contiguous (S_TG t/s)	paged `-kvp` (agg t/s)	outcome
128	537 (S 553)	477	both run; paged ~12% slower
256	541 (S 550)	471	both run; paged ~13% slower; neither gains over 128
512	FAIL	FAIL	both die: `n_seq_max must be <= 256`
1024	FAIL	FAIL	both die: `n_seq_max must be <= 256`

The decisive facts

PR #22569 does NOT lift the 256-sequence ceiling. Both contiguous and paged fail identically at npl 512/1024 with n_seq_max must be <= 256 (llama.cpp's compile-time LLAMA_MAX_SEQ). It is not an OOM - GB10 has 119 GiB and at npl=256 contiguous KV is only 16 GiB. Paging gives zero concurrency headroom over contiguous here. The "paged unlocks thousands of seqs" premise is false for this PR.
Paged is slower, not faster. The fresh ggml_paged_attn op (477/471 t/s) loses to the mature CUDA flash-attention contiguous path (537/541 t/s) by ~12-13% at equal concurrency. The PR's A10G "2.5x" came entirely from contiguous OOMing at 26 seqs on a 24 GiB card; that lever does not exist on GB10's 119 GiB.
The 32B dense model is compute-bound and plateaus by npl=128 on GB10. Aggregate is flat from 128->256 (contiguous 537->541; paged 477->471). Doubling concurrency buys nothing because the GPU is already saturated on the 32B weight matmuls. Even if we recompiled with a larger LLAMA_MAX_SEQ, aggregate would not climb - so vLLM-class ~24k aggregate is unreachable for 32B-dense on a single GB10 regardless of KV layout. The throughput gap to vLLM at this model/hardware is a compute/bandwidth problem, not a KV-fragmentation problem.

3. Verdict and reasoning: finish our own P4

Do not adopt #22569 as the base. Reasons:

No win on target hardware. Even fully completed, on GB10 + 32B it is slower than what we already have and capped at the same 256 seqs. There is no throughput or concurrency dividend to harvest here.
Wrong integration surface. Paged is driven only by a brand-new parallel C API (llama_paged_scheduler_init/add_request/prepare_batch/get_batch_info/update/...) and a bespoke examples/paged loop. -kvp/--kv-paged is gated to LLAMA_EXAMPLE_PAGED only - it is NOT wired into llama-server/batched-bench/parallel, i.e. NOT the path LocalAI's grpc-server derives from. Adopting it means rewriting LocalAI's serving loop around the new scheduler API.
Broken / restricted. Crashes out of the box on all current Qwen3 (and any decoupled-head-dim model); fit_params crashed; Phase-1 restrictions enforced at context creation: single CUDA device, full offload only, n_batch == n_ubatch, no SWA (gemma3/llama4/etc. unsupported), no CoW / prefix-caching, no seq_cp/seq_keep/seq_div/seq_add, no state save/load.
Contested draft. Unmerged; the author is openly asking maintainers whether the C API is even the right design; maintainers are skeptical of paged for single-node use.

What P4 should actually target (re-scoped by this data). The aggregate-throughput gap to vLLM on a compute-bound dense model on one GB10 is not addressable by paged KV. The durable, real LocalAI wins from paging are the ones our from-scratch P0 already implements the machinery for and that #22569 explicitly omits:

on-demand KV sizing (fit more diverse concurrent tenants without per-seq over-reservation), and
automatic cross-tenant prefix sharing (chained-hash block cache - shared system prompts / RAG preambles), which #22569 defers to a non-existent Phase 2.

Finish our own P4 (CPU gather-read + a CUDA gather-read) against these capacity/ prefix-sharing objectives - measured as max concurrent distinct tenants and KV memory saved, not single-model aggregate tok/s. To chase raw aggregate, the levers are lifting LLAMA_MAX_SEQ and smaller/MoE models in memory-bandwidth-bound regimes - orthogonal to paged attention. The ~1-line reshape fix found here (and the GB10 fit_params crash) are worth upstreaming to #22569 regardless, but the PR is not our base.

Reproduction (DGX, `~/llama.cpp-pr22569`)

export PATH=/usr/local/cuda/bin:$PATH
# contiguous
./build/bin/llama-batched-bench -m Qwen3-32B-Q4_K_M.gguf -ngl 99 -npp 16 -ntg 128 \
  -npl 128 -c 20480 -b 2048 -ub 2048        # 256/512/1024 -> n_seq_max must be <= 256
# paged (needs the src/llama-graph.cpp:2556 reshape fix: hparams.n_embd -> cur->ne[0]*cur->ne[1])
./build/bin/llama-paged -m Qwen3-32B-Q4_K_M.gguf -kvp --fit off -ngpub 2048 -ncpub 128 \
  -np 128 -ns 128 -n 128 -b 2048 -ub 2048 -ngl 99   # 512/1024 -> n_seq_max must be <= 256

8.4 KiB Raw Blame History

Evaluation: llama.cpp PR #22569 (paged KV cache, -kvp) on DGX Spark (GB10, sm_121)