Files
LocalAI/backend/cpp/llama-cpp/patches/paged/PAGED_VLLM_APPLES.md
Ettore Di Giacinto 52f0f7b8cf docs(paged): apples-to-apples paged llama.cpp vs vLLM (batched+NVFP4+prefix cache)
Matched comparison on DGX Spark (GB10, sm_121): batched llama-server with NVFP4
GGUF and the paged engine vs batched vLLM 0.23.0 NVFP4A16 with APC, both eager,
both prefix-cache on. Two findings: (1) the paged cross-request prefix
recompute-skip (patch 0007) does NOT engage in llama-server - it is only reachable
via paged_prefix_api::share/commit, which the server never calls; the server
engages only physical paged block placement plus its own native prompt cache. (2)
With every confounder removed, vLLM is ~6x faster end-to-end (K=16: 8.6s vs 50.7s;
K=32: 8.9s vs 58.3s), decode-bound not prefill-bound: llama ~828ms/decode-step at
batch 32 vs vLLM ~185ms; CUDA graphs are not the differentiator (both eager).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-22 14:16:52 +00:00

6.2 KiB

Paged llama.cpp vs vLLM - apples-to-apples (batched + NVFP4 + prefix cache)

Definitive matched comparison on a DGX Spark (GB10, sm_121). Both engines batched, both NVFP4-class weights, both with prefix caching on, both eager (no CUDA graphs). Workload: shared 1024-token system prefix + unique 32-token suffix, generate 64 tokens, K requests fired concurrently (cold fan-out), one client hitting both OpenAI-compatible servers with identical token-id prompts.

This run fixes the two confounders in the earlier comparison (a serial Q4_K dev driver vs a batched FP4 vLLM server). Here both sides are batched and NVFP4.

Setup

  • llama.cpp: llama-server built from the paged dev tree (~/llama-paged-dev, branch paged, patches 0001-0007), CUDA build-cuda/ (sm_121). LLAMA_KV_PAGED=1, -ngl 99 --parallel 32 -c 40960, model q3-32b-nvfp4-dense.gguf (NVFP4 weights, FP4-MMA kernel). OpenAI /completion.
  • vLLM 0.23.0: vllm serve q3-32b-nvfp4a16/ (compressed-tensors W4A16 / Marlin), --enforce-eager --max-model-len 4096 --gpu-memory-utilization 0.9 --max-num-seqs 64, APC on (default). OpenAI /v1/completions.

Finding 1 - the paged cross-request prefix cache does NOT engage in llama-server

This is itself a key result. The paged engine has two distinct mechanisms:

  1. Physical paged block placement (patches 0002/0004) - runs inside llama_kv_cache::find_slot, gated only by LLAMA_KV_PAGED. This DOES engage in the server: with LLAMA_KV_PAGED_DEBUG=1, 2 concurrent shared-prefix requests produced 14 [paged-alloc] ... grew lines, one stream per seq.

  2. Cross-request prefix recompute-skip (patch 0007) - the actual fan-out win (shares N prefix blocks ... prefix NOT recomputed, ref-counted block sharing). This is reachable ONLY through paged_prefix_api::share/commit (src/paged-prefix-api.cpp), which only the standalone driver calls.

Evidence it does not reach the server:

  • Static: grep -rn "paged_prefix\|share_prefix\|LLAMA_KV_PAGED" tools/server/ returns nothing; nm on the binary finds no paged_prefix symbol use from the server path. Nothing in llama_decode or the server calls share/commit.
  • Runtime: the 2-request verify run logged 0 shares prefix blocks / NOT recomputed lines. Both seq=0 and seq=1 independently grew to 65 blocks, each allocating and recomputing the full ~972-token prefix separately - no cross-slot KV block sharing, no ref_cnt>1.

So the 0007 recompute-skip, proven in the driver, does not yet reach the server. Closing it needs server-side wiring: when admitting a slot whose prompt shares a prefix with another live/committed slot, the server would have to call the paged_prefix_api::share / commit seam. That is a future patch.

Note: llama-server has its OWN native prefix reuse (the slot prompt cache / "context checkpoints"). In the K=32 wave the server reused the prefix cached by the earlier wave, so prefill was only the 32-token suffix (prompt eval ... / 32 tokens). But that is a separate mechanism, it only helps prefill, and prefill is not the bottleneck here (see below), so it does not change the verdict.

Finding 2 - the matched comparison

Both batched, both NVFP4, both prefix-cache on, both eager. Cold concurrent fan-out, identical token-id prompts via one client.

K engine wall (s) aggregate gen tok/s req/s vLLM speedup
16 llama.cpp 50.7 18.9 0.30 -
16 vLLM 8.57 119.5 1.87 ~5.9x
32 llama.cpp 58.3 34.0 0.53 -
32 vLLM 8.86 231.1 3.61 ~6.6x

vLLM APC confirmed engaged: prefix cache hit rate 90.9% (K=16), 95.5% (K=32), enforce_eager (CUDA graphs disabled), enable_prefix_caching=True.

Verdict: not competitive - vLLM ~6x faster, and prefix caching is not why

With every confounder removed (both batched, both NVFP4, both eager, both with prefix caching on), vLLM is still ~6x faster end-to-end. The gap is decode-bound, not prefill/cache-bound:

  • The G=64 workload is dominated by decode. In the llama K=32 run, decode was 52.98s of the 58.3s wall; prefill was ~3.5s (and only the 32-token suffix, since the server's native prompt cache already reused the prefix). So even perfect prefix sharing - paged or native - cannot move the total much.
  • llama.cpp batched decode: ~828 ms per decode step at batch 32 (1.21 tok/s per sequence).
  • vLLM batched decode: ~170 tok/s aggregate gen at 32 running reqs -> ~185 ms per step, roughly 4-5x faster per decode step.
  • CUDA graphs are NOT the differentiator: both sides are eager (llama graphs reused = 0, vLLM --enforce-eager). The win is vLLM's batched-decode efficiency: PagedAttention + fused W4A16 (Marlin) GEMMs + chunked-prefill scheduler, versus llama.cpp's per-step eager graph and NVFP4-GGUF decode path on this Blackwell-class part.

Because decode dominates, wiring the paged 0007 recompute-skip into the server (Finding 1) would mainly remove redundant prefill across slots - a real saving for short-generation / long-prefix RAG fan-out, but at G=64 it is a few seconds against a decode floor that is already ~6x slower than vLLM. The fan-out win does not, on its own, make llama.cpp competitive here; the decode kernel/batching gap is the load-bearing factor.

Caveats

  • NVFP4-GGUF is double-quant and is speed-representative (it routes onto the FP4-MMA kernel); output quality is not the subject of this run.
  • vLLM side is NVFP4A16 (W4A16 / Marlin) - 4-bit weights, 16-bit activations; llama side is NVFP4 weights on FP4-MMA. Both are NVFP4-weight class.
  • One llama request per run hit an intermittent HTTP 500 ("output does not match the expected Content-only format" - a Qwen3 thinking-output quirk on /completion), so llama counts were 15/16 and 31/32. The failed request returns early and reduces batch contention for the rest, so a clean 16/16 / 32/32 llama run would be marginally slower - i.e. the ~6x gap reported here is conservative (favorable to llama.cpp).
  • Both servers cold-started; numbers are end-to-end wall from the concurrent client. Disk healthy (~325 GB free), GPU otherwise idle.