Matched comparison on DGX Spark (GB10, sm_121): batched llama-server with NVFP4 GGUF and the paged engine vs batched vLLM 0.23.0 NVFP4A16 with APC, both eager, both prefix-cache on. Two findings: (1) the paged cross-request prefix recompute-skip (patch 0007) does NOT engage in llama-server - it is only reachable via paged_prefix_api::share/commit, which the server never calls; the server engages only physical paged block placement plus its own native prompt cache. (2) With every confounder removed, vLLM is ~6x faster end-to-end (K=16: 8.6s vs 50.7s; K=32: 8.9s vs 58.3s), decode-bound not prefill-bound: llama ~828ms/decode-step at batch 32 vs vLLM ~185ms; CUDA graphs are not the differentiator (both eager). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
6.2 KiB
Paged llama.cpp vs vLLM - apples-to-apples (batched + NVFP4 + prefix cache)
Definitive matched comparison on a DGX Spark (GB10, sm_121). Both engines batched, both NVFP4-class weights, both with prefix caching on, both eager (no CUDA graphs). Workload: shared 1024-token system prefix + unique 32-token suffix, generate 64 tokens, K requests fired concurrently (cold fan-out), one client hitting both OpenAI-compatible servers with identical token-id prompts.
This run fixes the two confounders in the earlier comparison (a serial Q4_K dev driver vs a batched FP4 vLLM server). Here both sides are batched and NVFP4.
Setup
- llama.cpp:
llama-serverbuilt from the paged dev tree (~/llama-paged-dev, branchpaged, patches 0001-0007), CUDAbuild-cuda/(sm_121).LLAMA_KV_PAGED=1,-ngl 99 --parallel 32 -c 40960, modelq3-32b-nvfp4-dense.gguf(NVFP4 weights, FP4-MMA kernel). OpenAI/completion. - vLLM 0.23.0:
vllm serve q3-32b-nvfp4a16/(compressed-tensors W4A16 / Marlin),--enforce-eager --max-model-len 4096 --gpu-memory-utilization 0.9 --max-num-seqs 64, APC on (default). OpenAI/v1/completions.
Finding 1 - the paged cross-request prefix cache does NOT engage in llama-server
This is itself a key result. The paged engine has two distinct mechanisms:
-
Physical paged block placement (patches 0002/0004) - runs inside
llama_kv_cache::find_slot, gated only byLLAMA_KV_PAGED. This DOES engage in the server: withLLAMA_KV_PAGED_DEBUG=1, 2 concurrent shared-prefix requests produced 14[paged-alloc] ... grewlines, one stream perseq. -
Cross-request prefix recompute-skip (patch 0007) - the actual fan-out win (
shares N prefix blocks ... prefix NOT recomputed, ref-counted block sharing). This is reachable ONLY throughpaged_prefix_api::share/commit(src/paged-prefix-api.cpp), which only the standalone driver calls.
Evidence it does not reach the server:
- Static:
grep -rn "paged_prefix\|share_prefix\|LLAMA_KV_PAGED" tools/server/returns nothing;nmon the binary finds nopaged_prefixsymbol use from the server path. Nothing inllama_decodeor the server callsshare/commit. - Runtime: the 2-request verify run logged 0
shares prefix blocks/NOT recomputedlines. Bothseq=0andseq=1independently grew to 65 blocks, each allocating and recomputing the full ~972-token prefix separately - no cross-slot KV block sharing, noref_cnt>1.
So the 0007 recompute-skip, proven in the driver, does not yet reach the
server. Closing it needs server-side wiring: when admitting a slot whose prompt
shares a prefix with another live/committed slot, the server would have to call
the paged_prefix_api::share / commit seam. That is a future patch.
Note: llama-server has its OWN native prefix reuse (the slot prompt cache /
"context checkpoints"). In the K=32 wave the server reused the prefix cached by the
earlier wave, so prefill was only the 32-token suffix (prompt eval ... / 32 tokens). But that is a separate mechanism, it only helps prefill, and prefill is
not the bottleneck here (see below), so it does not change the verdict.
Finding 2 - the matched comparison
Both batched, both NVFP4, both prefix-cache on, both eager. Cold concurrent fan-out, identical token-id prompts via one client.
| K | engine | wall (s) | aggregate gen tok/s | req/s | vLLM speedup |
|---|---|---|---|---|---|
| 16 | llama.cpp | 50.7 | 18.9 | 0.30 | - |
| 16 | vLLM | 8.57 | 119.5 | 1.87 | ~5.9x |
| 32 | llama.cpp | 58.3 | 34.0 | 0.53 | - |
| 32 | vLLM | 8.86 | 231.1 | 3.61 | ~6.6x |
vLLM APC confirmed engaged: prefix cache hit rate 90.9% (K=16), 95.5% (K=32),
enforce_eager (CUDA graphs disabled), enable_prefix_caching=True.
Verdict: not competitive - vLLM ~6x faster, and prefix caching is not why
With every confounder removed (both batched, both NVFP4, both eager, both with prefix caching on), vLLM is still ~6x faster end-to-end. The gap is decode-bound, not prefill/cache-bound:
- The G=64 workload is dominated by decode. In the llama K=32 run, decode was 52.98s of the 58.3s wall; prefill was ~3.5s (and only the 32-token suffix, since the server's native prompt cache already reused the prefix). So even perfect prefix sharing - paged or native - cannot move the total much.
- llama.cpp batched decode: ~828 ms per decode step at batch 32 (1.21 tok/s per sequence).
- vLLM batched decode: ~170 tok/s aggregate gen at 32 running reqs -> ~185 ms per step, roughly 4-5x faster per decode step.
- CUDA graphs are NOT the differentiator: both sides are eager (llama
graphs reused = 0, vLLM--enforce-eager). The win is vLLM's batched-decode efficiency: PagedAttention + fused W4A16 (Marlin) GEMMs + chunked-prefill scheduler, versus llama.cpp's per-step eager graph and NVFP4-GGUF decode path on this Blackwell-class part.
Because decode dominates, wiring the paged 0007 recompute-skip into the server (Finding 1) would mainly remove redundant prefill across slots - a real saving for short-generation / long-prefix RAG fan-out, but at G=64 it is a few seconds against a decode floor that is already ~6x slower than vLLM. The fan-out win does not, on its own, make llama.cpp competitive here; the decode kernel/batching gap is the load-bearing factor.
Caveats
- NVFP4-GGUF is double-quant and is speed-representative (it routes onto the FP4-MMA kernel); output quality is not the subject of this run.
- vLLM side is NVFP4A16 (W4A16 / Marlin) - 4-bit weights, 16-bit activations; llama side is NVFP4 weights on FP4-MMA. Both are NVFP4-weight class.
- One llama request per run hit an intermittent HTTP 500 ("output does not match
the expected Content-only format" - a Qwen3 thinking-output quirk on
/completion), so llama counts were 15/16 and 31/32. The failed request returns early and reduces batch contention for the rest, so a clean 16/16 / 32/32 llama run would be marginally slower - i.e. the ~6x gap reported here is conservative (favorable to llama.cpp). - Both servers cold-started; numbers are end-to-end wall from the concurrent client. Disk healthy (~325 GB free), GPU otherwise idle.