From 52f0f7b8cf0e9c7c144e207a631d43ef687c96c8 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Mon, 22 Jun 2026 14:16:52 +0000 Subject: [PATCH] docs(paged): apples-to-apples paged llama.cpp vs vLLM (batched+NVFP4+prefix cache) Matched comparison on DGX Spark (GB10, sm_121): batched llama-server with NVFP4 GGUF and the paged engine vs batched vLLM 0.23.0 NVFP4A16 with APC, both eager, both prefix-cache on. Two findings: (1) the paged cross-request prefix recompute-skip (patch 0007) does NOT engage in llama-server - it is only reachable via paged_prefix_api::share/commit, which the server never calls; the server engages only physical paged block placement plus its own native prompt cache. (2) With every confounder removed, vLLM is ~6x faster end-to-end (K=16: 8.6s vs 50.7s; K=32: 8.9s vs 58.3s), decode-bound not prefill-bound: llama ~828ms/decode-step at batch 32 vs vLLM ~185ms; CUDA graphs are not the differentiator (both eager). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto --- .../patches/paged/PAGED_VLLM_APPLES.md | 111 ++++++++++++++++++ 1 file changed, 111 insertions(+) create mode 100644 backend/cpp/llama-cpp/patches/paged/PAGED_VLLM_APPLES.md diff --git a/backend/cpp/llama-cpp/patches/paged/PAGED_VLLM_APPLES.md b/backend/cpp/llama-cpp/patches/paged/PAGED_VLLM_APPLES.md new file mode 100644 index 000000000..be85a82a5 --- /dev/null +++ b/backend/cpp/llama-cpp/patches/paged/PAGED_VLLM_APPLES.md @@ -0,0 +1,111 @@ +# Paged llama.cpp vs vLLM - apples-to-apples (batched + NVFP4 + prefix cache) + +Definitive matched comparison on a DGX Spark (GB10, sm_121). Both engines batched, +both NVFP4-class weights, both with prefix caching on, both eager (no CUDA graphs). +Workload: shared 1024-token system prefix + unique 32-token suffix, generate 64 +tokens, K requests fired concurrently (cold fan-out), one client hitting both +OpenAI-compatible servers with identical token-id prompts. + +This run fixes the two confounders in the earlier comparison (a *serial* Q4_K dev +driver vs a *batched* FP4 vLLM server). Here both sides are batched and NVFP4. + +## Setup + +- llama.cpp: `llama-server` built from the paged dev tree (`~/llama-paged-dev`, + branch `paged`, patches 0001-0007), CUDA `build-cuda/` (sm_121). + `LLAMA_KV_PAGED=1`, `-ngl 99 --parallel 32 -c 40960`, model + `q3-32b-nvfp4-dense.gguf` (NVFP4 weights, FP4-MMA kernel). OpenAI `/completion`. +- vLLM 0.23.0: `vllm serve q3-32b-nvfp4a16/` (compressed-tensors W4A16 / Marlin), + `--enforce-eager --max-model-len 4096 --gpu-memory-utilization 0.9 + --max-num-seqs 64`, APC on (default). OpenAI `/v1/completions`. + +## Finding 1 - the paged cross-request prefix cache does NOT engage in llama-server + +This is itself a key result. The paged engine has two distinct mechanisms: + +1. Physical paged block placement (patches 0002/0004) - runs inside + `llama_kv_cache::find_slot`, gated only by `LLAMA_KV_PAGED`. This DOES engage in + the server: with `LLAMA_KV_PAGED_DEBUG=1`, 2 concurrent shared-prefix requests + produced 14 `[paged-alloc] ... grew` lines, one stream per `seq`. + +2. Cross-request prefix recompute-skip (patch 0007) - the actual fan-out win + (`shares N prefix blocks ... prefix NOT recomputed`, ref-counted block sharing). + This is reachable ONLY through `paged_prefix_api::share/commit` + (`src/paged-prefix-api.cpp`), which only the standalone driver calls. + +Evidence it does not reach the server: +- Static: `grep -rn "paged_prefix\|share_prefix\|LLAMA_KV_PAGED" tools/server/` + returns nothing; `nm` on the binary finds no `paged_prefix` symbol use from the + server path. Nothing in `llama_decode` or the server calls `share`/`commit`. +- Runtime: the 2-request verify run logged **0** `shares prefix blocks` / + `NOT recomputed` lines. Both `seq=0` and `seq=1` independently grew to 65 blocks, + each allocating and recomputing the full ~972-token prefix separately - no + cross-slot KV block sharing, no `ref_cnt>1`. + +So the 0007 recompute-skip, proven in the driver, does **not** yet reach the +server. Closing it needs server-side wiring: when admitting a slot whose prompt +shares a prefix with another live/committed slot, the server would have to call +the `paged_prefix_api::share` / `commit` seam. That is a future patch. + +Note: llama-server has its OWN native prefix reuse (the slot prompt cache / +"context checkpoints"). In the K=32 wave the server reused the prefix cached by the +earlier wave, so prefill was only the 32-token suffix (`prompt eval ... / 32 +tokens`). But that is a separate mechanism, it only helps prefill, and prefill is +not the bottleneck here (see below), so it does not change the verdict. + +## Finding 2 - the matched comparison + +Both batched, both NVFP4, both prefix-cache on, both eager. Cold concurrent fan-out, +identical token-id prompts via one client. + +| K | engine | wall (s) | aggregate gen tok/s | req/s | vLLM speedup | +|----|----------|----------|---------------------|-------|--------------| +| 16 | llama.cpp| 50.7 | 18.9 | 0.30 | - | +| 16 | vLLM | 8.57 | 119.5 | 1.87 | ~5.9x | +| 32 | llama.cpp| 58.3 | 34.0 | 0.53 | - | +| 32 | vLLM | 8.86 | 231.1 | 3.61 | ~6.6x | + +vLLM APC confirmed engaged: prefix cache hit rate 90.9% (K=16), 95.5% (K=32), +enforce_eager (CUDA graphs disabled), `enable_prefix_caching=True`. + +### Verdict: not competitive - vLLM ~6x faster, and prefix caching is not why + +With every confounder removed (both batched, both NVFP4, both eager, both with +prefix caching on), vLLM is still ~6x faster end-to-end. The gap is decode-bound, +not prefill/cache-bound: + +- The G=64 workload is dominated by decode. In the llama K=32 run, decode was + 52.98s of the 58.3s wall; prefill was ~3.5s (and only the 32-token suffix, since + the server's native prompt cache already reused the prefix). So even perfect + prefix sharing - paged or native - cannot move the total much. +- llama.cpp batched decode: **~828 ms per decode step** at batch 32 + (1.21 tok/s per sequence). +- vLLM batched decode: ~170 tok/s aggregate gen at 32 running reqs -> + **~185 ms per step**, roughly **4-5x faster per decode step**. +- CUDA graphs are NOT the differentiator: both sides are eager (llama + `graphs reused = 0`, vLLM `--enforce-eager`). The win is vLLM's batched-decode + efficiency: PagedAttention + fused W4A16 (Marlin) GEMMs + chunked-prefill + scheduler, versus llama.cpp's per-step eager graph and NVFP4-GGUF decode path on + this Blackwell-class part. + +Because decode dominates, wiring the paged 0007 recompute-skip into the server +(Finding 1) would mainly remove redundant prefill across slots - a real saving for +short-generation / long-prefix RAG fan-out, but at G=64 it is a few seconds against +a decode floor that is already ~6x slower than vLLM. The fan-out win does not, on +its own, make llama.cpp competitive here; the decode kernel/batching gap is the +load-bearing factor. + +## Caveats + +- NVFP4-GGUF is double-quant and is speed-representative (it routes onto the + FP4-MMA kernel); output quality is not the subject of this run. +- vLLM side is NVFP4A16 (W4A16 / Marlin) - 4-bit weights, 16-bit activations; + llama side is NVFP4 weights on FP4-MMA. Both are NVFP4-weight class. +- One llama request per run hit an intermittent HTTP 500 ("output does not match + the expected Content-only format" - a Qwen3 thinking-output quirk on + `/completion`), so llama counts were 15/16 and 31/32. The failed request returns + early and reduces batch contention for the rest, so a clean 16/16 / 32/32 llama + run would be marginally slower - i.e. the ~6x gap reported here is conservative + (favorable to llama.cpp). +- Both servers cold-started; numbers are end-to-end wall from the concurrent + client. Disk healthy (~325 GB free), GPU otherwise idle.