Files
LocalAI/backend/cpp
Ettore Di Giacinto 52f0f7b8cf docs(paged): apples-to-apples paged llama.cpp vs vLLM (batched+NVFP4+prefix cache)
Matched comparison on DGX Spark (GB10, sm_121): batched llama-server with NVFP4
GGUF and the paged engine vs batched vLLM 0.23.0 NVFP4A16 with APC, both eager,
both prefix-cache on. Two findings: (1) the paged cross-request prefix
recompute-skip (patch 0007) does NOT engage in llama-server - it is only reachable
via paged_prefix_api::share/commit, which the server never calls; the server
engages only physical paged block placement plus its own native prompt cache. (2)
With every confounder removed, vLLM is ~6x faster end-to-end (K=16: 8.6s vs 50.7s;
K=32: 8.9s vs 58.3s), decode-bound not prefill-bound: llama ~828ms/decode-step at
batch 32 vs vLLM ~185ms; CUDA graphs are not the differentiator (both eager).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-22 14:16:52 +00:00
..