Files
LocalAI/backend/cpp/llama-cpp/patches/paged
Ettore Di Giacinto f347f7ca1d docs(paged): stock GPU batch-shape determinism + vLLM shared-prefix comparison
Two closing measurements on DGX Spark (GB10, sm_121):

1. Stock GPU determinism (no paging): with LLAMA_KV_PAGED unset, stock
   llama.cpp produces a different greedy token stream when the same prompt
   is decoded in a full-prefill batch vs a split (prefix-then-suffix) batch.
   At G=24 the generated stream diverges 1/5 prompts on CPU and 2/5 on CUDA
   (and earlier on CUDA). This confirms the patch-0007 GPU byte-identity
   failure is stock floating-point batch-shape non-determinism, not a paged
   bug. CPU exhibits it too, just less often, which is why 0007's short CPU
   scenarios passed 16/16 while the CUDA run flipped.

2. vLLM vs llama.cpp+paged on a shared-prefix fan-out (K reqs share a
   1024-tok prefix + unique 32-tok suffix, gen 64). llama.cpp+paged prefix
   cache gives 7.15x (K=16) / 10.3x (K=32) prefill reduction vs its no-share
   baseline - the same cross-request prefix-skip vLLM's APC provides (97%
   hit rate confirmed). Head-to-head on cached prefill vLLM is ~5x faster
   (Q4_K_M vs nvfp4a16 quant, vLLM on FP4 emulation + eager), and wider
   end-to-end due to continuous batched decode. Competitive in kind, behind
   in absolute terms on this hardware.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-22 13:48:01 +00:00
..