Files
LocalAI/backend/cpp/llama-cpp/patches/paged
Ettore Di Giacinto fccbb4082d docs(paged): ground vLLM 0.23.0 eager-decode architecture vs llama.cpp
Decompose vLLM's enforce_eager decode step (attention / weight GEMM /
sampling / host loop) on GB10 (DGX Spark, sm_121) and attribute the
measured ~2.4x NVFP4 decode-throughput gap to its parts, from source
reading plus the existing nsys decode trace and H2H bench logs.

Key finding: the gap is dominantly a KERNEL-efficiency gap (~80-90%),
not a host-overhead gap. llama's GPU is already ~94.6% busy during
steady decode, so a CUDA-graphed decode is a minority lever (~10-20%
of the gap, bounded by the GPU-idle bubble), not the silver bullet.
vLLM's wins: in-kernel paged-decode read (no gather tax), faster
long-context attention, fused native-FP4 / grouped-Marlin GEMM, and
O(1)-in-ctx GDN linear-attention layers on these Qwen3.6 hybrids.
vLLM achieved 2.4x with synchronous scheduling and no CUDA graphs.

Evidence: vllm 0.23.0 source (gpu_model_runner, flash_attn/gdn
backends, modelopt/marlin GEMM, v1/sample), reproduced nsys kernel
categorization (cat2.py), and QWEN36_NVFP4_BENCH / DECODE_GAP_STUDY /
CONTINUOUS_BATCH_SCHEDULER_SCOPE.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-24 07:44:07 +00:00
..