From cb28deda6b41e71238f2ce534502ea099b2b7238 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Fri, 19 Jun 2026 23:27:35 +0000
Subject: [PATCH] bench(paged): decode profile overturns 'engine-addressable' -
 decode is 54.6% MoE GEMM too

Decode-dominated B=64 nsys: mul_mat_q<MXFP4> 54.6%, attention only 19.8%. Both
phases are FP4-MoE-kernel-bound (Lever 3). The paged series cannot close the vLLM
gap in either phase; its real value is capacity + prefix-sharing, not tok/s parity.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
 backend/cpp/llama-cpp/patches/BENCHMARKS.md | 24 +++++++++++++++++++++
 1 file changed, 24 insertions(+)
diff --git a/backend/cpp/llama-cpp/patches/BENCHMARKS.md b/backend/cpp/llama-cpp/patches/BENCHMARKS.md
index 37c331902..3096aaeab 100644
--- a/backend/cpp/llama-cpp/patches/BENCHMARKS.md
+++ b/backend/cpp/llama-cpp/patches/BENCHMARKS.md
@@ -24,6 +24,30 @@ prior run, same box/model). S_PP / S_TG are aggregate prefill / decode tok/s acr
    engine's domain — vLLM's block-paged KV + continuous batching pack more concurrent decode work per step.
    **This is what patches 0003–0006 target.** The win here is realistic; the prefill win is not (kernel).
 
+## CORRECTION — decode-phase profile (B=64, decode-dominated nsys)
+
+The "decode gap is engine-addressable" read above was **wrong**. Profiling a decode-dominated B=64 run:
+
+| kernel | % GPU time |
+|---|---|
+| `mul_mat_q<MXFP4>` (MoE GEMM) | **54.6** |
+| `flash_attn_ext` (attention) | 19.8 |
+| `mul_mat_q<Q8>` (dense) | 10.9 |
+| KV writes / quant / norms / rest | ~15 |
+
+**Decode at concurrency is ALSO dominated by the FP4 MoE GEMM (54.6%)** — the same Lever-3 kernel as prefill.
+Attention (the only thing paging optimizes) is ~20%, and the gather-read reclaims only the *masked-cell*
+fraction of that. So **the paged series (0003–0006) cannot close the vLLM gap in either phase** — both are
+MoE-kernel-bound. vLLM's concurrency advantage is its MoE/attention *kernels*, not (mainly) its KV management.
+
+### What the paged series IS still good for (just not throughput parity)
+
+- **Capacity**: block-granular + on-demand allocation → fit more/longer concurrent sequences in fixed VRAM.
+- **Prefix sharing**: cross-request block dedup → lower TTFT + memory on shared system prompts / RAG.
+
+These are real wins on *memory-pressured* and *shared-prefix* workloads — but they are not tok/s parity, and
+batched-bench (fresh, non-fragmented, no shared prefix) won't show them.
+
 ## So, honestly, where parity stands
 
 - **Decode single-stream: already at/above parity** (B=1: 83 vs 48).