diff --git a/backend/cpp/llama-cpp/patches/BENCHMARKS.md b/backend/cpp/llama-cpp/patches/BENCHMARKS.md index 37c331902..3096aaeab 100644 --- a/backend/cpp/llama-cpp/patches/BENCHMARKS.md +++ b/backend/cpp/llama-cpp/patches/BENCHMARKS.md @@ -24,6 +24,30 @@ prior run, same box/model). S_PP / S_TG are aggregate prefill / decode tok/s acr engine's domain — vLLM's block-paged KV + continuous batching pack more concurrent decode work per step. **This is what patches 0003–0006 target.** The win here is realistic; the prefill win is not (kernel). +## CORRECTION — decode-phase profile (B=64, decode-dominated nsys) + +The "decode gap is engine-addressable" read above was **wrong**. Profiling a decode-dominated B=64 run: + +| kernel | % GPU time | +|---|---| +| `mul_mat_q` (MoE GEMM) | **54.6** | +| `flash_attn_ext` (attention) | 19.8 | +| `mul_mat_q` (dense) | 10.9 | +| KV writes / quant / norms / rest | ~15 | + +**Decode at concurrency is ALSO dominated by the FP4 MoE GEMM (54.6%)** — the same Lever-3 kernel as prefill. +Attention (the only thing paging optimizes) is ~20%, and the gather-read reclaims only the *masked-cell* +fraction of that. So **the paged series (0003–0006) cannot close the vLLM gap in either phase** — both are +MoE-kernel-bound. vLLM's concurrency advantage is its MoE/attention *kernels*, not (mainly) its KV management. + +### What the paged series IS still good for (just not throughput parity) + +- **Capacity**: block-granular + on-demand allocation → fit more/longer concurrent sequences in fixed VRAM. +- **Prefix sharing**: cross-request block dedup → lower TTFT + memory on shared system prompts / RAG. + +These are real wins on *memory-pressured* and *shared-prefix* workloads — but they are not tok/s parity, and +batched-bench (fresh, non-fragmented, no shared prefix) won't show them. + ## So, honestly, where parity stands - **Decode single-stream: already at/above parity** (B=1: 83 vs 48).