From 37cbc089b05e2dc9e8adbfd5d1c8e4d1efac97b5 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Sat, 20 Jun 2026 03:55:58 +0000 Subject: [PATCH] bench(dense): Qwen3-32B dense parity - dense has the kernel gap too (PP 7.6-32x) vLLM W4A16 vs llama Q4_K_M dense: prefill 7.6-32x behind (llama plateaus ~765, vLLM scales to 24.4k); decode ~parity at B=1 (weight-bandwidth-bound), 2.2x at B=64. Full NVFP4 (W4A4) hangs on this vLLM/GB10 stack - W4A16 used. Decision: the Lever-3 kernel track must ALSO deliver a non-grouped FP4 dense GEMM, not just the MoE grouped GEMM (dense GEMM is the simpler first kernel to land). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto --- backend/cpp/llama-cpp/patches/BENCHMARKS.md | 28 +++++++++++++++++++++ 1 file changed, 28 insertions(+) diff --git a/backend/cpp/llama-cpp/patches/BENCHMARKS.md b/backend/cpp/llama-cpp/patches/BENCHMARKS.md index 3096aaeab..e4cd79632 100644 --- a/backend/cpp/llama-cpp/patches/BENCHMARKS.md +++ b/backend/cpp/llama-cpp/patches/BENCHMARKS.md @@ -48,6 +48,34 @@ MoE-kernel-bound. vLLM's concurrency advantage is its MoE/attention *kernels*, n These are real wins on *memory-pressured* and *shared-prefix* workloads — but they are not tok/s parity, and batched-bench (fresh, non-fragmented, no shared prefix) won't show them. +## DENSE model parity (Qwen3-32B) — does the kernel gap exist for dense too? YES. + +The MoE work above is about the grouped MoE GEMM. Dense models use a different (non-grouped) matmul path, +so we benchmarked a dense 32B head-to-head. vLLM `RedHatAI/Qwen3-32B-NVFP4` (full NVFP4) **hangs on this +GB10 / vLLM 0.23.0 stack** (deadlocks right after weight-load, 0–3% GPU, no error, both eager + CUDA-graph), +so we used the **W4A16** variant (`Qwen3-32B-NVFP4A16`, 4-bit weights / FP16 activations, FlashInfer marlin +kernel) vs llama.cpp `Qwen3-32B-Q4_K_M` (4-bit weights / int8-MMQ compute). Both 4-bit weights — a fair +weight-quant comparison; the difference is the compute kernel. + +| B | llama Q4_K_M PP | vLLM W4A16 PP | PP gap | llama decode | vLLM decode | TG gap | +|---|---|---|---|---|---|---| +| 1 | 708 | 5367 | 7.6× | 10.2 | 11.7 | ~parity | +| 8 | 761 | 14941 | 20× | 58 | 92 | 1.6× | +| 32 | 763 | 21952 | 29× | 205 | 330 | 1.6× | +| 64 | 765 | 24444 | 32× | 253 | 569 | 2.2× | + +**Findings:** +1. **Dense prefill has the SAME (larger) kernel gap.** llama dense prefill plateaus at ~765 t/s regardless of + B; vLLM scales to 24.4k (32×). llama's dense matmul is int8-MMQ; vLLM uses an FP4 (marlin/cutlass) GEMM. + And this is a *lower bound* — full NVFP4 (W4A4) would be faster still (it hung, so we couldn't measure it). +2. **Decode is ~parity at B=1** (10.2 vs 11.7 — both weight-bandwidth-bound reading 4-bit weights), and the + gap grows with batch (compute starts to matter → the kernel gap reappears: 2.2× at B=64). +3. **Scope decision (the reason for this benchmark): the Lever-3 kernel track must also deliver a NON-grouped + block-scaled FP4 GEMM for dense**, not only the MoE grouped GEMM. The dense GEMM is the simpler of the two + (a plain CUTLASS dense GEMM), so it's a good first kernel to land — and it benefits every dense model. +4. **Aside:** full NVFP4 (W4A4) is currently unusable for dense on this vLLM/GB10 build — worth revisiting + on a newer vLLM, and a point in llama.cpp's favor (its 4-bit dense path at least *runs*). + ## So, honestly, where parity stands - **Decode single-stream: already at/above parity** (B=1: 83 vs 48).