From 9f16a907be726576142d6afb0f168b2e08ff9cca Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Fri, 19 Jun 2026 20:56:46 +0000
Subject: [PATCH] docs(paged): Lever 3 profiled + Q4/MXFP4 findings,
 auto-ubatch shipped

Prefill doesn't scale with bigger single prompts (attention O(N^2)); real gap
is batched MoE prefill (B=32: 27x vs vLLM, ~22 effective TFLOP/s). nsys pins
Lever 3 target: mul_mat_q<MXFP4> MoE GEMM 37% + un-fused act-quant 8%; native
FP4 MMA already engaged, inefficiency is the per-expert thin-tile scheduler.
Q4_K_M matches MXFP4 on decode (decode win is generic 4-bit); MXFP4's only edge
is prefill. Auto-ubatch=2048 on Blackwell shipped (PR #10411).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
 backend/cpp/llama-cpp/paged/DGX_BLACKWELL_PLAN.md | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)
diff --git a/backend/cpp/llama-cpp/paged/DGX_BLACKWELL_PLAN.md b/backend/cpp/llama-cpp/paged/DGX_BLACKWELL_PLAN.md
index c49c95bfa..72ca6e588 100644
--- a/backend/cpp/llama-cpp/paged/DGX_BLACKWELL_PLAN.md
+++ b/backend/cpp/llama-cpp/paged/DGX_BLACKWELL_PLAN.md
@@ -105,7 +105,20 @@ MXFP4's *only* real edge is **prefill (+41% over Q4_K_M)** via Blackwell FP4 ten
 **"4-bit quant + ubatch=2048" captures most of the win portably**; MXFP4 is a Blackwell-only prefill extra.
 
 ### Lever 3 — Fused FP4/FP8 MoE grouped GEMM (+ activation-quant fusion)
-Status: **DESIGNED, not built** (multi-week kernel R&D). This is the single biggest remaining prefill win.
+Status: **DESIGNED + PROFILED, not built** (multi-week kernel R&D). The single biggest remaining prefill win.
+
+**Decisive measurements:**
+- Prefill does NOT scale with bigger single prompts (attention O(N²) confounds): MXFP4 pp2048=3295, pp8192=1524,
+  pp16384=2051. So the plateau is not a batch-size fix.
+- Real gap is batched many-sequence prefill: B=32 llama 3651 vs vLLM 99398 = **27×**. llama.cpp MoE prefill runs
+  at only **~22 effective TFLOP/s** on the GB10 — far below the GPU. Large headroom.
+- **nsys (MXFP4 pp2048):** `mul_mat_q<type39>` (MoE FP4 GEMM) = **37.2%**, `quantize_mmq_mxfp4` (act-quant) = 8.0%,
+  `mul_mat_q<type8>` (dense/attn, still Q8) = 10.1%, flash_attn = 8.8%. The native FP4 MMA *is* engaged — the
+  inefficiency is the **per-expert thin-tile MMQ scheduler** + **un-fused activation quant**.
+
+**Target (precise):** the ~45% in `mmq.cu`'s grouped MoE path (`ggml_cuda_mul_mat_q` + `ids`, `mmid.cu`). Replace
+the per-expert thin-tile scheduler with a CUTLASS-style grouped GEMM (full tiles regardless of tokens/expert) and
+fuse `quantize_mmq_mxfp4` into the permute/gather. Dense Q8 matmuls (10%) are the separate Lever-4 (FP8) target.
 Problem (measured): the prefill ceiling is the MoE expert GEMM. Today `ggml_cuda_mul_mat_q` with `ids`
 (`mmq.cu:127`) launches one grouped MMQ over a 3D grid (z = expert), but each expert's tile is thin
 (~tokens/expert columns) so int8/FP4 tensor cores run underfilled; throughput is memory-bound on weight