From c0e0ed3865a559ae213621d171e19ac2a9ebc854 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Thu, 25 Jun 2026 09:06:50 +0000
Subject: [PATCH] docs(paged): synthesize decode-parity exploration - the
 o_proj MMVQ lever

Cross-check the adversarial validation against the profiler ground-truth and
finalize DECODE_PARITY_EXPLORE.md. The post-SSM 254->391 decode gap is one
llama-specific defect: the gated-DeltaNet output projection (ssm_out) runs as
an FP4 GEMV (mul_mat_vec_q, 132 ms/step = 26% of decode) at batch 128 instead
of a tensor-core MMQ GEMM. Mechanism confirmed at source: final_output is 3D
[6144,1,n_seqs] so src1->ne[1]=1 trips the MMVQ dispatch (<=8), with the 128
sequences in ne[2]. vLLM packs the same projection into a cutlass M=128 GEMM.

GDN recurrence is only +11%/call (not the lever); P2a optimized the wrong FP4
kernel (the 17% MMQ, not the 26% MMVQ); CUDA graphs, host loop, and DRAM bytes
are all ruled out. Decode parity is reachable in software (not a hardware
floor): identical bytes/floor, vLLM hits 62% util vs llama 40% on the same
GB10. Highest-value next step (~free, bit-exact): collapse final_output to 2D
before ssm_out so M=128 routes to MMQ. Ranked levers + cumulative ceilings
toward 391 documented.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
 .../patches/paged/DECODE_PARITY_EXPLORE.md    | 178 ++++++++++++++++++
 1 file changed, 178 insertions(+)

diff --git a/backend/cpp/llama-cpp/patches/paged/DECODE_PARITY_EXPLORE.md b/backend/cpp/llama-cpp/patches/paged/DECODE_PARITY_EXPLORE.md
index 0fe8be3be..086f022e6 100644
--- a/backend/cpp/llama-cpp/patches/paged/DECODE_PARITY_EXPLORE.md
+++ b/backend/cpp/llama-cpp/patches/paged/DECODE_PARITY_EXPLORE.md
@@ -576,3 +576,181 @@ not the GDN kernel and not byte-cutting.
   (decode-only windowed, overlap-correct, MB-memcpy, per-step reconstruction).
 
 Assisted-by: Claude:opus-4.8 [Claude Code]
+
+---
+
+## Section: SYNTHESIS (cross-check + ground-truth + ranked levers + verdict) - FINALIZED
+
+Agent label: synthesize. Read-only (no GPU). Cross-checks all sections above against the
+fresh `profile-both-engines` ground-truth, then mechanism-confirms the dominant lever by
+reading the model graph + ggml-cuda dispatch source on the DGX (`~/llama-paged-dev`, HEAD
+46d7dd8 = patch 0019). All throughput vs the vLLM 391 t/s eager apples-to-apples reference.
+
+### 0. Headline
+
+Post-SSM dense decode = 256.6 t/s @npl128 = 65.6% of vLLM 391, bit-exact. The residual is
+NOT a hardware/architecture floor and NOT the GDN recurrence kernel, the host loop, CUDA
+graphs, or DRAM byte-volume. It is ONE concrete, llama-specific kernel-routing defect:
+**the gated-DeltaNet output projection (`ssm_out`) runs as an FP4 GEMV (`mul_mat_vec_q`)
+at decode batch 128 instead of a tensor-core FP4 GEMM (MMQ), costing 132 ms/step = 26% of
+decode = the single biggest overage vs vLLM (which packs the same projection into a cutlass
+M=128 GEMM).** The fix is a ~2-line reshape, bit-exact, and is the highest-value next step.
+
+### 1. Cross-check: which prior findings HELD, were REFUTED, or are SUPERSEDED
+
+HELD (confirmed by both the adversarial re-derivation and the fresh profile):
+- Pre-fix decomposition (gated_delta_net 23.4%, k_get_rows 21.9%, MEMCPY-DtoD 18.9% / 382 GB,
+  mul_mat_vec_q 15.5%, mul_mat_q 10.5%): reproduced to <=0.1pp (validate-findings).
+- SSM-fix D2D collapse: the 18.4 GB/step redundant recurrent-state copy is GONE. Confirmed
+  three ways: validate (18.9% -> 0.008% on the post-fix sqlite), weight-bandwidth (A/B kernel
+  sum lists no DtoD term), and IN-TRACE by the profiler (18 GB/step DtoD -> 131 MB/step). The
+  SSM fix (0018/0019) is the real breakthrough and is working.
+- P2a FP4-GEMM occupancy remap FLAT on decode (+0.6% noise) while the `mul_mat_q` kernel itself
+  shrank -24.7% and prefill rose +12.7%: confirmed. Decode is not GEMM-occupancy-bound.
+- 65% of vLLM (254/391 = 64.96%, 256.6/391 = 65.6%): confirmed.
+- Decode is NOT at the bandwidth floor: 55.5 GB/step moved at 2.48x the 273 GB/s floor (40% util)
+  vs vLLM 1.61x (62% util) on the SAME bytes. Confirmed + LOCALIZED below.
+- Host loop / 64-layer serialization is NOT the lever: both engines ~98% GPU-busy at npl128
+  (llama 98.7%, vLLM 97.9%); the entire exposed-idle budget is ~0.65%. Confirmed by the profiler.
+- CUDA graphs are NOT the lever: vLLM is 394 t/s EAGER, graphs add only +6% (420); llama already
+  runs with graphs. Confirmed by the profiler.
+
+REFUTED / CORRECTED:
+- "GDN recurrence kernel is the dominant residual lever" (the STATE brief's "gated_delta_net
+  1.46 ms/call, the largest single kernel" and the gdn-source-compare framing): REFUTED. The
+  profiler's fresh side-by-side per-call duration is llama 4.03 ms vs vLLM 3.62 ms = only +11% /
+  +19 ms/step = ~10% of the 184 ms gap. It IS the largest single kernel on both sides (38% llama,
+  53% vLLM) but the largest GAP is elsewhere. (The brief's "1.46 ms/call" is a stale/narrower
+  window; the authoritative post-SSM per-call is 4.03 ms.) gdn-source-compare's occupancy/shuffle/
+  fusion anatomy is correct but addresses a SECONDARY +19 ms target, not parity.
+- "+66% SSM-fix gain" label: REFUTED. 146 -> 254-257 is +74 to +76%; "66%" is the percent-of-vLLM,
+  not the speedup (validate-findings).
+
+SUPERSEDED (the gap validate-findings flagged, now filled by real data):
+- The "FP4-GEMM ~48% / get_rows 0.7% / GDN 22.5%" Step-2 split had NO surviving sqlite (the
+  producer script crashed; only a Step-1 build was on the box). The profiler's fresh Step-2 trace
+  replaces it with a FINER, load-bearing breakdown: the ~46% "FP4 matmul" bucket is NOT one GEMM
+  family - it splits into `mul_mat_vec_q` 26% (the o_proj GEMV, the real culprit), `mul_mat_q` 17%
+  (the tensor-core GEMM P2a already optimized), and `quantize_mmq_nvfp4` 3%. Lumping them as
+  "48% FP4-GEMM" hid that Track B P2a optimized the 17% MMQ while the 26% MMVQ was the bind. This
+  is why P2a was flat on decode: **it optimized the wrong FP4 kernel.**
+
+### 2. Ground-truth per-step decode decomposition + the single biggest overage
+
+From the profiler's fresh post-SSM eager nsys, both at batch 128, prefill-free, GPU-accurate:
+
+| component (per decode step) | llama ms | llama% | vLLM ms | vLLM% | gap (llama-vLLM) |
+|-----------------------------|----------|--------|---------|-------|------------------|
+| GDN recurrence kernel       | 193      | 38%    | 174     | 53%   | **+19**          |
+| FP4 matmul + act-quant      | 236      | 46%    | 117     | 36%   | **+119**         |
+|   - mul_mat_vec_q (o_proj GEMV) | **132** | **26%** | 0   | -     | **+132**         |
+|   - mul_mat_q (MMQ GEMM)    | 88       | 17%    | 61 (cutlass) | 19% | +27             |
+|   - quantize_mmq_nvfp4      | 16       | 3%     | 55 (nvjet+cvt)| 17% | -39             |
+| full attention (16 layers)  | 6.6      | 1.3%   | 6.2     | 1.9%  | +0.4             |
+| SSM conv + glue/elementwise | 45       | 9%     | 22      | 7%    | +23              |
+| MEMCPY                      | 2.5      | 0.5%   | 0.36    | 0.1%  | +2               |
+| **TOTAL**                   | **~510** | 100%   | **~326**| 100%  | **+184**         |
+
+The +119 ms FP4-matmul gap is ENTIRELY the `mul_mat_vec_q` o_proj GEMV (+132), partly offset
+by llama being -39 on activation-quant (16 vs vLLM's heavier eager 55) and +27 on the MMQ. So
+the one lever that matters is the +132 ms/step o_proj GEMV; everything else nets to ~+52 ms.
+
+**MECHANISM (confirmed by source read, not inferred).** In the dense Qwen3.5-27B GDN block
+(`src/models/qwen3next.cpp` `build_recurrent`), the recurrent core keeps the SSM layout
+`[feat, n_seq_tokens, n_seqs]`. At decode `n_seq_tokens=1, n_seqs=128`. The output projection is:
+
+```cpp
+// current code (qwen3next.cpp, end of the GDN block)
+ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm,
+                                 head_v_dim * num_v_heads, n_seq_tokens, n_seqs); // [6144, 1, 128]
+cur = build_lora_mm(model.layers[il].ssm_out, final_output);                     // <-- the matmul
+cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs);                 // collapse AFTER
+```
+
+`final_output` is 3D `[6144, n_seq_tokens=1, n_seqs=128]`, so `src1->ne[1] = 1`. The ggml-cuda
+dispatch (`ggml-cuda.cu:2553`) picks MMVQ when `src1->ne[1] <= MMVQ_MAX_BATCH_SIZE (8)`, with the
+128 sequences carried in `ne[2]`. Result: a per-sequence FP4 GEMV, output rows 5120 x 128 seqs =
+**`mul_mat_vec_q`, grid 5120x128, 48 calls/step (one per GDN layer)** - matching the profiler's
+trace exactly. MMVQ does NOT amortize the `ssm_out` weight read into shared memory across the 128
+sequences (it is built for batch <=8), so each of the 128 sequences re-streams the weight tiles -
+the "40% vs 62% utilization" the weight-bandwidth section measured lives HERE, in this kernel, not
+in the GDN state traffic. vLLM packs all 128 decode tokens into one cutlass M=128 GEMM (its GDN
+kernel is literally `..._PACKED_decode`), so it has NO GEMV-at-batch-128 at all.
+
+This also pins WHY it is decode-specific: at prefill the tokens are in `ne[1]` (n_seq_tokens=prompt
+len), so `ne[1] >> 8` -> MMQ already; only the decode layout (128 seqs x 1 token, batched in ne[2])
+trips the GEMV path. The in-projection (`wqkv`) is unaffected: its input is the 2D residual stream
+`[n_embd, 128]` (reshaped to 3D only AFTER the matmul), so `ne[1]=128` -> MMQ today. The o_proj is
+the unique 3D-input matmul, which is exactly why the profiler counted one MMVQ per GDN layer.
+
+### 3. Ranked remaining decode levers (impact x tractability, cumulative ceiling toward 391)
+
+Anchored on llama 256.6 t/s (499 ms/step) -> vLLM 391 (327 ms/step), gap 172 ms/step. Recover
+figures past Lever 1 are ESTIMATES (the profiler measured the costs, not the post-fix kernels);
+each needs a confirming re-profile. Ceilings are cumulative.
+
+| # | lever | targets (ms/step) | est. recover | cumulative decode_agg | % of vLLM | tractability |
+|---|-------|-------------------|--------------|-----------------------|-----------|--------------|
+| 1 | **o_proj MMVQ -> MMQ** (collapse final_output to 2D before `ssm_out`) | vec_q 132 | ~100-110 | ~320-330 | **~82-85%** | **VERY HIGH** (2-line reshape, bit-exact, MMQ already proven on NVFP4 at M=128 by the in_proj) |
+| 2 | act-quant + norm prologue fusion (explore L1 `LLAMA_FUSE_NVFP4_QUANT=1` re-bench + L2 M=128 gate) | quant 16 + 448 launches/step | ~15-25 | ~345-360 | ~88-92% | MED-HIGH (producer code exists, tasks 38-40; needs post-0019 re-bench, the pre-SSM regression is stale) |
+| 3 | GDN-area fusion + occupancy (gdn A-D: row-local reduction, raise launch_bounds occupancy, fold gate/l2norm/softplus into the recurrence) | GDN +19 + glue +23 | ~25-40 | ~375-388 | ~96-99% | MED-LOW (real kernel rewrite + numeric re-validation) |
+| 4 | conv-state in-place + conv fuse (explore L4, the proven 0018/0019 pattern on `ssm_conv`/concat) | part of glue, 48 launches/step | ~5-10 | ~388-395 | ~99-101% | HIGH (bit-exact, proven pattern) |
+| - | between-step host gap / cgraph reuse | ~2 ms/step | ~2 | +~0.4% | n/a | LOW value (cleanup, not a parity lever) |
+| x | CUDA graphs | - | 0 | already on | n/a | NOT a lever (+6% even for vLLM) |
+| x | TMA weight-feed / NVFP4-dense weight-quant | prefill / npl1 | 0 at npl128 | n/a | n/a | MIS-SCOPED for batch-128 decode (prefill / low-batch levers; prefill already +12.7%) |
+
+Note on Lever 1+2 coupling: routing the o_proj to MMQ ADDS one activation-quant (q8_1/NVFP4) per
+o_proj, so Lever 2 (fusing that quant into the preceding `build_norm_gated`) compounds Lever 1
+rather than overlapping it. Lever 3's "glue +23 ms" and Lever 1's quant are the same elementwise
+passes vLLM folds into its packed kernel, so 2+3 share surface - treat the estimates as a band,
+not a sum.
+
+### 4. Verdict: is true decode parity reachable?
+
+**Yes, parity is reachable in software, and the residual is NOT a hardware/architecture floor.**
+Proof of "not a floor": both engines read identical NVFP4 weights and read+write identical f32
+recurrent state = identical 55.5 GB/step DRAM floor (203 ms) on the identical GB10 LPDDR5x; vLLM
+achieves 62% bandwidth utilization (327 ms/step) where llama achieves 40% (499 ms/step). The 1.54x
+throughput gap equals the 1.55x utilization gap, and that utilization gap is now LOCALIZED to
+specific llama kernels - chiefly the o_proj MMVQ - every one of which is closable in software. The
+GDN recurrence (the supposed floor) is only +11%/call between the two engines.
+
+How far each tier reaches:
+- The first ~84% of parity (256 -> ~325) is nearly FREE: Lever 1 is a 2-line reshape that moves
+  the GDN output projection from a per-sequence FP4 GEMV to a tensor-core M=128 FP4 GEMM, bit-exact,
+  no new kernel (MMQ already runs the in-projection at this exact shape and type).
+- ~84% -> ~92% (Levers 1+2) is low-effort: the fused act-quant producer already exists (tasks
+  38-40), it just needs a post-0019 re-bench because its pre-SSM regression was measured when the
+  GPU was 99% busy on the now-removed state-copy chain (no idle to reclaim then; real idle now).
+- ~92% -> ~100% (Levers 3+4) is the diminishing-returns tail and the only genuinely HARD work:
+  matching vLLM's fully-fused `packed_decode` GDN kernel (row-local reductions, higher occupancy,
+  folding the gate/l2norm/softplus elementwise passes into the recurrence). This last ~8% is "hard
+  but not floored" - it is kernel engineering, not a hardware wall.
+
+**Single highest-value next step (do this first):** apply Lever 1 - collapse `final_output` to 2D
+`[head_v_dim*num_v_heads, n_seq_tokens*n_seqs]` BEFORE the `ssm_out` matmul (drop the now-redundant
+post-matmul `reshape_2d`):
+
+```cpp
+// route the GDN output projection through tensor-core MMQ at decode:
+// M = n_seq_tokens*n_seqs (=128 at decode) instead of ne[1]=1 -> MMVQ GEMV. Free, bit-exact.
+ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm,
+                                 head_v_dim * num_v_heads, n_seq_tokens * n_seqs);
+cur = build_lora_mm(model.layers[il].ssm_out, final_output); // now [n_embd, n_tokens], M=128 MMQ
+```
+
+Then the profiler re-measures the realized o_proj-as-MMQ cost on a clean post-0019 nsys (the one
+number this synthesis estimates rather than measures) and confirms the 256 -> ~320-330 lift. The
+same 3D-input-matmul pattern almost certainly affects the MoE checkpoint (q36-35b-a3b) decode and
+any other matmul that consumes a tensor still in the `[feat, 1, n_seqs]` SSM layout - grep those
+and apply the same collapse. Levers 2-4 follow in priority order; none requires a model or accuracy
+compromise, so bit-exactness is preserved throughout.
+
+### Evidence (this section)
+- Source read (DGX `~/llama-paged-dev`, read-only): `src/models/qwen3next.cpp` (GDN in/out proj
+  layout, lines ~286-305 and ~518-528), `ggml/src/ggml-cuda/ggml-cuda.cu:2553` (MMVQ dispatch on
+  `ne[1]<=8`), `ggml/src/ggml-cuda/mmvq.cuh:3` (`MMVQ_MAX_BATCH_SIZE 8`), `mmq.cu:267` (NVFP4 is
+  MMQ-supported).
+- All five prior sections of this doc + the profiler's `~/bench/postssm_decomp/` traces.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]