From c0e0ed3865a559ae213621d171e19ac2a9ebc854 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Thu, 25 Jun 2026 09:06:50 +0000 Subject: [PATCH] docs(paged): synthesize decode-parity exploration - the o_proj MMVQ lever Cross-check the adversarial validation against the profiler ground-truth and finalize DECODE_PARITY_EXPLORE.md. The post-SSM 254->391 decode gap is one llama-specific defect: the gated-DeltaNet output projection (ssm_out) runs as an FP4 GEMV (mul_mat_vec_q, 132 ms/step = 26% of decode) at batch 128 instead of a tensor-core MMQ GEMM. Mechanism confirmed at source: final_output is 3D [6144,1,n_seqs] so src1->ne[1]=1 trips the MMVQ dispatch (<=8), with the 128 sequences in ne[2]. vLLM packs the same projection into a cutlass M=128 GEMM. GDN recurrence is only +11%/call (not the lever); P2a optimized the wrong FP4 kernel (the 17% MMQ, not the 26% MMVQ); CUDA graphs, host loop, and DRAM bytes are all ruled out. Decode parity is reachable in software (not a hardware floor): identical bytes/floor, vLLM hits 62% util vs llama 40% on the same GB10. Highest-value next step (~free, bit-exact): collapse final_output to 2D before ssm_out so M=128 routes to MMQ. Ranked levers + cumulative ceilings toward 391 documented. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto --- .../patches/paged/DECODE_PARITY_EXPLORE.md | 178 ++++++++++++++++++ 1 file changed, 178 insertions(+) diff --git a/backend/cpp/llama-cpp/patches/paged/DECODE_PARITY_EXPLORE.md b/backend/cpp/llama-cpp/patches/paged/DECODE_PARITY_EXPLORE.md index 0fe8be3be..086f022e6 100644 --- a/backend/cpp/llama-cpp/patches/paged/DECODE_PARITY_EXPLORE.md +++ b/backend/cpp/llama-cpp/patches/paged/DECODE_PARITY_EXPLORE.md @@ -576,3 +576,181 @@ not the GDN kernel and not byte-cutting. (decode-only windowed, overlap-correct, MB-memcpy, per-step reconstruction). Assisted-by: Claude:opus-4.8 [Claude Code] + +--- + +## Section: SYNTHESIS (cross-check + ground-truth + ranked levers + verdict) - FINALIZED + +Agent label: synthesize. Read-only (no GPU). Cross-checks all sections above against the +fresh `profile-both-engines` ground-truth, then mechanism-confirms the dominant lever by +reading the model graph + ggml-cuda dispatch source on the DGX (`~/llama-paged-dev`, HEAD +46d7dd8 = patch 0019). All throughput vs the vLLM 391 t/s eager apples-to-apples reference. + +### 0. Headline + +Post-SSM dense decode = 256.6 t/s @npl128 = 65.6% of vLLM 391, bit-exact. The residual is +NOT a hardware/architecture floor and NOT the GDN recurrence kernel, the host loop, CUDA +graphs, or DRAM byte-volume. It is ONE concrete, llama-specific kernel-routing defect: +**the gated-DeltaNet output projection (`ssm_out`) runs as an FP4 GEMV (`mul_mat_vec_q`) +at decode batch 128 instead of a tensor-core FP4 GEMM (MMQ), costing 132 ms/step = 26% of +decode = the single biggest overage vs vLLM (which packs the same projection into a cutlass +M=128 GEMM).** The fix is a ~2-line reshape, bit-exact, and is the highest-value next step. + +### 1. Cross-check: which prior findings HELD, were REFUTED, or are SUPERSEDED + +HELD (confirmed by both the adversarial re-derivation and the fresh profile): +- Pre-fix decomposition (gated_delta_net 23.4%, k_get_rows 21.9%, MEMCPY-DtoD 18.9% / 382 GB, + mul_mat_vec_q 15.5%, mul_mat_q 10.5%): reproduced to <=0.1pp (validate-findings). +- SSM-fix D2D collapse: the 18.4 GB/step redundant recurrent-state copy is GONE. Confirmed + three ways: validate (18.9% -> 0.008% on the post-fix sqlite), weight-bandwidth (A/B kernel + sum lists no DtoD term), and IN-TRACE by the profiler (18 GB/step DtoD -> 131 MB/step). The + SSM fix (0018/0019) is the real breakthrough and is working. +- P2a FP4-GEMM occupancy remap FLAT on decode (+0.6% noise) while the `mul_mat_q` kernel itself + shrank -24.7% and prefill rose +12.7%: confirmed. Decode is not GEMM-occupancy-bound. +- 65% of vLLM (254/391 = 64.96%, 256.6/391 = 65.6%): confirmed. +- Decode is NOT at the bandwidth floor: 55.5 GB/step moved at 2.48x the 273 GB/s floor (40% util) + vs vLLM 1.61x (62% util) on the SAME bytes. Confirmed + LOCALIZED below. +- Host loop / 64-layer serialization is NOT the lever: both engines ~98% GPU-busy at npl128 + (llama 98.7%, vLLM 97.9%); the entire exposed-idle budget is ~0.65%. Confirmed by the profiler. +- CUDA graphs are NOT the lever: vLLM is 394 t/s EAGER, graphs add only +6% (420); llama already + runs with graphs. Confirmed by the profiler. + +REFUTED / CORRECTED: +- "GDN recurrence kernel is the dominant residual lever" (the STATE brief's "gated_delta_net + 1.46 ms/call, the largest single kernel" and the gdn-source-compare framing): REFUTED. The + profiler's fresh side-by-side per-call duration is llama 4.03 ms vs vLLM 3.62 ms = only +11% / + +19 ms/step = ~10% of the 184 ms gap. It IS the largest single kernel on both sides (38% llama, + 53% vLLM) but the largest GAP is elsewhere. (The brief's "1.46 ms/call" is a stale/narrower + window; the authoritative post-SSM per-call is 4.03 ms.) gdn-source-compare's occupancy/shuffle/ + fusion anatomy is correct but addresses a SECONDARY +19 ms target, not parity. +- "+66% SSM-fix gain" label: REFUTED. 146 -> 254-257 is +74 to +76%; "66%" is the percent-of-vLLM, + not the speedup (validate-findings). + +SUPERSEDED (the gap validate-findings flagged, now filled by real data): +- The "FP4-GEMM ~48% / get_rows 0.7% / GDN 22.5%" Step-2 split had NO surviving sqlite (the + producer script crashed; only a Step-1 build was on the box). The profiler's fresh Step-2 trace + replaces it with a FINER, load-bearing breakdown: the ~46% "FP4 matmul" bucket is NOT one GEMM + family - it splits into `mul_mat_vec_q` 26% (the o_proj GEMV, the real culprit), `mul_mat_q` 17% + (the tensor-core GEMM P2a already optimized), and `quantize_mmq_nvfp4` 3%. Lumping them as + "48% FP4-GEMM" hid that Track B P2a optimized the 17% MMQ while the 26% MMVQ was the bind. This + is why P2a was flat on decode: **it optimized the wrong FP4 kernel.** + +### 2. Ground-truth per-step decode decomposition + the single biggest overage + +From the profiler's fresh post-SSM eager nsys, both at batch 128, prefill-free, GPU-accurate: + +| component (per decode step) | llama ms | llama% | vLLM ms | vLLM% | gap (llama-vLLM) | +|-----------------------------|----------|--------|---------|-------|------------------| +| GDN recurrence kernel | 193 | 38% | 174 | 53% | **+19** | +| FP4 matmul + act-quant | 236 | 46% | 117 | 36% | **+119** | +| - mul_mat_vec_q (o_proj GEMV) | **132** | **26%** | 0 | - | **+132** | +| - mul_mat_q (MMQ GEMM) | 88 | 17% | 61 (cutlass) | 19% | +27 | +| - quantize_mmq_nvfp4 | 16 | 3% | 55 (nvjet+cvt)| 17% | -39 | +| full attention (16 layers) | 6.6 | 1.3% | 6.2 | 1.9% | +0.4 | +| SSM conv + glue/elementwise | 45 | 9% | 22 | 7% | +23 | +| MEMCPY | 2.5 | 0.5% | 0.36 | 0.1% | +2 | +| **TOTAL** | **~510** | 100% | **~326**| 100% | **+184** | + +The +119 ms FP4-matmul gap is ENTIRELY the `mul_mat_vec_q` o_proj GEMV (+132), partly offset +by llama being -39 on activation-quant (16 vs vLLM's heavier eager 55) and +27 on the MMQ. So +the one lever that matters is the +132 ms/step o_proj GEMV; everything else nets to ~+52 ms. + +**MECHANISM (confirmed by source read, not inferred).** In the dense Qwen3.5-27B GDN block +(`src/models/qwen3next.cpp` `build_recurrent`), the recurrent core keeps the SSM layout +`[feat, n_seq_tokens, n_seqs]`. At decode `n_seq_tokens=1, n_seqs=128`. The output projection is: + +```cpp +// current code (qwen3next.cpp, end of the GDN block) +ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm, + head_v_dim * num_v_heads, n_seq_tokens, n_seqs); // [6144, 1, 128] +cur = build_lora_mm(model.layers[il].ssm_out, final_output); // <-- the matmul +cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs); // collapse AFTER +``` + +`final_output` is 3D `[6144, n_seq_tokens=1, n_seqs=128]`, so `src1->ne[1] = 1`. The ggml-cuda +dispatch (`ggml-cuda.cu:2553`) picks MMVQ when `src1->ne[1] <= MMVQ_MAX_BATCH_SIZE (8)`, with the +128 sequences carried in `ne[2]`. Result: a per-sequence FP4 GEMV, output rows 5120 x 128 seqs = +**`mul_mat_vec_q`, grid 5120x128, 48 calls/step (one per GDN layer)** - matching the profiler's +trace exactly. MMVQ does NOT amortize the `ssm_out` weight read into shared memory across the 128 +sequences (it is built for batch <=8), so each of the 128 sequences re-streams the weight tiles - +the "40% vs 62% utilization" the weight-bandwidth section measured lives HERE, in this kernel, not +in the GDN state traffic. vLLM packs all 128 decode tokens into one cutlass M=128 GEMM (its GDN +kernel is literally `..._PACKED_decode`), so it has NO GEMV-at-batch-128 at all. + +This also pins WHY it is decode-specific: at prefill the tokens are in `ne[1]` (n_seq_tokens=prompt +len), so `ne[1] >> 8` -> MMQ already; only the decode layout (128 seqs x 1 token, batched in ne[2]) +trips the GEMV path. The in-projection (`wqkv`) is unaffected: its input is the 2D residual stream +`[n_embd, 128]` (reshaped to 3D only AFTER the matmul), so `ne[1]=128` -> MMQ today. The o_proj is +the unique 3D-input matmul, which is exactly why the profiler counted one MMVQ per GDN layer. + +### 3. Ranked remaining decode levers (impact x tractability, cumulative ceiling toward 391) + +Anchored on llama 256.6 t/s (499 ms/step) -> vLLM 391 (327 ms/step), gap 172 ms/step. Recover +figures past Lever 1 are ESTIMATES (the profiler measured the costs, not the post-fix kernels); +each needs a confirming re-profile. Ceilings are cumulative. + +| # | lever | targets (ms/step) | est. recover | cumulative decode_agg | % of vLLM | tractability | +|---|-------|-------------------|--------------|-----------------------|-----------|--------------| +| 1 | **o_proj MMVQ -> MMQ** (collapse final_output to 2D before `ssm_out`) | vec_q 132 | ~100-110 | ~320-330 | **~82-85%** | **VERY HIGH** (2-line reshape, bit-exact, MMQ already proven on NVFP4 at M=128 by the in_proj) | +| 2 | act-quant + norm prologue fusion (explore L1 `LLAMA_FUSE_NVFP4_QUANT=1` re-bench + L2 M=128 gate) | quant 16 + 448 launches/step | ~15-25 | ~345-360 | ~88-92% | MED-HIGH (producer code exists, tasks 38-40; needs post-0019 re-bench, the pre-SSM regression is stale) | +| 3 | GDN-area fusion + occupancy (gdn A-D: row-local reduction, raise launch_bounds occupancy, fold gate/l2norm/softplus into the recurrence) | GDN +19 + glue +23 | ~25-40 | ~375-388 | ~96-99% | MED-LOW (real kernel rewrite + numeric re-validation) | +| 4 | conv-state in-place + conv fuse (explore L4, the proven 0018/0019 pattern on `ssm_conv`/concat) | part of glue, 48 launches/step | ~5-10 | ~388-395 | ~99-101% | HIGH (bit-exact, proven pattern) | +| - | between-step host gap / cgraph reuse | ~2 ms/step | ~2 | +~0.4% | n/a | LOW value (cleanup, not a parity lever) | +| x | CUDA graphs | - | 0 | already on | n/a | NOT a lever (+6% even for vLLM) | +| x | TMA weight-feed / NVFP4-dense weight-quant | prefill / npl1 | 0 at npl128 | n/a | n/a | MIS-SCOPED for batch-128 decode (prefill / low-batch levers; prefill already +12.7%) | + +Note on Lever 1+2 coupling: routing the o_proj to MMQ ADDS one activation-quant (q8_1/NVFP4) per +o_proj, so Lever 2 (fusing that quant into the preceding `build_norm_gated`) compounds Lever 1 +rather than overlapping it. Lever 3's "glue +23 ms" and Lever 1's quant are the same elementwise +passes vLLM folds into its packed kernel, so 2+3 share surface - treat the estimates as a band, +not a sum. + +### 4. Verdict: is true decode parity reachable? + +**Yes, parity is reachable in software, and the residual is NOT a hardware/architecture floor.** +Proof of "not a floor": both engines read identical NVFP4 weights and read+write identical f32 +recurrent state = identical 55.5 GB/step DRAM floor (203 ms) on the identical GB10 LPDDR5x; vLLM +achieves 62% bandwidth utilization (327 ms/step) where llama achieves 40% (499 ms/step). The 1.54x +throughput gap equals the 1.55x utilization gap, and that utilization gap is now LOCALIZED to +specific llama kernels - chiefly the o_proj MMVQ - every one of which is closable in software. The +GDN recurrence (the supposed floor) is only +11%/call between the two engines. + +How far each tier reaches: +- The first ~84% of parity (256 -> ~325) is nearly FREE: Lever 1 is a 2-line reshape that moves + the GDN output projection from a per-sequence FP4 GEMV to a tensor-core M=128 FP4 GEMM, bit-exact, + no new kernel (MMQ already runs the in-projection at this exact shape and type). +- ~84% -> ~92% (Levers 1+2) is low-effort: the fused act-quant producer already exists (tasks + 38-40), it just needs a post-0019 re-bench because its pre-SSM regression was measured when the + GPU was 99% busy on the now-removed state-copy chain (no idle to reclaim then; real idle now). +- ~92% -> ~100% (Levers 3+4) is the diminishing-returns tail and the only genuinely HARD work: + matching vLLM's fully-fused `packed_decode` GDN kernel (row-local reductions, higher occupancy, + folding the gate/l2norm/softplus elementwise passes into the recurrence). This last ~8% is "hard + but not floored" - it is kernel engineering, not a hardware wall. + +**Single highest-value next step (do this first):** apply Lever 1 - collapse `final_output` to 2D +`[head_v_dim*num_v_heads, n_seq_tokens*n_seqs]` BEFORE the `ssm_out` matmul (drop the now-redundant +post-matmul `reshape_2d`): + +```cpp +// route the GDN output projection through tensor-core MMQ at decode: +// M = n_seq_tokens*n_seqs (=128 at decode) instead of ne[1]=1 -> MMVQ GEMV. Free, bit-exact. +ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm, + head_v_dim * num_v_heads, n_seq_tokens * n_seqs); +cur = build_lora_mm(model.layers[il].ssm_out, final_output); // now [n_embd, n_tokens], M=128 MMQ +``` + +Then the profiler re-measures the realized o_proj-as-MMQ cost on a clean post-0019 nsys (the one +number this synthesis estimates rather than measures) and confirms the 256 -> ~320-330 lift. The +same 3D-input-matmul pattern almost certainly affects the MoE checkpoint (q36-35b-a3b) decode and +any other matmul that consumes a tensor still in the `[feat, 1, n_seqs]` SSM layout - grep those +and apply the same collapse. Levers 2-4 follow in priority order; none requires a model or accuracy +compromise, so bit-exactness is preserved throughout. + +### Evidence (this section) +- Source read (DGX `~/llama-paged-dev`, read-only): `src/models/qwen3next.cpp` (GDN in/out proj + layout, lines ~286-305 and ~518-528), `ggml/src/ggml-cuda/ggml-cuda.cu:2553` (MMVQ dispatch on + `ne[1]<=8`), `ggml/src/ggml-cuda/mmvq.cuh:3` (`MMVQ_MAX_BATCH_SIZE 8`), `mmq.cu:267` (NVFP4 is + MMQ-supported). +- All five prior sections of this doc + the profiler's `~/bench/postssm_decomp/` traces. + +Assisted-by: Claude:opus-4.8 [Claude Code]