docs(paged): SYNTHESIS - validated decode-parity picture, ranked plan, verdict

Final synthesis of the critical-path gap analysis: the decode step is 99.94% GPU-busy single-stream (idle 0.225ms = 0.06%), so the 14% gap to vLLM is kernel GPU-time dominated by the bandwidth-bound gated_delta_net recurrence (196.37ms = 51.6%), not launch bubbles. Claims A/B/C all REFUTED as worded; the single residual is the unmeasured DRAM byte ratio of llama's recurrence vs vLLM's fused kernel. Ranked plan: single-pass fused GDN recurrence (gap-closer, gate on ncu byte-ratio test) + conv-state concat fusion (no-regret +2-3%, bit-exact); gate-fold alone tops out at ~89% of vLLM; bf16 state is the only floor-mover but breaks bit-exactness. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-26 09:26:55 -04:00 · 2026-06-25 15:03:18 +00:00
parent a72385257a
commit 5825b073a5
1 changed files with 164 additions and 0 deletions
--- a/backend/cpp/llama-cpp/patches/paged/CRITICALPATH_GAP_ANALYSIS.md
+++ b/backend/cpp/llama-cpp/patches/paged/CRITICALPATH_GAP_ANALYSIS.md
@@ -473,3 +473,167 @@ per-step GPU-busy = wall (no overlap), so that inference does not hold.
  pass or the GEMM/quantize stack.

 Assisted-by: Claude:opus-4.8 [Claude Code]
+
+---
+
+## SYNTHESIS (final) - the validated decode-parity picture, ranked plan, and verdict
+
+Reconciles all six investigation sections above plus the three adversarial verdicts
+(Verify A/B/C). One sentence: **the "~60% idle" never existed; the decode step is
+99.94% GPU-busy single-stream, so the 14% gap to vLLM is kernel GPU-time, dominated by
+the bandwidth-bound `gated_delta_net` recurrence (51.6%), and the only gap-closing levers
+are byte-reduction inside that kernel - NOT launch-bubble removal.**
+
+### 1. The proven critical-path decomposition of the decode step
+
+Decisive node-level trace (`nsys --cuda-graph-trace=node`, clean Lever-1 build df1cc97,
+q36-27b-nvfp4 dense, npl128, GB10/48SM/sm_121, commit a7238525, nsysgap.sqlite). One
+steady step = single replayed CUDA graph (graphId=11, 23 replays), all 2965 kernels on
+ONE stream (stream 14, strictly serial -> every inter-kernel gap is pure idle). Window
+383.48 ms.
+
+BUBBLE CLASSIFICATION (the "where is the ~60% idle" answer - it is NOT idle):
+
+| bucket | ms/step | % step | note |
+|---|---|---|---|
+| (a) inter-kernel launch bubbles | ~0 | ~0 | graph replay collapses host launch latency |
+| (b) serial-dependency stalls (GDN chain) | included in 0.225 | 0.06 | each kernel starts < 1 us after prev; zero gaps > 5 us, max 2.40 us |
+| (a)+(b) total exposed idle (LAG sum) | **0.225** | **0.06%** | 1700 kernels back-to-back |
+| (d) between-step HOST gap (cgraph rebuild, new uid) | ~0.2 | ~0.05 | the ONLY graph-non-covered idle; ~0.4% in older eager-tail traces |
+| (c) within-kernel GPU-busy | **380.4** | **99.94%** | this is the whole step |
+
+The nvidia-smi "40%" is within-kernel SM/bandwidth efficiency (~12-16% achieved
+occupancy on memory-latency-bound kernels), NOT wall-clock idle.
+
+KERNEL GPU-TIME DECOMPOSITION of the 380.4 ms busy step (this is where the gap lives):
+
+| kernel | ms | % step | regime |
+|---|---|---|---|
+| `gated_delta_net_cuda<128>` (48x, 4.08 ms/call) | **196.37** | **51.6** | bandwidth-bound f32 recurrent-state R+W (~384 MB R + 384 MB W/layer) |
+| `mul_mat_q` FP4 GEMM (496x) | 92.90 | 24.4 | memory-bound weight stream, 136-CTA tail-bound at decode |
+| `quantize_mmq_nvfp4` (496x) | 17.13 | 4.5 | mandatory act-quant (Lever-2 only relocated it) |
+| `nvjet` lm_head GEMM | 11.91 | 3.1 | |
+| `flash_attn_ext_f16` (16 attn layers) | 11.67 | 3.1 | |
+| `concat_cont` (conv-state splice) | 8.01 | 2.1 | Lever-1 target |
+| `cpy_scalar` (conv-state writeback + dup) | 7.62 | 2.0 | Lever-1 target (the conv-state share) |
+| `k_get_rows_float` | 7.08 | 1.9 | |
+| `k_bin_bcast` (gate mul + add) | 6.59 | 1.7 | Lever-3 gate-fold target (partial - rest is residual adds) |
+| `ssm_conv_f32` | 5.64 | 1.5 | folds into Lever-1 |
+| `unary_gated` (silu/sigmoid) | 5.36 | 1.4 | mostly FFN + output-gate (Lever 3 does NOT touch) |
+| `mul_mat_q_stream_k_fixup` | 3.94 | 1.0 | |
+| `rms_norm_f32` | 3.52 | 0.9 | |
+| `l2_norm_f32` | 0.64 | 0.2 | Lever-3 gate-fold target |
+| `gdn_gather_nonident` | 0.061 | 0.016 | negligible (early-returns on identity ids) |
+
+GDN region (recurrence + conv + concat + cpy + gather + l2norm) >= 210 ms = 55%+ of the step.
+The widely-cited "gated_delta_net 13%, 1.47 ms/call near-vLLM" from nsysab_new.kern.txt was
+PREFILL + the single eager capture step contaminating the average over 1248 calls (range
+0.046-4.42 ms); true steady decode is 4.08 ms/call, 2.8x higher, 51.6% of the step.
+
+### 2. Claims A / B / C: which HOLD, which are REFUTED, and the residual uncertainty
+
+**CLAIM A** ("the ~60% decode GPU-idle is inter-op launch bubbles ON the serial GDN
+chain"): **REFUTED.** Measured idle = 0.225 ms = 0.06%, not the ~53-57 ms the claim
+requires (two-plus orders of magnitude short). Zero gaps > 5 us; CUDA-graph replay
+already collapsed launch latency; serial data-dependency does NOT equal idle when the
+graph dispatches nodes back-to-back. The "40%" was a misread of within-kernel SM
+efficiency; the "555 ms busy-sum > 384 ms wall implies overlap" was a prefill-contaminated
+`--trace=cuda` artifact (each step recorded as one opaque ~380 ms block).
+
+**CLAIM B** ("Lever 3 - gate fusion - moves the wall, unlike P2a/Lever-2, by removing
+serial launch bubbles"): **REFUTED on mechanism.** (i) There are no bubbles to remove
+(0.06%). (ii) The contrast is fictional: the step is single-stream with ZERO overlap
+anywhere, so P2a/Lever-2 were NOT flat because they "optimized overlapped work" - P2a
+tuned the prefill large-M GEMM (decode GEMMs are a different 136-CTA tail regime) and
+Lever-2 merely relocated mandatory quantize work into the GEMM prologue (net zero).
+(iii) Where the claim is trivially true (any kernel removal cuts wall in a 99.94%-busy
+single-stream step), the slice Lever 3 actually fuses ceilings at **12.76 ms = 3.35%**
+(k_bin_bcast 6.59 + silu/sigmoid 5.36 + l2_norm 0.64 + softplus 0.13 - and even that
+over-counts, since silu is mostly untouched FFN/output-gate). So the wall DOES move, but
+only ~3% (380 -> ~367 ms, 86% -> ~89% of vLLM), and NOT for the claimed reason. Lever 3
+is a component, not the gap-closer.
+
+**CLAIM C** ("the residual gap is software-closable LATENCY, not a GB10 hardware floor"):
+**REFUTED as worded** (no latency, no idle to close - same data as A). The "not a hardware
+floor" half is **UNSETTLED, not proven.** vLLM hits 327 ms on the same silicon, so it is
+not an absolute hard floor - but whether the dominant 51.6% `gated_delta_net` term is
+software-closable in BIT-EXACT form turns on one unmeasured quantity (below).
+
+RESIDUAL UNCERTAINTY (the single open question that decides everything):
+- **The DRAM byte-traffic ratio of llama's recurrence vs vLLM's.** Every section above
+  ESTIMATED the GDN state bytes (~190 GB/s effective, ~70% of 273 GB/s peak); none MEASURED
+  it. If llama's `gated_delta_net_cuda<128>` moves ~2x the minimal (s0-read + s1-write)
+  bytes because the un-fused gate/l2norm/writeback/gather ops re-stream state through HBM,
+  then the 51.6% is software-closable by a single-pass fused recurrence (Claim C spirit
+  HOLDS). If llama already moves ~minimal bytes at > 85% of peak and vLLM moves the same,
+  the recurrence is at the GB10 LPDDR5x floor for this state size -> the gap is a
+  hardware/architecture floor and is NOT closable in bit-exact form (Claim C REFUTED on
+  both halves). This is the one measurement that converts the verdict from "refuted as
+  worded" to a definitive yes/no.
+- **The MoE model (qwen35moe) is untested.** At B=128 MUL_MAT_ID can trip
+  [TAG_MUL_MAT_ID_CUDA_GRAPHS] (`ne[2] > mmvq_mmid_max`) and disable the WHOLE MoE-decode
+  graph into eager, where the ~3100 per-step launches re-dispatch serially on the Grace
+  cores and inter-op bubbles WOULD reappear. For MoE only, Claim A could partially hold.
+  The dense 335 tok/s headline is fully settled.
+
+### 3. Ranked implementation plan for the remaining ~14% (57 ms/step, 384 -> 327)
+
+Every win must come from kernel GPU-time (bytes), because bubbles = 0 and both engines
+share identical bandwidth/compute floors. Ranked by expected recovery.
+
+| # | Lever | ms/step recovered | -> % of vLLM | bit-exact | tractability | gate |
+|---|---|---|---|---|---|---|
+| **1** | **Single-pass fused GDN recurrence** (fold l2norm+gate+decay+recurrence+state-writeback+gather into ONE pass over state, mirroring vLLM `fused_recurrent_gated_delta_rule_packed_decode`) - cuts state HBM round-trips | **0 to ~40** (= the byte-delta; UNKNOWN until ncu) | 86% -> up to ~98% | near (l2norm reduction; KL < ~1e-3) | HIGH (kernel rewrite) | **ncu byte-ratio test FIRST** |
+| 2 | **Conv-state concat -> ssm_conv fusion** (Lever 1): pass conv-state + new token as separate srcs, update conv state in place (vLLM `causal_conv1d_update`); removes concat_cont + the conv-state cpy | **~8-12** (concat 8.01 + cpy share of 7.62) | +2-3% | YES | MEDIUM | no-regret, build regardless |
+| 3 | **Gate-chain fold** (Lever 3 as designed): sigmoid-beta + softplus+dt+ssm_a gate + q/k l2norm into the recurrence kernel | **~12.76 ceiling** (3.35%) - but SUBSUMED by #1 | +3% | near (l2norm) | MEDIUM | build as a COMPONENT of #1, not standalone |
+| 4 | **bf16 recurrent + conv state** (Lever 5): halve the 196 ms recurrence + conv traffic; keep f32 in-register accumulation | **~70-90** (if floor-bound) | could reach/exceed parity | NO (parity-tolerance decision; must match vLLM stored dtype) | HIGH (rewrite + parity validation) | the ONLY lever that moves the floor kernel; separate precision track |
+| 5 | gdn_gather skip-launch at steady decode | ~0.06 | ~0 | YES | trivial | not worth it (micro) |
+| 6 | GDN occupancy split | 0 | 0 | - | - | NOT a lever: 196608 CTAs / 4096 waves, already saturated, bandwidth-bound |
+| 7 | quantize_mmq attack (Lever 2) | 0 | 0 | - | - | SPENT - relocated mandatory work, proven flat |
+| 8 | decode CUDA-graph capture | 0 | 0 | - | - | SPENT - ALREADY in effect (graphId=11), did not close gap |
+| 9 | persistent cgraph (uid fast-path) | ~0.2 (0.05-0.4%) | ~0 | YES | MEDIUM | second-order to the SSM floor |
+
+Levers 1, 3, and the gather of #5 are the SAME kernel rewrite: build them together as a
+single-pass recurrence. Levers 6/7/8 are dead (at-floor or already-shipped). Lever 4 is a
+distinct, bit-exactness-breaking precision track.
+
+### 4. The honest verdict and the single highest-value next step
+
+**Is true (bit-exact) decode parity reachable?** UNCERTAIN, and it hinges entirely on the
+unmeasured byte ratio:
+- If llama's recurrence re-streams state (~2x bytes from un-fused ops): YES - a single-pass
+  fused recurrence (Lever 1) plus conv fusion (Lever 2) plausibly recover ~20-40 ms, taking
+  llama to ~345-365 ms = ~90-95% of vLLM, near-bit-exact (gate on KL tolerance).
+- If llama is already at the GB10 bandwidth floor for f32 state: NO in bit-exact form - the
+  57 ms is a hardware floor, and only bf16 state (Lever 4, non-bit-exact) closes it.
+
+Either way, the gating-fold-alone path tops out at ~89% of vLLM, so the project should NOT
+ship the isolated gate fold as "the parity lever."
+
+**SINGLE highest-value next IMPLEMENTATION step:** build the **single-pass fused GDN
+recurrence kernel** (Lever 1 = fold gate + l2norm + state-writeback + gather into one pass
+over the recurrent state) - BUT gate the build on one cheap measurement first, because it
+is a HIGH-effort kernel rewrite that is worthless if the recurrence is already byte-minimal.
+
+**The measurement that confirms it before over-investing (one short GPU run, gap-analysis
+agent only):** `ncu` on `gated_delta_net_cuda<128>` at B=128 vs vLLM's
+`fused_recurrent_gated_delta_rule_packed_decode_kernel` for identical layer dims, two
+counters:
+- `dram__bytes.sum` (actual DRAM bytes/call)
+- `dram__throughput.avg.pct_of_peak_sustained_elapsed` (achieved % of 273 GB/s)
+
+Decision rule:
+- llama moves ~2x minimal bytes OR vLLM moves materially fewer for the same math -> redundant
+  un-fused state round-trips -> BUILD the single-pass fused recurrence; predicted recovery
+  scales with the byte delta (up to ~40 ms). This is the gap-closer.
+- llama already moves ~minimal bytes at > 85% of peak and vLLM moves the same -> the
+  recurrence is at the GB10 hardware floor -> do NOT build the fusion for throughput (only
+  the ~3% gate-fold ceiling remains); the sole remaining lever is bf16 state (Lever 4,
+  accept non-bit-exact), and bit-exact parity is NOT reachable.
+
+**No-regret parallel work** (build regardless of the ncu outcome, bit-exact, medium effort):
+the conv-state concat -> ssm_conv in-place fusion (Lever 2, ~8-12 ms = +2-3% toward parity),
+which removes concat_cont (8.01 ms) and the conv-state writeback cpy off a bandwidth-bound,
+single-stream step where their full GPU-time is wall-clock.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]