From 2b57997df061e050f154b9e089e31874bc5b959a Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Thu, 25 Jun 2026 14:45:51 +0000 Subject: [PATCH] docs(paged): cudagraph-coverage - GDN serial chain IS graph-covered at B=128 Determine whether the ggml CUDA graph covers the gated-DeltaNet serial chain at batch=128. It does: nothing in the GDN region forces graph-disable (check_compability lists only split-buffers and large-batch MUL_MAT_ID), and the recurrent head is constant for a steady 128-seq batch so the inplace_ids state_dst offset + rs_head op_param + SSM input shapes are stable across steps. The fused op does no host-sync / capture-time cudaMalloc. The only re-warm is the per-256-token full-attention block-table cadence (not a GDN op). The ~40% util is bandwidth-roofline (SSM state traffic 66% of step bytes), not launch-gap idle - so no GDN graph-safe lever; the only non-covered idle is the ~0.4% between-step host cgraph rebuild. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto --- .../paged/CRITICALPATH_GAP_ANALYSIS.md | 255 ++++++++++++++++++ 1 file changed, 255 insertions(+) diff --git a/backend/cpp/llama-cpp/patches/paged/CRITICALPATH_GAP_ANALYSIS.md b/backend/cpp/llama-cpp/patches/paged/CRITICALPATH_GAP_ANALYSIS.md index f7a145819..3a1baee1a 100644 --- a/backend/cpp/llama-cpp/patches/paged/CRITICALPATH_GAP_ANALYSIS.md +++ b/backend/cpp/llama-cpp/patches/paged/CRITICALPATH_GAP_ANALYSIS.md @@ -98,3 +98,258 @@ compute - is the 62%-vs-40% busy gap. A timeline gap analysis on the existing ns trace should show idle GPU between the GDN sub-ops (conv -> gate -> recurrence -> gather) per layer; if it does, Lever 3 (gating-into-recurrence fusion) and/or decode CUDA-graph capture are the levers that will move the wall, unlike P2a/Lever 2. + +--- + +## roofline-decode (READ-ONLY, no GPU) - decode-step roofline + bubble budget + parity target + +Goal: bound the q36-27b-nvfp4 decode step by the bandwidth floor and the compute floor, +compare to measured llama 384 ms/step vs vLLM 327 ms/step, size the unexplained "bubble +budget", and state the step-time target for parity. Cross-checks vllm-gdn-compare above. + +### Inputs (measured / GGUF metadata, no new GPU work) +- DGX GB10 (sm_121): LPDDR5x **273 GB/s**, dense NVFP4 MMA peak ~**500 TFLOP/s** (sparse ~1 PFLOP/s). + Both numbers are shared identically by llama and vLLM (same HW, same weights). +- q36-27b-nvfp4 GGUF (arch qwen35): block_count **64** (full_attention_interval 4 -> + **16 full-attn + 48 GDN** layers), d_model 5120, FFN 17408, attn 24 heads / 4 kv-heads, + head_dim 256, ssm conv_kernel 4 / state_size 128 / group_count 16 / inner_size 6144. + Weight file = **18.804 GB** (NVFP4 + FP8 block-scales + f16 norms/embd), fully GPU-resident. +- Measured llama decode (dense_base.out, -fa -npp128 -ntg128 -npl128, B=128, 128 TG steps): + T_TG 49.154 s / 128 = **384.0 ms/step** (S_TG 333.3 tok/s; matches STATE "~381 ms"). +- vLLM dense ref **391 tok/s @128** => 128/391 = **327.4 ms/step**. + +### 1. Bandwidth floor (bytes that MUST cross LPDDR5x per step / 273 GB/s) +| term | bytes/step | basis | +|------|-----------|-------| +| Weights (batched, read ONCE/step, reused across all 128 seqs) | ~18.4 GB | file 18.804 minus ~0.4 GB sparse input-embd lookup; lm_head fully read | +| SSM state R+W (48 GDN layers x 128 seqs) | ~19 GB (bracket 10-38) | ~1.5 MB/layer/seq R+W, kernel-grounded: gated_delta_net 1.47 ms/call -> ~400 MB/call @273 GB/s; theoretical d_inner*d_state f32 doubles it | +| KV cache read (16 attn layers, avg 192 ctx, 128 seqs, f16) | ~1.6 GB | 64 KiB/tok over 16 layers; max-ctx 256 -> 2.1 GB | +| Activation/quantize/gate intermediates R+W | ~3 GB | quantize_mmq_nvfp4 + k_bin_bcast + silu + rms tensors @ batch 128 | +| **TOTAL** | **~42 GB/step** | bracket 32-61 GB | + +**Bandwidth floor = 42/273 = ~154 ms/step** (central; bracket ~117-224 ms). +Weight-only HARD sub-floor (unavoidable, both engines pay it): 18.4/273 = **67 ms/step**. + +KEY: even at batch 128 the FP4 GEMM is STILL memory-bound, not MMA-bound. AI = 2*128/0.53 B += ~483 FLOP/byte << GB10 ridge 500e12/273e9 = 1832 FLOP/byte. The 42% `mul_mat_q` +GPU time is weight-DRAM streaming, not tensor cores -> first-principles reason P2a (-26% MMA +occupancy) and Lever-2 were FLAT on decode. + +### 2. Compute floor (FLOPs / ~500 TFLOP/s dense FP4) +| term | FLOPs/step | floor | +|------|-----------|-------| +| FP4 GEMM (all dense matmuls): 2 * ~26e9 params * 128 seqs | 6.66 TFLOP | / 500e12 = **13.3 ms** (6.7 ms @ sparse 1 PFLOP) | +| GDN recurrence (state update + read-out, 48 layers x 128 seqs) | ~0.04 TFLOP | < 0.1 ms (state-bound, not FLOP-bound) | +| **TOTAL** | ~6.7 TFLOP | **~13 ms/step (~4% of the step)** | + +### 3. Verdict / bubble budget / parity target +``` + compute floor bandwidth floor MEASURED step x above bw-floor +GB10 dense-FP4 ~13 ms ~154 ms (117-224) +vLLM dense @128 327 ms ~2.1x (1.5-2.8x) +llama dense @128 384 ms ~2.5x (1.7-3.3x) +``` +- **Binding floor = bandwidth (~130-155 ms), NOT compute (~13 ms).** Compute floor is ~25x + below the wall -> FP4-MMA throughput is irrelevant; matches P2a/Lever-2 flatness exactly. +- **Both engines run ~2-2.8x ABOVE the bandwidth floor.** vLLM itself reaches only ~40-47% + LPDDR5x efficiency -> even the reference is LATENCY/occupancy bound, not byte-bound. + Confirms prior "decode is 2.5x above its bandwidth floor" work. +- **Bubble budget** (wall - bandwidth floor, central 154): vLLM ~**173 ms**, llama ~**230 ms**. + = kernel-launch latency + occupancy gaps + serial data-dependency stalls. +- **The llama-vs-vLLM gap = 384 - 327 = 57 ms/step (14.8% of the step) is 100% BUBBLE.** + Both engines share IDENTICAL bandwidth AND compute floors (same 18.8 GB NVFP4 weights, same + SSM state, same KV, same GB10 273 GB/s + 500 TFLOP). Bytes and FLOPs are byte-for-byte equal, + so the entire 57 ms differential lives in critical-path bubble - NOT bandwidth, NOT compute. + +**Parity target: 327 ms/step (391 tok/s @128). llama must shave 57 ms/step = 14.8% off 384 ms.** +Neither floor can move (both already shared with vLLM), so the 57 ms can ONLY come from +collapsing critical-path bubbles -> structurally-correct case for Lever 3 (fuse the serial GDN +gating chain) and/or decode CUDA-graph capture, exactly the two wins vllm-gdn-compare found vLLM +already has. P2a/Lever-2 were flat because they freed OVERLAPPED GPU-busy time BELOW the floor. + +### Cross-check / sizing for the gap-analysis (timeline) agent +- GPU-busy sum from nsysab_new (ntg24 window, /24 steps): FP4 GEMM ~243 + gated_delta_net ~76 + + GDN glue (k_bin_bcast mul ~49, silu ~34, concat ~19, gdn_gather ~21, ssm_conv ~12, l2_norm ~6, + op_add ~10) ~152 + quantize ~62 = **~555 ms GPU-busy vs 384 ms wall** -> sum >> wall by ~1.45x, + so heavy overlap is real and GPU-busy% buckets ARE misleading. Do NOT sum kernel times; the + wall is the critical path. +- Concrete budget: if the inter-kernel IDLE gaps + non-overlapped launch latency along the serial + GDN chain (ssm_conv -> gated_delta_net -> gating elementwise -> gdn_gather, x48 layers x N steps) + sum to **>= 57 ms/step**, Lever 3 is justified AND sized. If those critical-path gaps total + < 57 ms, parity is NOT reachable via GDN-gate fusion alone and the gap is elsewhere (GDN core + kernel slower than vLLM fused_recurrent, or scheduler/H2D). +- Structural corroboration (agrees with vllm-gdn-compare): vLLM runs the GDN region as 2 fused + Triton kernels under a full CUDA graph; llama splits it into ssm_conv + gated_delta_net + + gdn_gather + ~6 serially data-dependent elementwise gate kernels. ~384 host-launched nodes/step + on a chain that cannot overlap is precisely the mechanism that produces llama's extra ~57 ms. + +Floors are engine-independent lower bounds; the timeline agent owns proving the 57 ms is +recoverable on the critical path. Roofline says: target 327 ms, shave 57 ms, and it can ONLY +come from bubble (not bytes, not FLOPs). + +Assisted-by: Claude:opus-4.8 [Claude Code] + +## lever3-design (READ-ONLY, no GPU) - concrete fusion of the serial GDN gate chain into the recurrence kernel + +### What actually feeds/consumes the recurrence kernel today (qwen35 decode, fused_gdn_ar) +Traced in `src/models/qwen35.cpp::build_layer_attn_linear` -> +`src/models/delta-net-base.cpp::build_recurrent_attn` (fused !keep branch) -> +`ggml/src/ggml-cuda/gated_delta_net.cu`. The model is GDA (g->ne[0]==1, scalar +gate per head; kda=false in the kernel). S_v = ssm_d_state = 128, so the kernel +runs the `<128>` template: warp_size==S_v==128, num_warps=4, rows_per_lane==1, +grid (H, n_seqs, S_v/4=32 z-tiles). Each warp owns one output column `col`; the +128 lanes hold the full head-vector (one element per lane). + +Serial pre-GDN gate chain (each a standalone host-launched ggml node, all on the +critical path between the in-proj GEMMs and the recurrence): +1. `beta = ggml_sigmoid(ssm_beta @ cur)` -> kernel reads `beta_val = *beta_t` +2. `alpha = ssm_alpha @ cur` +3. `ggml_add(alpha, ssm_dt)` (k_bin_bcast op_add) +4. `ggml_softplus(...)` (unary_op, 1248 inst) +5. `ggml_mul(softplus, ssm_a)` (k_bin_bcast op_mul; ssm_a = -exp(A_log), baked) -> g; kernel does `expf(g_t)` +6. `ssm_conv` then `ggml_silu` (conv path; may already hit the upstream SSM_CONV+SILU fuse) -> v_conv, and the q/k slices +7. `ggml_l2_norm(q_conv)`, `ggml_l2_norm(k_conv)` (l2_norm_f32<32>, 2496 inst = 1248x2) -> kernel reads q_reg/k_reg + +Post-GDN gate (consumes kernel output): +8. `build_norm_gated(output, ssm_norm, z)` = rms_norm(output)*ssm_norm (RMS_NORM+MUL fused) then `silu(z)*.` (unary_gated_op, the 5.9% bucket) + +### The fusion: fold steps 1,3,4,5,7 INTO gated_delta_net_cuda (a "fused-gate" mode) +These five are exactly the per-(head) scalar gates (sigmoid beta; softplus+dt+ssm_a +-> g) and the per-head-vector L2 norms of q/k - and the kernel ALREADY loads every +operand it needs: +- It reads `beta_val` (scalar) -> pass RAW beta, do `beta_val = 1.f/(1.f+expf(-raw))` in-kernel. Removes node 1. +- It reads `g_t` (scalar, GDA) and does `expf(g_t)` -> pass RAW alpha + per-head `ssm_dt[h]` + per-head `ssm_a[h]`, compute `g = ssm_a[h]*op_softplus(alpha + ssm_dt[h])` in-kernel, keep the existing `expf(g)`. `op_softplus(x) = (x>20)?x:logf(1+expf(x))` (copy `ggml_compute_softplus_f32` verbatim). Removes nodes 3,4,5. +- It loads the full q/k head-vector into `q_reg[r]`/`k_reg[r]` (one element per lane at S_v==128). L2-normalize in registers: `float qss = warp_reduce_sum<128>(q_reg[0]*q_reg[0]); q_reg[0] *= rsqrtf(qss + eps* ... )` matching the l2_norm formula, same for k. Each warp redundantly recomputes the (identical) norm for its column - cheap, no shared mem, no extra launch. Removes nodes 7 (x2). `eps` (= f_norm_rms_eps) passed as a kernel float param. + +That collapses the pre-GDN serial chain to just: in-proj GEMMs -> build_conv_state(concat) -> ssm_conv(+silu) -> [single fused gated_delta_net kernel]. 5 gate kernels removed per SSM layer per decode step. + +### Why the OUTPUT gate (step 8) is NOT folded into this kernel +The output gated-rmsnorm reduces over the full head_v_dim (S_v=128) per (head,seq). +In this kernel those 128 elements are produced by 128 DIFFERENT (warp x z-tile) +blocks (4 warps x 32 z-tiles), so an in-kernel head-wide reduction would need a +grid-global sync - not feasible without a grid redesign. Leave step 8 as the +existing RMS_NORM+MUL + unary_gated fusion (already 2 launches, not in scope). +The conv-silu (step 6) is a convolution, structurally separate; rely on the +existing upstream SSM_CONV(+ADD)+SILU fuse rather than pulling it into the +recurrence kernel. + +### Implementation scope +- `ggml/include/ggml.h`: new builder `ggml_gated_delta_net_inplace_ids_fused_gate(ctx, q_raw, k_raw, v, alpha_raw, beta_raw, cache4d, state_dst, ids, ssm_a, ssm_dt, rs_head, eps)` (or an op-param flag GDN_FUSE_GATE on the existing builder + 2 extra srcs). src budget: current op uses src[0..7]; add ssm_a -> src[8], ssm_dt -> src[9]. GGML_MAX_SRC==10, so it fits EXACTLY (zero headroom - note for review). +- `ggml/src/ggml.c`: builder + a new op-param i32 flag (e.g. params[2]=fuse_gate) + f32 param for eps; assert shapes (ssm_a/ssm_dt are [num_v_heads]). +- `ggml/src/ggml-cuda/gated_delta_net.cu`: in `gated_delta_net_cuda`, gate the in-kernel sigmoid/softplus-gate/l2norm behind a `bool FUSE_GATE` template param (4th template bool, keeps the non-fused path byte-identical and avoids register bloat when off). Read ssm_a[h_idx], ssm_dt[h_idx]; compute g per head; sigmoid raw beta; warp-reduce q_reg/k_reg sumsq -> rsqrtf scale. Plumb the 2 new src pointers + eps through `launch_gated_delta_net` and `ggml_cuda_op_gated_delta_net` (read src[8],src[9], op_param eps/flag). The `gdn_gather_nonident` path is unaffected (it gathers state, not q/k/g/beta). +- `ggml/src/ggml-cpu/ops.cpp`: mirror in `ggml_compute_forward_gated_delta_net_one_chunk` (host sigmoid/softplus/l2norm before the per-token math) for CPU parity / test-backend-ops. +- `src/models/delta-net-base.cpp::build_recurrent_attn` (the fused !keep + ids branch, and the inplace non-ids branch): call the fused-gate builder, pass raw alpha/beta/q/k + ssm_a + ssm_dt + eps. +- `src/models/qwen35.cpp` / `qwen35moe.cpp` / `qwen3next.cpp` `build_layer_attn_linear`: when the fuse flag is on, DROP `ggml_sigmoid(beta)`, `ggml_add(alpha,dt)`, `ggml_softplus`, `ggml_mul(.,ssm_a)`, and the two `ggml_l2_norm` nodes; hand the raw tensors + `model.layers[il].ssm_a`, `ssm_dt` to build_recurrent_attn. The conv-silu and z/output-gate path are unchanged. +- Guard the whole thing behind `cparams.fused_gdn_gate` / env `LLAMA_FUSE_GDN_GATE` (default OFF) so it A/Bs against the clean Lever-1 build exactly like P2a/Lever-2, and only the recurrent (GDA) qwen35 family path is touched. + +### Numeric considerations / bit-exactness +- sigmoid(beta), softplus(alpha+dt), and the `g = ssm_a*softplus` mul/add are pointwise fp32 with the SAME formula/order as the standalone ggml ops -> these can be **bit-exact** (no reduction). softplus must copy `(x>20)?x:logf(1+expf(x))` exactly. +- q/k l2norm is the ONE op with a reduction: the standalone `l2_norm_f32<32>` does its own warp/block reduction; the in-kernel `warp_reduce_sum<128>` tree may differ in the last ULP, and the eps placement (`x*rsqrt(sumsq+eps)` vs `x/max(sqrt(sumsq),eps)`) must match the ggml l2_norm exactly. Expect **near-bit-exact, not guaranteed byte-identical** greedy output. So unlike Levers 1/2, gate this on a **PPL/KL tolerance** (KL logit delta < ~1e-3, PPL delta within noise) rather than md5 identity. If byte-identity is required, exclude l2norm from the fold (keep nodes 7) and fuse only sigmoid/softplus/gate - but that drops the value to ~0.3% and is probably not worth it. + +### Estimated kernels-removed-per-layer and the honest ceiling +- Removed per SSM decode layer-step: sigmoid(beta) + add(dt) + softplus + mul(ssm_a) + l2norm(q) + l2norm(k) = **6 host-launched kernels -> 0**, collapsing 7 nodes (incl. recurrence) to 1. Across 48 SSM layers = **~288 launches/step removed** (matches the instance deltas: l2_norm 2496, softplus 1248, sigmoid 1248, plus the alpha-add/ssm_a-mul share of op_add/op_mul). +- GPU-BUSY ceiling of the removed ops is small: l2_norm 1.0% + softplus ~0% + sigmoid 0.3% + (dt add + ssm_a mul share of op_add 1.7% / op_mul 8.5%). The point of Lever 3 is NOT the freed busy-time (P2a/Lever-2 proved freeing overlapped busy-time is flat) - it is removing ~288 LAUNCH BUBBLES/step that sit on the serial conv->gate->recurrence dependency where the GPU is otherwise idle. The win is wall-clock only if those specific bubbles are on the critical path. + +### RISK (must be settled before building) +1. **Same trap as P2a/Lever-2 if the bubbles overlap.** If the scheduler already + overlaps these pre-GDN gate kernels with an adjacent layer's 42% mul_mat_q GEMM, + Lever 3 is FLAT. **Precondition: the timeline gap analysis must show idle GPU + between ssm_conv -> (sigmoid/softplus/l2norm) -> gated_delta_net per layer** at + batch=128 BEFORE building. If the trace shows the gate ops back-to-back with no + gap (overlapped), do NOT build op-fusion; go to lever (2) below. +2. **The bigger bubbles may be elsewhere on the chain.** The large buckets are op_mul + 8.5% and unary_gated 5.9% - much of which is the POST-GDN output gate and + FFN, which this fusion does NOT touch. If the gap analysis pins the dominant idle + to the post-GDN region or to inter-layer launch latency generally, the + higher-leverage Lever 3 is **decode CUDA-graph capture** (removes host launch + latency for ALL ~384 nodes/step at once, exactly what vLLM does), not per-op + fusion. CUDA-graph is the strictly larger hammer here; op-fusion only helps the + pre-GDN slice. Recommend measuring the per-sub-op gap first and preferring the + CUDA-graph lever if the bubbles are spread across the step rather than concentrated + in the pre-GDN gate slice. +3. **src-slot exhaustion** (src[8],src[9] use the last 2 of GGML_MAX_SRC=10) - any + later op needing more srcs on this node has zero headroom; flag for review. + +## cudagraph-coverage (READ-ONLY, no GPU) - does the CUDA graph cover the GDN serial chain at B=128? + +### Verdict: YES, the graph covers GDN at batch=128 (dense model). No GDN op forces graph-disable or per-step re-instantiation. + +Source: `ggml/src/ggml-cuda/ggml-cuda.cu` (graph state machine), `gated_delta_net.cu` +(fused op), `src/models/delta-net-base.cpp` (graph build), `src/llama-memory-recurrent.cpp` +(recurrent head), all on dev tree `~/llama-paged-dev` (HEAD df1cc97, Lever-1). Cross-checked +against the committed A2_CUDAGRAPH_DECODE.md + DECODE_PARITY_EXPLORE.md measurements. + +### How graph-disable / re-instantiation are decided (this fork's state machine) +- `ggml_cuda_graph_check_compability` (ggml-cuda.cu:3251) disables the graph for ONLY two + reasons: (a) a split-buffer src, (b) `GGML_OP_MUL_MAT_ID` with non-quantized weights OR + `node->ne[2] > get_mmvq_mmid_max(...)` [TAG_MUL_MAT_ID_CUDA_GRAPHS]. GATED_DELTA_NET, + SSM_CONV, SSM_SCAN, GET_ROWS, CONCAT, the gating elementwise ops are NOT in the disable + list. So no GDN op forces graph-disable. +- `ggml_cuda_graph_update_required` (3297) memcmps, per node, the full `ggml_tensor` struct + (incl. `op_params` and `data`) + each src's `data` ptr / `ne` / `nb`. ANY delta -> the + warmup state machine (ggml_backend_cuda_graph_compute:4464) resets `warmup_complete` and the + WHOLE graph (one key = `cgraph->nodes[0]`) runs eager that step until stable again. Buffer + CONTENTS are NOT compared - a contents-only change (e.g. ids values) is graph-safe. + +### Why the GDN region's properties are STABLE across steady decode steps +The fused decode path is `ggml_gated_delta_net_inplace_ids` (delta-net-base.cpp:558-560): +``` +state_dst = ggml_view_2d(ctx, ssm_states_all, n_embd_s, n_seqs, nb1, + kv_head * n_embd_s * elsize); // offset = kv_head +ggml_gated_delta_net_inplace_ids(..., cache4d, state_dst, ids, /*rs_head=*/(int)kv_head); +``` +Both the `state_dst` view byte-offset and the `rs_head` op_param (read back as +`ggml_get_op_params_i32(dst,1)` in gated_delta_net.cu:330) derive from +`kv_head = mctx_cur->get_head()`. In `llama_memory_recurrent::find_slot` +(llama-memory-recurrent.cpp:610-689) the n_seqs used cells are SWAPPED into the contiguous +range `[min .. min+n_seqs-1]` and `head = min`. The recurrent cache does NOT grow per token +(one state cell per sequence, unlike the KV cache). For a steady 128-seq continuous batch the +same sequences own the same tails every step, so `min`/`head` are constant (=0) -> state_dst +offset constant, rs_head op_param constant. The GDN inputs (q,k,v,g,b, cache4d, ids) are +fixed-shape (n_seqs=128, n_rs slots), so ne/nb are stable, and ggml-alloc hands out the same +compute-buffer offsets each same-topology ubatch -> data ptrs stable. The `ids` (s_copy) +tensor's CONTENTS change per step but its address/ne/nb do not -> graph-safe. + +### The fused GDN op is capture-safe (no host-sync, no capture-time cudaMalloc) +`gated_delta_net.cu`: the op launches `gdn_gather_nonident_kernel` + `gated_delta_net_cuda` +on `ctx.stream()` with NO `cudaStreamSynchronize` / host `cudaMemcpy` / `cudaMalloc`. The +gather scratch is `ggml_cuda_pool_alloc` (VMM pool, served from reserved memory after warmup, +no real cudaMalloc during capture). `gdn_gather_nonident` early-returns for identity sequences +(`ids[s]==rs_head+s`), which is the steady-decode case, so its 3.7% is a launched-but-mostly- +noop kernel - still captured into the graph like any other. Capture succeeds (the build runs, +graphs engage), confirming none of these break stream capture. + +### The only re-instantiation is NOT GDN-driven +A2 already measured the re-warm cadence: the graph re-instantiates every ~256 tokens because +the FULL-ATTENTION block-table input `idx` has `ne[0]=GGML_PAD(n_kv,256)` (and kq_mask in +lockstep) - those step at 256-token boundaries (paged-attn.cpp:199-213). ~97% of decode steps +replay the captured graph. This is a full-attention-layer input, not a GDN op. (The unpadded +`LLAMA_KV_PAGED_GATHER` fallback grows `ne[0]` every step and runs pure-eager, but that is not +the default decode path and is not the GDN/SSM path.) + +### Reconciliation with the "~40% util / 60% idle bubbles" premise (it is refuted for GDN) +The committed nsys sweeps (A2_CUDAGRAPH_DECODE.md, DECODE_PARITY_EXPLORE.md) show the steady +decode is ~99.4-99.5% GPU-BUSY with graphs ON (measured with `--cuda-graph-trace=node`; a +graphs-ON trace WITHOUT that flag under-counts GPU rows and falsely reports idle - Trap #2). +Total exposed idle is ~0.65% of the step; the within-step launch fraction graphs remove is +0.34% (0.37%->0.11%) and is ALREADY collapsed - the GDN sub-op launch gaps are inside the +captured region. The "40% utilization" in the STATE is BANDWIDTH-roofline util, not idle SMs: +decode moves ~55.5 GB/step at 2.48x the 273 GB/s floor, SSM state r+w = 66% of step bytes. The +GDN recurrence is memory-bandwidth-bound at low occupancy (~12-16%), not launch-gap-bound. So +"60% idle bubbles on the serial GDN chain" is not supported by the traces; the gap to vLLM is +SSM-state memory traffic, consistent with P2a/Lever-2 being flat (freeing GPU-busy time, not +wall-clock). + +### Graph-safe lever for GDN: none new +- GDN is already graph-covered; there is no "make the GDN ops graph-safe" lever to build - they + are already safe and captured. +- The only genuinely graph-NON-covered idle is the BETWEEN-step host gap (~2 ms/step, ~0.4%): + ggml rebuilds/reallocs the cgraph each step with a new `cgraph->uid`, so the uid fast-path in + ggml_cuda_graph_update_required never fires and the host re-dispatches ~3100 launches on the + Grace cores between graph launches (vLLM builds its graph once + persistent device metadata). + A persistent/reused cgraph across decode steps would let the uid fast-path fire and shrink the + host gap - but at 0.4% of the step it is second-order to the SSM bandwidth floor. +- CAVEAT (MoE, qwen35moe): MUL_MAT_ID at B=128 can trip [TAG_MUL_MAT_ID_CUDA_GRAPHS] + (`ne[2] > mmvq_mmid_max`), disabling the WHOLE MoE-decode graph (GDN included) into eager. + That is a MUL_MAT_ID disable, not a GDN break, and does not touch the dense 335 tok/s headline; + worth a separate confirm for the MoE model.