docs(paged): cudagraph-coverage - GDN serial chain IS graph-covered at B=128

Determine whether the ggml CUDA graph covers the gated-DeltaNet serial chain at batch=128. It does: nothing in the GDN region forces graph-disable (check_compability lists only split-buffers and large-batch MUL_MAT_ID), and the recurrent head is constant for a steady 128-seq batch so the inplace_ids state_dst offset + rs_head op_param + SSM input shapes are stable across steps. The fused op does no host-sync / capture-time cudaMalloc. The only re-warm is the per-256-token full-attention block-table cadence (not a GDN op). The ~40% util is bandwidth-roofline (SSM state traffic 66% of step bytes), not launch-gap idle - so no GDN graph-safe lever; the only non-covered idle is the ~0.4% between-step host cgraph rebuild. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-26 09:26:55 -04:00 · 2026-06-25 14:45:51 +00:00
parent e597a8ac78
commit 2b57997df0
1 changed files with 255 additions and 0 deletions
--- a/backend/cpp/llama-cpp/patches/paged/CRITICALPATH_GAP_ANALYSIS.md
+++ b/backend/cpp/llama-cpp/patches/paged/CRITICALPATH_GAP_ANALYSIS.md
@@ -98,3 +98,258 @@ compute - is the 62%-vs-40% busy gap. A timeline gap analysis on the existing ns
 trace should show idle GPU between the GDN sub-ops (conv -> gate -> recurrence ->
 gather) per layer; if it does, Lever 3 (gating-into-recurrence fusion) and/or
 decode CUDA-graph capture are the levers that will move the wall, unlike P2a/Lever 2.
+
+---
+
+## roofline-decode (READ-ONLY, no GPU) - decode-step roofline + bubble budget + parity target
+
+Goal: bound the q36-27b-nvfp4 decode step by the bandwidth floor and the compute floor,
+compare to measured llama 384 ms/step vs vLLM 327 ms/step, size the unexplained "bubble
+budget", and state the step-time target for parity. Cross-checks vllm-gdn-compare above.
+
+### Inputs (measured / GGUF metadata, no new GPU work)
+- DGX GB10 (sm_121): LPDDR5x **273 GB/s**, dense NVFP4 MMA peak ~**500 TFLOP/s** (sparse ~1 PFLOP/s).
+  Both numbers are shared identically by llama and vLLM (same HW, same weights).
+- q36-27b-nvfp4 GGUF (arch qwen35): block_count **64** (full_attention_interval 4 ->
+  **16 full-attn + 48 GDN** layers), d_model 5120, FFN 17408, attn 24 heads / 4 kv-heads,
+  head_dim 256, ssm conv_kernel 4 / state_size 128 / group_count 16 / inner_size 6144.
+  Weight file = **18.804 GB** (NVFP4 + FP8 block-scales + f16 norms/embd), fully GPU-resident.
+- Measured llama decode (dense_base.out, -fa -npp128 -ntg128 -npl128, B=128, 128 TG steps):
+  T_TG 49.154 s / 128 = **384.0 ms/step** (S_TG 333.3 tok/s; matches STATE "~381 ms").
+- vLLM dense ref **391 tok/s @128** => 128/391 = **327.4 ms/step**.
+
+### 1. Bandwidth floor (bytes that MUST cross LPDDR5x per step / 273 GB/s)
+| term | bytes/step | basis |
+|------|-----------|-------|
+| Weights (batched, read ONCE/step, reused across all 128 seqs) | ~18.4 GB | file 18.804 minus ~0.4 GB sparse input-embd lookup; lm_head fully read |
+| SSM state R+W (48 GDN layers x 128 seqs) | ~19 GB (bracket 10-38) | ~1.5 MB/layer/seq R+W, kernel-grounded: gated_delta_net 1.47 ms/call -> ~400 MB/call @273 GB/s; theoretical d_inner*d_state f32 doubles it |
+| KV cache read (16 attn layers, avg 192 ctx, 128 seqs, f16) | ~1.6 GB | 64 KiB/tok over 16 layers; max-ctx 256 -> 2.1 GB |
+| Activation/quantize/gate intermediates R+W | ~3 GB | quantize_mmq_nvfp4 + k_bin_bcast + silu + rms tensors @ batch 128 |
+| **TOTAL** | **~42 GB/step** | bracket 32-61 GB |
+
+**Bandwidth floor = 42/273 = ~154 ms/step** (central; bracket ~117-224 ms).
+Weight-only HARD sub-floor (unavoidable, both engines pay it): 18.4/273 = **67 ms/step**.
+
+KEY: even at batch 128 the FP4 GEMM is STILL memory-bound, not MMA-bound. AI = 2*128/0.53 B
+= ~483 FLOP/byte << GB10 ridge 500e12/273e9 = 1832 FLOP/byte. The 42% `mul_mat_q<NVFP4,m=128>`
+GPU time is weight-DRAM streaming, not tensor cores -> first-principles reason P2a (-26% MMA
+occupancy) and Lever-2 were FLAT on decode.
+
+### 2. Compute floor (FLOPs / ~500 TFLOP/s dense FP4)
+| term | FLOPs/step | floor |
+|------|-----------|-------|
+| FP4 GEMM (all dense matmuls): 2 * ~26e9 params * 128 seqs | 6.66 TFLOP | / 500e12 = **13.3 ms** (6.7 ms @ sparse 1 PFLOP) |
+| GDN recurrence (state update + read-out, 48 layers x 128 seqs) | ~0.04 TFLOP | < 0.1 ms (state-bound, not FLOP-bound) |
+| **TOTAL** | ~6.7 TFLOP | **~13 ms/step (~4% of the step)** |
+
+### 3. Verdict / bubble budget / parity target
+```
+                    compute floor   bandwidth floor    MEASURED step   x above bw-floor
+GB10 dense-FP4      ~13 ms          ~154 ms (117-224)
+vLLM dense @128                                        327 ms          ~2.1x (1.5-2.8x)
+llama dense @128                                       384 ms          ~2.5x (1.7-3.3x)
+```
+- **Binding floor = bandwidth (~130-155 ms), NOT compute (~13 ms).** Compute floor is ~25x
+  below the wall -> FP4-MMA throughput is irrelevant; matches P2a/Lever-2 flatness exactly.
+- **Both engines run ~2-2.8x ABOVE the bandwidth floor.** vLLM itself reaches only ~40-47%
+  LPDDR5x efficiency -> even the reference is LATENCY/occupancy bound, not byte-bound.
+  Confirms prior "decode is 2.5x above its bandwidth floor" work.
+- **Bubble budget** (wall - bandwidth floor, central 154): vLLM ~**173 ms**, llama ~**230 ms**.
+  = kernel-launch latency + occupancy gaps + serial data-dependency stalls.
+- **The llama-vs-vLLM gap = 384 - 327 = 57 ms/step (14.8% of the step) is 100% BUBBLE.**
+  Both engines share IDENTICAL bandwidth AND compute floors (same 18.8 GB NVFP4 weights, same
+  SSM state, same KV, same GB10 273 GB/s + 500 TFLOP). Bytes and FLOPs are byte-for-byte equal,
+  so the entire 57 ms differential lives in critical-path bubble - NOT bandwidth, NOT compute.
+
+**Parity target: 327 ms/step (391 tok/s @128). llama must shave 57 ms/step = 14.8% off 384 ms.**
+Neither floor can move (both already shared with vLLM), so the 57 ms can ONLY come from
+collapsing critical-path bubbles -> structurally-correct case for Lever 3 (fuse the serial GDN
+gating chain) and/or decode CUDA-graph capture, exactly the two wins vllm-gdn-compare found vLLM
+already has. P2a/Lever-2 were flat because they freed OVERLAPPED GPU-busy time BELOW the floor.
+
+### Cross-check / sizing for the gap-analysis (timeline) agent
+- GPU-busy sum from nsysab_new (ntg24 window, /24 steps): FP4 GEMM ~243 + gated_delta_net ~76 +
+  GDN glue (k_bin_bcast mul ~49, silu ~34, concat ~19, gdn_gather ~21, ssm_conv ~12, l2_norm ~6,
+  op_add ~10) ~152 + quantize ~62 = **~555 ms GPU-busy vs 384 ms wall** -> sum >> wall by ~1.45x,
+  so heavy overlap is real and GPU-busy% buckets ARE misleading. Do NOT sum kernel times; the
+  wall is the critical path.
+- Concrete budget: if the inter-kernel IDLE gaps + non-overlapped launch latency along the serial
+  GDN chain (ssm_conv -> gated_delta_net -> gating elementwise -> gdn_gather, x48 layers x N steps)
+  sum to **>= 57 ms/step**, Lever 3 is justified AND sized. If those critical-path gaps total
+  < 57 ms, parity is NOT reachable via GDN-gate fusion alone and the gap is elsewhere (GDN core
+  kernel slower than vLLM fused_recurrent, or scheduler/H2D).
+- Structural corroboration (agrees with vllm-gdn-compare): vLLM runs the GDN region as 2 fused
+  Triton kernels under a full CUDA graph; llama splits it into ssm_conv + gated_delta_net +
+  gdn_gather + ~6 serially data-dependent elementwise gate kernels. ~384 host-launched nodes/step
+  on a chain that cannot overlap is precisely the mechanism that produces llama's extra ~57 ms.
+
+Floors are engine-independent lower bounds; the timeline agent owns proving the 57 ms is
+recoverable on the critical path. Roofline says: target 327 ms, shave 57 ms, and it can ONLY
+come from bubble (not bytes, not FLOPs).
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+
+## lever3-design (READ-ONLY, no GPU) - concrete fusion of the serial GDN gate chain into the recurrence kernel
+
+### What actually feeds/consumes the recurrence kernel today (qwen35 decode, fused_gdn_ar)
+Traced in `src/models/qwen35.cpp::build_layer_attn_linear` ->
+`src/models/delta-net-base.cpp::build_recurrent_attn` (fused !keep branch) ->
+`ggml/src/ggml-cuda/gated_delta_net.cu`. The model is GDA (g->ne[0]==1, scalar
+gate per head; kda=false in the kernel). S_v = ssm_d_state = 128, so the kernel
+runs the `<128>` template: warp_size==S_v==128, num_warps=4, rows_per_lane==1,
+grid (H, n_seqs, S_v/4=32 z-tiles). Each warp owns one output column `col`; the
+128 lanes hold the full head-vector (one element per lane).
+
+Serial pre-GDN gate chain (each a standalone host-launched ggml node, all on the
+critical path between the in-proj GEMMs and the recurrence):
+1. `beta = ggml_sigmoid(ssm_beta @ cur)`            -> kernel reads `beta_val = *beta_t`
+2. `alpha = ssm_alpha @ cur`
+3. `ggml_add(alpha, ssm_dt)`  (k_bin_bcast op_add)
+4. `ggml_softplus(...)`        (unary_op<softplus>, 1248 inst)
+5. `ggml_mul(softplus, ssm_a)` (k_bin_bcast op_mul; ssm_a = -exp(A_log), baked)  -> g; kernel does `expf(g_t)`
+6. `ssm_conv` then `ggml_silu` (conv path; may already hit the upstream SSM_CONV+SILU fuse) -> v_conv, and the q/k slices
+7. `ggml_l2_norm(q_conv)`, `ggml_l2_norm(k_conv)` (l2_norm_f32<32>, 2496 inst = 1248x2) -> kernel reads q_reg/k_reg
+
+Post-GDN gate (consumes kernel output):
+8. `build_norm_gated(output, ssm_norm, z)` = rms_norm(output)*ssm_norm (RMS_NORM+MUL fused) then `silu(z)*.` (unary_gated_op<silu>, the 5.9% bucket)
+
+### The fusion: fold steps 1,3,4,5,7 INTO gated_delta_net_cuda (a "fused-gate" mode)
+These five are exactly the per-(head) scalar gates (sigmoid beta; softplus+dt+ssm_a
+-> g) and the per-head-vector L2 norms of q/k - and the kernel ALREADY loads every
+operand it needs:
+- It reads `beta_val` (scalar) -> pass RAW beta, do `beta_val = 1.f/(1.f+expf(-raw))` in-kernel. Removes node 1.
+- It reads `g_t` (scalar, GDA) and does `expf(g_t)` -> pass RAW alpha + per-head `ssm_dt[h]` + per-head `ssm_a[h]`, compute `g = ssm_a[h]*op_softplus(alpha + ssm_dt[h])` in-kernel, keep the existing `expf(g)`. `op_softplus(x) = (x>20)?x:logf(1+expf(x))` (copy `ggml_compute_softplus_f32` verbatim). Removes nodes 3,4,5.
+- It loads the full q/k head-vector into `q_reg[r]`/`k_reg[r]` (one element per lane at S_v==128). L2-normalize in registers: `float qss = warp_reduce_sum<128>(q_reg[0]*q_reg[0]); q_reg[0] *= rsqrtf(qss + eps* ... )` matching the l2_norm formula, same for k. Each warp redundantly recomputes the (identical) norm for its column - cheap, no shared mem, no extra launch. Removes nodes 7 (x2). `eps` (= f_norm_rms_eps) passed as a kernel float param.
+
+That collapses the pre-GDN serial chain to just: in-proj GEMMs -> build_conv_state(concat) -> ssm_conv(+silu) -> [single fused gated_delta_net kernel]. 5 gate kernels removed per SSM layer per decode step.
+
+### Why the OUTPUT gate (step 8) is NOT folded into this kernel
+The output gated-rmsnorm reduces over the full head_v_dim (S_v=128) per (head,seq).
+In this kernel those 128 elements are produced by 128 DIFFERENT (warp x z-tile)
+blocks (4 warps x 32 z-tiles), so an in-kernel head-wide reduction would need a
+grid-global sync - not feasible without a grid redesign. Leave step 8 as the
+existing RMS_NORM+MUL + unary_gated<silu> fusion (already 2 launches, not in scope).
+The conv-silu (step 6) is a convolution, structurally separate; rely on the
+existing upstream SSM_CONV(+ADD)+SILU fuse rather than pulling it into the
+recurrence kernel.
+
+### Implementation scope
+- `ggml/include/ggml.h`: new builder `ggml_gated_delta_net_inplace_ids_fused_gate(ctx, q_raw, k_raw, v, alpha_raw, beta_raw, cache4d, state_dst, ids, ssm_a, ssm_dt, rs_head, eps)` (or an op-param flag GDN_FUSE_GATE on the existing builder + 2 extra srcs). src budget: current op uses src[0..7]; add ssm_a -> src[8], ssm_dt -> src[9]. GGML_MAX_SRC==10, so it fits EXACTLY (zero headroom - note for review).
+- `ggml/src/ggml.c`: builder + a new op-param i32 flag (e.g. params[2]=fuse_gate) + f32 param for eps; assert shapes (ssm_a/ssm_dt are [num_v_heads]).
+- `ggml/src/ggml-cuda/gated_delta_net.cu`: in `gated_delta_net_cuda`, gate the in-kernel sigmoid/softplus-gate/l2norm behind a `bool FUSE_GATE` template param (4th template bool, keeps the non-fused path byte-identical and avoids register bloat when off). Read ssm_a[h_idx], ssm_dt[h_idx]; compute g per head; sigmoid raw beta; warp-reduce q_reg/k_reg sumsq -> rsqrtf scale. Plumb the 2 new src pointers + eps through `launch_gated_delta_net` and `ggml_cuda_op_gated_delta_net` (read src[8],src[9], op_param eps/flag). The `gdn_gather_nonident` path is unaffected (it gathers state, not q/k/g/beta).
+- `ggml/src/ggml-cpu/ops.cpp`: mirror in `ggml_compute_forward_gated_delta_net_one_chunk` (host sigmoid/softplus/l2norm before the per-token math) for CPU parity / test-backend-ops.
+- `src/models/delta-net-base.cpp::build_recurrent_attn` (the fused !keep + ids branch, and the inplace non-ids branch): call the fused-gate builder, pass raw alpha/beta/q/k + ssm_a + ssm_dt + eps.
+- `src/models/qwen35.cpp` / `qwen35moe.cpp` / `qwen3next.cpp` `build_layer_attn_linear`: when the fuse flag is on, DROP `ggml_sigmoid(beta)`, `ggml_add(alpha,dt)`, `ggml_softplus`, `ggml_mul(.,ssm_a)`, and the two `ggml_l2_norm` nodes; hand the raw tensors + `model.layers[il].ssm_a`, `ssm_dt` to build_recurrent_attn. The conv-silu and z/output-gate path are unchanged.
+- Guard the whole thing behind `cparams.fused_gdn_gate` / env `LLAMA_FUSE_GDN_GATE` (default OFF) so it A/Bs against the clean Lever-1 build exactly like P2a/Lever-2, and only the recurrent (GDA) qwen35 family path is touched.
+
+### Numeric considerations / bit-exactness
+- sigmoid(beta), softplus(alpha+dt), and the `g = ssm_a*softplus` mul/add are pointwise fp32 with the SAME formula/order as the standalone ggml ops -> these can be **bit-exact** (no reduction). softplus must copy `(x>20)?x:logf(1+expf(x))` exactly.
+- q/k l2norm is the ONE op with a reduction: the standalone `l2_norm_f32<32>` does its own warp/block reduction; the in-kernel `warp_reduce_sum<128>` tree may differ in the last ULP, and the eps placement (`x*rsqrt(sumsq+eps)` vs `x/max(sqrt(sumsq),eps)`) must match the ggml l2_norm exactly. Expect **near-bit-exact, not guaranteed byte-identical** greedy output. So unlike Levers 1/2, gate this on a **PPL/KL tolerance** (KL logit delta < ~1e-3, PPL delta within noise) rather than md5 identity. If byte-identity is required, exclude l2norm from the fold (keep nodes 7) and fuse only sigmoid/softplus/gate - but that drops the value to ~0.3% and is probably not worth it.
+
+### Estimated kernels-removed-per-layer and the honest ceiling
+- Removed per SSM decode layer-step: sigmoid(beta) + add(dt) + softplus + mul(ssm_a) + l2norm(q) + l2norm(k) = **6 host-launched kernels -> 0**, collapsing 7 nodes (incl. recurrence) to 1. Across 48 SSM layers = **~288 launches/step removed** (matches the instance deltas: l2_norm 2496, softplus 1248, sigmoid 1248, plus the alpha-add/ssm_a-mul share of op_add/op_mul).
+- GPU-BUSY ceiling of the removed ops is small: l2_norm 1.0% + softplus ~0% + sigmoid 0.3% + (dt add + ssm_a mul share of op_add 1.7% / op_mul 8.5%). The point of Lever 3 is NOT the freed busy-time (P2a/Lever-2 proved freeing overlapped busy-time is flat) - it is removing ~288 LAUNCH BUBBLES/step that sit on the serial conv->gate->recurrence dependency where the GPU is otherwise idle. The win is wall-clock only if those specific bubbles are on the critical path.
+
+### RISK (must be settled before building)
+1. **Same trap as P2a/Lever-2 if the bubbles overlap.** If the scheduler already
+   overlaps these pre-GDN gate kernels with an adjacent layer's 42% mul_mat_q GEMM,
+   Lever 3 is FLAT. **Precondition: the timeline gap analysis must show idle GPU
+   between ssm_conv -> (sigmoid/softplus/l2norm) -> gated_delta_net per layer** at
+   batch=128 BEFORE building. If the trace shows the gate ops back-to-back with no
+   gap (overlapped), do NOT build op-fusion; go to lever (2) below.
+2. **The bigger bubbles may be elsewhere on the chain.** The large buckets are op_mul
+   8.5% and unary_gated<silu> 5.9% - much of which is the POST-GDN output gate and
+   FFN, which this fusion does NOT touch. If the gap analysis pins the dominant idle
+   to the post-GDN region or to inter-layer launch latency generally, the
+   higher-leverage Lever 3 is **decode CUDA-graph capture** (removes host launch
+   latency for ALL ~384 nodes/step at once, exactly what vLLM does), not per-op
+   fusion. CUDA-graph is the strictly larger hammer here; op-fusion only helps the
+   pre-GDN slice. Recommend measuring the per-sub-op gap first and preferring the
+   CUDA-graph lever if the bubbles are spread across the step rather than concentrated
+   in the pre-GDN gate slice.
+3. **src-slot exhaustion** (src[8],src[9] use the last 2 of GGML_MAX_SRC=10) - any
+   later op needing more srcs on this node has zero headroom; flag for review.
+
+## cudagraph-coverage (READ-ONLY, no GPU) - does the CUDA graph cover the GDN serial chain at B=128?
+
+### Verdict: YES, the graph covers GDN at batch=128 (dense model). No GDN op forces graph-disable or per-step re-instantiation.
+
+Source: `ggml/src/ggml-cuda/ggml-cuda.cu` (graph state machine), `gated_delta_net.cu`
+(fused op), `src/models/delta-net-base.cpp` (graph build), `src/llama-memory-recurrent.cpp`
+(recurrent head), all on dev tree `~/llama-paged-dev` (HEAD df1cc97, Lever-1). Cross-checked
+against the committed A2_CUDAGRAPH_DECODE.md + DECODE_PARITY_EXPLORE.md measurements.
+
+### How graph-disable / re-instantiation are decided (this fork's state machine)
+- `ggml_cuda_graph_check_compability` (ggml-cuda.cu:3251) disables the graph for ONLY two
+  reasons: (a) a split-buffer src, (b) `GGML_OP_MUL_MAT_ID` with non-quantized weights OR
+  `node->ne[2] > get_mmvq_mmid_max(...)` [TAG_MUL_MAT_ID_CUDA_GRAPHS]. GATED_DELTA_NET,
+  SSM_CONV, SSM_SCAN, GET_ROWS, CONCAT, the gating elementwise ops are NOT in the disable
+  list. So no GDN op forces graph-disable.
+- `ggml_cuda_graph_update_required` (3297) memcmps, per node, the full `ggml_tensor` struct
+  (incl. `op_params` and `data`) + each src's `data` ptr / `ne` / `nb`. ANY delta -> the
+  warmup state machine (ggml_backend_cuda_graph_compute:4464) resets `warmup_complete` and the
+  WHOLE graph (one key = `cgraph->nodes[0]`) runs eager that step until stable again. Buffer
+  CONTENTS are NOT compared - a contents-only change (e.g. ids values) is graph-safe.
+
+### Why the GDN region's properties are STABLE across steady decode steps
+The fused decode path is `ggml_gated_delta_net_inplace_ids` (delta-net-base.cpp:558-560):
+```
+state_dst = ggml_view_2d(ctx, ssm_states_all, n_embd_s, n_seqs, nb1,
+                         kv_head * n_embd_s * elsize);   // offset = kv_head
+ggml_gated_delta_net_inplace_ids(..., cache4d, state_dst, ids, /*rs_head=*/(int)kv_head);
+```
+Both the `state_dst` view byte-offset and the `rs_head` op_param (read back as
+`ggml_get_op_params_i32(dst,1)` in gated_delta_net.cu:330) derive from
+`kv_head = mctx_cur->get_head()`. In `llama_memory_recurrent::find_slot`
+(llama-memory-recurrent.cpp:610-689) the n_seqs used cells are SWAPPED into the contiguous
+range `[min .. min+n_seqs-1]` and `head = min`. The recurrent cache does NOT grow per token
+(one state cell per sequence, unlike the KV cache). For a steady 128-seq continuous batch the
+same sequences own the same tails every step, so `min`/`head` are constant (=0) -> state_dst
+offset constant, rs_head op_param constant. The GDN inputs (q,k,v,g,b, cache4d, ids) are
+fixed-shape (n_seqs=128, n_rs slots), so ne/nb are stable, and ggml-alloc hands out the same
+compute-buffer offsets each same-topology ubatch -> data ptrs stable. The `ids` (s_copy)
+tensor's CONTENTS change per step but its address/ne/nb do not -> graph-safe.
+
+### The fused GDN op is capture-safe (no host-sync, no capture-time cudaMalloc)
+`gated_delta_net.cu`: the op launches `gdn_gather_nonident_kernel` + `gated_delta_net_cuda`
+on `ctx.stream()` with NO `cudaStreamSynchronize` / host `cudaMemcpy` / `cudaMalloc`. The
+gather scratch is `ggml_cuda_pool_alloc` (VMM pool, served from reserved memory after warmup,
+no real cudaMalloc during capture). `gdn_gather_nonident` early-returns for identity sequences
+(`ids[s]==rs_head+s`), which is the steady-decode case, so its 3.7% is a launched-but-mostly-
+noop kernel - still captured into the graph like any other. Capture succeeds (the build runs,
+graphs engage), confirming none of these break stream capture.
+
+### The only re-instantiation is NOT GDN-driven
+A2 already measured the re-warm cadence: the graph re-instantiates every ~256 tokens because
+the FULL-ATTENTION block-table input `idx` has `ne[0]=GGML_PAD(n_kv,256)` (and kq_mask in
+lockstep) - those step at 256-token boundaries (paged-attn.cpp:199-213). ~97% of decode steps
+replay the captured graph. This is a full-attention-layer input, not a GDN op. (The unpadded
+`LLAMA_KV_PAGED_GATHER` fallback grows `ne[0]` every step and runs pure-eager, but that is not
+the default decode path and is not the GDN/SSM path.)
+
+### Reconciliation with the "~40% util / 60% idle bubbles" premise (it is refuted for GDN)
+The committed nsys sweeps (A2_CUDAGRAPH_DECODE.md, DECODE_PARITY_EXPLORE.md) show the steady
+decode is ~99.4-99.5% GPU-BUSY with graphs ON (measured with `--cuda-graph-trace=node`; a
+graphs-ON trace WITHOUT that flag under-counts GPU rows and falsely reports idle - Trap #2).
+Total exposed idle is ~0.65% of the step; the within-step launch fraction graphs remove is
+0.34% (0.37%->0.11%) and is ALREADY collapsed - the GDN sub-op launch gaps are inside the
+captured region. The "40% utilization" in the STATE is BANDWIDTH-roofline util, not idle SMs:
+decode moves ~55.5 GB/step at 2.48x the 273 GB/s floor, SSM state r+w = 66% of step bytes. The
+GDN recurrence is memory-bandwidth-bound at low occupancy (~12-16%), not launch-gap-bound. So
+"60% idle bubbles on the serial GDN chain" is not supported by the traces; the gap to vLLM is
+SSM-state memory traffic, consistent with P2a/Lever-2 being flat (freeing GPU-busy time, not
+wall-clock).
+
+### Graph-safe lever for GDN: none new
+- GDN is already graph-covered; there is no "make the GDN ops graph-safe" lever to build - they
+  are already safe and captured.
+- The only genuinely graph-NON-covered idle is the BETWEEN-step host gap (~2 ms/step, ~0.4%):
+  ggml rebuilds/reallocs the cgraph each step with a new `cgraph->uid`, so the uid fast-path in
+  ggml_cuda_graph_update_required never fires and the host re-dispatches ~3100 launches on the
+  Grace cores between graph launches (vLLM builds its graph once + persistent device metadata).
+  A persistent/reused cgraph across decode steps would let the uid fast-path fire and shrink the
+  host gap - but at 0.4% of the step it is second-order to the SSM bandwidth floor.
+- CAVEAT (MoE, qwen35moe): MUL_MAT_ID at B=128 can trip [TAG_MUL_MAT_ID_CUDA_GRAPHS]
+  (`ne[2] > mmvq_mmid_max`), disabling the WHOLE MoE-decode graph (GDN included) into eager.
+  That is a MUL_MAT_ID disable, not a GDN break, and does not touch the dense 335 tok/s headline;
+  worth a separate confirm for the MoE model.