docs(paged): profile-both-engines post-SSM ground-truth decode decomposition

Fresh post-SSM nsys of llama (build-cuda-base, patch 0019) AND vLLM 0.23.0 at npl128 decode. Reproduces the 391 reference (vLLM 394 t/s eager / 420 graphs, graphs +6% only) and confirms llama 245 t/s. Both ~98% GPU-busy; the gap is GPU kernel-time, not idle/host/graphs. GDN compute comparable (llama 4.03 vs vLLM 3.62 ms/call, +11%). bytes/step: llama not higher (131 vs 85 MB memcpy; SSM-fix 18GB/step DtoD removal confirmed in-trace). Single biggest llama-specific overage = FP4 matmul path 236 vs 117 ms/step (+119 ms = 64% of the gap), dominated by mul_mat_vec_q (FP4 GEMV at batch 128, 132 ms/step, 26%, one per GDN layer). Track B optimized the wrong FP4 kernel (mul_mat_q, not the GEMV). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-25 17:12:10 -04:00 · 2026-06-25 08:56:37 +00:00
parent 6f0792c3be
commit ee13fd18ce
1 changed files with 578 additions and 0 deletions
--- a/backend/cpp/llama-cpp/patches/paged/DECODE_PARITY_EXPLORE.md
+++ b/backend/cpp/llama-cpp/patches/paged/DECODE_PARITY_EXPLORE.md
@@ -0,0 +1,578 @@
+# Decode parity exploration (post-SSM-fix) - per-agent findings
+
+Post-SSM-fix decode (patches 0018 in-place state write-back + 0019 fused gather):
+dense q36-27b-nvfp4 decode_agg = 256.6 t/s @npl128 = 65.6% of vLLM 391, bit-exact.
+The remaining +54% to parity is the question each section below probes. All numbers
+DGX GB10 (sm_121), fusion OFF baseline, `decode_agg` = `S_TG t/s`.
+
+---
+
+## Section: per-token-latency (critical path / host-loop) - READ-ONLY
+
+**Verdict: the per-step critical path and host loop are NOT the residual lever.
+Post-SSM the GPU is still ~99% busy at npl128; the entire exposed-idle budget is
+~0.65% of the step (~2.4 ms), of which graphs already remove the within-step half
+(0.34%) and the between-step host gap is ~2 ms/step (~0.4% post-SSM). The 64-layer
+sequential chain does NOT under-fill the GPU at batch 128 - every kernel's grid
+saturates the SMs on its own. The +54% to parity is GPU kernel work (FP4 GEMM
+efficiency + LPDDR5x weight bandwidth), not serialization or host overhead.**
+
+### 1. Measured exposed-idle structure (a2_nsys pre-SSM rep, read-only sqlite sweep)
+
+`paged_off_npl128.sqlite`, steady window 40-97% of trace (14.78 s, ~16.5 decode
+steps at the pre-SSM ~896 ms/step). Overlap-correct interval-union sweep:
+
+| activity set            | busy %  | exposed idle |
+|-------------------------|---------|--------------|
+| kernels only            | 80.25%  | 19.74%       |
+| kernels + memcpy (all)  | 99.35%  | **0.65%**    |
+
+- The 19.4% kernels-only gap = **841 big gaps (median 3.35 ms, ~51/step)** that are
+  filled by D2D memcpy. These ARE the per-layer gated-DeltaNet recurrent-state copies
+  (the `gated_delta_net -> ggml_cpy(state->cache) -> next layer reads state` chain).
+  They were a real critical-path serialization, and **patches 0018/0019 removed exactly
+  these** (D2D bucket 18.9% -> 0.23%; get_rows gather 18.8% -> 0.7%). Decode rose
+  +37.8% (186 -> 256 t/s), ~matching the work removed -> the kernels reflowed
+  back-to-back, so post-SSM these big gaps are CLOSED, not re-exposed (inferred from
+  the throughput scaling; the post-SSM nsys was not re-profiled by this read-only agent).
+- The TRUE exposed idle (kernel+memcpy union) is **0.65%**: 18 host gaps >=0.5 ms,
+  **median 2.06 ms, max 2.85 ms, ~1.1/step**. This is the single between-step host gap
+  (sample-128 + `update_slots` + next-batch build) that does NOT overlap GPU compute.
+- Within-step launch gaps: 24,190 micro-gaps, median 2.14 us, summing to 50.6 ms =
+  **0.34%** of the window - the pure launch overhead that CUDA graphs collapse
+  (measured 0.37% -> 0.11% in A2_CUDAGRAPH_DECODE; graphs already engage on the
+  default paged decode with a 256-token reset cadence).
+
+### 2. Post-SSM scaling of the FIXED host gap
+
+The ~2 ms/step between-step host gap is FIXED work (independent of GPU kernel time).
+As decode accelerated it grew only as a fraction of a shrinking step:
+
+| build         | step ms @npl128 | host gap | host gap % of step |
+|---------------|-----------------|----------|--------------------|
+| pre-SSM (146) | ~877            | ~2 ms    | 0.24%              |
+| post-SSM (256)| ~499            | ~2 ms    | **~0.40%**         |
+| vLLM (391)    | ~328            | (n/a)    | (would be ~0.6%)   |
+
+Even fully removing it (perfect overlap) buys ~0.4%. It is a second-order floor, not
+the lever - it only becomes material once the kernels are fast enough to drop GPU-busy
+below the host time, which is not the case at 65% of parity.
+
+### 3. The 64-layer chain does NOT under-fill the GPU at batch 128
+
+The decode is an intrinsically sequential depth-64 chain (autoregressive: layer N
+needs layer N-1; cannot be parallelized across layers). The question is whether each
+individual kernel fills the SMs at batch 128. It does:
+
+- **GDN kernel** (`gated_delta_net.cu`): launch grid `dim3(H, n_seqs, ceil(S_v/4))`
+  = `48 x 128 x 32 = 196,608 blocks` (dense, H=48 value heads, S_v=128). Block
+  `(warp_size, 4, 1)`. Massively oversubscribes the GB10 SMs. Each warp loads its
+  state shard into registers once and runs a single `n_tokens==1` iteration - O(1) in
+  context (confirmed flat across 4x ctx in GDN_DECODE_VERIFY).
+- **FP4 GEMM** (`mul_mat_q`, mmq_x=128): M=128 token tile, well into the M-batched
+  regime, full SM occupancy (and Track B P2a already showed it goes 2 CTA/SM).
+- The 99.35% kernel+memcpy busy reading IS the direct proof there is no under-fill at
+  npl128: if the chain under-filled, busy% would be well below 99%.
+
+Under-fill only appears at LOW batch (npl32/npl4), where it manifests as the
+weight-bandwidth/GEMV regime (npl32 = 170 t/s vs npl128 = 256): fewer tokens amortize
+the same per-step weight read, NOT idle SMs. That is a bandwidth floor, not a
+host/scheduler problem.
+
+### 4. What the host actually does per step (eager rep runtime API)
+
+Steady-window `CUPTI_ACTIVITY_KIND_RUNTIME` totals (host-thread wall, overlaps GPU):
+
+| API                       |   n   | total   | avg     |
+|---------------------------|-------|---------|---------|
+| cudaStreamSynchronize     | 1723  | 7775 ms | 4513 us |
+| cudaLaunchKernelExC        | 30983 | 4045 ms | 131 us  |
+| cudaLaunchKernel          | 20385 | 2694 ms | 132 us  |
+| cudaMemcpyAsync           | 2085  |   96 ms |  46 us  |
+
+~104 stream-syncs/step and ~3100 kernel launches/step in eager mode (collapsed by
+graphs to ~900 launches/step). The 7.8 s of sync is the host BLOCKING on the busy
+GPU (it overlaps GPU compute, it is NOT exposed idle) - the GPU stays 99.4% busy. The
+sampled-token path is `cudaMemcpyAsync` (96 ms total, negligible, non-blocking). The
+only NON-overlapped residue is the ~2 ms/step between-step gap in section 1.
+
+### 5. vLLM host-loop comparison (per VLLM_DECODE_GROUNDING.md)
+
+vLLM's eager decode is host-cheap BY CONSTRUCTION and hides the host fully behind the
+async CUDA stream WITHOUT pipelined scheduling (`async_scheduling` was OFF; it won the
+2.4x with synchronous scheduling): persistent pre-allocated input buffers updated by
+vectorized numpy (no per-token Python), attention metadata `build()` once per step
+reused across all layers, no GPU->CPU sync in the hot path, sampled-token D2H
+non-blocking + event-gated, and a fixed small launch sequence (~2 ops/Linear). The
+next-step host prep overlaps the current-step GPU compute on the async stream. The key
+asymmetry vs llama: vLLM builds its graph ONCE and reuses persistent device
+KV/block metadata; ggml rebuilds/reallocates the cgraph each decode step (new
+`cgraph->uid`) and re-dispatches ~3100 launches from the loop on the weak Grace cores.
+
+But this asymmetry is hidden under GPU compute on BOTH sides at npl128: llama's host
+loop is a 0.4% exposed gap, not a 2x lever. vLLM's host cheapness is why ITS step is
+328 ms host-free, but llama's 499 ms is also ~99% GPU - the 171 ms difference is GPU
+kernel time (FP4 GEMM), not host.
+
+### 6. Is any host/serialization lever CUDA-graph or scheduler addressable?
+
+- **Within-step launch idle (0.34%)**: CUDA-graph addressable, ALREADY captured by
+  default (0.37 -> 0.11%). Worth ~0% of decode_agg (measured +0.1-0.8%, noise).
+  Nothing left to win here.
+- **Between-step host gap (~2 ms, ~0.4%)**: NOT removed by a graph (the graph replays
+  the forward; the host still samples + runs `update_slots` + rebuilds the batch
+  between replays). It is SCHEDULER addressable - overlap step N+1's host prep with
+  step N's GPU compute, mirroring vLLM's persistent-buffer + build-once-reuse +
+  non-blocking-D2H pattern (and ideally reuse the ggml cgraph across steps instead of
+  rebuilding it every ubatch). But the ceiling is ~0.4% of the step, so it is a
+  cleanup, not a parity lever.
+- **The +54% to parity is none of the above.** It is GPU kernel work: post-SSM the FP4
+  GEMM family is ~48% of decode (the dominant residual), GDN recurrence ~22.5%, and the
+  decode is weight-bandwidth/latency-bound on LPDDR5x (Track B P2a: a -24.7% FP4-GEMM
+  kernel left decode_agg FLAT, the freed compute became idle gaps -> decode is not
+  GEMM-compute-bound but bandwidth/latency-bound). The lever lives in cutting DRAM
+  traffic per step (fused act-quant to drop the separate `quantize_mmq` pass, native
+  FP4-MMA, and/or NVFP4-dense weight quant), NOT in the host loop or CUDA graphs.
+
+### Evidence
+- Read-only sqlite sweeps on `~/bench/a2_nsys/paged_off_npl128.sqlite` (this agent).
+- `gated_delta_net.cu` launch grid (DGX `~/llama-paged-dev`).
+- A2_CUDAGRAPH_DECODE.md, SSM_DECODE_FIX_RESULTS.md, GDN_DECODE_VERIFY.md,
+  VLLM_DECODE_GROUNDING.md, THROUGHPUT_B_P2a_POSTSSM_RESULTS.md.
+# Decode-Parity Exploration
+
+## Section: gdn-source-compare (llama gated_delta_net.cu vs vLLM fused_recurrent_gated_delta_rule)
+
+### Model config (Qwen3.5-27B dense, from vLLM config.json)
+- linear_key_head_dim K = 128, linear_value_head_dim V = 128
+- linear_num_key_heads = 16, linear_num_value_heads = 48 (GVA 3:1), conv_kernel = 4
+- 64 layers, full_attention_interval 4 -> 48 linear (GDN) : 16 full-attn
+- Recurrent state per (seq, v-head) = V*K = 128*128 = 16384 f32 = 64 KiB.
+  Per layer per seq = 48 * 64 KiB = 3 MiB. Both engines store state in f32.
+
+### Which kernels run at decode
+- llama: ggml_gated_delta_net_inplace_ids -> gated_delta_net_cuda<S_v=128, KDA=false, keep_rs_t=false>.
+  Gate is SCALAR per head (graph reshapes gate/beta to ne[0]=1), so the cheaper !KDA branch runs (one expf per token, not per-channel).
+- vLLM: enable_packed_recurrent_decode -> fused_recurrent_gated_delta_rule_packed_decode_kernel
+  (the dedicated single-token decode kernel, NOT the generic varlen fwd kernel).
+
+### The state HBM traffic is IDENTICAL - it is NOT the lever
+Per (seq, v-head) per decode token both engines read 64 KiB state + write 64 KiB state, f32, coalesced.
+The dominant memory term is equal. llama is NOT moving more state bytes than vLLM.
+=> The 1.46 ms/call is llama achieving LOWER effective bandwidth on the SAME bytes,
+   plus extra non-state work, NOT a fundamental HBM-traffic deficit. Hence closable.
+
+### Algorithmic / parallelization delta (the real differences)
+
+1) Reduction strategy (biggest structural difference)
+   - llama: WARP-PER-OUTPUT-COLUMN. State stored transposed M[col][i]=S[i][col]. Each warp owns
+     one V-column; the contraction over the 128 K-rows is a cross-lane warp_reduce_sum.
+     TWO warp_reduce_sum per token (one for kv = S^T@k, one for attn = S^T@q) = ~10 shuffle
+     rounds on the critical path, with n_tokens=1 they are NOT amortized.
+   - vLLM: THREAD-PER-OUTPUT-ROW. b_h is a [BV,BK]=[32,128] tile; each thread owns a FULL K-row
+     of state. sum(b_h*b_k, axis=K) and sum(b_h*b_q) are THREAD-LOCAL 128-wide reductions -
+     ZERO cross-thread shuffles. Outer-product update b_h += b_v*b_k is also thread-local.
+   Same FLOPs, but vLLM has no shuffle-reduction latency in the recurrence.
+
+2) Occupancy / launch geometry (likely the dominant bandwidth gap)
+   - llama: block = (32 lanes, 4 warps) = 128 threads; grid = (H=48, n_seqs, ceil(128/4)=32).
+     Per (head,seq) it launches 32 blocks * 128 threads = 4096 threads to touch a 16384-elem state
+     (only 4 state elems/thread). launch_bounds(128, 2) budgets registers for >=2 blocks/SM; with
+     s_shard[4]+k_reg[4]+q_reg[4]+addressing the register pressure caps it near ~2 blocks = 8 warps/SM
+     (~12-16% occupancy on GB10). A memory-bound kernel at ~8 warps/SM cannot generate enough in-flight
+     loads to saturate 273 GB/s -> low achieved bandwidth on the state read/write.
+   - vLLM: 1 warp/program (num_warps=1), grid (NV=4, B*HV), small register footprint, num_stages=3
+     software-pipelines (prefetches) the state load. Far higher memory-level parallelism per SM.
+
+3) Redundant non-state traffic in llama
+   - q,k re-loaded by EVERY column-warp: 128 column-warps/head each reload the same 128-float q and k
+     => ~128x amplified L2 loads of q/k per head/token (vLLM reloads ~4x, once per NV program).
+     Small (L2-resident) but adds load-issue + L2 pressure competing with the state stream.
+   - Output store: llama writes attn_data[col] from lane 0 only (31/32 lanes idle), scattered
+     single-float stores; vLLM stores a contiguous BV=32 vector (coalesced).
+
+4) Fusion delta (per-layer kernel-launch / HBM round-trip count)
+   - vLLM packed_decode FUSES into ONE kernel: q/k l2norm + q*scale + softplus(a+dt_bias) +
+     (-exp(A_log)) gate + sigmoid(beta) + the recurrence + state write-back.
+   - llama computes these as SEPARATE ggml ops/kernels in the graph before the GDN op:
+     ggml_l2_norm(q), ggml_l2_norm(k), ggml_add(+dt), ggml_softplus, ggml_mul(gate),
+     ggml_sigmoid(beta) (+ conv/silu), each a launch + small HBM round-trip. Plus a separate
+     gdn_gather_nonident_kernel launch per layer (a no-op at steady-state decode: every block
+     early-returns on the identity check, but still a grid launch of n_seqs blocks).
+   Across 48 linear layers this is ~6-10 extra small kernels/layer (~300-480 extra launches/token).
+   Whether this dominates depends on CUDA-graph capture (see A2_CUDAGRAPH_DECODE.md); if captured,
+   launch latency is hidden and the cost reverts to the per-op HBM round-trips + dependency gaps.
+
+### What a faster llama GDN decode kernel would need (optimization scope)
+- A. Re-parallelize like vLLM: thread/lane owns a full K-row (or K-shard) so the kv and attn
+  contractions become register-local FMAs, eliminating the two warp_reduce_sum per token.
+- B. Raise occupancy for the memory-bound regime: drop/raise the launch_bounds minBlocks hint
+  (the `,2)` is too low), shrink the block, cut registers, and add a software-prefetch of the next
+  state shard so more state loads are in flight per SM. This directly lifts achieved bandwidth on
+  the equal state bytes - the single highest-leverage change.
+- C. Load q,k ONCE per (head,seq) into shared memory instead of 128x per-column reload; coalesce
+  the output store across the warp.
+- D. Fuse the gate/l2norm/scale (softplus, exp(A_log), sigmoid, l2norm) INTO the recurrence kernel,
+  reading raw a/b/A_log/dt_bias from registers, removing ~6 elementwise passes + their HBM round-trips
+  per layer (matches vLLM's packed_decode). Drop the gather no-op kernel at steady-state decode
+  (or fold the identity check into the recurrence prologue, which it already partly does).
+- E. (Longer term) bf16 state would HALVE the dominant traffic, but vLLM keeps f32 too, so this is a
+  divergence-from-reference not a parity lever.
+
+### Bottom line
+llama's GDN decode kernel is NOT moving more state HBM bytes than vLLM (the dominant term is equal),
+so the 1.46 ms/call is an EFFICIENCY gap, not a traffic floor: (1) cross-warp shuffle reductions on
+the n_tokens=1 critical path, (2) low occupancy (~8 warps/SM from launch_bounds + register pressure)
+starving memory-level parallelism so the equal state bytes move at lower effective bandwidth, plus
+(3) 128x redundant q/k L2 loads and (4) ~6-10 unfused gate/norm elementwise kernels per layer that
+vLLM folds into one packed-decode kernel. Highest-leverage fixes: raise occupancy + prefetch (B) and
+row-local reductions (A); secondary: gate/norm fusion (D) and q/k shared-mem reuse (C).
+
+---
+
+## Section: validate-findings (adversarial re-derivation from raw DGX data) - READ-ONLY
+
+Re-queried `CUPTI_ACTIVITY_KIND_KERNEL` + `CUPTI_ACTIVITY_KIND_MEMCPY` directly (kernel and
+memcpy summed separately so D2D is never lumped into compute), not from summary text.
+
+### CLAIM 1 - decode decomposition
+PRE-FIX (`a2_nsys/paged_off_npl128.sqlite`, last 17s) vs `decode_decomp.txt`, match <=0.1pp:
+gated_delta_net 23.40% (doc 23.43), k_get_rows 21.99% (21.88), MEMCPY-DtoD 18.89% / 382 GB /
+1583 ops (18.90 / 356 GB / 1584), mul_mat_vec_q 15.53% (15.51), mul_mat_q 10.48% (10.37).
+=> CONFIRMED exactly. gated_delta_net = largest single non-GEMM kernel; FP4-GEMM group ~28%;
+full attention 0.37%.
+
+D2D collapse: only on-box post-fix decomp is `ssm_decomp/after.sqlite`; MEMCPY-DtoD there =
+526 ops / 0.9 ms / 0.05 GB = 0.008% of busy (from 382 GB / 18.89%). => CONFIRMED, stronger than
+the doc's "0.23%" (382 GB state copy-back gone; exact "0.23%/2.93GB/734ops" not reproducible -
+my DtoD 0.05 GB, the 2.16 GB is DtoH).
+
+FLAG (refutes part of the Step-2 decomp): `after.sqlite` is a Step-1 build (patch 0018 only),
+NOT Step-2. It still shows k_get_rows_float 28.44% (gated_delta_net 28.96%, FP4-GEMM group ~33%),
+no `gdn_gather_nonident` kernel, profiled S_TG=164 (~Step-1 180, not Step-2 256); mtime 00:31
+predates the 08:48 rebuild that carried patch 0019. The Step-2 split in `SSM_DECODE_FIX_RESULTS`
+("get_rows 18.8%->0.7%, FP4-GEMM ->48%, GDN 22.5%") has NO surviving sqlite, and the script meant
+to produce it (`ssm_decomp.sh`) CRASHED (Python SyntaxError, see `ssm_decomp_after.out`). So
+"FP4-GEMM ~48%" is UNVERIFIED against raw Step-2 data: measured ~33% on Step-1; removing the 28%
+get_rows bucket lifts it to ~46% arithmetically, so ~48% is plausible but not directly measured.
+Section 1 above and SSM_DECODE_FIX_RESULTS both inherit this unverified Step-2 split.
+
+### CLAIM 2 - 146 -> ~257 ("+66%")
+146.23 baseline CONFIRMED (`ssm_decode_baseline.out`); final 256.57 / 252.50 / 254.02 across
+SSM_DECODE_FIX_RESULTS + THROUGHPUT_B_P2a, within ~1.6%. Magnitude CONFIRMED. TRAP: 146->257 is
+76% (146->254 = +74%), NOT +66%. "66%" is the % of vLLM (257/391 = 65.7%), not the speedup.
+
+### CLAIM 3 - P2a GEMM-remap FLAT on decode
+THROUGHPUT_B_P2a: dense npl128 252.50->254.02 (+0.6% noise), npl32 -0.4%, MoE flat; FP4 GEMM
+kernel itself -24.7%, PREFILL +12.7%. Pre-SSM corroborated by THROUGHPUT_B_P1. => CONFIRMED.
+
+### CLAIM 4 - 65% of vLLM (254 vs 391)
+254/391 = 64.96%, 256.57/391 = 65.6%; vLLM 391 = enforce_eager apples ref. => CONFIRMED.
+
+### Traps checked
+GGML_CUDA_DISABLE_GRAPHS set `=1` explicitly (not the empty-value trap); graphs ON/OFF within
+noise. memcpy-in-compute lumping AVOIDED (separate table sums). Decomp reps are ntg24-under-nsys
+(S_TG 149/164) - valid for SHARES only; throughput correctly from unprofiled ntg128 logs.
+
+### Net verdict
+1 pre-fix decomp CONFIRMED exact; D2D collapse CONFIRMED (stronger); Step-2 0.7%/48% split
+UNVERIFIED (producer script crashed, only post-fix sqlite is Step-1). 2 magnitude CONFIRMED,
+"+66%" label REFUTED (true +76%; 66% = % of vLLM). 3 CONFIRMED. 4 CONFIRMED.
+
+---
+
+## Section: weight-bandwidth (whole-step DRAM budget, READ-ONLY math)
+
+Agent label: weight-bandwidth. Method: exact GGUF tensor accounting (q36-27b-nvfp4,
+arch qwen35, 64 layers) + activation-state math + existing nsys/decode_decomp; no GPU started.
+Config = the production decode number: llama-batched-bench -fa on -npp128 -ntg128 -npl 128
+(B = n_parallel = 128 sequences, S_TG = 254 t/s post-0019). GB10 LPDDR5x peak ~273 GB/s.
+
+### Exact per-step DRAM byte budget at B=128 (ctx avg ~192 over the ntg128 window)
+
+NVFP4 type-40 = 0.5625 B/weight (4-bit data + e4m3 per-16 micro-scale; verified: 5120*48*0.5625=138240).
+
+WEIGHTS (read ONCE per step, shared across all 128 seqs):
+  - NVFP4 layer weights (type40, 64 layers): 13,062.7 MB = 12.76 GB
+      (per SSM layer 215.6 MB x48 = 9867.7 MB ; per full-attn layer 199.7 MB x16 = 3195.0 MB)
+  - LM head output.weight: type 30 = **bf16, NOT quantized** = 2425 MB = 2.37 GB (read in full each step)
+  - per-layer norms/conv1d/ssm_a/dt_bias (type0 f32): 10.1 MB
+  - token_embd: EXCLUDED (get_rows gathers only 128 rows, negligible)
+  => WEIGHTS TOTAL = 15.14 GB / step
+
+PER-SEQUENCE STATE (x128 seqs, read + write every step):
+  - SSM recurrent state: inner_size(6144) x state_size(128) x 4B(f32) = 3.0 MB / layer / seq
+      x 48 SSM layers x 128 seq = 18.43 GB read + 18.43 GB write = **36.86 GB / step**
+  - conv state: conv_k(4) x conv_dim(10240) x 4B = 160 KB / layer / seq
+      x 48 x 128 = 0.96 GB read + 0.96 GB write = 1.92 GB / step
+  - KV cache (16 full-attn layers, GQA n_kv_head=4, k+v_len=512, f16):
+      4096 B/tok/layer x 16 x ~192 ctx x 128 seq = ~1.6 GB read / step
+
+  TOTAL ~= 15.14 (W) + 36.86 (SSM state) + 1.92 (conv) + 1.6 (KV) = **~55.5 GB / step**
+
+### Floor vs measured -- decode is NOT at the bandwidth floor
+
+  Bandwidth floor = 55.5 GB / 273 GB/s = **203 ms/step**
+  Measured llama  = 128 tok / 254 t/s   = **504 ms/step**  => **2.48x the floor** (eff BW 110 GB/s = 40% of peak)
+  vLLM 391 t/s    = 128 / 391           = **327 ms/step**  => 1.61x the floor (eff BW 170 GB/s = 62% of peak)
+
+  The SAME 55.5 GB/step floor applies to vLLM: identical NVFP4 weights, and its
+  fused_recurrent_gated_delta_rule reads+writes the identical f32 recurrent state. So both engines
+  face the same DRAM wall; vLLM simply moves those bytes at 62% of peak vs llama's 40%. The 62/40 =
+  1.55x utilization gap is EXACTLY the 254->391 (1.54x) throughput gap. => Decode-parity is a
+  bandwidth-UTILIZATION / launch-serialization problem, NOT a DRAM-traffic-volume problem. Bandwidth
+  is not the binding constraint (we sit 2.5x above the floor); confirms the GDN-kernel section above.
+
+### Traffic composition is STATE-dominated at B=128 (qualifies the "weight-quant" verdict)
+
+  SSM state r+w = 66% of step traffic; weights = 27%; conv = 3.5%; KV = 3%.
+  At B=128 weights are a minority of traffic, and we are 2.5x above the floor anyway -> NVFP4-dense
+  weight quant (the QWEN36_NVFP4 verdict's lever) cannot move batch-128 decode much. Weight-quant
+  helps PREFILL (compute/weight-bound, already +12.7% from the GEMM remap) and LOW-batch decode.
+  Cross-check at B=32: traffic ~25.2 GB/step (weights now 60%), floor 92 ms, measured 189 ms = 2.05x
+  floor. The sublinear scaling 32->128 (4x batch, only 1.5x throughput: 169->254) is fully explained
+  by per-seq state traffic growing with B while weights stay amortized -> at B=128 the step has become
+  state-traffic-heavy but is STILL 2.5x off the floor, i.e. latency/overlap-bound, not byte-bound.
+
+### Redundant traffic llama reads that vLLM avoids (cut list, by impact)
+
+  1. (HISTORICAL, FIXED by 0018) Redundant DtoD recurrent-state copy = +18.4 GB/step EXTRA
+     (pre-fix decode_decomp: MEMCPY-DtoD 18.9%, 80 copies/step ~230 MB each = 18.4 GB; nsys window
+     356 GB/19.8 steps). This doubled state traffic and was the dominant pre-fix waste. Verified gone
+     post-fix: the THROUGHPUT_B_P2a A/B kernel sum (npp128 ntg24 npl128) lists gated_delta_net /
+     mul_mat_q / quantize but NO MEMCPY-DtoD term. (The committed ~/bench/a2_nsys sqlites are all
+     PRE-fix S_TG~149 traces; re-profiling deferred to the designated profiler.) This single removal
+     (18.4 GB/273 ~= 67 ms/step of bytes plus the killed overlap stalls) is the bulk of 146->254.
+  2. conv state as a SEPARATE ssm_conv kernel + separate buffer: 1.92 GB r+w/step AND 48 extra kernel
+     launches/step. vLLM folds the causal conv into its recurrence kernel. Cut ~= 7 ms bytes + 48
+     launches/step of serialization.
+  3. Residual get_rows gather post-0019 (~0.7%, decode_decomp pre-fix k_get_rows was 21.9% / ~96
+     ops/step = 2/SSM-layer): vLLM indexes the per-seq state in-kernel; llama still does a small
+     gather/scatter. ~0.13 GB. 0019 already folds most of it; fold the identity check fully into the
+     recurrence prologue.
+  4. quantize_mmq_nvfp4: 448 ops/step re-quantizing activations to NVFP4 before each FP4 matmul.
+     Activation BYTES are negligible, but it is 448 extra kernel launches/step that vLLM fuses into
+     the GEMM prologue -> pure launch latency, not traffic.
+  5. NOT redundant: weight bytes (identical NVFP4 to vLLM), SSM-state r+w (inherent, vLLM pays it),
+     NVFP4 scale scalars (8 B/tensor). Note the LM head is bf16 not quantized (2.37 GB/step, 16% of
+     weight traffic) -- fp8 LM head would save ~1.2 GB/step but only matters if vLLM also quantizes it.
+
+### Bottom line (weight-bandwidth)
+At B=128, decode moves ~55.5 GB/step and runs at 2.48x the 273 GB/s floor (40% util) vs vLLM's 1.61x
+(62% util). Same bytes, same floor for both engines -> decode is bandwidth-UTILIZATION-bound, not
+traffic-bound. There is NO large redundant-byte stream left to cut post-0018/0019 (the 18.4 GB/step
+DtoD redundancy is already gone); the remaining 254->391 is recovered by raising achieved bandwidth
+(occupancy + prefetch on the GDN state loads, conv fusion to drop 48 launches/step) so the EXISTING
+55.5 GB/step moves at vLLM's 62% instead of 40%. Weight-quant (NVFP4-dense) is a PREFILL / low-batch
+lever, largely orthogonal to the batch-128 decode-parity gap.
+
+---
+
+## Section: explore-other-levers (broad sweep for OTHER llama-specific decode inefficiencies) - READ-ONLY, no GPU
+
+Scope handoff: GDN-kernel internals -> `gdn-source-compare`; host loop / graphs / gaps ->
+`per-token-latency`; weight-byte / utilization -> `weight-bandwidth` section above (which already
+covers the BF16 lm_head and the "same bytes, 40% vs 62% util" framing - I concur, no need to repeat).
+This section covers the levers NONE of those own: the FP4 act-quant fusion, the M=128-vs-M=1 ggml
+fusion gate, TMA scoping, and the conv-state residual.
+
+**Terminology fix that matters for the whole doc:** in this repo's benches **"fusion OFF" means
+`LLAMA_FUSE_NVFP4_QUANT=0`** (Track A's NVFP4 act-quant producer), confirmed in
+`a2_nsys.sh`/`a2_4cell.sh`/`trackA_clean.sh`. It does NOT set `GGML_CUDA_DISABLE_FUSION`, so the
+**standard ggml-cuda elementwise/GLU/rope fusion is ON** in every result. The header's "fusion OFF
+baseline" is only about the act-quant producer.
+
+**Framing (consistent with the sections above, sharpened):** the binder is bandwidth-UTILIZATION /
+the kernel-dependency chain, not traffic or per-kernel compute (P2a -24.7% GEMM and graphs both
+flat). The thing that raises utilization AND shortens the chain is the same: **fewer, fused kernels
+per step** - removing whole passes vLLM doesn't run. So rank by "whole pass eliminated", not "us
+shaved".
+
+### L1. Re-test Track A act-quant fusion (`LLAMA_FUSE_NVFP4_QUANT=1`) POST-SSM. [impact ~8-11%, tractability HIGH - code exists, owned by tasks 38-41]
+`quantize_mmq_nvfp4` is a standalone full-activation requantize run once per NVFP4 GEMM at M=128
+(the weight-bandwidth section counts 448 such launches/step). vLLM has **zero** equivalent:
+`rms_quant_fusion.py:98` folds it into RMSNorm, `act_quant_fusion.py:40,128` into SiLU+mul - the
+activation never hits a temp buffer. Track A built exactly this fused producer (tasks 38-40 DONE),
+but `LLAMA_FUSE_NVFP4_QUANT=1` regressed, and EVERY post-SSM bench ran with it OFF. **The regression
+is likely stale:** pre-SSM the GPU was 99% busy on the state-copy chain, so folding act-quant into
+the norm only relocated busy work into a lower-occupancy kernel with no idle to reclaim. Post-SSM the
+chain has real idle and removing 448 launches/step both shortens the dependency chain and lifts
+utilization - exactly the post-0018/0019 bind. Highest-value CLEAN removal; needs only a re-bench
+(re-run `trackA_clean.sh` on the post-0019 build), not new code. Do not treat the prior regression
+as final.
+
+### L2. M=128 norm->matmul prologue fusion - the ggml fusion gate that does NOT fire at decode batch. [impact ~5-15% aggregate, tractability MEDIUM]
+ggml-cuda's built-in `rms_norm+mul+mul_mat_vec_q` fusion (`ggml_cuda_should_fuse_mul_mat_vec_q`,
+ggml-cuda.cu:2502) is gated to `dst->ne[1]==1` - it ONLY fires at **npl1** (M=1). At npl128
+(`mul_mat_q`, M=128) it does NOT fire, so the per-layer RMSNorm stays a separate kernel feeding the
+GEMM and the act path is unfused (L1). vLLM fuses both into the GEMM prologue at all M. This is the
+M=128 generalization of the existing M=1 fusion + L1; largest aggregate surface but real kernel work.
+Implies a regime split worth stating loudly: **npl1 single-stream latency already gets this fusion;
+the npl128 throughput number does not** - tune the two separately.
+
+### L3. TMA weight feed: a PREFILL / npl1-latency lever, NOT an npl128-decode lever.
+Answering the brief's question (GEMM idle = FEED problem TMA fixes, or off-critical-path TMA can't?):
+P2a cut GEMM compute and the freed time became IDLE, so at npl128 the GEMM finishes early and the
+stall is BETWEEN kernels, not inside the GEMM waiting on weight tiles. TMA accelerates a
+*feed-stalled* GEMM; at npl128 the GEMM is not the binder, so TMA won't move npl128 S_TG. It pays on
+(a) **prefill** (compute/feed-bound; the remap already gave +12.7%) and (b) **npl1 decode**, a pure
+weight-feed GEMV (full model / 273 GB/s ~ 19-20 tok/s ceiling). Scope TMA to prefill + low-batch
+latency; do not bank it for batch-128 decode parity. (Consistent with the weight-bandwidth section's
+"NVFP4-dense is a prefill/low-batch lever".)
+
+### L4. In-place / `ids` conv-state - apply the 0018/0019 pattern to `ssm_conv`. [impact ~1-3%, tractability HIGH, proven pattern, bit-exact-able]
+After the SSM fix the residual D2D is the conv-state copy (`build_conv_state`,
+delta-net-base.cpp:449-525: `build_rs` reads 3 prior samples, `ggml_concat` the new token, writes
+the last 3 back), plus `ssm_conv` (~0.8-1.5%) and a per-GDN-layer `concat_cont` (48/step). The exact
+in-place + `ids`-read treatment from 0018/0019 applies to the conv state, and `ssm_conv`+`concat`
+can fold into the GDN kernel prologue (it already has `ids` plumbing). Small ceiling but bit-exact,
+low-risk, and removes ~48 launches/step from the chain - this is the "conv fusion to drop 48
+launches/step" the weight-bandwidth section calls for, made concrete via the proven patch pattern.
+
+### Deferred (covered by other sections, I concur)
+- GDN occupancy / row-local reductions / gate-norm fusion -> `gdn-source-compare`. Add only: bf16
+  state halves the dominant traffic but vLLM keeps f32, so it is a divergence-from-reference, not a
+  parity lever - last priority, quality-risk.
+- BF16 lm_head / weight-byte / 40%-vs-62% utilization -> `weight-bandwidth` section. lm_head NVFP4 is
+  an absolute ~1-2% trim, not a vLLM-relative gap (vLLM likely keeps it bf16 too).
+- Full-attention KV path (16 attn layers, 0.4-1.8%, O(ctx) but tiny) -> CLOSED, not a lever.
+
+### Bottom line (this section's net-new)
+Ranked by "whole pass vLLM eliminated": **L1 (re-test act-quant fusion post-SSM - clean removable
+pass, code already written, just needs a post-0019 re-bench)** > **L2 (M=128 norm/act prologue
+fusion - biggest aggregate surface, real work)** > **L4 (conv-state in-place - cheap, proven 0018/0019
+pattern, -48 launches/step)**. **L3 (TMA) is mis-scoped if aimed at npl128 decode** - it is a prefill
+/ npl1-latency lever, same bucket as NVFP4-dense weight quant. Caveat inherited from
+`validate-findings`: the post-SSM act-quant absolute share (L1) is on an unverified Step-2 decomp
+(only clean post-fix sqlite is Step-1); re-measure on a clean Step-2 nsys when the profiler runs.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+
+---
+
+## Section: profile-both-engines (GROUND-TRUTH post-SSM nsys of llama AND vLLM at npl128) - THE GPU PROFILER
+
+Agent label: profile-both-engines (the only GPU agent). Fresh post-SSM nsys traces of
+BOTH engines at the same shape (128-seq decode, 128-token prompts), q36-27b-nvfp4 dense.
+llama = `build-cuda-base` (no FP4 flag, byte-identical to stock, HEAD 46d7dd8 = patch 0019
+SSM fix), `llama-batched-bench -npp128 -ntg32 -npl128 -fa on`, eager (DISABLE_GRAPHS=1) for
+a clean per-kernel trace. vLLM = 0.23.0 in-process offline (`VLLM_ENABLE_V1_MULTIPROCESSING=0`
+so cudaProfilerApi controls the worker), enforce_eager, max_num_seqs 256, 128 prompts.
+Decode-only windows (prefill excluded), overlap-correct interval-union busy, GPU-accurate
+per-call kernel durations. This is the post-SSM **Step-2** trace `validate-findings` flagged
+as having no surviving sqlite - it now exists: `~/bench/postssm_decomp/`.
+
+### 0. THROUGHPUT GROUND TRUTH (un-profiled, prefill-subtracted) - resolves the 391 reference
+
+The vLLM 391 reference is real and reproduced. Prefill-subtracted decode step (two-length
+w16/w64 timing, in-process, batch 128):
+
+| engine / mode            | ms/step | decode tok/s | notes                          |
+|--------------------------|---------|--------------|--------------------------------|
+| llama post-SSM (graphs)  | ~510-522| **245-251**  | S_TG @npl128 ntg32 (this run)  |
+| vLLM enforce_eager       | 324.9   | **394.0**    | == the ~391 ref (h2h log 371-384)|
+| vLLM cuda-graphs         | 304.9   | **419.8**    | graphs buy only +6%            |
+
+- **CUDA graphs are NOT the parity lever**: vLLM is already 394 t/s EAGER; graphs add +6%
+  (394->420). llama-batched-bench already runs WITH graphs at 245. So the gap is eager-vs-eager
+  kernel work, confirming `per-token-latency` and `A2_CUDAGRAPH_DECODE`.
+- TRAP I hit and corrected: the FIRST vLLM nsys window (0.35-0.99) read 468 ms/step / 273 t/s -
+  WRONG, contaminated by prefill chunked-GDN kernels AND eager-nsys host overhead. The tight
+  decode-only window (0.62-0.98) reads **326.5 ms/step**, matching the un-profiled 324.9 ms
+  exactly -> the tight window is faithful; per-kernel numbers below use it.
+
+### 1. POST-SSM per-step decode decomposition, SIDE BY SIDE (GPU-accurate, prefill-free)
+
+Both at batch 128. llama 510 ms/step (98.7% GPU-busy), vLLM 326 ms/step (97.9% GPU-busy).
+ms/step = on-device kernel time per real decode step (nsys host overhead does not inflate GPU
+kernel duration; per-step = GPU-ms / real-step-count from the decode-only GDN call count).
+
+| component (per step)        | llama ms/step | llama % | vLLM ms/step | vLLM % |
+|-----------------------------|---------------|---------|--------------|--------|
+| GDN linear-attn recurrence  | 193 (48x4.03) | 38%     | 174 (48x3.62)| 53%    |
+| FP4 matmul + act-quant      | **236**       | **46%** | **117**      | **36%**|
+|   - mul_mat_vec_q (GEMV)     | 132 (48x2.75) | 26%     | -            | -      |
+|   - mul_mat_q (GEMM)         | 88 (448 calls)| 17%     | cutlass 61   | 19%    |
+|   - quantize_mmq_nvfp4       | 16 (448)      | 3%      | nvjet 53+cvt2| 17%    |
+| full attention (16 layers)  | 6.6 (16)      | 1.3%    | 6.2 (16)     | 1.9%   |
+| SSM conv + glue/elementwise | ~45           | 9%      | ~22          | 7%     |
+| MEMCPY (D2D+H2D)            | 2.5 (131 MB)  | 0.5%    | 0.36 (85 MB) | 0.1%   |
+| **TOTAL**                   | **~510**      | 100%    | **~326**     | 100%   |
+
+### 2. The three load-bearing comparisons (the brief)
+
+**(1) GDN compute: llama vs vLLM = NOT the gap.** Per-call GPU duration:
+llama `gated_delta_net_cuda<128>` = **4.03 ms/call**, vLLM
+`fused_recurrent_gated_delta_rule_packed_decode` = **3.62 ms/call**. llama is only **+11%**
+slower per call (+19 ms/step). GDN is comparable; it is the largest single kernel on BOTH sides
+(38% llama, 53% vLLM) but it explains only ~19 ms of the 185 ms gap (~10%). This REFUTES the
+framing that the GDN kernel is the dominant residual lever - it is a minor overage post-0018/0019.
+(The `gdn-source-compare` occupancy/shuffle deltas are real but worth ~19 ms/step, not 1.5x.)
+
+**(2) DRAM bytes/step: llama does NOT read more.** Explicit memcpy: llama **131 MB/step** vs
+vLLM **85 MB/step** - llama moves a hair more in copies but both are <0.5% of the step. The big
+per-layer state copies are GONE (pre-SSM 18 GB/step DtoD -> post-SSM 131 MB/step) - **the SSM fix
+(0018/0019) is confirmed working in this trace.** Weight DRAM (read inside the GEMM/GEMV kernels,
+not memcpy) is the SAME ~15 GB NVFP4 for both engines; at 273 GB/s that is a ~52 ms floor, and
+BOTH engines sit far above it (326-510 ms), so BOTH are compute/kernel-bound, NOT
+weight-bandwidth-bound, and llama reads no extra bytes. The 254-vs-391 gap is NOT a byte-volume
+deficit - it is effective-bandwidth/compute-efficiency in the FP4 matmul kernels (see 3).
+
+**(3) GPU-busy% / idle structure: identical, both ~98% busy.** llama 98.7% busy (1.3% idle),
+vLLM 97.9% busy (2.1% idle). Neither engine is idle/gap/host-bound at npl128. The entire gap is
+the GPU doing MORE kernel-time per step on llama: llama's non-GDN GPU work = ~310 ms/step vs
+vLLM's ~146 ms/step. That 164 ms delta is concentrated in the FP4 matmul path.
+
+### 3. THE single biggest llama-specific overage: the FP4 matmul path (+119 ms/step = 64% of the gap)
+
+llama spends **236 ms/step** on FP4 matmul+quant; vLLM does ALL its matmul (cutlass FP4 GEMM +
+cublas nvjet + act cvt) in **117 ms/step** - even though vLLM ALSO carries ~18 ms/step of extra
+PyTorch eager elementwise glue that llama's fused ggml kernels avoid. llama is **2.0x slower on
+FP4 matmul**, and that +119 ms is **64% of the entire 185 ms/step gap**.
+
+Inside llama's FP4 path the dominant, untouched cost is **`mul_mat_vec_q` = 132 ms/step (26% of
+decode), 48 calls/step (exactly one per GDN layer), 2.75 ms/call, grid 5120x128**. This is the
+**FP4 GEMV ("vec_q") kernel running at decode batch 128** for the gated-DeltaNet in-projections -
+a non-tensor-core, memory-bound-style kernel doing M=128 work without GEMM-grade weight-read
+amortization. vLLM runs the equivalent projections through cutlass batched FP4 GEMM (tensor-core,
+weight read amortized across the 128-row batch) at a fraction of the cost. **There is no
+GEMV-at-batch-128 on the vLLM side at all.**
+
+Key cross-check with Track B P2a: P2a optimized `mul_mat_q` (the 17%/88 ms tensor-core GEMM, made
+it -24.7%) and decode stayed FLAT - because the BIG FP4 cost is `mul_mat_vec_q` (26%/132 ms),
+which P2a never touched. **Track B optimized the wrong FP4 kernel.** The lever is to route the
+GDN in-projection at M=128 through a tensor-core GEMM (mul_mat_q / MMQ) instead of the vec_q path,
+and to fuse the act-quant (L1) + the norm prologue (L2) so the 448 `quantize_mmq_nvfp4` launches
+fold away - exactly what `explore-other-levers` L1/L2 propose. My measurement RANKS them: the
+mul_mat_vec_q->GEMM routing is the single highest-value target (132 ms), then act-quant fusion
+(16 ms + 448 launches), then the GDN +19 ms.
+
+### 4. Reconciling with the `weight-bandwidth` section (unification, not contradiction)
+
+weight-bandwidth concluded "same 55.5 GB/step, llama 40% util vs vLLM 62% util -> utilization-bound."
+My per-kernel data LOCALIZES that utilization gap: it lives in the **FP4 matmul kernels** (which
+do the bulk of the ~15 GB weight read), NOT in the GDN state traffic. GDN moves its (equal) state
+bytes at comparable rate on both engines (4.03 vs 3.62 ms/call). So the "40% vs 62%" is the
+`mul_mat_vec_q`/`mul_mat_q` weight-read efficiency vs cutlass FP4 GEMM. Raising decode parity =
+raise the FP4-matmul achieved bandwidth (tensor-core GEMM routing + act/norm prologue fusion),
+not the GDN kernel and not byte-cutting.
+
+### Verdict (profiler)
+- Reproduced both engines at their true operating points: llama 245 / vLLM 394 eager / 420 graphs.
+  Graphs are not the lever (+6%). Both engines ~98% GPU-busy; gap is GPU kernel-time, not idle/host.
+- GDN compute is comparable (llama +11%/call, +19 ms/step) - NOT the dominant residual.
+- bytes/step: llama does not read more (131 vs 85 MB memcpy; identical weight bytes); SSM fix's
+  18 GB/step DtoD removal CONFIRMED in-trace.
+- **The single biggest llama-specific overage is the FP4 matmul path: 236 vs 117 ms/step (+119 ms
+  = 64% of the 185 ms gap), dominated by `mul_mat_vec_q` (FP4 GEMV at batch 128, 132 ms/step, 26%,
+  one per GDN layer).** Highest-value lever = route the GDN in-projection through a tensor-core FP4
+  GEMM at M=128 + fuse act-quant/norm prologue (L1/L2). Track B optimized the wrong FP4 kernel.
+
+### Evidence (DGX, this agent)
+- `~/bench/postssm_decomp/postssm_base.{nsys-rep,sqlite,gpu_trace.csv,run.log}` (llama post-SSM).
+- `~/bench/postssm_decomp/vllm_decode.{nsys-rep,gpu_trace.csv}` (vLLM eager decode trace).
+- `~/bench/postssm_decomp/vllm_decode_g1.*` (vLLM graphs run), `~/bench/vllm_tps.py` (throughput).
+- Scripts: `~/bench/postssm_llama_decomp.sh`, `~/bench/vllm_nsys_run.sh`, `~/bench/decode_decomp2.py`
+  (decode-only windowed, overlap-correct, MB-memcpy, per-step reconstruction).
+
+Assisted-by: Claude:opus-4.8 [Claude Code]