docs(paged): GDN recurrence byte-gate SETTLED - re-stream ~1.0x, build bf16 state not fused kernel

Decisive measurement (ncu-byte-gate agent, DGX GB10). ncu HW DRAM counters were blocked (ERR_NVGPUCTRPERM, root-only NVreg param; no passwordless sudo), so the byte ratio was settled via CUPTI kernel timing + exact byte geometry: bytes moved <= peak_BW x duration caps the re-stream factor. llama gated_delta_net_cuda decode (B=128, f32 state): 3.98 ms/call, 805 MB R+W, 202 GB/s = 74% of GB10 peak. vLLM fused_recurrent_packed_decode (B=128, bf16 state): 3.62 ms/call, 402 MB R+W, 111 GB/s = 41% peak. Both single-pass (load-once/store-once, verified in source). llama re-stream factor ~1.0x (hard cap <=1.33x; >=1.5x needs >peak BW = impossible). VERDICT: NO-BUILD the fused single-pass recurrence - the kernel is already single-pass, coalesced, and MORE bandwidth-efficient than vLLM's triton kernel; the gate ops touch the tiny q/k/g/beta projections, not the 805 MB state, so fusion recovers ~0 state bytes. The entire 2x DRAM gap vs vLLM is f32 (llama) vs bf16 (vLLM) state-cache width. BUILD bf16 SSM state instead: halves 805->413 MB, ~45-95 ms/step, step 384 -> 289-339 ms = parity-to-ahead of vLLM 327 (non-bit-exact vs f32 but equal to vLLM's own bf16 precision). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-26 01:16:58 -04:00 · 2026-06-25 15:24:49 +00:00
parent 5825b073a5
commit fd4332e8f0
2 changed files with 310 additions and 0 deletions
--- a/backend/cpp/llama-cpp/patches/paged/BYTEGATE_PROGRESS.md
+++ b/backend/cpp/llama-cpp/patches/paged/BYTEGATE_PROGRESS.md
@@ -0,0 +1,53 @@
+# GDN Recurrence Byte-Gate - Progress (agent: ncu-byte-gate)
+
+## Hard blocker on direct DRAM counters
+- ncu HW perf counters: ERR_NVGPUCTRPERM (NVreg_RestrictProfilingToAdminUsers=restricted, root-only).
+- nsys --gpu-metrics-devices: same ERR_NVGPUCTRPERM.
+- No passwordless sudo on dgx.casa. DRAM byte counters UNOBTAINABLE without root.
+- FALLBACK (decisive, no perfcounters needed): CUPTI kernel TIMING (allowed) + exact byte
+  geometry from kernel source => implied effective BW + a hard mathematical cap on re-stream factor.
+
+## Byte geometry (exact, from gated_delta_net.cu + GGUF)
+- Qwen3.5 dense q36-27b-nvfp4: 48 GDN layers, H=48 v-heads, S_v=128 (square state 128x128/head).
+- State per (seq,head) = 128*128 f32 = 64 KiB. Per seq = 48*64KiB = 3.0 MiB.
+- Kernel is SINGLE-PASS by construction: loads s_shard[] ONCE into regs, recurrence in-register,
+  writes state ONCE (read_state coalesced 128 consecutive f32/warp; writeback coalesced).
+  l2norm/sigmoid/softplus/gate act on small q/k/g/beta (NOT the 805MB state); gather no-ops at
+  steady decode (identity seqs). => NO multi-pass state re-streaming exists to fuse away.
+- Minimal bytes/call (B=128): state R+W = 128*48*16384*4*2 = 805.3 MB; +q/k/v/out ~10 MB = ~816 MB.
+- Floor time @273 GB/s = 816MB/273 = 2.99 ms/call.
+
+## Measured (clean nsys CUDA timing, graphs OFF, npp8 ntg12 npl128, build-cuda-base df1cc97)
+- llama gated_delta_net_cuda steady decode: 480 calls, grid(48,128,32), avg 3.98 ms/call
+  (min 3.90, max 4.33; very tight => bandwidth-bound). 48 layers => 191 ms/step (50% of 384 ms).
+- Implied effective BW @1.0x bytes = 816MB/3.98ms = 205 GB/s = 75% of 273 peak.
+- HARD CAP: max bytes movable in 3.98ms @273 peak = 1.087 GB = 1.33x minimal.
+  => re-stream factor in [1.0x, 1.33x]. 2x re-streaming PHYSICALLY IMPOSSIBLE.
+  Source proves single-pass+coalesced => ~1.0x, kernel at ~75% peak.
+
+## Conv-path (same trace, steady-decode region kernels, per-call):
+- ssm_conv_f32: 672 calls whole-trace avg 135.9us (incl prefill); decode-region TBD
+- concat_cont: 576 calls avg 169.6us ; concat_non_cont 96 calls (prefill big)
+- cpy_scalar: 896 calls avg 123.7us ; gdn_gather_nonident 672 calls avg 153.9us (mostly no-op)
+
+## vLLM (apples-to-apples: NSEQ=128, enforce_eager=True; postssm_decomp/vllm_decode.sqlite)
+- vLLM state dtype = model_dtype = BF16 (_mamba_state_dtype default "auto"; config dtype=bfloat16).
+  Geometry identical to llama (H=48, k/v head_dim 128, S_v 128).
+- vLLM fused_recurrent_gated_delta_rule_packed_decode steady: 3.62 ms/call (grid 4x6144x1),
+  bf16 state R+W = 402.6 MB => 111 GB/s = 41% peak. SINGLE-PASS (load p_h0 once -> f32 regs ->
+  store bf16 once).
+- llama 3.98 ms/call, f32 805.3 MB => 202 GB/s = 74% peak. llama kernel is MORE BW-efficient.
+
+## Conv-path (llama steady decode, per call x48 layers)
+- concat_cont 169.6us (8.14 ms/step) + cpy_scalar 120.1us (5.76) + ssm_conv_f32 115.9us (5.56)
+  = ~19.5 ms/step. Conv state ~12.6 MB (tiny) => LAUNCH-bound, not byte-bound => fusion lever (~5%).
+- l2_norm 6.8us, gdn_gather 1.21us (no-op identity seqs => gather does NOT re-stream state).
+
+## FINAL VERDICT (DONE)
+- llama re-stream factor ~1.0x (hard cap <=1.33x; >=1.5x physically impossible @273 peak).
+- NO-BUILD fused single-pass recurrence: already single-pass, coalesced, 74% peak (> vLLM 41%);
+  gate ops touch tiny q/k/g/beta, not the 805MB state => recovers ~0 state bytes.
+- BUILD bf16 SSM state (design lever (2)): the 2x gap vs vLLM is 100% f32-vs-bf16 cache width.
+  805->413 MB => ~45-95 ms/step => step 384 -> 289-339 ms = parity-to-ahead of vLLM 327.
+  Non-bit-exact vs llama f32 but equal to vLLM's own bf16 precision.
+- Findings written: GDN_RECURRENCE_BYTE_GATE.md (MEASUREMENT + VERDICT section appended).
--- a/backend/cpp/llama-cpp/patches/paged/GDN_RECURRENCE_BYTE_GATE.md
+++ b/backend/cpp/llama-cpp/patches/paged/GDN_RECURRENCE_BYTE_GATE.md
@@ -0,0 +1,257 @@
+# GDN recurrence byte gate + fused single-pass kernel design
+
+Label: llama-fused-recurrence-design (READ-ONLY, no GPU). Source-and-math design only;
+the byte-ratio measurement itself is produced by the `ncu-byte-gate` agent.
+
+## TL;DR (the correction the workflow was set up to settle)
+
+**The recurrence kernel is ALREADY single-pass on the f32 state.** `gated_delta_net_cuda<128>`
+(after patches 0018 in-place write + 0019 fused gather) loads the whole `s0` column into registers
+ONCE (`s_shard[rows_per_lane]`), runs the entire token loop in registers, and writes the new state
+back ONCE - directly into the persistent cache slot (0018) or scratch. For decode `n_tokens==1`,
+`keep_rs_t==false`: one register load, one register store, no re-read of state from DRAM.
+
+The byte-gate's working hypothesis - "un-fused l2norm/gate/decay/recurrence/state-writeback/gather
+each touching the f32 state, so a fused pass halves DRAM bytes" - is **false for the state**. Only
+the recurrence kernel touches the 3 MB/seq state. The surrounding ops (`l2_norm`, `silu`, `sigmoid`,
+the `gate` exp/softplus, `ssm_conv`, `concat`, `cpy`) all operate on the **small activations**
+(q/k/v/g/beta), which are 100-800x smaller than the state. There is no 2x state re-streaming to
+recover; the recurrence kernel is byte-minimal on state by construction.
+
+Therefore a fused single-pass kernel **cannot move the dominant 196 ms recurrence** - that cost is
+f32-state read+write bandwidth, already a single pass. The two real levers are decoupled:
+
+1. **Fold the surrounding activation ops into the kernel** (MEDIUM effort): recovers the small
+   per-op buckets (`ssm_conv` 1.5% + `silu`/`sigmoid` 1.4% + 2x `l2_norm` + `concat` 2.1% + conv
+   `cpy` 2.0%, ~6-8% of the step) plus per-op launch overhead. Bit-exact. Ceiling ~93-96% of vLLM.
+2. **bf16 state cache** (HIGH effort, NON-bit-exact): halves the dominant byte stream. The only
+   large lever on the 196 ms. Target KL < 1e-3 by keeping f32 register accumulation, storing only
+   the persisted cache in bf16.
+
+Which of (1)/(2) is worth building hinges on the `ncu-byte-gate` byte ratio (below).
+
+## Byte arithmetic (dense q36-27b-nvfp4, decode, npl128, S_v=128, H_v=48, batch=128)
+
+State per (seq, GDN layer) = S_v^2 * H_v = 128*128*48 = 786,432 f32 = **3.0 MiB**.
+
+Per kernel call (one GDN layer, full 128-seq batch), single pass:
+- state read  = 786,432 * 128 * 4 = 402.65 MB
+- state write = 402.65 MB
+- **state R+W = 805.3 MB/call** (768 MiB)
+- activations (q,k 1 MB each; v 3 MB; attn-out 3 MB; g,beta tiny) ~= 8 MB/call = **<1%**.
+
+Measured 4.08 ms/call (node-level trace) -> effective **197.4 GB/s**.
+GB10 / DGX Spark LPDDR5X peak ~= **273 GB/s** -> **~72% of peak.**
+
+48 GDN layers/step -> 38.7 GB of state traffic/step -> 196 ms = 51.6% of the 383.48 ms step. v=8MB
+activation traffic is noise; state is 99% of the recurrence bytes.
+
+### What this means for the open question
+- The recurrence is single-pass, coalesced (transposed layout: lane reads `state[col*S_v + i]`,
+  consecutive lanes -> consecutive `i`), running at ~72% of peak BW. It is NOT at the 85% hardware
+  floor, but it is NOT re-streaming state either. The 72->85% headroom (~30 ms, bit-exact) is an
+  occupancy/coalescing tune, NOT a fusion win.
+- vLLM `fused_recurrent_gated_delta_rule` does the SAME single-pass recurrence. If vLLM's recurrent
+  state cache is bf16 (model dtype) while llama's is f32, vLLM moves HALF the bytes on the dominant
+  stream - that alone is ~98 ms, i.e. essentially the whole residual decode gap. **This is the
+  single most decision-relevant number for the `ncu-byte-gate` agent to confirm: the dtype/bytes of
+  vLLM's GDN state cache vs llama's f32, plus llama's measured achieved-BW % on the recurrence
+  kernel.** If vLLM is bf16-state -> build (2). If vLLM is also f32-state and at ~85% -> llama is
+  at the floor, only (1) + coalescing remain and bit-exact parity tops out ~95%.
+
+## The fused single-pass kernel design
+
+Two deliverables, layered. Build (1) first (bit-exact, de-risks the graph), gate (2) on the byte
+verdict.
+
+### (1) `ggml_gated_delta_net_decode_fused` - fold the activation ops into the kernel
+
+Folds the pre-recurrence activation ops and the post-recurrence gated RMSNorm into the existing
+single-pass recurrence kernel, so q/k/v/g are produced and consumed in registers/shared and never
+make a separate DRAM round-trip, and the per-op launches collapse to one.
+
+Current decode op chain in `build_layer_attn_linear` (qwen35.cpp 386-461), per GDN layer:
+
+```
+wqkv GEMM -> qkv_mixed                                  (keep: GEMM, separate)
+wqkv_gate GEMM -> z                                     (keep: GEMM, separate)
+ssm_beta GEMM -> beta -> sigmoid                        [FOLD beta sigmoid]
+ssm_alpha GEMM -> alpha -> +ssm_dt -> softplus -> *ssm_a (gate) [FOLD softplus/mul -> per-head g]
+build_conv_state: reshape, transpose qkv, CONCAT, cpy   [concat/cpy -> conv-state plumbing, see note]
+ggml_ssm_conv(conv_input, conv_kernel)                  [FOLD depthwise conv, K=4]
+ggml_silu(conv_output)                                  [FOLD silu]
+views q_conv/k_conv/v_conv
+ggml_l2_norm(q_conv); ggml_l2_norm(k_conv)              [FOLD 2x l2norm]
+[repeat_4d skipped on fused path]
+ggml_gated_delta_net_inplace_ids(...)                   <-- THE recurrence kernel (196 ms)
+build_norm_gated(output, ssm_norm, z): RMSNorm + silu(z) + mul  [FOLD post gated-RMSNorm]
+ssm_out GEMM                                            (keep: GEMM, separate)
+```
+
+Fold list (what moves INTO the kernel):
+- `beta` sigmoid: scalar per (head,seq); apply in-kernel when reading beta.
+- `gate` g = softplus(alpha+dt)*a (GDA, g->ne0==1): scalar per (head,seq); compute/exp in-kernel.
+  The kernel already does `expf(*g_t)` (non-KDA path, line 85) - so feed RAW `alpha`+`dt` and the
+  `a` scale and do softplus+mul+exp in-kernel; removes the `add`/`softplus`/`mul` launches.
+- `ssm_conv` (depthwise causal conv1d, kernel width 4) + `silu`: per channel a length-4 dot of the
+  conv state with `ssm_conv1d` then silu. This is the prologue: each warp/thread, before loading
+  state, computes its q/k/v channel by reading 3 cached conv-state taps + the current qkv_mixed
+  token, dotting the 4-wide kernel, applying silu. The conv state (conv_kernel-1=3 taps x conv_dim)
+  is tiny and already cached; fold its read here and its 1-token shift write into the epilogue
+  (replaces the `concat`+`cpy` conv-state update).
+- `l2_norm` of q and k: a warp reduction over S_v of the per-head q/k vector. The recurrence kernel
+  already does warp reductions over S_v (the kv/attn dot products) - the l2norm reuses the same
+  warp-reduce primitive on q_reg/k_reg right after they are loaded, before the recurrence math.
+- Post: `build_norm_gated` = RMSNorm(output, ssm_norm) * silu(z). The kernel already holds the
+  attn output `attn_col` per (head,seq,col) in registers at the end; fold an S_v warp-reduce RMS,
+  multiply by `ssm_norm` weight and by `silu(z)` (z read once), and write the final gated output -
+  removing the `rms_norm`+`silu`+`mul` launches and one activation round-trip.
+
+State traffic UNCHANGED (still one read + one write). Activation traffic for conv/silu/l2norm/norm
+collapses into the kernel's register/shared path; ~6 separate launches become 0. Expected recovery:
+the ~6-8% surrounding-op buckets + launch overhead. **Bit-exact** if the numeric ordering is held
+(see Numeric notes). Conservative ceiling ~365-375 tok/s dense (~93-96% of vLLM 391).
+
+Data flow (per (h_idx=head, sequence=seq) block, decode n_tokens=1, S_v=128, num_warps=4):
+1. PDL sync.
+2. Prologue (per channel/lane): read 3 conv-state taps + current `qkv_mixed[t]` for this channel,
+   dot with `ssm_conv1d[0..3]`, add conv bias if any, `silu`. Produces this lane's q/k/v element.
+3. l2norm q,k: warp-reduce sum(q^2), sum(k^2) over the S_v dim; scale q_reg,k_reg by rsqrt(.+eps).
+4. Load `s0` column into `s_shard` (UNCHANGED single read).
+5. Recurrence (UNCHANGED math: g-decay, kv = S^T k, delta = (v - g*kv)*beta, S = g*S + k(x)delta,
+   attn = S^T q * scale).
+6. Write `s_shard` back to cache slot ONCE (UNCHANGED single write). Write the 1-token-shifted conv
+   state back to the conv cache (replaces concat+cpy).
+7. Epilogue gated-RMSNorm: warp-reduce sum(attn^2) over S_v -> RMS; multiply by `ssm_norm[col]` and
+   by `silu(z[col])` (z loaded once); write final output element. ssm_out GEMM stays separate.
+
+Inputs added to the op: `ssm_conv1d` weight, `ssm_norm` weight, `z`, conv-state cache view, raw
+`alpha`/`dt`/`a`, eps. This is a wider op signature (src[8..]) - acceptable; gate it behind a new
+`cparams.fused_gdn_decode` resolved exactly like `auto_fgdn` (graph_reserve + device-match probe,
+llama-context.cpp 518-595) so it silently falls back to the current op chain if any device lacks it.
+
+### (2) bf16 recurrent-state cache - the dominant-term lever (NON-bit-exact)
+
+Only build if `ncu-byte-gate` shows vLLM moves fewer state bytes (bf16) OR llama's f32 recurrence is
+already >=85% of peak (then f32 is at the floor and bf16 is the only way down).
+
+- Store `ssm_states_all` (the recurrent-state cache) as bf16. Halves the dominant 805 MB/call -> at
+  the same ~197 GB/s -> ~2.04 ms/call -> ~98 ms/step saved (196 -> ~98). Dense projected
+  335 -> ~440+ tok/s (>= vLLM 391) if BW-bound holds; smaller dtype usually achieves a HIGHER % of
+  peak, so likely better.
+- Kernel change: read state -> convert bf16->f32 into `s_shard` (registers stay f32); all recurrence
+  arithmetic in f32 (UNCHANGED); on write, convert f32->bf16. Accumulation precision is preserved
+  within a step; only the PERSISTED state is rounded to bf16 each step.
+- Numerics: the recurrent state decays geometrically (g<1), so per-step bf16 rounding does not
+  accumulate unboundedly, but it is NOT bit-exact. Validate KL < 1e-3 vs the f32-state build over a
+  256-token greedy run; if KL fails, fall back to f32 state (keep it a cparams toggle). This is the
+  ONLY path to bit-near parity-or-better on the dominant term; bit-EXACT parity on the 196 ms is
+  unreachable because the f32 state bytes are irreducible (single pass already).
+
+## Numeric / bit-exactness notes (for fold (1))
+- l2norm/RMS use f32 warp-reduce accumulation (matches `ggml_l2_norm`/`ggml_rms_norm` f32 sum).
+  Order of summation across lanes differs from the standalone op's sequential sum -> floating
+  reassociation. To stay bit-exact, replicate the standalone op's reduction order, OR accept a
+  tiny reassociation delta and gate on KL<1e-3 (the workflow's near-bit-exact target). Recommend:
+  ship fold (1) behind the cparams probe and assert greedy md5 match vs the current chain (0019
+  already established the harness: dense text md5, MoE byte-identical).
+- Recurrence math, scale, g-exp order, beta apply: keep EXACTLY as in `gated_delta_net_cuda` /
+  `ops.cpp` reference (lines 84-141 .cu, 10685-10730 ops.cpp). Do not reorder the
+  v - g*kv -> *beta -> S update -> S^T q sequence.
+- conv: depthwise dot of width-4 kernel in f32, then silu - identical to `ggml_ssm_conv`+`ggml_silu`
+  if done in the same order.
+- gate softplus: `softplus(x)=log1p(exp(x))`; match ggml's `ggml_softplus` (has the >20 fast path)
+  to stay bit-exact.
+
+## Implementation scope
+- (1) `.cu`: extend `gated_delta_net_cuda` with a decode-fused template specialization (or a new
+  kernel) that does conv+silu prologue, q/k l2norm, recurrence, conv-state shift write, gated-RMSNorm
+  epilogue. Add `ggml_cuda_op` dispatch. CPU mirror in `ops.cpp` for parity/CI.
+- (1) `ggml.h`/`ggml.c`: new builder `ggml_gated_delta_net_decode_fused` (extra src: ssm_conv1d,
+  ssm_norm, z, conv-cache view, alpha/dt/a, eps + op_params for eps).
+- (1) graph edits: `delta-net-base.cpp build_recurrent_attn` (add the decode-fused branch alongside
+  the existing fused/ids branch); `qwen35.cpp` + `qwen35moe.cpp` `build_layer_attn_linear` (route
+  the pre/post ops into the op when `cparams.fused_gdn_decode`); leave `qwen3next.cpp`,
+  `kimi-linear.cpp`, the non-fused and rollback (n_rs_seq>0) paths unchanged.
+- (1) `llama-context.cpp`: `auto_fgdn`-style device-match probe to enable/disable the decode-fused
+  op (silent fallback). `cparams.h`/`cparams.fused_gdn_decode`.
+- (2) bf16 state: cache dtype change in the recurrent-memory allocation + the kernel load/store
+  convert + a `cparams` toggle + KL gate. Touches `gated_delta_net.cu` load/store, the inplace/ids
+  builders' state asserts, and the recurrent cache type.
+
+## Risk register
+- (1) is MEDIUM effort, bit-exact-targetable, but bounded upside (~6-8% + launches; ceiling ~95% of
+  vLLM). Worth it only if the workflow wants >90% and accepts no bf16.
+- (2) is the only large lever on the dominant 196 ms but is NON-bit-exact (KL-gated). If vLLM is
+  f32-state, (2) takes llama BELOW vLLM's precision, not toward parity - a product call, not a perf
+  call.
+- The widened op signature (many srcs) raises maintenance cost and the device-match probe matters
+  (CPU offload of a GDN layer must fall back cleanly).
+- Do NOT expect a fused recurrence to cut the 196 ms: it is already one read + one write of f32
+  state. Re-confirm with the `ncu-byte-gate` achieved-BW number before committing HIGH effort.
+
+---
+
+# MEASUREMENT + VERDICT (label ncu-byte-gate, THE GPU agent) - GATE SETTLED
+
+The design above predicted the answer; this is the decisive measurement that confirms it.
+
+## VERDICT: NO-BUILD the fused single-pass recurrence. BUILD bf16 SSM state (design's lever (2)).
+
+Deciding number: **llama re-stream factor = ~1.0x** (mathematically capped at <=1.33x; >=1.5x is
+physically impossible). llama's recurrence kernel is ALREADY single-pass, coalesced, and at
+**74% of GB10 peak BW** - MORE bandwidth-efficient than vLLM's fused triton kernel (41% of peak).
+The whole 2x DRAM gap vs vLLM is **f32 (llama) vs bf16 (vLLM) state-cache width**, not re-streaming.
+
+## ncu HW counters were BLOCKED; timing + geometry gave the byte ratio anyway
+- `ncu dram__bytes` and `nsys --gpu-metrics-devices` both return `ERR_NVGPUCTRPERM`
+  (`NVreg_RestrictProfilingToAdminUsers` restricted, root-only; no passwordless sudo on dgx.casa).
+  DRAM byte counters are unobtainable on this box.
+- Decisive fallback (no perf counters): CUPTI kernel TIMING (allowed) + EXACT byte geometry from
+  the kernel source. bytes_moved <= peak_BW x duration gives a HARD CAP on the re-stream factor;
+  comparing implied effective BW between llama and vLLM (same model, same B, both eager) settles it.
+
+## Measured (clean nsys CUDA timing; build-cuda-base df1cc97 Lever-1; both B=128, both graphs/eager-OFF)
+llama: `llama-batched-bench -npp 8 -ntg 12 -npl 128 -ub 2048`, GGML_CUDA_DISABLE_GRAPHS=1.
+vLLM:  postssm_decomp/vllm_decode.sqlite, NSEQ=128, enforce_eager=True (apples-to-apples).
+
+| kernel | state dtype | bytes R+W/call | duration/call (steady) | eff. BW | % of 273 peak | re-stream |
+|---|---|---|---|---|---|---|
+| llama gated_delta_net_cuda          | f32  | 805.3 MB | **3.98 ms** (min 3.90 max 4.33, grid 48x128x32) | 202 GB/s | **74%** | ~1.0x |
+| vLLM fused_recurrent...packed_decode | bf16 | 402.6 MB | **3.62 ms** (min 3.53 max 3.96, grid 4x6144x1)  | 111 GB/s | **41%** | ~1.0x |
+
+- llama recurrence/step = 3.98 x 48 = **191 ms** (50% of 384 ms step; matches STATE 196 ms).
+- vLLM recurrence/step  = 3.62 x 48 = **174 ms**. Per-call gap llama-vs-vLLM is only +10%, NOT 2.8x.
+  The old "1.47 ms near-vLLM" was prefill-contaminated; clean decode is 3.98 ms (confirms STATE).
+- Both kernels verified SINGLE-PASS in source (llama: s_shard load-once/store-once, 128 consecutive
+  f32/warp = coalesced; vLLM packed_decode: `b_h += load(p_h0).to(f32)` once, `store(p_ht, b_h.to(bf16))`
+  once). vLLM cache dtype = state_dtype = model_dtype = bf16 (`_mamba_state_dtype` default "auto" ->
+  model dtype; config.json dtype=bfloat16). Geometry identical (H=48, k/v head_dim 128, S_v 128).
+
+## Why re-stream ~1.0x (the gate number)
+Most bytes a 3.98 ms call could move at 273 GB/s peak = 1.087 GB = **1.33x the 816 MB minimal**.
+1.5x/2x re-stream would need >peak BW -> impossible. Source proves single-pass+coalesced -> 1.0x end:
+~816 MB at 202 GB/s = 74% peak. A fused single-pass rewrite recovers ~0 state bytes => NO-BUILD.
+
+## The lever: bf16 SSM state (design (2)) - confirmed, large, parity-to-ahead
+2x recurrence bytes vs vLLM = 100% f32-vs-bf16 cache. llama's kernel is the more efficient one
+(74% vs 41% peak), so bf16 state (cache + load/store bf16, f32 register compute, exactly as vLLM):
+- 805.3 -> ~413 MB => at 74% peak ~2.0 ms/call => 191 -> ~96 ms/step, save ~95 ms => step ~289 ms
+  (~443 tok/s, AHEAD of vLLM 327). Conservative (50% peak on smaller footprint): ~3.0 ms/call =>
+  save ~45 ms => step ~339 ms = vLLM parity. Range = parity-to-ahead.
+- NON-bit-exact vs llama's f32 reference, but EQUAL precision to vLLM (which is bf16). Gate on
+  PPL/KL vs the f32 build, not md5. "Bit-exact parity with vLLM" was never on the table - vLLM is bf16.
+
+## Conv-path (no-regret conv-fusion lever sizing), llama steady decode, per call x48
+concat_cont 169.6 us (8.14 ms/step) + cpy_scalar 120.1 us (5.76) + ssm_conv_f32 115.9 us (5.56)
+= ~19.5 ms/step (~5%). Conv STATE ~12.6 MB (tiny) -> this is LAUNCH/small-kernel overhead, not bytes
+-> a FUSION lever (design (1)), secondary to bf16 state. l2_norm 6.8 us, gdn_gather 1.21 us (no-op,
+identity seqs -> confirms gather does NOT re-stream state at steady decode).
+
+## One-line answer
+llama: 805 MB/call, 74% peak, re-stream ~1.0x (<=1.33x). vLLM: 402 MB/call (bf16), 41% peak.
+conv-path: ~12.6 MB (launch-bound ~19.5 ms/step, not byte-bound).
+=> NO-BUILD fused recurrence (already single-pass, more efficient than vLLM); BUILD bf16 state
+(halves the dominant 805 MB, ~45-95 ms/step, parity-to-ahead). Deciding number: re-stream ~1.0x.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]