From fd4332e8f09ce4d22be44bfdf870452c8d1f04dc Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Thu, 25 Jun 2026 15:24:49 +0000 Subject: [PATCH] docs(paged): GDN recurrence byte-gate SETTLED - re-stream ~1.0x, build bf16 state not fused kernel Decisive measurement (ncu-byte-gate agent, DGX GB10). ncu HW DRAM counters were blocked (ERR_NVGPUCTRPERM, root-only NVreg param; no passwordless sudo), so the byte ratio was settled via CUPTI kernel timing + exact byte geometry: bytes moved <= peak_BW x duration caps the re-stream factor. llama gated_delta_net_cuda decode (B=128, f32 state): 3.98 ms/call, 805 MB R+W, 202 GB/s = 74% of GB10 peak. vLLM fused_recurrent_packed_decode (B=128, bf16 state): 3.62 ms/call, 402 MB R+W, 111 GB/s = 41% peak. Both single-pass (load-once/store-once, verified in source). llama re-stream factor ~1.0x (hard cap <=1.33x; >=1.5x needs >peak BW = impossible). VERDICT: NO-BUILD the fused single-pass recurrence - the kernel is already single-pass, coalesced, and MORE bandwidth-efficient than vLLM's triton kernel; the gate ops touch the tiny q/k/g/beta projections, not the 805 MB state, so fusion recovers ~0 state bytes. The entire 2x DRAM gap vs vLLM is f32 (llama) vs bf16 (vLLM) state-cache width. BUILD bf16 SSM state instead: halves 805->413 MB, ~45-95 ms/step, step 384 -> 289-339 ms = parity-to-ahead of vLLM 327 (non-bit-exact vs f32 but equal to vLLM's own bf16 precision). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto --- .../patches/paged/BYTEGATE_PROGRESS.md | 53 ++++ .../patches/paged/GDN_RECURRENCE_BYTE_GATE.md | 257 ++++++++++++++++++ 2 files changed, 310 insertions(+) create mode 100644 backend/cpp/llama-cpp/patches/paged/BYTEGATE_PROGRESS.md create mode 100644 backend/cpp/llama-cpp/patches/paged/GDN_RECURRENCE_BYTE_GATE.md diff --git a/backend/cpp/llama-cpp/patches/paged/BYTEGATE_PROGRESS.md b/backend/cpp/llama-cpp/patches/paged/BYTEGATE_PROGRESS.md new file mode 100644 index 000000000..6a68cc504 --- /dev/null +++ b/backend/cpp/llama-cpp/patches/paged/BYTEGATE_PROGRESS.md @@ -0,0 +1,53 @@ +# GDN Recurrence Byte-Gate - Progress (agent: ncu-byte-gate) + +## Hard blocker on direct DRAM counters +- ncu HW perf counters: ERR_NVGPUCTRPERM (NVreg_RestrictProfilingToAdminUsers=restricted, root-only). +- nsys --gpu-metrics-devices: same ERR_NVGPUCTRPERM. +- No passwordless sudo on dgx.casa. DRAM byte counters UNOBTAINABLE without root. +- FALLBACK (decisive, no perfcounters needed): CUPTI kernel TIMING (allowed) + exact byte + geometry from kernel source => implied effective BW + a hard mathematical cap on re-stream factor. + +## Byte geometry (exact, from gated_delta_net.cu + GGUF) +- Qwen3.5 dense q36-27b-nvfp4: 48 GDN layers, H=48 v-heads, S_v=128 (square state 128x128/head). +- State per (seq,head) = 128*128 f32 = 64 KiB. Per seq = 48*64KiB = 3.0 MiB. +- Kernel is SINGLE-PASS by construction: loads s_shard[] ONCE into regs, recurrence in-register, + writes state ONCE (read_state coalesced 128 consecutive f32/warp; writeback coalesced). + l2norm/sigmoid/softplus/gate act on small q/k/g/beta (NOT the 805MB state); gather no-ops at + steady decode (identity seqs). => NO multi-pass state re-streaming exists to fuse away. +- Minimal bytes/call (B=128): state R+W = 128*48*16384*4*2 = 805.3 MB; +q/k/v/out ~10 MB = ~816 MB. +- Floor time @273 GB/s = 816MB/273 = 2.99 ms/call. + +## Measured (clean nsys CUDA timing, graphs OFF, npp8 ntg12 npl128, build-cuda-base df1cc97) +- llama gated_delta_net_cuda steady decode: 480 calls, grid(48,128,32), avg 3.98 ms/call + (min 3.90, max 4.33; very tight => bandwidth-bound). 48 layers => 191 ms/step (50% of 384 ms). +- Implied effective BW @1.0x bytes = 816MB/3.98ms = 205 GB/s = 75% of 273 peak. +- HARD CAP: max bytes movable in 3.98ms @273 peak = 1.087 GB = 1.33x minimal. + => re-stream factor in [1.0x, 1.33x]. 2x re-streaming PHYSICALLY IMPOSSIBLE. + Source proves single-pass+coalesced => ~1.0x, kernel at ~75% peak. + +## Conv-path (same trace, steady-decode region kernels, per-call): +- ssm_conv_f32: 672 calls whole-trace avg 135.9us (incl prefill); decode-region TBD +- concat_cont: 576 calls avg 169.6us ; concat_non_cont 96 calls (prefill big) +- cpy_scalar: 896 calls avg 123.7us ; gdn_gather_nonident 672 calls avg 153.9us (mostly no-op) + +## vLLM (apples-to-apples: NSEQ=128, enforce_eager=True; postssm_decomp/vllm_decode.sqlite) +- vLLM state dtype = model_dtype = BF16 (_mamba_state_dtype default "auto"; config dtype=bfloat16). + Geometry identical to llama (H=48, k/v head_dim 128, S_v 128). +- vLLM fused_recurrent_gated_delta_rule_packed_decode steady: 3.62 ms/call (grid 4x6144x1), + bf16 state R+W = 402.6 MB => 111 GB/s = 41% peak. SINGLE-PASS (load p_h0 once -> f32 regs -> + store bf16 once). +- llama 3.98 ms/call, f32 805.3 MB => 202 GB/s = 74% peak. llama kernel is MORE BW-efficient. + +## Conv-path (llama steady decode, per call x48 layers) +- concat_cont 169.6us (8.14 ms/step) + cpy_scalar 120.1us (5.76) + ssm_conv_f32 115.9us (5.56) + = ~19.5 ms/step. Conv state ~12.6 MB (tiny) => LAUNCH-bound, not byte-bound => fusion lever (~5%). +- l2_norm 6.8us, gdn_gather 1.21us (no-op identity seqs => gather does NOT re-stream state). + +## FINAL VERDICT (DONE) +- llama re-stream factor ~1.0x (hard cap <=1.33x; >=1.5x physically impossible @273 peak). +- NO-BUILD fused single-pass recurrence: already single-pass, coalesced, 74% peak (> vLLM 41%); + gate ops touch tiny q/k/g/beta, not the 805MB state => recovers ~0 state bytes. +- BUILD bf16 SSM state (design lever (2)): the 2x gap vs vLLM is 100% f32-vs-bf16 cache width. + 805->413 MB => ~45-95 ms/step => step 384 -> 289-339 ms = parity-to-ahead of vLLM 327. + Non-bit-exact vs llama f32 but equal to vLLM's own bf16 precision. +- Findings written: GDN_RECURRENCE_BYTE_GATE.md (MEASUREMENT + VERDICT section appended). diff --git a/backend/cpp/llama-cpp/patches/paged/GDN_RECURRENCE_BYTE_GATE.md b/backend/cpp/llama-cpp/patches/paged/GDN_RECURRENCE_BYTE_GATE.md new file mode 100644 index 000000000..3a9e30d84 --- /dev/null +++ b/backend/cpp/llama-cpp/patches/paged/GDN_RECURRENCE_BYTE_GATE.md @@ -0,0 +1,257 @@ +# GDN recurrence byte gate + fused single-pass kernel design + +Label: llama-fused-recurrence-design (READ-ONLY, no GPU). Source-and-math design only; +the byte-ratio measurement itself is produced by the `ncu-byte-gate` agent. + +## TL;DR (the correction the workflow was set up to settle) + +**The recurrence kernel is ALREADY single-pass on the f32 state.** `gated_delta_net_cuda<128>` +(after patches 0018 in-place write + 0019 fused gather) loads the whole `s0` column into registers +ONCE (`s_shard[rows_per_lane]`), runs the entire token loop in registers, and writes the new state +back ONCE - directly into the persistent cache slot (0018) or scratch. For decode `n_tokens==1`, +`keep_rs_t==false`: one register load, one register store, no re-read of state from DRAM. + +The byte-gate's working hypothesis - "un-fused l2norm/gate/decay/recurrence/state-writeback/gather +each touching the f32 state, so a fused pass halves DRAM bytes" - is **false for the state**. Only +the recurrence kernel touches the 3 MB/seq state. The surrounding ops (`l2_norm`, `silu`, `sigmoid`, +the `gate` exp/softplus, `ssm_conv`, `concat`, `cpy`) all operate on the **small activations** +(q/k/v/g/beta), which are 100-800x smaller than the state. There is no 2x state re-streaming to +recover; the recurrence kernel is byte-minimal on state by construction. + +Therefore a fused single-pass kernel **cannot move the dominant 196 ms recurrence** - that cost is +f32-state read+write bandwidth, already a single pass. The two real levers are decoupled: + +1. **Fold the surrounding activation ops into the kernel** (MEDIUM effort): recovers the small + per-op buckets (`ssm_conv` 1.5% + `silu`/`sigmoid` 1.4% + 2x `l2_norm` + `concat` 2.1% + conv + `cpy` 2.0%, ~6-8% of the step) plus per-op launch overhead. Bit-exact. Ceiling ~93-96% of vLLM. +2. **bf16 state cache** (HIGH effort, NON-bit-exact): halves the dominant byte stream. The only + large lever on the 196 ms. Target KL < 1e-3 by keeping f32 register accumulation, storing only + the persisted cache in bf16. + +Which of (1)/(2) is worth building hinges on the `ncu-byte-gate` byte ratio (below). + +## Byte arithmetic (dense q36-27b-nvfp4, decode, npl128, S_v=128, H_v=48, batch=128) + +State per (seq, GDN layer) = S_v^2 * H_v = 128*128*48 = 786,432 f32 = **3.0 MiB**. + +Per kernel call (one GDN layer, full 128-seq batch), single pass: +- state read = 786,432 * 128 * 4 = 402.65 MB +- state write = 402.65 MB +- **state R+W = 805.3 MB/call** (768 MiB) +- activations (q,k 1 MB each; v 3 MB; attn-out 3 MB; g,beta tiny) ~= 8 MB/call = **<1%**. + +Measured 4.08 ms/call (node-level trace) -> effective **197.4 GB/s**. +GB10 / DGX Spark LPDDR5X peak ~= **273 GB/s** -> **~72% of peak.** + +48 GDN layers/step -> 38.7 GB of state traffic/step -> 196 ms = 51.6% of the 383.48 ms step. v=8MB +activation traffic is noise; state is 99% of the recurrence bytes. + +### What this means for the open question +- The recurrence is single-pass, coalesced (transposed layout: lane reads `state[col*S_v + i]`, + consecutive lanes -> consecutive `i`), running at ~72% of peak BW. It is NOT at the 85% hardware + floor, but it is NOT re-streaming state either. The 72->85% headroom (~30 ms, bit-exact) is an + occupancy/coalescing tune, NOT a fusion win. +- vLLM `fused_recurrent_gated_delta_rule` does the SAME single-pass recurrence. If vLLM's recurrent + state cache is bf16 (model dtype) while llama's is f32, vLLM moves HALF the bytes on the dominant + stream - that alone is ~98 ms, i.e. essentially the whole residual decode gap. **This is the + single most decision-relevant number for the `ncu-byte-gate` agent to confirm: the dtype/bytes of + vLLM's GDN state cache vs llama's f32, plus llama's measured achieved-BW % on the recurrence + kernel.** If vLLM is bf16-state -> build (2). If vLLM is also f32-state and at ~85% -> llama is + at the floor, only (1) + coalescing remain and bit-exact parity tops out ~95%. + +## The fused single-pass kernel design + +Two deliverables, layered. Build (1) first (bit-exact, de-risks the graph), gate (2) on the byte +verdict. + +### (1) `ggml_gated_delta_net_decode_fused` - fold the activation ops into the kernel + +Folds the pre-recurrence activation ops and the post-recurrence gated RMSNorm into the existing +single-pass recurrence kernel, so q/k/v/g are produced and consumed in registers/shared and never +make a separate DRAM round-trip, and the per-op launches collapse to one. + +Current decode op chain in `build_layer_attn_linear` (qwen35.cpp 386-461), per GDN layer: + +``` +wqkv GEMM -> qkv_mixed (keep: GEMM, separate) +wqkv_gate GEMM -> z (keep: GEMM, separate) +ssm_beta GEMM -> beta -> sigmoid [FOLD beta sigmoid] +ssm_alpha GEMM -> alpha -> +ssm_dt -> softplus -> *ssm_a (gate) [FOLD softplus/mul -> per-head g] +build_conv_state: reshape, transpose qkv, CONCAT, cpy [concat/cpy -> conv-state plumbing, see note] +ggml_ssm_conv(conv_input, conv_kernel) [FOLD depthwise conv, K=4] +ggml_silu(conv_output) [FOLD silu] +views q_conv/k_conv/v_conv +ggml_l2_norm(q_conv); ggml_l2_norm(k_conv) [FOLD 2x l2norm] +[repeat_4d skipped on fused path] +ggml_gated_delta_net_inplace_ids(...) <-- THE recurrence kernel (196 ms) +build_norm_gated(output, ssm_norm, z): RMSNorm + silu(z) + mul [FOLD post gated-RMSNorm] +ssm_out GEMM (keep: GEMM, separate) +``` + +Fold list (what moves INTO the kernel): +- `beta` sigmoid: scalar per (head,seq); apply in-kernel when reading beta. +- `gate` g = softplus(alpha+dt)*a (GDA, g->ne0==1): scalar per (head,seq); compute/exp in-kernel. + The kernel already does `expf(*g_t)` (non-KDA path, line 85) - so feed RAW `alpha`+`dt` and the + `a` scale and do softplus+mul+exp in-kernel; removes the `add`/`softplus`/`mul` launches. +- `ssm_conv` (depthwise causal conv1d, kernel width 4) + `silu`: per channel a length-4 dot of the + conv state with `ssm_conv1d` then silu. This is the prologue: each warp/thread, before loading + state, computes its q/k/v channel by reading 3 cached conv-state taps + the current qkv_mixed + token, dotting the 4-wide kernel, applying silu. The conv state (conv_kernel-1=3 taps x conv_dim) + is tiny and already cached; fold its read here and its 1-token shift write into the epilogue + (replaces the `concat`+`cpy` conv-state update). +- `l2_norm` of q and k: a warp reduction over S_v of the per-head q/k vector. The recurrence kernel + already does warp reductions over S_v (the kv/attn dot products) - the l2norm reuses the same + warp-reduce primitive on q_reg/k_reg right after they are loaded, before the recurrence math. +- Post: `build_norm_gated` = RMSNorm(output, ssm_norm) * silu(z). The kernel already holds the + attn output `attn_col` per (head,seq,col) in registers at the end; fold an S_v warp-reduce RMS, + multiply by `ssm_norm` weight and by `silu(z)` (z read once), and write the final gated output - + removing the `rms_norm`+`silu`+`mul` launches and one activation round-trip. + +State traffic UNCHANGED (still one read + one write). Activation traffic for conv/silu/l2norm/norm +collapses into the kernel's register/shared path; ~6 separate launches become 0. Expected recovery: +the ~6-8% surrounding-op buckets + launch overhead. **Bit-exact** if the numeric ordering is held +(see Numeric notes). Conservative ceiling ~365-375 tok/s dense (~93-96% of vLLM 391). + +Data flow (per (h_idx=head, sequence=seq) block, decode n_tokens=1, S_v=128, num_warps=4): +1. PDL sync. +2. Prologue (per channel/lane): read 3 conv-state taps + current `qkv_mixed[t]` for this channel, + dot with `ssm_conv1d[0..3]`, add conv bias if any, `silu`. Produces this lane's q/k/v element. +3. l2norm q,k: warp-reduce sum(q^2), sum(k^2) over the S_v dim; scale q_reg,k_reg by rsqrt(.+eps). +4. Load `s0` column into `s_shard` (UNCHANGED single read). +5. Recurrence (UNCHANGED math: g-decay, kv = S^T k, delta = (v - g*kv)*beta, S = g*S + k(x)delta, + attn = S^T q * scale). +6. Write `s_shard` back to cache slot ONCE (UNCHANGED single write). Write the 1-token-shifted conv + state back to the conv cache (replaces concat+cpy). +7. Epilogue gated-RMSNorm: warp-reduce sum(attn^2) over S_v -> RMS; multiply by `ssm_norm[col]` and + by `silu(z[col])` (z loaded once); write final output element. ssm_out GEMM stays separate. + +Inputs added to the op: `ssm_conv1d` weight, `ssm_norm` weight, `z`, conv-state cache view, raw +`alpha`/`dt`/`a`, eps. This is a wider op signature (src[8..]) - acceptable; gate it behind a new +`cparams.fused_gdn_decode` resolved exactly like `auto_fgdn` (graph_reserve + device-match probe, +llama-context.cpp 518-595) so it silently falls back to the current op chain if any device lacks it. + +### (2) bf16 recurrent-state cache - the dominant-term lever (NON-bit-exact) + +Only build if `ncu-byte-gate` shows vLLM moves fewer state bytes (bf16) OR llama's f32 recurrence is +already >=85% of peak (then f32 is at the floor and bf16 is the only way down). + +- Store `ssm_states_all` (the recurrent-state cache) as bf16. Halves the dominant 805 MB/call -> at + the same ~197 GB/s -> ~2.04 ms/call -> ~98 ms/step saved (196 -> ~98). Dense projected + 335 -> ~440+ tok/s (>= vLLM 391) if BW-bound holds; smaller dtype usually achieves a HIGHER % of + peak, so likely better. +- Kernel change: read state -> convert bf16->f32 into `s_shard` (registers stay f32); all recurrence + arithmetic in f32 (UNCHANGED); on write, convert f32->bf16. Accumulation precision is preserved + within a step; only the PERSISTED state is rounded to bf16 each step. +- Numerics: the recurrent state decays geometrically (g<1), so per-step bf16 rounding does not + accumulate unboundedly, but it is NOT bit-exact. Validate KL < 1e-3 vs the f32-state build over a + 256-token greedy run; if KL fails, fall back to f32 state (keep it a cparams toggle). This is the + ONLY path to bit-near parity-or-better on the dominant term; bit-EXACT parity on the 196 ms is + unreachable because the f32 state bytes are irreducible (single pass already). + +## Numeric / bit-exactness notes (for fold (1)) +- l2norm/RMS use f32 warp-reduce accumulation (matches `ggml_l2_norm`/`ggml_rms_norm` f32 sum). + Order of summation across lanes differs from the standalone op's sequential sum -> floating + reassociation. To stay bit-exact, replicate the standalone op's reduction order, OR accept a + tiny reassociation delta and gate on KL<1e-3 (the workflow's near-bit-exact target). Recommend: + ship fold (1) behind the cparams probe and assert greedy md5 match vs the current chain (0019 + already established the harness: dense text md5, MoE byte-identical). +- Recurrence math, scale, g-exp order, beta apply: keep EXACTLY as in `gated_delta_net_cuda` / + `ops.cpp` reference (lines 84-141 .cu, 10685-10730 ops.cpp). Do not reorder the + v - g*kv -> *beta -> S update -> S^T q sequence. +- conv: depthwise dot of width-4 kernel in f32, then silu - identical to `ggml_ssm_conv`+`ggml_silu` + if done in the same order. +- gate softplus: `softplus(x)=log1p(exp(x))`; match ggml's `ggml_softplus` (has the >20 fast path) + to stay bit-exact. + +## Implementation scope +- (1) `.cu`: extend `gated_delta_net_cuda` with a decode-fused template specialization (or a new + kernel) that does conv+silu prologue, q/k l2norm, recurrence, conv-state shift write, gated-RMSNorm + epilogue. Add `ggml_cuda_op` dispatch. CPU mirror in `ops.cpp` for parity/CI. +- (1) `ggml.h`/`ggml.c`: new builder `ggml_gated_delta_net_decode_fused` (extra src: ssm_conv1d, + ssm_norm, z, conv-cache view, alpha/dt/a, eps + op_params for eps). +- (1) graph edits: `delta-net-base.cpp build_recurrent_attn` (add the decode-fused branch alongside + the existing fused/ids branch); `qwen35.cpp` + `qwen35moe.cpp` `build_layer_attn_linear` (route + the pre/post ops into the op when `cparams.fused_gdn_decode`); leave `qwen3next.cpp`, + `kimi-linear.cpp`, the non-fused and rollback (n_rs_seq>0) paths unchanged. +- (1) `llama-context.cpp`: `auto_fgdn`-style device-match probe to enable/disable the decode-fused + op (silent fallback). `cparams.h`/`cparams.fused_gdn_decode`. +- (2) bf16 state: cache dtype change in the recurrent-memory allocation + the kernel load/store + convert + a `cparams` toggle + KL gate. Touches `gated_delta_net.cu` load/store, the inplace/ids + builders' state asserts, and the recurrent cache type. + +## Risk register +- (1) is MEDIUM effort, bit-exact-targetable, but bounded upside (~6-8% + launches; ceiling ~95% of + vLLM). Worth it only if the workflow wants >90% and accepts no bf16. +- (2) is the only large lever on the dominant 196 ms but is NON-bit-exact (KL-gated). If vLLM is + f32-state, (2) takes llama BELOW vLLM's precision, not toward parity - a product call, not a perf + call. +- The widened op signature (many srcs) raises maintenance cost and the device-match probe matters + (CPU offload of a GDN layer must fall back cleanly). +- Do NOT expect a fused recurrence to cut the 196 ms: it is already one read + one write of f32 + state. Re-confirm with the `ncu-byte-gate` achieved-BW number before committing HIGH effort. + +--- + +# MEASUREMENT + VERDICT (label ncu-byte-gate, THE GPU agent) - GATE SETTLED + +The design above predicted the answer; this is the decisive measurement that confirms it. + +## VERDICT: NO-BUILD the fused single-pass recurrence. BUILD bf16 SSM state (design's lever (2)). + +Deciding number: **llama re-stream factor = ~1.0x** (mathematically capped at <=1.33x; >=1.5x is +physically impossible). llama's recurrence kernel is ALREADY single-pass, coalesced, and at +**74% of GB10 peak BW** - MORE bandwidth-efficient than vLLM's fused triton kernel (41% of peak). +The whole 2x DRAM gap vs vLLM is **f32 (llama) vs bf16 (vLLM) state-cache width**, not re-streaming. + +## ncu HW counters were BLOCKED; timing + geometry gave the byte ratio anyway +- `ncu dram__bytes` and `nsys --gpu-metrics-devices` both return `ERR_NVGPUCTRPERM` + (`NVreg_RestrictProfilingToAdminUsers` restricted, root-only; no passwordless sudo on dgx.casa). + DRAM byte counters are unobtainable on this box. +- Decisive fallback (no perf counters): CUPTI kernel TIMING (allowed) + EXACT byte geometry from + the kernel source. bytes_moved <= peak_BW x duration gives a HARD CAP on the re-stream factor; + comparing implied effective BW between llama and vLLM (same model, same B, both eager) settles it. + +## Measured (clean nsys CUDA timing; build-cuda-base df1cc97 Lever-1; both B=128, both graphs/eager-OFF) +llama: `llama-batched-bench -npp 8 -ntg 12 -npl 128 -ub 2048`, GGML_CUDA_DISABLE_GRAPHS=1. +vLLM: postssm_decomp/vllm_decode.sqlite, NSEQ=128, enforce_eager=True (apples-to-apples). + +| kernel | state dtype | bytes R+W/call | duration/call (steady) | eff. BW | % of 273 peak | re-stream | +|---|---|---|---|---|---|---| +| llama gated_delta_net_cuda | f32 | 805.3 MB | **3.98 ms** (min 3.90 max 4.33, grid 48x128x32) | 202 GB/s | **74%** | ~1.0x | +| vLLM fused_recurrent...packed_decode | bf16 | 402.6 MB | **3.62 ms** (min 3.53 max 3.96, grid 4x6144x1) | 111 GB/s | **41%** | ~1.0x | + +- llama recurrence/step = 3.98 x 48 = **191 ms** (50% of 384 ms step; matches STATE 196 ms). +- vLLM recurrence/step = 3.62 x 48 = **174 ms**. Per-call gap llama-vs-vLLM is only +10%, NOT 2.8x. + The old "1.47 ms near-vLLM" was prefill-contaminated; clean decode is 3.98 ms (confirms STATE). +- Both kernels verified SINGLE-PASS in source (llama: s_shard load-once/store-once, 128 consecutive + f32/warp = coalesced; vLLM packed_decode: `b_h += load(p_h0).to(f32)` once, `store(p_ht, b_h.to(bf16))` + once). vLLM cache dtype = state_dtype = model_dtype = bf16 (`_mamba_state_dtype` default "auto" -> + model dtype; config.json dtype=bfloat16). Geometry identical (H=48, k/v head_dim 128, S_v 128). + +## Why re-stream ~1.0x (the gate number) +Most bytes a 3.98 ms call could move at 273 GB/s peak = 1.087 GB = **1.33x the 816 MB minimal**. +1.5x/2x re-stream would need >peak BW -> impossible. Source proves single-pass+coalesced -> 1.0x end: +~816 MB at 202 GB/s = 74% peak. A fused single-pass rewrite recovers ~0 state bytes => NO-BUILD. + +## The lever: bf16 SSM state (design (2)) - confirmed, large, parity-to-ahead +2x recurrence bytes vs vLLM = 100% f32-vs-bf16 cache. llama's kernel is the more efficient one +(74% vs 41% peak), so bf16 state (cache + load/store bf16, f32 register compute, exactly as vLLM): +- 805.3 -> ~413 MB => at 74% peak ~2.0 ms/call => 191 -> ~96 ms/step, save ~95 ms => step ~289 ms + (~443 tok/s, AHEAD of vLLM 327). Conservative (50% peak on smaller footprint): ~3.0 ms/call => + save ~45 ms => step ~339 ms = vLLM parity. Range = parity-to-ahead. +- NON-bit-exact vs llama's f32 reference, but EQUAL precision to vLLM (which is bf16). Gate on + PPL/KL vs the f32 build, not md5. "Bit-exact parity with vLLM" was never on the table - vLLM is bf16. + +## Conv-path (no-regret conv-fusion lever sizing), llama steady decode, per call x48 +concat_cont 169.6 us (8.14 ms/step) + cpy_scalar 120.1 us (5.76) + ssm_conv_f32 115.9 us (5.56) += ~19.5 ms/step (~5%). Conv STATE ~12.6 MB (tiny) -> this is LAUNCH/small-kernel overhead, not bytes +-> a FUSION lever (design (1)), secondary to bf16 state. l2_norm 6.8 us, gdn_gather 1.21 us (no-op, +identity seqs -> confirms gather does NOT re-stream state at steady decode). + +## One-line answer +llama: 805 MB/call, 74% peak, re-stream ~1.0x (<=1.33x). vLLM: 402 MB/call (bf16), 41% peak. +conv-path: ~12.6 MB (launch-bound ~19.5 ms/step, not byte-bound). +=> NO-BUILD fused recurrence (already single-pass, more efficient than vLLM); BUILD bf16 state +(halves the dominant 805 MB, ~45-95 ms/step, parity-to-ahead). Deciding number: re-stream ~1.0x. + +Assisted-by: Claude:opus-4.8 [Claude Code]