docs(paged): FINAL DECISION - NO-BUILD fused recurrence, BUILD conv fusion + bf16 state

Synthesis of the byte-gate workflow (ncu-byte-gate measurement + vllm-fused-recurrence-study + llama-fused-recurrence-design + conv-fusion-design). Verdict closes all five decision points: (1) Byte ratio: llama re-stream ~1.0x (cap <=1.33x); recurrence at 74% GB10 peak, MORE BW-efficient than vLLM packed_decode at 41%. The 2x DRAM gap is 100% f32-vs-bf16 state-cache width, not extra passes. (2) Fused single-pass recurrence: NO-BUILD - already one R + one W of f32 state, gate ops touch tiny q/k/g/beta not the 805 MB state -> recovers ~0 bytes. (3) Conv-state in-place fusion: GO - bit-exact, no-regret, +12-14 ms/step (~+3%), eliminates concat_cont + cpy_scalar + folds silu. (4) bf16 SSM state: BUILD (KL<1e-3 gated product call) - only lever on the dominant 50% recurrence term, +45-95 ms/step -> step 289-339 ms = parity-to-ahead of vLLM. Bit-exact parity unreachable on this term (f32 bytes irreducible); bf16 = equal precision to vLLM, which is itself bf16. (5) Build order: conv fusion next (no-regret, bit-exact), then bf16 state (highest value, gated). Confirming measurements stated per step. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-26 01:16:58 -04:00 · 2026-06-25 15:27:04 +00:00
parent fd4332e8f0
commit 2a8103c419
1 changed files with 87 additions and 0 deletions
--- a/backend/cpp/llama-cpp/patches/paged/GDN_RECURRENCE_BYTE_GATE.md
+++ b/backend/cpp/llama-cpp/patches/paged/GDN_RECURRENCE_BYTE_GATE.md
@@ -254,4 +254,91 @@ conv-path: ~12.6 MB (launch-bound ~19.5 ms/step, not byte-bound).
 => NO-BUILD fused recurrence (already single-pass, more efficient than vLLM); BUILD bf16 state
 (halves the dominant 805 MB, ~45-95 ms/step, parity-to-ahead). Deciding number: re-stream ~1.0x.

+---
+
+# FINAL DECISION (synthesis of all four agents) - the five points
+
+This closes the workflow. Inputs: `ncu-byte-gate` (measured byte ratio), `vllm-fused-recurrence-study`
+(vLLM's single-pass boundary), `llama-fused-recurrence-design` (the fold/levers), `conv-fusion-design`
+(the no-regret conv in-place lever). They agree on every number; the decision is unambiguous.
+
+## (1) Byte-ratio verdict - the decisive number
+
+**llama is at the hardware bandwidth floor, NOT re-streaming.** Re-stream factor = **~1.0x**, hard
+capped at **<=1.33x** (the most bytes a 3.98 ms call can move at 273 GB/s peak is 1.087 GB = 1.33x
+the 816 MB minimal; >=1.5x is physically impossible). The recurrence kernel runs at **74% of GB10
+peak BW** (805.3 MB R+W / 3.98 ms = 202 GB/s) - MORE bandwidth-efficient than vLLM's fused triton
+`packed_decode` at **41% of peak** (402.6 MB / 3.62 ms = 111 GB/s). Source confirms both are
+single-pass and coalesced (llama `s_shard` load-once/store-once, 128 consecutive f32/warp; vLLM
+`b_h = load(p_h0)` once -> f32 regs -> `store(p_ht, b_h.to(bf16))` once). The entire 2x DRAM gap
+vs vLLM is **100% f32 (llama) vs bf16 (vLLM) state-cache WIDTH**, not extra passes.
+
+## (2) Fused single-pass GDN recurrence: **NO-BUILD**
+
+A fused single-pass rewrite recovers **~0 state bytes** because the kernel is already one read + one
+write of the f32 state, and the un-fused l2norm/sigmoid/softplus/gate ops act on the tiny
+q/k/g/beta projections (8 MB/call, <1%), not the 805 MB state. There is no second pass to fuse away.
+Expected ceiling if built anyway: unchanged 191 ms recurrence -> no movement on the dominant 50% of
+the step. **Do not build it.** This refutes the workflow's founding hypothesis with a measured cap.
+
+## (3) Conv-state in-place fusion (`conv-fusion-design`): **GO - confirmed, bit-exact, no-regret**
+
+This is independent of the recurrence verdict and holds regardless. Build a fused
+`ggml_ssm_conv_update_inplace` (mirrors the 0018/0019 in-place pattern) that, at decode
+(`n_seq_tokens==1 && !keep && fused-AR && n_rs_seq==0`), assembles the width-4 conv window in
+registers from the cached K-1=3 taps + the native `qkv_mixed` token, computes the depthwise conv,
+folds `silu`, and writes the 1-token-shifted ring state back in place.
+- Eliminates `concat_cont` (8.14 ms/step), `cpy_scalar` (5.76 ms/step), the transpose
+  materialization, and the separate `ggml_silu`; replaces `ssm_conv` with a ~1.6x-byte fused kernel
+  (5.56 -> ~9 ms). **Net ~12-14 ms/step = +3.1 to +3.7%** -> dense 335 -> ~346-349 tok/s @npl128
+  (88.5-89.3% of vLLM 391).
+- **Bit-exact**: identical ascending-j width-4 FMA order as `ssm_conv_f32` at i==0, same `silu`
+  primitive, same f32 state bytes written - only the producing node changes. Greedy output is
+  bit-identical to the 0018/0019 baseline. LOW risk, additive to everything else.
+
+## (4) Recurrence floor-mover: bf16 SSM state - **BUILD (gated product call)**, and the bit-exact question
+
+Since the recurrence is at the f32 byte floor, the **only** lever on the dominant 191 ms (50% of the
+step) is narrowing the state-cache width to bf16, exactly as vLLM does.
+- Store `ssm_states_all` in bf16; load bf16->f32 into `s_shard`, run ALL recurrence arithmetic in
+  f32 (UNCHANGED), store f32->bf16. 805.3 -> ~413 MB/call -> ~2.0-3.0 ms/call -> save **~45-95 ms/
+  step** -> step 384 -> **289-339 ms** = parity-to-ahead of vLLM (327 ms / 391 tok/s; projected
+  360-443 tok/s @npl128).
+- **Bit-exact parity is UNREACHABLE on this term, by construction.** The f32 state bytes are
+  irreducible (single pass already), so matching vLLM's *speed* on the recurrence requires matching
+  vLLM's *width* (bf16). bf16 state is non-bit-exact vs llama's own f32 reference, but it is **equal
+  precision to vLLM** (vLLM's state cache is itself bf16). "Bit-exact parity with vLLM" was never on
+  the table - vLLM is the less-precise reference here. Gate the build on **KL < 1e-3 / PPL-delta**
+  over a 256-token greedy run, not on md5, with a `cparams` f32 fallback. The geometric state decay
+  (g<1) bounds per-step bf16 rounding, so accumulation is well-behaved.
+- Bit-exact gains that ARE reachable (vs llama f32): the conv fusion (3) and the activation-fold
+  lever (1) - together ~9-11% - but they top out near ~93-96% of vLLM and never touch the 50%
+  recurrence term.
+
+## (5) Ranked build order + the single highest-value next step
+
+1. **Conv-state in-place fusion (BUILD NEXT - no-regret).** Bit-exact, LOW risk, +12-14 ms (~+3%),
+   reuses the proven 0018/0019 in-place op pattern. Build this first because it is risk-free, purely
+   additive, and de-risks the in-place conv-cache plumbing the bf16 work also touches.
+   Confirming measurement: nsys decode trace shows `concat_cont` and `cpy_scalar` GONE, step
+   384 -> ~370-372 ms, and greedy md5 IDENTICAL to the 0019 baseline (dense text md5, MoE
+   byte-identical).
+2. **bf16 SSM state cache (HIGHEST-VALUE lever - gated product call).** The ONLY lever on the
+   dominant 50% recurrence term: +45-95 ms/step, step -> 289-339 ms = parity-to-ahead of vLLM.
+   Non-bit-exact vs llama f32, equal precision to vLLM. Confirming measurement: `gated_delta_net_cuda`
+   duration/call drops 3.98 -> 2.0-3.0 ms in nsys; **KL < 1e-3 / PPL-delta vs the f32 build over
+   256-token greedy** passes; step time and tok/s hit the 289-339 ms / 360-443 tok/s band; cparams
+   f32 fallback verified.
+3. **Activation-op fold, design lever (1) (OPTIONAL, bit-exact, diminishing).** After (1) takes the
+   conv/silu buckets, the residual fold (q/k l2norm + gate softplus/sigmoid + gated-RMSNorm epilogue
+   + launch overhead) is ~3-5%; bit-exact but bounded. Build only if the goal is >90% of vLLM with
+   no bf16. Confirming measurement: per-op launch count for the GDN layer collapses to ~1; greedy
+   md5 unchanged.
+
+**Single highest-value next implementation step: bf16 SSM state cache (#2)** - it is the only change
+that moves the dominant 191 ms term and reaches vLLM parity-to-ahead. Its confirming measurement is
+the `gated_delta_net_cuda` per-call time dropping to ~2.0-3.0 ms AND the KL<1e-3 gate passing.
+**Recommended immediate build: the conv fusion (#1) first** (no-regret, bit-exact) so the bf16 work
+lands on an already-cleaned conv path; ship #2 as a `cparams`-gated, KL-validated product option.
+
 Assisted-by: Claude:opus-4.8 [Claude Code]