mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-26 09:26:55 -04:00
Synthesis of the byte-gate workflow (ncu-byte-gate measurement +
vllm-fused-recurrence-study + llama-fused-recurrence-design + conv-fusion-design).
Verdict closes all five decision points:
(1) Byte ratio: llama re-stream ~1.0x (cap <=1.33x); recurrence at 74% GB10 peak,
MORE BW-efficient than vLLM packed_decode at 41%. The 2x DRAM gap is 100%
f32-vs-bf16 state-cache width, not extra passes.
(2) Fused single-pass recurrence: NO-BUILD - already one R + one W of f32 state,
gate ops touch tiny q/k/g/beta not the 805 MB state -> recovers ~0 bytes.
(3) Conv-state in-place fusion: GO - bit-exact, no-regret, +12-14 ms/step (~+3%),
eliminates concat_cont + cpy_scalar + folds silu.
(4) bf16 SSM state: BUILD (KL<1e-3 gated product call) - only lever on the dominant
50% recurrence term, +45-95 ms/step -> step 289-339 ms = parity-to-ahead of vLLM.
Bit-exact parity unreachable on this term (f32 bytes irreducible); bf16 = equal
precision to vLLM, which is itself bf16.
(5) Build order: conv fusion next (no-regret, bit-exact), then bf16 state (highest
value, gated). Confirming measurements stated per step.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>