mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-26 01:16:58 -04:00
docs(paged): FINAL DECISION - NO-BUILD fused recurrence, BUILD conv fusion + bf16 state
Synthesis of the byte-gate workflow (ncu-byte-gate measurement +
vllm-fused-recurrence-study + llama-fused-recurrence-design + conv-fusion-design).
Verdict closes all five decision points:
(1) Byte ratio: llama re-stream ~1.0x (cap <=1.33x); recurrence at 74% GB10 peak,
MORE BW-efficient than vLLM packed_decode at 41%. The 2x DRAM gap is 100%
f32-vs-bf16 state-cache width, not extra passes.
(2) Fused single-pass recurrence: NO-BUILD - already one R + one W of f32 state,
gate ops touch tiny q/k/g/beta not the 805 MB state -> recovers ~0 bytes.
(3) Conv-state in-place fusion: GO - bit-exact, no-regret, +12-14 ms/step (~+3%),
eliminates concat_cont + cpy_scalar + folds silu.
(4) bf16 SSM state: BUILD (KL<1e-3 gated product call) - only lever on the dominant
50% recurrence term, +45-95 ms/step -> step 289-339 ms = parity-to-ahead of vLLM.
Bit-exact parity unreachable on this term (f32 bytes irreducible); bf16 = equal
precision to vLLM, which is itself bf16.
(5) Build order: conv fusion next (no-regret, bit-exact), then bf16 state (highest
value, gated). Confirming measurements stated per step.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
@@ -254,4 +254,91 @@ conv-path: ~12.6 MB (launch-bound ~19.5 ms/step, not byte-bound).
|
||||
=> NO-BUILD fused recurrence (already single-pass, more efficient than vLLM); BUILD bf16 state
|
||||
(halves the dominant 805 MB, ~45-95 ms/step, parity-to-ahead). Deciding number: re-stream ~1.0x.
|
||||
|
||||
---
|
||||
|
||||
# FINAL DECISION (synthesis of all four agents) - the five points
|
||||
|
||||
This closes the workflow. Inputs: `ncu-byte-gate` (measured byte ratio), `vllm-fused-recurrence-study`
|
||||
(vLLM's single-pass boundary), `llama-fused-recurrence-design` (the fold/levers), `conv-fusion-design`
|
||||
(the no-regret conv in-place lever). They agree on every number; the decision is unambiguous.
|
||||
|
||||
## (1) Byte-ratio verdict - the decisive number
|
||||
|
||||
**llama is at the hardware bandwidth floor, NOT re-streaming.** Re-stream factor = **~1.0x**, hard
|
||||
capped at **<=1.33x** (the most bytes a 3.98 ms call can move at 273 GB/s peak is 1.087 GB = 1.33x
|
||||
the 816 MB minimal; >=1.5x is physically impossible). The recurrence kernel runs at **74% of GB10
|
||||
peak BW** (805.3 MB R+W / 3.98 ms = 202 GB/s) - MORE bandwidth-efficient than vLLM's fused triton
|
||||
`packed_decode` at **41% of peak** (402.6 MB / 3.62 ms = 111 GB/s). Source confirms both are
|
||||
single-pass and coalesced (llama `s_shard` load-once/store-once, 128 consecutive f32/warp; vLLM
|
||||
`b_h = load(p_h0)` once -> f32 regs -> `store(p_ht, b_h.to(bf16))` once). The entire 2x DRAM gap
|
||||
vs vLLM is **100% f32 (llama) vs bf16 (vLLM) state-cache WIDTH**, not extra passes.
|
||||
|
||||
## (2) Fused single-pass GDN recurrence: **NO-BUILD**
|
||||
|
||||
A fused single-pass rewrite recovers **~0 state bytes** because the kernel is already one read + one
|
||||
write of the f32 state, and the un-fused l2norm/sigmoid/softplus/gate ops act on the tiny
|
||||
q/k/g/beta projections (8 MB/call, <1%), not the 805 MB state. There is no second pass to fuse away.
|
||||
Expected ceiling if built anyway: unchanged 191 ms recurrence -> no movement on the dominant 50% of
|
||||
the step. **Do not build it.** This refutes the workflow's founding hypothesis with a measured cap.
|
||||
|
||||
## (3) Conv-state in-place fusion (`conv-fusion-design`): **GO - confirmed, bit-exact, no-regret**
|
||||
|
||||
This is independent of the recurrence verdict and holds regardless. Build a fused
|
||||
`ggml_ssm_conv_update_inplace` (mirrors the 0018/0019 in-place pattern) that, at decode
|
||||
(`n_seq_tokens==1 && !keep && fused-AR && n_rs_seq==0`), assembles the width-4 conv window in
|
||||
registers from the cached K-1=3 taps + the native `qkv_mixed` token, computes the depthwise conv,
|
||||
folds `silu`, and writes the 1-token-shifted ring state back in place.
|
||||
- Eliminates `concat_cont` (8.14 ms/step), `cpy_scalar` (5.76 ms/step), the transpose
|
||||
materialization, and the separate `ggml_silu`; replaces `ssm_conv` with a ~1.6x-byte fused kernel
|
||||
(5.56 -> ~9 ms). **Net ~12-14 ms/step = +3.1 to +3.7%** -> dense 335 -> ~346-349 tok/s @npl128
|
||||
(88.5-89.3% of vLLM 391).
|
||||
- **Bit-exact**: identical ascending-j width-4 FMA order as `ssm_conv_f32` at i==0, same `silu`
|
||||
primitive, same f32 state bytes written - only the producing node changes. Greedy output is
|
||||
bit-identical to the 0018/0019 baseline. LOW risk, additive to everything else.
|
||||
|
||||
## (4) Recurrence floor-mover: bf16 SSM state - **BUILD (gated product call)**, and the bit-exact question
|
||||
|
||||
Since the recurrence is at the f32 byte floor, the **only** lever on the dominant 191 ms (50% of the
|
||||
step) is narrowing the state-cache width to bf16, exactly as vLLM does.
|
||||
- Store `ssm_states_all` in bf16; load bf16->f32 into `s_shard`, run ALL recurrence arithmetic in
|
||||
f32 (UNCHANGED), store f32->bf16. 805.3 -> ~413 MB/call -> ~2.0-3.0 ms/call -> save **~45-95 ms/
|
||||
step** -> step 384 -> **289-339 ms** = parity-to-ahead of vLLM (327 ms / 391 tok/s; projected
|
||||
360-443 tok/s @npl128).
|
||||
- **Bit-exact parity is UNREACHABLE on this term, by construction.** The f32 state bytes are
|
||||
irreducible (single pass already), so matching vLLM's *speed* on the recurrence requires matching
|
||||
vLLM's *width* (bf16). bf16 state is non-bit-exact vs llama's own f32 reference, but it is **equal
|
||||
precision to vLLM** (vLLM's state cache is itself bf16). "Bit-exact parity with vLLM" was never on
|
||||
the table - vLLM is the less-precise reference here. Gate the build on **KL < 1e-3 / PPL-delta**
|
||||
over a 256-token greedy run, not on md5, with a `cparams` f32 fallback. The geometric state decay
|
||||
(g<1) bounds per-step bf16 rounding, so accumulation is well-behaved.
|
||||
- Bit-exact gains that ARE reachable (vs llama f32): the conv fusion (3) and the activation-fold
|
||||
lever (1) - together ~9-11% - but they top out near ~93-96% of vLLM and never touch the 50%
|
||||
recurrence term.
|
||||
|
||||
## (5) Ranked build order + the single highest-value next step
|
||||
|
||||
1. **Conv-state in-place fusion (BUILD NEXT - no-regret).** Bit-exact, LOW risk, +12-14 ms (~+3%),
|
||||
reuses the proven 0018/0019 in-place op pattern. Build this first because it is risk-free, purely
|
||||
additive, and de-risks the in-place conv-cache plumbing the bf16 work also touches.
|
||||
Confirming measurement: nsys decode trace shows `concat_cont` and `cpy_scalar` GONE, step
|
||||
384 -> ~370-372 ms, and greedy md5 IDENTICAL to the 0019 baseline (dense text md5, MoE
|
||||
byte-identical).
|
||||
2. **bf16 SSM state cache (HIGHEST-VALUE lever - gated product call).** The ONLY lever on the
|
||||
dominant 50% recurrence term: +45-95 ms/step, step -> 289-339 ms = parity-to-ahead of vLLM.
|
||||
Non-bit-exact vs llama f32, equal precision to vLLM. Confirming measurement: `gated_delta_net_cuda`
|
||||
duration/call drops 3.98 -> 2.0-3.0 ms in nsys; **KL < 1e-3 / PPL-delta vs the f32 build over
|
||||
256-token greedy** passes; step time and tok/s hit the 289-339 ms / 360-443 tok/s band; cparams
|
||||
f32 fallback verified.
|
||||
3. **Activation-op fold, design lever (1) (OPTIONAL, bit-exact, diminishing).** After (1) takes the
|
||||
conv/silu buckets, the residual fold (q/k l2norm + gate softplus/sigmoid + gated-RMSNorm epilogue
|
||||
+ launch overhead) is ~3-5%; bit-exact but bounded. Build only if the goal is >90% of vLLM with
|
||||
no bf16. Confirming measurement: per-op launch count for the GDN layer collapses to ~1; greedy
|
||||
md5 unchanged.
|
||||
|
||||
**Single highest-value next implementation step: bf16 SSM state cache (#2)** - it is the only change
|
||||
that moves the dominant 191 ms term and reaches vLLM parity-to-ahead. Its confirming measurement is
|
||||
the `gated_delta_net_cuda` per-call time dropping to ~2.0-3.0 ms AND the KL<1e-3 gate passing.
|
||||
**Recommended immediate build: the conv fusion (#1) first** (no-regret, bit-exact) so the bf16 work
|
||||
lands on an already-cleaned conv path; ship #2 as a `cparams`-gated, KL-validated product option.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
|
||||
Reference in New Issue
Block a user