LocalAI/backend/cpp/llama-cpp/patches/paged/A_HYBRID_PROGRESS.md at 33dfe7fd41ccfe11cb5ed33fa542c10229b3ff02

Ettore Di Giacinto 33dfe7fd41 feat(paged): qwen35 hybrid per-head f32/bf16 SSM state - carry fix + gate sweep (patch 0026)

Regenerate patch 0026 with the hybrid-decode carry fix and record the
KL/throughput gate-sweep results.

Fix: clear(data=true) zeroes the whole recurrent buffer including the head_slot
maps, which were uploaded only once at construction; after the post-warmup
reset every head read head_slot==0 (f32-local-0), collapsing the split and
producing incoherent decode. Persist head_slot_host and re-upload via
upload_head_slots() after every buffer clear. Hybrid decode is now coherent and
the cross-op state carry is byte-exact (write==read, both partitions).

Gate result: de-risk PASS (test-backend-ops 84/84; T=0 md5 == 0023 baseline,
both models). Ship gate FAILS - no T_thresh meets MeanKLD<1e-3 AND
same-top-p>=99.5% with a meaningful speedup. The premise that the bf16 error
concentrates in long-memory heads is refuted: KL scales with the bf16 head
count and saturates ~0.06/~91% (MoE saturates at the minimal split). The carry
is byte-exact, so this is genuine bf16 sensitivity, not a bug. The byte-saving
lever is real (dense +12.4%, MoE +11.5% decode @npl128 at T=128) but cannot
meet the strict KL bar. Shipped default-off (f32, bit-exact opt-out); hybrid is
opt-in only and not recommended in the gallery config. Full tables in
A_HYBRID_SSM_RESULTS.md.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

4.3 KiB

Raw Blame History

A-build: hybrid per-head f32/bf16 SSM state - BUILD PROGRESS

Design recap (from SPEEDUP_HUNT.md A-hybrid-design)

DE-RISK GATE (must pass before sweep)

KNOB SEMANTICS (IMPORTANT - brief endpoint wording corrected)

STATUS

4.3 KiB Raw Blame History

A-build: hybrid per-head f32/bf16 SSM state - BUILD PROGRESS

Design recap (from SPEEDUP_HUNT.md A-hybrid-design)

DE-RISK GATE (must pass before sweep)

KNOB SEMANTICS (IMPORTANT - brief endpoint wording corrected)

STATUS

4.3 KiB

Raw Blame History