mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-30 03:17:01 -04:00
0042 fuses the pre-norm residual add into RMSNorm (+0.5% prefill, bit-exact). 0043 makes the full-step MoE decode CUDA graph default-on (+2-4% decode, bit-exact; removes ~18x per-step host kernel re-issue, A/B-confirmed). 0034 (native FP4-MMA W4A4) and 0035 (Marlin-style W4A16 grouped MoE GEMM) are correct + bit-exact but regress vs the int8 FP4-MMQ in-backend on GB10 (bf16 MMA is ~half the int8 rate); shipped default-off as validated mechanisms and recorded negatives per the parity methodology. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>