mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-27 01:47:18 -04:00
Mirror of llama.cpp dev-tree patch 0025 (qwen35moe NVFP4 MoE-decode re-graph) and the GPU-agent B findings in SPEEDUP_HUNT.md: re-confirmed MoE decode decomposition @npl128, the measured re-graph lever (+4.4%/+2.9%/+1.9% decode_agg at npl 32/64/128; bit-exact: test-backend-ops MUL_MAT_ID 806/806 + parallel-greedy np16 byte-identical ON==OFF), grouped-GEMM occupancy headroom (exhausted on this bandwidth-bound model), and the W4A16 assessment (rejected: non-bit-exact, slower BF16 MMA). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>