mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-24 00:28:55 -04:00
Mirror of the dev-tree engine patch (ggml mmq.cuh) into the paged patch set, plus its measurement writeup. Adds LLAMA_MOE_MMQ_X, an opt-in env cap on the MoE grouped-GEMM token-tile (mmq_x) for the MUL_MAT_ID path; default-off = byte-identical to stock. Honest result of the MoE near-term lever: the npl128 decode cliff does NOT exist on current HEAD (stock decode is monotonic 85/282/629/935/1295/1779 t/s at npl 1/8/32/64/128/256; the old cliff was fixed upstream by the sorted grouped FP4-MMA GEMM + MoE stream-k). The cap is therefore not a cliff fix but a modest high-batch decode micro-optimization: cap64 gives +4.8% decode at npl128 and +2.3% at npl256 (reproducible, neutral at npl<=64) for a ~1.3% prefill cost; cap16/cap32 are net-negative (prefill -41% / -17%). Full tables in MOE_TOKEN_TILE_CAP.md; durable density-aware follow-up in MOE_GROUPED_GEMM_SCOPE.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>