mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-24 00:28:55 -04:00
Build-ready plan (not implemented) for matching/beating vLLM MoE grouped-GEMM efficiency on GB10 sm_121 for Qwen3-30B-A3B mxfp4. Honest reframe: the grouped GEMM the mission scoped to build already exists upstream and runs on GB10 for mxfp4 - should_use_mmq() routes MUL_MAT_ID to the grouped mmq path, which already contains both vLLM building blocks (mm_ids_helper moe_align/scatter + a persistent stream-k FP4-MMA grouped GEMM). The npl128 cliff was a since-fixed regression, not a batched-bench artifact; re-measured decode is monotonic 85->1771 t/s. The one structural gap is M-tile sizing: ggml maximizes mmq_x over the aggregate token count while vLLM uses a small per-expert BLOCK_SIZE_M, so each tiny per-expert M-tile is 3-6% filled at decode density. Scope is a surgical two-step delta (expert-aware mmq_x selection; block-padded moe_align), the parity gate (test_mul_mat_id bit-exact + ragged small-M), and a phased plan gated behind the GB10 W4A16 occupancy wall. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>