Files
LocalAI/backend/cpp
Ettore Di Giacinto 9f16a907be docs(paged): Lever 3 profiled + Q4/MXFP4 findings, auto-ubatch shipped
Prefill doesn't scale with bigger single prompts (attention O(N^2)); real gap
is batched MoE prefill (B=32: 27x vs vLLM, ~22 effective TFLOP/s). nsys pins
Lever 3 target: mul_mat_q<MXFP4> MoE GEMM 37% + un-fused act-quant 8%; native
FP4 MMA already engaged, inefficiency is the per-expert thin-tile scheduler.
Q4_K_M matches MXFP4 on decode (decode win is generic 4-bit); MXFP4's only edge
is prefill. Auto-ubatch=2048 on Blackwell shipped (PR #10411).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-19 20:56:46 +00:00
..