Files
LocalAI/backend/cpp/llama-cpp/patches/paged
Ettore Di Giacinto 3b59571579 docs(paged): both-engine MoE decode decomposition - the 15% is NOT the Marlin GEMM
Ground-truth side-by-side per-kernel ms/step of the MoE decode gap on DGX GB10.
llama (752 t/s, step 169.8ms) vs vLLM graphs-on (901-equiv, step 142.0ms): 27.8ms gap.

Headline: the grouped MoE-expert GEMM is a llama WIN - native FP4-MMA W4A4 47.3ms
vs vLLM Marlin W4A16 50.0ms at the tiny-M decode shape. A Marlin-style W4A16 MoE
GEMM would be slower; it is not the lever (extends the w4a16-marlin DENSE verdict).

The 15% lives elsewhere: bf16 projections + convert glue (+6.5ms), recurrence
state-gather plumbing (+6.6ms, led by k_get_rows 5.2ms), graph coverage + stream
overlap (~+7ms), W4A4 act-quant tax (+3.3ms), router/glue (+5.4ms).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-26 20:11:40 +00:00
..