mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-23 16:19:07 -04:00
Measured npl=128 cold A/B: NVFP4 decode 619 vs Q4_K 547 (+13%), closing the gap to vLLM (667) from ~22% to ~7%. NVFP4's FP4-MMA kernel is more bandwidth-efficient at the thin n=128 decode shape than Q4_K int8-MMQ (which ran 2.1x above the floor), so it IS the better int4 decode GEMM the diagnosis called for - no multi-day Marlin-for-K-quants needed. With NVFP4, llama.cpp on GB10 is ahead on prefill (1209 vs 800) and within ~7% on decode. Remaining 7% = optional FP4 kernel tuning. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>