LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-23 16:19:07 -04:00

Files

Ettore Di Giacinto 037ad82b7c docs(paged): MXFP4-dense vs Q4_K quality gate on GB10 (do not recommend)

Fair clean-source perplexity check on DGX Spark (GB10): quantize Qwen3-4B
from one BF16 source to both Q4_K_M and MXFP4 (no imatrix, identical recipe).
Q4_K_M is +2.6% PPL vs BF16; MXFP4-dense is +30.8% (+27.5% worse than Q4_K).
The existing 32B MXFP4 was confirmed double-quant (Q4_K_M -> MXFP4 via
--allow-requantize), but the clean 4B test shows the gap is intrinsic to the
format, not the double-quant. Output stays coherent. Verdict: the ~1.58x
prefill / ~1.2x decode win does not justify a Blackwell MXFP4-dense quality
recommendation; keep Q4_K_M the dense default, pursue NVFP4 instead.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-21 17:25:14 +00:00

paged

docs(paged): MXFP4-dense vs Q4_K quality gate on GB10 (do not recommend)

2026-06-21 17:25:14 +00:00

patches

bench(dense): root-cause the W4A4 NVFP4 hang; W4A16 vs Q4 is the headline

2026-06-20 06:59:50 +00:00

CMakeLists.txt

fix(turboquant): resolve common.h by detecting llama-common vs common target (#9413 )

2026-04-18 20:30:28 +02:00

grpc-server.cpp

feat: generic chat_template_kwargs (model config + per-request metadata) (#10359 )