docs(dllm): record Q4_K_M validation and quantization guidance

Q4_K_M validated on GB10: quality holds (cosine 0.9862, coherent generation, 19/48 stopper exit) but a forward step is ~5x slower than BF16 (27.5s vs 5.6s: native BF16 tensor cores vs K-quant MoE dequant). Guidance: prefer BF16 when it fits; Q4_K_M is the memory-bound option. Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-11 18:27:32 -04:00 · 2026-06-11 19:22:02 +00:00
parent ad6d1dbc8b
commit 8134d6db37
1 changed files with 2 additions and 0 deletions
--- a/docs/content/features/text-generation.md
+++ b/docs/content/features/text-generation.md
@@ -768,6 +768,8 @@ Honest numbers from validation on a DGX Spark (GB10, CUDA 13, BF16 26B model, fu

 On CPU the same forward step takes ~139 s (20 Grace cores): treat the CPU flavor as functional, not practical, for the 26B model.

+**Quantized models.** The Q4_K_M export (16.8 GB vs 50.5 GB BF16) was validated on the same GB10: it loads faster (~12.6 s vs ~32.7 s), quality held up in validation (golden-logits cosine 0.9862, coherent generation on the same prompt as the BF16 run, EB stopper exiting at 19/48 steps, ~0.49 tok/s on that run) - but a forward step takes ~27.5 s, about **5x slower than BF16** (~5.6 s/step) on this hardware. GB10-class GPUs run BF16 natively on tensor cores, while the K-quant MoE weights pay a dequantization cost on every denoise step. Choose Q4_K_M only when you are memory-bound; if BF16 fits, it is both faster and the file the engine's validation tolerances are calibrated for.
+
 #### Reference

 - [dllm.cpp](https://github.com/mudler/dllm.cpp)