Files
LocalAI/backend/cpp/llama-cpp/patches/paged
Ettore Di Giacinto 634c0e5a0f docs(paged): rms_norm->fp4 fold analysis - bit-exact decode ceiling at 95% of vLLM
The standalone quantize fold is empirically flat (Lever-2 precedent) with the
worst gain/plumbing ratio; no bit-exact lever remains. Dense 371.81 t/s @npl128
= 95.0% of vLLM 391, recurrence past vLLM at the LPDDR5x DRAM floor, all
byte-identical to llama f32. Only bf16 state (shelved) goes further.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-25 22:42:08 +00:00
..