Files
LocalAI/backend/cpp/llama-cpp-localai-paged/docs
Ettore Di Giacinto 3159ed0637 docs(paged): record P6 fp8-KV measured NO-GO - throughput dead end, capacity-play open
Retry of P6 unblocked the prior infra-block (DGX reachable via ssh dgx.casa) and
ran the kill-gate. Two measured artifacts replace the analytical estimates:

Stage 0a decode ceiling (v2 per-kernel decode-isolation, cross-checked within
0.3% of the batched-bench wall t_tg): fp8-KV theoretical-MAX decode saving
(fa-only) tops at +8.81% at ctx8192 x npl8 and clears +3% only at long context;
standard npl128 serving shapes reach +2.2/+3.4%. This refutes the earlier
analytical prior (0.65% std, +17.34% ctx8192) in both directions.

Stage 0b zero-code Q8_0-KV A/B proxy at the highest-ceiling shape (5 reps/arm):
dense ctx8192 +0.37% decode (flat), moe ctx8192 -2.63% decode REGRESSION. Even
Q8_0 - which wins on the integer DP4A fattn-vec dot that e4m3 cannot use -
realizes ~none of the ceiling; dequant-in-attention eats the KV-read BW saving,
re-confirming the historical Q8_0 +7.8% null. e4m3's KQ path is strictly worse
than Q8_0's, so the e4m3 throughput kernel is a definitive NO-GO and was not
built. The capacity-play (halving the 10/40 attention layers' KV footprint)
stays open as a footprint feature.

Default path measured green on the byte-identical worktree (canonical greedy-md5
re-run: MoE 8cb0ce23, dense 5951a5b4, paged). Fork localai-paged untouched at
653bb2f3d; topic branch p6-fp8-kv retained on the DGX, not pushed; series stays
46 patches (0001-0055). P3's landed program conclusion is preserved; only the
now-stale P6 status descriptors in it were corrected to the measured NO-GO.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-07-02 23:01:40 +00:00
..