mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-03 04:46:54 -04:00
Retry of P6 unblocked the prior infra-block (DGX reachable via ssh dgx.casa) and ran the kill-gate. Two measured artifacts replace the analytical estimates: Stage 0a decode ceiling (v2 per-kernel decode-isolation, cross-checked within 0.3% of the batched-bench wall t_tg): fp8-KV theoretical-MAX decode saving (fa-only) tops at +8.81% at ctx8192 x npl8 and clears +3% only at long context; standard npl128 serving shapes reach +2.2/+3.4%. This refutes the earlier analytical prior (0.65% std, +17.34% ctx8192) in both directions. Stage 0b zero-code Q8_0-KV A/B proxy at the highest-ceiling shape (5 reps/arm): dense ctx8192 +0.37% decode (flat), moe ctx8192 -2.63% decode REGRESSION. Even Q8_0 - which wins on the integer DP4A fattn-vec dot that e4m3 cannot use - realizes ~none of the ceiling; dequant-in-attention eats the KV-read BW saving, re-confirming the historical Q8_0 +7.8% null. e4m3's KQ path is strictly worse than Q8_0's, so the e4m3 throughput kernel is a definitive NO-GO and was not built. The capacity-play (halving the 10/40 attention layers' KV footprint) stays open as a footprint feature. Default path measured green on the byte-identical worktree (canonical greedy-md5 re-run: MoE 8cb0ce23, dense 5951a5b4, paged). Fork localai-paged untouched at 653bb2f3d; topic branch p6-fp8-kv retained on the DGX, not pushed; series stays 46 patches (0001-0055). P3's landed program conclusion is preserved; only the now-stale P6 status descriptors in it were corrected to the measured NO-GO. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>