LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-07-03 04:46:54 -04:00

Files

Ettore Di Giacinto 3159ed0637 docs(paged): record P6 fp8-KV measured NO-GO - throughput dead end, capacity-play open

Retry of P6 unblocked the prior infra-block (DGX reachable via ssh dgx.casa) and
ran the kill-gate. Two measured artifacts replace the analytical estimates:

Stage 0a decode ceiling (v2 per-kernel decode-isolation, cross-checked within
0.3% of the batched-bench wall t_tg): fp8-KV theoretical-MAX decode saving
(fa-only) tops at +8.81% at ctx8192 x npl8 and clears +3% only at long context;
standard npl128 serving shapes reach +2.2/+3.4%. This refutes the earlier
analytical prior (0.65% std, +17.34% ctx8192) in both directions.

Stage 0b zero-code Q8_0-KV A/B proxy at the highest-ceiling shape (5 reps/arm):
dense ctx8192 +0.37% decode (flat), moe ctx8192 -2.63% decode REGRESSION. Even
Q8_0 - which wins on the integer DP4A fattn-vec dot that e4m3 cannot use -
realizes ~none of the ceiling; dequant-in-attention eats the KV-read BW saving,
re-confirming the historical Q8_0 +7.8% null. e4m3's KQ path is strictly worse
than Q8_0's, so the e4m3 throughput kernel is a definitive NO-GO and was not
built. The capacity-play (halving the 10/40 attention layers' KV footprint)
stays open as a footprint feature.

Default path measured green on the byte-identical worktree (canonical greedy-md5
re-run: MoE 8cb0ce23, dense 5951a5b4, paged). Fork localai-paged untouched at
653bb2f3d; topic branch p6-fp8-kv retained on the DGX, not pushed; series stays
46 patches (0001-0055). P3's landed program conclusion is preserved; only the
now-stale P6 status descriptors in it were corrected to the measured NO-GO.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-07-02 23:01:40 +00:00

ds4

chore: ⬆️ Update antirez/ds4 to 80ebbc396aee40eedc1d829222f3362d10fa4c6c (#10378 )

2026-06-18 00:32:13 +02:00

grpc

fix: speedup git submodule update with --single-branch (#2847 )

2024-07-13 22:32:25 +02:00

ik-llama-cpp

fix(ik-llama): port multimodal path to mtmd API and bump to f96eaddb (#10534 ) (#10568 )

2026-06-28 08:57:11 +02:00

llama-cpp

paged: drop bf16-tau (patch 0026), subsumed by decode fusions (tau=100000 flat, zero speed benefit)