mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-03 04:46:54 -04:00
P6 (final program phase) could not run its kill-gate: the DGX/GB10 was unreachable for the entire window (cloudflared access via prem-vm returned HTTP 530 / websocket bad-handshake on every probe; re-confirmed with 5 fresh probes). Stage 0a (measured nsys graph-node decode ceiling) and Stage 0b (fp8-e4m3 kernel + kill-gate A/B) were physically impossible with no GPU. Records the honest infra-block (NOT a measured NO-GO, NOT a NO-GO-by-ceiling) plus the load-bearing artifact: the analytical fp8-KV decode ceiling table. fp8 halves KV bytes -> theoretical-max decode saving = 0.5 x flash-attn share: ctx256 0.65% (standard shape hard NO-GO), ctx1024 2.55%, ctx2048 4.98% (first crosses +3%), ctx4096 9.49%, ctx8192 17.34%. The win, if realizable, lives only at ctx>=2048; the hybrid-GDN structure (10/40 layers carry KV, 30 GDN layers hold fixed-size recurrent state with no KV) caps what any KV-dtype lever can save. The dominant null stands unrefuted: Q8_0 KV was a measured +7.8% decode regression on GB10. Notes the capacity-play framing (fp8-KV as a memory feature remains open even if throughput-flat). Fork localai-paged untouched at 653bb2f3d; series stays at 46 patches (0001-0055); P3's p3-w4a16-direct work undisturbed. Docs-only; no code, no topic branch, no patches. Not pushed. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>