mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-26 01:16:58 -04:00
Final synthesis of the critical-path gap analysis: the decode step is 99.94% GPU-busy single-stream (idle 0.225ms = 0.06%), so the 14% gap to vLLM is kernel GPU-time dominated by the bandwidth-bound gated_delta_net recurrence (196.37ms = 51.6%), not launch bubbles. Claims A/B/C all REFUTED as worded; the single residual is the unmeasured DRAM byte ratio of llama's recurrence vs vLLM's fused kernel. Ranked plan: single-pass fused GDN recurrence (gap-closer, gate on ncu byte-ratio test) + conv-state concat fusion (no-regret +2-3%, bit-exact); gate-fold alone tops out at ~89% of vLLM; bf16 state is the only floor-mover but breaks bit-exactness. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>