mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-23 16:19:07 -04:00
Plan A (Lever 3): phased path to FP4 MoE GEMM parity — cheap tweaks, act-quant fusion, then the real lever (tcgen05/CUTLASS grouped GEMM), full-model FP4. Plan B (paged attention): on-demand pool, gather-read + Gate 0, continuous batching, prefix sharing; benchmark in memory-pressured/mixed-length regimes. Upstream issue draft: GB10 numbers, nsys profile, ruled-out config knobs, tcgen05 proposal. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>