mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-03 04:46:54 -04:00
Reframe the GB10 vLLM-parity gap from a per-lever "hardware floor" verdict to a ggml-execution-architecture-conditional one: same-silicon 2-3x is software architecture, not silicon. Add EXECUTION_REARCH_SCOPE.md, a phased additive program (P1 bf16-native stream, P2 expert-major fused MoE region, P3 Marlin large-M retry on P1+P2, P4 token-budget scheduler, P5 blocked-solve GDN, P6 fp8 KV), each with the ggml/fork seam, default-off env gate, per-path md5/KL correctness gate, a falsifiable P0 kill-gate, expected-recovery arithmetic grounded in the both-engine nsys buckets, and upstream-clash analysis. Point the README docs list and PARITY_HANDOFF forward-direction at it. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>