Reframe the GB10 vLLM-parity gap from a per-lever "hardware floor" verdict
to a ggml-execution-architecture-conditional one: same-silicon 2-3x is
software architecture, not silicon. Add EXECUTION_REARCH_SCOPE.md, a phased
additive program (P1 bf16-native stream, P2 expert-major fused MoE region,
P3 Marlin large-M retry on P1+P2, P4 token-budget scheduler, P5 blocked-solve
GDN, P6 fp8 KV), each with the ggml/fork seam, default-off env gate, per-path
md5/KL correctness gate, a falsifiable P0 kill-gate, expected-recovery
arithmetic grounded in the both-engine nsys buckets, and upstream-clash
analysis. Point the README docs list and PARITY_HANDOFF forward-direction at
it.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Record the phase 110-140 GDN/MoE campaign benchmark log and append the
series-trim decision to the parity handoff: keep the Phase135 routed-FFN
fused-quant line plus the MoE test sentinels and the MTP-draft correctness
fix; drop the W4A16 structural line, the trace/tile-policy patches, GPU-sort,
W4A16-direct-A, and the finalize fusion. Rejected/neutral levers are recorded
in the handoff and the per-phase bench artifacts. Fork re-mirrored on
51168c5ee: fd920cf8a a85c1e098 2fed6aacf f1d976f06 1edddc8fe (HEAD tree
097c862c).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Record Phase81 default-off BF16 persistent S-cache results, including md5 drift, op gates, decode profile, and KL smoke. Scope Phase82 as full f16-reference KL plus serving A/B before patch-series promotion.
Assisted-by: Codex:gpt-5
Record Phase58 prompt-backlog threshold A/B, DGX gates, MoE and dense serving results, and the repeat-before-default decision.
Assisted-by: Codex:gpt-5
Record Phase57 capped TTFT prefill-first sweep, DGX gates, and the decision to keep the cap as an A/B knob rather than a parity path.
Assisted-by: Codex:gpt-5
Record Phase56 MoE and lower-concurrency validation for the TTFT prefill-first policy, including DGX gates and the opt-in-only decision.
Assisted-by: Codex:gpt-5
Record Phase54 trace-only histogram work, DGX md5/op gates, dense serving histogram evidence, and the next scheduler decision.
Assisted-by: Codex:gpt-5
Add patch 0063 extending LLAMA_CUBLAS_ROUTE_TRACE with src0/src1/dst tensor names.
Record Phase 37 gates and the conclusion that SGEMM traces to MoE gate tensors.
Assisted-by: Codex:gpt-5
Add patch 0062 with default-off LLAMA_CUBLAS_ROUTE_TRACE instrumentation for generic cuBLAS MUL_MAT subroutes.
Record Phase 36 DGX gates, serving trace results, and the next projection follow-up scope.
Assisted-by: Codex:gpt-5
Add LocalAI patch 0061 from the llama.cpp fork and record Phase 35 gates, serving route counts, and the updated patch mirror invariant.
Assisted-by: Codex:gpt-5