Reframe the GB10 vLLM-parity gap from a per-lever "hardware floor" verdict
to a ggml-execution-architecture-conditional one: same-silicon 2-3x is
software architecture, not silicon. Add EXECUTION_REARCH_SCOPE.md, a phased
additive program (P1 bf16-native stream, P2 expert-major fused MoE region,
P3 Marlin large-M retry on P1+P2, P4 token-budget scheduler, P5 blocked-solve
GDN, P6 fp8 KV), each with the ggml/fork seam, default-off env gate, per-path
md5/KL correctness gate, a falsifiable P0 kill-gate, expected-recovery
arithmetic grounded in the both-engine nsys buckets, and upstream-clash
analysis. Point the README docs list and PARITY_HANDOFF forward-direction at
it.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
The campaign patches 0048-0063 were added without matching fork commits.
After a keep/drop review, the series is trimmed and re-mirrored 1:1 onto the
fork branch mudler/llama.cpp:localai-paged (HEAD 1edddc8fe, on 51168c5ee).
Kept, renumbered from the fork (now carry Assisted-by + Signed-off-by):
- 0048 test(paged): cover MoE swiglu down chain (was 0051, fd920cf8a)
- 0049 test(paged): cover MoE weighted combine chain (was 0052, a85c1e098)
- 0050 test(paged): cover ragged MoE dispatch (was 0053, 2fed6aacf)
- 0051 fix(speculative): disable backend sampling for MTP drafts (was 0054, f1d976f06)
- 0052 feat(paged): whole-pattern MoE matcher + routed-FFN fused NVFP4-quant
down MMQ (new, 1edddc8fe)
Dropped (no fork commits, removed from the series):
- 0048-0050 W4A16 grouped-tile pack/tune/pad: dead line, W4A16 ~1.5x slower
than grouped-MMQ.
- 0055-0063 speculative/moe/mul-mat/cublas route traces + the rejected small-M
tile-policy knob (0059).
- All other 110-140 campaign markers not needed by Phase135 (GPU-sort,
W4A16-direct-A, boundary trace/timing, Phase133 sorted-F32, Phase134
fused-SWIGLU, Phase138 finalize) carry no code in this tree.
Tree-hash proof (the mirror invariant): a fresh detached worktree at
LLAMA_VERSION 0ed235ea2c17a19fc8238668653946721ed136fd with every on-disk
patches/paged/0*.patch applied in numeric order (git apply) stages to tree
097c862c6834b7d8b90419b305b8402155ef8373, byte-identical to fork HEAD
1edddc8fe's tree. Series is 43 patches (0001-0047 unchanged + 0048-0052).
Gated on GB10 sm_121a: default md5 MoE 8cb0ce23 / dense 5951a5b4 unchanged;
opt-in md5-clean; MUL_MAT 1146/1146, MUL_MAT_ID 806/806, GATED_DELTA_NET 46/46,
MOE_SWIGLU_DOWN 7/7, MUL_MAT_ID_RAGGED_MOE 6/6; six mmq_moe_quantized_raw
markers with zero sorted launches on the opt-in sentinel.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Record the phase 110-140 GDN/MoE campaign benchmark log and append the
series-trim decision to the parity handoff: keep the Phase135 routed-FFN
fused-quant line plus the MoE test sentinels and the MTP-draft correctness
fix; drop the W4A16 structural line, the trace/tile-policy patches, GPU-sort,
W4A16-direct-A, and the finalize fusion. Rejected/neutral levers are recorded
in the handoff and the per-phase bench artifacts. Fork re-mirrored on
51168c5ee: fd920cf8a a85c1e098 2fed6aacf f1d976f06 1edddc8fe (HEAD tree
097c862c).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Record Phase81 default-off BF16 persistent S-cache results, including md5 drift, op gates, decode profile, and KL smoke. Scope Phase82 as full f16-reference KL plus serving A/B before patch-series promotion.
Assisted-by: Codex:gpt-5
Record Phase58 prompt-backlog threshold A/B, DGX gates, MoE and dense serving results, and the repeat-before-default decision.
Assisted-by: Codex:gpt-5
Record Phase57 capped TTFT prefill-first sweep, DGX gates, and the decision to keep the cap as an A/B knob rather than a parity path.
Assisted-by: Codex:gpt-5
Record Phase56 MoE and lower-concurrency validation for the TTFT prefill-first policy, including DGX gates and the opt-in-only decision.
Assisted-by: Codex:gpt-5
Record Phase54 trace-only histogram work, DGX md5/op gates, dense serving histogram evidence, and the next scheduler decision.
Assisted-by: Codex:gpt-5
Add patch 0063 extending LLAMA_CUBLAS_ROUTE_TRACE with src0/src1/dst tensor names.
Record Phase 37 gates and the conclusion that SGEMM traces to MoE gate tensors.
Assisted-by: Codex:gpt-5
Add patch 0062 with default-off LLAMA_CUBLAS_ROUTE_TRACE instrumentation for generic cuBLAS MUL_MAT subroutes.
Record Phase 36 DGX gates, serving trace results, and the next projection follow-up scope.
Assisted-by: Codex:gpt-5