P4 (token-granular continuous-batching scheduler, LLAMA_CONTINUOUS_BATCH_V2,
default-off) stopped honestly at the P0 perf kill-gate. The kill-gate subset
(per-seq chunked-prefill cursors + adaptive decode bucketing, server-side only,
zero ggml/ files, ~68 LOC + a new unit-tested server-admission-policy.h) was
implemented and correctness-proven green (canonical md5 both models default-off
AND cbv2-on: MoE 8cb0ce23, dense 5951a5b4; test-backend-ops MUL_MAT 1146/1146,
MUL_MAT_ID 806/806, GATED_DELTA_NET 46/46; cursor-interleave PROVEN via
LLAMA_CBV2_TRACE with decode+prefill co-batched and per-seq cursors advancing
across steps, dbucket==n_decode no-pad; determinism-NEUTRAL: CBv2 diverges from
control no more than control diverges from itself, the paged concurrent-greedy
path being inherently non-deterministic run-to-run in the baseline too).
The kill-gate GO criterion - a >20% TTFT-under-load drop with md5 green and
serving-aggregate not regressed - was NOT demonstrated: the staggered/burst TTFT
A/B was force-terminated by the harness mid-run (CONTROL-only, 30/60 raws), so
the TTFT deltas are not-yet-measured placeholders, not measured neutrality. Per
the phased contract go=false was the kill-gate default: nothing built beyond P0
(no SLOT_STATE_PREEMPTED, no aging/starvation-freedom), nothing landed. This is
the scope-anticipated outcome - P4 is a GB10 TTFT/fairness/enabler lever, not a
throughput lever (decode is GPU-compute-bound), so a NO-GO on the TTFT gate is
expected and any throughput payoff is non-GB10.
Records the honest rejection in EXECUTION_REARCH_SCOPE.md (P4 RESULT subsection)
and PARITY_HANDOFF.md chronology, including the re-score path: read the finalized
DGX ~/bench/p4_cbv2/perf_20260702_194359/RESULTS.md once the CANDIDATE arm
completes; a genuine >20% staggered-TTFT drop clearing max(2%, 3*stdev) re-scores
go=true and triggers the full P4 build-out. Fork localai-paged untouched at
653bb2f3d; LocalAI series stays at 46 patches; topic branch p4-cbv2 retained on
the DGX fork at ebb649335 (base 653bb2f3d, not pushed).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
P2 (expert-major fused routed-FFN region executor, LLAMA_MOE_REGION_EXECUTOR,
default-off) is recorded as NO-GO on two independent signals; nothing built
beyond the P0 kill-gate, nothing landed, fork localai-paged HEAD untouched at
653bb2f3d (LocalAI series stays at 46 patches, 0001-0055).
(1) Primary GO metric flat: n=257 MOE_SWIGLU_DOWN region 1022.15 us vs
grouped-MMQ control 1021.61 us = -0.05% (needed >5% faster); n=128 -0.34%;
MUL_MAT_ID_RAGGED_MOE +0.48%/+0.28% (region never engages). All inside the
5-sample spread - reproduces the six prior one-boundary transplants
(phases 113/114/122/123/125/127). A compact expert-major layout + single sort,
both GEMMs still ragged grouped-MMQ, does not move the ragged-tile tax; that
needs P3 Marlin persistent-CTA, not a P2 layout swap.
(2) Decisive structural blocker: q36-35b-a3b-nvfp4 ships separate
ffn_gate_exps/ffn_up_exps (+ per-tensor .scale) with ggml_swiglu_split, not the
merged gate_up->VIEW->VIEW->SWIGLU->down shape the whole-pattern matcher
requires; the matcher, region executor, and pre-existing POC/fused-quant all
engage 0x on q36 in prefill and decode. KL delta 0.000000 is vacuous (0
engagement). Default md5 canonical both models (MoE 8cb0ce23, dense 5951a5b4);
test-backend-ops all green both arms.
Prerequisite handoff (gates P2 and P3): rebuild the seam for q36's
separate/scaled/swiglu-split FFN shape before any MoE-region lever can engage,
then re-evaluate a fused two-GEMM region (not a layout swap). Topic branch
p2-moe-region retained on the fork for forensics at 2d87564dd (base 653bb2f3d),
not pushed.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Adversarial verification against the canonical fork mudler/llama.cpp:localai-paged
HEAD 1edddc8fe found the scope doc's section-3 seam references were anchored to
the abandoned pre-trim tree 237ad9b96, which the immediately-preceding commit
b529cc5420 reset away. Two classes of defect, both corrected:
- Phantom scaffolding (honesty): the doc claimed "the team has already started
scaffolding P1 and P3" citing four commits (237ad9b96 bf16 GDN state cache,
afc2c7030 act-quant trace, ea0875d14 LLAMA_BF16_CUBLAS_F32_OUT, 7967ad47f
W4A16 direct-A stub) that b529cc5420 TRIMMED - none exist at 1edddc8fe (git
cat-file: not a valid object). w4a16-policy.h, test-cuda-w4a16-policy.cpp and
ggml_cuda_mul_mat_id_w4a16_grouped_direct_a are absent from the tree. Reworded
P1 plank-1 and the P3 mechanism/files/effort to say these must be re-introduced
on top of the surviving grouped W4A16 path (patch 0035), not "finished".
- Stale line numbers (additivity): every file:line was off (computed against the
larger 237ad9b96 tree). Re-anchored to 1edddc8fe: ggml_cuda_try_fuse 4232 (was
4661), capture loop 4908 (was 5444), moe whole-pattern matcher 4157 (was 4678),
routed_ffn_poc moe-ffn.cu:275 (was 456), grouped W4A16 hook ggml-cuda.cu:2797
(was 3093/3188; the direct-A hooks 3085/3171 never existed), concurrent_event
machinery 4769 (was 5305-5318), continuous-batch budget server-context.cpp
3083-3135 with LLAMA_MAX_BATCH_TOKENS at 3105 / prefill_budget_step at 3113
(was 3122-3200).
Numbers (attribution table, recovery arithmetic), the six P0 kill-gates, and the
unreachable-floor honesty were verified sound and left unchanged.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Reframe the GB10 vLLM-parity gap from a per-lever "hardware floor" verdict
to a ggml-execution-architecture-conditional one: same-silicon 2-3x is
software architecture, not silicon. Add EXECUTION_REARCH_SCOPE.md, a phased
additive program (P1 bf16-native stream, P2 expert-major fused MoE region,
P3 Marlin large-M retry on P1+P2, P4 token-budget scheduler, P5 blocked-solve
GDN, P6 fp8 KV), each with the ggml/fork seam, default-off env gate, per-path
md5/KL correctness gate, a falsifiable P0 kill-gate, expected-recovery
arithmetic grounded in the both-engine nsys buckets, and upstream-clash
analysis. Point the README docs list and PARITY_HANDOFF forward-direction at
it.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Record the phase 110-140 GDN/MoE campaign benchmark log and append the
series-trim decision to the parity handoff: keep the Phase135 routed-FFN
fused-quant line plus the MoE test sentinels and the MTP-draft correctness
fix; drop the W4A16 structural line, the trace/tile-policy patches, GPU-sort,
W4A16-direct-A, and the finalize fusion. Rejected/neutral levers are recorded
in the handoff and the per-phase bench artifacts. Fork re-mirrored on
51168c5ee: fd920cf8a a85c1e098 2fed6aacf f1d976f06 1edddc8fe (HEAD tree
097c862c).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Record Phase81 default-off BF16 persistent S-cache results, including md5 drift, op gates, decode profile, and KL smoke. Scope Phase82 as full f16-reference KL plus serving A/B before patch-series promotion.
Assisted-by: Codex:gpt-5
Record Phase58 prompt-backlog threshold A/B, DGX gates, MoE and dense serving results, and the repeat-before-default decision.
Assisted-by: Codex:gpt-5
Record Phase57 capped TTFT prefill-first sweep, DGX gates, and the decision to keep the cap as an A/B knob rather than a parity path.
Assisted-by: Codex:gpt-5
Record Phase56 MoE and lower-concurrency validation for the TTFT prefill-first policy, including DGX gates and the opt-in-only decision.
Assisted-by: Codex:gpt-5
Record Phase54 trace-only histogram work, DGX md5/op gates, dense serving histogram evidence, and the next scheduler decision.
Assisted-by: Codex:gpt-5