# PARITY_HANDOFF: how to pick up the GB10 vLLM-parity work > 2026-07-02 forward direction: the active plan is now > [`EXECUTION_REARCH_SCOPE.md`](EXECUTION_REARCH_SCOPE.md), which reframes the > per-lever "hardware floor" verdict as *ggml-execution-architecture-conditional* > (same-silicon 2-3x is software) and scopes an additive, phased (P1 bf16-native > stream, P2 expert-major fused MoE region, P3 Marlin large-M retry on top of > P1+P2, P4 token-budget scheduler, P5 blocked-solve GDN, P6 fp8 KV) program with > a falsifiable P0 kill-gate per phase. The port-forensics finding is that the > failed single-kernel/single-boundary A/Bs below failed on *integration tax* > (dropped into a materialize-every-node executor), not because the kernels are > GB10-hostile; the reject log below is the evidence that grounds those verdicts. > Read the scope doc first for what to build next. > > 2026-06-30 update: this handoff is now historical procedure, not the active > verdict. The GB10 investigation was reopened in `GB10_PARITY_REOPEN_SPEC.md` > and `GB10_PARITY_PHASE0_RESULTS.md`, with Phase 6 serving-nsys evidence and > the active follow-up plans under `docs/superpowers/plans/`. Use those files for > the current state before relying on the older "closed" conclusion below. > > 2026-07-01 Phase112 update: keep the new default-off > `LLAMA_W4A16_DIRECT_A=1` direct activation staging hook, especially combined > with Phase110 `LLAMA_MOE_GPU_SORT=1`. Artifact: > `/home/mudler/bench/phase112_w4a16_direct_a/20260701_231749_direct_a`. > Selected gates passed `13/13` for W4A16+GPU-sort, direct-A, and > direct-A+GPU-sort. Direct-A+GPU-sort improved the 257-token W4A16 fallback > rows versus W4A16+GPU-sort control (`MOE_SWIGLU_DOWN 1551.08 -> 1477.74 us`, > `MUL_MAT_ID_RAGGED_MOE 2278.50 -> 2166.22 us`) but was neutral/slightly > slower on 128-token rows. Canonical README md5 gates are green: MoE > `8cb0ce23777bf55f92f63d0292c756b0`, dense > `5951a5b4d624ce891e22ab5fca9bc439`; compact supported op gates are green > (`SSM_CONV 45/45`, `SSM_CONV_SPLIT 6/6`, `GET_ROWS 49/49`, > `GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`). > This is still default-off structural groundwork, not parity: W4A16 fallback > remains slower than the default grouped-MMQ path. Use the patch-series README > md5 command as canonical; the handoff `-no-cnv -c 4096` snippet produced > stable but non-canonical md5s for both candidate and control. > > 2026-07-01 Phase113 update: reject the combined direct-A GPU-tile descriptor > attempt. Artifact: > `/home/mudler/bench/phase113_w4a16_direct_a_gpu_tiles/20260701_233345_no_readback`. > The candidate (`LLAMA_W4A16_GPU_TILES=1` on top of Phase112 direct-A+GPU-sort) > avoided the `n_tiles` readback by launching over zero-initialized `max_tiles` > and returning early on `rows <= 0`. Selected correctness passed `13/13`, but > perf failed the keep gate: `MOE_SWIGLU_DOWN n=257` was flat > (`1478.16 -> 1476.36 us`) and `MUL_MAT_ID_RAGGED_MOE n=257` regressed > (`2148.44 -> 2214.23 us`). The source was reverted and post-revert > Phase112 direct-A+GPU-sort selected gates passed `13/13`. Next W4A16/MoE work > should not revisit compact GPU tile descriptors; use vLLM-style padded routing > metadata (`sorted_token_ids`, expert ids per M block, padded row count) if > continuing this line. > > 2026-07-01 Phase114 update: reject the naive padded routing implementation. > It implemented the vLLM-style metadata contract with separate padded source > ids and destination ids for llama.cpp, plus an expert-id W4A16 consumer mode > and a direct scatter that skipped compact `get_rows_cuda`. Correctness passed > (`13/13`) but perf failed: after a fix using `num_tokens_post_pad` early > returns, `MOE_SWIGLU_DOWN n=257` regressed `1477.88 -> 1726.27 us` and > `MUL_MAT_ID_RAGGED_MOE n=257` regressed `2163.35 -> 2650.93 us`. Artifacts: > `/home/mudler/bench/phase114_w4a16_padded_routing/20260701_234634_padded_meta` > and > `/home/mudler/bench/phase114_w4a16_padded_routing/20260701_235003_padded_meta_fix1`. > Source was reverted; post-revert Phase112 direct-A+GPU-sort selected gate > passed `13/13`. Padded metadata is not enough by itself on GB10 because sparse > expert occupancy makes padded activation/output traffic too expensive. > > 2026-07-02 Phase115 update: reject another small-M/tile-policy shortcut. > Phase115 re-tested the existing default-off `LLAMA_MOE_SMALL_M_TILE=16/32/64` > knob on the newer Phase108 whole-graph MoE sentinels. Artifact: > `/home/mudler/bench/phase115_moe_small_m_sentinel/20260702_020258`. > Control and all three tile caps passed selected correctness (`13/13` each), > but no candidate met the promotion rule. The 257-token ragged down row > regressed for every cap (`1452.30 us` control vs `1455.02`, `1458.71`, and > `1456.88 us`). Do not add name-based down special cases or another MMQ > tile-policy patch. The next credible target is a true fused routed-MoE kernel > or a graph-level fusion that removes materialized activation/output traffic. > > 2026-07-02 Phase116 update: reject the standalone graph-level > SwiGLU-to-MMQ-activation-quant fusion. The default-off candidate > `LLAMA_MOE_SWIGLU_DOWN_FUSED_QUANT=1` detected the plain > `GLU -> down MUL_MAT_ID` pattern and computed `silu(gate) * up` directly into > the grouped-MMQ NVFP4 activation buffer. Artifact: > `/home/mudler/bench/phase116_moe_swiglu_down_fused_quant/20260702_022611`. > Correctness passed (`13/13`) and the fix1 route emitted the fused marker > (`6` hits), but perf was not useful: `MOE_SWIGLU_DOWN n=257` was flat > (`1024.90 -> 1024.69 us`), `n=128` regressed (`806.33 -> 808.79 us`), and the > ragged sentinel drifted slower. Source was reverted and post-revert selected > gate passed `13/13`. Do not retry this narrow fused-quant route; the next > fused-MoE attempt must remove a larger boundary, such as route-once metadata > shared by both expert GEMMs plus fused GEMM1/activation/GEMM2 or > weighted-combine/scatter. > > 2026-07-02 Phase117 update: keep the default-off MoE boundary trace as > diagnostic instrumentation only. Artifact: > `/home/mudler/bench/phase117_moe_route_once_boundary/20260702_024140`. > The trace decomposes `MOE_SWIGLU_DOWN` into route-sort, activation > quantization, grouped-MMQ launch, GLU, and graph-pattern records under > `LLAMA_MOE_BOUNDARY_TRACE=1`; optional timing is gated by > `LLAMA_MOE_BOUNDARY_TIMING=1`. Inline CUDA event timing initially aborted > under CUDA graph capture, so the guarded trace emits `us=-1` while capturing > and only produces real event timings with `GGML_CUDA_DISABLE_GRAPHS=1`. > Post-guard selected gates passed (`13/13`), trace mode passed (`7/7`), and > canonical gates passed: MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, > `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`. The timing attribution does not > fund another local route-sort, tile, GLU, or activation-quant shortcut. The > next MoE source phase should own a larger pipeline boundary: shared > route-once metadata across gate_up/down and/or whole-pattern > GEMM1->activation->GEMM2 execution. > > 2026-07-02 Phase118 update: reject standalone route-metadata caching. > Artifact: > `/home/mudler/bench/phase118_moe_route_cache/20260702_030549`. The > default-off candidate `LLAMA_MOE_ROUTE_CACHE=1` stored ids-derived grouped-MMQ > route metadata in context-owned buffers and reused it within a graph > evaluation. It was correctness-clean (`13/13` default, opt-in, and > post-reject) and the trace showed reuse (`23` hits, `3` misses on > `MOE_SWIGLU_DOWN n=128`), but perf was too small: `MOE_SWIGLU_DOWN n=257` > improved only `1017.711 -> 1011.915 us` (`+0.57%`) and `n=128` regressed > `799.360 -> 803.738 us` (`-0.55%`). Runtime cache source was reverted; only a > local `ggml_cuda_mmq_ids_meta` helper refactor remains as low-conflict > groundwork. Do not retry metadata-cache-only work. The next attempt must own > more of the vLLM-style pipeline: GEMM1->activation->GEMM2 and/or > scatter/combine, not just skipping one `mm_ids_helper` launch. > > 2026-07-02 Phase119 update: keep the default-off whole-pattern contract trace > after fix1. Initial artifact: > `/home/mudler/bench/phase119_moe_whole_pattern_contract/20260702_034729`; > fix1 artifact: > `/home/mudler/bench/phase119_moe_whole_pattern_contract/20260702_035126_fix1`. > The initial trace proved coverage but missed the overhead rule on > `MOE_SWIGLU_DOWN n=257` (`1015.070 -> 1028.937 us`, `-1.35%`). Fix1 moved > detector work off the default path unless `LLAMA_MOE_WHOLE_PATTERN_TRACE` or > the existing boundary trace is enabled. Fix1 gates are green: selected > `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE` `13/13`, trace `MOE_SWIGLU_DOWN` > `7/7`, canonical MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, > `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Trace overhead is now within > rule (`MOE_SWIGLU_DOWN n=128` `805.400 -> 805.584 us`, `-0.02%`; > `n=257` `1019.715 -> 1021.836 us`, `-0.21%`) and emits supported NVFP4 > markers for both `n_tokens=128` and `257`. This is diagnostic scaffolding, > not a runtime optimization. The next executor attempt should match at the > earlier `gate_up MUL_MAT_ID` node and skip through `VIEW, VIEW, GLU, down > MUL_MAT_ID`; the current `GLU -> down` hook is validation-only because GEMM1 > has already executed. > > 2026-07-02 Phase120 update: keep the default-off early whole-pattern matcher > after fix2. Initial artifact: > `/home/mudler/bench/phase120_moe_early_whole_pattern/20260702_040153`; > fix2 artifact: > `/home/mudler/bench/phase120_moe_early_whole_pattern/20260702_040725_fix2`. > The initial/fix1 versions proved `skip_ready=4` but emitted noisy unsupported > markers from unrelated `MUL_MAT_ID` candidates. Fix2 emits only the actual > early pattern and is clean: selected `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE` > `13/13`, early trace `MOE_SWIGLU_DOWN` `7/7`, canonical MoE md5 > `8cb0ce23`, dense md5 `5951a5b4`, `MUL_MAT 1146/1146`, and > `MUL_MAT_ID 806/806`. It emits exactly six supported early markers for the > perf sentinels, covering `n_tokens=128` and `257`, with `skip_ready=4`, > `ids_match=1`, and `swiglu=1`. Trace overhead is within rule > (`MOE_SWIGLU_DOWN n=128` `803.937 -> 808.978 us`, `-0.62%`; > `n=257` `1020.412 -> 1026.073 us`, `-0.55%`). The next source phase can now > implement a guarded executor at this early matcher. First prove safe > ownership/skip accounting for the five-node sequence, then move route-plan > reuse and activation/down execution into the helper. > > 2026-07-02 Phase121 update: keep the default-off executor proof after fix1. > Initial artifact: > `/home/mudler/bench/phase121_moe_whole_pattern_exec_proof/20260702_041543`; > fix1 artifact: > `/home/mudler/bench/phase121_moe_whole_pattern_exec_proof/20260702_041739_fix1`. > The initial run passed correctness but emitted zero exec markers because the > exec branch was accidentally nested under the early-trace env condition. > Fix1 makes `LLAMA_MOE_WHOLE_PATTERN_EXEC=1` engage independently. Gates are > green: selected `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE` `13/13`, exec > `MOE_SWIGLU_DOWN` `7/7`, canonical MoE md5 `8cb0ce23`, dense md5 > `5951a5b4`, `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Exec perf emits > six `skip=4` markers covering `n_tokens=128` and `257`, and target perf is > neutral (`MOE_SWIGLU_DOWN n=128` `807.772 -> 806.051 us`, `+0.21%`; > `n=257` `1021.115 -> 1020.839 us`, `+0.03%`). This proves ownership and skip > accounting only; it is not a fused-MoE speedup. The next source phase should > replace one internal boundary inside this helper, preferably route-plan reuse > or activation in route-slot order, with the same md5/op gates. > > 2026-07-02 Phase122 update: reject route-only metadata reuse inside the > Phase121 executor. Artifact: > `/home/mudler/bench/phase122_moe_shared_route_meta/20260702_043212`. > The candidate exposed `ggml_cuda_mmq_ids_meta` as a public MMQ helper and > used `LLAMA_MOE_WHOLE_PATTERN_SHARED_ROUTE=1` to build route metadata once > for both `gate_up` and `down`. Correctness passed (`13/13` selected and > `7/7` shared-route), but perf missed the keep gate: > `MOE_SWIGLU_DOWN n=128` regressed `808.190 -> 811.836 us` and `n=257` > regressed `1020.850 -> 1051.666 us` versus the Phase121 executor. Source was > reverted, including the public metadata API and shared-route env. Post-reject > gates on the reverted tree passed (`13/13` selected and `7/7` Phase121 exec) > with six retained exec markers. Do not retry route-only metadata reuse. The > next MoE executor scope should target activation/down data layout, direct > activation-to-down input, or a larger GEMM1->activation->GEMM2 fused boundary. > > 2026-07-02 Phase123 update: reject standalone fused-down activation > quantization inside the Phase121 executor. Artifact: > `/home/mudler/bench/phase123_moe_executor_fused_down_input/20260702_025811`; > red check: > `/home/mudler/bench/phase123_moe_executor_fused_down_input/red_20260702_025031`. > The candidate used `LLAMA_MOE_WHOLE_PATTERN_FUSED_DOWN=1` to run `gate_up`, > compute `silu(gate) * up` directly into the sorted NVFP4 down MMQ activation > buffer, and launch the existing down MMQ kernel. Correctness passed > (`13/13` selected, `7/7` fused-down, six fused markers), but perf was flat: > versus Phase121 exec, `MOE_SWIGLU_DOWN n=128` was > `811.153 -> 810.618 us` and `n=257` was `1023.090 -> 1023.657 us`. > Source was reverted, including the fused-down env, MMQ helper, and NVFP4 > fused quant kernel. Post-reject gates passed (`13/13` selected, `7/7` > Phase121 exec, six exec markers). Do not retry a single-boundary > SwiGLU-to-down-quant shortcut; if continuing MoE source work, scope a full > expert-major packed pipeline that owns `GEMM1->activation->GEMM2`, or pivot to > another measured bottleneck. > > 2026-07-02 Phase124 update: current-stack graph-node serving was refreshed > after the Phase122/123 rejections. Artifact: > `/home/mudler/bench/phase124_current_moe_profile/20260702_031205`. > Pre/post gates are green: MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, > `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`. At `N=128`, prompt `128`, > generation `64`, serving under graph-node profiling was > `agg_tps 206.2`, `decode_agg_tps 320.3`, `prefill_tps 1536.4`, wall > `39.738s`. Fine buckets are now `mmq_nvfp4 6074.78 ms` (`30.17%`) and > `gdn_core 5888.31 ms` (`29.25%`), with `act_quant` only `674.88 ms` > (`3.35%`). This explains why single-boundary activation/quant attempts were > flat. The next source work must reduce one of the two dominant buckets: > either a full expert-major MoE pipeline for `mmq_nvfp4`, or a default-off GDN > decode/core experiment for `gdn_core`. Do not spend more GB10 time on > route-only metadata reuse, fused-down quantization, or MoE tile-policy knobs > unless a new profile makes those buckets material. > > 2026-07-02 Phase125 scoping update: two independent code explorers and a > local GDN audit challenged the Phase124 fork in the road. The chosen next > source attempt is the MoE side, but only as a first maintainable slice: > implement a default-off MMQ sorted-output primitive behind > `LLAMA_MOE_EXPERT_MAJOR_SORTED_OUT=1`, immediately unsort as a proof, and > measure `MOE_SWIGLU_DOWN` before attempting the full > `gate_up -> SWIGLU -> down` expert-major executor. Rationale: vLLM's portable > advantage is keeping activations expert-major across both GEMMs and > unpermuting once; Phase122/123 failed because they only touched route metadata > or one activation boundary. Do not copy CUTLASS/FlashInfer pointer-array, TMA, > or FP4 scale-swizzle internals. A small GDN patch is not funded by current > evidence because previous decode/core micro-attempts already rejected the > obvious geometry/store/broadcast/conv-state shortcuts. Plan: > `docs/superpowers/plans/2026-07-02-moe-expert-major-sorted-output-phase125.md`. > > 2026-07-02 Phase125 result: reject the MMQ sorted-output plus immediate > unsort proof. Artifact: > `/home/mudler/bench/phase125_moe_expert_major_sorted_output/20260702_033931`; > post-reject: > `/home/mudler/bench/phase125_moe_expert_major_sorted_output/post_reject_20260702_034232`. > The candidate was default-off and correctness-clean (`13/13` default > selected, `7/7` opt-in `MOE_SWIGLU_DOWN`, 12 sorted markers), but perf failed > decisively: versus Phase121 exec, `MOE_SWIGLU_DOWN n=128` regressed > `805.16 -> 888.76 us` and `n=257` regressed `1023.83 -> 1192.05 us`. > Source was reverted. Post-reject gates are green: selected `13/13`, Phase121 > exec `7/7` with six markers, MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, > `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Do not retry a path that adds > a sorted-output temporary and immediately unsorts. A future expert-major MoE > attempt must keep sorted activations through the down GEMM and unpermute only > once after the full FFN, or pivot to a larger GDN recurrence design. > 2026-07-02 Phase126 result: keep the grouped-MMQ presorted helper scaffold. > The patch only touches `mmq.cu`/`mmq.cuh`, refactors the current MoE id path > behind explicit `ids_src1`/`ids_dst`/`expert_bounds` metadata, and exposes a > `src1_sorted` entry point for the future whole-MoE executor. Fixed artifact: > `/home/mudler/bench/phase126_mmq_presorted_helper/fix1_20260702_040858`. > Gates were green: selected `13/13`, MoE md5 > `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 > `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT 1146/1146`, > `MUL_MAT_ID 806/806`. Focused perf was neutral: > `MOE_SWIGLU_DOWN n=128 805.99 us`, `MUL_MAT_ID_RAGGED_MOE n=128 > 1243.85 us`, `MOE_SWIGLU_DOWN n=257 1018.74 us`, > `MUL_MAT_ID_RAGGED_MOE n=257 1452.84 us`. This is not a parity win by > itself; it is the dependency for Phase127 to keep `gate_up -> SWIGLU -> down` > in expert-major order and unpermute only once after the full FFN. > 2026-07-02 Phase127 result: reject and revert the whole-MoE expert-major > executor built on the Phase126 helper. Red: > `/home/mudler/bench/phase127_moe_whole_expert_major/red_20260702_042125` > passed by fallback with zero markers. Candidate green: > `/home/mudler/bench/phase127_moe_whole_expert_major/green2_20260702_042916` > passed default selected `13/13` and opt-in `MOE_SWIGLU_DOWN 7/7`, emitting > six `LLAMA_MOE_WHOLE_EXPERT_MAJOR` markers after fixing the down-weight shape > interpretation (`down_w` is `[n_ff, n_embd, experts]`). Perf artifact: > `/home/mudler/bench/phase127_moe_whole_expert_major/perf_20260702_043104`. > It failed the keep rule: `MOE_SWIGLU_DOWN n=128` regressed > `802.57 -> 812.14 us`; `n=257` regressed `1023.25 -> 1039.36 us`; > ragged standalone was essentially flat. Source was reverted. Post-reject: > `/home/mudler/bench/phase127_moe_whole_expert_major/post_reject_20260702_043318` > passed selected `13/13`, MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, > `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Do not retry the same > fake-tensor whole-executor shape; the next MoE attempt must remove more > temporary traffic or become a real fused grouped MMQ/SWIGLU/down path. A > separate alternative is the previously scoped Qwen3Next BF16 GDN S-cache > experiment, but that needs non-md5 numerical gates. > 2026-07-02 Phase128 result: reject/revert the Qwen3Next BF16 GDN S-cache > selector probe for the current target. Artifact: > `/home/mudler/bench/phase128_qwen3next_gdn_bf16_s_cache/default_20260702_043939` > built and passed default gates (`GATED_DELTA_NET 48/48`, canonical MoE md5 > `8cb0ce23`, dense md5 `5951a5b4`, `MUL_MAT`, `MUL_MAT_ID`). Verbose smoke > artifact: > `/home/mudler/bench/phase128_qwen3next_gdn_bf16_s_cache/smoke3_20260702_044434` > showed the active decision model is `qwen35moe`, not Qwen3Next, and S cache > remained `f32` under `LLAMA_QWEN3NEXT_GDN_S_CACHE_TYPE=bf16`. No true > Qwen3Next GGUF was found on DGX. The relevant Qwen35/Qwen35MoE BF16 S-cache > lever was already Phase81/82: it cut `gdn_core` but changed MoE md5 and > missed the full f16-reference KL acceptance band. Do not retry this exact > lever unless the quality gate is explicitly re-scoped or a real Qwen3Next > model artifact is available. > 2026-07-02 Phase129 result: reject/revert the Qwen35/Qwen35MoE grouped Q/K > broadcast probe for fused GDN. Plan: > `docs/superpowers/plans/2026-07-02-qwen35-gdn-qk-grouped-bcast-phase129.md`. > The candidate added a default-off `LLAMA_QWEN35_GDN_QK_BCAST=1` branch in > `src/models/qwen35.cpp` and `src/models/qwen35moe.cpp`, reusing the existing > Qwen3Next `ggml_gated_delta_net_set_bcast()` path. Default gates were green: > `/home/mudler/bench/phase129_qwen35_gdn_qk_bcast/default_20260702_065445` > passed MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, `GATED_DELTA_NET 46/46`, > `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. A standalone opt-in gate > artifact at `optin_20260702_065604` was invalid because > `paged-inference-gates.sh` only passes completion env through `EXTRA_ENV`. > The valid opt-in pre-gate from > `/home/mudler/bench/phase129_qwen35_gdn_qk_bcast/decode_optin_20260702_070149/gate_pre` > changed MoE md5 to `b773e2f032aa0e992626d486b321808e`, so profiling was > stopped and the source was reverted. Post-reject: > `/home/mudler/bench/phase129_qwen35_gdn_qk_bcast/post_reject_20260702_070258` > passed canonical MoE/dense md5, `GATED_DELTA_NET 46/46`, `MUL_MAT 1146/1146`, > and `MUL_MAT_ID 806/806`; rebuilt `libllama.so` has zero > `LLAMA_QWEN35_GDN_QK_BCAST` strings. Do not retry this Qwen3Next > grouped-broadcast port for Qwen35/Qwen35MoE under the current bit-exact md5 > rule. > 2026-07-02 Phase130 result: current-stack graph-node serving profile refresh, > measurement-only. Artifact: > `/home/mudler/bench/phase130_current_stack_profile/20260702_070949`. Shape: > MoE `q36-35b-a3b-nvfp4`, `N=128`, prompt `128`, generation `64`, > `PARALLEL=128`, `CTX=131072`. Pre/post gates passed canonical MoE md5 > `8cb0ce23`, dense md5 `5951a5b4`, `MUL_MAT 1146/1146`, and > `MUL_MAT_ID 806/806`. Serving metrics: `agg_tps 208.0`, > `decode_agg_tps 326.9`, `prefill_tps 1519.6`, `TTFT mean 8170.6 ms`, wall > `39.38 s`, total kernel time `20.1559 s`. The profile confirms the live > bottleneck remains split between `mmq_nvfp4 6009.52 ms` (`29.82%`) and > `gdn_core 5891.40 ms` (`29.23%`). FA/mask cleanup is not funded: > `get_rows 280.62 ms` (`1.39%`) and `fa 257.38 ms` (`1.28%`). The next source > attempt must target a larger MoE/FFN-GEMM executor/kernel or a materially > different GDN recurrent-state/packed-decode design, not another paged-mask, > route-only, activation-only, grouped-broadcast, BF16-cache, or launch-geometry > shortcut. > 2026-07-02 Phase131 result: source-selection challenge, no source changes. > Plan: > `docs/superpowers/plans/2026-07-02-fused-routed-ffn-phase131.md`. Two > read-only explorers challenged the Phase130 fork. MoE/FFN-GEMM source work is > not funded unless it becomes a real fused routed-FFN kernel/executor; another > route-only, activation-only, W4A16, tile-policy, sorted-output, or fake > executor patch is expected to repeat Phases 110-127. GDN source work is not > funded unless it materially reduces f32 recurrent-state traffic without > BF16/quality drift; launch geometry, gather/identity, producer/store fusion, > BF16 S-cache, and grouped Q/K broadcast have already failed or changed md5s. > The next active line is to audit vLLM's fused MoE design and llama.cpp's > current whole-pattern executor hook for a default-off fused routed-FFN PoC. > If that audit does not produce a concrete low-conflict hook, require a > standalone CUDA PoC before touching llama.cpp source. > > 2026-07-02 Phase132 result: keep the new default-off routed-FFN PoC scaffold. > Plan: > `docs/superpowers/plans/2026-07-02-routed-ffn-poc-phase132.md`. Artifact: > `/home/mudler/bench/phase132_routed_ffn_poc/20260702_072725`. Source adds > `ggml/src/ggml-cuda/moe-ffn.cu/.cuh` and a narrow hook in > `ggml/src/ggml-cuda/ggml-cuda.cu` behind `LLAMA_MOE_ROUTED_FFN_POC=1`. > The helper currently executes the baseline `gate_up -> SWIGLU -> down` > sequence through the existing whole-pattern hook, so it is a scaffold, not a > parity speedup. Initial incremental build failed until CMake was reconfigured > to pick up the new globbed CUDA source; after `cmake -S . -B build`, build > passed. Selected default and opt-in gates passed > `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE 13/13`; opt-in emitted six exec > markers and `libggml-cuda.so` contains one `LLAMA_MOE_ROUTED_FFN_POC` string. > Default and opt-in canonical gates passed MoE md5 `8cb0ce23`, dense md5 > `5951a5b4`, `GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, and > `MUL_MAT_ID 806/806`. Focused perf was neutral (`808.32 -> 804.87 us` at > n=128, `1023.36 -> 1022.71 us` at n=257). Next phase may replace the helper > internals with a real fused routed-FFN slice; do not claim Phase132 itself as > a speedup. > > 2026-07-02 Phase133 result: keep only as a default-off structural base, not a > speedup. Plan: > `docs/superpowers/plans/2026-07-02-routed-ffn-sorted-down-phase133.md`. > Artifact: > `/home/mudler/bench/phase133_routed_ffn_sorted_down/20260702_074651`. > Source exposes `ggml_cuda_mmq_ids_meta`, adds raw > `ggml_cuda_mul_mat_q_moe_sorted_f32(...)`, and adds > `LLAMA_MOE_ROUTED_FFN_SORTED_DOWN=1` on top of > `LLAMA_MOE_ROUTED_FFN_POC=1`. The path executes baseline `gate_up` and > `SWIGLU`, gathers the SWIGLU output into compact expert-sorted F32 rows, then > calls raw MMQ down without fake tensors. Selected default, Phase132, and > Phase133 gates passed `13/13`; Phase133 trace proved six > `mmq_moe_sorted_raw` launches. Default and Phase133 canonical gates passed > MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, `GATED_DELTA_NET 48/48`, > `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Perf was not a win: > default `807.37/1020.76 us`, Phase132 `808.21/1018.87 us`, Phase133 > `808.85/1026.87 us` for `n=128/257`. Next phase must fuse SWIGLU-to-sorted > or SWIGLU-to-quant to remove this added gather/quant boundary; do not promote > sorted-down as-is. > > 2026-07-02 Phase134 result: keep only as default-off fused-SWIGLU structural > base, not a speedup. Plan: > `docs/superpowers/plans/2026-07-02-routed-ffn-fused-swiglu-phase134.md`. > Artifact: > `/home/mudler/bench/phase134_routed_ffn_fused_swiglu/20260702_075828`. > Source adds `LLAMA_MOE_ROUTED_FFN_FUSED_SWIGLU=1` on top of > `LLAMA_MOE_ROUTED_FFN_POC=1`, passes `gate/up` views into the routed-FFN > helper, computes `silu(gate) * up` directly into expert-sorted F32 rows, and > calls the raw sorted-F32 down MMQ helper. The fused flag now implies the > sorted-down path; `LLAMA_MOE_ROUTED_FFN_SORTED_DOWN=1` is not required. > Selected opt-in gates passed `13/13`; trace proved six `mmq_moe_sorted_raw` > launches; canonical opt-in gates passed MoE md5 `8cb0ce23`, dense md5 > `5951a5b4`, `GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, and > `MUL_MAT_ID 806/806`. Perf is mixed: default `804.92/1026.02 us`, Phase132 > `808.00/1028.43 us`, Phase133 `808.07/1029.02 us`, Phase134 > `810.61/1025.68 us` for `n=128/257`. It recovers n=257 but regresses n=128; > next work must fuse SWIGLU directly into down-MMQ quant or remove another > launch/buffer before this becomes a parity lever. > > 2026-07-02 Phase135 result: keep as current best default-off routed-FFN base, > but not parity. Plan: > `docs/superpowers/plans/2026-07-02-routed-ffn-fused-quant-phase135.md`. > Focused artifact: > `/home/mudler/bench/phase135_routed_ffn_fused_quant/20260702_081723`. > Serving artifact: > `/home/mudler/bench/phase135_routed_ffn_fused_quant_serving/20260702_082102`. > Source adds `LLAMA_MOE_ROUTED_FFN_FUSED_QUANT=1` on top of > `LLAMA_MOE_ROUTED_FFN_POC=1`, computes `silu(gate) * up` directly into the > NVFP4 MMQ activation layout, and launches raw down MMQ via > `ggml_cuda_mul_mat_q_moe_quantized(...)`. Focused selected gates passed > `13/13`; trace proved six `mmq_moe_quantized_raw` launches and zero > `mmq_moe_sorted_raw` launches; canonical focused gates passed MoE md5 > `8cb0ce23`, dense md5 `5951a5b4`, `GATED_DELTA_NET 48/48`, > `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Focused perf: > default `805.92/1031.06 us`, Phase134 `807.65/1027.51 us`, Phase135 > `807.92/1024.97 us` for `n=128/257`. Serving at the Phase130 shape passed > pre/post gates and improved decode aggregate t/s `326.9 -> 332.7`, while > `mmq_nvfp4` dropped `6009.52 -> 5915.24 ms`; aggregate stayed `208.0`, prefill > worsened `1519.6 -> 1475.1`, and total kernel time rose slightly > `20.1559 -> 20.2498 s`. Next work should target remaining MoE overhead after > fused quant (`mmq_fixup`, route/writeback, weighted combine), not another F32 > intermediate. > > 2026-07-02 Phase136 result: reject and revert the separate post-down > weighted-combine fuse. Plan: > `docs/superpowers/plans/2026-07-02-routed-ffn-combine-phase136.md`. > Focused artifact: > `/home/mudler/bench/phase136_routed_ffn_combine/20260702_083727`. > Serving artifact: > `/home/mudler/bench/phase136_routed_ffn_combine_serving/20260702_085749`. > The candidate added `LLAMA_MOE_ROUTED_FFN_COMBINE=1` on top of Phase135 and > skipped the post-down `MUL(weights) -> VIEW* -> ADD*` tail with a separate > F32 weighted-combine kernel. It was correctness-clean: expanded selected > gates passed `20/20`, trace proved six combine markers plus six > `mmq_moe_quantized_raw` launches and zero sorted launches, canonical gates > passed MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, `GATED_DELTA_NET 46/46`, > `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Focused full-tail perf > improved (`MOE_SWIGLU_COMBINE n=257` `428.53 -> 401.81 us` versus Phase135), > but serving regressed versus Phase135: aggregate/decode t/s > `208.0/332.7 -> 206.5/323.2`. Source and the sentinel test were reverted; > post-reject Phase135 selected gates passed `13/13`. Do not retry a standalone > post-MMQ combine launch as the next parity lever; any combine/finalize work > needs a larger serving-visible fused writeback/finalize design. > > 2026-07-02 Phase137 result: reject the GDN launch-geometry retune with no > source changes. Plan: > `docs/superpowers/plans/2026-07-02-gdn-geometry-sweep-phase137.md`. > Focused artifact: > `/home/mudler/bench/phase137_gdn_geometry_sweep/20260702_091441`. > Serving artifact: > `/home/mudler/bench/phase137_gdn_geometry_serving/20260702_091740`. > The env-only sweep tested existing `GDN_NW`/`GDN_CPW` knobs. The best focused > candidate, `GDN_NW=4 GDN_CPW=1`, improved 1-token GDN rows > (`hc=32,hs=128,kda=0` `6.793748 -> 4.713682 us`, KDA > `7.790557 -> 5.194275 us`, grouped broadcast `5.967364 -> 3.407998 us`), > but real serving regressed versus Phase135 despite clean pre/post gates: > MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, `MUL_MAT 1146/1146`, and > `MUL_MAT_ID 806/806`. Aggregate/decode t/s moved > `208.0/332.7 -> 206.2/324.9`, total kernel time rose > `20.2498 -> 20.7530 s`, and `gdn_core` worsened > `5926.55 -> 6466.27 ms`. Do not promote or source-code a GDN geometry retune > for this target. The next scoped source line is default-off MoE > finalize/writeback inside the existing down-MMQ path, not a standalone > post-MMQ combine launch. > > 2026-07-02 Phase138 attempt 1 update: keep the default-off finalize trace and > full-tail sentinel scaffold; no runtime speedup claim yet. Plan: > `docs/superpowers/plans/2026-07-02-moe-down-mmq-finalize-phase138.md`. > Artifacts: > `/home/mudler/bench/phase138_moe_down_mmq_finalize_trace/20260702_092943` > (`MOE_SWIGLU_DOWN` trace-only), > `/home/mudler/bench/phase138_moe_down_mmq_finalize_trace/20260702_093617_full_tail` > (new full-tail sentinel), and > `/home/mudler/bench/phase138_moe_down_mmq_finalize_trace/20260702_093731_canonical` > (canonical gates). The old `MOE_SWIGLU_DOWN` sentinel emitted six early > routed-FFN records but no weighted tail. The new `MOE_SWIGLU_FINALIZE` > sentinel passed default and Phase135-opt-in correctness (`7/7` each) and > emitted six supported tail records with `tail_nodes=16`, `views=8`, and > `adds=7`. Canonical patched-Phase93 gates passed MoE md5 `8cb0ce23`, dense > md5 `5951a5b4`, `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Next work may > implement default-off down-MMQ finalize/writeback against this sentinel first; > keep serving promotion gated by Phase135 decode/aggregate/kernel-time > thresholds. > > 2026-07-02 Phase138 attempt 2 update: keep the default-off down-MMQ > finalize/writeback candidate as a narrow positive, but do not promote it or > call parity. Plan: > `docs/superpowers/plans/2026-07-02-moe-down-mmq-finalize-phase138.md`. > Focused artifact: > `/home/mudler/bench/phase138_moe_down_mmq_finalize/20260702_095927_focused`; > canonical gates: > `/home/mudler/bench/phase138_moe_down_mmq_finalize/20260702_100202_canonical`; > serving: > `/home/mudler/bench/phase138_moe_down_mmq_finalize_serving/20260702_100330`. > The candidate adds `LLAMA_MOE_ROUTED_FFN_FINALIZE_POC=1` on top of Phase135, > zeroes the final output, atomically accumulates `down_sum * router_weight` > from the down-MMQ path, and skips the strict weighted tail only after the > finalize helper is selected. Focused `MOE_SWIGLU_FINALIZE` correctness passed > for default, Phase135, and Phase138 (`7/7` each); canonical and serving > pre/post gates passed MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, > `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Serving versus Phase135 moved > aggregate/decode t/s `208.0/332.7 -> 209.3/333.5`, total kernel time > `20.2498 -> 20.0489 s`, and `mmq_nvfp4 5915.24 -> 5802.87 ms`; however > `ew_add` remains visible at `374.09 ms`, so this is only an incremental > default-off improvement. Next work should reduce the remaining fan-in/writeback > path more deeply or return to the dominant `gdn_core`/`mmq_nvfp4` buckets. > > 2026-07-02 Phase139 result: serving noise-floor repeat rejects treating the > Phase138 one-off serving gain as source-funding evidence. Spec: > `docs/superpowers/specs/2026-07-02-serving-noise-floor-phase139-design.md`. > Plan: > `docs/superpowers/plans/2026-07-02-serving-noise-floor-phase139.md`. > Artifact: > `/home/mudler/bench/phase139_serving_noise_floor/20260702_081901`. > Seven identical current-binary Phase138 serving/profile runs all passed > pre/post gates: MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, > `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. The runtime variance was much > larger than Phase138's one-off delta: aggregate throughput median > `208.5 t/s`, stdev `2.8022`, CV `1.349%`, range `203.4..212.3`; wall CV > `1.347%`; `mmq_nvfp4` CV `3.351%`. Keep Phase138 default-off as > correctness-clean and focused-positive, but do not stack another > finalize/MMQ micro-patch from serving evidence alone. Future serving claims > need repeated A/B medians and must exceed `max(2.0%, 3 * same-binary stdev)`. > The next source phase should pivot to a larger measured bucket, with GDN > packed decode/prep now more defensible than another MoE finalize shortcut. > > 2026-07-02 Phase140 result: reject an immediate in-GDN Q/K > L2-normalization patch. Spec: > `docs/superpowers/specs/2026-07-02-gdn-decode-prep-trace-phase140-design.md`. > Plan: > `docs/superpowers/plans/2026-07-02-gdn-decode-prep-trace-phase140.md`. > Artifact: > `/home/mudler/bench/phase140_gdn_decode_prep_trace/20260702_085348`. > The current Phase138 opt-in serving/profile shape passed pre/post gates: > MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, `MUL_MAT 1146/1146`, and > `MUL_MAT_ID 806/806`. Serving/profile reported aggregate/decode throughput > `207.3/328.9 t/s`, wall `39.501 s`, total kernel `20.2002 s`, `GDN > 6673.66 ms`, `gdn_core 5890.44 ms`, and `gdn_l2norm 100.30 ms`. The focused > SQLite summary had `l2_norm_f32 100.3024 ms` versus > `gated_delta_net_cuda 5804.7074 ms`. This is above the absolute > three-sigma floor from Phase139 (`53.433 ms`) but below the planned `3%` of > GDN-core materiality threshold at about `1.7%`, so prep-only L2 fusion is not > source-funded. Next GDN work should be recurrence-level, packed-state, or > datacenter-Blackwell-specific, not another prep micro-fusion. > > 2026-07-02 Phase141 result: decode-only GDN source claims must normalize by > launch count or tightly control the capture window. Spec: > `docs/superpowers/specs/2026-07-02-gdn-decode-noise-floor-phase141-design.md`. > Plan: > `docs/superpowers/plans/2026-07-02-gdn-decode-noise-floor-phase141.md`. > Artifact: > `/home/mudler/bench/phase141_gdn_decode_noise_floor/20260702_090428`. > Five identical current-binary decode-only captures all passed pre/post gates: > MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, `MUL_MAT 1146/1146`, and > `MUL_MAT_ID 806/806`. Raw `gdn_core_ms` median/stdev/CV was > `1415.500/30.641/2.146%`, range `1410.300..1482.140 ms`, but launch counts > drifted (`597`, `598`, `600`, `630`). Normalized `gdn_core_ms_per_launch` > was stable: median/stdev/CV `2.359167/0.005399/0.229%`. Future GDN A/B > source claims need repeated medians and must beat either `6.49%` raw > `gdn_core` reduction or `2.0%` launch-normalized reduction. The small > default-off source follow-up now worth testing is scalar gate/beta hoisting > inside `gated_delta_net_cuda`; vLLM-style packed decode recurrence remains a > larger redesign. Audience: an agent with **zero prior context** who has been told to "continue the GB10 vLLM-parity investigation" on the `llama-cpp-localai-paged` backend. This file is the **operational how-to**. It is the companion to `VLLM_PARITY_FINAL.md`, which is the **why / authoritative record** ("never re-litigate"). If the two ever disagree on a *fact*, `VLLM_PARITY_FINAL.md` and the bench artifacts it cites win; this file wins on *procedure* (how to ssh, lock, build, bench, profile). Read order for a cold start: 1. This file (TL;DR + hard gates + quickstart). 2. `VLLM_PARITY_FINAL.md` (the closed record, every number cites its artifact). 3. `.agents/vllm-parity-methodology.md` (the methodology: bit-exact gating, profile-don't-assume, both-engine ground truth). 4. The patch-series `README.md` (~44 KB, canonical backend doc) and `PAGED_BITEXACT_NOTE.md`. --- ## 1. TL;DR STATE > 2026-07-01 Phase104-108 update: the current carried source line is still the > Phase93 Qwen3Next grouped Q/K broadcast plus the Phase101/102 default-off > cleanup candidates. Phase104/106 same-session serving showed the stack is > md5/op clean but still far from vLLM: at `N=128`, paged/vLLM was about > `0.66` on decode and `0.50-0.51` on aggregate; at `N=192/256`, vLLM remained > faster and TTFT stayed about `3x` lower. Phase105 refreshed the grouped-MMQ > trace and found no new host-side tile-policy lever. Phase107 proved the MoE > structural correctness gates exist (`MOE_SWIGLU_DOWN 7/7`, > `MOE_WEIGHTED_COMBINE 7/7`, `MUL_MAT_ID_RAGGED_MOE 6/6`) but also proved > `test-backend-ops perf` did not time those custom whole-graph cases. Phase108 > fixed that measurement gap in `tests/test-backend-ops.cpp`: perf mode now > includes those MoE cases at `n_tokens=128,257`, and CSV output includes > `time_us`, `flops`, `memory_kb`, and `n_runs`. The Phase108 artifact is > `/home/mudler/bench/phase108_moe_perf_csv/20260701_221559`; md5s and compact > op gates are green. Use Phase108 rows as the baseline for any fused routed-MoE > implementation. Current ranking: `MUL_MAT_ID_RAGGED_MOE` is `1239-1446 us/run`, > `MOE_SWIGLU_DOWN` is `802-1020 us/run`, and `MOE_WEIGHTED_COMBINE` is only > `28-68 us/run`, so do not spend the next patch on weighted-combine fusion > alone. > Phase109 then tested existing env-gated routes on the Phase108 rows: > `LLAMA_W4A16_PREFILL_M=128`, `LLAMA_FP4_PREFILL_M=128`, > `LLAMA_MOE_DENSITY_MAX=9`, and `LLAMA_MOE_MMQ_X=64` > (`/home/mudler/bench/phase109_existing_moe_prefill_ab/20260701_222559`). > All selected correctness gates passed (`13/13` per env), but W4A16 and FP4 > large-M regressed the 257-token rows badly, and density/tile retuning was > noise-level on `MUL_MAT_ID_RAGGED_MOE` while not helping `MOE_SWIGLU_DOWN`. > Do not spend another phase on MMQ tile-policy shortcuts. The next credible > implementation is structural: port the vLLM-style idea of GPU-side > token/expert routing metadata (`sorted_token_ids`, expert offsets/bounds, > inverse permutation) into llama.cpp's `mul_mat_id` host-sync fallback/grouped > W4A16 path, while leaving the graph-safe grouped-MMQ path untouched. > Phase110 implemented the first slice of that structural path as default-off > `LLAMA_MOE_GPU_SORT=1` in `ggml_cuda_mul_mat_id`, reusing the existing > `mm_ids_helper` GPU sort for fallback metadata. The initial branch failed > `3/13` selected opt-in rows because `mm_ids_helper` returns sorted-to-original > `ids_dst`, while fallback `get_rows_cuda()` needs original-to-sorted > `ids_from_sorted`; adding a tiny inverse-permutation kernel fixed correctness. > Accepted artifact: > `/home/mudler/bench/phase110_gpu_moe_sort/20260701_224446_fix1`. Gates are > green: canonical MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 > `5951a5b4d624ce891e22ab5fca9bc439`, and supported compact ops > `SSM_CONV 45/45`, `SSM_CONV_SPLIT 6/6`, `GET_ROWS 49/49`, > `GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806` for both > default and `LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1`. Perf decision: > keep as a default-off structural base only. It improves W4A16 fallback > 257-token rows by `7.2%` (`MOE_SWIGLU_DOWN`) and `7.9%` > (`MUL_MAT_ID_RAGGED_MOE`), but the opt-in fallback is still about `1.5x` > slower than default grouped-MMQ. Phase111 must remove another fallback > bottleneck, such as the remaining `expert_bounds` host copy / host tile > descriptor build, before this line can matter for parity. > Phase111 tested that narrow follow-up as default-off `LLAMA_W4A16_GPU_TILES=1`: > W4A16 tile descriptors were built on GPU from `expert_bounds_dev` with an > atomic tile counter. It was correctness-clean after fixing a pointer mutability > compile error and a CUDA pool LIFO allocation bug, but clean perf was > flat-to-negative (`MUL_MAT_ID_RAGGED_MOE n=257` regressed about `2.0%` versus > Phase110 GPU-sort). Artifact: > `/home/mudler/bench/phase111_w4a16_gpu_tiles/20260701_230400_fix1`. The > Phase111 source was reverted, and post-revert W4A16+GPU-sort selected gates > passed `13/13`. Do not reopen a standalone GPU tile descriptor cleanup; the > next W4A16 attempt must remove a larger boundary, such as direct activation > consumption plus GPU descriptors together, or bypass the host-sync fallback > path entirely. > > 2026-07-01 active update: Phase50-59 reopened the dense and MoE serving > scheduler question. > True dense decode is much closer to vLLM (`383.66` vs `435.00` t/s, `88.2%`) > than the Phase47 h2h aggregate suggested, while traced serving still shows > no pure decode-only steps and high TTFT. Phase53 rejected static lower > admission budgets; Phase54 histograms show prompt admission concentrated in a > few large chunks (`prompt_hist=513+:12`) with mostly full-width decode > (`decode_hist=128-255:53`). Phase55 implemented that targeted > first-token A/B as `LLAMA_TTFT_PREFILL_FIRST=1`: on dense `n=128` it improved > aggregate throughput `138.2 -> 142.9`, mean TTFT `23231.9 -> 21520.8 ms`, and > wall `59.272 -> 57.323 s`, with md5/op gates green. Phase56 then showed the > policy helps dense `n=32` but regresses MoE `n=128` mean TTFT > `7168.1 -> 7615.3 ms` and aggregate `341.1 -> 339.9`; keep it opt-in only and > do not default it broadly. Phase57 tried a per-step defer cap; cap32 improved > MoE mean TTFT in one same-window sweep but still lost aggregate and wall, and > dense caps lost aggregate. Phase58 added a prompt-backlog threshold; min32 > improved MoE `n=128` aggregate `339.0 -> 341.9`, mean TTFT > `7743.1 -> 7420.1 ms`, and wall `24.167 -> 23.950 s` in the same window, while > dense `n=128` was mixed. Phase59 repeated MoE min32: aggregate stayed flat > (`336.6 -> 336.9`), mean TTFT improved (`7798.5 -> 7167.8 ms`), and wall stayed > flat (`24.334 -> 24.316 s`) with md5/op gates green. Matching vLLM was still > far ahead (`601.3` aggregate, `2968.1 ms` mean TTFT), so min32 is an opt-in > llama.cpp QoS knob, not a parity-closing lever. The trace and scheduler commits > are local and DGX-gated but not pushed, so the LocalAI patch series has not > been regenerated. > > 2026-07-01 Phase81-85 update: the next viable GDN lever is no longer launch > shape or gather removal. A default-off Qwen35/Qwen35MoE BF16 persistent > recurrent S-cache experiment (`LLAMA_QWEN35_GDN_S_CACHE_TYPE=bf16`) cut > same-source decode-only `gdn_core` from `1399.30 ms / 599 launches` > (`2.34 ms/launch`) to `863.57 ms / 720 launches` (`1.20 ms/launch`). Default > F32 md5 gates and op gates stayed green, and BF16 dense md5 stayed canonical, > but BF16 MoE md5 changed to `07db32c2bcb78d17a43ed18bc22705cd`. A quick > MoE KL smoke vs the same-source F32 base showed KLD `0.055499 +/- 0.001705`, > same-top-p `88.361%`, and PPL ratio `1.010356`. Phase82 then ran the full MoE > f16-reference gate at > `/home/mudler/bench/phase82_bf16_s_cache_f16_kl/20260701_183016`: same-source > F32 measured KLD `0.136563 +/- 0.003242`, while BF16 S-cache measured > `0.137162 +/- 0.003456` against the documented paged acceptance reference > `0.136000 +/- 0.003285`. Reject promotion and do not run serving A/B for this > candidate under the current hard KL rule. Phase83 then tested a bit-exact > KDA `expf(g)` register-cache shortcut in the GDN CUDA core. Md5 and op gates > stayed green, but same-window decode-only `gdn_core` moved > `1399.46 -> 1405.62 ms`, so reject that micro-optimization too. Phase84 > reduced in-place GDN op outputs to attention-only tensors and moved the CPU > ids fallback scratch to workspace; md5/op gates stayed green and startup free > CUDA memory improved `117472 -> 117855 MiB`, but same-window decode-only > `gdn_core` moved `1399.72 -> 1407.38 ms`. Treat Phase84 as a possible > memory-footprint cleanup only, not a speed parity lever. Phase85 added a > graph-reuse-safe identity-contiguous recurrent-state fast path: it calls > `ggml_gated_delta_net_inplace` on a direct state view when `s_copy_main` is > identity, otherwise keeps the ids path. Md5/op gates stayed green, the > `gdn_gather` fine bucket disappeared, GDN macro launches dropped > `3600 -> 2980`, and same-window `gdn_core` moved `1412.33 -> 1400.34 ms`. > Carry Phase85 only as a small cleanup candidate. Phase86 audited the producer > fusion idea against the Phase85 node-traced profile before coding it: the > whole `act/GDN-gate(shared)` macro is only `13.57 ms` of `3.6622 s`, beta > sigmoid is `2.73 ms`, and CUDA already fuses `UNARY + MUL` for softplus, > sigmoid, and SILU. Reject producer-only fusion as too small. Phase87 then > exposed an env-gated `GDN_NW=4 GDN_CPW=8` decode geometry probe to test a > vLLM-like `BV=32` tile shape. It was md5/op green, but same-source > decode-only `gdn_core` regressed `1390.56 -> 1417.13 ms`, so the source line > was reverted. Phase88 tried a first default-off `GDN_DECODE_PACK2=1` packed > decode CTA kernel. It built and CUDA op tests stayed green, but canonical md5 > failed for both MoE (`320b5ed...` vs `8cb0ce...`) and dense (`6a65e9...` vs > `5951a5...`), with visible output corruption, so it was reverted without > profiling. Phase89 tried to add that focused guardrail through > `test_gated_delta_net_inplace_ids`, but selecting that test class directly > already fails the pre-existing BF16 cases on CUDA, so the naive test addition > was also reverted. Phase90 fixed the fixture root cause for identity ids by > mirroring `state` into `state_dst` during initialization and added F32 > `S_v=128`, `n_seqs=2` cases that return `concat(out,state_dst)`, so the > backend comparator now checks both attention output and the side-effect state > write. DGX CUDA selected-op gate is green (`4/4`). Use this Phase90 guardrail > before any new packed-decode kernel, then still run canonical md5/op gates. > Phase91 retried the default-off `GDN_DECODE_PACK2=1` CTA sequence-packing > kernel under that guardrail. The first `n_seqs=2` guardrail passed but MoE md5 > failed for the single-sequence completion gate, exposing an uncovered odd/single > sequence PDL hazard. Moving inactive lanes past `ggml_cuda_pdl_sync()` and > adding `n_seqs=1,3` guardrail cases made the candidate md5/op clean > (`GATED_DELTA_NET 46/46`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`), but > decode-only `gdn_core` regressed to `1425.44 ms`, so the runtime patch was > reverted. Keep the expanded guardrail; do not retry CTA-level sequence packing > unless it also reduces per-sequence GDN work. ids gather, producer overhead, > simple geometry changes, and ungated packed kernels are not acceptable parity > paths. Phase92 tried the next smallest scalar one-token recurrence > micro-optimization: a default-off `GDN_SCALAR_DECODE_STORE_FUSED=1` CUDA path > that stores final state inside the scalar update loop and skips the final > post-token register-store loop. It passed local CPU guardrail, DGX CUDA > guardrail, canonical md5s, `GATED_DELTA_NET 46/46`, `MUL_MAT 1146/1146`, and > `MUL_MAT_ID 806/806`, but decode-only `gdn_core` regressed further to > `1529.72 ms` (`/home/mudler/bench/phase92_gdn_scalar_store_fused/20260701_204718/decode_profile`), > so the runtime patch was reverted. Do not retry store-fusing without evidence > that the final state store loop is independently dominant. The next credible > scoped ideas from the vLLM audit are the larger packed decode contract and the > Qwen3Next GQA-repeat removal, each as a separate guarded phase. Phase93 > implemented the Qwen3Next GQA-repeat removal as an explicit grouped Q/K > broadcast mode on `GGML_OP_GATED_DELTA_NET` (`op_params[2]`), preserving the > existing modulo/tiled broadcast for Qwen35 while allowing Qwen3Next to map > `qk_head = value_head / (H_v / H_k)` and skip materializing repeated q/k heads > when the GDN op path is active. Local CPU `GATED_DELTA_NET` passed `48/48`, > local CPU in-place ids passed `6/6`, DGX CUDA `GATED_DELTA_NET` passed `48/48`, > DGX CUDA in-place ids passed `6/6`, canonical md5/op gates passed > (`GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`), and > decode-only `gdn_core` improved to `1333.48 ms` > (`/home/mudler/bench/phase93_qwen3next_gqa_bcast/20260701_211019/decode_profile`). > Carry Phase93 as the current positive candidate. Phase94 then retested > decode geometry on top of Phase93 with env-only `GDN_NW=8 GDN_CPW=8`. It > stayed md5/op clean (`GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, > `MUL_MAT_ID 806/806`) but decode-only `gdn_core` regressed to `1440.79 ms` > (`/home/mudler/bench/phase94_gdn_geometry_phase93/20260701_211855/decode_profile_8x8`), > so reject 8x8 and keep Phase93's default 16x8 geometry. Phase93 trace evidence > also shows remaining producer-side GDN work is small (`l2_norm_f32 8.65 ms`, > GDN gate/sigmoid about `12.75 ms`, remaining repeat `5.34 ms`), so the next > useful lead should target recurrence work or a larger packed decode contract, > not another small producer-only fusion. Phase95 tested a default-off > `GDN_WARP_SCALAR_GATE=1` CUDA decode specialization on top of Phase93: lane 0 > computed the scalar non-KDA gate and broadcast it within the warp for the > one-token `S_v=128`, default `16x8` path. Local CPU guardrails passed > (`GATED_DELTA_NET 48/48`, in-place ids `6/6`), DGX CUDA guardrails passed > (`GATED_DELTA_NET 48/48`, in-place ids `6/6`), canonical md5/op gates passed > (`GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`), but > decode-only `gdn_core` regressed to `1402.40 ms` > (`/home/mudler/bench/phase95_gdn_warp_scalar_gate/20260701_213311/decode_profile`). > The runtime patch was reverted. Do not retry scalar-gate warp broadcast unless > a future profile shows SFU pressure, rather than recurrent state traffic or > reductions, dominating the GDN core. Phase96 then tested the narrow > conv-state identity fast path suggested by the trace audit: when > `s_copy_main` was identity, `build_conv_state_fused` viewed the active > conv-cache slots directly and called `ggml_ssm_conv_update_inplace` instead of > the ids variant. Local CPU `SSM_CONV` passed `45/45`; DGX CUDA `SSM_CONV` > passed `45/45`; canonical gates passed (`SSM_CONV 45/45`, > `GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`, md5s > canonical). Decode-only profile regressed to total kernel `3.6723 s`, > `gdn_core 1406.57 ms`, and `gdn_conv 70.42 ms` > (`/home/mudler/bench/phase96_conv_identity_fastpath/20260701_214141/decode_profile`). > The runtime model-graph patch was reverted. Do not retry the conv identity > branch as a speed lever unless a same-window trace proves the ids variant is > independently dominant. Phase97 then measured the carried Phase93 stack in an > end-to-end `n=128`, `PTOK=128`, `GEN=64`, `PARALLEL=128` serving snapshot > against vLLM. Pre/post canonical gates stayed green. Paged Phase93 measured > `agg_tps 329.6`, `decode_agg_tps 669.8`, `prefill_tps 1734.5`, > `ttft_mean_ms 7415.4`, `wall_s 24.851`; vLLM measured `agg_tps 664.8`, > `decode_agg_tps 1029.4`, `prefill_tps 5271.8`, `ttft_mean_ms 2519.5`, > `wall_s 11.929` > (`/home/mudler/bench/phase97_phase93_serving_snapshot/20260701_214648`). > Phase93 therefore remains a decode-profile positive candidate, but it does not > close serving parity (`paged_decode_over_vllm=0.6507`). The next useful phase > needs a larger serving-impact lever; isolated GDN/conv micro-optimizations > have now repeatedly failed to move live serving enough. Phase98 profiled that > carried Phase93 serving window with graph-node CUDA tracing. Pre/post gates > stayed green. Total kernel time was `20.0411 s`; macro buckets were GDN > `6679.96 ms` (`33.33%`), MoE/FFN-GEMM `6034.52 ms` (`30.11%`), > bf16/fp8-proj `2766.06 ms` (`13.80%`), and layout-copy `1257.60 ms` > (`6.28%`). Fine buckets were led by `gdn_core 5892.99 ms` (`29.40%`) and > `mmq_nvfp4 5809.55 ms` (`28.99%`), followed by `convert_dtype 663.45 ms`, > `gdn_conv 457.11 ms`, and `concat_layout 430.25 ms` > (`/home/mudler/bench/phase98_phase93_serving_profile/20260701_215715`). > This re-ranks the next work: do not spend more time on scalar GDN, conv > identity, or gather-only shortcuts. Either attribute and remove a proven > material layout-copy node, or pursue a larger GDN-core/MMQ serving lever with a > standalone PoC gate. Phase99 then used the existing default-off > `LLAMA_LAYOUT_TRACE` hook on the same Phase93 serving profile shape > (`N=128`, `PTOK=128`, `GEN=64`, `PARALLEL=128`). Trace-enabled gates stayed > green (`GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`, > canonical MoE/dense md5s). Serving remained comparable (`total kernel > 20.2408 s`, `layout-copy 1269.35 ms`). The trace attributed > `concat_layout 440.01 ms` almost entirely to > `conv_input-* = concat(conv_states_reshaped-*, qkv_mixed_transposed-*)` before > `SSM_CONV`; `copy_layout 119.16 ms` includes `conv_state_update-*` writeback. > The larger `convert_dtype 662.34 ms` bucket is mostly unnamed F32-to-F16 `CPY` > rows and needs stronger attribution before coding. Decision: Phase99 is > measurement-only; do not retry the Phase96-style conv-state identity branch. > The only conv-side patch worth funding is a larger two-source `SSM_CONV` > contract that reads `(conv_states, qkv_mixed)` as a logical concat, or else > extend trace attribution for the unnamed `convert_dtype` bucket first > (`/home/mudler/bench/phase99_layout_trace/20260701_200835/serving_profile`). > Phase100 extended that trace with `dst_view`, `src0_view`, and `src1_view` > names. The trace-only patch built locally and on DGX, and trace-enabled gates > stayed green (`GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, > `MUL_MAT_ID 806/806`, canonical MoE/dense md5s). Serving stayed comparable > (`total kernel 20.3464 s`, `convert_dtype 661.73 ms`, `concat_layout > 438.15 ms`). The new fields identify a concrete `convert_dtype` source: > `GET_ROWS` reads F16 `cache_k_l*` / `cache_v_l*` into F32 `node_*`, then > `CPY` downcasts views such as `src0_view=node_358` / `node_365` to F16 > attention-shaped tensors. This repeats across attention layers > (`cache_k_l3/v_l3`, `cache_k_l7/v_l7`, `cache_k_l11/v_l11`, ...). Some F32->F16 > rows remain unnamed, so the next runtime phase should be a narrow K/V cache > get_rows dtype A/B, not a broad layout rewrite > (`/home/mudler/bench/phase100_layout_view_trace/20260701_201800/serving_profile`). > Phase101 implemented that narrow A/B as default-off > `LLAMA_PAGED_KV_GET_ROWS_F16=1`: add `ggml_get_rows_type`, support CPU F16 > source -> F16 destination row copy, and use typed F16 `GET_ROWS` only for > paged K/V gather when the cache tensor is F16. Local and DGX builds completed; > CUDA `GET_ROWS` passed `49/49` including the new F16-output cases; default and > opt-in md5/op gates stayed green (`GET_ROWS 49/49`, `GATED_DELTA_NET 48/48`, > `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`, canonical MoE/dense md5s). > Serving profile under opt-in measured `total kernel 20.1989 s`, `agg_tps > 206.4`, `decode_agg_tps 328.0`, and `ttft_mean_ms 8211.1`. It reduced > `copy_layout 116.25 -> 80.32 ms` and macro `layout-copy 1262.58 -> 1220.30 ms` > versus Phase100, but `convert_dtype` stayed flat (`661.73 -> 661.35 ms`) and > serving throughput did not improve. Carry Phase101 only as a small default-off > cleanup candidate pending repeat A/B; do not promote it as a parity lever > (`/home/mudler/bench/phase101_kv_get_rows_f16/20260701_203930/serving_profile`). > Phase102 then implemented the funded two-source `SSM_CONV` contract as > default-off `LLAMA_SSM_CONV_SPLIT=1`: `ggml_ssm_conv_split(ctx, conv_states, > x_cur, conv_kernel)` reuses `GGML_OP_SSM_CONV`, reads > `[K-1,channels,n_seqs]` cached taps plus native `[channels,n_tokens,n_seqs]` > qkv tokens as a logical concat, and is wired into Qwen3Next/Qwen35/Qwen35MoE > only for multi-token, non-rollback batches with `n_seq_tokens >= K-1`. The > initial semantic test exposed a harness issue (`split-base` has an exactly > zero CPU reference, so normalized MSE reported `ERR=inf`); direct split > CUDA-vs-CPU passed `6/6`, and the final test keeps `split-base` with absolute > max error. Local and DGX builds passed; default, standalone opt-in, and > serving pre/post gates stayed green (`SSM_CONV 45/45`, `SSM_CONV_SPLIT 6/6`, > `GET_ROWS 49/49`, `GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, > `MUL_MAT_ID 806/806`, canonical MoE/dense md5s). Opt-in serving measured > `total kernel 19.5482 s`, `agg_tps 206.1`, `decode_agg_tps 320.0`, > `prefill_tps 1538.0`, and `ttft_mean_ms 7928.4`. It removed the traced concat > materialization (`concat_layout 433.13 -> 4.59 ms` versus Phase101 and > `layout-copy 1220.30 -> 826.87 ms`), but live serving throughput still did not > improve. Carry Phase102 as a default-off cleanup/follow-up base only; do not > promote it as parity-closing without a repeat A/B or an additional state-update > fusion. The remaining high-value targets are still `gdn_core`, `mmq_nvfp4`, or > a larger serving scheduler/packed-decode contract > (`/home/mudler/bench/phase102_ssm_conv_split/20260701_210907/serving_profile`). > Phase103 measured Phase101+Phase102 together, with no new source changes: > `LLAMA_SSM_CONV_SPLIT=1 LLAMA_PAGED_KV_GET_ROWS_F16=1`. Standalone and > serving pre/post gates stayed green (`SSM_CONV 45/45`, `SSM_CONV_SPLIT 6/6`, > `GET_ROWS 49/49`, `GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, > `MUL_MAT_ID 806/806`, canonical MoE/dense md5s). Combined serving improved > over Phase102 (`agg_tps 206.1 -> 212.3`, `decode_agg_tps 320.0 -> 331.5`, > `prefill_tps 1538.0 -> 1569.1`, `wall_s 39.743 -> 38.575`) and reduced > `layout-copy 826.87 -> 798.52 ms`; it also preserved most of the split > SSM_CONV concat removal and recovered the F16 K/V `copy_layout` reduction > (`copy_layout 112.53 -> 78.22 ms`). This proves the two cleanup candidates are > compatible, but not parity-closing: `gdn_core 5930.47 ms` and `mmq_nvfp4 > 6001.77 ms` still dominate. Carry the combined env as the cleanup comparison > baseline; do not rerun isolated layout cleanup unless it changes a larger > serving contract > (`/home/mudler/bench/phase103_combined_layout_cleanups/20260701_211821/serving_profile`). > Phase104 then measured that combined cleanup stack in the normal same-session > serving harness against vLLM at `N=128`, `PTOK=128`, `GEN=64`, > `PARALLEL=128`. Pre/post gates stayed green with the same expanded op set and > canonical md5s. Paged combined measured `agg_tps 338.6`, > `decode_agg_tps 675.8`, `prefill_tps 1813.0`, `ttft_mean_ms 7121.6`, and > `wall_s 24.196`; vLLM measured `agg_tps 661.1`, `decode_agg_tps 1028.0`, > `prefill_tps 5208.7`, `ttft_mean_ms 2572.3`, and `wall_s 11.980`. This is a > small serving improvement over Phase97 (`agg_tps +2.73%`, `prefill_tps > +4.53%`, `TTFT -3.96%`), but still not parity: `paged_decode_over_vllm=0.6574` > and `paged_agg_over_vllm=0.5122`. Carry the combined cleanup stack as the best > current comparison baseline. The next useful phase must attack a larger > serving-impact contract or the dominant GDN/MMQ buckets, not more isolated > layout-copy cleanup > (`/home/mudler/bench/phase104_combined_serving_snapshot/20260701_212551`). > Phase105 refreshed grouped-MMQ evidence on that current stack without source > changes. `MUL_MAT_ID_RAGGED_MOE` stayed green both default and trace-enabled > (`6/6`), full `MUL_MAT_ID` stayed green (`806/806`), and the live serving > retry returned a non-empty response while recording `120` shape and launch > lines. The live sample was prefill-like (`ncols_max=317`, density `10`, > `mmq_x_best=112`, `stream_k=1`) with no small-M lines; all launches had > `fixup=0`, `stream_k_blocks == ntiles_dst`, and efficiency `100`. This > confirms the current cleanup stack did not open a new cheap MMQ shortcut. > Do not add another host-side MMQ tile policy; only revisit MMQ for a > genuinely structural kernel or serving-contract change > (`/home/mudler/bench/phase105_mmq_current_shape/20260701_214129_serving_retry`). > Phase106 tested the remaining low-conflict C1 operating-point hypothesis on > the current stack: same-session `N=128/192/256` with `PARALLEL=256`, > `VLLM_MAX_NUM_SEQS=256`, and the combined cleanup env. Pre/post gates stayed > green with canonical md5s and the expanded op set. vLLM completed all legs and > stayed ahead: at `N=256`, paged measured `agg_tps 338.4`, > `decode_agg_tps 824.6`, `ttft_mean_ms 14933.5`, while vLLM measured > `agg_tps 723.8`, `decode_agg_tps 1320.4`, `ttft_mean_ms 4999.0`. Reject C1 > for the current GB10 stack. The next source phase should be structural > persistent-batch/fused-MoE/GDN work, not another scheduler shortcut > (`/home/mudler/bench/phase106_max_concurrency_current_stack/20260701_214907`). > Phase107 established the fused-MoE structural guardrail surface before coding: > `MOE_SWIGLU_DOWN 7/7`, `MOE_WEIGHTED_COMBINE 7/7`, and > `MUL_MAT_ID_RAGGED_MOE 6/6` passed on CUDA0. However, > `test-backend-ops perf` did not provide usable timing rows for these custom > whole-graph cases; the broad `MUL_MAT_ID` perf CSV reported support metadata > only. The next source patch should be measurement-only: add a narrow MoE > fusion timing harness with explicit GPU synchronization and CSV timing before > funding any fused routed-MoE kernel > (`/home/mudler/bench/phase107_moe_fusion_guardrail/20260701_220227`). - Historical verdict: the older investigation marked GB10 parity **CLOSED** and unreachable. Treat that as superseded where Phase50-54 provide newer dense serving evidence. - **Prefill** is a genuine floor at **~36% (MoE) / ~43% (dense)** of vLLM. Prefill is **not** CUDA-graph-replayed, so these numbers are real, not measurement artifacts. - **Decode** is **near-parity: ~86% of vLLM's TRUE GPU-steady decode** (924 vs 1078 t/s). The long-standing **~56% headline was a CUDA-graph measurement artifact** (nsys without `--cuda-graph-trace=node` collapses each graph replay into one opaque launch). Decode is also **ahead of vLLM at low concurrency** (dense 116.7% at N=8) and uses **1.5-3x less memory**, bit-exact per-path. - The lever search was **exhaustive**: every attempt (prefill GEMM, GDN chunked scan, decode fusions, serving/scheduler) is recorded with its verdict and number so it is **not re-run**. - **The path to parity is different hardware: datacenter Blackwell** (B200, HBM, native tcgen05 / CUTLASS FP4). Do NOT reopen GB10 kernels. Re-run the methodology on the new silicon, where vLLM's GB10-losing FLA/Marlin kernels invert. --- ## 2. THE HARD GATES YOU MUST NOT VIOLATE These are non-negotiable. Violating any of them invalidates the result or the contribution. ### 2.1 The per-path greedy-md5 bit-exact gate (sacred) The gate is **per-path**: paged vs non-paged attention legitimately produce different (equivalent) FP-reduction orders. Each path is gated against **its own** reference, validated benign by KL-divergence to the f16 reference. Canonical greedy md5s: | Path | Model | Canonical md5 | |---|---|---| | non-paged | MoE q36-35b-a3b-nvfp4 | `07db32c2bcb78d17a43ed18bc22705cd` | | **paged** | MoE q36-35b-a3b-nvfp4 | `8cb0ce23777bf55f92f63d0292c756b0` | | non-paged | dense q36-27b-nvfp4 | `5951a5b4d624ce891e22ab5fca9bc439` | | paged | dense q36-27b-nvfp4 | `5951a5b4d624ce891e22ab5fca9bc439` (bit-exact to non-paged) | - **Compare paged-to-paged only.** Future paged-MoE regressions compare to `8cb0ce23`, NOT `07db32c2`. - **Why paged-MoE differs (benign, KL-validated):** `llama-perplexity --kl-divergence` on the MoE GGUF (16 chunks, f16 base PPL 7.3734) shows non-paged-vs-f16 KLD 0.136597 and paged-vs-f16 KLD 0.136000, i.e. paged does NOT diverge from f16 ground truth more than non-paged does. Paged and non-paged are two equivalent FP-reorderings of the same 4-bit model. This holds on the 0028 baseline and with `LLAMA_MOE_FORCE_GRAPHS`/0029 on or off, so it is a property of the paged path, not any one lever. - **Every bit-exact patch is gated two ways:** greedy md5 (per path) AND `test-backend-ops` vs the CPU oracle for every touched op. ### 2.2 The KL-gate for opt-in lossy paths Any path that is NOT byte-identical (e.g. 0033 dequant-bf16, the 0034/0035 large-M FP paths, FP8-KV) ships **default-off** and is gated by a **KL-divergence band**: it requires `KLD(new||f16) <= KLD(FP4-MMQ||f16)` and PPL within the established band. Lossy levers never ship default-on. ### 2.3 In-backend A/B is the only proof (hard methodology rule) A lever compiled into the binary is **NOT** isolated by a runtime flag alone. It needs a **separately-built in-backend A/B**. Precedents that burned this in: 0031 chunking math was correct yet -22% in-backend; 0034 had a standalone PoC win that did not hold in-backend. ### 2.4 Contribution / commit gates (LocalAI policy) - **DCO sign-off is human-only:** do not add an AI `Signed-off-by` trailer. - **AI attribution via `Assisted-by:` trailer:** `Assisted-by: Codex:gpt-5`. - **NEVER add `Co-Authored-By:` (AI) trailers** and never add an AI `Signed-off-by`. - **No em-dashes** anywhere in output (use `-`, `:`, parentheses, or rephrase). - **Ask before every `git push`.** Prior approval does not carry over. ### 2.5 Fork-first is MANDATORY (the fork is canonical) - The **canonical source of truth is the fork branch `mudler/llama.cpp:localai-paged`** (= pin commit + paged patch commits in order). It is canonical for ALL paged-backend kernel/patch work. The shipped `patches/paged/*.patch` series is a **derivative**: the fork is the source. - **Always update the fork FIRST, in this exact order:** (1) commit the change on the `localai-paged` branch and **push it**, then (2) regenerate the LocalAI series (`backend/cpp/llama-cpp-localai-paged/patches/paged/`) from the fork via `git format-patch` (one patch per fork commit, source-only, never touching a `*.md`/dev-doc), so the series stays a **1:1, drift-free mirror** of the branch. No hand-export. - **NEVER edit the LocalAI `patches/paged/*.patch` files directly**, and **NEVER add a patch to the series with no corresponding fork-branch commit.** They are generated output, not source. - The fork branch is also **where the build and the per-path bit-exact md5 gate actually run**, so it is the **only** place a change is truly validated. A patch that lives only in the LocalAI series has never been built or gated. - **Mirror invariant (verify by tree hash):** applying the full on-disk series on the pin must reproduce the fork branch tree byte-for-byte. The series has **intentional gaps** (missing 0005, 0026, 0027, 0032, 0036-0039, 0045), so the patch count is not the max number; what must hold is the tree-hash equality, not the count. Current verified state: fork HEAD `2d590d770` is mirrored by worktree patch `0063-feat-cuda-trace-cublas-tensor-names.patch`; applying all `54` patch files on `0ed235ea2c17a19fc8238668653946721ed136fd` produces tree `dedb1182910eafe9f6875588dc8285bfb544cce5`, exactly matching the fork. ### 2.6 Bench hygiene gates - **NEVER set `LLAMA_MAX_BATCH_TOKENS` in benches** (the harness explicitly logs "NO LLAMA_MAX_BATCH_TOKENS"). - Do **not** set `GDN_TC`, `GDN_CHUNK_MIN`, or `LLAMA_PAGED_DECODE_STABLE` in parity benches. Production defaults are compiled in: **GDN M5 on (`GDN_TC=5`, `GDN_CHUNK_MIN=64`), S1 decode-graph on, S3 off.** - **Decode profiling MUST use `nsys --cuda-graph-trace=node`** (see section 3.4). This is a gate, not a suggestion. --- ## 3. OPERATIONAL QUICKSTART (copy-pasteable) ### 3.0 Host ``` ssh dgx.casa # resolves to hostname promaxgb10-4ad8; GPU = NVIDIA GB10 (unified LPDDR5x, ~273 GB/s, the bandwidth floor) ``` `nvidia-smi` reports memory as `[N/A]` (unified memory). CUDA 13 / sm_121. ### 3.1 GPU lock protocol (`~/gpu_bench_lock`) - TWO conventions, reconcile carefully There are two conventions in flight: - **Old harnesses** (`combined_definitive.sh`, `fuse_validate.sh`, `fuse_profile.sh`) treat it as an **empty mutex dir**: `mkdir ~/gpu_bench_lock` to acquire, `rmdir` to release. - **Newer harnesses** (`fp4norm_profile.sh`) use an **owner-file convention**: `mkdir -p ~/gpu_bench_lock` then `echo "$ME $(date +%s)" > ~/gpu_bench_lock/owner`. They poll until `nvidia-smi --query-compute-apps=pid` count is 0 AND `owner` is `FREE*`/absent for 2 consecutive checks, and clear a stale `~/gpu_bench_lock/release` file. Release **writes** `FREE released-by-... $(date +%s)` to `owner` (it does NOT remove the dir). Because the dir now permanently contains an `owner` file, **release with `rm -rf ~/gpu_bench_lock`, NOT `rmdir`** (rmdir fails on the non-empty dir). Recommended procedure for a future agent: 1. Read `~/gpu_bench_lock/owner`. `FREE*`/absent + 0 compute-apps means free. 2. Acquire via `mkdir -p ~/gpu_bench_lock` + write `owner`. 3. Release by writing `FREE ...` to `owner` (or `rm -rf ~/gpu_bench_lock`). A separate 0-byte `~/bench/gpu.lock` is legacy/unrelated - ignore. **Always gate on ALL THREE** before benching or building on DGX: `nvidia-smi --query-compute-apps=pid` count == 0, `owner` FREE, and `docker ps` shows no running containers. In particular, do not start work while a `local-ai-worker` container is running. Concurrent jobs share this GPU: an offline-repack Marlin workflow, an `~/.cache/autoresearch-quant/` quant pipeline (this is the `llama-imatrix` class of job), finetune trees, and LocalAI worker containers. The canonical harnesses poll for GPU-idle up to 2h. ### 3.2 Build (long; run detached + poll) - **Mainline / canonical grpc-server + binaries: CUDA arch `121`** (`-DCMAKE_CUDA_ARCHITECTURES=121`). Runtime banner shows `ARCHS = 1210 | BLACKWELL_NATIVE_FP4 = 1`. - **FP4-MMA / tensor-core experimental kernels: the accelerated `121a` gencode** (`arch=compute_121a,code=[compute_121a,sm_121a]`). The `a` suffix unlocks tcgen05 / native FP4-MMA intrinsics. `121a` lives ONLY in the DGX experimental build scripts (`~/gdn_cc.sh` standalone nvcc, `~/gdn_bv_build.sh` `-DCMAKE_CUDA_ARCHITECTURES=121a`, `~/paged-build.sh` `--build-arg CUDA_DOCKER_ARCH=121a`), not in the worktree build files. Supply it at build time via `CMAKE_CUDA_ARCHITECTURES` / `CUDA_DOCKER_ARCH`. - **Long builds: run detached and poll for a marker.** Pattern: `nohup ... > build.log 2>&1 &` then poll for a `.DONE`/`.done` file. Do NOT block a foreground shell. Built binaries live at `dgx:~/llama-paged-dev/build-cuda/bin/` (`llama-server`, `llama-batched-bench`, `llama-completion`; thin ~70 KB dynamic wrappers). ### 3.3 The standard bench env + commands ``` cd /home/mudler/llama-paged-dev/build-cuda/bin L="LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GGML_NO_BACKTRACE=1" # GGML_NO_BACKTRACE is log-hygiene, not a lever MOE=/home/mudler/bench/q36-35b-a3b-nvfp4.gguf # arch qwen35moe, ~22.2 GiB DENSE=/home/mudler/bench/q36-27b-nvfp4.gguf # arch qwen35, ~17.5 GiB # (1) Bit-exact / coherence gate. stdin MUST be /dev/null or it hangs in conv mode. env $L ./llama-completion -m "$MOE" -ngl 99 -fa on -c 4096 --temp 0 --seed 1 -n 48 -no-cnv \ -p "The capital of France is" /home/mudler/bench/paged_server.log 2>&1 & # poll http://127.0.0.1:8090/health for '"ok"', then: python3 /home/mudler/bench/h2h_cli3.py # OpenAI /v1/completions, ignore_eos, fresh-nonce, ptok128 gen128, NPL sweep 8/32/128/256 ``` **vLLM side** (for both-engine parity): `~/vllm-bench/bin/vllm` (version **0.23.0**), served `gpu-util 0.85 max-model-len 4096 max-num-seqs 256 tp1`, models `~/bench/q36-35b-a3b-nvfp4-vllm/` and `~/bench/q36-27b-nvfp4-vllm/`. **Current-stack serving snapshots use `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh`.** It targets the clean `~/llama-phase6-source` mirror, checks docker/`local-ai-worker`/GPU-idle state, uses the owner-file lock, runs pre/post inference gates, then compares paged and vLLM with the same h2h client. The older `dgx:~/bench/combined_definitive.sh` is historical: do not reuse it without first porting away from stale `~/llama-paged-dev` paths and old lock assumptions. The harness also writes `hardware.txt` before any server starts, including `DRY_RUN=1`, so every new snapshot records the GPU model, driver, compute capability when exposed by `nvidia-smi`, and a conservative hardware class. Full runs also write `gate_summary.tsv` after the post gate, summarizing pre/post MoE md5, dense md5, and backend-op checks; use `paged-current-serving-snapshot.sh --summarize-gates ART` to backfill or audit an existing snapshot without starting servers. ### 3.4 THE DECODE-PROFILING RULE (this trap caused 4 wrong analyses) Decode runs as a **replayed CUDA graph**. `nsys` **without** `--cuda-graph-trace=node` collapses each graph replay into ONE opaque launch, so every per-kernel attribution becomes an artifact. This is exactly what made the old "paged 159 us/tok, GPU ~16% busy, host-bound, 5.4x more GPU-efficient" story wrong, and produced the wrong ~56% headline. Mandatory method for any decode profile: - Use **`nsys --cuda-graph-trace=node`**. - Decompose with the **difference method**: per-token cost = (ntg=64 profile) - (ntg=16 profile). Under the correct method, paged decode at npl=256 is **99% GPU-busy (1.4% idle), NOT host-bound** - the opposite of the collapsed-graph reading. The clean graph-node-traced profiles are at `~/highN_prof2/*.nsys-rep` (paged, npl=256) and `~/highN_vllm/*.nsys-rep` (vLLM), captured 2026-06-30. They **supersede every earlier decode decomposition.** ### 3.5 Models + artifacts (all on DGX) GGUF (paged): `~/bench/q36-35b-a3b-nvfp4.gguf` (MoE, qwen35moe), `~/bench/q36-27b-nvfp4.gguf` (dense, qwen35). vLLM safetensors: `~/bench/q36-35b-a3b-nvfp4-vllm/` (has `hf_quant_config.json` confirming MIXED_PRECISION / FP8-proj), `~/bench/q36-27b-nvfp4-vllm/`. Authoritative run: `~/bench/COMBINED_DEFINITIVE.txt` (+ `.log`, `.done`, `combined_definitive.sh`, per-engine `COMBINED_*_server.log`). A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`. NOTE: the `*_RESULTS*`/`*_MAP*` docs live only in the worktree `docs/`, not on the DGX. --- ## 4. THE COMPLETE LEVER MAP (do NOT re-run the rejected ones) Verdicts and numbers are from `VLLM_PARITY_FINAL.md` + the cited artifacts. "BE" = greedy-md5 bit-exact; "KL-benign" = lossy path inside the KL band. ### 4.1 Prefill weight-GEMM track - WHOLE TRACK REJECTED (FP4-MMQ is optimal on GB10) Decisive surprise: on sm_121 **vLLM itself does NOT run native FP4** - it runs **Marlin W4A16** (FP4 dequant->bf16 in-register + bf16 GEMM) for experts and FP8 projections, capped at ~half FP4 peak, because native CUTLASS NVFP4 grouped-GEMM is broken on consumer Blackwell (TMA-WS init failure, CUTLASS #3096; no tcgen05/TMEM). So MMQ's native FP4 is already structurally competitive here. | Lever | What | Verdict | Key number | |---|---|---|---| | 0033 dequant->bf16 cuBLAS | route large-M NVFP4 dense GEMM to dequant->bf16 cuBLAS | REJECTED, ships default-off | dense S_PP -49%/-42%/-29% at M=512/1024/2048; BE + KL-better | | dense-cuBLAS reroute (full) | same across dense+MoE prefill | REJECTED | -31% to -62% band | | 0034 native FP4-MMA W4A4 | Blackwell `mxf4nvf4` OMMA large-M | REJECTED in-backend | PoC 103 TFLOP/s (57.7% FP4 peak, NMSE=0) but win did not hold in-backend | | 0035 W4A16-Marlin grouped MoE | FP4->bf16 in-register + bf16 mma, zero act-quant tax | REJECTED (perf) | correct + KL-benign-and-better but **-39%** S_PP vs MMQ | | 0045/0046 offline-repack / vLLM-verbatim Marlin | repack to Marlin layout; port vLLM kernel verbatim | REJECTED | verbatim correct but -39%; offline-repack same bf16-peak ceiling, no win | Why it loses: bf16 TC peak on GB10 is ~half FP4 peak, so any dequant->bf16 kernel caps at ~half FP4-MMQ; the dequant write is an un-amortized weight-sized memory pass (~8x the FP4-read traffic). **The GEMM bucket is not winnable on GB10 with available kernels.** ### 4.2 Prefill GDN chunked-scan track - M5 tf32 C=16 is the SHIPPED winner GDN is the #1 prefill-gap contributor (+59.2 us/tok, ~30%). vLLM's FLA `chunk_gated_delta_rule` runs the same math at 36.5 vs paged 95.7 us/tok = 2.62x via tensor-core intra-chunk Gram products. | Lever | What | Verdict | Key number | |---|---|---|---| | 0031 scalar-serial chunked scan | FLA-style scalar/serial (`GDN_TC=0`) | superseded | correct but ~22% slower at forced C=16 | | **0047 / M5 tf32 tensor-core scan** | tf32 `m16n8k8` mma form-T solve, f32-only | **SHIPPED default-on under paged** | MoE prefill +3.5% @npp512, +17.7% @npp2048; decode unchanged; BE-benign | | bf16 CONFIG-C (M8) | bf16 Kc/Qc + 2 C*C scratch, C->64 | REJECTED (not in f32 series) | confirmed geometry then dropped | | bf16-C16 | bf16 Gram at C=16 | REJECTED | no win; bf16 mantissa unsafe on state-coupled products | | BV block-occupancy A/B (tf32) | raise blocks/SM | REJECTED (occupancy NOT the bound) | 1844 vs 1814 S_PP (-1.04%, within noise) | | bf16-C64 | bf16 Gram at C=64 | REJECTED | -18.75%; O(C^2) intra-chunk + serial recurrence dominates | | Phase 10 C32 slab M5 | C=32, two `dv_tile=64` slabs, default-off `GDN_C32_SLAB=1` | REJECTED | md5-clean after tail-row zeroing, but slower: MoE 2048 2430.32 -> 2054.86; dense 2048 1019.25 -> 903.73 | | Phase 11 QS-early M5 | move `QS = Qc * S0` earlier, default-off `GDN_M5_QS_EARLY=1` | REJECTED | md5-clean, but slightly slower: MoE 2048 2441.54 -> 2420.26; dense 2048 1021.06 -> 1015.77 | | Phase 12 shared-A/Ai cost model | f32 Ai scratch shared across two C32 value slabs | GO to one default-off prototype | BT32 f32 scratch at npp2048,npl32: MoE 256 MiB / 768 MiB Ai traffic; dense 384 MiB / 1152 MiB Ai traffic | | Phase 13 Global-Ai32 | precompute f32 Ai once, consume from two C32 `dv_tile=64` slabs | REJECTED | md5-clean, but slower: MoE 2048 2425.10 -> 2097.76; dense 2048 1016.14 -> 918.19 | Why not occupancy/dtype: the cost is the **O(C^2) intra-chunk triangular A-inverse solve + the strictly-serial inter-chunk recurrence**, with C forced to **16** by GB10's 99 KB dynamic-smem cap (the 128x128 f32 state alone is 64 KB). M5 captures the tractable TC part; it does not fully close 2.62x because vLLM's FLA blocked-solve is a more complete TC implementation. Phase 13 closes the caveat: the default-off `GDN_GLOBAL_AI32=1` prototype was correctness-clean but slower. Stop GDN kernel work on GB10 instead of iterating into f16 Ai or more local reorders. ### 4.3 Decode / fusion levers - all REJECTED (near-parity already at ~86% true GPU-steady) | Lever | What | Verdict | Key number | |---|---|---|---| | act-quant folded into ggml MMQ | inline y-quant in MoE expert MMQ | REJECTED | **-79.4%**; ggml MMQ re-quantizes y per weight-row-tile x stream-k split, no TC for inline quant | | norm+quant+silu fusion | one launch (vLLM Triton kernel) | REJECTED (infeasible) | `ggml_cuda_can_fuse` cannot express it: FP4 quant is a mul_mat-internal prologue, silu separated from norm by 2 GEMMs + router | | Q8_0 / FP8 projection | quantize bf16 GDN/attn projections | REJECTED (regime error) | vLLM DOES use FP8 proj, but at N>=128 proj is only ~12% of stream, closes <=6% | | NVFP4 the projections | drop proj to NVFP4 | REJECTED | KL-fail, ~+6% PPL; vLLM keeps SAME bf16/FP8 proj, never NVFP4 | | W4A16-Marlin MoE decode | Marlin grouped expert GEMM at decode | REJECTED | BW-floored wash, ~5% slower | | bf16-tau per-head SSM (0026) | per-head bf16 tau on SSM decode | DROPPED | flat 780.6 vs 780.0 t/s; earlier "+12%" subsumed by 0028/0029 | | D3 FA-split / D4 GDN-width-adaptive | older off-critical-path levers | SUPERSEDED reasoning | were rejected via the debunked "5.4x/host-bound" reading; under HNP the GDN scan IS critical path (51%) but is the shared BW floor where paged leads (83% vs 79%), so still not a win | Dense decode is **AHEAD at low N (116.7% @ N=8)** - the one operating point where paged is unambiguously faster. ### 4.4 Serving / engine levers - host loop and scheduler CLOSED | Lever | What | Verdict | Key number | |---|---|---|---| | **0040 / S1** paged decode-graph reuse | `can_reuse` keyed on bucketed block-table dims | SHIPPED default-on | serving reuse 0% -> 72.2% (with S3); static 0% -> 95.5% | | **0041 / S3** decode-shape-stable scheduling (`LLAMA_PAGED_DECODE_STABLE`) | keep prefill out of decode steps | SHIPPED **default-OFF** (opt-in) | recovers the ~17 pt graph-reuse overhead at a TTFT cost; default-on regressed real serving (2.5x worse TTFT, 20-29% lower e2e throughput) | | **0043 / D1** full-step MoE decode CUDA graph | graph whole decode step incl. grouped-MMQ MoE dispatch | SHIPPED default-on | +2.6% (npl128) to +5-13% (npl32); D1 premise "host-sync on MoE readback" REFUTED (sync count identical 1457 on/off) | | S2 double-buffer set_inputs | overlap host input build with GPU | DROPPED | `set_inputs` ~0.05 ms/step, nothing to recover | | whole-step graph / host loop | host loop as serving residual | CLOSED (~0-1%) | reuse 0% (757.6) == S1+S3 72% (763.3); hostproc only ~4-8% of step wall | | padded / fixed-slot decode | pad decode width to `--parallel` for ~100% reuse | **REJECTED (built, GPU-tested, commit b028c81e)** | inert (BE) but regresses everywhere; N=8 burst 28.16->6.05 tok/s/seq; serving decode is GPU-compute-bound, dummy-row compute > reuse recovered | | speculative decode (MTP) | draft + verify | **REJECTED for current GB10 serving** | Phase 14 safety passed, but Phase 15 serving A/B regressed hard: n128 decode agg 662.4 -> 138.5 tok/s; likely graph/batch-shape disruption (`graphs reused` 361 -> 1) | ### 4.5 SHIPPED WINS (all BE / KL-benign) - keep these, do not regress - **FP4-MMQ MoE/dense GEMM** (native Blackwell FP4-MMA at the FP4 weight-BW floor; reason 4.1 stays default-off). - **M5 tf32 tensor-core chunked GDN prefill (patch 0047)**, default-on under `LLAMA_KV_PAGED` (`GDN_TC=5` + `GDN_CHUNK_MIN=64`). - **0042 fused residual-add + RMSNorm + weight-mul** (dense S_PP +0.5%, BE). - **0044 fused GatedRMSNorm + SiLU gate-mul** (672 -> 336 launches @npp512; dense +1.1%, MoE +0.9%, test-backend-ops 12979/12979). - **0046 GDN-prefill geometry gate** (gates 0022's decode retune by scan length; recovers +7.2% dense prefill, keeps the decode win, BE). - **SSM decode-fusion stack (0018-0022, 0028)**: in-place state (+23.5%/+18.9%), fused gather (+37.8%/+35.3%), o_proj reshape (+31.7%/+23.3%), conv in-place (+3.2%/+3.5%), occupancy retune (+11.1%/+8.3%) = the **2.26x / 2.46x over stock** decode multiplier. - **Serving host loop closed (0040 S1, 0043 D1).** - **The memory advantage** (1.5-3x lower VRAM, NVFP4-resident, no persistent bf16 dequant copies). - **Low-N decode lead** (dense 116.7% @ N=8). **Bit-exact output per-path** through the whole series. ### 4.6 REMAINING / unattempted levers + EV - **Multi-week persistent-Marlin decode kernel** (vLLM's fused-Marlin MoE persistent-tiling + Triton elementwise): the only path to the residual ~14 pt GPU-steady decode gap. **Low-EV**: decode-only ~4-14%, our own ggml Marlin port already lost -19.6%, needs mature tiling + multi-stream overlap (hard inside a single-stream CUDA graph), GB10-uncertain, and **cannot lift the prefill floor**. Not a free bit-exact lever. - **Datacenter-Blackwell pivot** (B200, ~8 TB/s HBM, native tcgen05/CUTLASS FP4, TMEM): lifts the LPDDR5x GDN bandwidth floor ~30x and restores exactly the vLLM advantages that lose on GB10. **This is the documented path to parity.** Re-run the methodology on the new silicon, do not reopen GB10 levers. The `VLLM_PARITY_LEVER_MAP.md` "pursue list" (A1-A7/B1-B7/C1: graph-safe ragged grouped FP4-MMA MoE kernel, FP8 paged KV, MTP spec-decode, etc.) is the **earlier working brainstorm written before the final profiling**. `VLLM_PARITY_FINAL.md` is the authoritative supersession; treat those buckets as rejected / infeasible / different-hardware unless re-validated on new silicon. Phase 14 re-validated the MTP bucket as safe, then Phase 15 rejected it as a current GB10 serving-throughput lever. Do not enable it by default and do not keep tuning draft length blindly. The only plausible follow-up is a graph-reuse and speculative verification batch-shape profile with `nsys --cuda-graph-trace=node`. Phase 16 ran that profile and supported the root cause: small-shape baseline reused graphs (`graphs reused = 62`) while MTP did not (`graphs reused = 1`) and did ~2.3x more GPU kernel work. The fixed safety gates stayed green before and after the failed serving A/B: MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 `5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`. Phase 17 source inspection found no tiny additive graph-reuse fix. MTP verification rows are real target decode/output rows (`K + 1` per speculative slot), so fake padding would touch KV, positions, logits, MTP nextn state, and rollback semantics. If reopened, start with a server-only shape counter around `server_slot::handle_last_sampled_token()`. Only then consider an opt-in group/defer-by-draft-length scheduler experiment, with TTFT/throughput and md5/op gates as kill criteria. Phase 18 added the server-only shape trace as patch 0055. Set `LLAMA_SPEC_SHAPE_TRACE=1` to log `kind=decode` rows and MTP `kind=verify` `K + 1` row/output shapes from `server_slot::handle_last_sampled_token()`. This is default-off instrumentation only. DGX green check after the patch saw MTP verify shapes vary (`rows=4`, then `rows=3`) on a tiny request, while the env-unset run emitted no `spec shape:` lines. Canonical post-patch gates passed: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense `5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`. Artifacts: `/home/mudler/bench/phase18_mtp_shape_trace_green` and `/home/mudler/bench/phase18_mtp_shape_trace_green/gate_after`. Next MTP step, if any: trace real serving shape entropy first. Do not implement a scheduler change until the trace shows repeatable draft-length buckets worth grouping. Any scheduler experiment must be opt-in/default-off and killed by TTFT/throughput regression, graph-reuse failure, md5/op drift, or MTP rollback/prefix gate failure. Phase 19 ran that trace-only serving measurement and rejected the scheduler shortcut. Artifact: `/home/mudler/bench/phase19_mtp_shape_entropy/20260701_045534`. Pre/post gates passed with canonical MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 `5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`. Serving result: | n | baseline decode_agg | MTP decode_agg | MTP / baseline | baseline TTFT ms | MTP TTFT ms | |---|---------------------|----------------|----------------|------------------|-------------| | 8 | 245.0 | 95.7 | 39.1% | 1147.2 | 1633.4 | | 32 | 409.2 | 110.0 | 26.9% | 2710.0 | 4471.5 | | 128 | 697.2 | 154.0 | 22.1% | 7601.5 | 20310.4 | Shape result: `draft=3` already accounts for 96.2-96.9% of verify slots, so group/defer-by-draft has little to recover. Full in-flight steps already mostly use all-`draft=3` vectors; the remaining churn is active-slot/tail churn plus the real `K + 1` verification-row expansion. Do not build a Phase 20 scheduler experiment on this evidence. Future MTP work would need a deeper target-verify graph/state design, not another small server scheduling shortcut. Phase 62 ran that gated verify-cost recheck. Artifact: `/home/mudler/bench/phase62_mtp_verify_cost/20260701_134125`. Pre/post gates passed with canonical MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 `5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`. MTP acceptance was high (`7372/9340 = 0.789`, mean acceptance length `3.33`), but throughput remained far below the keep threshold: `0.420x`, `0.274x`, and `0.213x` baseline decode at n8/n32/n128. Shape trace again showed `draft=3` / `rows=4` dominance (`95.6%`), with `graphs reused = 1`. Keep current MTP rejected unless a later target-verify/output-row graph-cost design exists; do not tune `spec-draft-n-max` blindly. Phase 20 refreshed the current-stack MoE serving snapshot against vLLM using the clean `~/llama-phase6-source` mirror (`f2521ab12`) rather than the stale `llama-paged-dev` benchmark tree. Artifact: `/home/mudler/bench/phase20_current_snapshot/20260701_050621`. Pre/post gates passed with canonical MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 `5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`. Current MoE serving snapshot (`PTOK=128`, `GEN=64`): | n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg | |---|------------------|-----------------|-------------------|-----------|----------|----------------| | 8 | 220.8 | 290.5 | 76.0% | 164.8 | 245.5 | 67.1% | | 32 | 411.1 | 594.7 | 69.1% | 252.1 | 456.0 | 55.3% | | 128 | 670.0 | 1022.7 | 65.5% | 322.4 | 662.4 | 48.7% | TTFT remains the clearest user-visible gap: paged is 2.88x/3.36x/3.11x slower than vLLM at n8/n32/n128, and paged prefill_tps is roughly one-third of vLLM. This keeps the GB10 shortcut closure intact: do not reopen MTP or small scheduler work. The credible next parity path is a datacenter-Blackwell rerun or a larger fused-kernel project outside this low-conflict patch stack. Phase 21 added a reusable current-stack serving harness: `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh`. It defaults to `~/llama-phase6-source`, validates docker/`local-ai-worker`/GPU idle state, uses the owner-file lock, runs pre/post inference gates, compares paged and vLLM with h2h, and writes ratio summaries. DGX dry run passed at `/home/mudler/bench/phase21_harness_dryrun/20260701_051757`. Use this harness for future current-stack GB10 snapshots. Do not reuse `~/bench/combined_definitive.sh` unless it is first ported away from stale `~/llama-paged-dev` paths and old lock assumptions. Phase 31 re-verified the patch-series mirror invariant after patch `0057`: applying every LocalAI `patches/paged/0*.patch` with strict `git apply` on top of Makefile pin `0ed235ea2c17a19fc8238668653946721ed136fd` produced tree `4eae628e4ba6f2defa14a19d19f7e4abef9a2647`, exactly matching fork branch `localai-paged` HEAD `c78e537b5 feat(cuda): trace moe mmq launch shapes`. Phase 24 extended `paged-current-serving-snapshot.sh` to write the snapshot hardware report. DGX dry run passed at `/home/mudler/bench/phase24_hardware_report_dryrun/20260701_052741`; it recorded `GPU 0: NVIDIA GB10`, driver `580.159.03`, compute capability `12.1`, and `hardware_class=gb10_or_workstation_blackwell`. This makes future parity artifacts self-describing: GB10/workstation Blackwell results must not be used as datacenter-Blackwell parity evidence. Phase 25 extended the same harness to write `gate_summary.tsv`. The summary was backfilled on the Phase 20 artifact at `/home/mudler/bench/phase20_current_snapshot/20260701_050621/gate_summary.tsv`; it records pre/post MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 `5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806` as `ok`. Phase 26 ran the full audited current-stack snapshot with `hardware.txt`, pre/post gates, same-session paged and vLLM serving runs, `summary.tsv`, and `gate_summary.tsv`. Artifact: `/home/mudler/bench/phase26_audited_snapshot/20260701_053650`. Hardware was recorded as `hardware_class=gb10_or_workstation_blackwell`, GPU `NVIDIA GB10`, driver `580.159.03`, compute capability `12.1`. Every compact gate row was `ok`: MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 `5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`, both before and after the serving run. Audited current MoE serving snapshot (`PTOK=128`, `GEN=64`): | n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg | |---|------------------|-----------------|-------------------|-----------|----------|----------------| | 8 | 230.8 | 283.2 | 81.5% | 170.6 | 241.6 | 70.6% | | 32 | 420.0 | 609.0 | 69.0% | 254.6 | 466.7 | 54.6% | | 128 | 673.4 | 1025.0 | 65.7% | 324.0 | 656.5 | 49.4% | Use Phase 26 as the current audit-grade GB10 snapshot. It keeps the Phase 20 verdict intact, but the artifact is more useful for future regressions because it carries hardware classification and compact pre/post inference gates. Phase 27 re-profiled the current clean llama.cpp n128 serving path with `nsys --cuda-graph-trace=node`. Artifact: `/home/mudler/bench/phase27_graph_node_serving/20260701_055519`. The run matched Phase 26 throughput closely (`675.5` vs `673.4` decode_agg_tps) and kept gates green before and after the profile (post retry): MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT_ID` `806/806`. The node-traced buckets still put the work in `gdn_core` (`29.59%`) and `mmq_nvfp4` (`28.44%`); helper dispatch remains too small (`mm_ids` `0.61%`, `gather_mmq` `0.37%`, `argsort_topk` `0.40%`). Do not reopen metadata/helper-only MoE dispatch work on GB10. Phase 28 tested the remaining low-conflict NVFP4 grouped-MMQ occupancy knobs. Artifact: `/home/mudler/bench/phase28_mmq_occupancy/20260701_040450`. `GGML_CUDA_FP4_MINBLOCKS=2` passed md5/op gates before and after serving (MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT_ID` `806/806`) but regressed n128 same-session decode serving (`705.1 -> 689.9` decode_agg_tps, `0.9784x`). `GGML_CUDA_FP4_MMQ_Y=64` failed to compile because the NVFP4 writeback specialization asserts `nwarps*tile_C::I == mmq_y`. Do not promote either knob; future grouped-MMQ work must be structural kernel work. Phase 29 added the default-off grouped-MMQ shape trace as patch `0056`. Artifact: `/home/mudler/bench/phase29_mmq_shape_trace/20260701_042428`. Fork commit: `20a99518a feat(cuda): trace moe mmq batch shapes`. The helper was added test-first (`test-cuda-mmq-shape-trace`) and built under CUDA on DGX. Default-off and `LLAMA_MOE_MMQ_SHAPE_TRACE=4` gates both passed: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT_ID` `806/806`. The trace-enabled gate emitted exactly four `[LLAMA_MOE_MMQ_SHAPE]` lines. This is evidence-only instrumentation; it does not close the speed gap. Phase 30 used patch `0056` for a live n128 serving shape trace. Artifact: `/home/mudler/bench/phase30_mmq_shape_serving/20260701_043300`. The first 4096 grouped-MMQ calls split into 1200 decode-like calls (`ncols_max <= 128`) and 2896 prefill-like calls. Decode-like calls had density `1-4` and selected `mmq_x_best` only in `{32,40,48,64}`; prefill-like calls were mostly density `16` and selected `mmq_x_best=128`. All traced calls had `stream_k=1`. Post-run gates stayed green: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT_ID` `806/806`. Phase 31 added patch `0057` for default-off grouped-MMQ launch tracing. Artifact: `/home/mudler/bench/phase31_mmq_launch_trace/20260701_064424`. Fork commit: `c78e537b5 feat(cuda): trace moe mmq launch shapes`; DGX mirror commit: `8b75905e9`. The trace adds `[LLAMA_MOE_MMQ_LAUNCH]` lines under `LLAMA_MOE_MMQ_SHAPE_TRACE=`, recording `ntiles_dst`, `stream_k_blocks`, tile efficiency, `fixup`, `ntx/nty/ntzw`, and compiled `mmq_x/mmq_y`. Default off, trace-enabled, and post-serving gates stayed green: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT_ID` `806/806`. The n128 serving trace showed decode-like `4800/4800` and prefill-like `4920/4920` launch lines with `fixup=0` and `stream_k_blocks == ntiles_dst`. Do not pursue a no-fixup/no-stream-k shortcut for this workload; the remaining grouped-MMQ work is structural small-M kernel work. Phase 32 added patch `0058` for default-off small-M grouped-MMQ candidate tracing. Artifact: `/home/mudler/bench/phase32_small_m_classifier/20260701_070127`. Fork commit: `2a9964d29 feat(cuda): trace moe small-m mmq candidates`; DGX mirror commit: `024f494d0`. The trace adds `[LLAMA_MOE_MMQ_SMALL_M]` lines under `LLAMA_MOE_MMQ_SMALL_M_TRACE=` for decode-like low-density grouped-MMQ MoE calls (`ncols_max <= 128`, density `<=4`, `mmq_x_best <=64`). Default-off, trace-enabled, and post-serving gates stayed green: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT_ID` `806/806`. The n128 serving trace found 4096 candidate calls, mostly `mmq_x_best=64` (1800) and `48` (1096). Phase 33 should A/B a default-off small-M tile policy starting at `mmq_x=16`. Phase 33 added patch `0059`, default-off `LLAMA_MOE_SMALL_M_TILE=`, and rejected the simple smaller-tile policy. Artifact: `/home/mudler/bench/phase33_small_m_tile_policy/20260701_071136`. Fork commit: `fbed2abaa feat(cuda): gate moe small-m mmq tile policy`; DGX mirror commit: `dfd1eaea8`. Default-off, tile16, tile8, and post-serving gates stayed green: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT_ID` `806/806`. Same-session n128 serving rejected both caps: baseline `672.1` decode_agg_tps, tile16 `640.3` (`0.953x`), tile8 `583.2` (`0.868x`). Do not promote smaller `mmq_x` caps. Phase 34 added patch `0060`, default-off `LLAMA_MOE_MMID_ROUTE_TRACE=`, to classify the live `MUL_MAT_ID` dispatch route without changing route behavior. Artifact: `/home/mudler/bench/phase34_mmid_route_trace/20260701_072737`. Fork commit: `6c332094c feat(cuda): trace moe mmid routes`; DGX mirror commit: `34a256d14`. Default-off, trace-enabled, and post-serving gates stayed green: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT_ID` `806/806`. Live n128 serving with trace cap 4096 found `mmq=2776`, `mmvq=1320`, and `host_sync=0/4096`. Treat the old current-stack host-sync-fallback concern as refuted for this workload; the remaining MoE work is grouped-MMQ small-M efficiency or another measured bucket. Phase 35 added patch `0061`, default-off `LLAMA_MUL_MAT_ROUTE_TRACE=`, to classify regular `MUL_MAT` routes for the projection-heavy serving bucket. Artifact: `/home/mudler/bench/phase35_mul_mat_route_trace/20260701_074359`. Fork commit: `486c28c63 feat(cuda): trace mul mat routes`; DGX mirror commit: `18f7ad005`. Default-off, trace-enabled, and post-serving gates stayed green: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID` `806/806`. Live n128 serving with trace cap 8192 found `mat_f=2888`, `op_cublas=2292`, `mmq=1328`, `vec_q=1214`, `vec_f=470`; BF16 (`type=30`) was split `mat_f=2485`, `op_cublas=1330`. Next projection work should target BF16 `mat_f`/`op_cublas` subroute evidence or route policy, not batched cuBLAS. Phase 36 added patch `0062`, default-off `LLAMA_CUBLAS_ROUTE_TRACE=`, to classify the generic cuBLAS `MUL_MAT` subroute without changing branch behavior. Artifact: `/home/mudler/bench/phase36_cublas_route_trace/20260701_081228`. Fork commit: `38c4ef2e4 feat(cuda): trace cublas routes`; DGX mirror commit: `e0224393a`. Default-off, trace-enabled, and post-serving gates stayed green: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID` `806/806`. Live n128 serving with trace cap 8192 found `bf16_tc=5681` and `sgemm=2511`. The next projection phase should explain whether the F32 SGEMM shapes are expected glue tensors or a missed BF16 route; do not chase NVFP4 cuBLAS or batched cuBLAS for this measured bucket. Phase 37 added patch `0063`, extending `LLAMA_CUBLAS_ROUTE_TRACE=` with `src0`, `src1`, and `dst` tensor names. Artifact: `/home/mudler/bench/phase37_cublas_name_trace/20260701_083227`. Fork commit: `2d590d770 feat(cuda): trace cublas tensor names`; DGX mirror commit: `2cbb61969`. Default-off, trace-enabled, and post-serving gates stayed green: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID` `806/806`. Live n128 trace found `bf16_tc=2884`, `sgemm=1212`. The `sgemm` bucket is `blk.N.ffn_gate_inp.weight -> ffn_moe_logits-N` and `blk.N.ffn_gate_inp_shexp.weight -> shared_expert_gate-N`; do not force BF16 without first inspecting model-load tensor types and running KL validation. Phase 38 is the current gate-projection policy checkpoint. Artifact: `/home/mudler/bench/phase38_gate_baseline/20260701_084410`. Preflight showed docker `0`, `local-ai-worker` `0`, compute apps `0`, and GB10 driver `580.159.03`. Fresh baseline gates against the Phase37 build passed: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID` `806/806`. Source comparison found llama.cpp and vLLM both keep router and shared-expert gate weights unquantized; vLLM's relevant idea is fused F32 gate weight concatenation, not BF16/NVFP4 routing. Future fused-gate work must be default-off, preserve F32 semantics, and pass md5/op gates before benchmarking; if md5 changes, run KL first. Phase 39 closes the naive fused-gate shortcut. Artifact: `/home/mudler/bench/phase39_gate_sgemm_profile/phase27_reanalysis`. Re-analysis of the Phase27 graph-node serving profile showed total kernel time `20.0372s`, `concat_layout=459.84ms` (`2.29%`, `2250` instances), `cublas_bf16_gemm=1892.81ms` (`9.45%`), and `cutlass_bf16_gemm=684.01ms` (`3.41%`). Do not implement graph-time `ggml_concat()` of `ffn_gate_inp.weight` plus `ffn_gate_inp_shexp.weight`; it risks increasing an existing layout-copy bucket. The only future fused-gate design worth scoping is a persistent/load-time F32 combined gate weight with output views, default-off until MoE/dense md5, `MUL_MAT`, `MUL_MAT_ID`, and KL-if-md5-changes gates pass. Phase 40 closes the tested GB10 max-concurrency C1 shortcut. Artifact: `/home/mudler/bench/phase40_max_concurrency/20260701_090012`. The snapshot ran with `PARALLEL=256`, `CTX=262144`, `PTOK=128`, `GEN=64`, `NPL="128 192 256"`, and `OPS=MUL_MAT,MUL_MAT_ID`. Pre/post gates stayed green: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID` `806/806`. Paged safely served `n=256`, but vLLM also fit and remained faster: `paged_decode_over_vllm=0.6354`, `paged_agg_over_vllm=0.4721`, `paged_ttft_over_vllm=2.9401`. Do not claim GB10 parity from higher max concurrency at this prompt/gen length and `n<=256`; a future C1 retry must push beyond this tested point and keep the same md5/op gates. Phase 41 records the low-concurrency counterpart to the Phase40 high-concurrency check. Artifact: `/home/mudler/bench/phase41_low_concurrency/20260701_091437`. The snapshot ran with `PARALLEL=32`, `CTX=32768`, `PTOK=128`, `GEN=64`, `NPL="1 8 32"`, and `OPS=MUL_MAT,MUL_MAT_ID`. Pre/post gates stayed green: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID` `806/806`. Paged is about `0.75x` vLLM decode at `n=1/8` and `0.665x` at `n=32`; TTFT is `1.38x`, `3.14x`, and `3.40x` vLLM respectively. Do not reopen D1 from this result: `0043` already ships grouped-MMQ full-step graph capture default-on, Phase34 found `host_sync=0/4096`, and S3 is default-off because it regressed TTFT/end-to-end throughput. Phase 42 reconciles the target list after parallel read-only review. D1 is closed on the current GB10 path; GDN low-conflict work is exhausted after `0046`/`0047` plus the rejected C32/QS-early/Global-Ai32 follow-ups; W4A16/GEMM micro-tweaks are exhausted after `0033`-`0035` and `0048`-`0050`. It nominated the Phase38/39 persistent/load-time F32 combined gate projection as the last small GB10 source candidate. Phase 43 rejects that gate-fusion candidate as a small shortcut after source inspection. `ffn_gate_inp.weight` and `ffn_gate_inp_shexp.weight` are separate GGUF tensors; the Qwen35MoE graph consumes them in separate matmuls; the loader can create tensors from GGUF metadata or views of existing tensors, but not a new persistent derived concatenated weight. A correct implementation would need a general derived-weight allocation/materialization path across mmap, offload, split buffers, and MTP blocks. Do not implement a Qwen-only loader hack, and do not fall back to graph-time `ggml_concat()`. After Phase43 there is no remaining low-conflict GB10 shortcut justified by current evidence; future work is either a larger kernel/loader design or a hardware-pivot benchmark, still gated by MoE/dense md5 plus `MUL_MAT`/`MUL_MAT_ID` and KL if md5 changes. Phase 44 makes the current-stack serving snapshot harness ready for hardware pivots by parameterizing the vLLM side instead of hardcoding the GB10 defaults. `paged-current-serving-snapshot.sh` now accepts `VLLM_GPU_MEMORY_UTILIZATION`, `VLLM_MAX_MODEL_LEN`, `VLLM_MAX_NUM_SEQS`, `VLLM_TENSOR_PARALLEL_SIZE`, and whitespace-split `VLLM_EXTRA_ARGS`, and prints the resolved values during `DRY_RUN=1`. This is not a new benchmark and does not change inference code or gate behavior. Use it when the next parity run targets datacenter Blackwell or another non-GB10 vLLM serving shape, while keeping `hardware.txt`, pre/post MoE/dense md5, `MUL_MAT`/`MUL_MAT_ID`, and KL-if-md5-changes as mandatory gates. Phase 45 records the immediate inference-safety guard after Phase44. Artifact: `/home/mudler/bench/phase45_inference_gate_guard/20260701_094320`. The DGX phase36 build passed MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID` `806/806`. Docker, `local-ai-worker`, and GPU compute preflight were all zero before and after the run. Phase 46 removes the last hardcoded `q36` served-model name from the audited serving snapshot harness. Set `SERVED_MODEL_NAME` to drive vLLM `--served-model-name`, the vLLM readiness check, and h2h `--model` on both engines. DGX dry run: `/home/mudler/bench/phase46_served_model_name_dryrun/20260701_094849`, with `SERVED_MODEL_NAME=dense-q36` printed during `DRY_RUN=1`. This is harness-only hardware-pivot readiness, not a throughput result. Phase 47 attempted the first dense serving snapshot using the Phase46 override. Dry-run artifact: `/home/mudler/bench/phase47_dense_serving_dryrun/20260701_095141`; incomplete full artifact: `/home/mudler/bench/phase47_dense_serving/20260701_095151`. Pre-gates were green and the paged dense arm completed through `n=128`, but the artifact is not a dense parity result because vLLM produced no result JSONs. Root cause: dense vLLM startup exceeded the old fixed readiness budget, and the cleanup path could wait indefinitely on the server PID after `SIGTERM`. Phase 48 hardens the serving snapshot harness for that failure mode. It adds `LLAMA_READY_ATTEMPTS` and `VLLM_READY_ATTEMPTS`, bounds HTTP readiness probes with `curl --max-time 2`, and uses bounded server cleanup that escalates from `SIGTERM` to `SIGKILL`. Dry-run artifact: `/home/mudler/bench/phase48_readiness_harness_dryrun/20260701_100533`, with `VLLM_READY_ATTEMPTS=700` printed and clean DGX preflight. Phase 47 retry completed after Phase48. Artifact: `/home/mudler/bench/phase47_dense_serving_retry/20260701_100811`. Pre/post gates were green: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID` `806/806`. Dense paged decode beats vLLM at low concurrency (`1.3434x` at `n=1`, `1.1560x` at `n=8`) but falls behind at `n=32/128` (`0.9036x`, `0.7912x`), and TTFT remains `1.87x` to `4.05x` vLLM. This does not change the GB10 conclusion. Phase 49 removes vLLM log noise from harness-owned environment variables. The `vllm serve` child now unsets `VLLM_MODEL`, `VLLM_BIN`, `VLLM_READY_ATTEMPTS`, `VLLM_GPU_MEMORY_UTILIZATION`, `VLLM_MAX_MODEL_LEN`, `VLLM_MAX_NUM_SEQS`, `VLLM_TENSOR_PARALLEL_SIZE`, and `VLLM_EXTRA_ARGS` while preserving intentional vLLM runtime variables such as `VLLM_LOGGING_LEVEL`. Dry run: `/home/mudler/bench/phase49_vllm_env_hygiene_dryrun/20260701_102138`. Phase 50 resolves the dense high-N decode-accounting question with a graph-node difference-method profile. Artifact: `/home/mudler/bench/phase50_dense_true_decode/20260701_103120`. Pre/post inference gates on the profiled `build-cuda` binary stayed green: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID` `806/806`. Dense `npl=128`, `npp=128` true decode is `383.66 t/s` for paged and `435.00 t/s` for vLLM, ratio `0.8820`. This means Phase47's `0.7912` h2h decode ratio and `0.5071` aggregate ratio include scheduler/admission and prefill-overlap/accounting effects beyond the real GPU-steady decode gap. Next GB10 code work should instrument batch composition/admission in `server_context::pre_decode()` before attempting another kernel shortcut. Phase 51 implements that admission trace in the llama.cpp fork. Local fork commit: `c6cb8460e feat(server): trace serving admission batches`. The trace is default-off behind `LLAMA_SERVING_TRACE=1`, adds a small unit-tested accumulator, and records aggregate `pre_decode()` scheduler shape: decode tokens, prompt tokens admitted, waiting prompt slots, started/continued prompt slots, decode-only steps, `n_batch`, `n_ubatch`, `prefill_budget_step`, and `prefill_cap_per_slot`. DGX artifact: `/home/mudler/bench/phase51_serving_admission_trace/20260701_110130`. The patched `build-cuda` CTest passed and inference gates stayed green: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID` `806/806`. Push and LocalAI patch-series regeneration are still pending because push requires explicit approval. Phase 52 uses the Phase51 trace on DGX for dense `n=128`, `ptok=128`, `gen=64`. Artifact: `/home/mudler/bench/phase52_dense_admission_trace/20260701_111017`. Pre/post md5 and op gates stayed green. The clean traced h2h row was `decode_agg_tps=360.5`, `prefill_tps=629.5`, `ttft_mean_ms=23171.5`, wall `58.921s`. The admission trace reported `steps=76`, `decode_only_steps=0`, `decode_tokens=8064`, `prompt_tokens=22785`, `max_waiting_prompt_slots=35`, `started_prompt_slots=128`, `continued_prompt_slots=139`, `prefill_budget_step=0`, and `prefill_cap_per_slot=0`. The prompt token count matches h2h exactly, so this is the target request. The next GB10 lever should be a default-off scheduler/admission A/B or a per-step histogram trace, not an immediate GDN/GEMM rewrite. Phase 53 tested the existing runtime admission-budget knobs instead of adding new code. Artifact: `/home/mudler/bench/phase53_dense_admission_budget_sweep/20260701_111915`. Pre/post gates stayed green. Dense `n=128` results: default Phase52 `agg=139.0`, `decode_agg=360.5`, `prefill=629.5`, `TTFT=23171.5ms`, wall `58.921s`; `T=1536 cap=512` `agg=134.4`, `decode_agg=376.7`, `prefill=607.0`, `TTFT=22263.7ms`, wall `60.968s`; `T=1024 cap=512` `agg=130.0`, `decode_agg=392.4`, `prefill=565.2`, `TTFT=23234.3ms`, wall `63.003s`. Decision: simple budget shrinkage is rejected. It raises h2h decode-agg while lowering aggregate/prefill throughput, and it does not materially solve TTFT. Next scheduler work should be per-step histograms or a targeted first-token admission policy. Phase 54 through Phase 59 tested that targeted scheduler path. The fork commits are still local-only and default-off: - `c6cb8460e feat(server): trace serving admission batches` - `bd7b2e952 feat(server): add admission trace histograms` - `8a97629a4 feat(server): add TTFT prefill-first scheduler mode` - `3b6ab5fa8 feat(server): cap TTFT prefill-first decode deferral` - `8759213e3 feat(server): gate TTFT defer by prompt backlog` Phase59 is the current verdict. Artifact: `/home/mudler/bench/phase59_moe_min32_repeat_vllm/20260701_123147`. Pre/post llama gates stayed green: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID` `806/806`. MoE `n=128`, `ptok=128`, `gen=64` repeated the Phase58 min32 signal: llama default `agg=336.6`, `TTFT=7798.5ms`, wall `24.334s`; llama min32 `agg=336.9`, `TTFT=7167.8ms`, wall `24.316s`. Matching vLLM was still `agg=601.3`, `TTFT=2968.1ms`, wall `13.563s`. Decision: keep `LLAMA_TTFT_PREFILL_FIRST=1` and `LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING=32` as an opt-in llama.cpp latency/QoS knob. It does not prove vLLM parity progress by itself. Do not default it until more workload coverage exists, and do not regenerate LocalAI patches until the fork commits are pushed with explicit approval. Phase 60 re-profiled the current W4A16 grouped MoE prefill path to check whether there was still a low-conflict W4A16 shortcut after Phase1-5. Artifact: `/home/mudler/bench/phase60_w4a16_current_profile/20260701_104915`. Pre/post gates stayed green: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID` `806/806`. Default FP4-MMQ S_PP was `2327.69` at `npp=512` and `2423.20` at `npp=2048`; forced W4A16 was `1451.00` and `1482.76`, only `0.623x` and `0.612x` of default. The `npp=512` profile showed W4A16 still dominated by `w4a16_grouped_kernel` (`4.142s`, `42.5%`) plus sorted activation gathers (`1.094s`, `11.2%`), while the cast kernel was only `0.517s` (`5.3%`). Decision: do not add another small W4A16 metadata/body/cast patch. Future W4A16 work needs a larger redesign that improves the grouped kernel body and removes or fuses sorted activation movement. Near-term GB10 parity work should return to broader prefill/GDN/MoE design or hardware-pivot benchmarking. Phase61 is scoped as that larger W4A16 kill-gate, not as a committed code change: `docs/superpowers/plans/2026-07-01-w4a16-direct-activation-phase61.md`. It proposes a default-off `LLAMA_W4A16_DIRECT_A=1` experiment that consumes the original activation tensor plus the existing `ids_to_sorted` map directly, removing Phase60's sorted activation gather and separate cast kernels before any grouped-kernel body rewrite. Keep it only if it improves forced W4A16 S_PP by at least `+12%` and reaches at least `0.75x` default FP4-MMQ; otherwise reject and do not continue W4A16 body tuning. Phase61 result: rejected. The direct-A kernel passed correctness after matching `get_rows_cuda` flat-row addressing (`MUL_MAT_ID` `806/806`; forced/direct-A MoE transcript md5 both `07db32c2bcb78d17a43ed18bc22705cd`) and default gates remained green (`8cb0ce23`, `5951a5b4`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID` `806/806`). But direct-A only improved forced W4A16 S_PP `1471.05 -> 1566.30` at `npp=512` and `1502.46 -> 1605.82` at `npp=2048` (`+6.5%` / `+6.9%`), still just `0.67x` / `0.66x` of default FP4-MMQ. The direct kernel diff was not committed; only the safe policy/routing stub remains in the fork. Do not pursue more W4A16 body tuning on GB10 as the next parity lever. --- ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes) 1. **Profile, don't assume. The analysts were wrong 4 times.** Every one was caught only by an in-backend A/B or a corrected profile: - **GDN-scalar grep** (assumed the scan was scalar/serial from reading source) - wrong, retired by the tensor-core port. - **dense-cuBLAS reroute** (assumed dequant->bf16 would win) - wrong, -31% to -62%. - **occupancy** (assumed blocks/SM was the GDN bound) - wrong, 1844 vs 1814 within noise. - **projection-regime** (assumed FP8/NVFP4 projections were a big lever) - wrong, projections are ~12% of the decode stream at high N. **In-backend A/B is the only truth.** A standalone PoC win (0034) is not a result. 2. **Per-kernel us/tok overstates end-to-end S_PP/S_TG.** A kernel that is X% faster in isolation does not move throughput X%; always confirm against the end-to-end batched-bench / serving number. 3. **The CUDA-graph-trace decode artifact (the big one).** Decode is a replayed graph; nsys without `--cuda-graph-trace=node` collapses it and lies. This single trap produced the wrong "host-bound / 159 us/tok / 56%" story across multiple analyses. Always graph-node-trace + difference method (section 3.4). 4. **Beware GPU contention skewing absolutes.** The box runs concurrent quant/repack/finetune jobs. Gate on idle GPU + free lock; prefer the same-session both-engine harness so both numbers move together. 5. **The vLLM server number is inflated ~8 pt vs its true GPU-steady.** vLLM's chunked-prefill-overlap inflates its own server-measured decode window (1177 server vs 1078 true GPU-steady). Compare GPU-steady to GPU-steady, or you will chase a phantom gap. The reconciliation chain that must sum: vLLM server 1177 (100%) -> vLLM true GPU-steady 1078 (92%) -> llama GPU-steady 924 (78.5% of 1177, = 86% of 1078) -> llama server 718 (60.7%, the S3-recoverable serving overhead). --- ## 6. THE THREE FORWARD DIRECTIONS ### (a) Close / ship the record (lowest effort, do this first) The investigation is closed for GB10 shortcuts, and the closeout chores below are now done: - patch `0044` is tracked in the LocalAI series; - the Makefile pin `0ed235ea2c17a19fc8238668653946721ed136fd` is the authoritative paged pin; - Phase 20 re-ran the current-stack serving snapshot on the clean mirror; - Phase 22 re-verified the patch-series mirror invariant after `0055`. For future release checks, run `paged-inference-gates.sh` and `paged-current-serving-snapshot.sh` from the LocalAI backend tree. The inference gate now defaults to both `MUL_MAT` and `MUL_MAT_ID`; set `OPS=` only for a focused diagnostic run. ### (b) Datacenter-Blackwell pivot (THE real parity path) The thesis: every vLLM advantage that wins on GB10 is a kernel that is **broken or capped on consumer Blackwell** and **inverts on datacenter Blackwell** (B200): FLA blocked-solve GDN, Marlin/CUTLASS grouped FP4, HBM-tuned full-cudagraph decode, native tcgen05/TMEM. ~8 TB/s HBM lifts the LPDDR5x GDN bandwidth floor ~30x. Concrete first steps: 1. Acquire a B200 (or equivalent HBM tcgen05 part). Reproduce the **both-engine same-session harness** there (`combined_definitive.sh` discipline): build the stock and paged binaries, build vLLM 0.23.0+, run MoE + dense prefill + serving for both engines. 2. Re-measure the FP4 path: on B200, native CUTLASS NVFP4 grouped-GEMM should work (the CUTLASS #3096 / TMA-WS failure is consumer-Blackwell-specific). Confirm whether vLLM now runs **native FP4** instead of Marlin W4A16. If so, the 4.1 GEMM track must be re-evaluated from scratch (it was rejected on a GB10-specific ceiling). 3. Re-take the decode profile with `--cuda-graph-trace=node`; the GDN scan that floors at 273 GB/s on GB10 should no longer dominate at HBM bandwidth - re-derive the per-token decomposition before choosing any lever. ### (c) Multi-week persistent-Marlin decode kernel (decode-only, low-EV, CANNOT reach parity) Only pursue if (a)+(b) are not options and someone explicitly wants the residual decode gap closed on GB10. It targets the ~14 pt GPU-steady decode gap (vLLM's fused-Marlin MoE persistent-tiling + single Triton elementwise). Concrete first steps: 1. Re-confirm the ceiling first: our own ggml Marlin port already lost -19.6% at decode (4.3), so the bar is "beat that and beat FP4-MMQ at the decode BW floor". 2. Prototype the persistent-tiling grouped-FP4 MoE kernel **standalone**, then prove it **in-backend** (a PoC win is not a result, per 0034). It must live inside a single-stream CUDA graph or bring its own multi-stream overlap. 3. Bound the upside honestly: this is **decode-only ~4-14%** and **does nothing for the prefill floor (36-43%)**, so it does not reach parity. Record the verdict either way. --- ## 7. KEY FILE / ARTIFACT INDEX ### Fork (canonical source of truth) - Local canonical fork: `/home/mudler/_git/llama.cpp`, branch **`localai-paged`**, HEAD `2d590d770` ("trace cublas tensor names", patch `0063`). - DGX current clean mirror/build tree: `dgx:~/llama-phase6-source`, HEAD `2cbb61969` with the Phase 37 cuBLAS tensor-name trace patch applied and committed; Phase 20/26/27 artifacts still record their historical source hashes. - Historical DGX dev tree: `dgx:~/llama-paged-dev`, branch **`paged`**, HEAD `a7d439e8ce6990eb09721223c975da4e49d8d136` ("GDN CONFIG C (M8) - bf16 Kc/Qc"). It is an old experimental tree and must not be treated as canonical. ### LocalAI worktree - Path: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention`, branch `worktree-feat+paged-attention` (currently 246 ahead, 31 behind `origin/master`; recompute before reporting). - Backend dir: `backend/cpp/llama-cpp-localai-paged/` (`Makefile` thin wrapper, `package.sh`, `run.sh`, `README.md` ~44 KB canonical, `docs/`, `patches/paged/`). - `docs/`: `VLLM_PARITY_FINAL.md` (authoritative record), `VLLM_PARITY_LEVER_MAP.md` (working brainstorm, profile-validated section), `DECODE_SERVING_SCOPE.md`, `PREFILL_GEMM_SCOPE.md`, `PREFILL_GEMM_RESULTS.md`, `TENSORCORE_GDN_SCOPE.md`, `TENSORCORE_GDN_BUILD_PLAN.md`, `ACCELERATOR_PORTING_SCOPE.md`, `UPSTREAM_LAYER2_SCOPE.md`, `LOCALAI_LLAMACPP_BACKEND_PLAN.md`, `PAGED_BITEXACT_NOTE.md`, `PATCH_MAINTENANCE.md`, `final_benchmark.csv`, `paged-burst-bench.cpp`, `paged-reclaim-unit.cpp`, 3 PNGs, and this `PARITY_HANDOFF.md`. - `patches/paged/`: **54** `.patch` files spanning 0001-0063 with intentional gaps (missing 0005, 0026 [dropped ssm_bf16_tau], 0027, 0032, 0036-0039, 0045). Core paged-KV 0001-0012; decode-first scheduler 0013/0016; serving graph reuse 0040/0041; prefill fusions 0042/0044; SSM/GDN decode 0018-0022/0028; MoE NVFP4 quant 0023/0025/0043; FP4-MMA/Marlin scaffolds 0033/0034/0035 (default-off); GDN tensor-core prefill 0031 -> 0046 (geometry gate) -> 0047 (f32-only M5, default-on under paged KV); W4A16 packed metadata/shape/padding is 0048-0050; MoE safety tests are 0051-0053; MTP backend-sampling safety is 0054; speculative shape trace is 0055; MoE MMQ selector/launch/candidate/tile-policy/route instrumentation is 0056-0060; regular MUL_MAT route instrumentation is 0061; cuBLAS route instrumentation is 0062-0063. ### Bench artifacts (DGX) - `~/bench/COMBINED_DEFINITIVE.txt` (+ `.log`, `.done`, `combined_definitive.sh`, `combined_definitive.out`) - historical same-session both-engine run. - `~/bench/phase20_current_snapshot/20260701_050621` - current clean-stack paged-vs-vLLM MoE serving snapshot. - `~/bench/phase21_harness_dryrun/20260701_051757` - current snapshot harness dry-run artifact. - `~/bench/phase24_hardware_report_dryrun/20260701_052741` - current snapshot harness dry run proving `hardware.txt` captures the DGX as `hardware_class=gb10_or_workstation_blackwell`. - `~/bench/phase25_gate_summary_dryrun/20260701_053353` - dry run after adding `gate_summary.tsv` support; normal dry-run still writes `hardware.txt` and does not emit a gate summary before gates exist. - `~/bench/phase26_audited_snapshot/20260701_053650` - current audit-grade full paged-vs-vLLM MoE serving snapshot with `hardware.txt`, pre/post gates, `summary.tsv`, and `gate_summary.tsv`. - `~/bench/phase27_graph_node_serving/20260701_055519` - current clean llama.cpp n128 serving profile captured with `--cuda-graph-trace=node`, pre/post retry gates green. - `~/bench/phase28_mmq_occupancy/20260701_040450` - NVFP4 MMQ occupancy build-knob A/B; `MINBLOCKS=2` gate-safe but serving-regressed, `MMQ_Y=64` compile-rejected. - `~/bench/phase29_mmq_shape_trace/20260701_042428` - default-off MoE MMQ shape trace patch `0056`; CUDA build plus default/trace md5 gates green. - `~/bench/phase30_mmq_shape_serving/20260701_043300` - live n128 serving MMQ shape distribution from patch `0056`; post-run md5/op gates green. - `~/bench/phase31_mmq_launch_trace/20260701_064424` - default-off MoE MMQ launch trace patch `0057`; default/trace/post-serving md5 gates green; n128 launch trace rejects stream-k/fixup shortcut (`fixup=0`, `stream_k_blocks == ntiles_dst`). - `~/bench/phase32_small_m_classifier/20260701_070127` - default-off MoE MMQ small-M classifier patch `0058`; default/trace/post-serving md5 gates green; n128 trace found 4096 candidate calls. - `~/bench/phase33_small_m_tile_policy/20260701_071136` - default-off MoE MMQ small-M tile policy patch `0059`; tile16/tile8 md5/op safe but both slower in n128 serving. - `~/bench/phase34_mmid_route_trace/20260701_072737` - default-off MoE MMID route trace patch `0060`; default/trace/post-serving md5 gates green; n128 route trace found `mmq=2776`, `mmvq=1320`, `host_sync=0/4096`. - `~/bench/phase35_mul_mat_route_trace/20260701_074359` - default-off regular MUL_MAT route trace patch `0061`; default/trace/post-serving md5 gates green; n128 route trace found BF16 `mat_f=2485`, `op_cublas=1330`. - `~/bench/phase36_cublas_route_trace/20260701_081228` - default-off cuBLAS subroute trace patch `0062`; default/trace/post-serving md5 and op gates green; n128 route trace found `bf16_tc=5681`, `sgemm=2511`. - `~/bench/phase37_cublas_name_trace/20260701_083227` - cuBLAS tensor-name trace patch `0063`; default/trace/post-serving md5 and op gates green; n128 trace identified `sgemm` as MoE gate logits and shared-expert gate projections. - `~/bench/phase38_gate_baseline/20260701_084410` - current Phase37 build baseline before gate-projection policy work; docker/local-ai-worker/GPU idle preflight green; MoE/dense md5 green; `MUL_MAT` `1146/1146`; `MUL_MAT_ID` `806/806`. - `~/bench/phase39_gate_sgemm_profile/20260701_085211` - short completion profile, diagnostic only because `-n 32` is not a canonical md5 gate; useful for confirming graph-time concat is a real kernel path. - `~/bench/phase39_gate_sgemm_profile/phase27_reanalysis` - Phase27 serving profile re-analysis used to reject graph-time fused gate weight concat; `concat_layout=459.84ms` (`2.29%`) in the serving kernel window. - `~/bench/phase40_max_concurrency/20260701_090012` - max-concurrency C1 check at `NPL=128/192/256`, `PTOK=128`, `GEN=64`, `PARALLEL=256`, `CTX=262144`; pre/post MoE/dense md5 and `MUL_MAT`/`MUL_MAT_ID` gates green, but vLLM also fit at `n=256` and stayed ahead (`paged_decode_over_vllm=0.6354`, `paged_agg_over_vllm=0.4721`). - `~/bench/phase41_low_concurrency/20260701_091437` - low-concurrency serving check at `NPL=1/8/32`, `PTOK=128`, `GEN=64`, `PARALLEL=32`, `CTX=32768`; pre/post MoE/dense md5 and `MUL_MAT`/`MUL_MAT_ID` gates green; paged is `0.7493`, `0.7518`, and `0.6649` of vLLM decode at `n=1/8/32`, with TTFT still much worse by `n=8/32`; does not reopen D1. - `~/bench/phase44_hardware_pivot_harness_dryrun/20260701_094038` - harness-only dry-run artifact proving the vLLM serving config overrides are printed and preflighted before any server starts. - `~/bench/phase45_inference_gate_guard/20260701_094320` - post-Phase44 inference guard; MoE/dense md5 and `MUL_MAT`/`MUL_MAT_ID` backend-op gates green. - `~/bench/phase46_served_model_name_dryrun/20260701_094849` - harness-only dry-run artifact proving `SERVED_MODEL_NAME` is printed and preflighted before any server starts. - `~/bench/phase47_dense_serving_dryrun/20260701_095141` - dense serving dry-run with `SERVED_MODEL_NAME=dense-q36`. - `~/bench/phase47_dense_serving/20260701_095151` - incomplete dense serving attempt; pre-gates and paged arm completed, vLLM did not produce result JSONs under the old readiness budget. - `~/bench/phase48_readiness_harness_dryrun/20260701_100533` - harness dry-run proving configurable readiness budgets and clean preflight before retrying dense serving. - `~/bench/phase47_dense_serving_retry/20260701_100811` - completed dense serving snapshot after Phase48; pre/post md5 and op gates green; paged low-N decode ahead, high-N aggregate and TTFT behind. - `~/bench/phase49_vllm_env_hygiene_dryrun/20260701_102138` - harness dry-run after scrubbing harness-owned `VLLM_*` variables from the `vllm serve` child environment. - `~/bench/phase50_dense_true_decode/20260701_103120` - dense graph-node difference-method profile at `npl=128`, `npp=128`; `build-cuda` pre/post md5 and op gates green; true decode paged `383.66 t/s`, vLLM `435.00 t/s`, ratio `0.8820`, pointing next at serving admission/scheduler tracing. - `~/bench/phase51_serving_admission_trace/20260701_110130` - default-off `LLAMA_SERVING_TRACE=1` fork commit `c6cb8460e`; DGX patched `build-cuda` CTest and md5/op gates green; push and LocalAI patch-series mirror pending approval. - `~/bench/phase52_dense_admission_trace/20260701_111017` - clean dense `n=128` admission trace; pre/post gates green; `decode_only_steps=0`, `prompt_tokens=22785`, `max_waiting_prompt_slots=35`; next lever is scheduler/admission A/B or per-step histogram trace. - `~/bench/phase53_dense_admission_budget_sweep/20260701_111915` - runtime sweep of `LLAMA_MAX_BATCH_TOKENS=1536/1024` with `LLAMA_PREFILL_CAP=512`; pre/post gates green; simple budget shrinkage rejected because aggregate/prefill throughput regressed and TTFT did not materially improve. - Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`. - Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30. - A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`. ### Recent context commits - `6edbb56b0` "docs(paged): definitive vLLM-parity final-state record (GB10, CLOSED)" - adds `VLLM_PARITY_FINAL.md`. - `baf102524` "docs(paged): correct decode-serving record to ~86% GPU-steady parity (graph-node-traced)" - the ~56% -> ~86% correction. - `bd100dd20` "fix(paged): repair the patch series, sync to the fork branch" - dropped dev-tree 0044/0045, added f32-only M5 as 0047. - `b028c81ed` "docs(paged): record padded/fixed-slot decode shape as tested-and-rejected". ### Discrepancies to flag / resolve (carried verbatim from the gather, including UNVERIFIED labels) 1. **Pin prose reconciled in this worktree.** Makefile line 52 `LLAMA_VERSION?=0ed235ea2c17a19fc8238668653946721ed136fd` is authoritative and matches the local fork merge-base. Hard rule: the paged pin must equal the stock `llama-cpp` pin (shared `grpc-server.cpp`); a bump to `c299a92c` once broke the grpc-server link despite being bit-exact and was reverted. Trust the Makefile when building. 2. **Current fork/mirror are clean and verified.** Local fork HEAD is `2d590d770`, DGX clean mirror HEAD is `2cbb61969`, and Phase 37 should be treated as the current patch-series tip. The old `llama-paged-dev` tree is historical only. 3. **Worktree patch series is tracked through 0063.** The only expected unrelated untracked path in this worktree is `.claude/`. 4. **`sm_121a` is not in the worktree build files** - it lives only in the DGX experimental build scripts (`gdn_cc.sh`, `gdn_bv_build.sh`, `paged-build.sh`); mainline uses arch `121`. **UNVERIFIED** whether the shipped CI Dockerfile build path injects `121a` for the FP4-MMA kernels (`Dockerfile.llama-cpp-localai-paged` does not hardcode a CUDA arch). 5. **The `0921716...` paged-MoE md5 open item.** `COMBINED_DEFINITIVE.txt` records `PAGED_GATE_MD5=0921716cd0582b5d15af8c362b811d00` for MoE, but a full doc/patch/`git log -S` grep of the worktree found **no** occurrence of `0921716...` in any committed source; the committed canonical paged-MoE gate is `8cb0ce23`. Treat this as **unreconciled**: the documented, KL-validated paged-MoE gate remains `8cb0ce23`, and any paged-MoE divergence (including `0921716`) must be KL-validated against the f16 reference before being accepted as benign, never on assertion alone. The `0921716` value is **UNVERIFIED** as a sanctioned gate; do not adopt it as canonical without re-running the KL gate. The **dense** run is symmetric: `COMBINED_DEFINITIVE.txt` records `PAGED_GATE_MD5=ecfe924dee6c5622c149f419ff2a6481` for dense, which likewise differs from the canonical dense gate `5951a5b4`. Both CDEF `PAGED_GATE_MD5` values come from the `combined_definitive.sh` harness's own gate command, NOT the canonical bit-exact gate command in section 3.3, which is why they diverge from the committed `8cb0ce23` / `5951a5b4`; neither is a sanctioned gate and both must be KL-validated before being treated as benign. --- ## 8. PHASE63 RESULT: PREFILL BUCKET ATTRIBUTION Phase63 is complete as a measurement-only no-go. The plan is `docs/superpowers/plans/2026-07-01-prefill-bucket-attribution-phase63.md`; the DGX artifact is `/home/mudler/bench/phase63_prefill_bucket/20260701_140127`. Pre/post gates stayed green: - MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`; - dense md5 `5951a5b4d624ce891e22ab5fca9bc439`; - `MUL_MAT` `1146/1146`; - `MUL_MAT_ID` `806/806`. The candidate paged FlashAttention mask/block-table cleanup is rejected for now: llama.cpp FA is only `0.71%` at `npp=512` and `1.18%` at `npp=2048`; the `npp=2048` cross-engine FA delta is about `1.7 us/tok`, not the `15 us/tok` needed to fund source work. No llama.cpp source files were modified. *Status: Phase63 closed. `VLLM_PARITY_FINAL.md` remains the GB10 shortcut record; the remaining measured buckets are still MoE/FFN GEMM, GDN, bf16 projections, layout copies, and activation quantization.* ## 9. PHASE64 RESULT: LAYOUT TRACE Phase64 added default-off layout attribution in the llama.cpp fork: `fa944bb5f feat(cuda): trace layout tensor names`. The env gate is `LLAMA_LAYOUT_TRACE=`. It traces CUDA `GET_ROWS`, `CPY`, `CONT`, `DUP`, and `CONCAT` runtime dispatch with tensor names, types, shapes, and contiguity flags. DGX artifact: `/home/mudler/bench/phase64_layout_trace/20260701_142519`. Patched build gates stayed green: MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`. Trace result at MoE `npp=512`, `ntg=4`, `npl=32`: - `get_rows`: `7268` - `cpy`: `2008` - `cont`: `1734` - `concat`: `990` The named layout sources are GDN conv-state gather/concat/update (`cache_r_lN`, `conv_states_reshaped-N`, `qkv_mixed_transposed-N`, `conv_input-N`, `conv_state_update-N`), MoE top-k fan-in gathers (`ffn_moe_probs-N`, `ffn_moe_topk-N`, `ffn_moe_weights-N`), and paged-attention mask/KV reshape/copy paths. This does not fund a clean layout optimization yet; it gives Phase65 the exact names needed to either remove one repeated chain or reject it with evidence. ## 10. PHASE65 RESULT: QUANT TRACE Phase65 added default-off activation-quant route attribution in the llama.cpp fork: `afc2c7030 feat(cuda): trace activation quant routes`. The env gate is `LLAMA_QUANT_TRACE=`. DGX mirror commit: `7863194bd`. DGX artifact: `/home/mudler/bench/phase65_quant_trace/20260701_143729`. Patched build gates stayed green: MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`. Trace result at MoE `npp=512`, `ntg=4`, `npl=32`: - `mmq_dense`: `4444` - `mmq_moe_dedup_unique`: `2960` - `mmq_moe_gather`: `2960` - `mmq_moe_flat`: `1480` The dominant default-path shapes are MoE gate/up expert activation quant deduplication (`K=2048`, `rows=512`) followed by gather to expert-token rows (`rows=4096`), shared-expert dense gate/up quantization (`K=2048`, `rows=512`), MoE down expert flat quantization (`K=512`, `rows=4096`), and shared-expert down quantization (`K=512`, `rows=512`). This confirms the activation-quant bucket is concentrated in named MoE/shared-expert FFN paths, but it does not prove whether `gather_mmq_fp4` is material or just a cheap cost of the existing dedup win. Phase66 should time `quantize_mmq_nvfp4` versus `gather_mmq_fp4` with nsys/NVTX before funding any behavior-changing source patch. ## 11. PHASE66 RESULT: QUANT KERNEL TIMING Phase66 timed the Phase65 candidate kernels directly with Nsight Systems. Artifact: `/home/mudler/bench/phase66_quant_kernel_timing/20260701_144256`. Profile: `quant_npp512.nsys-rep`; summary: `quant_npp512_kern_sum_cuda_gpu_kern_sum.csv`. Shape: MoE `npp=512`, `ntg=4`, `npl=32`. Total GPU kernel time: `7108388986 ns`. | kernel | time | instances | share | |--------|-----:|----------:|------:| | `quantize_mmq_nvfp4` | `317205504 ns` | `8884` | `4.46%` | | `gather_mmq_fp4` | `45374880 ns` | `2960` | `0.64%` | | combined | `362580384 ns` | - | `5.10%` | Decision: reject a Phase66 gather/quant source patch. The gather is too small to target, and quantize plus gather is below the `8%` source-funding threshold. Do not reopen W4A16/no-activation-quant from this evidence; that larger rewrite was already rejected in earlier phases. ## 12. PHASE67 RESULT: BF16 CUBLAS F32 OUTPUT Phase67 added a default-off BF16 projection shortcut in the llama.cpp fork: `ea0875d14 feat(cuda): gate BF16 cuBLAS F32 output`. The env gate is `LLAMA_BF16_CUBLAS_F32_OUT=1`. DGX mirror commit: `14fd69f1e`. DGX artifact: `/home/mudler/bench/phase67_bf16_f32_out/20260701_144909`. Default and opt-in gates stayed green: MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, `MUL_MAT 1146/1146`. Same-window MoE prefill A/B: | npp | default S_PP | opt-in S_PP | change | |-----|-------------:|------------:|-------:| | `512` | `2347.41` | `2402.34` | `+2.34%` | | `2048` | `2440.18` | `2456.54` | `+0.67%` | The opt-in `npp=512` profile removed the BF16-to-F32 conversion row: `convert_unary<__nv_bfloat16, float>` became `0 ns`, `0` instances. Keep this as default-off for now. It is correctness-clean and measurable, but the win is small and needs dense plus serving A/B before any default-on decision. ## 13. PHASE68 RESULT: BF16 F32 OUTPUT DENSE + SERVING A/B Phase68 reused Phase67 source unchanged. Plan: `docs/superpowers/plans/2026-07-01-bf16-f32-output-dense-serving-phase68.md`. DGX artifact: `/home/mudler/bench/phase68_bf16_dense_serving/20260701_145710`; serving A/B artifact: `/home/mudler/bench/phase68_bf16_dense_serving/20260701_145710/serving_ab_20260701_150249`. Correctness basis for the exact source commit remains Phase67: default and `LLAMA_BF16_CUBLAS_F32_OUT=1` both produced MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, and `MUL_MAT 1146/1146`. Dense prefill stayed positive but tiny: | npp | default S_PP | opt-in S_PP | change | |-----|-------------:|------------:|-------:| | `512` | `973.13` | `975.52` | `+0.25%` | | `2048` | `1019.88` | `1021.39` | `+0.15%` | MoE serving A/B at `N=128`, prompt `128`, generation `128`, `--parallel 128`: | metric | default | opt-in | change | |--------|--------:|-------:|-------:| | `agg_tps` | `409.8` | `415.0` | `+1.27%` | | `decode_agg_tps` | `615.3` | `627.2` | `+1.93%` | | `prefill_tps` | `1630.2` | `1648.0` | `+1.09%` | | `ttft_mean_ms` | `8574.7` | `8085.9` | `-5.70%` | | `wall_s` | `39.978` | `39.480` | `-1.25%` | Decision: carry the shortcut as a default-off opt-in candidate. It is no longer just a prefill-only win, but Phase68 is not enough to default it on. Any future default-on proposal must mirror the fork commit into the LocalAI patch series and rerun a broader current serving snapshot with pre/post md5 and op gates. ## 14. PHASE69 RESULT: PATCH SERIES MIRROR READINESS Phase69 checked the patch-series state without pushing and without editing generated patch files. Plan: `docs/superpowers/plans/2026-07-01-patch-series-mirror-readiness-phase69.md`. Current committed LocalAI patches still match the Phase37 fork tip: ```text base=0ed235ea2c17a19fc8238668653946721ed136fd applied_tree=dedb1182910eafe9f6875588dc8285bfb544cce5 patch_tip_tree=dedb1182910eafe9f6875588dc8285bfb544cce5 fork_head_tree=fcf5720b659c5e1e2b487ccf3c8f7289bb12b9c4 match_patch_tip=yes match_fork_head=no patch_count=54 ``` Dry-run export from `2d590d770..ea0875d14` produced ten additive source-only patches, projected as `0064..0073`. Applying current `0001..0063` plus temp `0064..0073` onto the pin exactly reconstructed current fork HEAD: ```text applied_plus_missing_tree=fcf5720b659c5e1e2b487ccf3c8f7289bb12b9c4 fork_head_tree=fcf5720b659c5e1e2b487ccf3c8f7289bb12b9c4 match_fork_head=yes current_patch_count=54 missing_patch_count=10 projected_patch_count=64 ``` Projected patch tail: - `0064` serving admission trace (`c6cb8460e`) - `0065` admission histograms (`bd7b2e952`) - `0066..0068` TTFT prefill-first scheduler knobs (`8a97629a4`, `3b6ab5fa8`, `8759213e3`) - `0069..0070` W4A16 direct-activation policy/stub (`41be3da5b`, `7967ad47f`) - `0071` layout trace (`fa944bb5f`) - `0072` quant trace (`afc2c7030`) - `0073` BF16 cuBLAS F32 output (`ea0875d14`) Decision: mirror regeneration is technically ready but not executed. The local fork is `26` commits ahead of `fork/localai-paged`, and the fork-first policy requires pushing before regenerating the LocalAI series. Do not push without explicit approval. After approval, push the fork, regenerate `0064..0073`, rerun the same tree-hash check, and then run the broader serving gates before any default-on BF16 policy change. ## 15. PHASE70 RESULT: BF16 F32 OUTPUT BROADER SERVING Phase70 broadened the Phase68 serving evidence without source changes. Plan: `docs/superpowers/plans/2026-07-01-bf16-f32-output-broader-serving-phase70.md`. Benchmark ledger: `backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md`. DGX artifact: `/home/mudler/bench/phase70_bf16_broader_serving/20260701_151500`. Gates stayed green. Default pre/post gates matched MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Opt-in pre/post gates matched MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, and `MUL_MAT 1146/1146`. Serving shape: MoE `NPL=8 32 128`, prompt `128`, generation `64`, `PARALLEL=128`. | n | opt/default agg | opt/default decode | opt/default TTFT | default decode/vLLM | opt decode/vLLM | |---:|----------------:|-------------------:|-----------------:|--------------------:|----------------:| | `8` | `0.8896` | `0.8998` | `1.1247` | `0.8100` | `0.7289` | | `32` | `0.9912` | `0.9974` | `1.0320` | `0.6882` | `0.6864` | | `128` | `1.0071` | `0.9882` | `0.9852` | `0.6921` | `0.6839` | Decision: reject default-on for `LLAMA_BF16_CUBLAS_F32_OUT=1`. The shortcut is correctness-clean, but it materially regressed low-concurrency serving and slightly widened the vLLM decode gap at `n=32` and `n=128`. Keep it default-off only and move the next parity effort to a different lever. ## 16. PHASE71 RESULT: GDN TENSOR-CORE REVALIDATION Phase71 challenged the stale GDN planning docs before starting more source work. Plan: `docs/superpowers/plans/2026-07-01-gdn-tc-revalidation-phase71.md`. Benchmark ledger: `backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md`. DGX artifact: `/home/mudler/bench/phase71_gdn_tc_revalidation/20260701_153425`. Source under test stayed at DGX mirror commit `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`. No llama.cpp source was changed. Canonical gates matched for all four GDN modes: MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, and `GATED_DELTA_NET 46/46`. Default also passed `MUL_MAT 1146/1146` and `MUL_MAT_ID 806/806`. MoE prefill, `PP=512,2048`, `TG=4`, `B=32`, `CTX=131072`: | arm | npp512 S_PP | npp2048 S_PP | |-----|------------:|-------------:| | default | `2313.57` | `2422.88` | | sequential-disabled (`GDN_CHUNK_MIN=2147483647`) | `2198.28` | `2361.22` | | serial-chunked (`GDN_TC=0 GDN_CHUNK_MIN=64`) | `1787.49` | `1699.77` | | forced M5 (`GDN_TC=4 GDN_CHUNK_MIN=64`) | `2323.18` | `2420.52` | Decision: keep shipped GDN M5 default behavior. It still beats sequential-disabled by `+5.24%`/`+2.61%`, beats serial-chunked by `+29.43%`/`+42.54%`, and forced M5 is within noise of the current default. Do not reopen smaller GDN C32/QS/global-Ai32/kernel-reorder work on GB10. Post-Phase71 do-not-reopen list for GB10: - Smaller W4A16/MoE GEMM body, metadata, direct-activation, or quant/gather shortcuts. - GDN C32 slab, QS-early, Global-Ai32, or another low-conflict M5 reorder. - BF16 cuBLAS F32 output as a default-on policy. The only GDN work that should be reconsidered is a larger FLA/CuteDSL-class blocked-solve implementation or a hardware pivot where the GB10 constraints no longer apply. ## 17. PHASE72 RESULT: TTFT MIN32 BROADER SERVING Phase72 broadened the Phase59 min32 scheduler result to the same serving shape used by Phase70. Plan: `docs/superpowers/plans/2026-07-01-ttft-min32-serving-phase72.md`. Benchmark ledger: `backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md`. DGX artifact: `/home/mudler/bench/phase72_ttft_min32_serving/20260701_160730`. Source under test stayed at DGX mirror commit `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`. No llama.cpp source was changed. Gates stayed green. Pre default matched MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Pre/post min32 and post default md5 gates also matched MoE `8cb0ce23` and dense `5951a5b4`. Serving shape: MoE `NPL=8 32 128`, prompt `128`, generation `64`, `PARALLEL=128`. | n | min32/default agg | min32/default decode | min32/default TTFT | default decode/vLLM | min32 decode/vLLM | |---:|------------------:|---------------------:|-------------------:|--------------------:|------------------:| | `8` | `0.9302` | `0.9442` | `1.0379` | `0.7561` | `0.7140` | | `32` | `0.9414` | `0.9570` | `1.0977` | `0.7158` | `0.6850` | | `128` | `0.9699` | `0.9775` | `1.0300` | `0.6935` | `0.6779` | Decision: keep `LLAMA_TTFT_PREFILL_FIRST=1` plus `LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING=32` opt-in only. It regressed aggregate, decode, TTFT, and wall time at every tested concurrency in the broader shape, and widened the vLLM decode gap. Do not default this scheduler policy on GB10. ## 18. PHASE73 RESULT: DATACENTER BLACKWELL RERUN READINESS Phase73 is a no-new-benchmark decision/spec phase. Plan: `docs/superpowers/plans/2026-07-01-datacenter-blackwell-rerun-readiness-phase73.md`. Benchmark ledger: `backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md`. No GPU benchmark was run and no llama.cpp source was changed. Source baseline remains DGX mirror commit `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`. Decision: - Do not start more GB10 grouped-MMQ/W4A16 source work. Phase61 direct-A was the last structurally distinct W4A16 shortcut and failed its keep gate; Phase66 quantize plus gather was only `5.10%`, below the source-funding threshold. - Do not start GDN backend source work until a standalone C=64 blocked-solve PoC proves timing and numerical viability. Phase71 kept M5 as shipped; the remaining GDN gap is a larger FLA/CuteDSL-class blocked-solve/register-state implementation, not another C32/QS/global-Ai/local reorder. - The next parity evidence should come from datacenter Blackwell hardware with the existing same-session serving harness plus graph-node decode profiles. B200 rerun checklist: 1. Build and verify the llama.cpp paged binary on B200 or equivalent datacenter Blackwell hardware with the correct CUDA architecture/settings. 2. Install and verify vLLM `0.23.0+` with the intended Blackwell backend stack. 3. Confirm both model forms exist: `q36-35b-a3b-nvfp4.gguf` and `q36-35b-a3b-nvfp4-vllm`. 4. Run `paged-current-serving-snapshot.sh` with `NPL="8 32 128"`, `PTOK=128`, `GEN=64`, `PARALLEL=128`, `CTX=131072`, and B200-specific `VLLM_GPU_MEMORY_UTILIZATION`, `VLLM_MAX_NUM_SEQS`, and `VLLM_TENSOR_PARALLEL_SIZE`. 5. Before interpreting the artifact, require `hardware.txt` to say `hardware_class=datacenter_blackwell`, `gate_summary.tsv` to be green, pre/post MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, `MUL_MAT` and `MUL_MAT_ID` op gates green, and `summary.tsv` rows for both paged and vLLM. 6. Run decode/profile reruns with `nsys --cuda-graph-trace=node` and inspect whether vLLM is using native FP4/CUTLASS/FlashInfer rather than the GB10 Marlin fallback. Phase74 standalone GDN source-work gate result: ```sh nvcc -O3 -arch=sm_121a \ ~/scratch_tc_gdn_poc/gdn_blocked_solve_bench.cu \ -o ~/scratch_tc_gdn_poc/gdn_blocked_solve_bench ~/scratch_tc_gdn_poc/gdn_blocked_solve_bench \ --c 64 --dk 128 --dv 128 \ --iters 1000 \ --precision tf32,offdiag3x,apply3x \ --oracle f64 \ --dump-json ~/bench/phase74_gdn_blocked_solve_poc/20260701_143711/phase74_gdn_blocked_solve_poc.json ``` Artifact: `/home/mudler/bench/phase74_gdn_blocked_solve_poc/20260701_143711`. The standalone C=64 shared-memory explicit inverse-plus-apply scaffold did not fund backend source work: - weak decay: direct solve/apply `3.263936 ms`; inverse-plus-apply `5.493515 ms`; inverse/direct speed `0.5941x`; inverse NMSE `2.755e-15`; - mixed decay: direct solve/apply `3.275959 ms`; inverse-plus-apply `5.527584 ms`; inverse/direct speed `0.5927x`; inverse NMSE `7.541e-16`; - shared memory was already near the GB10 cap: direct `81920` bytes, inverse-plus-apply `98304` bytes, with `99 KB` opt-in available. Decision: do not touch `ggml/src/ggml-cuda/gated_delta_net.cu` for this C=64 inverse scaffold on GB10. A future GDN source-work gate must be a substantially different tensor-core blocked-solve/register-state design that shows a material timing win before backend changes. Phase75 follow-up audit: - llama.cpp already ships the M5 tensor-core GDN path default-on under paged KV: `KK/QK`, `KS/QS`, `P*U`, explicit `T=A^-1`, `U=T*RHS`, and `Kc^T*DU` state carry are covered in the current `C=16` GB10 path. - vLLM has a distinct one-token recurrent decode path that updates state directly and a packed decode path that avoids Q/K/V materialization copies, but this is not source-funded in llama.cpp without a fresh profile: prior parity evidence showed llama.cpp GDN decode already faster than vLLM and decode serving dominated by host/MoE synchronization. - vLLM's CuTeDSL GDN prefill path is useful reference material for datacenter Blackwell, but depends on SM10x/CUDA-13 features such as TMA/tcgen05/CUTLASS DSL and should not be treated as a portable GB10 patch base until the local toolchain proves support. Phase76 current-stack GB10 graph-node profile: - Artifact: `/home/mudler/bench/phase76_current_moe_profile/20260701_145116`. - Shape: MoE `q36-35b-a3b-nvfp4`, `n=128`, `PTOK=128`, `GEN=64`, `PARALLEL=128`, `CTX=131072`, production defaults. - Pre/post gates were green: MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`. - Serving under graph-node profiling: aggregate `204.1 t/s`, decode aggregate `320.7 t/s`, prefill `1490.1 t/s`, TTFT mean `8365.1 ms`, wall `40.146 s`. - Bucket result: GDN was the largest macro bucket, `6669.16 ms` (`32.88%`), ahead of MoE/FFN-GEMM `6264.88 ms` (`30.88%`) and BF16 projections `2772.38 ms` (`13.67%`). `gdn_core` alone was `5876.94 ms` (`28.97%`). This supersedes the Phase75 "datacenter only unless fresh profile" wording: Phase76 is that fresh profile. It does **not** justify an immediate backend patch because it is llama-only and graph-node tracing depresses absolute throughput, but it does fund one narrow GB10 follow-up before waiting for B200: prove whether vLLM's direct recurrent/packed decode idea can reduce the current `gdn_core` bucket. Current next gate: 1. Keep the B200/B100/GB200 Phase72 same-session rerun as the hardware-pivot gate when datacenter Blackwell is available. 2. In parallel on GB10, run a Phase77 GDN decode proof with pre/post md5 and op gates. Accept only if it materially reduces the Phase76 `gdn_core` bucket and does not regress serving throughput or canonical output md5. 3. Do not merge or default-on any `gated_delta_net.cu` change from this evidence alone; Phase76 is a profile gate, not a source patch gate. Phase77 decode-only profile result: - Artifact: `/home/mudler/bench/phase77_moe_decode_only_profile/20260701_150134`. - Shape: MoE `q36-35b-a3b-nvfp4`, `N=128`, long-running `/completion` requests, `N_PREDICT=2048`, capture after active decode. - Capture window: active slots `128`; median decoded depth `67` at start and `89` mid-capture. - Pre/post gates were green: MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`. - Bucket result: GDN `1489.71 ms` (`41.20%`) and MoE/FFN-GEMM `1400.77 ms` (`38.74%`). Fine bucket `gdn_core` was `1408.33 ms` (`38.95%`), slightly larger than `mmq_nvfp4` at `1383.50 ms` (`38.26%`). Phase77 supersedes the Phase75 "no GB10 GDN source work" stance for decode only. Do **not** reopen the failed C=64 prefill inverse scaffold. The funded GB10 source path is now a narrow, default-off GDN decode A/B or standalone PoC based on vLLM's direct recurrent/packed decode structure. The next patch must prove a material reduction in the Phase77 `gdn_core` bucket, keep canonical md5 and op gates green, and avoid serving/decode throughput regression under the same decode-only capture shape before it can be considered for merge or default. Phase78 launch-shape sweep: - Baseline: Phase77 default launch shape (`GDN_NW=16 GDN_CPW=8`) had `gdn_core 1408.33 ms` (`38.95%`) in the decode-only window. - `GDN_NW=8 GDN_CPW=8` artifact: `/home/mudler/bench/phase78_gdn_launch_sweep/nw8_cpw8_20260701_150654`. Gates were green, but `gdn_core` worsened to `1443.55 ms` (`39.68%`). - `GDN_NW=16 GDN_CPW=4` artifact: `/home/mudler/bench/phase78_gdn_launch_sweep/nw16_cpw4_20260701_150954`. Rejected before profiling: `MUL_MAT_ID` failed `805/806`. Decision: keep default `GDN_NW=16 GDN_CPW=8`. Do not retry existing `GDN_NW`/`GDN_CPW` launch-shape retunes unless a new profile gives a specific reason. The next GB10 source-funded work must be structural, default-off, and measured against the Phase77 decode-only `gdn_core` bucket. Phase79 BV32 decode source A/B: - Artifact root: `/home/mudler/bench/phase79_gdn_decode_bv32_ab/20260701_152530`. - Candidate source tree: `/home/mudler/llama-phase79-gdn-source`. - Candidate patch: one-file default-off CUDA decode-only kernel in `ggml/src/ggml-cuda/gated_delta_net.cu`, enabled by `GDN_DECODE_BV32=1`. Scope was `S_v=128`, one-token decode, scalar gate, final-state write-back. - Candidate build completed for `llama-completion`, `llama-batched-bench`, `test-backend-ops`, and `llama-server`. - Safety gates were green for the candidate default and opt-in paths: MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`. - A/B baseline post-gate initially failed `MUL_MAT 1145/1146` on a q4_1 case after profiling, but immediate retry in `A_baseline/gate_post_retry` was green; treat the first failure as a gate hiccup, not as accepted evidence. Result: | arm | env | GDN ms | `gdn_core` ms | `gdn_core` launches | `gdn_core`/launch | |-----|-----|-------:|--------------:|--------------------:|------------------:| | baseline | none | `1493.14` | `1411.46` | `600` | `2.352 ms` | | BV32 | `GDN_DECODE_BV32=1` | `1502.89` | `1426.17` | `570` | `2.502 ms` | Decision: reject the BV32 decode topology. It passed md5/op gates but worsened normalized `gdn_core` by about `6.4%` per launch and increased the GDN macro bucket. Do not carry this source patch forward. The next GDN source hypothesis should target recurrent-state precision/traffic or another structural delta from vLLM; reduced-precision recurrent state is promising but invasive and needs a separate scope. Phase80 identity-ids shortcut source A/B: - Artifact root: `/home/mudler/bench/phase80_gdn_identity_ids_ab/20260701_153927`. - Candidate source tree: `/home/mudler/llama-phase80-gdn-identity-source`. - Candidate patch: one-file default-off shortcut in `ggml/src/ggml-cuda/gated_delta_net.cu`, enabled by `GDN_ASSUME_IDENTITY_IDS=1`. It skips the GDN scratch gather for one-token final-state decode by assuming `ids[s] == rs_head+s` and reading from `state_dst` directly. - Candidate build completed for `llama-completion`, `llama-batched-bench`, `test-backend-ops`, and `llama-server`. - Baseline and candidate pre/post gates were green: MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`. Result: | arm | env | GDN ms | `gdn_core` ms | `gdn_gather` ms | GDN macro launches | |-----|-----|-------:|--------------:|----------------:|------------------:| | baseline | none | `1493.57` | `1411.65` | `0.79` | `3600` | | identity shortcut | `GDN_ASSUME_IDENTITY_IDS=1` | `1489.96` | `1409.28` | not present | `3000` | Decision: reject carry-forward/default. The shortcut safely removes the tiny `gdn_gather` bucket in this shape, but `gdn_core` is unchanged and the identity assumption is too narrow for a sub-millisecond capture-level win. Do not spend more parity time on gather-only GDN shortcuts unless a future profile makes gather material. The next serious GDN scope remains recurrent-state precision/traffic. ## Series trim (phases 110-140 review, 2026-07-02) The campaign's on-disk patches `0048-0063` were added without matching fork commits (a fork-first policy violation). After a keep/drop review of the phase 110-140 work, the series was trimmed to a single kept line plus the gate harness, and re-mirrored to the fork: - KEEP - test sentinels (the MoE gate harness): `MOE_SWIGLU_DOWN`, `MOE_SWIGLU_COMBINE`, `MUL_MAT_ID_RAGGED_MOE` (old `0051-0053`). - KEEP - the MTP-draft correctness fix (old `0054`): forces target-side sampler acceptance for MTP drafts (backend draft sampling can request multiple output rows per sequence); the backend ships `-mtp` gallery models. - KEEP - the Phase135 routed-FFN fused-quant line: whole-pattern MoE matcher + routed-FFN executor hook (Phase120/121), the routed-FFN PoC scaffold `moe-ffn.{cu,cuh}` (Phase132), and the fused SwiGLU-to-NVFP4-quant + raw down MMQ (`ggml_cuda_mul_mat_q_moe_quantized` + local `ggml_cuda_mmq_ids_meta` refactor, Phase135). All default-off, md5-clean opt-in, six `mmq_moe_quantized_raw` markers with zero sorted launches on the sentinel. - DROP - W4A16 grouped-tile pack/tune/pad (old `0048-0050`): dead line, W4A16 is ~1.5x slower than grouped-MMQ. - DROP - speculative/trace/cublas-route/mmid-route/mul-mat-route traces + the rejected small-M tile-policy knob (old `0055-0063`). - DROP - all other campaign keep-markers not needed by Phase135: GPU-sort (Phase110), W4A16-direct-A (Phase112), boundary trace/timing (Phase117), Phase133 sorted-F32 down, Phase134 fused-SWIGLU-only, Phase138 finalize/weighted-combine. The final fork tree carries zero of these markers. Fork branch `mudler/llama.cpp:localai-paged` re-mirrored on top of `51168c5ee` (LocalAI series `0001-0047`): - `fd920cf8a` test(paged): cover MoE swiglu down chain - `a85c1e098` test(paged): cover MoE weighted combine chain - `2fed6aacf` test(paged): cover ragged MoE dispatch - `f1d976f06` fix(speculative): disable backend sampling for MTP drafts - `1edddc8fe` feat(paged): whole-pattern MoE matcher + routed-FFN fused NVFP4-quant down MMQ New fork HEAD `1edddc8fe`, tree `097c862c`. The rejected/neutral levers of the 110-140 campaign are recorded above and in the per-phase bench artifacts. ## P1 bf16-native execution pass - LANDED (2026-07-02) First phase of the `EXECUTION_REARCH_SCOPE.md` additive program to land. `LLAMA_BF16_STREAM` (default-off) runs a bf16-resident residual-segment executor for the q36 MoE decision model's projection boundaries, deleting the per-op `f32->bf16` convert the stock cuBLAS-bf16 path pays at the projection `src1`. See the "P1 RESULT" subsection in `EXECUTION_REARCH_SCOPE.md` for the full record; summary and provenance: - **Verdict: GO / SHIP.** P0 kill-gate GO, P1 build-out and independent verify all correctness gates green, prefill positive-and-reproducible, KL-improving. - **Key reframe:** q36 GDN/attention projections (attn_qkv/gate, ssm_alpha/beta/out) are **BF16 weights, not NVFP4** - only the MoE experts (`ffn_*_exps`) are NVFP4. The convert tax lives at the BF16 cuBLAS projection boundary (`op_mul_mat_cublas` src0==BF16), so bf16-stream is a **MoE-model lever**; the dense model quantizes those projections to NVFP4 and engages nothing (stays bit-identical). - **Engagement:** P0 = 960 gate_norm->ssm_out segments/prefill; full build-out = 2240 (960 single-consumer 0044 ssm_out + 1280 multi-consumer plain-rms_norm -> {attn q/k/v, GDN in_proj}). - **Prefill (MoE @512 B=32):** +1.99% (2361.67 vs 2315.52 t/s, all 5 bf16 > all 5 ctrl; reproduced +1.89%); @2048 +0.95%; dense no-op (-0.09%). Recovered ~8.44 us/tok @512. At the noise floor -> classified neutral but reproducible; no regression. - **KL (MoE):** bf16 KLD 0.136042 vs control 0.136563 => delta -0.00052 (bf16 slightly better via the `LLAMA_BF16_CUBLAS_F32_OUT` plank keeping the full f32 GEMM result); same-top-p 84.461% vs 83.725% (>= 84% baseline). Dense: 0 engagements => bit-identical. - **Correctness:** default md5 canonical both models (MoE `8cb0ce23`, dense `5951a5b4`) present-but-off and env-on (small-M bails); `test-backend-ops` MUL_MAT 1146/1146, MUL_MAT_ID 806/806, GATED_DELTA_NET 46/46, MOE_SWIGLU_DOWN 7/7, MUL_MAT_ID_RAGGED_MOE 6/6, BF16_STREAM_SEGMENT 4/4. - **Honest scope:** targets prefill bucket 3 (the ~4.8%-of-wall convert/glue tax) only, and owns the projection-boundary portion of it (~40% end-to-end) - not the GDN-scan (bucket 1, P5) or GEMM-tiling (bucket 2, P2/P3) buckets. Well below the scope's optimistic ~45 us/tok target by construction. Next increment = own the bf16->f32 dst direction + the remaining attn_norm-fed projection src1 converts. - **Deferred (blocked by an external imatrix job contending the GB10, NOT a failed gate):** the nsys graph-node bucket table, decode S_TG @npl128, and the Phase130 serving A/B need a clean idle-GPU re-run. Fork branch `mudler/llama.cpp:localai-paged` fast-forwarded on top of `1edddc8fe` (LocalAI series `0001-0052`) with three P1 commits: - `1271488fc` feat(paged): P1 bf16-stream residual-segment executor + norm-bf16 kernels (+ the re-introduced `LLAMA_BF16_CUBLAS_F32_OUT` plank) - `91373e1b9` feat(paged): P1 bf16-stream bf16 residual-add + rope op-variants - `653bb2f3d` test(paged): P1 bf16-stream BF16_STREAM_SEGMENT sentinel New fork HEAD `653bb2f3d`, tree `6cf1523047`. LocalAI series regenerated additively as `0053-0055` (46 patches total, `0001-0052` untouched); kill-gate at pin `0ed235ea` applied all patches and staged tree `6cf1523047` byte-for-byte == fork HEAD tree. Nothing pushed. Artifacts: `~/bench/p1_bf16_stream/killgate_20260702_135544` and `.../verify_20260702_161229` on the DGX; fork topic branch `p1-bf16-stream` retained for forensics. ## P2 expert-major fused MoE region - NO-GO (recorded 2026-07-02) Second phase of the `EXECUTION_REARCH_SCOPE.md` additive program. The P0 kill-gate for `LLAMA_MOE_REGION_EXECUTOR` (default-off) returned **NO-GO on two independent signals**, so per the phased contract nothing was built beyond P0 and nothing landed. See the "P2 RESULT" subsection in `EXECUTION_REARCH_SCOPE.md` for the full record; summary and provenance: - **Verdict: NO-GO / DO-NOT-SHIP.** The expected-recovery line (~40 of the +56.5 bucket-2 prefill tax + ~11 ms decode residual) was **not** delivered - the layout-only expert-major region is flat on its own sentinel and engages 0x on the decision model. - **(1) Primary GO metric flat.** Kill-gate needed the n=257 batched-large-M `MOE_SWIGLU_DOWN` rows to beat the grouped-MMQ control by > 5%. Measured (5x medians): control 1021.61 us, region 1022.15 us => **-0.05%** (marginally slower); n=128 -0.34%; `MUL_MAT_ID_RAGGED_MOE` (region never engages) n=257 +0.48% / n=128 +0.28% (noise). All four inside the 5-sample spread. This reproduces the six prior one-boundary transplants (phases 113/114/122/123/125/ 127) - the null hypothesis P2 had to beat. A compact expert-major *layout* + a single sort, with both GEMMs still ragged grouped-MMQ, does not change the ragged-tile tiling that owns the +56.5 tax; that needs P3's Marlin persistent-CTA, not a P2 layout swap. (Sentinel caveat: `eval_perf` duplicates only the down node ~n_runs times, so the region invocation is ~1/n_runs of the signal => under-sensitive; reported as the requested metric, corroborated by signal 2.) - **(2) Decisive structural blocker (prerequisite gap).** `q36-35b-a3b-nvfp4.gguf` ships **separate** `ffn_gate_exps` + `ffn_up_exps` (+ per-tensor `.scale`/`.input_scale`), NOT a merged `ffn_gate_up_exps` (GGUF tensor-name scan). `llama-graph.cpp` `build_moe_ffn` takes the separate-gate/up + `ggml_swiglu_split` branch, so the whole-pattern matcher's merged `gate_up(MUL_MAT_ID)->VIEW->VIEW->SWIGLU->down` shape is **absent**. The matcher, the region executor, AND the pre-existing POC/fused-quant all engage **0x** on q36 in prefill and decode. The region only engages on the synthetic merged-shape test sentinel. Even a positive sentinel could not translate to q36 without first rebuilding the seam for the separate/scaled/swiglu-split shape. - **KL: vacuously identical.** control and region KLD both 0.136563, same-top-p both 83.725% => delta 0.000000 (byte-identical only because the region engages 0x on q36; not an executor KL-neutrality claim). - **S_PP @512 (5x):** control 2320.62 vs region 2316.70 t/s = -0.17% (flat, region == control at 0 engagement; stdev 0.24% => capture-stable, no re-capture thrash). - **Correctness GREEN, both arms** (default AND env-on): MUL_MAT 1146/1146, MUL_MAT_ID 806/806, GATED_DELTA_NET 46/46, MOE_SWIGLU_DOWN 8/8, MUL_MAT_ID_RAGGED_MOE 6/6, BF16_STREAM_SEGMENT 4/4. Default md5 canonical both models (MoE `8cb0ce23`, dense `5951a5b4`); env-on canonical (small-M bails). - **Prerequisite handoff (gates P2 AND P3).** Before any MoE-region lever can engage on q36, re-scope and rebuild the seam (whole-pattern matcher + POC/fused-quant + region executor) for q36's separate `ffn_gate_exps`/ `ffn_up_exps` + per-tensor `.scale` + `ggml_swiglu_split` FFN shape. Then re-evaluate a *fused two-GEMM* region (not a layout swap), per the scope's null hypothesis that the win exists only as the complete fused kernel that never materialises the intermediates. Implementation (correct, committed, NOT pushed, ~407 LOC / 6 files): `moe-ffn.cu` `ggml_cuda_moe_region_executor` (one route-sort ids_meta; gate_up grouped NVFP4 MMQ writes a compact expert-major buffer via iota ids_dst, token-order intermediate never materialised; `moe_swiglu_nvfp4_quant_compact_kernel` reads by route-slot; down MMQ unpermutes) + strict all-consumers guard `ggml_cuda_moe_region_consumers_ok` + `LLAMA_MOE_REGION_TRACE`. Fork `localai-paged` HEAD **untouched at `653bb2f3d`**; LocalAI series stays at 46 patches (`0001-0055`). Topic branch `mudler/llama.cpp:p2-moe-region` retained for forensics at `2d87564ddfa26f6c275dad0e1f0e3d8d5413e337` (base `653bb2f3d`, NOT pushed). Artifacts on the DGX: `~/bench/p2_moe_region/focused_20260702_172644/` (sentinels 5x, correctness OFF+ON, md5, S_PP@512 5x, KL) + `RESULTS.txt`, `.../killgate_20260702_171826/` (engagement proof, 0x on both models), `.../build_20260702_145928/` (build logs). ## P4 token-granular continuous-batching scheduler (CBv2) - NO-GO at the perf kill-gate (recorded 2026-07-02) Fourth phase of the `EXECUTION_REARCH_SCOPE.md` additive program. The P0 kill-gate subset for `LLAMA_CONTINUOUS_BATCH_V2` (default-off) was **implemented and correctness-proven green**, but the kill-gate's stated GO criterion - a **> 20% TTFT-under-load drop** with md5 green and serving-aggregate not regressed - was **NOT demonstrated**, so per the phased contract `go=false` was the kill-gate default, nothing was built beyond P0, and nothing landed. See the "P4 RESULT" subsection in `EXECUTION_REARCH_SCOPE.md` for the full record; summary and provenance: - **Verdict: NO-GO / DO-NOT-SHIP at the perf gate (scope-anticipated).** The P4 section frames CBv2 on GB10 as a TTFT + fairness + architecture-enabler lever, **not** a throughput lever (decode is GPU-compute-bound; the host-loop-dead measurement is real), so a NO-GO on the TTFT perf gate is the expected outcome and any throughput payoff is non-GB10 (out of scope). - **FINAL MEASURED VERDICT (A/B completed autonomously after the forced report; 60/60 raws, 5 reps/arm/shape; `dgx:~/bench/p4_cbv2/perf_20260702_194359/RESULTS.md`): NO-GO CONFIRMED, and stronger than flat: CBv2-at-this-granularity REGRESSES.** TTFT-GO shapes: NONE. staggered N=32 TTFT p50 **+33.6% WORSE** (4559 -> 6091 ms, clears noise), mean +31.4% worse; staggered N=128 TTFT p50 +15.5% / mean +17.9% worse AND **aggregate/decode-agg -6.9% regressed beyond noise**; burst N=128 TTFT +10-13% worse, agg -3.9%; N=8 shapes neutral; the one positive was burst N=32 decode-agg +36.3% on a very noisy shape. ANALYSIS (do not re-litigate): fair-share chunked prefill is processor-sharing and delays every near-uniform prompt's prefill completion versus run-to-completion admission, so TTFT rises by construction; the "TTFT scaling is scheduler-shaped" premise is PARTIALLY REFUTED for GB10 - patch 0016's decode-first budget already captures the schedulable win, and vLLM's TTFT advantage here is dominated by its 2.6-2.8x prefill compute. TTFT parity routes through P3/P5 (prefill compute), not the scheduler. Fair-share may still pay on mixed long/short-prompt workloads and non-GB10 (host-bound) silicon; out of scope. - **Correctness gates all GREEN (DGX GB10, sm_121a), the substantive P0 result.** Behind `LLAMA_CONTINUOUS_BATCH_V2=1` (default OFF, byte-identical off): - **(a) canonical md5 GREEN both models, default-off AND cbv2-on:** paged-MoE `8cb0ce23`, dense `5951a5b4`. - **(c) `test-backend-ops` GREEN (zero-ggml side-effect proof):** MUL_MAT 1146/1146, MUL_MAT_ID 806/806, GATED_DELTA_NET 46/46. - **(c) cursor-interleave PROVEN** (`LLAMA_CBV2_TRACE`, staggered N=20): steps co-batch decode AND prefill tokens with per-slot cursors advancing across steps (step=6: `n_decode_toks=5 n_prefill_toks=1535 n_seqs=20`, 15 partial cursors; slot s112 144/523 -> 281 -> 418 -> 519 over steps 6-9 while decode runs); adaptive fair-share cap tracks live load (410@5w, 171@12, 137@15); `dbucket==n_decode` => no fixed pad-to-parallel. - **(b) determinism = CBv2 NEUTRAL / correctness-preserving.** The paged concurrent greedy path is inherently non-deterministic run-to-run in the BASELINE too (a benign near-tied-argmax / co-batch FP-reduction-order property, `PAGED_BITEXACT_NOTE`), so the literal exact-match gate is unsatisfiable by any scheduler (control fails it too). The discriminating test - does CBv2 diverge from control more than control diverges from itself - PASSES across 8 configs {dense,moe} x {degenerate,natural} x {gen8,gen64}: cross-arm divergence tracks the within-arm baseline to +/-1-3 of 32. Single-sequence greedy is fully deterministic (the md5 gate). - **What would change the verdict (re-score path).** Read the finalized DGX `~/bench/p4_cbv2/perf_20260702_194359/RESULTS.md` once the CANDIDATE arm completes (`p4_agg.py` auto-writes medians+stdev with the `> 20%`-drop GO logic baked in). If it shows a genuine `> 20%` staggered-TTFT drop clearing `max(2%, 3*stdev)` with md5 green and aggregate not regressed, re-score `go=true` and trigger the full P4 build-out: `SLOT_STATE_PREEMPTED` + release-KV-keep-prompt re-admit (paged burst-reclaim 0024 + `paged-alloc.cpp` defrag), aging/starvation-freedom + a constructed starvation test, preemption/aging unit tests, and a forced-preemption byte-identical-resume determinism gate. Else this NO-GO stands. Implementation (kill-gate subset only; correct, committed, NOT pushed; server-side only, ZERO `ggml/` files, ~68 LOC): `tools/server/server-context.cpp` thin integration + a NEW pure unit-tested header `tools/server/server-admission-policy.h` (namespace `cbv2`) + `server-admission-policy-test.cpp`. (1) Per-seq chunked-prefill cursors with a load-adaptive fair-share cap `ceil(prefill_leftover/n_waiting)` floored at `LLAMA_CBV2_CHUNK_MIN` (default 128, NOT `n_ubatch`, so a 512 prompt actually chunks under load); CBv2 activates the shipped 0016 decode-first budget by default (`T=n_batch`) and replaces 0016's fixed cap; cursor = `slot.prompt.n_tokens()`. (2) Adaptive decode bucket policy (`LLAMA_CBV2_DECODE_PAD` default 0 => `bucket==n_decode`, no padding per `DECODE_SERVING_SCOPE.md` net-negative; policy computed+traced only, never fed to batch formation => bit-exact-safe; row-emission is the deferred [Build phase]). Trace under `LLAMA_CBV2_TRACE=1`. Series-numbering flag: P0 comments label `[paged 0056]` per the fork's next slot, but the LocalAI worktree README is already ahead at `0056-0061` (MoE MMQ trace series) - reconcile on landing (likely `0062`). Fork `localai-paged` HEAD **untouched at `653bb2f3d`**; LocalAI series stays at 46 patches (`0001-0055`). Topic branch `mudler/llama.cpp:p4-cbv2` retained at `ebb649335fe7686524a3630ee2fdffce44be6d52` (base `653bb2f3d`, NOT pushed). Artifacts on the DGX `~/bench/p4_cbv2/`: `build_20260702_192141/` (build.log), `gates_20260702_192632/` (SUMMARY.txt: md5 x4, test-backend-ops, cbv2_trace.txt, determinism tsvs), `det2_20260702_193123/` + `det3_20260702_193649/` + `det4_20260702_194040/` (determinism diff-matrix), `perf_20260702_194359/` (raw_*.json + auto-written RESULTS.md). Environment: `LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1`, `LLAMA_MAX_BATCH_TOKENS` unset, sm_121a, GPU lock held. ## P5 FLA-faithful GDN prefill scan (blocked solve_tril port) - NO-GO at the perf kill-gate; the GDN prefill bucket is a CONFIRMED SHARED-HARDWARE FLOOR (recorded 2026-07-02) Fifth phase of the `EXECUTION_REARCH_SCOPE.md` additive program, and its **strictest kill-gate**. The full six-kernel vLLM-FLA `chunk_gated_delta_rule_fwd` pipeline was **ported to CUDA tf32 mma, per-kernel validated vs a host fp64 reference, integrated behind `LLAMA_GDN_FLA_CHUNK=1` (default-off), and A/B'd in-backend** against the shipped M5 f32 chunked scan. It lost decisively and by the wrong sign, so `go=false` was the kill-gate default, nothing was built beyond P0, and nothing landed. See the "P5 RESULT" subsection in `EXECUTION_REARCH_SCOPE.md` for the full record; this closes the last speculative prefill lever in the program. Summary and provenance: - **Verdict: NO-GO / DO-NOT-SHIP at the perf gate (the scope-anticipated "expected null").** The P5 section framed this as the strictest kill-gate given Phase74's standalone blocked-inverse 0.59x. P5 delivered the one thing prior evidence lacked: **the whole FLA pipeline in-backend with the register/smem-resident inter-chunk state and the chunk loop in-kernel** - the form that "was never actually tested in-backend." It was tested here and **settles the GDN prefill bucket (bucket 1, +59.2, the single largest prefill lever) as a shared-hardware / memory-bandwidth floor on GB10.** - **PERF GO GATE FAILED DECISIVELY.** GO required the in-pipeline blocked `solve_tril` to beat M5 by **> 10% at npp2048**. Measured (nsys `--cuda-graph-trace=node`, MoE `q36-35b-a3b-nvfp4`, per distinct token over 30 GDN layers): **npp2048 M5 56.31 vs FLA 119.46 us/tok = FLA 2.12x SLOWER** (`gdn_delta_pct_2048 = -112.1`); **npp512 M5 51.23 vs FLA 117.35 = 2.29x slower.** End-to-end **S_PP regressed MoE -13.33% @npp2048 / -13.12% @npp512** (3-rep medians; wrong sign, no 3-sigma question). The shipped M5 stays `gdn_core` at **56.31 us/tok = 64.82% of vLLM's FLA chunk-64 36.5 us/tok**; the rejected FLA port was only **30.55% of vLLM** (36.5/119.46). Reproduces Phase74's standalone 0.59x and extends bf16-C64 (-18.75%), now **confirmed in-backend.** - **WHERE THE TIME WENT (the novel, valuable decomposition - why this NO-GO matters).** Per-kernel nsys share of the FLA bucket: **blocked `solve_tril` only ~2.8% (55.6 ms)** - the algorithm the phase was about is *cheap*. Dominated by **`fwd_h` 46.2% (903 ms) + `fwd_o` 31.5% (617 ms)**: the inter-chunk state-recurrence GEMMs + the **per-chunk h-state materialization to global LPDDR5x** that FLA's split-kernel structure forces (`fwd_h` writes `h_pre` per chunk, `fwd_o` re-reads it). The fused M5 single kernel keeps the 128x128 state **resident in smem, never materializes per-chunk h**, so it is **2.1x faster on GB10's low-bandwidth memory.** Novel finding vs all prior evidence: **the blocked solve is not the floor - the floor is the state-GEMM + h-materialization region, which the FLA structure makes WORSE than M5.** The binding silicon property is **LPDDR5x memory bandwidth** (per-chunk h round-trip), compounded by the **99 KB smem cap** that forces the `fwd_h`/`fwd_o` split - not mma shapes or wave count. - **Correctness / cap gates (recorded, not decisive):** - **SMEM GATE PASSES** (all six kernels under the 99 KB opt-in cap at C=64; `cudaOccupancyMaxActiveBlocksPerMultiprocessor`): `k_kkt` 48 KB / 2 blk, `k_solve` 38 KB / 2 blk, `k_wu` 48 KB / 2 blk, `k_fwdh` 80 KB / 1 blk, `k_fwdo` 96 KB / 1 blk - **max 96 KB < 99 KB.** The kernels fit; they are bandwidth-floored above M5. - **KL BAND GREEN / in-band:** FLA `KLD 0.137028` vs control `0.136563` = **delta +0.000465 < 0.01**; same-top-p **84.61% vs 83.73%** control (>= 84% baseline). Per-kernel bring-up vs host fp64: **o NMSE 2.2e-7, final-state 1.2e-7** (before integration, per the "do not debug six kernels blind" rule). - **DEFAULT PATH UNTOUCHED:** canonical md5 GREEN both models, **default-off AND `LLAMA_GDN_FLA_CHUNK`-on** (paged-MoE `8cb0ce23`, dense `5951a5b4`; small-M greedy bails to M5). `test-backend-ops GATED_DELTA_NET` **DEFAULT 46/46 OK.** Decode untouched (`GDN_CHUNK_MIN` untouched; decode stays on the sequential recurrence). - **`test-backend-ops` env-on = 43-44/46** (`gdn_op_tests_env_on_green=false`): the FLA-engaged `head_size=128, n_seq_tokens>=64` cases marginally exceed `1e-7` (**ERR 1.03-1.06e-7**, fluctuating across the boundary) because the port uses plain **tf32** where M5 uses **3xtf32 (CUTLASS fp32-emulation)** for the decay-coupled compounding state products; M5-chunked passes the SAME cases at `< 1e-7`. Judgment: marginal tf32-vs-3xtf32 gap, benign at model level (KL green); 3xtf32 would add mma count and deepen the perf NO-GO, so not pursued. - **Engagement PROVEN:** `LLAMA_GDN_FLA_TRACE` fired `[gdn-fla] engage H=32 ...` in `batched-bench`; nsys shows all six `gdn_fla::` kernels under `LLAMA_GDN_FLA_CHUNK=1` and none under default. - **Honest delta vs the +59.2 expectation.** Scope expected `~0-10 of the +59.2, "likely a shared-hardware floor."` Delivered: **0 recovery, a -63 us/tok regression on the FLA arm; the floor is confirmed.** M5's fused smem-resident chunked scan (56.31 us/tok) is the winner and is at/near the GB10 memory-bandwidth floor for this op. What binds is silicon (LPDDR5x bandwidth on the per-chunk h round-trip + the 99 KB smem cap forcing the split), not the algorithm; it lifts only on datacenter Blackwell (HBM + larger smem + TMEM), consistent with the scope's section-4 framing. Protocols honored: GPU lock held throughout and released; `LLAMA_MAX_BATCH_TOKENS` unset; sm_121a; nsys `--cuda-graph-trace=node`; 3+ iter S_PP medians; no external contention. WIP on the DGX fork topic branch `p5-fla-gdn` at `2d64c37f08ad323038a44a89ab32189527c6ba29` (base `localai-paged` `653bb2f3d`, **NOT pushed, NOT landed**): new `ggml/src/ggml-cuda/gdn-blocked-solve.cu` + narrow dispatch in `gated_delta_net.cu` / `gated_delta_net.cuh`. Fork `localai-paged` HEAD **untouched at `653bb2f3d`**; the LocalAI series **stays at 46 patches (`0001-0055`)**; topic branches `p1-bf16-stream` / `p2-moe-region` / `p4-cbv2` left intact. Artifacts on the DGX `~/bench/p5_fla_gdn/`: `killgate_20260702_204225/` (RESULTS.md, spp_control.txt, spp_fla.txt, `nsys_{ctrl,fla}{2048,512}.{nsys-rep,kern.csv}`, GATES.txt, `kl_moe_{ctrl,fla}.log`, occupancy.txt, gdn-blocked-solve.cu, p5_fla_test.cu) and `standalone_20260702_203434/` (RESULTS.txt + p5_fla_test.cu, p5_m5_time.cu, m5_kernel_body.cuh). ## P6 fp8-e4m3 KV cache (final program phase) - BLOCKED-ON-INFRA; kill-gate never ran; the analytical decode ceiling is the recorded artifact (recorded 2026-07-02) Sixth and final phase of the `EXECUTION_REARCH_SCOPE.md` additive program. **It did not run the kill-gate.** The DGX/GB10 - the only box with the GPU, the fork (`localai-paged` `653bb2f3d`), and the models - was **unreachable for the entire P6 window** (P0 session, build session, and this recording session). Its sole access path (cloudflared `access ssh` via `prem-vm` -> `jp-6.prem.io/c1f2af2fae580`) returned **HTTP 530 / "websocket: bad handshake" / "Connection closed by UNKNOWN port 65535" on every probe** (10+ attempts across sessions; re-confirmed 2026-07-02 with 5 fresh probes). This recording box (`mudler-ubuntu-box`) has **no GPU** and no local fork checkout, so **Stage 0a** (measured nsys `--cuda-graph-trace=node` decode ceiling) and **Stage 0b** (fp8-e4m3 kernel + kill-gate A/B) were **physically impossible.** `go=false` because the gate could not execute; `stopped_at_ceiling=false` because the analytical ceiling does not uniformly kill the lever. **This is an honest infra-block, NOT a measured NO-GO and NOT a NO-GO-by-ceiling.** See the "P6 RESULT" subsection in `EXECUTION_REARCH_SCOPE.md` for the full record. Summary: - **THE ANALYTICAL DECODE CEILING (the valuable artifact; ESTIMATES, UNMEASURED).** From the HNP decode decomposition (`VLLM_PARITY_FINAL.md` 2b) + the flash-attn ~14 us/tok / ~1.3% prior. Decode step ~1082 us/tok GPU-steady, of which **~1068 is context-INDEPENDENT** (GDN scan 553, NVFP4 expert GEMM 254, bf16 proj 73, elementwise 57, ssm conv 31); **only flash-attn is LINEAR in context.** fp8-e4m3 halves KV bytes -> theoretical-MAX decode saving = **0.5 x fa_kvread_share**: **ctx256 0.65%** (standard shape - hard NO-GO), **ctx1024 2.55%**, **ctx2048 4.98%** (first crosses +3%), **ctx4096 9.49%**, **ctx8192 17.34%.** Any realizable win lives **ONLY at ctx >= 2048**; standard serving shapes (ctx ~256, npl 128/256) are a **definitive ceiling NO-GO.** Matches lever-map B2. - **HYBRID-GDN STRUCTURAL CAP.** q36 is hybrid GDN: **only 10 of 40 layers are full attention with KV**; the other **30 are GDN** with a fixed-size recurrent state and **NO KV** (does not grow with context). fp8 can only touch the 10/40 KV slice - it cannot move the 30 GDN layers, which is why flash-attn is such a small decode fraction at modest ctx. - **THE DOMINANT NULL STANDS UNREFUTED.** The ceiling is a theoretical MAX; realized A/B = ceiling minus fused-dequant-in-attention cost minus non-KV-read flash-attn minus paged block-table gather indirection. **Q8_0 KV was a MEASURED +7.8% decode REGRESSION on GB10** (2026-06-23, dense-32B era) where flash-attn was a LARGER decode fraction - dequant cost exceeded the BW saving even in a more favorable regime. fp8-e4m3's only edge is its cheaper hw-convert dequant plus **vLLM shipping fp8-e4m3 KV on this exact model without visible penalty** (mechanism sound in principle). Whether OUR fa/paged-attn path realizes it on GB10 long-ctx shapes is exactly what Stage 0a/0b must MEASURE; the null predicts the residual may go **negative even at long context.** - **CAPACITY-PLAY FRAMING (remains OPEN).** As a **throughput** lever fp8-KV is a ceiling NO-GO at standard shapes and null-dominated at long ctx. As a **memory / capacity** feature it is a different, un-run gate: halving stored KV bytes for the 10/40 attention layers is a real long-context / high-concurrency capacity win (more sequences or longer contexts per fixed VRAM), independent of any throughput delta. **fp8-KV as a capacity feature stays open for a future capacity-motivated effort even if throughput-flat.** - **DEFAULT PATH: PROVABLY UNDISTURBED, NOT RE-VERIFIED THIS SESSION.** No P6 code exists, so there is provably zero diff vs `653bb2f3d` and zero overlap with P3's `w4a16*`/`mmq*` files. Canonical md5s (MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense `5951a5b4d624ce891e22ab5fca9bc439`) are documented green-with-code-present from prior phases but were NOT rebuilt/re-run this session (no GPU); recorded as such, not overclaimed. Provenance: **no `p6-fp8-kv` topic branch created** (DGX down). Fork `localai-paged` HEAD **untouched at `653bb2f3d`**; the LocalAI series **stays at 46 patches (`0001-0055`)**; P3's `p3-w4a16-direct` work **untouched**. The only P6 artifacts are **local-only staging scripts** (not on the DGX): `scratchpad/p6_stage0a_ceiling.sh` (staged Stage 0a nsys difference-method profiler: standard npp512/npl128 + long-ctx npp4096/8192 x npl8/32, both models, honors the shared GPU lock) and `scratchpad/p6_ceiling_extract.py` (fa-bucket analyzer). `~/bench/p6_fp8_kv/` was never created. HANDOFF (box-up required): scp the two scripts to the DGX; `git -C ~/llama-paged-fork worktree add ~/llama-paged-p6 -b p6-fp8-kv 653bb2f3d` (SEPARATE worktree, no collision with P3's checkout); build unmodified (sm_121a, nohup+poll), share `~/gpu_bench_lock` politely with P3; run Stage 0a measured graph-node decode profiles at standard + long-ctx shapes on both models; if the measured long-ctx ceiling stays < +3% at all realistic shapes -> NO-GO-BY-CEILING, record, stop; else Stage 0b fp8-e4m3 behind `LLAMA_KV_FP8=1` with **static** per-tensor/per-head scales, dequant **FUSED** in the fa/paged-attn read (the P5 lesson), then the kill-gate A/B at long-ctx with per-path KL (paged AND non-paged, both models, KLD delta < 0.01 + same-top-p >= 84%) + default md5 + test-backend-ops. Number the series dynamically at land time (P3 may land 0056+ first).