docs(paged): record phases 112-140 + series trim decision

Record the phase 110-140 GDN/MoE campaign benchmark log and append the series-trim decision to the parity handoff: keep the Phase135 routed-FFN fused-quant line plus the MoE test sentinels and the MTP-draft correctness fix; drop the W4A16 structural line, the trace/tile-policy patches, GPU-sort, W4A16-direct-A, and the finalize fusion. Rejected/neutral levers are recorded in the handoff and the per-phase bench artifacts. Fork re-mirrored on 51168c5ee: fd920cf8a a85c1e098 2fed6aacf f1d976f06 1edddc8fe (HEAD tree 097c862c). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-07-03 04:46:54 -04:00 · 2026-07-02 10:16:53 +00:00
parent 67d2c4c9d4
commit 1aba41082b
2 changed files with 4967 additions and 14 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -5,6 +5,575 @@
 > and `GB10_PARITY_PHASE0_RESULTS.md`, with Phase 6 serving-nsys evidence and
 > the active follow-up plans under `docs/superpowers/plans/`. Use those files for
 > the current state before relying on the older "closed" conclusion below.
+>
+> 2026-07-01 Phase112 update: keep the new default-off
+> `LLAMA_W4A16_DIRECT_A=1` direct activation staging hook, especially combined
+> with Phase110 `LLAMA_MOE_GPU_SORT=1`. Artifact:
+> `/home/mudler/bench/phase112_w4a16_direct_a/20260701_231749_direct_a`.
+> Selected gates passed `13/13` for W4A16+GPU-sort, direct-A, and
+> direct-A+GPU-sort. Direct-A+GPU-sort improved the 257-token W4A16 fallback
+> rows versus W4A16+GPU-sort control (`MOE_SWIGLU_DOWN 1551.08 -> 1477.74 us`,
+> `MUL_MAT_ID_RAGGED_MOE 2278.50 -> 2166.22 us`) but was neutral/slightly
+> slower on 128-token rows. Canonical README md5 gates are green: MoE
+> `8cb0ce23777bf55f92f63d0292c756b0`, dense
+> `5951a5b4d624ce891e22ab5fca9bc439`; compact supported op gates are green
+> (`SSM_CONV 45/45`, `SSM_CONV_SPLIT 6/6`, `GET_ROWS 49/49`,
+> `GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`).
+> This is still default-off structural groundwork, not parity: W4A16 fallback
+> remains slower than the default grouped-MMQ path. Use the patch-series README
+> md5 command as canonical; the handoff `-no-cnv -c 4096` snippet produced
+> stable but non-canonical md5s for both candidate and control.
+>
+> 2026-07-01 Phase113 update: reject the combined direct-A GPU-tile descriptor
+> attempt. Artifact:
+> `/home/mudler/bench/phase113_w4a16_direct_a_gpu_tiles/20260701_233345_no_readback`.
+> The candidate (`LLAMA_W4A16_GPU_TILES=1` on top of Phase112 direct-A+GPU-sort)
+> avoided the `n_tiles` readback by launching over zero-initialized `max_tiles`
+> and returning early on `rows <= 0`. Selected correctness passed `13/13`, but
+> perf failed the keep gate: `MOE_SWIGLU_DOWN n=257` was flat
+> (`1478.16 -> 1476.36 us`) and `MUL_MAT_ID_RAGGED_MOE n=257` regressed
+> (`2148.44 -> 2214.23 us`). The source was reverted and post-revert
+> Phase112 direct-A+GPU-sort selected gates passed `13/13`. Next W4A16/MoE work
+> should not revisit compact GPU tile descriptors; use vLLM-style padded routing
+> metadata (`sorted_token_ids`, expert ids per M block, padded row count) if
+> continuing this line.
+>
+> 2026-07-01 Phase114 update: reject the naive padded routing implementation.
+> It implemented the vLLM-style metadata contract with separate padded source
+> ids and destination ids for llama.cpp, plus an expert-id W4A16 consumer mode
+> and a direct scatter that skipped compact `get_rows_cuda`. Correctness passed
+> (`13/13`) but perf failed: after a fix using `num_tokens_post_pad` early
+> returns, `MOE_SWIGLU_DOWN n=257` regressed `1477.88 -> 1726.27 us` and
+> `MUL_MAT_ID_RAGGED_MOE n=257` regressed `2163.35 -> 2650.93 us`. Artifacts:
+> `/home/mudler/bench/phase114_w4a16_padded_routing/20260701_234634_padded_meta`
+> and
+> `/home/mudler/bench/phase114_w4a16_padded_routing/20260701_235003_padded_meta_fix1`.
+> Source was reverted; post-revert Phase112 direct-A+GPU-sort selected gate
+> passed `13/13`. Padded metadata is not enough by itself on GB10 because sparse
+> expert occupancy makes padded activation/output traffic too expensive.
+>
+> 2026-07-02 Phase115 update: reject another small-M/tile-policy shortcut.
+> Phase115 re-tested the existing default-off `LLAMA_MOE_SMALL_M_TILE=16/32/64`
+> knob on the newer Phase108 whole-graph MoE sentinels. Artifact:
+> `/home/mudler/bench/phase115_moe_small_m_sentinel/20260702_020258`.
+> Control and all three tile caps passed selected correctness (`13/13` each),
+> but no candidate met the promotion rule. The 257-token ragged down row
+> regressed for every cap (`1452.30 us` control vs `1455.02`, `1458.71`, and
+> `1456.88 us`). Do not add name-based down special cases or another MMQ
+> tile-policy patch. The next credible target is a true fused routed-MoE kernel
+> or a graph-level fusion that removes materialized activation/output traffic.
+>
+> 2026-07-02 Phase116 update: reject the standalone graph-level
+> SwiGLU-to-MMQ-activation-quant fusion. The default-off candidate
+> `LLAMA_MOE_SWIGLU_DOWN_FUSED_QUANT=1` detected the plain
+> `GLU -> down MUL_MAT_ID` pattern and computed `silu(gate) * up` directly into
+> the grouped-MMQ NVFP4 activation buffer. Artifact:
+> `/home/mudler/bench/phase116_moe_swiglu_down_fused_quant/20260702_022611`.
+> Correctness passed (`13/13`) and the fix1 route emitted the fused marker
+> (`6` hits), but perf was not useful: `MOE_SWIGLU_DOWN n=257` was flat
+> (`1024.90 -> 1024.69 us`), `n=128` regressed (`806.33 -> 808.79 us`), and the
+> ragged sentinel drifted slower. Source was reverted and post-revert selected
+> gate passed `13/13`. Do not retry this narrow fused-quant route; the next
+> fused-MoE attempt must remove a larger boundary, such as route-once metadata
+> shared by both expert GEMMs plus fused GEMM1/activation/GEMM2 or
+> weighted-combine/scatter.
+>
+> 2026-07-02 Phase117 update: keep the default-off MoE boundary trace as
+> diagnostic instrumentation only. Artifact:
+> `/home/mudler/bench/phase117_moe_route_once_boundary/20260702_024140`.
+> The trace decomposes `MOE_SWIGLU_DOWN` into route-sort, activation
+> quantization, grouped-MMQ launch, GLU, and graph-pattern records under
+> `LLAMA_MOE_BOUNDARY_TRACE=1`; optional timing is gated by
+> `LLAMA_MOE_BOUNDARY_TIMING=1`. Inline CUDA event timing initially aborted
+> under CUDA graph capture, so the guarded trace emits `us=-1` while capturing
+> and only produces real event timings with `GGML_CUDA_DISABLE_GRAPHS=1`.
+> Post-guard selected gates passed (`13/13`), trace mode passed (`7/7`), and
+> canonical gates passed: MoE md5 `8cb0ce23`, dense md5 `5951a5b4`,
+> `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`. The timing attribution does not
+> fund another local route-sort, tile, GLU, or activation-quant shortcut. The
+> next MoE source phase should own a larger pipeline boundary: shared
+> route-once metadata across gate_up/down and/or whole-pattern
+> GEMM1->activation->GEMM2 execution.
+>
+> 2026-07-02 Phase118 update: reject standalone route-metadata caching.
+> Artifact:
+> `/home/mudler/bench/phase118_moe_route_cache/20260702_030549`. The
+> default-off candidate `LLAMA_MOE_ROUTE_CACHE=1` stored ids-derived grouped-MMQ
+> route metadata in context-owned buffers and reused it within a graph
+> evaluation. It was correctness-clean (`13/13` default, opt-in, and
+> post-reject) and the trace showed reuse (`23` hits, `3` misses on
+> `MOE_SWIGLU_DOWN n=128`), but perf was too small: `MOE_SWIGLU_DOWN n=257`
+> improved only `1017.711 -> 1011.915 us` (`+0.57%`) and `n=128` regressed
+> `799.360 -> 803.738 us` (`-0.55%`). Runtime cache source was reverted; only a
+> local `ggml_cuda_mmq_ids_meta` helper refactor remains as low-conflict
+> groundwork. Do not retry metadata-cache-only work. The next attempt must own
+> more of the vLLM-style pipeline: GEMM1->activation->GEMM2 and/or
+> scatter/combine, not just skipping one `mm_ids_helper` launch.
+>
+> 2026-07-02 Phase119 update: keep the default-off whole-pattern contract trace
+> after fix1. Initial artifact:
+> `/home/mudler/bench/phase119_moe_whole_pattern_contract/20260702_034729`;
+> fix1 artifact:
+> `/home/mudler/bench/phase119_moe_whole_pattern_contract/20260702_035126_fix1`.
+> The initial trace proved coverage but missed the overhead rule on
+> `MOE_SWIGLU_DOWN n=257` (`1015.070 -> 1028.937 us`, `-1.35%`). Fix1 moved
+> detector work off the default path unless `LLAMA_MOE_WHOLE_PATTERN_TRACE` or
+> the existing boundary trace is enabled. Fix1 gates are green: selected
+> `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE` `13/13`, trace `MOE_SWIGLU_DOWN`
+> `7/7`, canonical MoE md5 `8cb0ce23`, dense md5 `5951a5b4`,
+> `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Trace overhead is now within
+> rule (`MOE_SWIGLU_DOWN n=128` `805.400 -> 805.584 us`, `-0.02%`;
+> `n=257` `1019.715 -> 1021.836 us`, `-0.21%`) and emits supported NVFP4
+> markers for both `n_tokens=128` and `257`. This is diagnostic scaffolding,
+> not a runtime optimization. The next executor attempt should match at the
+> earlier `gate_up MUL_MAT_ID` node and skip through `VIEW, VIEW, GLU, down
+> MUL_MAT_ID`; the current `GLU -> down` hook is validation-only because GEMM1
+> has already executed.
+>
+> 2026-07-02 Phase120 update: keep the default-off early whole-pattern matcher
+> after fix2. Initial artifact:
+> `/home/mudler/bench/phase120_moe_early_whole_pattern/20260702_040153`;
+> fix2 artifact:
+> `/home/mudler/bench/phase120_moe_early_whole_pattern/20260702_040725_fix2`.
+> The initial/fix1 versions proved `skip_ready=4` but emitted noisy unsupported
+> markers from unrelated `MUL_MAT_ID` candidates. Fix2 emits only the actual
+> early pattern and is clean: selected `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE`
+> `13/13`, early trace `MOE_SWIGLU_DOWN` `7/7`, canonical MoE md5
+> `8cb0ce23`, dense md5 `5951a5b4`, `MUL_MAT 1146/1146`, and
+> `MUL_MAT_ID 806/806`. It emits exactly six supported early markers for the
+> perf sentinels, covering `n_tokens=128` and `257`, with `skip_ready=4`,
+> `ids_match=1`, and `swiglu=1`. Trace overhead is within rule
+> (`MOE_SWIGLU_DOWN n=128` `803.937 -> 808.978 us`, `-0.62%`;
+> `n=257` `1020.412 -> 1026.073 us`, `-0.55%`). The next source phase can now
+> implement a guarded executor at this early matcher. First prove safe
+> ownership/skip accounting for the five-node sequence, then move route-plan
+> reuse and activation/down execution into the helper.
+>
+> 2026-07-02 Phase121 update: keep the default-off executor proof after fix1.
+> Initial artifact:
+> `/home/mudler/bench/phase121_moe_whole_pattern_exec_proof/20260702_041543`;
+> fix1 artifact:
+> `/home/mudler/bench/phase121_moe_whole_pattern_exec_proof/20260702_041739_fix1`.
+> The initial run passed correctness but emitted zero exec markers because the
+> exec branch was accidentally nested under the early-trace env condition.
+> Fix1 makes `LLAMA_MOE_WHOLE_PATTERN_EXEC=1` engage independently. Gates are
+> green: selected `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE` `13/13`, exec
+> `MOE_SWIGLU_DOWN` `7/7`, canonical MoE md5 `8cb0ce23`, dense md5
+> `5951a5b4`, `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Exec perf emits
+> six `skip=4` markers covering `n_tokens=128` and `257`, and target perf is
+> neutral (`MOE_SWIGLU_DOWN n=128` `807.772 -> 806.051 us`, `+0.21%`;
+> `n=257` `1021.115 -> 1020.839 us`, `+0.03%`). This proves ownership and skip
+> accounting only; it is not a fused-MoE speedup. The next source phase should
+> replace one internal boundary inside this helper, preferably route-plan reuse
+> or activation in route-slot order, with the same md5/op gates.
+>
+> 2026-07-02 Phase122 update: reject route-only metadata reuse inside the
+> Phase121 executor. Artifact:
+> `/home/mudler/bench/phase122_moe_shared_route_meta/20260702_043212`.
+> The candidate exposed `ggml_cuda_mmq_ids_meta` as a public MMQ helper and
+> used `LLAMA_MOE_WHOLE_PATTERN_SHARED_ROUTE=1` to build route metadata once
+> for both `gate_up` and `down`. Correctness passed (`13/13` selected and
+> `7/7` shared-route), but perf missed the keep gate:
+> `MOE_SWIGLU_DOWN n=128` regressed `808.190 -> 811.836 us` and `n=257`
+> regressed `1020.850 -> 1051.666 us` versus the Phase121 executor. Source was
+> reverted, including the public metadata API and shared-route env. Post-reject
+> gates on the reverted tree passed (`13/13` selected and `7/7` Phase121 exec)
+> with six retained exec markers. Do not retry route-only metadata reuse. The
+> next MoE executor scope should target activation/down data layout, direct
+> activation-to-down input, or a larger GEMM1->activation->GEMM2 fused boundary.
+>
+> 2026-07-02 Phase123 update: reject standalone fused-down activation
+> quantization inside the Phase121 executor. Artifact:
+> `/home/mudler/bench/phase123_moe_executor_fused_down_input/20260702_025811`;
+> red check:
+> `/home/mudler/bench/phase123_moe_executor_fused_down_input/red_20260702_025031`.
+> The candidate used `LLAMA_MOE_WHOLE_PATTERN_FUSED_DOWN=1` to run `gate_up`,
+> compute `silu(gate) * up` directly into the sorted NVFP4 down MMQ activation
+> buffer, and launch the existing down MMQ kernel. Correctness passed
+> (`13/13` selected, `7/7` fused-down, six fused markers), but perf was flat:
+> versus Phase121 exec, `MOE_SWIGLU_DOWN n=128` was
+> `811.153 -> 810.618 us` and `n=257` was `1023.090 -> 1023.657 us`.
+> Source was reverted, including the fused-down env, MMQ helper, and NVFP4
+> fused quant kernel. Post-reject gates passed (`13/13` selected, `7/7`
+> Phase121 exec, six exec markers). Do not retry a single-boundary
+> SwiGLU-to-down-quant shortcut; if continuing MoE source work, scope a full
+> expert-major packed pipeline that owns `GEMM1->activation->GEMM2`, or pivot to
+> another measured bottleneck.
+>
+> 2026-07-02 Phase124 update: current-stack graph-node serving was refreshed
+> after the Phase122/123 rejections. Artifact:
+> `/home/mudler/bench/phase124_current_moe_profile/20260702_031205`.
+> Pre/post gates are green: MoE md5 `8cb0ce23`, dense md5 `5951a5b4`,
+> `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`. At `N=128`, prompt `128`,
+> generation `64`, serving under graph-node profiling was
+> `agg_tps 206.2`, `decode_agg_tps 320.3`, `prefill_tps 1536.4`, wall
+> `39.738s`. Fine buckets are now `mmq_nvfp4 6074.78 ms` (`30.17%`) and
+> `gdn_core 5888.31 ms` (`29.25%`), with `act_quant` only `674.88 ms`
+> (`3.35%`). This explains why single-boundary activation/quant attempts were
+> flat. The next source work must reduce one of the two dominant buckets:
+> either a full expert-major MoE pipeline for `mmq_nvfp4`, or a default-off GDN
+> decode/core experiment for `gdn_core`. Do not spend more GB10 time on
+> route-only metadata reuse, fused-down quantization, or MoE tile-policy knobs
+> unless a new profile makes those buckets material.
+>
+> 2026-07-02 Phase125 scoping update: two independent code explorers and a
+> local GDN audit challenged the Phase124 fork in the road. The chosen next
+> source attempt is the MoE side, but only as a first maintainable slice:
+> implement a default-off MMQ sorted-output primitive behind
+> `LLAMA_MOE_EXPERT_MAJOR_SORTED_OUT=1`, immediately unsort as a proof, and
+> measure `MOE_SWIGLU_DOWN` before attempting the full
+> `gate_up -> SWIGLU -> down` expert-major executor. Rationale: vLLM's portable
+> advantage is keeping activations expert-major across both GEMMs and
+> unpermuting once; Phase122/123 failed because they only touched route metadata
+> or one activation boundary. Do not copy CUTLASS/FlashInfer pointer-array, TMA,
+> or FP4 scale-swizzle internals. A small GDN patch is not funded by current
+> evidence because previous decode/core micro-attempts already rejected the
+> obvious geometry/store/broadcast/conv-state shortcuts. Plan:
+> `docs/superpowers/plans/2026-07-02-moe-expert-major-sorted-output-phase125.md`.
+>
+> 2026-07-02 Phase125 result: reject the MMQ sorted-output plus immediate
+> unsort proof. Artifact:
+> `/home/mudler/bench/phase125_moe_expert_major_sorted_output/20260702_033931`;
+> post-reject:
+> `/home/mudler/bench/phase125_moe_expert_major_sorted_output/post_reject_20260702_034232`.
+> The candidate was default-off and correctness-clean (`13/13` default
+> selected, `7/7` opt-in `MOE_SWIGLU_DOWN`, 12 sorted markers), but perf failed
+> decisively: versus Phase121 exec, `MOE_SWIGLU_DOWN n=128` regressed
+> `805.16 -> 888.76 us` and `n=257` regressed `1023.83 -> 1192.05 us`.
+> Source was reverted. Post-reject gates are green: selected `13/13`, Phase121
+> exec `7/7` with six markers, MoE md5 `8cb0ce23`, dense md5 `5951a5b4`,
+> `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Do not retry a path that adds
+> a sorted-output temporary and immediately unsorts. A future expert-major MoE
+> attempt must keep sorted activations through the down GEMM and unpermute only
+> once after the full FFN, or pivot to a larger GDN recurrence design.
+
+> 2026-07-02 Phase126 result: keep the grouped-MMQ presorted helper scaffold.
+> The patch only touches `mmq.cu`/`mmq.cuh`, refactors the current MoE id path
+> behind explicit `ids_src1`/`ids_dst`/`expert_bounds` metadata, and exposes a
+> `src1_sorted` entry point for the future whole-MoE executor. Fixed artifact:
+> `/home/mudler/bench/phase126_mmq_presorted_helper/fix1_20260702_040858`.
+> Gates were green: selected `13/13`, MoE md5
+> `8cb0ce23777bf55f92f63d0292c756b0`, dense md5
+> `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT 1146/1146`,
+> `MUL_MAT_ID 806/806`. Focused perf was neutral:
+> `MOE_SWIGLU_DOWN n=128 805.99 us`, `MUL_MAT_ID_RAGGED_MOE n=128
+> 1243.85 us`, `MOE_SWIGLU_DOWN n=257 1018.74 us`,
+> `MUL_MAT_ID_RAGGED_MOE n=257 1452.84 us`. This is not a parity win by
+> itself; it is the dependency for Phase127 to keep `gate_up -> SWIGLU -> down`
+> in expert-major order and unpermute only once after the full FFN.
+
+> 2026-07-02 Phase127 result: reject and revert the whole-MoE expert-major
+> executor built on the Phase126 helper. Red:
+> `/home/mudler/bench/phase127_moe_whole_expert_major/red_20260702_042125`
+> passed by fallback with zero markers. Candidate green:
+> `/home/mudler/bench/phase127_moe_whole_expert_major/green2_20260702_042916`
+> passed default selected `13/13` and opt-in `MOE_SWIGLU_DOWN 7/7`, emitting
+> six `LLAMA_MOE_WHOLE_EXPERT_MAJOR` markers after fixing the down-weight shape
+> interpretation (`down_w` is `[n_ff, n_embd, experts]`). Perf artifact:
+> `/home/mudler/bench/phase127_moe_whole_expert_major/perf_20260702_043104`.
+> It failed the keep rule: `MOE_SWIGLU_DOWN n=128` regressed
+> `802.57 -> 812.14 us`; `n=257` regressed `1023.25 -> 1039.36 us`;
+> ragged standalone was essentially flat. Source was reverted. Post-reject:
+> `/home/mudler/bench/phase127_moe_whole_expert_major/post_reject_20260702_043318`
+> passed selected `13/13`, MoE md5 `8cb0ce23`, dense md5 `5951a5b4`,
+> `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Do not retry the same
+> fake-tensor whole-executor shape; the next MoE attempt must remove more
+> temporary traffic or become a real fused grouped MMQ/SWIGLU/down path. A
+> separate alternative is the previously scoped Qwen3Next BF16 GDN S-cache
+> experiment, but that needs non-md5 numerical gates.
+
+> 2026-07-02 Phase128 result: reject/revert the Qwen3Next BF16 GDN S-cache
+> selector probe for the current target. Artifact:
+> `/home/mudler/bench/phase128_qwen3next_gdn_bf16_s_cache/default_20260702_043939`
+> built and passed default gates (`GATED_DELTA_NET 48/48`, canonical MoE md5
+> `8cb0ce23`, dense md5 `5951a5b4`, `MUL_MAT`, `MUL_MAT_ID`). Verbose smoke
+> artifact:
+> `/home/mudler/bench/phase128_qwen3next_gdn_bf16_s_cache/smoke3_20260702_044434`
+> showed the active decision model is `qwen35moe`, not Qwen3Next, and S cache
+> remained `f32` under `LLAMA_QWEN3NEXT_GDN_S_CACHE_TYPE=bf16`. No true
+> Qwen3Next GGUF was found on DGX. The relevant Qwen35/Qwen35MoE BF16 S-cache
+> lever was already Phase81/82: it cut `gdn_core` but changed MoE md5 and
+> missed the full f16-reference KL acceptance band. Do not retry this exact
+> lever unless the quality gate is explicitly re-scoped or a real Qwen3Next
+> model artifact is available.
+
+> 2026-07-02 Phase129 result: reject/revert the Qwen35/Qwen35MoE grouped Q/K
+> broadcast probe for fused GDN. Plan:
+> `docs/superpowers/plans/2026-07-02-qwen35-gdn-qk-grouped-bcast-phase129.md`.
+> The candidate added a default-off `LLAMA_QWEN35_GDN_QK_BCAST=1` branch in
+> `src/models/qwen35.cpp` and `src/models/qwen35moe.cpp`, reusing the existing
+> Qwen3Next `ggml_gated_delta_net_set_bcast()` path. Default gates were green:
+> `/home/mudler/bench/phase129_qwen35_gdn_qk_bcast/default_20260702_065445`
+> passed MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, `GATED_DELTA_NET 46/46`,
+> `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. A standalone opt-in gate
+> artifact at `optin_20260702_065604` was invalid because
+> `paged-inference-gates.sh` only passes completion env through `EXTRA_ENV`.
+> The valid opt-in pre-gate from
+> `/home/mudler/bench/phase129_qwen35_gdn_qk_bcast/decode_optin_20260702_070149/gate_pre`
+> changed MoE md5 to `b773e2f032aa0e992626d486b321808e`, so profiling was
+> stopped and the source was reverted. Post-reject:
+> `/home/mudler/bench/phase129_qwen35_gdn_qk_bcast/post_reject_20260702_070258`
+> passed canonical MoE/dense md5, `GATED_DELTA_NET 46/46`, `MUL_MAT 1146/1146`,
+> and `MUL_MAT_ID 806/806`; rebuilt `libllama.so` has zero
+> `LLAMA_QWEN35_GDN_QK_BCAST` strings. Do not retry this Qwen3Next
+> grouped-broadcast port for Qwen35/Qwen35MoE under the current bit-exact md5
+> rule.
+
+> 2026-07-02 Phase130 result: current-stack graph-node serving profile refresh,
+> measurement-only. Artifact:
+> `/home/mudler/bench/phase130_current_stack_profile/20260702_070949`. Shape:
+> MoE `q36-35b-a3b-nvfp4`, `N=128`, prompt `128`, generation `64`,
+> `PARALLEL=128`, `CTX=131072`. Pre/post gates passed canonical MoE md5
+> `8cb0ce23`, dense md5 `5951a5b4`, `MUL_MAT 1146/1146`, and
+> `MUL_MAT_ID 806/806`. Serving metrics: `agg_tps 208.0`,
+> `decode_agg_tps 326.9`, `prefill_tps 1519.6`, `TTFT mean 8170.6 ms`, wall
+> `39.38 s`, total kernel time `20.1559 s`. The profile confirms the live
+> bottleneck remains split between `mmq_nvfp4 6009.52 ms` (`29.82%`) and
+> `gdn_core 5891.40 ms` (`29.23%`). FA/mask cleanup is not funded:
+> `get_rows 280.62 ms` (`1.39%`) and `fa 257.38 ms` (`1.28%`). The next source
+> attempt must target a larger MoE/FFN-GEMM executor/kernel or a materially
+> different GDN recurrent-state/packed-decode design, not another paged-mask,
+> route-only, activation-only, grouped-broadcast, BF16-cache, or launch-geometry
+> shortcut.
+
+> 2026-07-02 Phase131 result: source-selection challenge, no source changes.
+> Plan:
+> `docs/superpowers/plans/2026-07-02-fused-routed-ffn-phase131.md`. Two
+> read-only explorers challenged the Phase130 fork. MoE/FFN-GEMM source work is
+> not funded unless it becomes a real fused routed-FFN kernel/executor; another
+> route-only, activation-only, W4A16, tile-policy, sorted-output, or fake
+> executor patch is expected to repeat Phases 110-127. GDN source work is not
+> funded unless it materially reduces f32 recurrent-state traffic without
+> BF16/quality drift; launch geometry, gather/identity, producer/store fusion,
+> BF16 S-cache, and grouped Q/K broadcast have already failed or changed md5s.
+> The next active line is to audit vLLM's fused MoE design and llama.cpp's
+> current whole-pattern executor hook for a default-off fused routed-FFN PoC.
+> If that audit does not produce a concrete low-conflict hook, require a
+> standalone CUDA PoC before touching llama.cpp source.
+>
+> 2026-07-02 Phase132 result: keep the new default-off routed-FFN PoC scaffold.
+> Plan:
+> `docs/superpowers/plans/2026-07-02-routed-ffn-poc-phase132.md`. Artifact:
+> `/home/mudler/bench/phase132_routed_ffn_poc/20260702_072725`. Source adds
+> `ggml/src/ggml-cuda/moe-ffn.cu/.cuh` and a narrow hook in
+> `ggml/src/ggml-cuda/ggml-cuda.cu` behind `LLAMA_MOE_ROUTED_FFN_POC=1`.
+> The helper currently executes the baseline `gate_up -> SWIGLU -> down`
+> sequence through the existing whole-pattern hook, so it is a scaffold, not a
+> parity speedup. Initial incremental build failed until CMake was reconfigured
+> to pick up the new globbed CUDA source; after `cmake -S . -B build`, build
+> passed. Selected default and opt-in gates passed
+> `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE 13/13`; opt-in emitted six exec
+> markers and `libggml-cuda.so` contains one `LLAMA_MOE_ROUTED_FFN_POC` string.
+> Default and opt-in canonical gates passed MoE md5 `8cb0ce23`, dense md5
+> `5951a5b4`, `GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, and
+> `MUL_MAT_ID 806/806`. Focused perf was neutral (`808.32 -> 804.87 us` at
+> n=128, `1023.36 -> 1022.71 us` at n=257). Next phase may replace the helper
+> internals with a real fused routed-FFN slice; do not claim Phase132 itself as
+> a speedup.
+>
+> 2026-07-02 Phase133 result: keep only as a default-off structural base, not a
+> speedup. Plan:
+> `docs/superpowers/plans/2026-07-02-routed-ffn-sorted-down-phase133.md`.
+> Artifact:
+> `/home/mudler/bench/phase133_routed_ffn_sorted_down/20260702_074651`.
+> Source exposes `ggml_cuda_mmq_ids_meta`, adds raw
+> `ggml_cuda_mul_mat_q_moe_sorted_f32(...)`, and adds
+> `LLAMA_MOE_ROUTED_FFN_SORTED_DOWN=1` on top of
+> `LLAMA_MOE_ROUTED_FFN_POC=1`. The path executes baseline `gate_up` and
+> `SWIGLU`, gathers the SWIGLU output into compact expert-sorted F32 rows, then
+> calls raw MMQ down without fake tensors. Selected default, Phase132, and
+> Phase133 gates passed `13/13`; Phase133 trace proved six
+> `mmq_moe_sorted_raw` launches. Default and Phase133 canonical gates passed
+> MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, `GATED_DELTA_NET 48/48`,
+> `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Perf was not a win:
+> default `807.37/1020.76 us`, Phase132 `808.21/1018.87 us`, Phase133
+> `808.85/1026.87 us` for `n=128/257`. Next phase must fuse SWIGLU-to-sorted
+> or SWIGLU-to-quant to remove this added gather/quant boundary; do not promote
+> sorted-down as-is.
+>
+> 2026-07-02 Phase134 result: keep only as default-off fused-SWIGLU structural
+> base, not a speedup. Plan:
+> `docs/superpowers/plans/2026-07-02-routed-ffn-fused-swiglu-phase134.md`.
+> Artifact:
+> `/home/mudler/bench/phase134_routed_ffn_fused_swiglu/20260702_075828`.
+> Source adds `LLAMA_MOE_ROUTED_FFN_FUSED_SWIGLU=1` on top of
+> `LLAMA_MOE_ROUTED_FFN_POC=1`, passes `gate/up` views into the routed-FFN
+> helper, computes `silu(gate) * up` directly into expert-sorted F32 rows, and
+> calls the raw sorted-F32 down MMQ helper. The fused flag now implies the
+> sorted-down path; `LLAMA_MOE_ROUTED_FFN_SORTED_DOWN=1` is not required.
+> Selected opt-in gates passed `13/13`; trace proved six `mmq_moe_sorted_raw`
+> launches; canonical opt-in gates passed MoE md5 `8cb0ce23`, dense md5
+> `5951a5b4`, `GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, and
+> `MUL_MAT_ID 806/806`. Perf is mixed: default `804.92/1026.02 us`, Phase132
+> `808.00/1028.43 us`, Phase133 `808.07/1029.02 us`, Phase134
+> `810.61/1025.68 us` for `n=128/257`. It recovers n=257 but regresses n=128;
+> next work must fuse SWIGLU directly into down-MMQ quant or remove another
+> launch/buffer before this becomes a parity lever.
+>
+> 2026-07-02 Phase135 result: keep as current best default-off routed-FFN base,
+> but not parity. Plan:
+> `docs/superpowers/plans/2026-07-02-routed-ffn-fused-quant-phase135.md`.
+> Focused artifact:
+> `/home/mudler/bench/phase135_routed_ffn_fused_quant/20260702_081723`.
+> Serving artifact:
+> `/home/mudler/bench/phase135_routed_ffn_fused_quant_serving/20260702_082102`.
+> Source adds `LLAMA_MOE_ROUTED_FFN_FUSED_QUANT=1` on top of
+> `LLAMA_MOE_ROUTED_FFN_POC=1`, computes `silu(gate) * up` directly into the
+> NVFP4 MMQ activation layout, and launches raw down MMQ via
+> `ggml_cuda_mul_mat_q_moe_quantized(...)`. Focused selected gates passed
+> `13/13`; trace proved six `mmq_moe_quantized_raw` launches and zero
+> `mmq_moe_sorted_raw` launches; canonical focused gates passed MoE md5
+> `8cb0ce23`, dense md5 `5951a5b4`, `GATED_DELTA_NET 48/48`,
+> `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Focused perf:
+> default `805.92/1031.06 us`, Phase134 `807.65/1027.51 us`, Phase135
+> `807.92/1024.97 us` for `n=128/257`. Serving at the Phase130 shape passed
+> pre/post gates and improved decode aggregate t/s `326.9 -> 332.7`, while
+> `mmq_nvfp4` dropped `6009.52 -> 5915.24 ms`; aggregate stayed `208.0`, prefill
+> worsened `1519.6 -> 1475.1`, and total kernel time rose slightly
+> `20.1559 -> 20.2498 s`. Next work should target remaining MoE overhead after
+> fused quant (`mmq_fixup`, route/writeback, weighted combine), not another F32
+> intermediate.
+>
+> 2026-07-02 Phase136 result: reject and revert the separate post-down
+> weighted-combine fuse. Plan:
+> `docs/superpowers/plans/2026-07-02-routed-ffn-combine-phase136.md`.
+> Focused artifact:
+> `/home/mudler/bench/phase136_routed_ffn_combine/20260702_083727`.
+> Serving artifact:
+> `/home/mudler/bench/phase136_routed_ffn_combine_serving/20260702_085749`.
+> The candidate added `LLAMA_MOE_ROUTED_FFN_COMBINE=1` on top of Phase135 and
+> skipped the post-down `MUL(weights) -> VIEW* -> ADD*` tail with a separate
+> F32 weighted-combine kernel. It was correctness-clean: expanded selected
+> gates passed `20/20`, trace proved six combine markers plus six
+> `mmq_moe_quantized_raw` launches and zero sorted launches, canonical gates
+> passed MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, `GATED_DELTA_NET 46/46`,
+> `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Focused full-tail perf
+> improved (`MOE_SWIGLU_COMBINE n=257` `428.53 -> 401.81 us` versus Phase135),
+> but serving regressed versus Phase135: aggregate/decode t/s
+> `208.0/332.7 -> 206.5/323.2`. Source and the sentinel test were reverted;
+> post-reject Phase135 selected gates passed `13/13`. Do not retry a standalone
+> post-MMQ combine launch as the next parity lever; any combine/finalize work
+> needs a larger serving-visible fused writeback/finalize design.
+>
+> 2026-07-02 Phase137 result: reject the GDN launch-geometry retune with no
+> source changes. Plan:
+> `docs/superpowers/plans/2026-07-02-gdn-geometry-sweep-phase137.md`.
+> Focused artifact:
+> `/home/mudler/bench/phase137_gdn_geometry_sweep/20260702_091441`.
+> Serving artifact:
+> `/home/mudler/bench/phase137_gdn_geometry_serving/20260702_091740`.
+> The env-only sweep tested existing `GDN_NW`/`GDN_CPW` knobs. The best focused
+> candidate, `GDN_NW=4 GDN_CPW=1`, improved 1-token GDN rows
+> (`hc=32,hs=128,kda=0` `6.793748 -> 4.713682 us`, KDA
+> `7.790557 -> 5.194275 us`, grouped broadcast `5.967364 -> 3.407998 us`),
+> but real serving regressed versus Phase135 despite clean pre/post gates:
+> MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, `MUL_MAT 1146/1146`, and
+> `MUL_MAT_ID 806/806`. Aggregate/decode t/s moved
+> `208.0/332.7 -> 206.2/324.9`, total kernel time rose
+> `20.2498 -> 20.7530 s`, and `gdn_core` worsened
+> `5926.55 -> 6466.27 ms`. Do not promote or source-code a GDN geometry retune
+> for this target. The next scoped source line is default-off MoE
+> finalize/writeback inside the existing down-MMQ path, not a standalone
+> post-MMQ combine launch.
+>
+> 2026-07-02 Phase138 attempt 1 update: keep the default-off finalize trace and
+> full-tail sentinel scaffold; no runtime speedup claim yet. Plan:
+> `docs/superpowers/plans/2026-07-02-moe-down-mmq-finalize-phase138.md`.
+> Artifacts:
+> `/home/mudler/bench/phase138_moe_down_mmq_finalize_trace/20260702_092943`
+> (`MOE_SWIGLU_DOWN` trace-only),
+> `/home/mudler/bench/phase138_moe_down_mmq_finalize_trace/20260702_093617_full_tail`
+> (new full-tail sentinel), and
+> `/home/mudler/bench/phase138_moe_down_mmq_finalize_trace/20260702_093731_canonical`
+> (canonical gates). The old `MOE_SWIGLU_DOWN` sentinel emitted six early
+> routed-FFN records but no weighted tail. The new `MOE_SWIGLU_FINALIZE`
+> sentinel passed default and Phase135-opt-in correctness (`7/7` each) and
+> emitted six supported tail records with `tail_nodes=16`, `views=8`, and
+> `adds=7`. Canonical patched-Phase93 gates passed MoE md5 `8cb0ce23`, dense
+> md5 `5951a5b4`, `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Next work may
+> implement default-off down-MMQ finalize/writeback against this sentinel first;
+> keep serving promotion gated by Phase135 decode/aggregate/kernel-time
+> thresholds.
+>
+> 2026-07-02 Phase138 attempt 2 update: keep the default-off down-MMQ
+> finalize/writeback candidate as a narrow positive, but do not promote it or
+> call parity. Plan:
+> `docs/superpowers/plans/2026-07-02-moe-down-mmq-finalize-phase138.md`.
+> Focused artifact:
+> `/home/mudler/bench/phase138_moe_down_mmq_finalize/20260702_095927_focused`;
+> canonical gates:
+> `/home/mudler/bench/phase138_moe_down_mmq_finalize/20260702_100202_canonical`;
+> serving:
+> `/home/mudler/bench/phase138_moe_down_mmq_finalize_serving/20260702_100330`.
+> The candidate adds `LLAMA_MOE_ROUTED_FFN_FINALIZE_POC=1` on top of Phase135,
+> zeroes the final output, atomically accumulates `down_sum * router_weight`
+> from the down-MMQ path, and skips the strict weighted tail only after the
+> finalize helper is selected. Focused `MOE_SWIGLU_FINALIZE` correctness passed
+> for default, Phase135, and Phase138 (`7/7` each); canonical and serving
+> pre/post gates passed MoE md5 `8cb0ce23`, dense md5 `5951a5b4`,
+> `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Serving versus Phase135 moved
+> aggregate/decode t/s `208.0/332.7 -> 209.3/333.5`, total kernel time
+> `20.2498 -> 20.0489 s`, and `mmq_nvfp4 5915.24 -> 5802.87 ms`; however
+> `ew_add` remains visible at `374.09 ms`, so this is only an incremental
+> default-off improvement. Next work should reduce the remaining fan-in/writeback
+> path more deeply or return to the dominant `gdn_core`/`mmq_nvfp4` buckets.
+>
+> 2026-07-02 Phase139 result: serving noise-floor repeat rejects treating the
+> Phase138 one-off serving gain as source-funding evidence. Spec:
+> `docs/superpowers/specs/2026-07-02-serving-noise-floor-phase139-design.md`.
+> Plan:
+> `docs/superpowers/plans/2026-07-02-serving-noise-floor-phase139.md`.
+> Artifact:
+> `/home/mudler/bench/phase139_serving_noise_floor/20260702_081901`.
+> Seven identical current-binary Phase138 serving/profile runs all passed
+> pre/post gates: MoE md5 `8cb0ce23`, dense md5 `5951a5b4`,
+> `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. The runtime variance was much
+> larger than Phase138's one-off delta: aggregate throughput median
+> `208.5 t/s`, stdev `2.8022`, CV `1.349%`, range `203.4..212.3`; wall CV
+> `1.347%`; `mmq_nvfp4` CV `3.351%`. Keep Phase138 default-off as
+> correctness-clean and focused-positive, but do not stack another
+> finalize/MMQ micro-patch from serving evidence alone. Future serving claims
+> need repeated A/B medians and must exceed `max(2.0%, 3 * same-binary stdev)`.
+> The next source phase should pivot to a larger measured bucket, with GDN
+> packed decode/prep now more defensible than another MoE finalize shortcut.
+>
+> 2026-07-02 Phase140 result: reject an immediate in-GDN Q/K
+> L2-normalization patch. Spec:
+> `docs/superpowers/specs/2026-07-02-gdn-decode-prep-trace-phase140-design.md`.
+> Plan:
+> `docs/superpowers/plans/2026-07-02-gdn-decode-prep-trace-phase140.md`.
+> Artifact:
+> `/home/mudler/bench/phase140_gdn_decode_prep_trace/20260702_085348`.
+> The current Phase138 opt-in serving/profile shape passed pre/post gates:
+> MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, `MUL_MAT 1146/1146`, and
+> `MUL_MAT_ID 806/806`. Serving/profile reported aggregate/decode throughput
+> `207.3/328.9 t/s`, wall `39.501 s`, total kernel `20.2002 s`, `GDN
+> 6673.66 ms`, `gdn_core 5890.44 ms`, and `gdn_l2norm 100.30 ms`. The focused
+> SQLite summary had `l2_norm_f32 100.3024 ms` versus
+> `gated_delta_net_cuda 5804.7074 ms`. This is above the absolute
+> three-sigma floor from Phase139 (`53.433 ms`) but below the planned `3%` of
+> GDN-core materiality threshold at about `1.7%`, so prep-only L2 fusion is not
+> source-funded. Next GDN work should be recurrence-level, packed-state, or
+> datacenter-Blackwell-specific, not another prep micro-fusion.
+>
+> 2026-07-02 Phase141 result: decode-only GDN source claims must normalize by
+> launch count or tightly control the capture window. Spec:
+> `docs/superpowers/specs/2026-07-02-gdn-decode-noise-floor-phase141-design.md`.
+> Plan:
+> `docs/superpowers/plans/2026-07-02-gdn-decode-noise-floor-phase141.md`.
+> Artifact:
+> `/home/mudler/bench/phase141_gdn_decode_noise_floor/20260702_090428`.
+> Five identical current-binary decode-only captures all passed pre/post gates:
+> MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, `MUL_MAT 1146/1146`, and
+> `MUL_MAT_ID 806/806`. Raw `gdn_core_ms` median/stdev/CV was
+> `1415.500/30.641/2.146%`, range `1410.300..1482.140 ms`, but launch counts
+> drifted (`597`, `598`, `600`, `630`). Normalized `gdn_core_ms_per_launch`
+> was stable: median/stdev/CV `2.359167/0.005399/0.229%`. Future GDN A/B
+> source claims need repeated medians and must beat either `6.49%` raw
+> `gdn_core` reduction or `2.0%` launch-normalized reduction. The small
+> default-off source follow-up now worth testing is scalar gate/beta hoisting
+> inside `gated_delta_net_cuda`; vLLM-style packed decode recurrence remains a
+> larger redesign.

 Audience: an agent with **zero prior context** who has been told to "continue the GB10 vLLM-parity investigation" on the `llama-cpp-localai-paged` backend.

@@ -20,6 +589,69 @@ Read order for a cold start:

 ## 1. TL;DR STATE

+> 2026-07-01 Phase104-108 update: the current carried source line is still the
+> Phase93 Qwen3Next grouped Q/K broadcast plus the Phase101/102 default-off
+> cleanup candidates. Phase104/106 same-session serving showed the stack is
+> md5/op clean but still far from vLLM: at `N=128`, paged/vLLM was about
+> `0.66` on decode and `0.50-0.51` on aggregate; at `N=192/256`, vLLM remained
+> faster and TTFT stayed about `3x` lower. Phase105 refreshed the grouped-MMQ
+> trace and found no new host-side tile-policy lever. Phase107 proved the MoE
+> structural correctness gates exist (`MOE_SWIGLU_DOWN 7/7`,
+> `MOE_WEIGHTED_COMBINE 7/7`, `MUL_MAT_ID_RAGGED_MOE 6/6`) but also proved
+> `test-backend-ops perf` did not time those custom whole-graph cases. Phase108
+> fixed that measurement gap in `tests/test-backend-ops.cpp`: perf mode now
+> includes those MoE cases at `n_tokens=128,257`, and CSV output includes
+> `time_us`, `flops`, `memory_kb`, and `n_runs`. The Phase108 artifact is
+> `/home/mudler/bench/phase108_moe_perf_csv/20260701_221559`; md5s and compact
+> op gates are green. Use Phase108 rows as the baseline for any fused routed-MoE
+> implementation. Current ranking: `MUL_MAT_ID_RAGGED_MOE` is `1239-1446 us/run`,
+> `MOE_SWIGLU_DOWN` is `802-1020 us/run`, and `MOE_WEIGHTED_COMBINE` is only
+> `28-68 us/run`, so do not spend the next patch on weighted-combine fusion
+> alone.
+> Phase109 then tested existing env-gated routes on the Phase108 rows:
+> `LLAMA_W4A16_PREFILL_M=128`, `LLAMA_FP4_PREFILL_M=128`,
+> `LLAMA_MOE_DENSITY_MAX=9`, and `LLAMA_MOE_MMQ_X=64`
+> (`/home/mudler/bench/phase109_existing_moe_prefill_ab/20260701_222559`).
+> All selected correctness gates passed (`13/13` per env), but W4A16 and FP4
+> large-M regressed the 257-token rows badly, and density/tile retuning was
+> noise-level on `MUL_MAT_ID_RAGGED_MOE` while not helping `MOE_SWIGLU_DOWN`.
+> Do not spend another phase on MMQ tile-policy shortcuts. The next credible
+> implementation is structural: port the vLLM-style idea of GPU-side
+> token/expert routing metadata (`sorted_token_ids`, expert offsets/bounds,
+> inverse permutation) into llama.cpp's `mul_mat_id` host-sync fallback/grouped
+> W4A16 path, while leaving the graph-safe grouped-MMQ path untouched.
+> Phase110 implemented the first slice of that structural path as default-off
+> `LLAMA_MOE_GPU_SORT=1` in `ggml_cuda_mul_mat_id`, reusing the existing
+> `mm_ids_helper` GPU sort for fallback metadata. The initial branch failed
+> `3/13` selected opt-in rows because `mm_ids_helper` returns sorted-to-original
+> `ids_dst`, while fallback `get_rows_cuda()` needs original-to-sorted
+> `ids_from_sorted`; adding a tiny inverse-permutation kernel fixed correctness.
+> Accepted artifact:
+> `/home/mudler/bench/phase110_gpu_moe_sort/20260701_224446_fix1`. Gates are
+> green: canonical MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5
+> `5951a5b4d624ce891e22ab5fca9bc439`, and supported compact ops
+> `SSM_CONV 45/45`, `SSM_CONV_SPLIT 6/6`, `GET_ROWS 49/49`,
+> `GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806` for both
+> default and `LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1`. Perf decision:
+> keep as a default-off structural base only. It improves W4A16 fallback
+> 257-token rows by `7.2%` (`MOE_SWIGLU_DOWN`) and `7.9%`
+> (`MUL_MAT_ID_RAGGED_MOE`), but the opt-in fallback is still about `1.5x`
+> slower than default grouped-MMQ. Phase111 must remove another fallback
+> bottleneck, such as the remaining `expert_bounds` host copy / host tile
+> descriptor build, before this line can matter for parity.
+> Phase111 tested that narrow follow-up as default-off `LLAMA_W4A16_GPU_TILES=1`:
+> W4A16 tile descriptors were built on GPU from `expert_bounds_dev` with an
+> atomic tile counter. It was correctness-clean after fixing a pointer mutability
+> compile error and a CUDA pool LIFO allocation bug, but clean perf was
+> flat-to-negative (`MUL_MAT_ID_RAGGED_MOE n=257` regressed about `2.0%` versus
+> Phase110 GPU-sort). Artifact:
+> `/home/mudler/bench/phase111_w4a16_gpu_tiles/20260701_230400_fix1`. The
+> Phase111 source was reverted, and post-revert W4A16+GPU-sort selected gates
+> passed `13/13`. Do not reopen a standalone GPU tile descriptor cleanup; the
+> next W4A16 attempt must remove a larger boundary, such as direct activation
+> consumption plus GPU descriptors together, or bypass the host-sync fallback
+> path entirely.
+>
 > 2026-07-01 active update: Phase50-59 reopened the dense and MoE serving
 > scheduler question.
 > True dense decode is much closer to vLLM (`383.66` vs `435.00` t/s, `88.2%`)
@@ -46,7 +678,7 @@ Read order for a cold start:
 > are local and DGX-gated but not pushed, so the LocalAI patch series has not
 > been regenerated.
 >
-> 2026-07-01 Phase81 update: the next viable GDN lever is no longer launch
+> 2026-07-01 Phase81-85 update: the next viable GDN lever is no longer launch
 > shape or gather removal. A default-off Qwen35/Qwen35MoE BF16 persistent
 > recurrent S-cache experiment (`LLAMA_QWEN35_GDN_S_CACHE_TYPE=bf16`) cut
 > same-source decode-only `gdn_core` from `1399.30 ms / 599 launches`
@@ -54,9 +686,257 @@ Read order for a cold start:
 > F32 md5 gates and op gates stayed green, and BF16 dense md5 stayed canonical,
 > but BF16 MoE md5 changed to `07db32c2bcb78d17a43ed18bc22705cd`. A quick
 > MoE KL smoke vs the same-source F32 base showed KLD `0.055499 +/- 0.001705`,
-> same-top-p `88.361%`, and PPL ratio `1.010356`. Treat this as a promising
-> default-off candidate only. Phase82 must run the full f16-reference KL gate
-> plus serving A/B before regenerating LocalAI patches or considering promotion.
+> same-top-p `88.361%`, and PPL ratio `1.010356`. Phase82 then ran the full MoE
+> f16-reference gate at
+> `/home/mudler/bench/phase82_bf16_s_cache_f16_kl/20260701_183016`: same-source
+> F32 measured KLD `0.136563 +/- 0.003242`, while BF16 S-cache measured
+> `0.137162 +/- 0.003456` against the documented paged acceptance reference
+> `0.136000 +/- 0.003285`. Reject promotion and do not run serving A/B for this
+> candidate under the current hard KL rule. Phase83 then tested a bit-exact
+> KDA `expf(g)` register-cache shortcut in the GDN CUDA core. Md5 and op gates
+> stayed green, but same-window decode-only `gdn_core` moved
+> `1399.46 -> 1405.62 ms`, so reject that micro-optimization too. Phase84
+> reduced in-place GDN op outputs to attention-only tensors and moved the CPU
+> ids fallback scratch to workspace; md5/op gates stayed green and startup free
+> CUDA memory improved `117472 -> 117855 MiB`, but same-window decode-only
+> `gdn_core` moved `1399.72 -> 1407.38 ms`. Treat Phase84 as a possible
+> memory-footprint cleanup only, not a speed parity lever. Phase85 added a
+> graph-reuse-safe identity-contiguous recurrent-state fast path: it calls
+> `ggml_gated_delta_net_inplace` on a direct state view when `s_copy_main` is
+> identity, otherwise keeps the ids path. Md5/op gates stayed green, the
+> `gdn_gather` fine bucket disappeared, GDN macro launches dropped
+> `3600 -> 2980`, and same-window `gdn_core` moved `1412.33 -> 1400.34 ms`.
+> Carry Phase85 only as a small cleanup candidate. Phase86 audited the producer
+> fusion idea against the Phase85 node-traced profile before coding it: the
+> whole `act/GDN-gate(shared)` macro is only `13.57 ms` of `3.6622 s`, beta
+> sigmoid is `2.73 ms`, and CUDA already fuses `UNARY + MUL` for softplus,
+> sigmoid, and SILU. Reject producer-only fusion as too small. Phase87 then
+> exposed an env-gated `GDN_NW=4 GDN_CPW=8` decode geometry probe to test a
+> vLLM-like `BV=32` tile shape. It was md5/op green, but same-source
+> decode-only `gdn_core` regressed `1390.56 -> 1417.13 ms`, so the source line
+> was reverted. Phase88 tried a first default-off `GDN_DECODE_PACK2=1` packed
+> decode CTA kernel. It built and CUDA op tests stayed green, but canonical md5
+> failed for both MoE (`320b5ed...` vs `8cb0ce...`) and dense (`6a65e9...` vs
+> `5951a5...`), with visible output corruption, so it was reverted without
+> profiling. Phase89 tried to add that focused guardrail through
+> `test_gated_delta_net_inplace_ids`, but selecting that test class directly
+> already fails the pre-existing BF16 cases on CUDA, so the naive test addition
+> was also reverted. Phase90 fixed the fixture root cause for identity ids by
+> mirroring `state` into `state_dst` during initialization and added F32
+> `S_v=128`, `n_seqs=2` cases that return `concat(out,state_dst)`, so the
+> backend comparator now checks both attention output and the side-effect state
+> write. DGX CUDA selected-op gate is green (`4/4`). Use this Phase90 guardrail
+> before any new packed-decode kernel, then still run canonical md5/op gates.
+> Phase91 retried the default-off `GDN_DECODE_PACK2=1` CTA sequence-packing
+> kernel under that guardrail. The first `n_seqs=2` guardrail passed but MoE md5
+> failed for the single-sequence completion gate, exposing an uncovered odd/single
+> sequence PDL hazard. Moving inactive lanes past `ggml_cuda_pdl_sync()` and
+> adding `n_seqs=1,3` guardrail cases made the candidate md5/op clean
+> (`GATED_DELTA_NET 46/46`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`), but
+> decode-only `gdn_core` regressed to `1425.44 ms`, so the runtime patch was
+> reverted. Keep the expanded guardrail; do not retry CTA-level sequence packing
+> unless it also reduces per-sequence GDN work. ids gather, producer overhead,
+> simple geometry changes, and ungated packed kernels are not acceptable parity
+> paths. Phase92 tried the next smallest scalar one-token recurrence
+> micro-optimization: a default-off `GDN_SCALAR_DECODE_STORE_FUSED=1` CUDA path
+> that stores final state inside the scalar update loop and skips the final
+> post-token register-store loop. It passed local CPU guardrail, DGX CUDA
+> guardrail, canonical md5s, `GATED_DELTA_NET 46/46`, `MUL_MAT 1146/1146`, and
+> `MUL_MAT_ID 806/806`, but decode-only `gdn_core` regressed further to
+> `1529.72 ms` (`/home/mudler/bench/phase92_gdn_scalar_store_fused/20260701_204718/decode_profile`),
+> so the runtime patch was reverted. Do not retry store-fusing without evidence
+> that the final state store loop is independently dominant. The next credible
+> scoped ideas from the vLLM audit are the larger packed decode contract and the
+> Qwen3Next GQA-repeat removal, each as a separate guarded phase. Phase93
+> implemented the Qwen3Next GQA-repeat removal as an explicit grouped Q/K
+> broadcast mode on `GGML_OP_GATED_DELTA_NET` (`op_params[2]`), preserving the
+> existing modulo/tiled broadcast for Qwen35 while allowing Qwen3Next to map
+> `qk_head = value_head / (H_v / H_k)` and skip materializing repeated q/k heads
+> when the GDN op path is active. Local CPU `GATED_DELTA_NET` passed `48/48`,
+> local CPU in-place ids passed `6/6`, DGX CUDA `GATED_DELTA_NET` passed `48/48`,
+> DGX CUDA in-place ids passed `6/6`, canonical md5/op gates passed
+> (`GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`), and
+> decode-only `gdn_core` improved to `1333.48 ms`
+> (`/home/mudler/bench/phase93_qwen3next_gqa_bcast/20260701_211019/decode_profile`).
+> Carry Phase93 as the current positive candidate. Phase94 then retested
+> decode geometry on top of Phase93 with env-only `GDN_NW=8 GDN_CPW=8`. It
+> stayed md5/op clean (`GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`,
+> `MUL_MAT_ID 806/806`) but decode-only `gdn_core` regressed to `1440.79 ms`
+> (`/home/mudler/bench/phase94_gdn_geometry_phase93/20260701_211855/decode_profile_8x8`),
+> so reject 8x8 and keep Phase93's default 16x8 geometry. Phase93 trace evidence
+> also shows remaining producer-side GDN work is small (`l2_norm_f32 8.65 ms`,
+> GDN gate/sigmoid about `12.75 ms`, remaining repeat `5.34 ms`), so the next
+> useful lead should target recurrence work or a larger packed decode contract,
+> not another small producer-only fusion. Phase95 tested a default-off
+> `GDN_WARP_SCALAR_GATE=1` CUDA decode specialization on top of Phase93: lane 0
+> computed the scalar non-KDA gate and broadcast it within the warp for the
+> one-token `S_v=128`, default `16x8` path. Local CPU guardrails passed
+> (`GATED_DELTA_NET 48/48`, in-place ids `6/6`), DGX CUDA guardrails passed
+> (`GATED_DELTA_NET 48/48`, in-place ids `6/6`), canonical md5/op gates passed
+> (`GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`), but
+> decode-only `gdn_core` regressed to `1402.40 ms`
+> (`/home/mudler/bench/phase95_gdn_warp_scalar_gate/20260701_213311/decode_profile`).
+> The runtime patch was reverted. Do not retry scalar-gate warp broadcast unless
+> a future profile shows SFU pressure, rather than recurrent state traffic or
+> reductions, dominating the GDN core. Phase96 then tested the narrow
+> conv-state identity fast path suggested by the trace audit: when
+> `s_copy_main` was identity, `build_conv_state_fused` viewed the active
+> conv-cache slots directly and called `ggml_ssm_conv_update_inplace` instead of
+> the ids variant. Local CPU `SSM_CONV` passed `45/45`; DGX CUDA `SSM_CONV`
+> passed `45/45`; canonical gates passed (`SSM_CONV 45/45`,
+> `GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`, md5s
+> canonical). Decode-only profile regressed to total kernel `3.6723 s`,
+> `gdn_core 1406.57 ms`, and `gdn_conv 70.42 ms`
+> (`/home/mudler/bench/phase96_conv_identity_fastpath/20260701_214141/decode_profile`).
+> The runtime model-graph patch was reverted. Do not retry the conv identity
+> branch as a speed lever unless a same-window trace proves the ids variant is
+> independently dominant. Phase97 then measured the carried Phase93 stack in an
+> end-to-end `n=128`, `PTOK=128`, `GEN=64`, `PARALLEL=128` serving snapshot
+> against vLLM. Pre/post canonical gates stayed green. Paged Phase93 measured
+> `agg_tps 329.6`, `decode_agg_tps 669.8`, `prefill_tps 1734.5`,
+> `ttft_mean_ms 7415.4`, `wall_s 24.851`; vLLM measured `agg_tps 664.8`,
+> `decode_agg_tps 1029.4`, `prefill_tps 5271.8`, `ttft_mean_ms 2519.5`,
+> `wall_s 11.929`
+> (`/home/mudler/bench/phase97_phase93_serving_snapshot/20260701_214648`).
+> Phase93 therefore remains a decode-profile positive candidate, but it does not
+> close serving parity (`paged_decode_over_vllm=0.6507`). The next useful phase
+> needs a larger serving-impact lever; isolated GDN/conv micro-optimizations
+> have now repeatedly failed to move live serving enough. Phase98 profiled that
+> carried Phase93 serving window with graph-node CUDA tracing. Pre/post gates
+> stayed green. Total kernel time was `20.0411 s`; macro buckets were GDN
+> `6679.96 ms` (`33.33%`), MoE/FFN-GEMM `6034.52 ms` (`30.11%`),
+> bf16/fp8-proj `2766.06 ms` (`13.80%`), and layout-copy `1257.60 ms`
+> (`6.28%`). Fine buckets were led by `gdn_core 5892.99 ms` (`29.40%`) and
+> `mmq_nvfp4 5809.55 ms` (`28.99%`), followed by `convert_dtype 663.45 ms`,
+> `gdn_conv 457.11 ms`, and `concat_layout 430.25 ms`
+> (`/home/mudler/bench/phase98_phase93_serving_profile/20260701_215715`).
+> This re-ranks the next work: do not spend more time on scalar GDN, conv
+> identity, or gather-only shortcuts. Either attribute and remove a proven
+> material layout-copy node, or pursue a larger GDN-core/MMQ serving lever with a
+> standalone PoC gate. Phase99 then used the existing default-off
+> `LLAMA_LAYOUT_TRACE` hook on the same Phase93 serving profile shape
+> (`N=128`, `PTOK=128`, `GEN=64`, `PARALLEL=128`). Trace-enabled gates stayed
+> green (`GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`,
+> canonical MoE/dense md5s). Serving remained comparable (`total kernel
+> 20.2408 s`, `layout-copy 1269.35 ms`). The trace attributed
+> `concat_layout 440.01 ms` almost entirely to
+> `conv_input-* = concat(conv_states_reshaped-*, qkv_mixed_transposed-*)` before
+> `SSM_CONV`; `copy_layout 119.16 ms` includes `conv_state_update-*` writeback.
+> The larger `convert_dtype 662.34 ms` bucket is mostly unnamed F32-to-F16 `CPY`
+> rows and needs stronger attribution before coding. Decision: Phase99 is
+> measurement-only; do not retry the Phase96-style conv-state identity branch.
+> The only conv-side patch worth funding is a larger two-source `SSM_CONV`
+> contract that reads `(conv_states, qkv_mixed)` as a logical concat, or else
+> extend trace attribution for the unnamed `convert_dtype` bucket first
+> (`/home/mudler/bench/phase99_layout_trace/20260701_200835/serving_profile`).
+> Phase100 extended that trace with `dst_view`, `src0_view`, and `src1_view`
+> names. The trace-only patch built locally and on DGX, and trace-enabled gates
+> stayed green (`GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`,
+> `MUL_MAT_ID 806/806`, canonical MoE/dense md5s). Serving stayed comparable
+> (`total kernel 20.3464 s`, `convert_dtype 661.73 ms`, `concat_layout
+> 438.15 ms`). The new fields identify a concrete `convert_dtype` source:
+> `GET_ROWS` reads F16 `cache_k_l*` / `cache_v_l*` into F32 `node_*`, then
+> `CPY` downcasts views such as `src0_view=node_358` / `node_365` to F16
+> attention-shaped tensors. This repeats across attention layers
+> (`cache_k_l3/v_l3`, `cache_k_l7/v_l7`, `cache_k_l11/v_l11`, ...). Some F32->F16
+> rows remain unnamed, so the next runtime phase should be a narrow K/V cache
+> get_rows dtype A/B, not a broad layout rewrite
+> (`/home/mudler/bench/phase100_layout_view_trace/20260701_201800/serving_profile`).
+> Phase101 implemented that narrow A/B as default-off
+> `LLAMA_PAGED_KV_GET_ROWS_F16=1`: add `ggml_get_rows_type`, support CPU F16
+> source -> F16 destination row copy, and use typed F16 `GET_ROWS` only for
+> paged K/V gather when the cache tensor is F16. Local and DGX builds completed;
+> CUDA `GET_ROWS` passed `49/49` including the new F16-output cases; default and
+> opt-in md5/op gates stayed green (`GET_ROWS 49/49`, `GATED_DELTA_NET 48/48`,
+> `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`, canonical MoE/dense md5s).
+> Serving profile under opt-in measured `total kernel 20.1989 s`, `agg_tps
+> 206.4`, `decode_agg_tps 328.0`, and `ttft_mean_ms 8211.1`. It reduced
+> `copy_layout 116.25 -> 80.32 ms` and macro `layout-copy 1262.58 -> 1220.30 ms`
+> versus Phase100, but `convert_dtype` stayed flat (`661.73 -> 661.35 ms`) and
+> serving throughput did not improve. Carry Phase101 only as a small default-off
+> cleanup candidate pending repeat A/B; do not promote it as a parity lever
+> (`/home/mudler/bench/phase101_kv_get_rows_f16/20260701_203930/serving_profile`).
+> Phase102 then implemented the funded two-source `SSM_CONV` contract as
+> default-off `LLAMA_SSM_CONV_SPLIT=1`: `ggml_ssm_conv_split(ctx, conv_states,
+> x_cur, conv_kernel)` reuses `GGML_OP_SSM_CONV`, reads
+> `[K-1,channels,n_seqs]` cached taps plus native `[channels,n_tokens,n_seqs]`
+> qkv tokens as a logical concat, and is wired into Qwen3Next/Qwen35/Qwen35MoE
+> only for multi-token, non-rollback batches with `n_seq_tokens >= K-1`. The
+> initial semantic test exposed a harness issue (`split-base` has an exactly
+> zero CPU reference, so normalized MSE reported `ERR=inf`); direct split
+> CUDA-vs-CPU passed `6/6`, and the final test keeps `split-base` with absolute
+> max error. Local and DGX builds passed; default, standalone opt-in, and
+> serving pre/post gates stayed green (`SSM_CONV 45/45`, `SSM_CONV_SPLIT 6/6`,
+> `GET_ROWS 49/49`, `GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`,
+> `MUL_MAT_ID 806/806`, canonical MoE/dense md5s). Opt-in serving measured
+> `total kernel 19.5482 s`, `agg_tps 206.1`, `decode_agg_tps 320.0`,
+> `prefill_tps 1538.0`, and `ttft_mean_ms 7928.4`. It removed the traced concat
+> materialization (`concat_layout 433.13 -> 4.59 ms` versus Phase101 and
+> `layout-copy 1220.30 -> 826.87 ms`), but live serving throughput still did not
+> improve. Carry Phase102 as a default-off cleanup/follow-up base only; do not
+> promote it as parity-closing without a repeat A/B or an additional state-update
+> fusion. The remaining high-value targets are still `gdn_core`, `mmq_nvfp4`, or
+> a larger serving scheduler/packed-decode contract
+> (`/home/mudler/bench/phase102_ssm_conv_split/20260701_210907/serving_profile`).
+> Phase103 measured Phase101+Phase102 together, with no new source changes:
+> `LLAMA_SSM_CONV_SPLIT=1 LLAMA_PAGED_KV_GET_ROWS_F16=1`. Standalone and
+> serving pre/post gates stayed green (`SSM_CONV 45/45`, `SSM_CONV_SPLIT 6/6`,
+> `GET_ROWS 49/49`, `GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`,
+> `MUL_MAT_ID 806/806`, canonical MoE/dense md5s). Combined serving improved
+> over Phase102 (`agg_tps 206.1 -> 212.3`, `decode_agg_tps 320.0 -> 331.5`,
+> `prefill_tps 1538.0 -> 1569.1`, `wall_s 39.743 -> 38.575`) and reduced
+> `layout-copy 826.87 -> 798.52 ms`; it also preserved most of the split
+> SSM_CONV concat removal and recovered the F16 K/V `copy_layout` reduction
+> (`copy_layout 112.53 -> 78.22 ms`). This proves the two cleanup candidates are
+> compatible, but not parity-closing: `gdn_core 5930.47 ms` and `mmq_nvfp4
+> 6001.77 ms` still dominate. Carry the combined env as the cleanup comparison
+> baseline; do not rerun isolated layout cleanup unless it changes a larger
+> serving contract
+> (`/home/mudler/bench/phase103_combined_layout_cleanups/20260701_211821/serving_profile`).
+> Phase104 then measured that combined cleanup stack in the normal same-session
+> serving harness against vLLM at `N=128`, `PTOK=128`, `GEN=64`,
+> `PARALLEL=128`. Pre/post gates stayed green with the same expanded op set and
+> canonical md5s. Paged combined measured `agg_tps 338.6`,
+> `decode_agg_tps 675.8`, `prefill_tps 1813.0`, `ttft_mean_ms 7121.6`, and
+> `wall_s 24.196`; vLLM measured `agg_tps 661.1`, `decode_agg_tps 1028.0`,
+> `prefill_tps 5208.7`, `ttft_mean_ms 2572.3`, and `wall_s 11.980`. This is a
+> small serving improvement over Phase97 (`agg_tps +2.73%`, `prefill_tps
+> +4.53%`, `TTFT -3.96%`), but still not parity: `paged_decode_over_vllm=0.6574`
+> and `paged_agg_over_vllm=0.5122`. Carry the combined cleanup stack as the best
+> current comparison baseline. The next useful phase must attack a larger
+> serving-impact contract or the dominant GDN/MMQ buckets, not more isolated
+> layout-copy cleanup
+> (`/home/mudler/bench/phase104_combined_serving_snapshot/20260701_212551`).
+> Phase105 refreshed grouped-MMQ evidence on that current stack without source
+> changes. `MUL_MAT_ID_RAGGED_MOE` stayed green both default and trace-enabled
+> (`6/6`), full `MUL_MAT_ID` stayed green (`806/806`), and the live serving
+> retry returned a non-empty response while recording `120` shape and launch
+> lines. The live sample was prefill-like (`ncols_max=317`, density `10`,
+> `mmq_x_best=112`, `stream_k=1`) with no small-M lines; all launches had
+> `fixup=0`, `stream_k_blocks == ntiles_dst`, and efficiency `100`. This
+> confirms the current cleanup stack did not open a new cheap MMQ shortcut.
+> Do not add another host-side MMQ tile policy; only revisit MMQ for a
+> genuinely structural kernel or serving-contract change
+> (`/home/mudler/bench/phase105_mmq_current_shape/20260701_214129_serving_retry`).
+> Phase106 tested the remaining low-conflict C1 operating-point hypothesis on
+> the current stack: same-session `N=128/192/256` with `PARALLEL=256`,
+> `VLLM_MAX_NUM_SEQS=256`, and the combined cleanup env. Pre/post gates stayed
+> green with canonical md5s and the expanded op set. vLLM completed all legs and
+> stayed ahead: at `N=256`, paged measured `agg_tps 338.4`,
+> `decode_agg_tps 824.6`, `ttft_mean_ms 14933.5`, while vLLM measured
+> `agg_tps 723.8`, `decode_agg_tps 1320.4`, `ttft_mean_ms 4999.0`. Reject C1
+> for the current GB10 stack. The next source phase should be structural
+> persistent-batch/fused-MoE/GDN work, not another scheduler shortcut
+> (`/home/mudler/bench/phase106_max_concurrency_current_stack/20260701_214907`).
+> Phase107 established the fused-MoE structural guardrail surface before coding:
+> `MOE_SWIGLU_DOWN 7/7`, `MOE_WEIGHTED_COMBINE 7/7`, and
+> `MUL_MAT_ID_RAGGED_MOE 6/6` passed on CUDA0. However,
+> `test-backend-ops perf` did not provide usable timing rows for these custom
+> whole-graph cases; the broad `MUL_MAT_ID` perf CSV reported support metadata
+> only. The next source patch should be measurement-only: add a narrow MoE
+> fusion timing harness with explicit GPU synchronization and CSV timing before
+> funding any fused routed-MoE kernel
+> (`/home/mudler/bench/phase107_moe_fusion_guardrail/20260701_220227`).

 - Historical verdict: the older investigation marked GB10 parity **CLOSED** and
  unreachable. Treat that as superseded where Phase50-54 provide newer dense
@@ -1416,3 +2296,44 @@ assumption is too narrow for a sub-millisecond capture-level win. Do not spend
 more parity time on gather-only GDN shortcuts unless a future profile makes
 gather material. The next serious GDN scope remains recurrent-state
 precision/traffic.
+
+## Series trim (phases 110-140 review, 2026-07-02)
+
+The campaign's on-disk patches `0048-0063` were added without matching fork
+commits (a fork-first policy violation). After a keep/drop review of the
+phase 110-140 work, the series was trimmed to a single kept line plus the
+gate harness, and re-mirrored to the fork:
+
+- KEEP - test sentinels (the MoE gate harness): `MOE_SWIGLU_DOWN`,
+  `MOE_SWIGLU_COMBINE`, `MUL_MAT_ID_RAGGED_MOE` (old `0051-0053`).
+- KEEP - the MTP-draft correctness fix (old `0054`): forces target-side
+  sampler acceptance for MTP drafts (backend draft sampling can request
+  multiple output rows per sequence); the backend ships `-mtp` gallery models.
+- KEEP - the Phase135 routed-FFN fused-quant line: whole-pattern MoE matcher +
+  routed-FFN executor hook (Phase120/121), the routed-FFN PoC scaffold
+  `moe-ffn.{cu,cuh}` (Phase132), and the fused SwiGLU-to-NVFP4-quant + raw down
+  MMQ (`ggml_cuda_mul_mat_q_moe_quantized` + local `ggml_cuda_mmq_ids_meta`
+  refactor, Phase135). All default-off, md5-clean opt-in, six
+  `mmq_moe_quantized_raw` markers with zero sorted launches on the sentinel.
+
+- DROP - W4A16 grouped-tile pack/tune/pad (old `0048-0050`): dead line, W4A16
+  is ~1.5x slower than grouped-MMQ.
+- DROP - speculative/trace/cublas-route/mmid-route/mul-mat-route traces + the
+  rejected small-M tile-policy knob (old `0055-0063`).
+- DROP - all other campaign keep-markers not needed by Phase135: GPU-sort
+  (Phase110), W4A16-direct-A (Phase112), boundary trace/timing (Phase117),
+  Phase133 sorted-F32 down, Phase134 fused-SWIGLU-only, Phase138
+  finalize/weighted-combine. The final fork tree carries zero of these markers.
+
+Fork branch `mudler/llama.cpp:localai-paged` re-mirrored on top of
+`51168c5ee` (LocalAI series `0001-0047`):
+
+- `fd920cf8a` test(paged): cover MoE swiglu down chain
+- `a85c1e098` test(paged): cover MoE weighted combine chain
+- `2fed6aacf` test(paged): cover ragged MoE dispatch
+- `f1d976f06` fix(speculative): disable backend sampling for MTP drafts
+- `1edddc8fe` feat(paged): whole-pattern MoE matcher + routed-FFN fused
+  NVFP4-quant down MMQ
+
+New fork HEAD `1edddc8fe`, tree `097c862c`. The rejected/neutral levers of
+the 110-140 campaign are recorded above and in the per-phase bench artifacts.