docs(paged): record phases 112-140 + series trim decision

Record the phase 110-140 GDN/MoE campaign benchmark log and append the
series-trim decision to the parity handoff: keep the Phase135 routed-FFN
fused-quant line plus the MoE test sentinels and the MTP-draft correctness
fix; drop the W4A16 structural line, the trace/tile-policy patches, GPU-sort,
W4A16-direct-A, and the finalize fusion. Rejected/neutral levers are recorded
in the handoff and the per-phase bench artifacts. Fork re-mirrored on
51168c5ee: fd920cf8a a85c1e098 2fed6aacf f1d976f06 1edddc8fe (HEAD tree
097c862c).

Assisted-by: Claude:opus-4.8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
Ettore Di Giacinto
2026-07-02 10:16:53 +00:00
parent 67d2c4c9d4
commit 1aba41082b
2 changed files with 4967 additions and 14 deletions

View File

File diff suppressed because it is too large Load Diff

View File

@@ -5,6 +5,575 @@
> and `GB10_PARITY_PHASE0_RESULTS.md`, with Phase 6 serving-nsys evidence and
> the active follow-up plans under `docs/superpowers/plans/`. Use those files for
> the current state before relying on the older "closed" conclusion below.
>
> 2026-07-01 Phase112 update: keep the new default-off
> `LLAMA_W4A16_DIRECT_A=1` direct activation staging hook, especially combined
> with Phase110 `LLAMA_MOE_GPU_SORT=1`. Artifact:
> `/home/mudler/bench/phase112_w4a16_direct_a/20260701_231749_direct_a`.
> Selected gates passed `13/13` for W4A16+GPU-sort, direct-A, and
> direct-A+GPU-sort. Direct-A+GPU-sort improved the 257-token W4A16 fallback
> rows versus W4A16+GPU-sort control (`MOE_SWIGLU_DOWN 1551.08 -> 1477.74 us`,
> `MUL_MAT_ID_RAGGED_MOE 2278.50 -> 2166.22 us`) but was neutral/slightly
> slower on 128-token rows. Canonical README md5 gates are green: MoE
> `8cb0ce23777bf55f92f63d0292c756b0`, dense
> `5951a5b4d624ce891e22ab5fca9bc439`; compact supported op gates are green
> (`SSM_CONV 45/45`, `SSM_CONV_SPLIT 6/6`, `GET_ROWS 49/49`,
> `GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`).
> This is still default-off structural groundwork, not parity: W4A16 fallback
> remains slower than the default grouped-MMQ path. Use the patch-series README
> md5 command as canonical; the handoff `-no-cnv -c 4096` snippet produced
> stable but non-canonical md5s for both candidate and control.
>
> 2026-07-01 Phase113 update: reject the combined direct-A GPU-tile descriptor
> attempt. Artifact:
> `/home/mudler/bench/phase113_w4a16_direct_a_gpu_tiles/20260701_233345_no_readback`.
> The candidate (`LLAMA_W4A16_GPU_TILES=1` on top of Phase112 direct-A+GPU-sort)
> avoided the `n_tiles` readback by launching over zero-initialized `max_tiles`
> and returning early on `rows <= 0`. Selected correctness passed `13/13`, but
> perf failed the keep gate: `MOE_SWIGLU_DOWN n=257` was flat
> (`1478.16 -> 1476.36 us`) and `MUL_MAT_ID_RAGGED_MOE n=257` regressed
> (`2148.44 -> 2214.23 us`). The source was reverted and post-revert
> Phase112 direct-A+GPU-sort selected gates passed `13/13`. Next W4A16/MoE work
> should not revisit compact GPU tile descriptors; use vLLM-style padded routing
> metadata (`sorted_token_ids`, expert ids per M block, padded row count) if
> continuing this line.
>
> 2026-07-01 Phase114 update: reject the naive padded routing implementation.
> It implemented the vLLM-style metadata contract with separate padded source
> ids and destination ids for llama.cpp, plus an expert-id W4A16 consumer mode
> and a direct scatter that skipped compact `get_rows_cuda`. Correctness passed
> (`13/13`) but perf failed: after a fix using `num_tokens_post_pad` early
> returns, `MOE_SWIGLU_DOWN n=257` regressed `1477.88 -> 1726.27 us` and
> `MUL_MAT_ID_RAGGED_MOE n=257` regressed `2163.35 -> 2650.93 us`. Artifacts:
> `/home/mudler/bench/phase114_w4a16_padded_routing/20260701_234634_padded_meta`
> and
> `/home/mudler/bench/phase114_w4a16_padded_routing/20260701_235003_padded_meta_fix1`.
> Source was reverted; post-revert Phase112 direct-A+GPU-sort selected gate
> passed `13/13`. Padded metadata is not enough by itself on GB10 because sparse
> expert occupancy makes padded activation/output traffic too expensive.
>
> 2026-07-02 Phase115 update: reject another small-M/tile-policy shortcut.
> Phase115 re-tested the existing default-off `LLAMA_MOE_SMALL_M_TILE=16/32/64`
> knob on the newer Phase108 whole-graph MoE sentinels. Artifact:
> `/home/mudler/bench/phase115_moe_small_m_sentinel/20260702_020258`.
> Control and all three tile caps passed selected correctness (`13/13` each),
> but no candidate met the promotion rule. The 257-token ragged down row
> regressed for every cap (`1452.30 us` control vs `1455.02`, `1458.71`, and
> `1456.88 us`). Do not add name-based down special cases or another MMQ
> tile-policy patch. The next credible target is a true fused routed-MoE kernel
> or a graph-level fusion that removes materialized activation/output traffic.
>
> 2026-07-02 Phase116 update: reject the standalone graph-level
> SwiGLU-to-MMQ-activation-quant fusion. The default-off candidate
> `LLAMA_MOE_SWIGLU_DOWN_FUSED_QUANT=1` detected the plain
> `GLU -> down MUL_MAT_ID` pattern and computed `silu(gate) * up` directly into
> the grouped-MMQ NVFP4 activation buffer. Artifact:
> `/home/mudler/bench/phase116_moe_swiglu_down_fused_quant/20260702_022611`.
> Correctness passed (`13/13`) and the fix1 route emitted the fused marker
> (`6` hits), but perf was not useful: `MOE_SWIGLU_DOWN n=257` was flat
> (`1024.90 -> 1024.69 us`), `n=128` regressed (`806.33 -> 808.79 us`), and the
> ragged sentinel drifted slower. Source was reverted and post-revert selected
> gate passed `13/13`. Do not retry this narrow fused-quant route; the next
> fused-MoE attempt must remove a larger boundary, such as route-once metadata
> shared by both expert GEMMs plus fused GEMM1/activation/GEMM2 or
> weighted-combine/scatter.
>
> 2026-07-02 Phase117 update: keep the default-off MoE boundary trace as
> diagnostic instrumentation only. Artifact:
> `/home/mudler/bench/phase117_moe_route_once_boundary/20260702_024140`.
> The trace decomposes `MOE_SWIGLU_DOWN` into route-sort, activation
> quantization, grouped-MMQ launch, GLU, and graph-pattern records under
> `LLAMA_MOE_BOUNDARY_TRACE=1`; optional timing is gated by
> `LLAMA_MOE_BOUNDARY_TIMING=1`. Inline CUDA event timing initially aborted
> under CUDA graph capture, so the guarded trace emits `us=-1` while capturing
> and only produces real event timings with `GGML_CUDA_DISABLE_GRAPHS=1`.
> Post-guard selected gates passed (`13/13`), trace mode passed (`7/7`), and
> canonical gates passed: MoE md5 `8cb0ce23`, dense md5 `5951a5b4`,
> `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`. The timing attribution does not
> fund another local route-sort, tile, GLU, or activation-quant shortcut. The
> next MoE source phase should own a larger pipeline boundary: shared
> route-once metadata across gate_up/down and/or whole-pattern
> GEMM1->activation->GEMM2 execution.
>
> 2026-07-02 Phase118 update: reject standalone route-metadata caching.
> Artifact:
> `/home/mudler/bench/phase118_moe_route_cache/20260702_030549`. The
> default-off candidate `LLAMA_MOE_ROUTE_CACHE=1` stored ids-derived grouped-MMQ
> route metadata in context-owned buffers and reused it within a graph
> evaluation. It was correctness-clean (`13/13` default, opt-in, and
> post-reject) and the trace showed reuse (`23` hits, `3` misses on
> `MOE_SWIGLU_DOWN n=128`), but perf was too small: `MOE_SWIGLU_DOWN n=257`
> improved only `1017.711 -> 1011.915 us` (`+0.57%`) and `n=128` regressed
> `799.360 -> 803.738 us` (`-0.55%`). Runtime cache source was reverted; only a
> local `ggml_cuda_mmq_ids_meta` helper refactor remains as low-conflict
> groundwork. Do not retry metadata-cache-only work. The next attempt must own
> more of the vLLM-style pipeline: GEMM1->activation->GEMM2 and/or
> scatter/combine, not just skipping one `mm_ids_helper` launch.
>
> 2026-07-02 Phase119 update: keep the default-off whole-pattern contract trace
> after fix1. Initial artifact:
> `/home/mudler/bench/phase119_moe_whole_pattern_contract/20260702_034729`;
> fix1 artifact:
> `/home/mudler/bench/phase119_moe_whole_pattern_contract/20260702_035126_fix1`.
> The initial trace proved coverage but missed the overhead rule on
> `MOE_SWIGLU_DOWN n=257` (`1015.070 -> 1028.937 us`, `-1.35%`). Fix1 moved
> detector work off the default path unless `LLAMA_MOE_WHOLE_PATTERN_TRACE` or
> the existing boundary trace is enabled. Fix1 gates are green: selected
> `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE` `13/13`, trace `MOE_SWIGLU_DOWN`
> `7/7`, canonical MoE md5 `8cb0ce23`, dense md5 `5951a5b4`,
> `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Trace overhead is now within
> rule (`MOE_SWIGLU_DOWN n=128` `805.400 -> 805.584 us`, `-0.02%`;
> `n=257` `1019.715 -> 1021.836 us`, `-0.21%`) and emits supported NVFP4
> markers for both `n_tokens=128` and `257`. This is diagnostic scaffolding,
> not a runtime optimization. The next executor attempt should match at the
> earlier `gate_up MUL_MAT_ID` node and skip through `VIEW, VIEW, GLU, down
> MUL_MAT_ID`; the current `GLU -> down` hook is validation-only because GEMM1
> has already executed.
>
> 2026-07-02 Phase120 update: keep the default-off early whole-pattern matcher
> after fix2. Initial artifact:
> `/home/mudler/bench/phase120_moe_early_whole_pattern/20260702_040153`;
> fix2 artifact:
> `/home/mudler/bench/phase120_moe_early_whole_pattern/20260702_040725_fix2`.
> The initial/fix1 versions proved `skip_ready=4` but emitted noisy unsupported
> markers from unrelated `MUL_MAT_ID` candidates. Fix2 emits only the actual
> early pattern and is clean: selected `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE`
> `13/13`, early trace `MOE_SWIGLU_DOWN` `7/7`, canonical MoE md5
> `8cb0ce23`, dense md5 `5951a5b4`, `MUL_MAT 1146/1146`, and
> `MUL_MAT_ID 806/806`. It emits exactly six supported early markers for the
> perf sentinels, covering `n_tokens=128` and `257`, with `skip_ready=4`,
> `ids_match=1`, and `swiglu=1`. Trace overhead is within rule
> (`MOE_SWIGLU_DOWN n=128` `803.937 -> 808.978 us`, `-0.62%`;
> `n=257` `1020.412 -> 1026.073 us`, `-0.55%`). The next source phase can now
> implement a guarded executor at this early matcher. First prove safe
> ownership/skip accounting for the five-node sequence, then move route-plan
> reuse and activation/down execution into the helper.
>
> 2026-07-02 Phase121 update: keep the default-off executor proof after fix1.
> Initial artifact:
> `/home/mudler/bench/phase121_moe_whole_pattern_exec_proof/20260702_041543`;
> fix1 artifact:
> `/home/mudler/bench/phase121_moe_whole_pattern_exec_proof/20260702_041739_fix1`.
> The initial run passed correctness but emitted zero exec markers because the
> exec branch was accidentally nested under the early-trace env condition.
> Fix1 makes `LLAMA_MOE_WHOLE_PATTERN_EXEC=1` engage independently. Gates are
> green: selected `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE` `13/13`, exec
> `MOE_SWIGLU_DOWN` `7/7`, canonical MoE md5 `8cb0ce23`, dense md5
> `5951a5b4`, `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Exec perf emits
> six `skip=4` markers covering `n_tokens=128` and `257`, and target perf is
> neutral (`MOE_SWIGLU_DOWN n=128` `807.772 -> 806.051 us`, `+0.21%`;
> `n=257` `1021.115 -> 1020.839 us`, `+0.03%`). This proves ownership and skip
> accounting only; it is not a fused-MoE speedup. The next source phase should
> replace one internal boundary inside this helper, preferably route-plan reuse
> or activation in route-slot order, with the same md5/op gates.
>
> 2026-07-02 Phase122 update: reject route-only metadata reuse inside the
> Phase121 executor. Artifact:
> `/home/mudler/bench/phase122_moe_shared_route_meta/20260702_043212`.
> The candidate exposed `ggml_cuda_mmq_ids_meta` as a public MMQ helper and
> used `LLAMA_MOE_WHOLE_PATTERN_SHARED_ROUTE=1` to build route metadata once
> for both `gate_up` and `down`. Correctness passed (`13/13` selected and
> `7/7` shared-route), but perf missed the keep gate:
> `MOE_SWIGLU_DOWN n=128` regressed `808.190 -> 811.836 us` and `n=257`
> regressed `1020.850 -> 1051.666 us` versus the Phase121 executor. Source was
> reverted, including the public metadata API and shared-route env. Post-reject
> gates on the reverted tree passed (`13/13` selected and `7/7` Phase121 exec)
> with six retained exec markers. Do not retry route-only metadata reuse. The
> next MoE executor scope should target activation/down data layout, direct
> activation-to-down input, or a larger GEMM1->activation->GEMM2 fused boundary.
>
> 2026-07-02 Phase123 update: reject standalone fused-down activation
> quantization inside the Phase121 executor. Artifact:
> `/home/mudler/bench/phase123_moe_executor_fused_down_input/20260702_025811`;
> red check:
> `/home/mudler/bench/phase123_moe_executor_fused_down_input/red_20260702_025031`.
> The candidate used `LLAMA_MOE_WHOLE_PATTERN_FUSED_DOWN=1` to run `gate_up`,
> compute `silu(gate) * up` directly into the sorted NVFP4 down MMQ activation
> buffer, and launch the existing down MMQ kernel. Correctness passed
> (`13/13` selected, `7/7` fused-down, six fused markers), but perf was flat:
> versus Phase121 exec, `MOE_SWIGLU_DOWN n=128` was
> `811.153 -> 810.618 us` and `n=257` was `1023.090 -> 1023.657 us`.
> Source was reverted, including the fused-down env, MMQ helper, and NVFP4
> fused quant kernel. Post-reject gates passed (`13/13` selected, `7/7`
> Phase121 exec, six exec markers). Do not retry a single-boundary
> SwiGLU-to-down-quant shortcut; if continuing MoE source work, scope a full
> expert-major packed pipeline that owns `GEMM1->activation->GEMM2`, or pivot to
> another measured bottleneck.
>
> 2026-07-02 Phase124 update: current-stack graph-node serving was refreshed
> after the Phase122/123 rejections. Artifact:
> `/home/mudler/bench/phase124_current_moe_profile/20260702_031205`.
> Pre/post gates are green: MoE md5 `8cb0ce23`, dense md5 `5951a5b4`,
> `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`. At `N=128`, prompt `128`,
> generation `64`, serving under graph-node profiling was
> `agg_tps 206.2`, `decode_agg_tps 320.3`, `prefill_tps 1536.4`, wall
> `39.738s`. Fine buckets are now `mmq_nvfp4 6074.78 ms` (`30.17%`) and
> `gdn_core 5888.31 ms` (`29.25%`), with `act_quant` only `674.88 ms`
> (`3.35%`). This explains why single-boundary activation/quant attempts were
> flat. The next source work must reduce one of the two dominant buckets:
> either a full expert-major MoE pipeline for `mmq_nvfp4`, or a default-off GDN
> decode/core experiment for `gdn_core`. Do not spend more GB10 time on
> route-only metadata reuse, fused-down quantization, or MoE tile-policy knobs
> unless a new profile makes those buckets material.
>
> 2026-07-02 Phase125 scoping update: two independent code explorers and a
> local GDN audit challenged the Phase124 fork in the road. The chosen next
> source attempt is the MoE side, but only as a first maintainable slice:
> implement a default-off MMQ sorted-output primitive behind
> `LLAMA_MOE_EXPERT_MAJOR_SORTED_OUT=1`, immediately unsort as a proof, and
> measure `MOE_SWIGLU_DOWN` before attempting the full
> `gate_up -> SWIGLU -> down` expert-major executor. Rationale: vLLM's portable
> advantage is keeping activations expert-major across both GEMMs and
> unpermuting once; Phase122/123 failed because they only touched route metadata
> or one activation boundary. Do not copy CUTLASS/FlashInfer pointer-array, TMA,
> or FP4 scale-swizzle internals. A small GDN patch is not funded by current
> evidence because previous decode/core micro-attempts already rejected the
> obvious geometry/store/broadcast/conv-state shortcuts. Plan:
> `docs/superpowers/plans/2026-07-02-moe-expert-major-sorted-output-phase125.md`.
>
> 2026-07-02 Phase125 result: reject the MMQ sorted-output plus immediate
> unsort proof. Artifact:
> `/home/mudler/bench/phase125_moe_expert_major_sorted_output/20260702_033931`;
> post-reject:
> `/home/mudler/bench/phase125_moe_expert_major_sorted_output/post_reject_20260702_034232`.
> The candidate was default-off and correctness-clean (`13/13` default
> selected, `7/7` opt-in `MOE_SWIGLU_DOWN`, 12 sorted markers), but perf failed
> decisively: versus Phase121 exec, `MOE_SWIGLU_DOWN n=128` regressed
> `805.16 -> 888.76 us` and `n=257` regressed `1023.83 -> 1192.05 us`.
> Source was reverted. Post-reject gates are green: selected `13/13`, Phase121
> exec `7/7` with six markers, MoE md5 `8cb0ce23`, dense md5 `5951a5b4`,
> `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Do not retry a path that adds
> a sorted-output temporary and immediately unsorts. A future expert-major MoE
> attempt must keep sorted activations through the down GEMM and unpermute only
> once after the full FFN, or pivot to a larger GDN recurrence design.
> 2026-07-02 Phase126 result: keep the grouped-MMQ presorted helper scaffold.
> The patch only touches `mmq.cu`/`mmq.cuh`, refactors the current MoE id path
> behind explicit `ids_src1`/`ids_dst`/`expert_bounds` metadata, and exposes a
> `src1_sorted` entry point for the future whole-MoE executor. Fixed artifact:
> `/home/mudler/bench/phase126_mmq_presorted_helper/fix1_20260702_040858`.
> Gates were green: selected `13/13`, MoE md5
> `8cb0ce23777bf55f92f63d0292c756b0`, dense md5
> `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT 1146/1146`,
> `MUL_MAT_ID 806/806`. Focused perf was neutral:
> `MOE_SWIGLU_DOWN n=128 805.99 us`, `MUL_MAT_ID_RAGGED_MOE n=128
> 1243.85 us`, `MOE_SWIGLU_DOWN n=257 1018.74 us`,
> `MUL_MAT_ID_RAGGED_MOE n=257 1452.84 us`. This is not a parity win by
> itself; it is the dependency for Phase127 to keep `gate_up -> SWIGLU -> down`
> in expert-major order and unpermute only once after the full FFN.
> 2026-07-02 Phase127 result: reject and revert the whole-MoE expert-major
> executor built on the Phase126 helper. Red:
> `/home/mudler/bench/phase127_moe_whole_expert_major/red_20260702_042125`
> passed by fallback with zero markers. Candidate green:
> `/home/mudler/bench/phase127_moe_whole_expert_major/green2_20260702_042916`
> passed default selected `13/13` and opt-in `MOE_SWIGLU_DOWN 7/7`, emitting
> six `LLAMA_MOE_WHOLE_EXPERT_MAJOR` markers after fixing the down-weight shape
> interpretation (`down_w` is `[n_ff, n_embd, experts]`). Perf artifact:
> `/home/mudler/bench/phase127_moe_whole_expert_major/perf_20260702_043104`.
> It failed the keep rule: `MOE_SWIGLU_DOWN n=128` regressed
> `802.57 -> 812.14 us`; `n=257` regressed `1023.25 -> 1039.36 us`;
> ragged standalone was essentially flat. Source was reverted. Post-reject:
> `/home/mudler/bench/phase127_moe_whole_expert_major/post_reject_20260702_043318`
> passed selected `13/13`, MoE md5 `8cb0ce23`, dense md5 `5951a5b4`,
> `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Do not retry the same
> fake-tensor whole-executor shape; the next MoE attempt must remove more
> temporary traffic or become a real fused grouped MMQ/SWIGLU/down path. A
> separate alternative is the previously scoped Qwen3Next BF16 GDN S-cache
> experiment, but that needs non-md5 numerical gates.
> 2026-07-02 Phase128 result: reject/revert the Qwen3Next BF16 GDN S-cache
> selector probe for the current target. Artifact:
> `/home/mudler/bench/phase128_qwen3next_gdn_bf16_s_cache/default_20260702_043939`
> built and passed default gates (`GATED_DELTA_NET 48/48`, canonical MoE md5
> `8cb0ce23`, dense md5 `5951a5b4`, `MUL_MAT`, `MUL_MAT_ID`). Verbose smoke
> artifact:
> `/home/mudler/bench/phase128_qwen3next_gdn_bf16_s_cache/smoke3_20260702_044434`
> showed the active decision model is `qwen35moe`, not Qwen3Next, and S cache
> remained `f32` under `LLAMA_QWEN3NEXT_GDN_S_CACHE_TYPE=bf16`. No true
> Qwen3Next GGUF was found on DGX. The relevant Qwen35/Qwen35MoE BF16 S-cache
> lever was already Phase81/82: it cut `gdn_core` but changed MoE md5 and
> missed the full f16-reference KL acceptance band. Do not retry this exact
> lever unless the quality gate is explicitly re-scoped or a real Qwen3Next
> model artifact is available.
> 2026-07-02 Phase129 result: reject/revert the Qwen35/Qwen35MoE grouped Q/K
> broadcast probe for fused GDN. Plan:
> `docs/superpowers/plans/2026-07-02-qwen35-gdn-qk-grouped-bcast-phase129.md`.
> The candidate added a default-off `LLAMA_QWEN35_GDN_QK_BCAST=1` branch in
> `src/models/qwen35.cpp` and `src/models/qwen35moe.cpp`, reusing the existing
> Qwen3Next `ggml_gated_delta_net_set_bcast()` path. Default gates were green:
> `/home/mudler/bench/phase129_qwen35_gdn_qk_bcast/default_20260702_065445`
> passed MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, `GATED_DELTA_NET 46/46`,
> `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. A standalone opt-in gate
> artifact at `optin_20260702_065604` was invalid because
> `paged-inference-gates.sh` only passes completion env through `EXTRA_ENV`.
> The valid opt-in pre-gate from
> `/home/mudler/bench/phase129_qwen35_gdn_qk_bcast/decode_optin_20260702_070149/gate_pre`
> changed MoE md5 to `b773e2f032aa0e992626d486b321808e`, so profiling was
> stopped and the source was reverted. Post-reject:
> `/home/mudler/bench/phase129_qwen35_gdn_qk_bcast/post_reject_20260702_070258`
> passed canonical MoE/dense md5, `GATED_DELTA_NET 46/46`, `MUL_MAT 1146/1146`,
> and `MUL_MAT_ID 806/806`; rebuilt `libllama.so` has zero
> `LLAMA_QWEN35_GDN_QK_BCAST` strings. Do not retry this Qwen3Next
> grouped-broadcast port for Qwen35/Qwen35MoE under the current bit-exact md5
> rule.
> 2026-07-02 Phase130 result: current-stack graph-node serving profile refresh,
> measurement-only. Artifact:
> `/home/mudler/bench/phase130_current_stack_profile/20260702_070949`. Shape:
> MoE `q36-35b-a3b-nvfp4`, `N=128`, prompt `128`, generation `64`,
> `PARALLEL=128`, `CTX=131072`. Pre/post gates passed canonical MoE md5
> `8cb0ce23`, dense md5 `5951a5b4`, `MUL_MAT 1146/1146`, and
> `MUL_MAT_ID 806/806`. Serving metrics: `agg_tps 208.0`,
> `decode_agg_tps 326.9`, `prefill_tps 1519.6`, `TTFT mean 8170.6 ms`, wall
> `39.38 s`, total kernel time `20.1559 s`. The profile confirms the live
> bottleneck remains split between `mmq_nvfp4 6009.52 ms` (`29.82%`) and
> `gdn_core 5891.40 ms` (`29.23%`). FA/mask cleanup is not funded:
> `get_rows 280.62 ms` (`1.39%`) and `fa 257.38 ms` (`1.28%`). The next source
> attempt must target a larger MoE/FFN-GEMM executor/kernel or a materially
> different GDN recurrent-state/packed-decode design, not another paged-mask,
> route-only, activation-only, grouped-broadcast, BF16-cache, or launch-geometry
> shortcut.
> 2026-07-02 Phase131 result: source-selection challenge, no source changes.
> Plan:
> `docs/superpowers/plans/2026-07-02-fused-routed-ffn-phase131.md`. Two
> read-only explorers challenged the Phase130 fork. MoE/FFN-GEMM source work is
> not funded unless it becomes a real fused routed-FFN kernel/executor; another
> route-only, activation-only, W4A16, tile-policy, sorted-output, or fake
> executor patch is expected to repeat Phases 110-127. GDN source work is not
> funded unless it materially reduces f32 recurrent-state traffic without
> BF16/quality drift; launch geometry, gather/identity, producer/store fusion,
> BF16 S-cache, and grouped Q/K broadcast have already failed or changed md5s.
> The next active line is to audit vLLM's fused MoE design and llama.cpp's
> current whole-pattern executor hook for a default-off fused routed-FFN PoC.
> If that audit does not produce a concrete low-conflict hook, require a
> standalone CUDA PoC before touching llama.cpp source.
>
> 2026-07-02 Phase132 result: keep the new default-off routed-FFN PoC scaffold.
> Plan:
> `docs/superpowers/plans/2026-07-02-routed-ffn-poc-phase132.md`. Artifact:
> `/home/mudler/bench/phase132_routed_ffn_poc/20260702_072725`. Source adds
> `ggml/src/ggml-cuda/moe-ffn.cu/.cuh` and a narrow hook in
> `ggml/src/ggml-cuda/ggml-cuda.cu` behind `LLAMA_MOE_ROUTED_FFN_POC=1`.
> The helper currently executes the baseline `gate_up -> SWIGLU -> down`
> sequence through the existing whole-pattern hook, so it is a scaffold, not a
> parity speedup. Initial incremental build failed until CMake was reconfigured
> to pick up the new globbed CUDA source; after `cmake -S . -B build`, build
> passed. Selected default and opt-in gates passed
> `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE 13/13`; opt-in emitted six exec
> markers and `libggml-cuda.so` contains one `LLAMA_MOE_ROUTED_FFN_POC` string.
> Default and opt-in canonical gates passed MoE md5 `8cb0ce23`, dense md5
> `5951a5b4`, `GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, and
> `MUL_MAT_ID 806/806`. Focused perf was neutral (`808.32 -> 804.87 us` at
> n=128, `1023.36 -> 1022.71 us` at n=257). Next phase may replace the helper
> internals with a real fused routed-FFN slice; do not claim Phase132 itself as
> a speedup.
>
> 2026-07-02 Phase133 result: keep only as a default-off structural base, not a
> speedup. Plan:
> `docs/superpowers/plans/2026-07-02-routed-ffn-sorted-down-phase133.md`.
> Artifact:
> `/home/mudler/bench/phase133_routed_ffn_sorted_down/20260702_074651`.
> Source exposes `ggml_cuda_mmq_ids_meta`, adds raw
> `ggml_cuda_mul_mat_q_moe_sorted_f32(...)`, and adds
> `LLAMA_MOE_ROUTED_FFN_SORTED_DOWN=1` on top of
> `LLAMA_MOE_ROUTED_FFN_POC=1`. The path executes baseline `gate_up` and
> `SWIGLU`, gathers the SWIGLU output into compact expert-sorted F32 rows, then
> calls raw MMQ down without fake tensors. Selected default, Phase132, and
> Phase133 gates passed `13/13`; Phase133 trace proved six
> `mmq_moe_sorted_raw` launches. Default and Phase133 canonical gates passed
> MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, `GATED_DELTA_NET 48/48`,
> `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Perf was not a win:
> default `807.37/1020.76 us`, Phase132 `808.21/1018.87 us`, Phase133
> `808.85/1026.87 us` for `n=128/257`. Next phase must fuse SWIGLU-to-sorted
> or SWIGLU-to-quant to remove this added gather/quant boundary; do not promote
> sorted-down as-is.
>
> 2026-07-02 Phase134 result: keep only as default-off fused-SWIGLU structural
> base, not a speedup. Plan:
> `docs/superpowers/plans/2026-07-02-routed-ffn-fused-swiglu-phase134.md`.
> Artifact:
> `/home/mudler/bench/phase134_routed_ffn_fused_swiglu/20260702_075828`.
> Source adds `LLAMA_MOE_ROUTED_FFN_FUSED_SWIGLU=1` on top of
> `LLAMA_MOE_ROUTED_FFN_POC=1`, passes `gate/up` views into the routed-FFN
> helper, computes `silu(gate) * up` directly into expert-sorted F32 rows, and
> calls the raw sorted-F32 down MMQ helper. The fused flag now implies the
> sorted-down path; `LLAMA_MOE_ROUTED_FFN_SORTED_DOWN=1` is not required.
> Selected opt-in gates passed `13/13`; trace proved six `mmq_moe_sorted_raw`
> launches; canonical opt-in gates passed MoE md5 `8cb0ce23`, dense md5
> `5951a5b4`, `GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, and
> `MUL_MAT_ID 806/806`. Perf is mixed: default `804.92/1026.02 us`, Phase132
> `808.00/1028.43 us`, Phase133 `808.07/1029.02 us`, Phase134
> `810.61/1025.68 us` for `n=128/257`. It recovers n=257 but regresses n=128;
> next work must fuse SWIGLU directly into down-MMQ quant or remove another
> launch/buffer before this becomes a parity lever.
>
> 2026-07-02 Phase135 result: keep as current best default-off routed-FFN base,
> but not parity. Plan:
> `docs/superpowers/plans/2026-07-02-routed-ffn-fused-quant-phase135.md`.
> Focused artifact:
> `/home/mudler/bench/phase135_routed_ffn_fused_quant/20260702_081723`.
> Serving artifact:
> `/home/mudler/bench/phase135_routed_ffn_fused_quant_serving/20260702_082102`.
> Source adds `LLAMA_MOE_ROUTED_FFN_FUSED_QUANT=1` on top of
> `LLAMA_MOE_ROUTED_FFN_POC=1`, computes `silu(gate) * up` directly into the
> NVFP4 MMQ activation layout, and launches raw down MMQ via
> `ggml_cuda_mul_mat_q_moe_quantized(...)`. Focused selected gates passed
> `13/13`; trace proved six `mmq_moe_quantized_raw` launches and zero
> `mmq_moe_sorted_raw` launches; canonical focused gates passed MoE md5
> `8cb0ce23`, dense md5 `5951a5b4`, `GATED_DELTA_NET 48/48`,
> `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Focused perf:
> default `805.92/1031.06 us`, Phase134 `807.65/1027.51 us`, Phase135
> `807.92/1024.97 us` for `n=128/257`. Serving at the Phase130 shape passed
> pre/post gates and improved decode aggregate t/s `326.9 -> 332.7`, while
> `mmq_nvfp4` dropped `6009.52 -> 5915.24 ms`; aggregate stayed `208.0`, prefill
> worsened `1519.6 -> 1475.1`, and total kernel time rose slightly
> `20.1559 -> 20.2498 s`. Next work should target remaining MoE overhead after
> fused quant (`mmq_fixup`, route/writeback, weighted combine), not another F32
> intermediate.
>
> 2026-07-02 Phase136 result: reject and revert the separate post-down
> weighted-combine fuse. Plan:
> `docs/superpowers/plans/2026-07-02-routed-ffn-combine-phase136.md`.
> Focused artifact:
> `/home/mudler/bench/phase136_routed_ffn_combine/20260702_083727`.
> Serving artifact:
> `/home/mudler/bench/phase136_routed_ffn_combine_serving/20260702_085749`.
> The candidate added `LLAMA_MOE_ROUTED_FFN_COMBINE=1` on top of Phase135 and
> skipped the post-down `MUL(weights) -> VIEW* -> ADD*` tail with a separate
> F32 weighted-combine kernel. It was correctness-clean: expanded selected
> gates passed `20/20`, trace proved six combine markers plus six
> `mmq_moe_quantized_raw` launches and zero sorted launches, canonical gates
> passed MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, `GATED_DELTA_NET 46/46`,
> `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Focused full-tail perf
> improved (`MOE_SWIGLU_COMBINE n=257` `428.53 -> 401.81 us` versus Phase135),
> but serving regressed versus Phase135: aggregate/decode t/s
> `208.0/332.7 -> 206.5/323.2`. Source and the sentinel test were reverted;
> post-reject Phase135 selected gates passed `13/13`. Do not retry a standalone
> post-MMQ combine launch as the next parity lever; any combine/finalize work
> needs a larger serving-visible fused writeback/finalize design.
>
> 2026-07-02 Phase137 result: reject the GDN launch-geometry retune with no
> source changes. Plan:
> `docs/superpowers/plans/2026-07-02-gdn-geometry-sweep-phase137.md`.
> Focused artifact:
> `/home/mudler/bench/phase137_gdn_geometry_sweep/20260702_091441`.
> Serving artifact:
> `/home/mudler/bench/phase137_gdn_geometry_serving/20260702_091740`.
> The env-only sweep tested existing `GDN_NW`/`GDN_CPW` knobs. The best focused
> candidate, `GDN_NW=4 GDN_CPW=1`, improved 1-token GDN rows
> (`hc=32,hs=128,kda=0` `6.793748 -> 4.713682 us`, KDA
> `7.790557 -> 5.194275 us`, grouped broadcast `5.967364 -> 3.407998 us`),
> but real serving regressed versus Phase135 despite clean pre/post gates:
> MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, `MUL_MAT 1146/1146`, and
> `MUL_MAT_ID 806/806`. Aggregate/decode t/s moved
> `208.0/332.7 -> 206.2/324.9`, total kernel time rose
> `20.2498 -> 20.7530 s`, and `gdn_core` worsened
> `5926.55 -> 6466.27 ms`. Do not promote or source-code a GDN geometry retune
> for this target. The next scoped source line is default-off MoE
> finalize/writeback inside the existing down-MMQ path, not a standalone
> post-MMQ combine launch.
>
> 2026-07-02 Phase138 attempt 1 update: keep the default-off finalize trace and
> full-tail sentinel scaffold; no runtime speedup claim yet. Plan:
> `docs/superpowers/plans/2026-07-02-moe-down-mmq-finalize-phase138.md`.
> Artifacts:
> `/home/mudler/bench/phase138_moe_down_mmq_finalize_trace/20260702_092943`
> (`MOE_SWIGLU_DOWN` trace-only),
> `/home/mudler/bench/phase138_moe_down_mmq_finalize_trace/20260702_093617_full_tail`
> (new full-tail sentinel), and
> `/home/mudler/bench/phase138_moe_down_mmq_finalize_trace/20260702_093731_canonical`
> (canonical gates). The old `MOE_SWIGLU_DOWN` sentinel emitted six early
> routed-FFN records but no weighted tail. The new `MOE_SWIGLU_FINALIZE`
> sentinel passed default and Phase135-opt-in correctness (`7/7` each) and
> emitted six supported tail records with `tail_nodes=16`, `views=8`, and
> `adds=7`. Canonical patched-Phase93 gates passed MoE md5 `8cb0ce23`, dense
> md5 `5951a5b4`, `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Next work may
> implement default-off down-MMQ finalize/writeback against this sentinel first;
> keep serving promotion gated by Phase135 decode/aggregate/kernel-time
> thresholds.
>
> 2026-07-02 Phase138 attempt 2 update: keep the default-off down-MMQ
> finalize/writeback candidate as a narrow positive, but do not promote it or
> call parity. Plan:
> `docs/superpowers/plans/2026-07-02-moe-down-mmq-finalize-phase138.md`.
> Focused artifact:
> `/home/mudler/bench/phase138_moe_down_mmq_finalize/20260702_095927_focused`;
> canonical gates:
> `/home/mudler/bench/phase138_moe_down_mmq_finalize/20260702_100202_canonical`;
> serving:
> `/home/mudler/bench/phase138_moe_down_mmq_finalize_serving/20260702_100330`.
> The candidate adds `LLAMA_MOE_ROUTED_FFN_FINALIZE_POC=1` on top of Phase135,
> zeroes the final output, atomically accumulates `down_sum * router_weight`
> from the down-MMQ path, and skips the strict weighted tail only after the
> finalize helper is selected. Focused `MOE_SWIGLU_FINALIZE` correctness passed
> for default, Phase135, and Phase138 (`7/7` each); canonical and serving
> pre/post gates passed MoE md5 `8cb0ce23`, dense md5 `5951a5b4`,
> `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Serving versus Phase135 moved
> aggregate/decode t/s `208.0/332.7 -> 209.3/333.5`, total kernel time
> `20.2498 -> 20.0489 s`, and `mmq_nvfp4 5915.24 -> 5802.87 ms`; however
> `ew_add` remains visible at `374.09 ms`, so this is only an incremental
> default-off improvement. Next work should reduce the remaining fan-in/writeback
> path more deeply or return to the dominant `gdn_core`/`mmq_nvfp4` buckets.
>
> 2026-07-02 Phase139 result: serving noise-floor repeat rejects treating the
> Phase138 one-off serving gain as source-funding evidence. Spec:
> `docs/superpowers/specs/2026-07-02-serving-noise-floor-phase139-design.md`.
> Plan:
> `docs/superpowers/plans/2026-07-02-serving-noise-floor-phase139.md`.
> Artifact:
> `/home/mudler/bench/phase139_serving_noise_floor/20260702_081901`.
> Seven identical current-binary Phase138 serving/profile runs all passed
> pre/post gates: MoE md5 `8cb0ce23`, dense md5 `5951a5b4`,
> `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. The runtime variance was much
> larger than Phase138's one-off delta: aggregate throughput median
> `208.5 t/s`, stdev `2.8022`, CV `1.349%`, range `203.4..212.3`; wall CV
> `1.347%`; `mmq_nvfp4` CV `3.351%`. Keep Phase138 default-off as
> correctness-clean and focused-positive, but do not stack another
> finalize/MMQ micro-patch from serving evidence alone. Future serving claims
> need repeated A/B medians and must exceed `max(2.0%, 3 * same-binary stdev)`.
> The next source phase should pivot to a larger measured bucket, with GDN
> packed decode/prep now more defensible than another MoE finalize shortcut.
>
> 2026-07-02 Phase140 result: reject an immediate in-GDN Q/K
> L2-normalization patch. Spec:
> `docs/superpowers/specs/2026-07-02-gdn-decode-prep-trace-phase140-design.md`.
> Plan:
> `docs/superpowers/plans/2026-07-02-gdn-decode-prep-trace-phase140.md`.
> Artifact:
> `/home/mudler/bench/phase140_gdn_decode_prep_trace/20260702_085348`.
> The current Phase138 opt-in serving/profile shape passed pre/post gates:
> MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, `MUL_MAT 1146/1146`, and
> `MUL_MAT_ID 806/806`. Serving/profile reported aggregate/decode throughput
> `207.3/328.9 t/s`, wall `39.501 s`, total kernel `20.2002 s`, `GDN
> 6673.66 ms`, `gdn_core 5890.44 ms`, and `gdn_l2norm 100.30 ms`. The focused
> SQLite summary had `l2_norm_f32 100.3024 ms` versus
> `gated_delta_net_cuda 5804.7074 ms`. This is above the absolute
> three-sigma floor from Phase139 (`53.433 ms`) but below the planned `3%` of
> GDN-core materiality threshold at about `1.7%`, so prep-only L2 fusion is not
> source-funded. Next GDN work should be recurrence-level, packed-state, or
> datacenter-Blackwell-specific, not another prep micro-fusion.
>
> 2026-07-02 Phase141 result: decode-only GDN source claims must normalize by
> launch count or tightly control the capture window. Spec:
> `docs/superpowers/specs/2026-07-02-gdn-decode-noise-floor-phase141-design.md`.
> Plan:
> `docs/superpowers/plans/2026-07-02-gdn-decode-noise-floor-phase141.md`.
> Artifact:
> `/home/mudler/bench/phase141_gdn_decode_noise_floor/20260702_090428`.
> Five identical current-binary decode-only captures all passed pre/post gates:
> MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, `MUL_MAT 1146/1146`, and
> `MUL_MAT_ID 806/806`. Raw `gdn_core_ms` median/stdev/CV was
> `1415.500/30.641/2.146%`, range `1410.300..1482.140 ms`, but launch counts
> drifted (`597`, `598`, `600`, `630`). Normalized `gdn_core_ms_per_launch`
> was stable: median/stdev/CV `2.359167/0.005399/0.229%`. Future GDN A/B
> source claims need repeated medians and must beat either `6.49%` raw
> `gdn_core` reduction or `2.0%` launch-normalized reduction. The small
> default-off source follow-up now worth testing is scalar gate/beta hoisting
> inside `gated_delta_net_cuda`; vLLM-style packed decode recurrence remains a
> larger redesign.
Audience: an agent with **zero prior context** who has been told to "continue the GB10 vLLM-parity investigation" on the `llama-cpp-localai-paged` backend.
@@ -20,6 +589,69 @@ Read order for a cold start:
## 1. TL;DR STATE
> 2026-07-01 Phase104-108 update: the current carried source line is still the
> Phase93 Qwen3Next grouped Q/K broadcast plus the Phase101/102 default-off
> cleanup candidates. Phase104/106 same-session serving showed the stack is
> md5/op clean but still far from vLLM: at `N=128`, paged/vLLM was about
> `0.66` on decode and `0.50-0.51` on aggregate; at `N=192/256`, vLLM remained
> faster and TTFT stayed about `3x` lower. Phase105 refreshed the grouped-MMQ
> trace and found no new host-side tile-policy lever. Phase107 proved the MoE
> structural correctness gates exist (`MOE_SWIGLU_DOWN 7/7`,
> `MOE_WEIGHTED_COMBINE 7/7`, `MUL_MAT_ID_RAGGED_MOE 6/6`) but also proved
> `test-backend-ops perf` did not time those custom whole-graph cases. Phase108
> fixed that measurement gap in `tests/test-backend-ops.cpp`: perf mode now
> includes those MoE cases at `n_tokens=128,257`, and CSV output includes
> `time_us`, `flops`, `memory_kb`, and `n_runs`. The Phase108 artifact is
> `/home/mudler/bench/phase108_moe_perf_csv/20260701_221559`; md5s and compact
> op gates are green. Use Phase108 rows as the baseline for any fused routed-MoE
> implementation. Current ranking: `MUL_MAT_ID_RAGGED_MOE` is `1239-1446 us/run`,
> `MOE_SWIGLU_DOWN` is `802-1020 us/run`, and `MOE_WEIGHTED_COMBINE` is only
> `28-68 us/run`, so do not spend the next patch on weighted-combine fusion
> alone.
> Phase109 then tested existing env-gated routes on the Phase108 rows:
> `LLAMA_W4A16_PREFILL_M=128`, `LLAMA_FP4_PREFILL_M=128`,
> `LLAMA_MOE_DENSITY_MAX=9`, and `LLAMA_MOE_MMQ_X=64`
> (`/home/mudler/bench/phase109_existing_moe_prefill_ab/20260701_222559`).
> All selected correctness gates passed (`13/13` per env), but W4A16 and FP4
> large-M regressed the 257-token rows badly, and density/tile retuning was
> noise-level on `MUL_MAT_ID_RAGGED_MOE` while not helping `MOE_SWIGLU_DOWN`.
> Do not spend another phase on MMQ tile-policy shortcuts. The next credible
> implementation is structural: port the vLLM-style idea of GPU-side
> token/expert routing metadata (`sorted_token_ids`, expert offsets/bounds,
> inverse permutation) into llama.cpp's `mul_mat_id` host-sync fallback/grouped
> W4A16 path, while leaving the graph-safe grouped-MMQ path untouched.
> Phase110 implemented the first slice of that structural path as default-off
> `LLAMA_MOE_GPU_SORT=1` in `ggml_cuda_mul_mat_id`, reusing the existing
> `mm_ids_helper` GPU sort for fallback metadata. The initial branch failed
> `3/13` selected opt-in rows because `mm_ids_helper` returns sorted-to-original
> `ids_dst`, while fallback `get_rows_cuda()` needs original-to-sorted
> `ids_from_sorted`; adding a tiny inverse-permutation kernel fixed correctness.
> Accepted artifact:
> `/home/mudler/bench/phase110_gpu_moe_sort/20260701_224446_fix1`. Gates are
> green: canonical MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5
> `5951a5b4d624ce891e22ab5fca9bc439`, and supported compact ops
> `SSM_CONV 45/45`, `SSM_CONV_SPLIT 6/6`, `GET_ROWS 49/49`,
> `GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806` for both
> default and `LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1`. Perf decision:
> keep as a default-off structural base only. It improves W4A16 fallback
> 257-token rows by `7.2%` (`MOE_SWIGLU_DOWN`) and `7.9%`
> (`MUL_MAT_ID_RAGGED_MOE`), but the opt-in fallback is still about `1.5x`
> slower than default grouped-MMQ. Phase111 must remove another fallback
> bottleneck, such as the remaining `expert_bounds` host copy / host tile
> descriptor build, before this line can matter for parity.
> Phase111 tested that narrow follow-up as default-off `LLAMA_W4A16_GPU_TILES=1`:
> W4A16 tile descriptors were built on GPU from `expert_bounds_dev` with an
> atomic tile counter. It was correctness-clean after fixing a pointer mutability
> compile error and a CUDA pool LIFO allocation bug, but clean perf was
> flat-to-negative (`MUL_MAT_ID_RAGGED_MOE n=257` regressed about `2.0%` versus
> Phase110 GPU-sort). Artifact:
> `/home/mudler/bench/phase111_w4a16_gpu_tiles/20260701_230400_fix1`. The
> Phase111 source was reverted, and post-revert W4A16+GPU-sort selected gates
> passed `13/13`. Do not reopen a standalone GPU tile descriptor cleanup; the
> next W4A16 attempt must remove a larger boundary, such as direct activation
> consumption plus GPU descriptors together, or bypass the host-sync fallback
> path entirely.
>
> 2026-07-01 active update: Phase50-59 reopened the dense and MoE serving
> scheduler question.
> True dense decode is much closer to vLLM (`383.66` vs `435.00` t/s, `88.2%`)
@@ -46,7 +678,7 @@ Read order for a cold start:
> are local and DGX-gated but not pushed, so the LocalAI patch series has not
> been regenerated.
>
> 2026-07-01 Phase81 update: the next viable GDN lever is no longer launch
> 2026-07-01 Phase81-85 update: the next viable GDN lever is no longer launch
> shape or gather removal. A default-off Qwen35/Qwen35MoE BF16 persistent
> recurrent S-cache experiment (`LLAMA_QWEN35_GDN_S_CACHE_TYPE=bf16`) cut
> same-source decode-only `gdn_core` from `1399.30 ms / 599 launches`
@@ -54,9 +686,257 @@ Read order for a cold start:
> F32 md5 gates and op gates stayed green, and BF16 dense md5 stayed canonical,
> but BF16 MoE md5 changed to `07db32c2bcb78d17a43ed18bc22705cd`. A quick
> MoE KL smoke vs the same-source F32 base showed KLD `0.055499 +/- 0.001705`,
> same-top-p `88.361%`, and PPL ratio `1.010356`. Treat this as a promising
> default-off candidate only. Phase82 must run the full f16-reference KL gate
> plus serving A/B before regenerating LocalAI patches or considering promotion.
> same-top-p `88.361%`, and PPL ratio `1.010356`. Phase82 then ran the full MoE
> f16-reference gate at
> `/home/mudler/bench/phase82_bf16_s_cache_f16_kl/20260701_183016`: same-source
> F32 measured KLD `0.136563 +/- 0.003242`, while BF16 S-cache measured
> `0.137162 +/- 0.003456` against the documented paged acceptance reference
> `0.136000 +/- 0.003285`. Reject promotion and do not run serving A/B for this
> candidate under the current hard KL rule. Phase83 then tested a bit-exact
> KDA `expf(g)` register-cache shortcut in the GDN CUDA core. Md5 and op gates
> stayed green, but same-window decode-only `gdn_core` moved
> `1399.46 -> 1405.62 ms`, so reject that micro-optimization too. Phase84
> reduced in-place GDN op outputs to attention-only tensors and moved the CPU
> ids fallback scratch to workspace; md5/op gates stayed green and startup free
> CUDA memory improved `117472 -> 117855 MiB`, but same-window decode-only
> `gdn_core` moved `1399.72 -> 1407.38 ms`. Treat Phase84 as a possible
> memory-footprint cleanup only, not a speed parity lever. Phase85 added a
> graph-reuse-safe identity-contiguous recurrent-state fast path: it calls
> `ggml_gated_delta_net_inplace` on a direct state view when `s_copy_main` is
> identity, otherwise keeps the ids path. Md5/op gates stayed green, the
> `gdn_gather` fine bucket disappeared, GDN macro launches dropped
> `3600 -> 2980`, and same-window `gdn_core` moved `1412.33 -> 1400.34 ms`.
> Carry Phase85 only as a small cleanup candidate. Phase86 audited the producer
> fusion idea against the Phase85 node-traced profile before coding it: the
> whole `act/GDN-gate(shared)` macro is only `13.57 ms` of `3.6622 s`, beta
> sigmoid is `2.73 ms`, and CUDA already fuses `UNARY + MUL` for softplus,
> sigmoid, and SILU. Reject producer-only fusion as too small. Phase87 then
> exposed an env-gated `GDN_NW=4 GDN_CPW=8` decode geometry probe to test a
> vLLM-like `BV=32` tile shape. It was md5/op green, but same-source
> decode-only `gdn_core` regressed `1390.56 -> 1417.13 ms`, so the source line
> was reverted. Phase88 tried a first default-off `GDN_DECODE_PACK2=1` packed
> decode CTA kernel. It built and CUDA op tests stayed green, but canonical md5
> failed for both MoE (`320b5ed...` vs `8cb0ce...`) and dense (`6a65e9...` vs
> `5951a5...`), with visible output corruption, so it was reverted without
> profiling. Phase89 tried to add that focused guardrail through
> `test_gated_delta_net_inplace_ids`, but selecting that test class directly
> already fails the pre-existing BF16 cases on CUDA, so the naive test addition
> was also reverted. Phase90 fixed the fixture root cause for identity ids by
> mirroring `state` into `state_dst` during initialization and added F32
> `S_v=128`, `n_seqs=2` cases that return `concat(out,state_dst)`, so the
> backend comparator now checks both attention output and the side-effect state
> write. DGX CUDA selected-op gate is green (`4/4`). Use this Phase90 guardrail
> before any new packed-decode kernel, then still run canonical md5/op gates.
> Phase91 retried the default-off `GDN_DECODE_PACK2=1` CTA sequence-packing
> kernel under that guardrail. The first `n_seqs=2` guardrail passed but MoE md5
> failed for the single-sequence completion gate, exposing an uncovered odd/single
> sequence PDL hazard. Moving inactive lanes past `ggml_cuda_pdl_sync()` and
> adding `n_seqs=1,3` guardrail cases made the candidate md5/op clean
> (`GATED_DELTA_NET 46/46`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`), but
> decode-only `gdn_core` regressed to `1425.44 ms`, so the runtime patch was
> reverted. Keep the expanded guardrail; do not retry CTA-level sequence packing
> unless it also reduces per-sequence GDN work. ids gather, producer overhead,
> simple geometry changes, and ungated packed kernels are not acceptable parity
> paths. Phase92 tried the next smallest scalar one-token recurrence
> micro-optimization: a default-off `GDN_SCALAR_DECODE_STORE_FUSED=1` CUDA path
> that stores final state inside the scalar update loop and skips the final
> post-token register-store loop. It passed local CPU guardrail, DGX CUDA
> guardrail, canonical md5s, `GATED_DELTA_NET 46/46`, `MUL_MAT 1146/1146`, and
> `MUL_MAT_ID 806/806`, but decode-only `gdn_core` regressed further to
> `1529.72 ms` (`/home/mudler/bench/phase92_gdn_scalar_store_fused/20260701_204718/decode_profile`),
> so the runtime patch was reverted. Do not retry store-fusing without evidence
> that the final state store loop is independently dominant. The next credible
> scoped ideas from the vLLM audit are the larger packed decode contract and the
> Qwen3Next GQA-repeat removal, each as a separate guarded phase. Phase93
> implemented the Qwen3Next GQA-repeat removal as an explicit grouped Q/K
> broadcast mode on `GGML_OP_GATED_DELTA_NET` (`op_params[2]`), preserving the
> existing modulo/tiled broadcast for Qwen35 while allowing Qwen3Next to map
> `qk_head = value_head / (H_v / H_k)` and skip materializing repeated q/k heads
> when the GDN op path is active. Local CPU `GATED_DELTA_NET` passed `48/48`,
> local CPU in-place ids passed `6/6`, DGX CUDA `GATED_DELTA_NET` passed `48/48`,
> DGX CUDA in-place ids passed `6/6`, canonical md5/op gates passed
> (`GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`), and
> decode-only `gdn_core` improved to `1333.48 ms`
> (`/home/mudler/bench/phase93_qwen3next_gqa_bcast/20260701_211019/decode_profile`).
> Carry Phase93 as the current positive candidate. Phase94 then retested
> decode geometry on top of Phase93 with env-only `GDN_NW=8 GDN_CPW=8`. It
> stayed md5/op clean (`GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`,
> `MUL_MAT_ID 806/806`) but decode-only `gdn_core` regressed to `1440.79 ms`
> (`/home/mudler/bench/phase94_gdn_geometry_phase93/20260701_211855/decode_profile_8x8`),
> so reject 8x8 and keep Phase93's default 16x8 geometry. Phase93 trace evidence
> also shows remaining producer-side GDN work is small (`l2_norm_f32 8.65 ms`,
> GDN gate/sigmoid about `12.75 ms`, remaining repeat `5.34 ms`), so the next
> useful lead should target recurrence work or a larger packed decode contract,
> not another small producer-only fusion. Phase95 tested a default-off
> `GDN_WARP_SCALAR_GATE=1` CUDA decode specialization on top of Phase93: lane 0
> computed the scalar non-KDA gate and broadcast it within the warp for the
> one-token `S_v=128`, default `16x8` path. Local CPU guardrails passed
> (`GATED_DELTA_NET 48/48`, in-place ids `6/6`), DGX CUDA guardrails passed
> (`GATED_DELTA_NET 48/48`, in-place ids `6/6`), canonical md5/op gates passed
> (`GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`), but
> decode-only `gdn_core` regressed to `1402.40 ms`
> (`/home/mudler/bench/phase95_gdn_warp_scalar_gate/20260701_213311/decode_profile`).
> The runtime patch was reverted. Do not retry scalar-gate warp broadcast unless
> a future profile shows SFU pressure, rather than recurrent state traffic or
> reductions, dominating the GDN core. Phase96 then tested the narrow
> conv-state identity fast path suggested by the trace audit: when
> `s_copy_main` was identity, `build_conv_state_fused` viewed the active
> conv-cache slots directly and called `ggml_ssm_conv_update_inplace` instead of
> the ids variant. Local CPU `SSM_CONV` passed `45/45`; DGX CUDA `SSM_CONV`
> passed `45/45`; canonical gates passed (`SSM_CONV 45/45`,
> `GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`, md5s
> canonical). Decode-only profile regressed to total kernel `3.6723 s`,
> `gdn_core 1406.57 ms`, and `gdn_conv 70.42 ms`
> (`/home/mudler/bench/phase96_conv_identity_fastpath/20260701_214141/decode_profile`).
> The runtime model-graph patch was reverted. Do not retry the conv identity
> branch as a speed lever unless a same-window trace proves the ids variant is
> independently dominant. Phase97 then measured the carried Phase93 stack in an
> end-to-end `n=128`, `PTOK=128`, `GEN=64`, `PARALLEL=128` serving snapshot
> against vLLM. Pre/post canonical gates stayed green. Paged Phase93 measured
> `agg_tps 329.6`, `decode_agg_tps 669.8`, `prefill_tps 1734.5`,
> `ttft_mean_ms 7415.4`, `wall_s 24.851`; vLLM measured `agg_tps 664.8`,
> `decode_agg_tps 1029.4`, `prefill_tps 5271.8`, `ttft_mean_ms 2519.5`,
> `wall_s 11.929`
> (`/home/mudler/bench/phase97_phase93_serving_snapshot/20260701_214648`).
> Phase93 therefore remains a decode-profile positive candidate, but it does not
> close serving parity (`paged_decode_over_vllm=0.6507`). The next useful phase
> needs a larger serving-impact lever; isolated GDN/conv micro-optimizations
> have now repeatedly failed to move live serving enough. Phase98 profiled that
> carried Phase93 serving window with graph-node CUDA tracing. Pre/post gates
> stayed green. Total kernel time was `20.0411 s`; macro buckets were GDN
> `6679.96 ms` (`33.33%`), MoE/FFN-GEMM `6034.52 ms` (`30.11%`),
> bf16/fp8-proj `2766.06 ms` (`13.80%`), and layout-copy `1257.60 ms`
> (`6.28%`). Fine buckets were led by `gdn_core 5892.99 ms` (`29.40%`) and
> `mmq_nvfp4 5809.55 ms` (`28.99%`), followed by `convert_dtype 663.45 ms`,
> `gdn_conv 457.11 ms`, and `concat_layout 430.25 ms`
> (`/home/mudler/bench/phase98_phase93_serving_profile/20260701_215715`).
> This re-ranks the next work: do not spend more time on scalar GDN, conv
> identity, or gather-only shortcuts. Either attribute and remove a proven
> material layout-copy node, or pursue a larger GDN-core/MMQ serving lever with a
> standalone PoC gate. Phase99 then used the existing default-off
> `LLAMA_LAYOUT_TRACE` hook on the same Phase93 serving profile shape
> (`N=128`, `PTOK=128`, `GEN=64`, `PARALLEL=128`). Trace-enabled gates stayed
> green (`GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`,
> canonical MoE/dense md5s). Serving remained comparable (`total kernel
> 20.2408 s`, `layout-copy 1269.35 ms`). The trace attributed
> `concat_layout 440.01 ms` almost entirely to
> `conv_input-* = concat(conv_states_reshaped-*, qkv_mixed_transposed-*)` before
> `SSM_CONV`; `copy_layout 119.16 ms` includes `conv_state_update-*` writeback.
> The larger `convert_dtype 662.34 ms` bucket is mostly unnamed F32-to-F16 `CPY`
> rows and needs stronger attribution before coding. Decision: Phase99 is
> measurement-only; do not retry the Phase96-style conv-state identity branch.
> The only conv-side patch worth funding is a larger two-source `SSM_CONV`
> contract that reads `(conv_states, qkv_mixed)` as a logical concat, or else
> extend trace attribution for the unnamed `convert_dtype` bucket first
> (`/home/mudler/bench/phase99_layout_trace/20260701_200835/serving_profile`).
> Phase100 extended that trace with `dst_view`, `src0_view`, and `src1_view`
> names. The trace-only patch built locally and on DGX, and trace-enabled gates
> stayed green (`GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`,
> `MUL_MAT_ID 806/806`, canonical MoE/dense md5s). Serving stayed comparable
> (`total kernel 20.3464 s`, `convert_dtype 661.73 ms`, `concat_layout
> 438.15 ms`). The new fields identify a concrete `convert_dtype` source:
> `GET_ROWS` reads F16 `cache_k_l*` / `cache_v_l*` into F32 `node_*`, then
> `CPY` downcasts views such as `src0_view=node_358` / `node_365` to F16
> attention-shaped tensors. This repeats across attention layers
> (`cache_k_l3/v_l3`, `cache_k_l7/v_l7`, `cache_k_l11/v_l11`, ...). Some F32->F16
> rows remain unnamed, so the next runtime phase should be a narrow K/V cache
> get_rows dtype A/B, not a broad layout rewrite
> (`/home/mudler/bench/phase100_layout_view_trace/20260701_201800/serving_profile`).
> Phase101 implemented that narrow A/B as default-off
> `LLAMA_PAGED_KV_GET_ROWS_F16=1`: add `ggml_get_rows_type`, support CPU F16
> source -> F16 destination row copy, and use typed F16 `GET_ROWS` only for
> paged K/V gather when the cache tensor is F16. Local and DGX builds completed;
> CUDA `GET_ROWS` passed `49/49` including the new F16-output cases; default and
> opt-in md5/op gates stayed green (`GET_ROWS 49/49`, `GATED_DELTA_NET 48/48`,
> `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`, canonical MoE/dense md5s).
> Serving profile under opt-in measured `total kernel 20.1989 s`, `agg_tps
> 206.4`, `decode_agg_tps 328.0`, and `ttft_mean_ms 8211.1`. It reduced
> `copy_layout 116.25 -> 80.32 ms` and macro `layout-copy 1262.58 -> 1220.30 ms`
> versus Phase100, but `convert_dtype` stayed flat (`661.73 -> 661.35 ms`) and
> serving throughput did not improve. Carry Phase101 only as a small default-off
> cleanup candidate pending repeat A/B; do not promote it as a parity lever
> (`/home/mudler/bench/phase101_kv_get_rows_f16/20260701_203930/serving_profile`).
> Phase102 then implemented the funded two-source `SSM_CONV` contract as
> default-off `LLAMA_SSM_CONV_SPLIT=1`: `ggml_ssm_conv_split(ctx, conv_states,
> x_cur, conv_kernel)` reuses `GGML_OP_SSM_CONV`, reads
> `[K-1,channels,n_seqs]` cached taps plus native `[channels,n_tokens,n_seqs]`
> qkv tokens as a logical concat, and is wired into Qwen3Next/Qwen35/Qwen35MoE
> only for multi-token, non-rollback batches with `n_seq_tokens >= K-1`. The
> initial semantic test exposed a harness issue (`split-base` has an exactly
> zero CPU reference, so normalized MSE reported `ERR=inf`); direct split
> CUDA-vs-CPU passed `6/6`, and the final test keeps `split-base` with absolute
> max error. Local and DGX builds passed; default, standalone opt-in, and
> serving pre/post gates stayed green (`SSM_CONV 45/45`, `SSM_CONV_SPLIT 6/6`,
> `GET_ROWS 49/49`, `GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`,
> `MUL_MAT_ID 806/806`, canonical MoE/dense md5s). Opt-in serving measured
> `total kernel 19.5482 s`, `agg_tps 206.1`, `decode_agg_tps 320.0`,
> `prefill_tps 1538.0`, and `ttft_mean_ms 7928.4`. It removed the traced concat
> materialization (`concat_layout 433.13 -> 4.59 ms` versus Phase101 and
> `layout-copy 1220.30 -> 826.87 ms`), but live serving throughput still did not
> improve. Carry Phase102 as a default-off cleanup/follow-up base only; do not
> promote it as parity-closing without a repeat A/B or an additional state-update
> fusion. The remaining high-value targets are still `gdn_core`, `mmq_nvfp4`, or
> a larger serving scheduler/packed-decode contract
> (`/home/mudler/bench/phase102_ssm_conv_split/20260701_210907/serving_profile`).
> Phase103 measured Phase101+Phase102 together, with no new source changes:
> `LLAMA_SSM_CONV_SPLIT=1 LLAMA_PAGED_KV_GET_ROWS_F16=1`. Standalone and
> serving pre/post gates stayed green (`SSM_CONV 45/45`, `SSM_CONV_SPLIT 6/6`,
> `GET_ROWS 49/49`, `GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`,
> `MUL_MAT_ID 806/806`, canonical MoE/dense md5s). Combined serving improved
> over Phase102 (`agg_tps 206.1 -> 212.3`, `decode_agg_tps 320.0 -> 331.5`,
> `prefill_tps 1538.0 -> 1569.1`, `wall_s 39.743 -> 38.575`) and reduced
> `layout-copy 826.87 -> 798.52 ms`; it also preserved most of the split
> SSM_CONV concat removal and recovered the F16 K/V `copy_layout` reduction
> (`copy_layout 112.53 -> 78.22 ms`). This proves the two cleanup candidates are
> compatible, but not parity-closing: `gdn_core 5930.47 ms` and `mmq_nvfp4
> 6001.77 ms` still dominate. Carry the combined env as the cleanup comparison
> baseline; do not rerun isolated layout cleanup unless it changes a larger
> serving contract
> (`/home/mudler/bench/phase103_combined_layout_cleanups/20260701_211821/serving_profile`).
> Phase104 then measured that combined cleanup stack in the normal same-session
> serving harness against vLLM at `N=128`, `PTOK=128`, `GEN=64`,
> `PARALLEL=128`. Pre/post gates stayed green with the same expanded op set and
> canonical md5s. Paged combined measured `agg_tps 338.6`,
> `decode_agg_tps 675.8`, `prefill_tps 1813.0`, `ttft_mean_ms 7121.6`, and
> `wall_s 24.196`; vLLM measured `agg_tps 661.1`, `decode_agg_tps 1028.0`,
> `prefill_tps 5208.7`, `ttft_mean_ms 2572.3`, and `wall_s 11.980`. This is a
> small serving improvement over Phase97 (`agg_tps +2.73%`, `prefill_tps
> +4.53%`, `TTFT -3.96%`), but still not parity: `paged_decode_over_vllm=0.6574`
> and `paged_agg_over_vllm=0.5122`. Carry the combined cleanup stack as the best
> current comparison baseline. The next useful phase must attack a larger
> serving-impact contract or the dominant GDN/MMQ buckets, not more isolated
> layout-copy cleanup
> (`/home/mudler/bench/phase104_combined_serving_snapshot/20260701_212551`).
> Phase105 refreshed grouped-MMQ evidence on that current stack without source
> changes. `MUL_MAT_ID_RAGGED_MOE` stayed green both default and trace-enabled
> (`6/6`), full `MUL_MAT_ID` stayed green (`806/806`), and the live serving
> retry returned a non-empty response while recording `120` shape and launch
> lines. The live sample was prefill-like (`ncols_max=317`, density `10`,
> `mmq_x_best=112`, `stream_k=1`) with no small-M lines; all launches had
> `fixup=0`, `stream_k_blocks == ntiles_dst`, and efficiency `100`. This
> confirms the current cleanup stack did not open a new cheap MMQ shortcut.
> Do not add another host-side MMQ tile policy; only revisit MMQ for a
> genuinely structural kernel or serving-contract change
> (`/home/mudler/bench/phase105_mmq_current_shape/20260701_214129_serving_retry`).
> Phase106 tested the remaining low-conflict C1 operating-point hypothesis on
> the current stack: same-session `N=128/192/256` with `PARALLEL=256`,
> `VLLM_MAX_NUM_SEQS=256`, and the combined cleanup env. Pre/post gates stayed
> green with canonical md5s and the expanded op set. vLLM completed all legs and
> stayed ahead: at `N=256`, paged measured `agg_tps 338.4`,
> `decode_agg_tps 824.6`, `ttft_mean_ms 14933.5`, while vLLM measured
> `agg_tps 723.8`, `decode_agg_tps 1320.4`, `ttft_mean_ms 4999.0`. Reject C1
> for the current GB10 stack. The next source phase should be structural
> persistent-batch/fused-MoE/GDN work, not another scheduler shortcut
> (`/home/mudler/bench/phase106_max_concurrency_current_stack/20260701_214907`).
> Phase107 established the fused-MoE structural guardrail surface before coding:
> `MOE_SWIGLU_DOWN 7/7`, `MOE_WEIGHTED_COMBINE 7/7`, and
> `MUL_MAT_ID_RAGGED_MOE 6/6` passed on CUDA0. However,
> `test-backend-ops perf` did not provide usable timing rows for these custom
> whole-graph cases; the broad `MUL_MAT_ID` perf CSV reported support metadata
> only. The next source patch should be measurement-only: add a narrow MoE
> fusion timing harness with explicit GPU synchronization and CSV timing before
> funding any fused routed-MoE kernel
> (`/home/mudler/bench/phase107_moe_fusion_guardrail/20260701_220227`).
- Historical verdict: the older investigation marked GB10 parity **CLOSED** and
unreachable. Treat that as superseded where Phase50-54 provide newer dense
@@ -1416,3 +2296,44 @@ assumption is too narrow for a sub-millisecond capture-level win. Do not spend
more parity time on gather-only GDN shortcuts unless a future profile makes
gather material. The next serious GDN scope remains recurrent-state
precision/traffic.
## Series trim (phases 110-140 review, 2026-07-02)
The campaign's on-disk patches `0048-0063` were added without matching fork
commits (a fork-first policy violation). After a keep/drop review of the
phase 110-140 work, the series was trimmed to a single kept line plus the
gate harness, and re-mirrored to the fork:
- KEEP - test sentinels (the MoE gate harness): `MOE_SWIGLU_DOWN`,
`MOE_SWIGLU_COMBINE`, `MUL_MAT_ID_RAGGED_MOE` (old `0051-0053`).
- KEEP - the MTP-draft correctness fix (old `0054`): forces target-side
sampler acceptance for MTP drafts (backend draft sampling can request
multiple output rows per sequence); the backend ships `-mtp` gallery models.
- KEEP - the Phase135 routed-FFN fused-quant line: whole-pattern MoE matcher +
routed-FFN executor hook (Phase120/121), the routed-FFN PoC scaffold
`moe-ffn.{cu,cuh}` (Phase132), and the fused SwiGLU-to-NVFP4-quant + raw down
MMQ (`ggml_cuda_mul_mat_q_moe_quantized` + local `ggml_cuda_mmq_ids_meta`
refactor, Phase135). All default-off, md5-clean opt-in, six
`mmq_moe_quantized_raw` markers with zero sorted launches on the sentinel.
- DROP - W4A16 grouped-tile pack/tune/pad (old `0048-0050`): dead line, W4A16
is ~1.5x slower than grouped-MMQ.
- DROP - speculative/trace/cublas-route/mmid-route/mul-mat-route traces + the
rejected small-M tile-policy knob (old `0055-0063`).
- DROP - all other campaign keep-markers not needed by Phase135: GPU-sort
(Phase110), W4A16-direct-A (Phase112), boundary trace/timing (Phase117),
Phase133 sorted-F32 down, Phase134 fused-SWIGLU-only, Phase138
finalize/weighted-combine. The final fork tree carries zero of these markers.
Fork branch `mudler/llama.cpp:localai-paged` re-mirrored on top of
`51168c5ee` (LocalAI series `0001-0047`):
- `fd920cf8a` test(paged): cover MoE swiglu down chain
- `a85c1e098` test(paged): cover MoE weighted combine chain
- `2fed6aacf` test(paged): cover ragged MoE dispatch
- `f1d976f06` fix(speculative): disable backend sampling for MTP drafts
- `1edddc8fe` feat(paged): whole-pattern MoE matcher + routed-FFN fused
NVFP4-quant down MMQ
New fork HEAD `1edddc8fe`, tree `097c862c`. The rejected/neutral levers of
the 110-140 campaign are recorded above and in the per-phase bench artifacts.