mirror of https://github.com/mudler/LocalAI.git synced 2026-07-03 04:46:54 -04:00

Files

Ettore Di Giacinto 1aba41082b docs(paged): record phases 112-140 + series trim decision

Record the phase 110-140 GDN/MoE campaign benchmark log and append the
series-trim decision to the parity handoff: keep the Phase135 routed-FFN
fused-quant line plus the MoE test sentinels and the MTP-draft correctness
fix; drop the W4A16 structural line, the trace/tile-policy patches, GPU-sort,
W4A16-direct-A, and the finalize fusion. Rejected/neutral levers are recorded
in the handoff and the per-phase bench artifacts. Fork re-mirrored on
51168c5ee: fd920cf8a a85c1e098 2fed6aacf f1d976f06 1edddc8fe (HEAD tree
097c862c).

Assisted-by: Claude:opus-4.8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-07-02 10:16:53 +00:00

223 KiB

Raw Blame History

llama.cpp vLLM Parity Benchmark Ledger

This file tracks each parity attempt from Phase70 onward, plus the immediate context needed to interpret the current record. Append every new attempt here with artifact path, gates, benchmark rows, and decision.

Current Status

Goal: reach vLLM speed parity in llama.cpp on GB10.
Current decision model: MoE q36-35b-a3b-nvfp4.
Canonical paged MoE md5: 8cb0ce23777bf55f92f63d0292c756b0.
Canonical dense md5: 5951a5b4d624ce891e22ab5fca9bc439.
Current tested source: DGX mirror /home/mudler/llama-phase93-qwen3next-gqa-bcast, local guardrail stack plus Qwen3Next grouped Q/K broadcast for fused GDN.
Latest attempt: Phase141 GDN decode-only noise-floor repeat.
Latest decision: recurrence-level GDN source A/B must normalize by launch count or control the decode capture window tightly. Phase141 ran five identical current-binary decode-only captures with pre/post gates green. Raw gdn_core_ms had median 1415.500, stdev 30.641, CV 2.146%, and range 1410.300..1482.140 ms, mostly because capture windows recorded 597, 598, 600, or 630 gdn_core launches. Normalized gdn_core_ms_per_launch was much steadier: median 2.359167, stdev 0.005399, CV 0.229%, range 2.352603..2.366917 ms. A future recurrence-level source patch must beat max(2.0%, 3 * same-binary stdev) on repeated A/B medians, using per-launch GDN core when launch counts drift; for Phase141 that means at least 6.49% raw gdn_core reduction or 2.0% launch-normalized reduction. Phase140 still rejects prep-only L2 fusion. The most defensible small source follow-up is a default-off scalar gate/beta hoist inside gated_delta_net_cuda; the vLLM-style packed decode recurrence remains a larger redesign, not a shortcut. Phase137 was rejected with no source changes: GDN_NW=4 GDN_CPW=1 improved isolated 1-token GDN rows but regressed real serving versus Phase135 (208.0/332.7 -> 206.2/324.9 aggregate/decode t/s, gdn_core 5926.55 -> 6466.27 ms). Phase135 remains the current best default-off routed-FFN base without Phase138 finalize, but not parity. Phase135 adds LLAMA_MOE_ROUTED_FFN_FUSED_QUANT=1 on top of LLAMA_MOE_ROUTED_FFN_POC=1: it computes silu(gate) * up directly into the NVFP4 MMQ activation layout and launches raw down MMQ, skipping both the sorted F32 buffer and the separate activation-quant kernel. Focused gates and canonical opt-in gates passed; trace proved six mmq_moe_quantized_raw launches and zero mmq_moe_sorted_raw launches. Focused perf was mixed but better at the larger sentinel: default 805.92/1031.06 us, Phase135 807.92/1024.97 us for n=128/257. The same opt-in serving profile at the Phase130 shape passed pre/post gates and improved decode aggregate t/s 326.9 -> 332.7, while mmq_nvfp4 dropped 6009.52 -> 5915.24 ms; total kernel time still rose slightly (20.1559 -> 20.2498 s) because GDN and projection buckets moved up. Next work should either make this path default-off-clean enough for broader serving comparisons, or attack the remaining MoE launch/writeback overhead (mmq_fixup, route metadata, and direct weighted combine) rather than another F32 intermediate. Phase134 is kept as a default-off fused-SWIGLU structural base, not as a promoted speedup. Phase134 adds LLAMA_MOE_ROUTED_FFN_FUSED_SWIGLU=1 on top of LLAMA_MOE_ROUTED_FFN_POC=1: it executes gate_up, computes silu(gate) * up directly into expert-sorted F32 rows, then calls the raw MMQ down helper. Selected opt-in gates passed 13/13; trace proved six raw sorted launches; canonical opt-in gates passed MoE/dense md5, GATED_DELTA_NET 48/48, MUL_MAT 1146/1146, and MUL_MAT_ID 806/806. Focused perf was mixed: default 804.92/1026.02 us, Phase134 810.61/1025.68 us for n=128/257. It removes the Phase133 standalone glu -> get_rows boundary and recovers n=257, but the extra fused-SWIGLU kernel is still slower at n=128. Next work should fuse SWIGLU directly into the down-MMQ quant buffer, or otherwise remove one more launch/buffer. Phase133 remains only as a default-off structural base for the next fused routed-FFN slice, not as a speedup. Phase133 adds LLAMA_MOE_ROUTED_FFN_SORTED_DOWN=1 on top of LLAMA_MOE_ROUTED_FFN_POC=1: it keeps baseline gate_up and SWIGLU, gathers the computed SWIGLU output into expert-sorted compact F32 rows, and calls a raw MMQ down helper without constructing fake tensors. Default and opt-in canonical gates passed with canonical MoE/dense md5s, GATED_DELTA_NET 48/48, MUL_MAT 1146/1146, and MUL_MAT_ID 806/806; selected default/Phase132/Phase133 gates passed 13/13, and trace proved six mmq_moe_sorted_raw launches. Focused perf was not a win: default 807.37/1020.76 us, Phase132 808.21/1018.87 us, Phase133 808.85/1026.87 us for n=128/257. The next phase must fuse SWIGLU-to-sorted or SWIGLU-to-quant to remove the added gather/quant boundary; do not promote sorted-down as-is. Phase132 remains the cleaner default-off scaffold if Phase133 needs to be bypassed. Phase131 challenged the Phase130 fork with two read-only source explorers. Both rejected another cheap source patch: MoE/FFN-GEMM work should not continue unless it funds a real fused routed-FFN kernel/executor, and GDN work should not continue unless it materially changes the f32 recurrent-state traffic without BF16/quality drift. The next active line is therefore a default-off fused routed-FFN PoC scoped from vLLM's real fused MoE design and llama.cpp's current gate_up -> SWIGLU -> down executor hook. Phase131 is a no-source decision/architecture attempt, not a speedup claim. Keep carrying the Phase93 Qwen3Next GQA-repeat removal candidate as a decode-profile positive, but it does not close serving parity. Phase130 refreshed the current-stack graph-node serving profile after the Phase129 rejection. Pre/post gates stayed green and the profile confirms the live serving bottleneck remains split between mmq_nvfp4 (6009.52 ms, 29.82%) and gdn_core (5891.40 ms, 29.23%), with FA only 1.28% and get-rows only 1.39%. This rejects the paged-mask/F16 get-rows idea as the next source patch and keeps the next credible work on either a larger MoE/FFN-GEMM executor/kernel or a larger GDN recurrence redesign. Phase129 tested a default-off Qwen35/Qwen35MoE grouped Q/K broadcast probe for fused GDN, reusing the existing Qwen3Next op-param path. The default path was md5/op clean, but the valid opt-in gate changed the MoE greedy md5 to b773e2f032aa0e992626d486b321808e, so the source was rejected and reverted. Do not port Qwen3Next grouped-broadcast semantics to Qwen35/Qwen35MoE under the current bit-exact rule. Phase128 scoped the Qwen3Next BF16 GDN S-cache idea and rejected/reverted the source probe for the current target: the active q36-35b-a3b-nvfp4.gguf model loads as qwen35moe, no true Qwen3Next GGUF was found on DGX, and the existing Qwen35/Qwen35MoE BF16 S-cache lever was already rejected by the Phase82 f16-reference KL gate. Phase127 tested the first whole-MoE expert-major executor using the Phase126 helper; it passed selected correctness and emitted expert-major markers, but was rejected and reverted because focused perf regressed MOE_SWIGLU_DOWN at both n=128 and n=257. Phase126 remains the kept scaffold. Phase104 measured the combined cleanup stack in the normal same-session serving harness against vLLM at N=128. It is md5/op clean and modestly improves paged serving versus Phase97 (agg_tps 329.6 -> 338.6, prefill_tps 1734.5 -> 1813.0, TTFT 7415.4 -> 7121.6 ms), but it is not parity-closing: paged/vLLM is 0.6574 on decode and 0.5122 on aggregate. Phase105 refreshed the current-stack grouped-MMQ evidence: ragged MoE and full MUL_MAT_ID gates still pass, serving launch traces still have fixup=0 and stream_k_blocks == ntiles_dst, and the simple live request landed in density-10 prefill-like shapes (mmq_x_best=112) rather than a new small-M decode opportunity. Phase106 then tested the C1 high-concurrency operating-point hypothesis at N=128/192/256; vLLM completed all legs and stayed ahead, so C1 is rejected for the current GB10 stack. Do not add another MMQ micro-policy patch or scheduler shortcut. Phase107 established the existing fused-MoE correctness guardrails and found that test-backend-ops perf did not emit timing rows for these custom whole-graph cases. Phase108 added the missing measurement-only harness by exposing the existing MoE whole-graph cases to perf mode and expanding CSV output to include timing fields. Use these timings to rank fused routed-MoE work; do not start a fused kernel without improving one of these rows and preserving md5/op gates. Phase109 tested the existing default-off W4A16 and FP4 large-M MoE routes, plus the cheapest grouped-MMQ density/tile-policy knobs, on the Phase108 rows. All selected op gates passed, but none of the env-only routes is a useful parity lever: W4A16 and FP4 large-M are much slower at n_tokens=257, while LLAMA_MOE_DENSITY_MAX=9 / LLAMA_MOE_MMQ_X=64 are noise-level on MUL_MAT_ID_RAGGED_MOE and do not help MOE_SWIGLU_DOWN. The next credible implementation target is GPU-side routed-MoE metadata construction for the host-sync fallback/grouped path, taking the vLLM moe_align_block_size / permute-unpermute design as the reference, not importing vLLM wholesale. Phase110 implemented that first default-off CUDA metadata branch behind LLAMA_MOE_GPU_SORT=1, reusing mm_ids_helper and adding a tiny inverse permutation kernel for the fallback get_rows contract. The initial branch failed 3/13 selected opt-in rows because mm_ids_helper's ids_dst is sorted-to-original while fallback get_rows needs original-to-sorted; the inversion fix made default, W4A16, and W4A16+GPU-sort selected gates 13/13, and canonical md5/op gates stayed green. Keep Phase110 as a default-off structural base only: it improves W4A16 fallback 257-token rows by 7-8%, but remains ~1.5x slower than default grouped-MMQ, so it is not a parity win by itself. Phase111 then tried to remove the remaining W4A16 fallback host descriptor construction by building w4a16_tile_desc on GPU from expert_bounds_dev. The first compile needed a pointer mutability fix, then the first runtime attempt hit a CUDA pool LIFO assertion because the outer expert-bounds allocation was freed after an inner later allocation. After fixing that, selected gates passed for the new LLAMA_W4A16_GPU_TILES=1 path, but clean perf was flat-to-negative versus Phase110 (MUL_MAT_ID_RAGGED_MOE n=257 regressed about 2.0%). The Phase111 source was reverted; post-revert W4A16+GPU-sort selected gates passed 13/13. Do not carry a GPU tile descriptor path unless it is part of a larger direct-A or graph-safe W4A16 redesign that removes more than one host-sync/launch bottleneck. Phase112 implemented the existing default-off LLAMA_W4A16_DIRECT_A=1 hook for W4A16 grouped MoE, staging bf16 activations directly from original src1 through ids_to_sorted instead of materializing a sorted f32 buffer and then casting it. Selected gates passed for W4A16+GPU-sort, direct-A alone, and direct-A+GPU-sort (13/13 each). The useful arm is direct-A+GPU-sort: MUL_MAT_ID_RAGGED_MOE n=257 improved 2278.50 -> 2166.22 us (+4.93%) and MOE_SWIGLU_DOWN n=257 improved 1551.08 -> 1477.74 us (+4.73%) versus Phase112's W4A16+GPU-sort control, while the 128-token rows were neutral/slightly negative. Canonical README md5 gates are green (8cb0ce23, 5951a5b4) and compact op gates are green on the supported rows. Keep Phase112 default-off as the next structural base; do not make it default-on because W4A16 fallback remains slower than the default grouped-MMQ path. Phase113 tried the combined follow-up: LLAMA_W4A16_DIRECT_A=1 LLAMA_MOE_GPU_SORT=1 LLAMA_W4A16_GPU_TILES=1. It built W4A16 tile descriptors from GPU expert bounds and launched over a zero-initialized max_tiles grid to avoid even the one-int tile-count readback. Selected correctness stayed green (13/13), but perf did not meet the keep threshold: MOE_SWIGLU_DOWN n=257 was effectively flat (1478.16 -> 1476.36 us) and MUL_MAT_ID_RAGGED_MOE n=257 regressed (2148.44 -> 2214.23 us). The Phase113 source was reverted; post-revert Phase112 direct-A+GPU-sort selected gates passed 13/13. Phase114 then implemented the vLLM-style padded routing contract behind LLAMA_W4A16_PADDED_META=1: separate padded source ids, padded destination ids, expert ids per M block, a padded W4A16 expert-id consumer mode, and a direct scatter that skipped the old compact get_rows_cuda restore. It was correctness-clean (13/13) but failed the performance gate. Initial artifact: /home/mudler/bench/phase114_w4a16_padded_routing/20260701_234634_padded_meta; fix1 artifact: /home/mudler/bench/phase114_w4a16_padded_routing/20260701_235003_padded_meta_fix1. Fix1 added num_tokens_post_pad early returns for padded gather/scatter, but 257-token rows still regressed (MOE_SWIGLU_DOWN 1477.88 -> 1726.27 us, MUL_MAT_ID_RAGGED_MOE 2163.35 -> 2650.93 us). The source was reverted and post-revert Phase112 direct-A+GPU-sort selected gates passed 13/13. Phase115 then re-tested the existing default-off MoE small-M MMQ tile knob on the current Phase108 whole-graph sentinels rather than adding another patch. Artifact: /home/mudler/bench/phase115_moe_small_m_sentinel/20260702_020258. Control and LLAMA_MOE_SMALL_M_TILE=16/32/64 all passed the selected MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE correctness gate (13/13 each), but none met the promotion rule. The best 128-token rows were tiny/noise-level wins, while every capped env regressed the 257-token ragged row (1452.30 us control vs 1455.02, 1458.71, 1456.88 us). Reject small-M row shaping as a parity lever; the next phase should scope a true fused routed-MoE kernel or a graph-level fusion target that removes materialized activation/output traffic. Phase116 implemented that graph-level probe as a default-off CUDA-only detector for the plain GLU -> down MUL_MAT_ID pattern: LLAMA_MOE_SWIGLU_DOWN_FUSED_QUANT=1. The candidate computed silu(gate) * up directly into the existing grouped-MMQ NVFP4 activation buffer, leaving the MMQ kernel and graph API unchanged. Artifact: /home/mudler/bench/phase116_moe_swiglu_down_fused_quant/20260702_022611. Correctness passed (13/13) and the fix1 route emitted the fused trace marker (6 hits), but perf failed the promotion gate: MOE_SWIGLU_DOWN n=257 was flat (1024.90 -> 1024.69 us), n=128 regressed (806.33 -> 808.79 us), and the non-fused ragged sentinel drifted slower. Source was reverted and the post-revert selected gate passed 13/13. Do not retry a standalone fused SwiGLU-to-MMQ-activation-quant path; the next fused-MoE attempt must remove a larger boundary than one activation materialization. Phase117 added default-off boundary tracing/timing around the route-sort, activation quantization, grouped-MMQ launch, GLU, and whole-graph pattern detector. Artifact: /home/mudler/bench/phase117_moe_route_once_boundary/20260702_024140. The first timing run proved inline CUDA events are incompatible with CUDA graph capture (cudaEventSynchronize on a capturing stream), so the trace was guarded to emit us=-1 during capture and real timings only with GGML_CUDA_DISABLE_GRAPHS=1. Post-guard selected gates passed (13/13), trace mode passed (7/7), and canonical gates passed: MoE md5 8cb0ce23, dense md5 5951a5b4, MUL_MAT 1146/1146, MUL_MAT_ID 806/806. No new runtime optimization is promoted from Phase117. The timing attribution rejects another small route-sort or standalone GLU/quant shortcut; the next funded MoE source phase needs a larger pipeline boundary: shared route metadata across gate_up/down and/or an executor that owns GEMM1->activation->GEMM2 rather than another local micro-fusion. Phase118 tested a default-off route metadata cache/reuse prototype. Artifact: /home/mudler/bench/phase118_moe_route_cache/20260702_030549. The first preflight command falsely detected local-ai-worker because the check matched its own shell text; the corrected pgrep -x local-ai-worker preflight was clean. The cache candidate (LLAMA_MOE_ROUTE_CACHE=1) was correctness-clean and did hit (23 hits, 3 misses on the trace row), but did not meet the keep rule: MOE_SWIGLU_DOWN n=257 improved only 1017.711 -> 1011.915 us (+0.57%) and n=128 regressed 799.360 -> 803.738 us (-0.55%). Runtime cache source was reverted; the post-reject selected gate passed 13/13. Keep only the local ids metadata helper refactor if final checks remain clean. This closes route-cache as a standalone parity lever; next MoE work needs a larger executor boundary than skipping one metadata build. Phase119 added a default-off whole-pattern contract trace for gate_up MUL_MAT_ID -> views -> SWIGLU -> down MUL_MAT_ID. Initial artifact: /home/mudler/bench/phase119_moe_whole_pattern_contract/20260702_034729; fix1 artifact: /home/mudler/bench/phase119_moe_whole_pattern_contract/20260702_035126_fix1. The initial trace proved coverage but exceeded the trace-overhead rule on MOE_SWIGLU_DOWN n=257 (1015.070 -> 1028.937 us, -1.35%). Fix1 moved detector work fully off the default path unless a trace env is enabled. It is correctness-clean (13/13 selected, 7/7 trace), canonical md5/op clean (MoE 8cb0ce23, dense 5951a5b4, MUL_MAT 1146/1146, MUL_MAT_ID 806/806), and trace overhead is within rule: MOE_SWIGLU_DOWN n=128 805.400 -> 805.584 us (-0.02%) and n=257 1019.715 -> 1021.836 us (-0.21%). Keep Phase119 as default-off diagnostic/contract scaffolding only. The next source phase is allowed to implement a guarded executor, but the executor must match at the earlier gate_up MUL_MAT_ID node so it can own GEMM1->activation->GEMM2 and skip the remaining nodes; the current GLU hook is validation-only because GEMM1 has already executed. Phase120 added that earlier default-off matcher/trace at the gate_up MUL_MAT_ID node. Initial artifact: /home/mudler/bench/phase120_moe_early_whole_pattern/20260702_040153; fix2 artifact: /home/mudler/bench/phase120_moe_early_whole_pattern/20260702_040725_fix2. The initial/fix1 traces proved skip_ready=4 but emitted noisy unsupported candidates from unrelated MUL_MAT_ID rows; fix2 gates output on the actual gate/up view pair only. Fix2 is correctness-clean (13/13 selected, 7/7 early trace), canonical md5/op clean (MoE 8cb0ce23, dense 5951a5b4, MUL_MAT 1146/1146, MUL_MAT_ID 806/806), and early trace overhead stays within rule: MOE_SWIGLU_DOWN n=128 803.937 -> 808.978 us (-0.62%) and n=257 1020.412 -> 1026.073 us (-0.55%). Keep Phase120 as the executor entry-point scaffold. The next source phase should add a default-off executor that starts from this early matcher, first proving safe ownership/skip accounting, then moving route-plan reuse and fused activation into that helper. Phase121 added that default-off executor proof behind LLAMA_MOE_WHOLE_PATTERN_EXEC=1. Initial artifact: /home/mudler/bench/phase121_moe_whole_pattern_exec_proof/20260702_041543; fix1 artifact: /home/mudler/bench/phase121_moe_whole_pattern_exec_proof/20260702_041739_fix1. The initial run passed gates but emitted zero exec markers because the exec path was incorrectly nested under the early-trace env. Fix1 made exec detection depend on either exec or trace env. It is correctness-clean (13/13 selected, 7/7 exec), canonical md5/op clean (MoE 8cb0ce23, dense 5951a5b4, MUL_MAT 1146/1146, MUL_MAT_ID 806/806), and emits skip=4 markers for the six supported MoE rows. Perf is neutral for the target sentinel: MOE_SWIGLU_DOWN n=128 807.772 -> 806.051 us (+0.21%) and n=257 1021.115 -> 1020.839 us (+0.03%). Keep Phase121 as the executor ownership/skip-accounting proof only. The next real optimization phase should replace one internal boundary inside this helper, starting with route-plan reuse or activation-in-route-order, while preserving this md5/op contract. Phase122 tested route-plan reuse inside the Phase121 executor by exposing ggml_cuda_mmq_ids_meta and passing one built route to both gate_up and down MMQ calls behind LLAMA_MOE_WHOLE_PATTERN_SHARED_ROUTE=1. Artifact: /home/mudler/bench/phase122_moe_shared_route_meta/20260702_043212. Correctness was clean (13/13 selected, 7/7 shared-route), but the target MOE_SWIGLU_DOWN n=257 row regressed versus the Phase121 executor (1020.850 -> 1051.666 us, -3.02%) and n=128 also missed the keep threshold (808.190 -> 811.836 us, -0.45%). The source was reverted, including the public MMQ metadata API. Post-reject gates on the reverted tree passed (13/13 selected, 7/7 executor) with six retained Phase121 exec markers. Do not retry route-only metadata reuse; the next MoE executor phase should attack activation/down data layout, direct activation-to-down input, or a larger fused GEMM1->activation->GEMM2 boundary. Phase123 tested that direct activation-to-down input boundary inside the Phase121 executor. Artifact: /home/mudler/bench/phase123_moe_executor_fused_down_input/20260702_025811. The candidate added an NVFP4-only fused silu(gate) * up -> down MMQ activation buffer path behind LLAMA_MOE_WHOLE_PATTERN_FUSED_DOWN=1. Correctness passed (13/13 selected, 7/7 fused-down, six fused markers), but perf was flat and missed the keep rule: versus Phase121 exec, MOE_SWIGLU_DOWN n=128 was 811.153 -> 810.618 us (+0.07%) and n=257 was 1023.090 -> 1023.657 us (-0.06%). Source was reverted; post-reject selected and Phase121 exec gates passed (13/13, 7/7, six exec markers). Do not retry standalone fused-down quantization. The next MoE source attempt must either own the full expert-major packed pipeline GEMM1->activation->GEMM2 or pivot to another measured bottleneck. Phase124 refreshed the current-stack graph-node serving profile after the Phase122/123 rejections. Artifact: /home/mudler/bench/phase124_current_moe_profile/20260702_031205. Pre/post gates were green (MoE md5 8cb0ce23, dense md5 5951a5b4, MUL_MAT 1146/1146, MUL_MAT_ID 806/806). Serving under graph-node profiling at N=128, prompt 128, generation 64 was agg_tps 206.2, decode_agg_tps 320.3, prefill_tps 1536.4, wall 39.738s. The fine buckets explain the Phase122/123 failures: mmq_nvfp4 is now the largest fine bucket (6074.78 ms, 30.17%) and gdn_core remains essentially tied (5888.31 ms, 29.25%), while act_quant is only 674.88 ms (3.35%). Next work should target either a full expert-major MoE pipeline that materially reduces mmq_nvfp4 or a GDN source experiment that materially reduces gdn_core; one-boundary activation/route shortcuts are no longer funded. Phase125 scoping used two independent code explorers plus a local GDN audit. The challenged conclusion is that another GDN micro-patch is not funded: prior geometry/store/broadcast and conv-state attempts already exhausted the small safe space, while a useful GDN change would be a larger recurrence redesign. The next source attempt should therefore test the first maintainable slice of a vLLM-style expert-major MoE pipeline: a default-off MMQ sorted-output primitive that still uses expert bounds but writes sorted rows, then immediately unsorts as a proof. Only if that primitive is correctness clean and materially improves MOE_SWIGLU_DOWN should the following phase proceed to a full gate_up -> SWIGLU -> down expert-major executor.

Phase141: GDN Decode-Only Noise Floor

Date: 2026-07-02.
Spec: docs/superpowers/specs/2026-07-02-gdn-decode-noise-floor-phase141-design.md.
Plan: docs/superpowers/plans/2026-07-02-gdn-decode-noise-floor-phase141.md.
Result type: measurement-only; no llama.cpp source changes.
Artifact: /home/mudler/bench/phase141_gdn_decode_noise_floor/20260702_090428.
Summary files:
- /home/mudler/bench/phase141_gdn_decode_noise_floor/20260702_090428/summary.tsv
- /home/mudler/bench/phase141_gdn_decode_noise_floor/20260702_090428/runs.tsv

Setup:

Current patched Phase93 binary: /home/mudler/llama-phase93-qwen3next-gqa-bcast/build/bin.
Env: LLAMA_MOE_ROUTED_FFN_POC=1, LLAMA_MOE_ROUTED_FFN_FUSED_QUANT=1, LLAMA_MOE_ROUTED_FFN_FINALIZE_POC=1.
Harness: /home/mudler/bench/phase77_moe_decode_only_profile.sh.
Shape: N=128 N_PREDICT=2048 DEPTH_TARGET=64 CAPTURE_SECONDS=4 CTX=131072 PARALLEL=128 BATCH=2048 UBATCH=512.

Gates:

All five runs passed pre/post canonical gates: MoE md5 8cb0ce23777bf55f92f63d0292c756b0, dense md5 5951a5b4d624ce891e22ab5fca9bc439, MUL_MAT 1146/1146, and MUL_MAT_ID 806/806.

Run summary:

run	total kernel s	GDN ms	GDN launches	`gdn_core` ms	`gdn_core` launches	`gdn_core` ms/launch	`mmq_nvfp4` ms	`mmq_nvfp4` launches
1	`3.553400`	`1500.210000`	`3000`	`1420.150000`	`600`	`2.366917`	`1315.460000`	`4816`
2	`3.708300`	`1492.230000`	`2994`	`1410.300000`	`598`	`2.358361`	`1470.550000`	`4801`
3	`3.678100`	`1566.780000`	`3150`	`1482.140000`	`630`	`2.352603`	`1336.250000`	`5061`
4	`3.698400`	`1495.970000`	`3000`	`1415.500000`	`600`	`2.359167`	`1458.510000`	`4820`
5	`3.620900`	`1490.630000`	`2985`	`1410.870000`	`597`	`2.363266`	`1389.990000`	`4784`

Variance summary:

metric	median	mean	stdev	CV	min	max
`total_kernel_s`	`3.678100`	`3.651820`	`0.064600`	`1.769%`	`3.553400`	`3.708300`
`gdn_ms`	`1495.970000`	`1509.164000`	`32.419626`	`2.148%`	`1490.630000`	`1566.780000`
`gdn_core_ms`	`1415.500000`	`1427.792000`	`30.641160`	`2.146%`	`1410.300000`	`1482.140000`
`mmq_nvfp4_ms`	`1389.990000`	`1394.152000`	`69.894566`	`5.013%`	`1315.460000`	`1470.550000`
`gdn_core_ms_per_launch`	`2.359167`	`2.360063`	`0.005399`	`0.229%`	`2.352603`	`2.366917`

Decision:

Raw decode-only gdn_core is not a reliable keep/reject metric by itself unless capture launch counts are fixed; run 3 recorded 630 core launches while the other runs recorded 597..600.
For future GDN source A/B, require repeated medians and either:
- raw gdn_core reduction above max(2.0%, 3 * 30.641160 / 1415.500000) = 6.49%, or
- launch-normalized gdn_core_ms_per_launch reduction above 2.0% (3 * 0.005399 / 2.359167 = 0.69%, so the explicit floor dominates).
This supports a very small default-off scalar gate/beta hoist probe if it can be kept bit-exact and measured per launch. It does not support large packed decode recurrence source work yet; that should wait for a broader spec.

Phase140: GDN Decode Prep Trace

Date: 2026-07-02.
Spec: docs/superpowers/specs/2026-07-02-gdn-decode-prep-trace-phase140-design.md.
Plan: docs/superpowers/plans/2026-07-02-gdn-decode-prep-trace-phase140.md.
Result type: measurement-only; no llama.cpp source changes.
Artifact: /home/mudler/bench/phase140_gdn_decode_prep_trace/20260702_085348.
Summary file: /home/mudler/bench/phase140_gdn_decode_prep_trace/20260702_085348/gdn_prep_kernel_summary.tsv.

Setup:

Current patched Phase93 binary: /home/mudler/llama-phase93-qwen3next-gqa-bcast/build/bin.
Env: LLAMA_MOE_ROUTED_FFN_POC=1, LLAMA_MOE_ROUTED_FFN_FUSED_QUANT=1, LLAMA_MOE_ROUTED_FFN_FINALIZE_POC=1, plus route/layout trace envs.
Shape: N=128 PTOK=128 GEN=64 CTX=131072 PARALLEL=128 BATCH=2048 UBATCH=512.

Gates:

gate	MoE md5	dense md5	`MUL_MAT`	`MUL_MAT_ID`
pre	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`1146/1146`	`806/806`
post	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`1146/1146`	`806/806`

Serving/profile result:

metric	value
`agg_tps`	`207.3`
`decode_agg_tps`	`328.9`
`decode_perseq_tps`	`2.11`
`prefill_tps`	`1490.6`
`ttft_mean_ms`	`8325.9`
`ttft_max_ms`	`14593.3`
`wall_s`	`39.501`
total kernel time	`20.2002 s`

Key buckets:

bucket	ms
`GDN`	`6673.66`
`gdn_core`	`5890.44`
`MoE/FFN-GEMM`	`6144.19`
`mmq_nvfp4`	`5918.31`
`gdn_conv`	`454.99`
`gdn_gather`	`227.92`
`gdn_l2norm`	`100.30`
`gdn_sigmoid`	`22.68`

Focused kernel summary:

kernel	count	ms	avg us
`gated_delta_net_cuda`	`4650`	`5804.7074`	`1248.3242`
`k_bin_bcast`	`89426`	`1155.3901`	`12.9201`
`convert_unary`	`52060`	`659.7529`	`12.6729`
`concat_non_cont`	`2130`	`441.9353`	`207.4814`
`ssm_conv_update_ids_f32`	`2610`	`227.8964`	`87.3166`
`mul_mat_f`	`3670`	`227.7857`	`62.0669`
`ssm_conv_long_token_f32`	`1110`	`190.6664`	`171.7715`
`unary_gated_op_kernel`	`14340`	`184.3254`	`12.8539`
`rms_norm_gate_mul_f32`	`4740`	`170.0508`	`35.8757`
`rms_norm_f32`	`9798`	`114.3863`	`11.6745`
`rms_norm_pre_add_mul_f32`	`6160`	`108.2927`	`17.5800`
`cpy_scalar`	`5130`	`106.8951`	`20.8373`
`l2_norm_f32`	`9480`	`100.3024`	`10.5804`
`gated_delta_net_chunked_cuda`	`90`	`85.7367`	`952.6300`

Decision:

Reject an immediate in-GDN Q/K L2-normalization source patch for this shape.
l2_norm_f32 is above the absolute Phase139 noise floor (3 * 17.8110 ms = 53.433 ms) but only about 1.7% of gdn_core, below the phase's 3% materiality rule.
Do not spend another phase on prep-only GDN micro-fusion unless a future profile shows prep kernels above the materiality gate.
Next GDN work should be recurrence-level, packed-state, or datacenter Blackwell-specific, and still default-off with md5/op gates.

Phase139: Serving Noise-Floor Repeat

Date: 2026-07-02.
Spec: docs/superpowers/specs/2026-07-02-serving-noise-floor-phase139-design.md.
Plan: docs/superpowers/plans/2026-07-02-serving-noise-floor-phase139.md.
Result type: measurement-only; no llama.cpp source changes.
Artifact: /home/mudler/bench/phase139_serving_noise_floor/20260702_081901.
Summary files:
- /home/mudler/bench/phase139_serving_noise_floor/20260702_081901/summary.tsv
- /home/mudler/bench/phase139_serving_noise_floor/20260702_081901/runs.tsv

Setup:

Current patched Phase93 binary: /home/mudler/llama-phase93-qwen3next-gqa-bcast/build/bin.
Env: LLAMA_MOE_ROUTED_FFN_POC=1, LLAMA_MOE_ROUTED_FFN_FUSED_QUANT=1, LLAMA_MOE_ROUTED_FFN_FINALIZE_POC=1.
Shape: N=128 PTOK=128 GEN=64 CTX=131072 PARALLEL=128 BATCH=2048 UBATCH=512.
Harness: /home/mudler/bench/phase76_current_moe_profile.sh.

Gates:

All seven runs passed pre/post canonical gates: MoE md5 8cb0ce23777bf55f92f63d0292c756b0, dense md5 5951a5b4d624ce891e22ab5fca9bc439, MUL_MAT 1146/1146, and MUL_MAT_ID 806/806.

Run summary:

run	agg t/s	decode agg t/s	wall s	kernel s	MoE ms	mmq_nvfp4 ms	gdn_core ms	mmq_fixup ms	ew_add ms
1	`212.3`	`333.6`	`38.586`	`19.5196`	`5642.07`	`5464.17`	`5877.57`	`104.64`	`371.81`
2	`208.6`	`330.1`	`39.272`	`19.8779`	`5927.18`	`5719.41`	`5886.67`	`104.49`	`353.07`
3	`206.8`	`327.2`	`39.606`	`20.0228`	`5983.97`	`5756.85`	`5906.11`	`105.76`	`369.31`
4	`208.5`	`331.4`	`39.284`	`19.8543`	`5921.30`	`5702.74`	`5911.82`	`104.31`	`371.32`
5	`208.8`	`335.6`	`39.240`	`20.0571`	`5950.46`	`5720.96`	`5913.65`	`104.53`	`371.59`
6	`203.4`	`319.7`	`40.277`	`20.3933`	`6285.32`	`6049.05`	`5914.11`	`104.98`	`379.23`
7	`205.7`	`320.4`	`39.818`	`20.1422`	`6173.88`	`5978.03`	`5929.75`	`106.28`	`355.59`

Variance summary:

metric	median	mean	stdev	CV	min	max
`agg_tps`	`208.5000`	`207.7286`	`2.8022`	`1.349%`	`203.4000`	`212.3000`
`decode_agg_tps`	`330.1000`	`328.2857`	`6.2157`	`1.893%`	`319.7000`	`335.6000`
`wall_s`	`39.2840`	`39.4404`	`0.5312`	`1.347%`	`38.5860`	`40.2770`
`kernel_s`	`20.0228`	`19.9810`	`0.2717`	`1.360%`	`19.5196`	`20.3933`
`moe_ms`	`5950.4600`	`5983.4543`	`204.9581`	`3.425%`	`5642.0700`	`6285.3200`
`mmq_nvfp4_ms`	`5720.9600`	`5770.1729`	`193.3642`	`3.351%`	`5464.1700`	`6049.0500`
`gdn_ms`	`6695.0800`	`6690.3629`	`17.4585`	`0.261%`	`6656.7100`	`6705.9100`
`gdn_core_ms`	`5911.8200`	`5905.6686`	`17.8110`	`0.302%`	`5877.5700`	`5929.7500`
`mmq_fixup_ms`	`104.6400`	`104.9986`	`0.7420`	`0.707%`	`104.3100`	`106.2800`
`ew_add_ms`	`371.3200`	`367.4171`	`9.4938`	`2.584%`	`353.0700`	`379.2300`

Decision:

Phase138 remains md5/op clean and focused-positive, but its one-off serving gain (+0.63% aggregate, +0.24% decode) is inside same-binary noise.
Do not use Phase138's single serving run as evidence to stack another finalize/MMQ micro-patch.
Future serving claims need repeated A/B medians and must exceed max(2.0%, 3 * same-binary stdev) on aggregate throughput. With this Phase139 stdev, that is materially higher than the Phase138 one-off delta.
Bucket attribution also needs repeated evidence: the same binary had mmq_nvfp4 CV 3.351%, so a small MMQ movement is not enough. GDN was much steadier (gdn_core CV 0.302%), making a measured GDN-side source attempt the more defensible next phase.

Phase138 Attempt 2: Down-MMQ Finalize Writeback

Date: 2026-07-02.
Plan: docs/superpowers/plans/2026-07-02-moe-down-mmq-finalize-phase138.md.
Result type: kept source candidate, default-off; narrow serving-positive result, not parity and not default-on.
Focused artifact: /home/mudler/bench/phase138_moe_down_mmq_finalize/20260702_095927_focused.
Canonical gate artifact: /home/mudler/bench/phase138_moe_down_mmq_finalize/20260702_100202_canonical.
Serving/profile artifact: /home/mudler/bench/phase138_moe_down_mmq_finalize_serving/20260702_100330.
Source files changed:
- ggml/src/ggml-cuda/ggml-cuda.cu
- ggml/src/ggml-cuda/mmq.cu
- ggml/src/ggml-cuda/mmq.cuh
- ggml/src/ggml-cuda/moe-ffn.cu
- ggml/src/ggml-cuda/moe-ffn.cuh
- tests/test-backend-ops.cpp

Implementation:

Added default-off LLAMA_MOE_ROUTED_FFN_FINALIZE_POC=1, requiring both LLAMA_MOE_ROUTED_FFN_POC=1 and LLAMA_MOE_ROUTED_FFN_FUSED_QUANT=1.
Added a finalize helper that zeroes the final output, sends router weights and the final output pointer into the grouped down-MMQ path, and skips the strict weighted tail only after the helper is selected.
Added optional finalize metadata to MMQ and stream-k/fixup writeback. The finalize branch uses the routed destination id to derive (token, slot) and atomically accumulates sum * weight into the final token row.
Left all existing non-finalize MMQ call sites disabled-by-default.

Focused gates and trace:

route	result
`MOE_SWIGLU_FINALIZE` default	`7/7`
`MOE_SWIGLU_FINALIZE` Phase135 opt-in	`7/7`
`MOE_SWIGLU_FINALIZE` Phase138 finalize opt-in	`7/7`
Phase138 exec trace	`6` records, `FINALIZE_EXEC skip=20 tail_nodes=16`

Canonical gates on patched Phase93 binary:

route	MoE md5	dense md5	`MUL_MAT`	`MUL_MAT_ID`
Phase138 via `EXTRA_ENV`	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`1146/1146`	`806/806`

Focused perf:

row	default	Phase135	Phase138 finalize
`MOE_SWIGLU_FINALIZE nvfp4 n_tokens=128`	`198.021937 us`	`197.301518 us`	`187.134493 us`
`MOE_SWIGLU_FINALIZE nvfp4 n_tokens=257`	`429.235219 us`	`428.697087 us`	`384.673195 us`

Serving comparison:

metric	Phase135 opt-in	Phase138 finalize opt-in
aggregate t/s	`208.0`	`209.3`
decode aggregate t/s	`332.7`	`333.5`
decode per-seq t/s	`2.12`	`2.13`
prefill t/s	`1475.1`	`1492.8`
TTFT mean	`8468.1 ms`	`8382.5 ms`
wall	`39.375 s`	`39.144 s`
total kernel time	`20.2498 s`	`20.0489 s`

Serving buckets:

bucket	Phase135 opt-in	Phase138 finalize opt-in
`gdn_core`	`5926.55 ms`	`5914.04 ms`
`mmq_nvfp4`	`5915.24 ms`	`5802.87 ms`
`ew_mul`	`727.04 ms`	`723.65 ms`
`act_quant`	`677.59 ms`	`678.17 ms`
`get_rows`	`283.62 ms`	`283.80 ms`
`mmq_fixup`	`104.81 ms`	`106.06 ms`
`ew_add`	not listed in Phase135 top rows	`374.09 ms`

Serving pre/post gates:

phase	MoE md5	dense md5	`MUL_MAT`	`MUL_MAT_ID`
pre	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`1146/1146`	`806/806`
post	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`1146/1146`	`806/806`

Decision:

Keep Phase138 default-off. It passes md5/op gates and beats Phase135 on the configured keep thresholds: aggregate/decode throughput, total kernel time, and mmq_nvfp4.
Do not promote/default-on. The serving delta is small and the weighted fan-in still appears as ew_add 374.09 ms, so this is not a complete tail removal and not parity.
Next work should either reduce the remaining fan-in/writeback path more deeply, or pivot back to the two dominant buckets: gdn_core and mmq_nvfp4.

Phase138 Attempt 1: MoE Finalize Trace And Full-Tail Sentinel

Date: 2026-07-02.
Plan: docs/superpowers/plans/2026-07-02-moe-down-mmq-finalize-phase138.md.
Result type: kept trace/test scaffold, default-off; no runtime speedup claim.
Trace-only MOE_SWIGLU_DOWN artifact: /home/mudler/bench/phase138_moe_down_mmq_finalize_trace/20260702_092943.
Traced canonical gate artifact using the old default gate binary, superseded: /home/mudler/bench/phase138_moe_down_mmq_finalize_trace/20260702_093003_gate.
Traced canonical gate artifact using patched Phase93 binary: /home/mudler/bench/phase138_moe_down_mmq_finalize_trace/20260702_093141_gate_phase93.
Traced early-pattern gate artifact using patched Phase93 binary: /home/mudler/bench/phase138_moe_down_mmq_finalize_trace/20260702_093243_gate_phase93_early.
Full-tail sentinel artifact: /home/mudler/bench/phase138_moe_down_mmq_finalize_trace/20260702_093617_full_tail.
Canonical gate artifact: /home/mudler/bench/phase138_moe_down_mmq_finalize_trace/20260702_093731_canonical.
Source files changed:
- ggml/src/ggml-cuda/ggml-cuda.cu
- tests/test-backend-ops.cpp

Implementation:

Added default-off LLAMA_MOE_ROUTED_FFN_FINALIZE_TRACE.
Added a trace-only strict tail scanner for down -> MUL(weights) -> VIEW/ADD rank reduction.
Added MOE_SWIGLU_FINALIZE, a whole-graph backend-op sentinel that composes the existing gate_up -> SWIGLU -> down graph with the existing router-weighted rank-add tail.
No production finalize/writeback kernel was added in this attempt.

Focused gates:

route	result
`MOE_SWIGLU_DOWN` + Phase135 opt-in + finalize trace	`6` early records, `0` supported tail records
`MOE_SWIGLU_FINALIZE` default	`7/7`
`MOE_SWIGLU_FINALIZE` + Phase135 opt-in + finalize trace	`7/7`, `6` supported tail records

Representative finalize trace row:

field	value
`supported`	`1`
`tail_nodes`	`16`
`views`	`8`
`adds`	`7`
`down_ne`	`2048x8x128` on the 128-token row
`weights_ne`	`1x8x128`
`weights_nb`	`4,4,32`
`final_ne`	`2048x128x1`
`final_nb`	`4,8192,1048576`

Canonical gates on patched Phase93 binary:

MoE md5	dense md5	`MUL_MAT`	`MUL_MAT_ID`
`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`1146/1146`	`806/806`

Decision:

Keep the trace/test scaffold as Phase138 groundwork.
Proceed next to the default-off down-MMQ finalize/writeback implementation, but only against MOE_SWIGLU_FINALIZE first.
Do not claim a speedup from this attempt; it only proves graph availability and preserves md5/op gates.

Phase136: Routed-FFN Post-Down Weighted Combine

Date: 2026-07-02.
Plan: docs/superpowers/plans/2026-07-02-routed-ffn-combine-phase136.md.
Result type: rejected source probe; source and sentinel test reverted.
Focused artifact: /home/mudler/bench/phase136_routed_ffn_combine/20260702_083727.
Serving/profile artifact: /home/mudler/bench/phase136_routed_ffn_combine_serving/20260702_085749.
Source files tested and reverted:
- ggml/src/ggml-cuda/moe-ffn.cuh
- ggml/src/ggml-cuda/moe-ffn.cu
- ggml/src/ggml-cuda/ggml-cuda.cu
- tests/test-backend-ops.cpp

Implementation tested:

Added LLAMA_MOE_ROUTED_FFN_COMBINE=1 on top of Phase135.
Extended the early routed-FFN graph hook to skip the post-down MUL(weights) -> VIEW* -> ADD* tail.
Added a separate F32 weighted-combine kernel that preserved expert-rank accumulation order.
Added a temporary full-tail MOE_SWIGLU_COMBINE sentinel for focused correctness/perf.

Focused gates:

route	result
default selected + full-tail sentinel	`MOE_SWIGLU_DOWN,MOE_SWIGLU_COMBINE,MUL_MAT_ID_RAGGED_MOE 20/20`
Phase135 selected + full-tail sentinel	`20/20`
Phase136 selected + full-tail sentinel	`20/20`
Phase136 trace	`6` combine markers, `6` `mmq_moe_quantized_raw`, `0` `mmq_moe_sorted_raw`
post-reject Phase135 selected	`MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE 13/13`

Canonical focused gates:

route	MoE md5	dense md5	`GATED_DELTA_NET`	`MUL_MAT`	`MUL_MAT_ID`
Phase136 via `EXTRA_ENV`	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`46/46`	`1146/1146`	`806/806`

Focused perf:

row	default	Phase135	Phase136
`MOE_SWIGLU_DOWN n_tokens=128`	`803.97 us`	`805.77 us`	`806.75 us`
`MOE_SWIGLU_DOWN n_tokens=257`	`1020.15 us`	`1016.53 us`	`1017.11 us`
`MOE_SWIGLU_COMBINE n_tokens=128`	`197.98 us`	`197.74 us`	`191.04 us`
`MOE_SWIGLU_COMBINE n_tokens=257`	`429.22 us`	`428.53 us`	`401.81 us`

Serving/profile gate:

phase	MoE md5	dense md5	`MUL_MAT`	`MUL_MAT_ID`
pre	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`1146/1146`	`806/806`
post	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`1146/1146`	`806/806`

Serving metrics at Phase130 shape:

metric	Phase135 opt-in	Phase136 opt-in
aggregate t/s	`208.0`	`206.5`
decode aggregate t/s	`332.7`	`323.2`
decode per-seq t/s	`2.12`	`2.07`
prefill t/s	`1475.1`	`1519.5`
TTFT mean ms	`8468.1`	`8080.6`
wall s	`39.375`	`39.668`
total kernel time	`20.2498 s`	`19.9778 s`

Serving fine buckets:

bucket	Phase135 opt-in	Phase136 opt-in
`mmq_nvfp4`	`5915.24 ms`	`5885.05 ms`
`gdn_core`	`5926.55 ms`	`5912.65 ms`
`cublas_bf16_gemm`	`1782.58 ms`	`1728.15 ms`
`cutlass_bf16_gemm`	`756.98 ms`	`767.94 ms`
`ew_mul`	`727.04 ms`	`712.97 ms`
`ew_add`	not listed in Phase135 top rows	`374.70 ms`
`act_quant`	`677.59 ms`	`677.60 ms`
`get_rows`	`283.62 ms`	`278.31 ms`
`mmq_fixup`	`104.81 ms`	`103.73 ms`

Decision:

Reject and revert Phase136. The focused synthetic full-tail row improved, but serving aggregate and decode throughput regressed versus Phase135.
Keep Phase135 as the current default-off routed-FFN source base.
Do not retry a separate post-MMQ weighted-combine launch next. A future combine/finalize attempt needs to remove a larger serving-visible boundary, likely by integrating finalize/writeback with the down projection or by changing graph scheduling enough to reduce launches without hurting decode.

Phase137: GDN Geometry Sweep

Date: 2026-07-02.
Plan: docs/superpowers/plans/2026-07-02-gdn-geometry-sweep-phase137.md.
Result type: rejected env-only serving probe; no source changes.
Focused artifact: /home/mudler/bench/phase137_gdn_geometry_sweep/20260702_091441.
Serving/profile artifact: /home/mudler/bench/phase137_gdn_geometry_serving/20260702_091740.

Implementation tested:

No source edits.
Swept existing GDN_NW/GDN_CPW runtime knobs: default (16,8), (8,8), (16,4), (8,4), and (4,1).
Ran serving only for the best focused candidate: LLAMA_MOE_ROUTED_FFN_POC=1 LLAMA_MOE_ROUTED_FFN_FUSED_QUANT=1 GDN_NW=4 GDN_CPW=1.

Focused GDN perf:

row	default	`8x8`	`16x4`	`8x4`	`4x1`
`hc=32,hs=128,nt=1,kda=0`	`6.793748 us`	`6.992506 us`	`6.161572 us`	`5.501046 us`	`4.713682 us`
`hc=32,hs=128,nt=1,kda=1`	`7.790557 us`	`7.639035 us`	`6.553847 us`	`5.772280 us`	`5.194275 us`
`hc=4,hs=128,nt=1,nseq=2,vrep=2,bcast=1`	`5.967364 us`	`4.721621 us`	`3.759859 us`	`3.747508 us`	`3.407998 us`
`hc=32,hs=128,nt=64,kda=0`	`153.718880 us`	`152.660797 us`	`119.964294 us`	`94.862477 us`	`125.016141 us`
`hc=32,hs=128,nt=256,kda=0`	`491.066095 us`	`678.143207 us`	`495.650551 us`	`454.202876 us`	`489.942166 us`
`hc=32,hs=128,nt=512,kda=0`	`1033.510463 us`	`2081.115639 us`	`1197.792952 us`	`1143.683921 us`	`1025.449339 us`
`hc=32,hs=128,nt=1024,kda=0`	`2060.529106 us`	`4382.363825 us`	`2403.995842 us`	`2310.580042 us`	`2060.707900 us`
`hc=4,hs=128,nt=64,kda=0`	`151.409035 us`	`142.777045 us`	`82.000488 us`	`78.839499 us`	`26.777607 us`
`hc=4,hs=128,nt=256,kda=0`	`102.606410 us`	`564.485714 us`	`311.945543 us`	`301.296947 us`	`102.232357 us`
`hc=4,hs=128,nt=512,kda=0`	`198.996831 us`	`1127.205870 us`	`620.111479 us`	`600.911809 us`	`198.595701 us`
`hc=4,hs=128,nt=1024,kda=0`	`396.210102 us`	`2249.487113 us`	`1240.201770 us`	`1200.476178 us`	`395.850039 us`

Serving/profile gate:

phase	MoE md5	dense md5	`MUL_MAT`	`MUL_MAT_ID`
pre	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`1146/1146`	`806/806`
post	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`1146/1146`	`806/806`

Serving metrics at Phase130 shape:

metric	Phase135 opt-in	Phase137 `GDN_NW=4 GDN_CPW=1`
aggregate t/s	`208.0`	`206.2`
decode aggregate t/s	`332.7`	`324.9`
decode per-seq t/s	`2.12`	`2.08`
prefill t/s	`1475.1`	`1499.4`
TTFT mean ms	`8468.1`	`8209.4`
TTFT max ms	not recorded	`14511.2`
wall s	`39.375`	`39.719`
total kernel time	`20.2498 s`	`20.7530 s`

Serving fine buckets:

bucket	Phase135 opt-in	Phase137 `GDN_NW=4 GDN_CPW=1`
`gdn_core`	`5926.55 ms`	`6466.27 ms`
`mmq_nvfp4`	`5915.24 ms`	`5978.87 ms`
`cublas_bf16_gemm`	`1782.58 ms`	`1726.10 ms`
`cutlass_bf16_gemm`	`756.98 ms`	`745.00 ms`
`ew_mul`	`727.04 ms`	`711.72 ms`
`ew_add`	not listed in Phase135 top rows	`367.85 ms`
`act_quant`	`677.59 ms`	`681.32 ms`
`get_rows`	`283.62 ms`	`284.31 ms`
`mmq_fixup`	`104.81 ms`	`103.26 ms`

Decision:

Reject Phase137. The isolated 1-token GDN rows improved, but real serving decode, aggregate throughput, total kernel time, gdn_core, and mmq_nvfp4 all regressed versus Phase135.
Do not edit source for a GDN launch-geometry retune.
Next scoped source line: a default-off MoE finalize/writeback integration in down-MMQ that removes the serving-visible MUL(weights) -> VIEW* -> ADD* tail without adding a standalone combine launch.

Phase135: Routed-FFN Fused SWIGLU-to-NVFP4 Quant

Date: 2026-07-02.
Plan: docs/superpowers/plans/2026-07-02-routed-ffn-fused-quant-phase135.md.
Result type: source structural base, default-off, serving-profile positive on decode but not parity-closing.
Focused artifact: /home/mudler/bench/phase135_routed_ffn_fused_quant/20260702_081723.
Serving/profile artifact: /home/mudler/bench/phase135_routed_ffn_fused_quant_serving/20260702_082102.
Source files:
- ggml/src/ggml-cuda/mmq.cuh
- ggml/src/ggml-cuda/mmq.cu
- ggml/src/ggml-cuda/moe-ffn.cu

Implementation:

Added LLAMA_MOE_ROUTED_FFN_FUSED_QUANT=1 on top of LLAMA_MOE_ROUTED_FFN_POC=1.
Added ggml_cuda_mul_mat_q_moe_quantized(...), a raw MMQ launcher that accepts a caller-owned quantized activation buffer.
Added a Blackwell/NVFP4-only fused kernel that reads gate/up views, uses the existing ids metadata ordering, computes silu(gate) * up, and writes block_fp4_mmq activation layout directly.
MXFP4 and unsupported shapes fall back to earlier paths.

Focused gates:

route	result
Phase135 selected	`MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE 13/13`
Phase135 trace	`6` `mmq_moe_quantized_raw` launches, `0` `mmq_moe_sorted_raw` launches

Canonical focused gates:

route	MoE md5	dense md5	`GATED_DELTA_NET`	`MUL_MAT`	`MUL_MAT_ID`
Phase135 via `EXTRA_ENV`	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`48/48`	`1146/1146`	`806/806`

Focused perf:

row	default	Phase134	Phase135
`MOE_SWIGLU_DOWN n_tokens=128`	`805.920354 us`	`807.650845 us`	`807.921963 us`
`MOE_SWIGLU_DOWN n_tokens=257`	`1031.064815 us`	`1027.513292 us`	`1024.971370 us`

Serving/profile gate:

phase	MoE md5	dense md5	`MUL_MAT`	`MUL_MAT_ID`
pre	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`1146/1146`	`806/806`
post	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`1146/1146`	`806/806`

Serving metrics at Phase130 shape:

metric	Phase130 default	Phase135 opt-in
aggregate t/s	`208.0`	`208.0`
decode aggregate t/s	`326.9`	`332.7`
decode per-seq t/s	`2.1`	`2.12`
prefill t/s	`1519.6`	`1475.1`
TTFT mean ms	`8170.6`	`8468.1`
wall s	`39.38`	`39.375`
total kernel time	`20.1559 s`	`20.2498 s`

Serving fine buckets:

bucket	Phase130 default	Phase135 opt-in
`mmq_nvfp4`	`6009.52 ms`	`5915.24 ms`
`gdn_core`	`5891.40 ms`	`5926.55 ms`
`cublas_bf16_gemm`	`1735.98 ms`	`1782.58 ms`
`cutlass_bf16_gemm`	`749.64 ms`	`756.98 ms`
`act_quant`	`675.67 ms`	`677.59 ms`
`get_rows`	`280.62 ms`	`283.62 ms`
`mmq_fixup`	not listed in Phase130 top rows	`104.81 ms`

Decision:

Keep Phase135 as the best current default-off routed-FFN base. It is canonical-clean and reduces the dominant mmq_nvfp4 serving bucket.
Do not promote it as parity: aggregate serving is unchanged, prefill/TTFT are worse, and total kernel time is slightly higher due to other buckets.
Next work should target remaining MoE overhead after fused quant, especially mmq_fixup, route/writeback, and weighted-combine/scatter boundaries, or run a broader serving comparison to determine whether the decode improvement persists outside this graph-node profile.

Phase134: Routed-FFN Fused SWIGLU-to-Sorted

Date: 2026-07-02.
Plan: docs/superpowers/plans/2026-07-02-routed-ffn-fused-swiglu-phase134.md.
Result type: source structural base, default-off, mixed perf.
Artifact: /home/mudler/bench/phase134_routed_ffn_fused_swiglu/20260702_075828.
Source files:
- ggml/src/ggml-cuda/moe-ffn.cuh
- ggml/src/ggml-cuda/moe-ffn.cu
- ggml/src/ggml-cuda/ggml-cuda.cu

Implementation:

Added LLAMA_MOE_ROUTED_FFN_FUSED_SWIGLU=1 on top of LLAMA_MOE_ROUTED_FFN_POC=1.
Passes gate and up views into the Phase132 routed-FFN helper.
Executes gate_up, builds ids metadata, launches a CUDA kernel to write silu(gate) * up directly into expert-sorted F32 rows, then calls Phase133's raw sorted-F32 down MMQ helper.
The fused flag now implies the sorted-down machinery; it does not require LLAMA_MOE_ROUTED_FFN_SORTED_DOWN=1.

Selected and trace gates:

route	result
Phase134 selected	`MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE 13/13`
Phase134 trace	`MOE_SWIGLU_DOWN 7/7`, `6` `mmq_moe_sorted_raw` launches

Canonical gates:

route	MoE md5	dense md5	`GATED_DELTA_NET`	`MUL_MAT`	`MUL_MAT_ID`
Phase134 via `EXTRA_ENV`	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`48/48`	`1146/1146`	`806/806`

Focused perf sanity:

row	default	Phase132	Phase133	Phase134
`MOE_SWIGLU_DOWN n_tokens=128`	`804.920354 us`	`807.999195 us`	`808.068383 us`	`810.614642 us`
`MOE_SWIGLU_DOWN n_tokens=257`	`1026.024540 us`	`1028.434560 us`	`1029.015432 us`	`1025.682004 us`

Decision:

Keep Phase134 only as default-off structural plumbing. It removes the standalone glu -> get_rows boundary and recovers the n=257 regression, but the extra fused-SWIGLU kernel is still slower at n=128.
Do not promote LLAMA_MOE_ROUTED_FFN_FUSED_SWIGLU=1 as a speedup.
Next work must remove one more boundary, likely by fusing SWIGLU directly into the down-MMQ quant buffer rather than writing an intermediate sorted F32 buffer.

Phase133: Routed-FFN Sorted-Down Raw MMQ

Date: 2026-07-02.
Plan: docs/superpowers/plans/2026-07-02-routed-ffn-sorted-down-phase133.md.
Result type: source structural base, default-off, not a speedup.
Artifact: /home/mudler/bench/phase133_routed_ffn_sorted_down/20260702_074651.
Source files:
- ggml/src/ggml-cuda/mmq.cuh
- ggml/src/ggml-cuda/mmq.cu
- ggml/src/ggml-cuda/moe-ffn.cu

Implementation:

Exposed ggml_cuda_mmq_ids_meta from mmq.cuh so the routed-FFN helper can reuse the existing GPU ids metadata (ids_src1, ids_dst, expert_bounds).
Added ggml_cuda_mul_mat_q_moe_sorted_f32(...), a raw sorted-F32 MMQ entry that accepts a compact F32 activation pointer plus ids_dst and expert_bounds directly.
Added LLAMA_MOE_ROUTED_FFN_SORTED_DOWN=1 on top of LLAMA_MOE_ROUTED_FFN_POC=1. The opt-in path executes baseline gate_up and SWIGLU, gathers SWIGLU output into compact expert-sorted F32 rows, then runs the raw MMQ down helper. It falls back to Phase132 if strict shape/type checks fail.

Selected op gates:

route	result	marker
default	`MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE 13/13`	none
Phase132 `LLAMA_MOE_ROUTED_FFN_POC=1`	`13/13`	`6` whole-pattern exec markers
Phase133 `LLAMA_MOE_ROUTED_FFN_POC=1 LLAMA_MOE_ROUTED_FFN_SORTED_DOWN=1`	`13/13`	`6` whole-pattern exec markers

Trace proof:

LLAMA_QUANT_TRACE=32 with Phase133 opt-in passed MOE_SWIGLU_DOWN 7/7.
grep -c mmq_moe_sorted_raw phase133_quant_trace.log returned 6, proving the raw sorted-down helper engaged for the NVFP4 rows.

Canonical gates:

route	MoE md5	dense md5	`GATED_DELTA_NET`	`MUL_MAT`	`MUL_MAT_ID`
default	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`48/48`	`1146/1146`	`806/806`
Phase133 via `EXTRA_ENV`	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`48/48`	`1146/1146`	`806/806`

Focused perf sanity:

row	default	Phase132	Phase133
`MOE_SWIGLU_DOWN n_tokens=128`	`807.369268 us`	`808.213194 us`	`808.848753 us`
`MOE_SWIGLU_DOWN n_tokens=257`	`1020.762195 us`	`1018.870935 us`	`1026.874233 us`

Decision:

Keep Phase133 only as default-off structural plumbing. It is correctness-clean and proves the fake-tensor boundary can be replaced with a raw helper, but it adds a separate gather into sorted F32 rows and is not faster.
Do not promote LLAMA_MOE_ROUTED_FFN_SORTED_DOWN=1 as a runtime speedup.
Next work must remove the new overhead by fusing SWIGLU directly into sorted rows or directly into the down-MMQ quant buffer. A standalone sorted-down gather is not a parity lever.

Phase132: Default-Off Routed-FFN PoC Scaffold

Date: 2026-07-02.
Plan: docs/superpowers/plans/2026-07-02-routed-ffn-poc-phase132.md.
Result type: source scaffold, default-off, no math change intended.
Artifact: /home/mudler/bench/phase132_routed_ffn_poc/20260702_072725.
Source files:
- ggml/src/ggml-cuda/moe-ffn.cuh
- ggml/src/ggml-cuda/moe-ffn.cu
- ggml/src/ggml-cuda/ggml-cuda.cu

Build:

First incremental build failed at link because the existing CMake build directory had not reconfigured its globbed CUDA source list, so the new moe-ffn.cu object was not compiled.
Re-running cmake -S . -B build in the DGX mirror picked up moe-ffn.cu; cmake --build build --target test-backend-ops -j"$(nproc)" then passed.
Symbol/string evidence: strings build/bin/libggml-cuda.so | grep -c LLAMA_MOE_ROUTED_FFN_POC returned 1.

Selected op gates:

route	result	trace
default	`MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE 13/13`	no opt-in markers
`LLAMA_MOE_ROUTED_FFN_POC=1`	`MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE 13/13`	`6` `LLAMA_MOE_WHOLE_PATTERN_EXEC` markers

Canonical gates:

route	MoE md5	dense md5	`GATED_DELTA_NET`	`MUL_MAT`	`MUL_MAT_ID`
default	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`48/48`	`1146/1146`	`806/806`
`LLAMA_MOE_ROUTED_FFN_POC=1` via `EXTRA_ENV`	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`48/48`	`1146/1146`	`806/806`

Focused perf sanity:

row	default	opt-in	delta
`MOE_SWIGLU_DOWN n_tokens=128`	`808.318584 us`	`804.868061 us`	`+0.43%`
`MOE_SWIGLU_DOWN n_tokens=257`	`1023.355828 us`	`1022.713701 us`	`+0.06%`

Decision:

Keep the Phase132 scaffold. It is correctness-clean and neutral, and it gives the next patch a low-conflict helper boundary for a real fused routed-FFN slice.
Do not present Phase132 as a speedup. The helper currently executes the same baseline gate_up, SWIGLU, and down nodes; it only proves default-off ownership, capability gating, and reachability.
Next source phase should replace one internal helper boundary with real work, preferably a routed-FFN packed workspace or direct sorted activation/down path that removes more traffic than Phase116/123.

Phase131: Fused Routed-FFN Scoping Challenge

Date: 2026-07-02.
Plan: docs/superpowers/plans/2026-07-02-fused-routed-ffn-phase131.md.
Result type: source-selection and design-gate phase; no source changes and no DGX benchmark artifact.
Inputs:
- Phase130 current-stack serving profile: /home/mudler/bench/phase130_current_stack_profile/20260702_070949.
- MoE explorer: 019f2140-de84-7eb2-8ab5-0c7d7de336bd.
- GDN explorer: 019f2141-0af2-7480-bf66-4fd7e67716c5.

Decision:

Reject another incremental MoE/FFN-GEMM shortcut for Phase131. The current stack already includes default grouped FP4-MMQ, default-off W4A16 fallback routes, route metadata scaffolding, and whole-pattern executor ownership proof. Prior route-only, activation-only, tile-policy, W4A16, sorted-output, and fake-executor attempts either regressed or were noise-level.
Reject another incremental GDN shortcut for Phase131. The remaining GDN bucket is dominated by the f32 recurrent-state scan; the safe space around launch geometry, gather/identity, producer fusion, store fusion, BF16 S-cache, and grouped Q/K broadcast has already been tested and rejected under canonical md5/KL gates.
Continue only with a larger default-off fused routed-FFN PoC if the vLLM and llama.cpp audits identify a concrete low-conflict hook. Otherwise, require a standalone CUDA PoC before touching llama.cpp source.

Gates:

No correctness or performance gates were run for this no-source decision phase.
Any follow-up source phase must use the canonical MoE md5 8cb0ce23777bf55f92f63d0292c756b0, dense md5 5951a5b4d624ce891e22ab5fca9bc439, GATED_DELTA_NET, MUL_MAT 1146/1146, MUL_MAT_ID 806/806, and selected MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE op gates before claiming a speedup.

Phase130: Current-Stack Serving Profile Refresh

Date: 2026-07-02.
Plan: docs/superpowers/plans/2026-07-02-current-stack-serving-profile-phase130.md.
Result type: measurement-only profile; no source changes.
Artifact: /home/mudler/bench/phase130_current_stack_profile/20260702_070949.
Shape: MoE q36-35b-a3b-nvfp4, N=128, prompt 128, generation 64, PARALLEL=128, CTX=131072, graph-node CUDA tracing.

Gates:

phase	MoE md5	dense md5	`MUL_MAT`	`MUL_MAT_ID`
pre	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`1146/1146`	`806/806`
post	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`1146/1146`	`806/806`

Serving metrics:

metric	value
aggregate t/s	`208.0`
decode aggregate t/s	`326.9`
decode per-seq t/s	`2.1`
prefill t/s	`1519.6`
TTFT mean ms	`8170.6`
TTFT max ms	`14315.6`
wall s	`39.38`
total kernel time	`20.1559 s`

Macro buckets:

bucket	time	share
GDN	`6646.64 ms`	`32.98%`
MoE/FFN-GEMM	`6213.70 ms`	`30.83%`
bf16/fp8-proj	`2734.06 ms`	`13.56%`
layout-copy	`1260.74 ms`	`6.25%`
act-quant	`675.67 ms`	`3.35%`
gather	`280.62 ms`	`1.39%`
FA	`267.02 ms`	`1.32%`

Fine buckets:

bucket	time	share
`mmq_nvfp4`	`6009.52 ms`	`29.82%`
`gdn_core`	`5891.40 ms`	`29.23%`
`cublas_bf16_gemm`	`1735.98 ms`	`8.61%`
`cutlass_bf16_gemm`	`749.64 ms`	`3.72%`
`act_quant`	`675.67 ms`	`3.35%`
`convert_dtype`	`656.25 ms`	`3.26%`
`concat_layout`	`443.94 ms`	`2.20%`
`gdn_conv`	`443.80 ms`	`2.20%`
`get_rows`	`280.62 ms`	`1.39%`
`fa`	`257.38 ms`	`1.28%`

Decision:

The current serving profile remains a tied two-bucket problem: mmq_nvfp4 and gdn_core are effectively equal and far larger than every candidate cleanup bucket.
Do not spend the next source attempt on paged mask/F16 get-rows or FA cleanup: get_rows and FA are below 1.5% each in this profile, matching the older Phase63 no-go.
The next credible source attempt must either reduce the MoE/FFN-GEMM bucket with a larger executor/kernel than the rejected route/activation shortcuts, or reduce GDN with a materially different recurrent-state/packed-decode design rather than the rejected grouped-broadcast/BF16-cache/geometry/store shapes.

Phase129: Qwen35 GDN Q/K Grouped Broadcast Probe

Date: 2026-07-02.
Plan: docs/superpowers/plans/2026-07-02-qwen35-gdn-qk-grouped-bcast-phase129.md.
Result type: source attempted, rejected, and reverted.
Default gate artifact: /home/mudler/bench/phase129_qwen35_gdn_qk_bcast/default_20260702_065445.
Focused GDN perf artifact: /home/mudler/bench/phase129_qwen35_gdn_qk_bcast/perf_20260702_065728.
Default decode-profile artifact: /home/mudler/bench/phase129_qwen35_gdn_qk_bcast/decode_default_20260702_065847.
Valid opt-in reject artifact: /home/mudler/bench/phase129_qwen35_gdn_qk_bcast/decode_optin_20260702_070149/gate_pre.
Post-reject artifact: /home/mudler/bench/phase129_qwen35_gdn_qk_bcast/post_reject_20260702_070258.
Candidate env: LLAMA_QWEN35_GDN_QK_BCAST=1.

Candidate implementation:

Added a default-off qk_bcast_grouped branch to src/models/qwen35.cpp and src/models/qwen35moe.cpp.
When enabled, the branch skipped explicit Q/K repeat and called the state-taking build_recurrent_attn(..., state, il, true) overload so the existing ggml_gated_delta_net_set_bcast() op parameter could use grouped Q/K indexing.
Default source behavior remained unchanged when the env was unset.

Evidence:

Default canonical gates passed:
- MoE md5 8cb0ce23777bf55f92f63d0292c756b0;
- dense md5 5951a5b4d624ce891e22ab5fca9bc439;
- GATED_DELTA_NET 46/46;
- MUL_MAT 1146/1146;
- MUL_MAT_ID 806/806.
The first standalone opt-in gate artifact /home/mudler/bench/phase129_qwen35_gdn_qk_bcast/optin_20260702_065604 was not valid evidence because paged-inference-gates.sh only injects model env through EXTRA_ENV.
The valid opt-in gate from the decode harness used PROFILE_ENV="LLAMA_QWEN35_GDN_QK_BCAST=1" and failed before profiling: MoE md5 became b773e2f032aa0e992626d486b321808e instead of the canonical 8cb0ce23777bf55f92f63d0292c756b0.
Focused test-backend-ops perf -o GATED_DELTA_NET was effectively neutral because it exercises op fixtures, not the Qwen35 model-builder branch. The representative rows were:

row	default us/run	opt-in us/run
`head_count=32,head_size=128,n_seq_tokens=1024,qk_bcast_grouped=0`	`2064.48`	`2060.23`
`head_count=4,head_size=128,n_seq_tokens=256,qk_bcast_grouped=0`	`101.69`	`101.61`
`head_count=4,head_size=128,n_seq_tokens=64,v_repeat=2,qk_bcast_grouped=1`	`151.32`	`151.39`

Default decode-profile baseline, before the valid opt-in reject:

metric	default
total kernel time	`3.6916 s`
GDN macro	`1491.99 ms` (`40.42%`)
`gdn_core`	`1411.34 ms` (`38.23%`)
MoE/FFN-GEMM macro	`1475.96 ms` (`39.98%`)
`mmq_nvfp4`	`1458.54 ms` (`39.51%`)

Post-reject rebuild removed the env string from libllama.so (strings ... | grep -c LLAMA_QWEN35_GDN_QK_BCAST == 0) and post-reject gates passed: MoE md5 canonical, dense md5 canonical, GATED_DELTA_NET 46/46, MUL_MAT 1146/1146, MUL_MAT_ID 806/806.

Decision:

Reject and revert Phase129 source. The candidate is not bit-exact for the current qwen35moe decision model.
Do not retry the same Qwen3Next grouped Q/K broadcast port for Qwen35 or Qwen35MoE unless the quality rule is explicitly changed. The current bit-exact md5 gate rejects it before any perf profile is meaningful.

Phase128: Qwen3Next GDN BF16 S-Cache Scope

Date: 2026-07-02.
Plan: docs/superpowers/plans/2026-07-02-qwen3next-gdn-bf16-s-cache-phase128.md.
Result type: source probe rejected and reverted.
Default gate artifact: /home/mudler/bench/phase128_qwen3next_gdn_bf16_s_cache/default_20260702_043939.
Verbose smoke artifact: /home/mudler/bench/phase128_qwen3next_gdn_bf16_s_cache/smoke3_20260702_044434.

Candidate implementation:

Temporarily generalized the Qwen35/Qwen35MoE GDN S-cache selector in src/llama-model.cpp to accept LLAMA_QWEN3NEXT_GDN_S_CACHE_TYPE=bf16 for LLM_ARCH_QWEN3NEXT.
Preserved the existing LLAMA_QWEN35_GDN_S_CACHE_TYPE=bf16 behavior.
Reverted the source probe after validation showed it does not apply to the current decision model and no true Qwen3Next artifact is available.

Evidence:

Default GATED_DELTA_NET op gate passed 48/48.
Default canonical gates passed:
- MoE md5 8cb0ce23777bf55f92f63d0292c756b0;
- dense md5 5951a5b4d624ce891e22ab5fca9bc439;
- MUL_MAT passed;
- MUL_MAT_ID passed.
Verbose smoke showed the active model metadata: general.architecture = qwen35moe, print_info: arch = qwen35moe.
With LLAMA_QWEN3NEXT_GDN_S_CACHE_TYPE=bf16, recurrent cache logs still showed S (f32): 60.00 MiB, as expected for a qwen35moe model.
DGX search found no true Qwen3Next GGUF under /home/mudler/bench or /home/mudler.

Decision:

Reject and revert the Qwen3Next selector change for the current parity run.
Do not retry the existing Qwen35/Qwen35MoE BF16 S-cache lever under the current rules: Phase81 showed it reduced gdn_core, but Phase82 rejected it because MoE md5 changed and the full f16-reference KL gate missed the hard acceptance band.
A future BF16-S-cache attempt needs either a deliberately re-scoped quality gate or an actual Qwen3Next model artifact to validate.

Phase127: Whole-MoE Expert-Major Executor

Date: 2026-07-02.
Plan: docs/superpowers/plans/2026-07-02-moe-whole-expert-major-phase127.md.
Result type: source attempted, rejected, and reverted. Phase126 helper remains.
Red artifact: /home/mudler/bench/phase127_moe_whole_expert_major/red_20260702_042125.
Green artifact: /home/mudler/bench/phase127_moe_whole_expert_major/green2_20260702_042916.
Perf artifact: /home/mudler/bench/phase127_moe_whole_expert_major/perf_20260702_043104.
Post-reject artifact: /home/mudler/bench/phase127_moe_whole_expert_major/post_reject_20260702_043318.
Candidate env: LLAMA_MOE_WHOLE_EXPERT_MAJOR=1 LLAMA_MOE_WHOLE_EXPERT_MAJOR_TRACE=128.

Candidate implementation:

Added an opt-in executor at the existing early whole-pattern match.
Built route metadata once with ggml_cuda_launch_mm_ids_helper().
Wrote gate_up to a sorted F32 temporary using identity ids_dst.
Ran SWIGLU on a fake contiguous split-half [2*n_ff, ne_get_rows] tensor.
Ran down MMQ from sorted activations through the Phase126 ggml_cuda_mul_mat_q_moe_with_ids(..., src1_sorted=true) helper.
Unpermuted once after down into the real graph destination.

Attempt notes:

The red gate passed by fallback and emitted zero LLAMA_MOE_WHOLE_EXPERT_MAJOR markers.
First green attempt aborted because the executor interpreted down_w as [n_embd, n_ff, experts]. Debug trace proved the correct shape is [n_ff, n_embd, experts]; the dimension fix made the selected green gate pass.

Gates:

gate	result
red `MOE_SWIGLU_DOWN`	`7/7`, zero expert-major markers
default selected `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE`	`13/13`
opt-in `MOE_SWIGLU_DOWN`	`7/7`, six expert-major markers
candidate canonical md5/op	skipped because perf rejected source
post-reject selected `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE`	`13/13`
post-reject MoE md5	`8cb0ce23777bf55f92f63d0292c756b0`
post-reject dense md5	`5951a5b4d624ce891e22ab5fca9bc439`
post-reject `MUL_MAT`	`1146/1146`
post-reject `MUL_MAT_ID`	`806/806`

Focused perf:

arm	`MOE_SWIGLU_DOWN n=128`	`MUL_MAT_ID_RAGGED_MOE n=128`	`MOE_SWIGLU_DOWN n=257`	`MUL_MAT_ID_RAGGED_MOE n=257`
default	`802.57 us`	`1236.67 us`	`1023.25 us`	`1455.65 us`
expert-major opt-in	`812.14 us`	`1238.50 us`	`1039.36 us`	`1455.06 us`

Decision:

Reject and revert Phase127 source. The path passed correctness but missed the keep rule: MOE_SWIGLU_DOWN n=128 regressed about 1.2% and n=257 regressed about 1.6%; no row reached the required >=3% improvement.
Do not retry the same fake-tensor whole-executor shape. It removes the early unsort boundary but adds enough temporary traffic and quant/layout work to lose on the focused rows. The next MoE attempt must reduce temporary traffic or move closer to a real fused grouped MMQ/SWIGLU/down path; otherwise pivot to the scoped GDN BF16 S-cache experiment with non-md5 numerical gates.

Phase126: MMQ Presorted Helper Scaffold

Date: 2026-07-02.
Plan: docs/superpowers/plans/2026-07-02-mmq-presorted-helper-phase126.md.
Result type: source scaffold kept; no default behavior change intended.
Artifact: /home/mudler/bench/phase126_mmq_presorted_helper/fix1_20260702_040858.
Source scope:
- ggml/src/ggml-cuda/mmq.cu
- ggml/src/ggml-cuda/mmq.cuh
Candidate implementation:
- refactored the current MoE ggml_cuda_mul_mat_q() id path into an internal helper that accepts prebuilt ids_src1, ids_dst, and expert_bounds;
- added the public CUDA-internal wrapper ggml_cuda_mul_mat_q_moe_with_ids(..., bool src1_sorted);
- preserved current behavior by having the existing path build metadata and call the helper with src1_sorted=false;
- added src1_sorted=true support for the future whole-MoE executor without wiring that executor in this phase.

Attempt notes:

Initial Phase126 build/gate attempt compiled and selected gates passed, but local review found the helper had widened the default MMQ q-buffer stride from n_expert_used to ne_get_rows. The fix1 attempt restored the old stride for src1_sorted=false; that is the accepted artifact below.
One canonical gate invocation failed because it was nested under an outer DGX lock while paged-inference-gates.sh owns the lock itself. The gate was rerun cleanly outside the outer lock.

Gates:

gate	result
build `test-backend-ops llama-completion`	passed
selected `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE`	`13/13`
MoE md5	`8cb0ce23777bf55f92f63d0292c756b0`
dense md5	`5951a5b4d624ce891e22ab5fca9bc439`
`MUL_MAT`	`1146/1146`
`MUL_MAT_ID`	`806/806`

Focused perf:

row	runs	us/run	TFLOPS
`MOE_SWIGLU_DOWN n=128`	`1243`	`805.99`	`11.99`
`MUL_MAT_ID_RAGGED_MOE n=128`	`832`	`1243.85`	`2.59`
`MOE_SWIGLU_DOWN n=257`	`984`	`1018.74`	`19.05`
`MUL_MAT_ID_RAGGED_MOE n=257`	`704`	`1452.84`	`4.45`

Decision:

Keep the scaffold as Phase127 dependency. This phase is perf-neutral versus the Phase125 baseline/control band and preserves canonical md5/op gates.
Do not claim parity progress from Phase126 alone. The useful next step is to use this helper inside the whole-pattern executor so gate_up output, SWIGLU, and down input stay in expert-major order, with one unpermute after the full FFN.

Phase125: Expert-Major Sorted Output Scope

Date: 2026-07-02.
Plan: docs/superpowers/plans/2026-07-02-moe-expert-major-sorted-output-phase125.md.
Result type: source implementation spec and scoped next attempt; no source change yet.
Subagent findings:
- llama.cpp audit: the full expert-major executor is credible but too large for a first patch. The first slice should add a sorted-output grouped MMQ mode so expert_bounds can be used without scattering through ids_dst.
- vLLM audit: portable ideas are expert-major layout across both GEMMs, one permute/unpermute boundary, expert offsets for activation quant/scales, and whole-layer measurement. CUTLASS/FlashInfer pointer-array, TMA, and FP4 scale-swizzle contracts should not be copied into GGML/MMQ.
- local GDN challenge: Phase124's gdn_core bucket is material, but prior small GDN attempts already rejected the obvious decode/core knobs. A new GDN win would need a larger recurrence redesign, not a Phase125 shortcut.
Decision:
- Phase125 source was tested and rejected. Do not carry LLAMA_MOE_EXPERT_MAJOR_SORTED_OUT, the mmq_args identity-destination flag, the MMQ sorted-output temporary, or the immediate unsort proof path.
- The full expert-major gate_up -> SWIGLU -> down executor remains the right conceptual MoE target, but the first slice proved that sorted-output plus immediate unsort is too expensive to be a stepping stone by itself. Any follow-up must avoid adding an extra unsort boundary and must consume sorted activations directly in the down GEMM.
Red/baseline attempt:
- Red artifact: /home/mudler/bench/phase125_moe_expert_major_sorted_output/red_valid_20260702_032918.
- Baseline artifact: /home/mudler/bench/phase125_moe_expert_major_sorted_output/baseline_valid_20260702_032923.
- Red env: LLAMA_MOE_EXPERT_MAJOR_SORTED_OUT=1 LLAMA_MOE_EXPERT_MAJOR_SORTED_TRACE=32.
- Red result: test-backend-ops perf -o MOE_SWIGLU_DOWN exited 0 and emitted 0 LLAMA_MOE_EXPERT_MAJOR_SORTED markers, as expected before implementation.
- Baseline selected gate: test-backend-ops test -o MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE passed 13/13.

Baseline perf rows:

row	runs	us/run	GFLOP/run	TFLOPS
`MOE_SWIGLU_DOWN n=128`	`1243`	`809.70`	`9.66`	`11.93`
`MUL_MAT_ID_RAGGED_MOE n=128`	`832`	`1244.18`	`3.22`	`2.59`
`MOE_SWIGLU_DOWN n=257`	`984`	`1016.44`	`19.40`	`19.09`
`MUL_MAT_ID_RAGGED_MOE n=257`	`688`	`1453.65`	`6.47`	`4.45`

Source attempt:

Artifact: /home/mudler/bench/phase125_moe_expert_major_sorted_output/20260702_033931.
Candidate env: LLAMA_MOE_EXPERT_MAJOR_SORTED_OUT=1 LLAMA_MOE_EXPERT_MAJOR_SORTED_TRACE=32.
Candidate implementation:
- added an internal mmq_args identity-destination flag;
- wrote NVFP4 grouped MMQ output to a sorted temporary when the env was set;
- inverted ids_dst on GPU and immediately used get_rows_cuda to restore the normal destination layout;
- emitted bounded LLAMA_MOE_EXPERT_MAJOR_SORTED trace markers.
Correctness:
- default selected MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE: 13/13;
- opt-in sorted MOE_SWIGLU_DOWN: 7/7;
- opt-in correctness markers: 12 (gate_up and down for six NVFP4 rows).

Perf:

arm	`MOE_SWIGLU_DOWN n=128`	`MUL_MAT_ID_RAGGED_MOE n=128`	`MOE_SWIGLU_DOWN n=257`	`MUL_MAT_ID_RAGGED_MOE n=257`
control	`806.13 us`	`1250.99 us`	`1027.15 us`	`1457.69 us`
Phase121 exec	`805.16 us`	`1247.92 us`	`1023.83 us`	`1457.67 us`
sorted-output proof	`888.76 us`	`1283.17 us`	`1192.05 us`	`1528.27 us`

Rejection:

Reject and revert. The proof passed correctness, but it badly missed the keep rule: versus Phase121 exec, MOE_SWIGLU_DOWN n=128 regressed by about 10.4% and n=257 regressed by about 16.4%. The ragged standalone row also regressed.
Post-reject artifact: /home/mudler/bench/phase125_moe_expert_major_sorted_output/post_reject_20260702_034232.
Post-reject gates:
- build: 0;
- selected MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE: 13/13;
- retained Phase121 exec MOE_SWIGLU_DOWN: 7/7, six exec markers;
- MoE md5: 8cb0ce23777bf55f92f63d0292c756b0;
- dense md5: 5951a5b4d624ce891e22ab5fca9bc439;
- MUL_MAT: 1146/1146;
- MUL_MAT_ID: 806/806.

Phase124: Current MoE Serving Graph-Node Refresh

Date: 2026-07-02.
Artifact: /home/mudler/bench/phase124_current_moe_profile/20260702_031205.
Result type: current-stack llama.cpp graph-node serving profile; no source change.
Shape: MoE q36-35b-a3b-nvfp4, N=128, PTOK=128, GEN=64, PARALLEL=128, CTX=131072, BATCH=2048, UBATCH=512.
Profiler: nsys launch --cuda-graph-trace=node, bucketed with /home/mudler/bench/bucket2.py.

Gates:

phase	MoE md5	dense md5	`MUL_MAT`	`MUL_MAT_ID`
pre	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`1146/1146`	`806/806`
post	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`1146/1146`	`806/806`

Serving result under graph-node profiling:

n	agg_tps	decode_agg_tps	decode_perseq_tps	prefill_tps	ttft_mean_ms	wall_s
`128`	`206.2`	`320.3`	`2.11`	`1536.4`	`8826.7`	`39.738`

Macro buckets:

bucket	time ms	share	instances
GDN	`6665.04`	`33.10%`	`20790`
MoE/FFN-GEMM	`6246.97`	`31.03%`	`52484`
bf16/fp8-proj	`2687.28`	`13.35%`	`51960`
layout-copy	`1259.59`	`6.26%`	`79100`
ew-mul(weight/norm/GDN)	`728.03`	`3.62%`	`50422`
act-quant	`674.88`	`3.35%`	`36084`
FA	`264.14`	`1.31%`	`3530`

Fine buckets:

bucket	macro	time ms	share	instances
`mmq_nvfp4`	MoE/FFN-GEMM	`6074.78`	`30.17%`	`33204`
`gdn_core`	GDN	`5888.31`	`29.25%`	`4500`
`cublas_bf16_gemm`	bf16/fp8-proj	`1722.37`	`8.55%`	`21970`
`cutlass_bf16_gemm`	bf16/fp8-proj	`766.57`	`3.81%`	`26380`
`ew_mul`	ew-mul(weight/norm/GDN)	`723.07`	`3.59%`	`46494`
`act_quant`	act-quant	`674.88`	`3.35%`	`36084`
`convert_dtype`	layout-copy	`660.48`	`3.28%`	`51300`
`gdn_conv`	GDN	`457.10`	`2.27%`	`6960`
`concat_layout`	layout-copy	`440.02`	`2.19%`	`2040`

Decision:

Phase124 confirms the current serving gap is still a two-bucket problem: mmq_nvfp4 and gdn_core together account for about 59.4% of kernel time.
The act_quant bucket is only 3.35%, explaining why Phase116/123 fused-activation shortcuts did not move end-to-end rows.
Do not fund more route-only, activation-only, or tile-policy MoE shortcuts. Next source work must either own the full expert-major MoE pipeline to reduce mmq_nvfp4, or attack gdn_core with a default-off GDN decode experiment measured against this Phase124/Phase77 bucket.

Phase123: MoE Executor Fused Down Input

Date: 2026-07-02.
Plan: docs/superpowers/plans/2026-07-02-moe-executor-fused-down-input-phase123.md.
Artifact: /home/mudler/bench/phase123_moe_executor_fused_down_input/20260702_025811.
Red check artifact: /home/mudler/bench/phase123_moe_executor_fused_down_input/red_20260702_025031.
Candidate env: LLAMA_MOE_WHOLE_PATTERN_EXEC=1 LLAMA_MOE_WHOLE_PATTERN_FUSED_DOWN=1.
Source decision: reject and revert. Do not carry the LLAMA_MOE_WHOLE_PATTERN_FUSED_DOWN env, NVFP4 fused SwiGLU quant kernel, or ggml_cuda_mul_mat_q_moe_swiglu_down() helper.

Gates:

gate	result	trace markers
red check fused-down trace before implementation	`7/7` test rows	`0` fused-down markers
default selected `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE`	`13/13`	n/a
fused-down `MOE_SWIGLU_DOWN`	`7/7`	`6` fused-down markers
post-reject selected `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE`	`13/13`	n/a
post-reject Phase121 exec `MOE_SWIGLU_DOWN`	`7/7`	`6` exec markers

Perf:

arm	`MOE_SWIGLU_DOWN n=128`	`MUL_MAT_ID_RAGGED_MOE n=128`	`MOE_SWIGLU_DOWN n=257`	`MUL_MAT_ID_RAGGED_MOE n=257`
control	`812.340097 us`	`1242.909856 us`	`1021.592480 us`	`1461.043605 us`
Phase121 exec	`811.152856 us`	`1248.876202 us`	`1023.089980 us`	`1455.405523 us`
fused-down	`810.617860 us`	`1250.528750 us`	`1023.657464 us`	`1459.239826 us`

Decision:

Reject the standalone fused-down activation quantization path. It passed correctness, but the target row was flat-to-negative and far below the 2% keep rule.
Keep Phase121 executor proof only. The next MoE attempt should not be another one-boundary activation materialization shortcut; it needs a full expert-major packed pipeline or a different measured bottleneck.

Phase122: MoE Shared Route Metadata

Date: 2026-07-02.
Plan: docs/superpowers/plans/2026-07-02-moe-shared-route-meta-phase122.md.
Artifact: /home/mudler/bench/phase122_moe_shared_route_meta/20260702_043212.
Candidate env: LLAMA_MOE_WHOLE_PATTERN_EXEC=1 LLAMA_MOE_WHOLE_PATTERN_SHARED_ROUTE=1.
Source decision: reject and revert. Do not carry the public ggml_cuda_mmq_ids_meta API, shared-route executor helper, or LLAMA_MOE_WHOLE_PATTERN_SHARED_ROUTE env.

Gates:

gate	result	trace markers
default selected `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE`	`13/13`	n/a
shared-route `MOE_SWIGLU_DOWN`	`7/7`	`6` shared-route markers
post-reject selected `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE`	`13/13`	n/a
post-reject Phase121 exec `MOE_SWIGLU_DOWN`	`7/7`	`6` exec markers

Perf:

arm	`MOE_SWIGLU_DOWN n=128`	`MUL_MAT_ID_RAGGED_MOE n=128`	`MOE_SWIGLU_DOWN n=257`	`MUL_MAT_ID_RAGGED_MOE n=257`
control	`808.519710 us`	`1245.913462 us`	`1022.664622 us`	`1457.690407 us`
Phase121 exec	`808.189863 us`	`1250.302500 us`	`1020.849593 us`	`1461.318314 us`
shared-route	`811.836039 us`	`1246.143029 us`	`1051.665618 us`	`1449.548295 us`

Decision:

Reject the shared-route metadata API/path: it did not meet the keep rule and regressed the target MOE_SWIGLU_DOWN n=257 row by about 3% versus the Phase121 executor.
Keep Phase121 executor proof only. Route-only reuse is closed as a parity lever; the next executor scope must remove a larger activation/down boundary.

Phase121: MoE Whole-Pattern Exec Proof

Date: 2026-07-02.
Plan: docs/superpowers/plans/2026-07-02-moe-whole-pattern-exec-proof-phase121.md.
Initial artifact: /home/mudler/bench/phase121_moe_whole_pattern_exec_proof/20260702_041543.
Fix1 artifact: /home/mudler/bench/phase121_moe_whole_pattern_exec_proof/20260702_041739_fix1.
Source decision: keep fix1 default-off executor proof; it proves ownership and skip accounting but does not yet fuse work.

Gates:

run	result
fix1 selected default, `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE`	`13/13`
fix1 exec proof, `LLAMA_MOE_WHOLE_PATTERN_EXEC=1 MOE_SWIGLU_DOWN`	`7/7`
fix1 MoE md5	`8cb0ce23777bf55f92f63d0292c756b0`
fix1 dense md5	`5951a5b4d624ce891e22ab5fca9bc439`
fix1 `MUL_MAT` gate	`1146/1146`
fix1 `MUL_MAT_ID` gate	`806/806`

Perf:

row	control us	exec us	change
`MOE_SWIGLU_DOWN n_tokens=128`	`807.772325`	`806.051488`	`+0.21%`
`MOE_SWIGLU_DOWN n_tokens=257`	`1021.114837`	`1020.839431`	`+0.03%`
`MUL_MAT_ID_RAGGED_MOE n=128`	`1243.250000`	`1243.313702`	`-0.01%`
`MUL_MAT_ID_RAGGED_MOE n=257`	`1450.889205`	`1456.279070`	`-0.37%`

Trace:

Initial run passed correctness but emitted 0 exec markers because the exec branch was accidentally nested under the early trace env condition.
Fix1 exec gate emitted 6 skip=4 markers for the supported correctness rows.
Fix1 exec perf emitted 6 skip=4 markers covering n_tokens=128 and n_tokens=257.

Decision:

Keep the default-off executor proof.
It changes no default behavior and proves that the early matcher can own gate_up, skip both views, execute GLU and down, and return 4.
Next phase should turn the proof helper into a useful executor by replacing one internal boundary at a time. The most defensible next slice is route-plan reuse inside the helper or activation in route-slot order, not another graph detector.

Phase120: MoE Early Whole-Pattern Matcher

Date: 2026-07-02.
Plan: docs/superpowers/plans/2026-07-02-moe-early-whole-pattern-phase120.md.
Initial artifact: /home/mudler/bench/phase120_moe_early_whole_pattern/20260702_040153.
Fix1 artifact: /home/mudler/bench/phase120_moe_early_whole_pattern/20260702_040515_fix1.
Fix2 artifact: /home/mudler/bench/phase120_moe_early_whole_pattern/20260702_040725_fix2.
Source decision: keep fix2 default-off early matcher/trace; no execution is skipped yet.

Gates:

run	result
fix2 selected default, `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE`	`13/13`
fix2 early trace, `LLAMA_MOE_WHOLE_PATTERN_EARLY_TRACE=16 MOE_SWIGLU_DOWN`	`7/7`
fix2 MoE md5	`8cb0ce23777bf55f92f63d0292c756b0`
fix2 dense md5	`5951a5b4d624ce891e22ab5fca9bc439`
fix2 `MUL_MAT` gate	`1146/1146`
fix2 `MUL_MAT_ID` gate	`806/806`

Perf:

row	control us	early trace us	change
`MOE_SWIGLU_DOWN n_tokens=128`	`803.937002`	`808.978278`	`-0.62%`
`MOE_SWIGLU_DOWN n_tokens=257`	`1020.411585`	`1026.072597`	`-0.55%`
`MUL_MAT_ID_RAGGED_MOE n=128`	`1246.259615`	`1243.800481`	`+0.20%`
`MUL_MAT_ID_RAGGED_MOE n=257`	`1456.428779`	`1456.109012`	`+0.02%`

Trace:

Initial artifact emitted 96 early markers with only 6 supported rows; fix1 emitted 104 markers with only 6 supported rows.
Fix2 emits exactly 6 early markers, all supported, covering n_tokens=128 and n_tokens=257.
The fix2 marker proves the executor entry contract before GEMM1 dispatch: skip_ready=4, ids_match=1, swiglu=1, n_used=8, experts=128, n_embd=2048, n_ff=768.

Decision:

Keep the default-off early matcher/trace.
This does not improve runtime by itself; it establishes the correct hook for the next executor attempt.
Next phase should add a guarded executor at this matcher. First prove that it can own the five-node sequence and return 4 only after reproducing the existing outputs, then move useful work into the helper: route-plan reuse across both expert GEMMs, activation in route-slot order, and later direct weighted combine.

Phase119: MoE Whole-Pattern Contract

Date: 2026-07-02.
Plan: docs/superpowers/plans/2026-07-02-moe-whole-pattern-contract-phase119.md.
Initial artifact: /home/mudler/bench/phase119_moe_whole_pattern_contract/20260702_034729.
Fix1 artifact: /home/mudler/bench/phase119_moe_whole_pattern_contract/20260702_035126_fix1.
Source decision: keep default-off contract trace after fix1; no runtime executor yet.

Gates:

run	result
fix1 selected default, `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE`	`13/13`
fix1 trace gate, `LLAMA_MOE_WHOLE_PATTERN_TRACE=16 MOE_SWIGLU_DOWN`	`7/7`
fix1 MoE md5	`8cb0ce23777bf55f92f63d0292c756b0`
fix1 dense md5	`5951a5b4d624ce891e22ab5fca9bc439`
fix1 `MUL_MAT` gate	`1146/1146`
fix1 `MUL_MAT_ID` gate	`806/806`

Initial perf:

row	control us	trace us	change
`MOE_SWIGLU_DOWN n_tokens=128`	`809.251810`	`811.777597`	`-0.31%`
`MOE_SWIGLU_DOWN n_tokens=257`	`1015.069697`	`1028.937243`	`-1.35%`
`MUL_MAT_ID_RAGGED_MOE n=128`	`1247.114183`	`1247.876202`	`-0.06%`
`MUL_MAT_ID_RAGGED_MOE n=257`	`1450.355114`	`1456.109012`	`-0.40%`

Fix1 perf:

row	control us	trace us	change
`MOE_SWIGLU_DOWN n_tokens=128`	`805.399839`	`805.584071`	`-0.02%`
`MOE_SWIGLU_DOWN n_tokens=257`	`1019.715447`	`1021.836382`	`-0.21%`
`MUL_MAT_ID_RAGGED_MOE n=128`	`1247.504808`	`1247.542067`	`-0.00%`
`MUL_MAT_ID_RAGGED_MOE n=257`	`1458.351744`	`1454.090116`	`+0.29%`

Trace:

Initial and fix1 trace perf emitted 6 whole-pattern markers.
Fix1 covered supported NVFP4 contract rows at n_tokens=128 and n_tokens=257: view_pair=1, ids_match=1, swiglu=1, n_used=8, experts=128, n_embd=2048, n_ff=768.
The trace gate also covered smaller correctness shapes; the F32 row reports supported=0 by design because the executor target is native FP4.

Decision:

Keep the default-off trace/contract scaffold.
This phase does not promote a runtime optimization.
The next executor attempt should be matched from the earlier gate_up MUL_MAT_ID node, not from the current GLU -> down validation hook, so it can own route-plan reuse, GEMM1, activation, GEMM2, and later weighted combine.

Phase118: MoE Route Cache

Date: 2026-07-02.
Plan: docs/superpowers/plans/2026-07-02-moe-route-cache-phase118.md.
Artifact: /home/mudler/bench/phase118_moe_route_cache/20260702_030549.
Source decision: reject and revert runtime cache; keep helper refactor only.

Preflight note:

The initial pgrep -af "[l]ocal-ai-worker" preflight was a false positive because the remote shell contained the literal text local-ai-worker busy. Corrected follow-up used pgrep -x local-ai-worker; Docker, worker, and GPU compute-app checks were clean.

Gates:

run	result
helper refactor selected gate	`13/13`
cache default selected gate	`13/13`
cache opt-in selected gate, `LLAMA_MOE_ROUTE_CACHE=1`	`13/13`
post-reject selected gate	`13/13`

Perf:

row	baseline us	cache us	change
`MOE_SWIGLU_DOWN n_tokens=128`	`799.360447`	`803.738437`	`-0.55%`
`MOE_SWIGLU_DOWN n_tokens=257`	`1017.711382`	`1011.915152`	`+0.57%`
`MUL_MAT_ID_RAGGED_MOE n=128`	`1239.332933`	`1239.560096`	`-0.02%`
`MUL_MAT_ID_RAGGED_MOE n=257`	`1447.588068`	`1441.795455`	`+0.40%`

Trace:

LLAMA_MOE_ROUTE_CACHE=1 LLAMA_MOE_ROUTE_CACHE_TRACE=128 on MOE_SWIGLU_DOWN n_tokens=128: 23 hits, 3 misses.

Decision:

Reject and revert the runtime route cache. It proves reuse is possible, but the win is too small for the additional context-owned state and graph-capture lifetime surface.
Keep only the local ggml_cuda_mmq_ids_meta helper refactor as low-conflict groundwork for a future whole-pattern executor.

Phase117: MoE Route-Once Boundary Timing

Date: 2026-07-02.
Plan: docs/superpowers/plans/2026-07-02-moe-route-once-boundary-phase117.md.
Artifact: /home/mudler/bench/phase117_moe_route_once_boundary/20260702_024140.
Trace env: LLAMA_MOE_BOUNDARY_TRACE=1; optional timings with LLAMA_MOE_BOUNDARY_TIMING=1.
Source decision: keep default-off diagnostic trace only; no runtime optimization promoted.

Gates:

run	result
post-guard selected default, `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE`	`13/13`
post-guard trace/timing, `MOE_SWIGLU_DOWN`	`7/7`, `50` trace lines
canonical MoE md5	`8cb0ce23777bf55f92f63d0292c756b0`
canonical dense md5	`5951a5b4d624ce891e22ab5fca9bc439`
canonical `MUL_MAT`	`1146/1146`
canonical `MUL_MAT_ID`	`806/806`

Perf / timing:

row	perf us	boundary medians
graph-enabled `MOE_SWIGLU_DOWN n=128`, trace+timing guarded	`806.271923`	capture emits `us=-1` after graph warmup
no-graph `MOE_SWIGLU_DOWN n=128`	`821.530713`	gate_up: sort `8.992`, quant `103.840`, mmq `1218.656`; down: sort `8.800`, quant `50.720`, mmq `632.768`; GLU `26.240`
no-graph `MOE_SWIGLU_DOWN n=257`	`1079.544086`	gate_up: sort `13.376`, quant `185.632`, mmq `1297.728`; down: sort `13.952`, quant `83.808`, mmq `672.096`; GLU `51.232`
no-graph `MUL_MAT_ID_RAGGED_MOE n=128`	`1255.156250`	sort `8.896`, quant `99.232`, mmq `1133.472`
no-graph `MUL_MAT_ID_RAGGED_MOE n=257`	`1531.667683`	sort `14.624`, quant `174.464`, mmq `1263.360`

Notes:

Inline CUDA events cannot be synchronized inside CUDA graph capture. The guard is required: graph-enabled timing no longer aborts, but captured sections report us=-1; use GGML_CUDA_DISABLE_GRAPHS=1 only for boundary attribution.
The route-sort bucket is small, and standalone GLU/down-quant is not enough after the Phase116 flat result. Do not fund another small sort/tile/quant shortcut from this evidence.
Next source work should be a larger MoE pipeline: route-once metadata shared by both expert GEMMs and/or whole-pattern GEMM1->activation->GEMM2 ownership.

Phase116: MoE SwiGLU Down Fused Quant

Date: 2026-07-02.
Plan: docs/superpowers/plans/2026-07-02-moe-swiglu-down-fused-quant-phase116.md.
Artifact: /home/mudler/bench/phase116_moe_swiglu_down_fused_quant/20260702_022611.
Env under test: LLAMA_MOE_SWIGLU_DOWN_FUSED_QUANT=1.
Source decision: rejected and reverted.

Selected gates:

run	selected gate	route marker
control	`13/13`	n/a
initial candidate	`13/13`	absent
fix1 candidate	`13/13`	present, `6` hits
post-revert	`13/13`	n/a

Perf:

op	shape	control us	fused us	candidate change
`MOE_SWIGLU_DOWN`	`n_tokens=128`	`806.332261`	`808.791633`	`-0.30%`
`MUL_MAT_ID_RAGGED_MOE`	`n=128`	`1241.147837`	`1245.063702`	`-0.32%`
`MOE_SWIGLU_DOWN`	`n_tokens=257`	`1024.895706`	`1024.685072`	`+0.02%`
`MUL_MAT_ID_RAGGED_MOE`	`n=257`	`1454.116279`	`1455.965116`	`-0.13%`

Decision:

Reject and revert Phase116.
The route is technically feasible without a new ggml op or MMQ kernel change, but fusing only SWIGLU into MMQ activation quantization is too small to move GB10 parity.
Do not retry this exact standalone fused-quant path. The next credible fused routed-MoE phase needs route-once metadata shared by both expert GEMMs plus a larger fused GEMM1/activation/GEMM2 or weighted-combine/scatter boundary.

Phase115: MoE Small-M Sentinel A/B

Date: 2026-07-02.
Plan: docs/superpowers/plans/2026-07-01-moe-small-m-sentinel-phase115.md.
Artifact: /home/mudler/bench/phase115_moe_small_m_sentinel/20260702_020258.
Env under test: LLAMA_MOE_SMALL_M_TILE=16, LLAMA_MOE_SMALL_M_TILE=32, LLAMA_MOE_SMALL_M_TILE=64.
Source decision: no source change; reject as a parity lever.

Selected gates:

env	selected gate
control	`13/13`
`LLAMA_MOE_SMALL_M_TILE=16`	`13/13`
`LLAMA_MOE_SMALL_M_TILE=32`	`13/13`
`LLAMA_MOE_SMALL_M_TILE=64`	`13/13`

Perf:

env	`MOE_SWIGLU_DOWN` 128 us	`MUL_MAT_ID_RAGGED_MOE` 128 us	`MOE_SWIGLU_DOWN` 257 us	`MUL_MAT_ID_RAGGED_MOE` 257 us
control	`809.814159`	`1247.719952`	`1021.508130`	`1452.301136`
`LLAMA_MOE_SMALL_M_TILE=16`	`804.780370`	`1241.008413`	`1020.710366`	`1455.017442`
`LLAMA_MOE_SMALL_M_TILE=32`	`809.751408`	`1242.140625`	`1021.155488`	`1458.712209`
`LLAMA_MOE_SMALL_M_TILE=64`	`807.938858`	`1247.765625`	`1021.431911`	`1456.875000`

Decision:

Reject small-M row shaping for the current stack.
This confirms the older Phase33 serving-level rejection on the newer whole-graph sentinels: smaller MoE token tiles are correctness-safe, but the 257-token ragged down path does not improve.
Do not add a down-name special case or another tile-policy shortcut. Phase116 should scope a fused routed-MoE kernel or graph-level fusion that avoids materializing intermediate activation/output traffic.

Phase114: W4A16 Padded Routing

Date: 2026-07-01.
Plan: docs/superpowers/plans/2026-07-01-w4a16-padded-routing-phase114.md.
Initial artifact: /home/mudler/bench/phase114_w4a16_padded_routing/20260701_234634_padded_meta.
Fix1 artifact: /home/mudler/bench/phase114_w4a16_padded_routing/20260701_235003_padded_meta_fix1.
Env under test: LLAMA_W4A16_PREFILL_M=128 LLAMA_W4A16_DIRECT_A=1 LLAMA_MOE_GPU_SORT=1 LLAMA_W4A16_PADDED_META=1.
Source decision: rejected and reverted.

Selected gates:

run	control	candidate
initial padded metadata	`13/13`	`13/13`
fix1 with `num_tokens_post_pad` early returns	`13/13`	`13/13`
post-revert Phase112 control	`13/13`	n/a

Fix1 perf:

op	shape	Phase112 control us	Phase114 fix1 us	candidate change
`MOE_SWIGLU_DOWN`	`n_tokens=128`	`805.094932`	`804.176236`	`+0.11%`
`MUL_MAT_ID_RAGGED_MOE`	`n=128`	`1243.722356`	`1245.055288`	`-0.11%`
`MOE_SWIGLU_DOWN`	`n_tokens=257`	`1477.876106`	`1726.273196`	`-16.81%`
`MUL_MAT_ID_RAGGED_MOE`	`n=257`	`2163.346983`	`2650.932292`	`-22.54%`

Decision:

Reject and revert Phase114.
The vLLM-style padded metadata contract is correctness-feasible in llama.cpp, but a naive padded consumer does too much padded gather/GEMM/scatter work for sparse expert occupancy on these GB10 test rows.
Do not retry this exact padded-W4A16 route unless the kernel is changed to avoid padded activation/output traffic, or the work shifts to a true fused routed-MoE kernel where padding is part of the native tile scheduler.

Phase113: W4A16 Direct-A GPU Tiles

Date: 2026-07-01.
Plan: docs/superpowers/plans/2026-07-01-w4a16-direct-a-gpu-tiles-phase113.md.
Artifact: /home/mudler/bench/phase113_w4a16_direct_a_gpu_tiles/20260701_233345_no_readback.
Env under test: LLAMA_W4A16_PREFILL_M=128 LLAMA_W4A16_DIRECT_A=1 LLAMA_MOE_GPU_SORT=1 LLAMA_W4A16_GPU_TILES=1.
Source decision: rejected and reverted.

Selected gates:

env	selected gate
Phase112 control, `DIRECT_A=1 MOE_GPU_SORT=1`	`13/13`
Phase113 candidate, plus `W4A16_GPU_TILES=1`	`13/13`
post-revert Phase112 control	`13/13`

Perf:

op	shape	Phase112 control us	Phase113 candidate us	candidate change
`MOE_SWIGLU_DOWN`	`n_tokens=128`	`808.130330`	`803.574960`	`+0.56%`
`MUL_MAT_ID_RAGGED_MOE`	`n=128`	`1242.206731`	`1239.567308`	`+0.21%`
`MOE_SWIGLU_DOWN`	`n_tokens=257`	`1478.156342`	`1476.355457`	`+0.12%`
`MUL_MAT_ID_RAGGED_MOE`	`n=257`	`2148.437500`	`2214.230603`	`-3.06%`

Canonical gates:

Skipped for the candidate because the perf gate failed.
Post-revert selected gate passed 13/13, restoring the accepted Phase112 state on DGX.

Decision:

Reject and revert Phase113.
Do not spend more time on compact GPU tile descriptors for W4A16 unless the GEMM itself consumes a vLLM-style padded metadata contract directly.
The next credible MoE phase should move toward padded aligned metadata (sorted_token_ids, expert-per-block ids, and padded row count) rather than compact descriptors plus a ragged tile map.

Phase112: W4A16 Direct Activation Staging

Date: 2026-07-01.
Plan: docs/superpowers/plans/2026-07-01-w4a16-direct-a-phase112.md.
Artifact: /home/mudler/bench/phase112_w4a16_direct_a/20260701_231749_direct_a.
Env under test: LLAMA_W4A16_PREFILL_M=128 LLAMA_W4A16_DIRECT_A=1 LLAMA_MOE_GPU_SORT=1.
Source decision: keep default-off.

Selected gates:

env	selected gate
`LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1`	`13/13`
`LLAMA_W4A16_PREFILL_M=128 LLAMA_W4A16_DIRECT_A=1`	`13/13`
`LLAMA_W4A16_PREFILL_M=128 LLAMA_W4A16_DIRECT_A=1 LLAMA_MOE_GPU_SORT=1`	`13/13`

Perf:

op	shape	W4A16+GPU-sort us	direct-A us	direct-A+GPU-sort us	best change vs control
`MOE_SWIGLU_DOWN`	`n_tokens=128`	`807.219630`	`805.847949`	`809.409493`	`-0.27%`
`MUL_MAT_ID_RAGGED_MOE`	`n=128`	`1242.664663`	`1245.671875`	`1247.674279`	`-0.40%`
`MOE_SWIGLU_DOWN`	`n_tokens=257`	`1551.081790`	`1576.045597`	`1477.738938`	`+4.73%`
`MUL_MAT_ID_RAGGED_MOE`	`n=257`	`2278.504464`	`2347.164352`	`2166.224138`	`+4.93%`

Canonical gates for direct-A+GPU-sort:

gate	result
README MoE md5	`8cb0ce23777bf55f92f63d0292c756b0`
README dense md5	`5951a5b4d624ce891e22ab5fca9bc439`
`SSM_CONV`	`45/45`
`SSM_CONV_SPLIT`	`6/6`
`GET_ROWS`	`49/49` supported rows
`GATED_DELTA_NET`	`48/48`
`MUL_MAT`	`1146/1146` supported rows
`MUL_MAT_ID`	`806/806`

Note: the older handoff snippet with -no-cnv -c 4096 produced stable but non-canonical md5s (18a4e85031694388bab85e5f5b03effc and 0764361176d94719ab94f82da12eed65) for both the direct-A candidate and the W4A16+GPU-sort control. Treat that as a harness mismatch, not a sanctioned gate. The patch-series README gate without -no-cnv and without explicit -c 4096 is the canonical md5 gate used above.

Decision:

Carry Phase112 as default-off only.
The improvement is real for the larger Phase108 MoE rows, but it only narrows the fallback path. W4A16 fallback is still not the default grouped-MMQ parity path.
Next target: either remove another W4A16 fallback boundary that remains after direct-A, or shift to a fused routed-MoE kernel that avoids fallback entirely while preserving the same md5/op gates.

Current Serving Record

Phase72 broader serving snapshot, MoE PTOK=128, GEN=64, PARALLEL=128.

Artifact:

/home/mudler/bench/phase72_ttft_min32_serving/20260701_160730

arm	n	agg_tps	decode_agg_tps	decode_perseq_tps	prefill_tps	ttft_mean_ms	wall_s
llama default	`8`	`170.4`	`231.3`	`28.42`	`1693.4`	`786.4`	`3.004`
llama min32	`8`	`158.5`	`218.4`	`26.27`	`1547.8`	`816.2`	`3.230`
vLLM	`8`	`260.0`	`305.9`	`37.32`	`4659.7`	`266.4`	`1.915`
llama default	`32`	`257.8`	`430.2`	`12.09`	`1720.4`	`2625.2`	`7.943`
llama min32	`32`	`242.7`	`411.7`	`11.58`	`1617.4`	`2881.6`	`8.439`
vLLM	`32`	`463.6`	`601.0`	`17.60`	`5496.2`	`773.7`	`4.357`
llama default	`128`	`325.8`	`714.0`	`3.92`	`1628.8`	`7822.5`	`25.148`
llama min32	`128`	`316.0`	`697.9`	`3.81`	`1606.0`	`8056.9`	`25.926`
vLLM	`128`	`666.4`	`1029.5`	`6.81`	`5292.5`	`2511.7`	`11.933`

Ratios:

n	min32/default agg	min32/default decode	min32/default TTFT	default decode/vLLM	min32 decode/vLLM
`8`	`0.9302`	`0.9442`	`1.0379`	`0.7561`	`0.7140`
`32`	`0.9414`	`0.9570`	`1.0977`	`0.7158`	`0.6850`
`128`	`0.9699`	`0.9775`	`1.0300`	`0.6935`	`0.6779`

Decision:

Reject default-on for LLAMA_TTFT_PREFILL_FIRST=1 LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING=32.
Keep min32 as opt-in only.
The opt-in regressed aggregate, decode, TTFT, and wall time at every tested concurrency and widened the vLLM decode gap.

Attempt Log

Phase111: W4A16 GPU Tile Descriptor Probe

Date: 2026-07-01.
Plan: docs/superpowers/plans/2026-07-01-w4a16-gpu-tile-descriptors-phase111.md.
Source: /home/mudler/llama-phase93-qwen3next-gqa-bcast.
Local patch status: rejected and reverted.
- Probe added default-off LLAMA_W4A16_GPU_TILES=1.
- It built W4A16 tile descriptors on GPU from Phase110 expert_bounds_dev with an atomic tile counter, then copied back one n_tiles integer for the grouped W4A16 launch dimension.
- The final source returned to the Phase110 LLAMA_MOE_GPU_SORT=1 state.
Failed build/runtime artifact: /home/mudler/bench/phase111_w4a16_gpu_tiles/20260701_230216.
Measured artifact: /home/mudler/bench/phase111_w4a16_gpu_tiles/20260701_230400_fix1.

Failure/fix notes:

attempt	result	cause
initial DGX compile	failed	`expert_bounds_for_w4a16` was typed `const int32_t *` but `mm_ids_helper` writes expert bounds
first runtime artifact `20260701_230216`	aborted	CUDA pool LIFO assert: outer `expert_bounds_dev` was allocated after inner `ids_dst_dev` but freed later
fix1 artifact `20260701_230400_fix1`	selected gates passed	allocation order corrected; `LLAMA_W4A16_GPU_TILES=1` branch traced
post-revert gate	`13/13`	source restored to Phase110 behavior

Selected gates:

env	selected gate result
`LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1`	`13/13`
`LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1 LLAMA_W4A16_GPU_TILES=1`	`13/13`
post-revert `LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1`	`13/13`

Clean perf A/B:

env	case	`n_tokens`	time_us	n_runs	vs Phase110 GPU-sort
`LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1`	`MOE_SWIGLU_DOWN`	`128`	`807.037812`	`1243`	`1.000`
`LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1`	`MOE_SWIGLU_DOWN`	`257`	`1531.958716`	`654`	`1.000`
`LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1 LLAMA_W4A16_GPU_TILES=1`	`MOE_SWIGLU_DOWN`	`128`	`802.969697`	`1254`	`0.995`
`LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1 LLAMA_W4A16_GPU_TILES=1`	`MOE_SWIGLU_DOWN`	`257`	`1538.542813`	`654`	`1.004`
`LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1`	`MUL_MAT_ID_RAGGED_MOE`	`128`	`1244.568510`	`832`	`1.000`
`LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1`	`MUL_MAT_ID_RAGGED_MOE`	`257`	`2250.435268`	`448`	`1.000`
`LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1 LLAMA_W4A16_GPU_TILES=1`	`MUL_MAT_ID_RAGGED_MOE`	`128`	`1243.544471`	`832`	`0.999`
`LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1 LLAMA_W4A16_GPU_TILES=1`	`MUL_MAT_ID_RAGGED_MOE`	`257`	`2295.743304`	`448`	`1.020`

Trace facts:

MOE_SWIGLU_DOWN n=257 built 128 W4A16 tiles for 2056 rows.
MUL_MAT_ID_RAGGED_MOE n=257 built 288 W4A16 tiles for 2056 rows.
The clean perf rerun omitted LLAMA_W4A16_GPU_TILES_TRACE=1; the earlier traced perf leg is preserved in the artifact but should not be used for timing.

Decision:

Reject and revert Phase111 source. Moving only the W4A16 tile descriptor build to GPU is correctness-clean after fixes, but it does not improve the parity row and slightly regresses the most relevant 257-token ragged row.
Do not spend another phase on a one-piece W4A16 host-metadata cleanup. The next W4A16 attempt must remove a larger boundary, such as direct activation consumption plus GPU descriptors in one path, or avoid the host-sync fallback path entirely.

Phase110: GPU MoE Routing Metadata for Fallback/W4A16

Date: 2026-07-01.
Plan: docs/superpowers/plans/2026-07-01-gpu-moe-routing-metadata-phase110.md.
Source: /home/mudler/llama-phase93-qwen3next-gqa-bcast.
Local patch status: new default-off CUDA source change in ggml/src/ggml-cuda/ggml-cuda.cu.
- Add LLAMA_MOE_GPU_SORT=1 to route fallback ggml_cuda_mul_mat_id metadata construction through existing ggml_cuda_launch_mm_ids_helper().
- Add a local inverse-permutation kernel because mm_ids_helper returns sorted-to-original ids_dst, while fallback get_rows_cuda() needs original-to-sorted ids_from_sorted.
- Leave graph-safe grouped-MMQ untouched.
Failed first artifact: /home/mudler/bench/phase110_gpu_moe_sort/20260701_224103.
Accepted artifact: /home/mudler/bench/phase110_gpu_moe_sort/20260701_224446_fix1.

Initial failure and fix:

artifact	env	selected gate result	reason
`20260701_224103`	default	`13/13`	baseline clean
`20260701_224103`	`LLAMA_W4A16_PREFILL_M=128`	`13/13`	fallback baseline clean
`20260701_224103`	`LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1`	`10/13`	wrong permutation direction for fallback `get_rows`
`20260701_224446_fix1`	default	`13/13`	accepted fix
`20260701_224446_fix1`	`LLAMA_W4A16_PREFILL_M=128`	`13/13`	accepted fix
`20260701_224446_fix1`	`LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1`	`13/13`	accepted fix; trace showed branch execution

Canonical gates:

env	MoE md5	dense md5	`SSM_CONV`	`SSM_CONV_SPLIT`	`GET_ROWS`	`GATED_DELTA_NET`	`MUL_MAT`	`MUL_MAT_ID`
default	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`45/45`	`6/6`	`49/49`	`48/48`	`1146/1146`	`806/806`
`LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1`	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`45/45`	`6/6`	`49/49`	`48/48`	`1146/1146`	`806/806`

Perf A/B:

env	case	`n_tokens`	time_us	n_runs	vs W4A16	vs default
default	`MOE_SWIGLU_DOWN`	`128`	`806.724859`	`1243`	n/a	`1.000`
default	`MOE_SWIGLU_DOWN`	`257`	`1022.161585`	`984`	n/a	`1.000`
`LLAMA_W4A16_PREFILL_M=128`	`MOE_SWIGLU_DOWN`	`128`	`809.339501`	`1243`	`1.000`	`1.003`
`LLAMA_W4A16_PREFILL_M=128`	`MOE_SWIGLU_DOWN`	`257`	`1656.102310`	`606`	`1.000`	`1.620`
`LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1`	`MOE_SWIGLU_DOWN`	`128`	`807.311344`	`1243`	`0.997`	`1.001`
`LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1`	`MOE_SWIGLU_DOWN`	`257`	`1536.868502`	`654`	`0.928`	`1.504`
default	`MUL_MAT_ID_RAGGED_MOE`	`128`	`1242.343750`	`832`	n/a	`1.000`
default	`MUL_MAT_ID_RAGGED_MOE`	`257`	`1453.979651`	`688`	n/a	`1.000`
`LLAMA_W4A16_PREFILL_M=128`	`MUL_MAT_ID_RAGGED_MOE`	`128`	`1248.412260`	`832`	`1.000`	`1.005`
`LLAMA_W4A16_PREFILL_M=128`	`MUL_MAT_ID_RAGGED_MOE`	`257`	`2428.586538`	`416`	`1.000`	`1.670`
`LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1`	`MUL_MAT_ID_RAGGED_MOE`	`128`	`1247.145433`	`832`	`0.999`	`1.004`
`LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1`	`MUL_MAT_ID_RAGGED_MOE`	`257`	`2237.145089`	`448`	`0.921`	`1.539`

Decision:

Keep Phase110 as a default-off structural base. It is md5/op clean after the inverse-permutation fix and confirms vLLM-style GPU route metadata can replace the CPU id scan for the host-sync fallback path.
Do not promote it as a speed parity lever by itself. The W4A16 fallback improves by 7.2% on MOE_SWIGLU_DOWN n=257 and 7.9% on MUL_MAT_ID_RAGGED_MOE n=257, but still remains about 1.5x slower than the default grouped-MMQ path.
Phase111 should only build on this if it removes another fallback bottleneck: either the remaining expert_bounds host copy / host tile descriptor build, or a grouped W4A16 path that can consume GPU expert bounds directly.

Phase109: Existing MoE Prefill and Tile-Policy A/B

Date: 2026-07-01.
Source: /home/mudler/llama-phase93-qwen3next-gqa-bcast.
Local patch status: no new source changes. This was an env-only benchmark attempt using the Phase108 perf CSV harness.
Artifact: /home/mudler/bench/phase109_existing_moe_prefill_ab/20260701_222559.

Perf A/B:

env	case	`n_tokens`	time_us	n_runs	vs default
default	`MOE_SWIGLU_DOWN`	`128`	`800.802233`	`1254`	`1.000`
default	`MOE_SWIGLU_DOWN`	`257`	`1008.593373`	`996`	`1.000`
`LLAMA_W4A16_PREFILL_M=128`	`MOE_SWIGLU_DOWN`	`128`	`805.747385`	`1243`	`1.006`
`LLAMA_W4A16_PREFILL_M=128`	`MOE_SWIGLU_DOWN`	`257`	`1646.679739`	`612`	`1.633`
`LLAMA_FP4_PREFILL_M=128`	`MOE_SWIGLU_DOWN`	`128`	`806.103781`	`1243`	`1.007`
`LLAMA_FP4_PREFILL_M=128`	`MOE_SWIGLU_DOWN`	`257`	`4070.191057`	`246`	`4.035`
`LLAMA_MOE_DENSITY_MAX=9`	`MOE_SWIGLU_DOWN`	`128`	`810.080451`	`1243`	`1.012`
`LLAMA_MOE_DENSITY_MAX=9`	`MOE_SWIGLU_DOWN`	`257`	`1024.869121`	`978`	`1.016`
`LLAMA_MOE_MMQ_X=64`	`MOE_SWIGLU_DOWN`	`128`	`806.358005`	`1243`	`1.007`
`LLAMA_MOE_MMQ_X=64`	`MOE_SWIGLU_DOWN`	`257`	`1008.191767`	`996`	`1.000`
default	`MUL_MAT_ID_RAGGED_MOE`	`128`	`1241.417067`	`832`	`1.000`
default	`MUL_MAT_ID_RAGGED_MOE`	`257`	`1445.333807`	`704`	`1.000`
`LLAMA_W4A16_PREFILL_M=128`	`MUL_MAT_ID_RAGGED_MOE`	`128`	`1242.049279`	`832`	`1.001`
`LLAMA_W4A16_PREFILL_M=128`	`MUL_MAT_ID_RAGGED_MOE`	`257`	`2518.852500`	`400`	`1.743`
`LLAMA_FP4_PREFILL_M=128`	`MUL_MAT_ID_RAGGED_MOE`	`128`	`1244.775240`	`832`	`1.003`
`LLAMA_FP4_PREFILL_M=128`	`MUL_MAT_ID_RAGGED_MOE`	`257`	`2898.838068`	`352`	`2.006`
`LLAMA_MOE_DENSITY_MAX=9`	`MUL_MAT_ID_RAGGED_MOE`	`128`	`1247.564904`	`832`	`1.005`
`LLAMA_MOE_DENSITY_MAX=9`	`MUL_MAT_ID_RAGGED_MOE`	`257`	`1438.245739`	`704`	`0.995`
`LLAMA_MOE_MMQ_X=64`	`MUL_MAT_ID_RAGGED_MOE`	`128`	`1246.139423`	`832`	`1.004`
`LLAMA_MOE_MMQ_X=64`	`MUL_MAT_ID_RAGGED_MOE`	`257`	`1434.058239`	`704`	`0.992`

MOE_WEIGHTED_COMBINE spot rows:

env	`n_tokens=128`	`n_tokens=257`
default	`27.695333`	`67.423746`
`LLAMA_W4A16_PREFILL_M=128`	`27.502254`	`95.550477`
`LLAMA_FP4_PREFILL_M=128`	`27.687500`	`229.421474`

Correctness gates:

env	selected gate result
default	`13/13`
`LLAMA_W4A16_PREFILL_M=128`	`13/13`
`LLAMA_FP4_PREFILL_M=128`	`13/13`
`LLAMA_MOE_DENSITY_MAX=9`	`13/13`
`LLAMA_MOE_MMQ_X=64`	`13/13`

Trace notes:

The default/density route remained CUDA-graph-safe grouped MMQ: route=mmq host_sync=0.
For the 257-token ragged row the traced launch uses ncols_dst=2056, ncols_max=257, mmq_x=96, stream_k_blocks == ntiles_dst, and fixup=0.
For 128-token rows the current default already selects mmq_x=64; raising density or forcing 64 does not open a new path.

Decision:

Reject existing W4A16 and FP4 large-M env routes for these Phase108 MoE sentinel rows. They are correctness-clean but slower, especially at n_tokens=257.
Reject LLAMA_MOE_DENSITY_MAX=9 and LLAMA_MOE_MMQ_X=64 as parity levers. The best MUL_MAT_ID_RAGGED_MOE improvement is only 0.5-0.8% and MOE_SWIGLU_DOWN is flat or worse.
Do not spend Phase110 on another MMQ tile-policy shortcut.
Next implementation should target the structural gap identified by the vLLM audit: build routed-MoE sorted token/expert metadata on GPU and remove the host ID readback/sync path from the grouped fallback/W4A16 path, while keeping the graph-safe MMQ path untouched.

Phase108: MoE Whole-Graph Perf CSV Harness

Date: 2026-07-01.
Source: /home/mudler/llama-phase93-qwen3next-gqa-bcast.
Local patch status: measurement-only source change in tests/test-backend-ops.cpp.
- Add existing MOE_SWIGLU_DOWN, MOE_WEIGHTED_COMBINE, and MUL_MAT_ID_RAGGED_MOE whole-graph cases to make_test_cases_perf() for n_tokens=128 and 257.
- Expand --output csv to use test_result::get_fields(), which includes time_us, flops, bandwidth_gb_s, memory_kb, and n_runs.
Artifact: /home/mudler/bench/phase108_moe_perf_csv/20260701_221559.

RED condition from Phase107:

command	Phase107 result
`test-backend-ops perf -b CUDA0 -o MOE_SWIGLU_DOWN --output csv`	zero rows
`test-backend-ops perf -b CUDA0 -o MOE_WEIGHTED_COMBINE --output csv`	zero rows
`test-backend-ops perf -b CUDA0 -o MUL_MAT_ID_RAGGED_MOE --output csv`	zero rows

Perf rows after patch:

case	params	time_us	n_runs	flops
`MOE_SWIGLU_DOWN`	`type_a=nvfp4,n_mats=128,n_used=8,n_ff=768,n_tokens=128,n_embd=2048`	`801.764753`	`1254`	`12053007297164.449219`
`MOE_SWIGLU_DOWN`	`type_a=nvfp4,n_mats=128,n_used=8,n_ff=768,n_tokens=257,n_embd=2048`	`1019.953252`	`984`	`19023274120980.359375`
`MOE_WEIGHTED_COMBINE`	`type_a=nvfp4,n_mats=128,n_used=8,n_ff=768,n_tokens=128,n_embd=2048`	`27.550055`	`36320`	`117074893979840.453125`
`MOE_WEIGHTED_COMBINE`	`type_a=nvfp4,n_mats=128,n_used=8,n_ff=768,n_tokens=257,n_embd=2048`	`67.593041`	`14800`	`95809244446043.828125`
`MUL_MAT_ID_RAGGED_MOE`	`type_a=nvfp4,n_mats=256,n_used=8,m=768,n=128,k=2048`	`1239.103365`	`832`	`2599642259062.170898`
`MUL_MAT_ID_RAGGED_MOE`	`type_a=nvfp4,n_mats=256,n_used=8,m=768,n=257,k=2048`	`1445.950284`	`704`	`4472917803025.495117`

Safety gates:

gate	result
MoE md5	`8cb0ce23777bf55f92f63d0292c756b0`
dense md5	`5951a5b4d624ce891e22ab5fca9bc439`
`MOE_SWIGLU_DOWN`	`7/7`
`MOE_WEIGHTED_COMBINE`	`7/7`
`MUL_MAT_ID_RAGGED_MOE`	`6/6`
`SSM_CONV`	`45/45`
`SSM_CONV_SPLIT`	`6/6`
`GET_ROWS`	`49/49`
`GATED_DELTA_NET`	`48/48`
`MUL_MAT`	`1146/1146`
`MUL_MAT_ID`	`806/806`

Notes:

The first md5 attempt in gates/ used -no-cnv and intentionally failed against the canonical chat-template hashes. The corrected historical gate is in gates_chat/ and passed.
CSV output is now a usable perf ledger for these cases; the schema includes timing columns instead of support metadata only.

Decision:

Phase108 closes the Phase107 measurement gap; it is not a parity-improving runtime patch by itself.
The dominant focused row is MUL_MAT_ID_RAGGED_MOE (1239-1446 us/run) and MOE_SWIGLU_DOWN (802-1020 us/run), not MOE_WEIGHTED_COMBINE (28-68 us/run).
Next fused-MoE work should target the routed matmul/SWIGLU/down chain and must report deltas against these Phase108 rows plus the same md5/op gates.

Phase107: Fused-MoE Structural Guardrail

Date: 2026-07-01.
Source: /home/mudler/llama-phase93-qwen3next-gqa-bcast.
Local patch status: no new source changes. This was a correctness and measurement-surface attempt for the next structural fused routed-MoE path.
Artifact: /home/mudler/bench/phase107_moe_fusion_guardrail/20260701_220227.

Correctness guardrails:

guard	result
`MOE_SWIGLU_DOWN`	`7/7`
`MOE_WEIGHTED_COMBINE`	`7/7`
`MUL_MAT_ID_RAGGED_MOE`	`6/6`

Perf-output check:

command	result
`test-backend-ops perf -b CUDA0 -o MOE_SWIGLU_DOWN --output csv`	zero rows
`test-backend-ops perf -b CUDA0 -o MOE_WEIGHTED_COMBINE --output csv`	zero rows
`test-backend-ops perf -b CUDA0 -o MUL_MAT_ID_RAGGED_MOE --output csv`	zero rows
`test-backend-ops perf -b CUDA0 -o MUL_MAT_ID --output csv`	`116` support rows, `63` relevant rows, but no timing columns

Decision:

Existing correctness guardrails are sufficient to protect the three structural MoE surfaces before a future source change.
Existing test-backend-ops perf output is not sufficient as a performance guard for these custom whole-graph cases because it emits support metadata, not timings.
The next source patch should be measurement-only: a narrow MoE fusion timing harness that emits case,iterations,total_ms,mean_ms for the selected MOE_SWIGLU_DOWN, MOE_WEIGHTED_COMBINE, and MUL_MAT_ID_RAGGED_MOE shapes.
Do not start fused routed-MoE kernel implementation until that timing harness proves which sub-surface is large enough to move Phase104/106 serving.

Phase106: Max-Concurrency Current-Stack Serving

Date: 2026-07-01.
Source: /home/mudler/llama-phase93-qwen3next-gqa-bcast.
Local patch status: no new source changes. This was a measurement-only serving-contract attempt on top of the carried Phase101/102 default-off cleanup candidates.
Harness: streamed paged-current-serving-snapshot.sh with:
- source-log workaround for the non-git DGX mirror,
- paged env LLAMA_SSM_CONV_SPLIT=1 LLAMA_PAGED_KV_GET_ROWS_F16=1,
- expanded gate ops: SSM_CONV,SSM_CONV_SPLIT,GET_ROWS,GATED_DELTA_NET,MUL_MAT,MUL_MAT_ID,
- NPL=128 192 256, PTOK=128, GEN=64, PARALLEL=256, CTX=131072, BATCH=2048, UBATCH=512, VLLM_MAX_NUM_SEQS=256.
Artifacts:
- dry-run: /home/mudler/bench/phase106_max_concurrency_current_stack/20260701_214839_dryrun,
- full sweep: /home/mudler/bench/phase106_max_concurrency_current_stack/20260701_214907.

Safety gates:

phase	env	MoE md5	dense md5	`SSM_CONV`	`SSM_CONV_SPLIT`	`GET_ROWS`	`GATED_DELTA_NET`	`MUL_MAT`	`MUL_MAT_ID`
pre	split + F16 K/V rows	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`45/45`	`6/6`	`49/49`	`48/48`	`1146/1146`	`806/806`
post	split + F16 K/V rows	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`45/45`	`6/6`	`49/49`	`48/48`	`1146/1146`	`806/806`

Serving snapshot:

arm	n	agg_tps	decode_agg_tps	decode_perseq_tps	prefill_tps	ttft_mean_ms	wall_s
paged combined	`128`	`331.8`	`678.9`	`3.90`	`1734.1`	`7392.5`	`24.689`
paged combined	`192`	`318.4`	`681.8`	`2.50`	`1602.4`	`11058.0`	`38.595`
paged combined	`256`	`338.4`	`824.6`	`2.10`	`1542.8`	`14933.5`	`48.410`
vLLM	`128`	`663.4`	`1029.8`	`6.78`	`5228.9`	`2514.6`	`11.970`
vLLM	`192`	`709.8`	`1202.4`	`4.98`	`4881.5`	`3674.8`	`16.769`
vLLM	`256`	`723.8`	`1320.4`	`3.94`	`4520.9`	`4999.0`	`21.931`

Ratios:

n	paged decode/vLLM	paged perseq/vLLM	paged agg/vLLM	paged TTFT/vLLM
`128`	`0.6593`	`0.5752`	`0.5002`	`2.9398`
`192`	`0.5670`	`0.5020`	`0.4486`	`3.0091`
`256`	`0.6245`	`0.5330`	`0.4675`	`2.9873`

Decision:

Reject C1 as a GB10 parity lever for the current stack.
llama.cpp completed N=256, but vLLM also completed N=256 under the same harness cap and remained materially faster.
Higher concurrency did not reveal an aggregate operating point where llama.cpp catches vLLM: paged aggregate stayed around 318-338 t/s, while vLLM rose to 724 t/s.
TTFT widened with higher concurrency on llama.cpp (7392.5 -> 14933.5 ms) and stayed much lower on vLLM (2514.6 -> 4999.0 ms).
The next phase should not be another scheduler or MMQ micro-policy. The remaining plausible source work is structural: persistent batch state, fused routed-MoE dispatch, or a larger GDN/packed-decode design with new guardrails.

Phase105: Current-Stack MoE MMQ Shape Refresh

Date: 2026-07-01.
Source: /home/mudler/llama-phase93-qwen3next-gqa-bcast.
Local patch status: no new source changes. This was a measurement-only attempt on top of the carried Phase101/102 default-off cleanup candidates.
Env for trace legs: LLAMA_SSM_CONV_SPLIT=1 LLAMA_PAGED_KV_GET_ROWS_F16=1.
Artifacts:
- gates: /home/mudler/bench/phase105_mmq_current_shape/20260701_213927,
- serving trace retry: /home/mudler/bench/phase105_mmq_current_shape/20260701_214129_serving_retry.

Safety gates:

gate	env	result
`MUL_MAT_ID_RAGGED_MOE`	default	`6/6`
`MUL_MAT_ID_RAGGED_MOE`	split + F16 K/V rows + shape traces	`6/6`
`MUL_MAT_ID`	split + F16 K/V rows	`806/806`

Trace refresh:

source	shape lines	launch lines	small-M lines	shape summary	launch summary
ragged gate	`3`	`3`	`2`	density `2/4/9`, `mmq_x_best 40/64/96`	`fixup=0`, `stream_k_blocks == ntiles_dst`
one live serving request	`120`	`120`	`0`	`ncols_max=317`, density `10`, `mmq_x_best=112`, `stream_k=1`	`fixup=0`, `stream_k_blocks == ntiles_dst` (`120/120`), efficiency `100`

Notes:

The first live-serving trace leg used the wrong model path and exited before loading the model. It is preserved in the gate artifact as a harness hiccup, not an inference failure.
The serving retry used ~/bench/q36-35b-a3b-nvfp4.gguf; the request returned a non-empty response (3648 bytes), and the wrapper's nonzero exit was from grep under pipefail when there were zero SMALL_M lines.

Decision:

The current Phase104 stack did not create a new cheap grouped-MMQ lever.
The trace reconfirms that no-fixup/no-stream-k shortcuts are closed for this workload, and the live sampled shape is prefill-like rather than a new small-M decode class.
Do not pursue another host-side MMQ tile policy. Any next MMQ work must be a structural kernel or serving-contract change with a clear path to reducing the dominant mmq_nvfp4 bucket.
Given prior GDN micro-kernel rejections, the next high-value phase should be a larger serving contract or a new structural design, not more isolated micro-knobs.

Phase104: Combined Cleanup Normal Serving Snapshot vs vLLM

Date: 2026-07-01.
Source: /home/mudler/llama-phase93-qwen3next-gqa-bcast.
Local patch status: no new source changes beyond the carried Phase101/102 default-off runtime candidates.
Harness: streamed paged-current-serving-snapshot.sh with:
- source-log workaround for the non-git DGX mirror,
- paged env LLAMA_SSM_CONV_SPLIT=1 LLAMA_PAGED_KV_GET_ROWS_F16=1,
- expanded gate ops: SSM_CONV,SSM_CONV_SPLIT,GET_ROWS,GATED_DELTA_NET,MUL_MAT,MUL_MAT_ID,
- NPL=128, PTOK=128, GEN=64, PARALLEL=128, CTX=131072, BATCH=2048, UBATCH=512.
Artifact: /home/mudler/bench/phase104_combined_serving_snapshot/20260701_212551.

Safety gates:

phase	env	MoE md5	dense md5	`SSM_CONV`	`SSM_CONV_SPLIT`	`GET_ROWS`	`GATED_DELTA_NET`	`MUL_MAT`	`MUL_MAT_ID`
pre	split + F16 K/V rows	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`45/45`	`6/6`	`49/49`	`48/48`	`1146/1146`	`806/806`
post	split + F16 K/V rows	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`45/45`	`6/6`	`49/49`	`48/48`	`1146/1146`	`806/806`

Serving snapshot, MoE PTOK=128, GEN=64, PARALLEL=128, N=128:

arm	n	agg_tps	decode_agg_tps	decode_perseq_tps	prefill_tps	ttft_mean_ms	wall_s
paged combined	`128`	`338.6`	`675.8`	`3.93`	`1813.0`	`7121.6`	`24.196`
vLLM	`128`	`661.1`	`1028.0`	`6.80`	`5208.7`	`2572.3`	`11.980`

Ratios:

n	paged decode/vLLM	paged perseq/vLLM	paged agg/vLLM	paged TTFT/vLLM
`128`	`0.6574`	`0.5779`	`0.5122`	`2.7686`

Comparison to Phase97 Phase93-only normal serving:

metric	Phase97	Phase104 combined	change
`agg_tps`	`329.6`	`338.6`	`+2.73%`
`decode_agg_tps`	`669.8`	`675.8`	`+0.90%`
`prefill_tps`	`1734.5`	`1813.0`	`+4.53%`
`ttft_mean_ms`	`7415.4`	`7121.6`	`-3.96%`
`wall_s`	`24.851`	`24.196`	`-2.64%`
`paged_decode_over_vllm`	`0.6507`	`0.6574`	`+0.0067`
`paged_agg_over_vllm`	`0.4958`	`0.5122`	`+0.0164`

Decision:

The combined cleanup stack has a small real serving benefit outside nsys.
It does not change the parity conclusion: vLLM is still about 1.52x faster on decode aggregate and 1.95x faster on aggregate throughput at this shape.
Carry the combined cleanup env as the best current comparison baseline.
Next source work should target the remaining high-impact gap, not another isolated layout cleanup. The current evidence points to larger serving contracts or the dominant GDN/MMQ buckets.

Phase103: Combined Layout Cleanup Stack

Date: 2026-07-01.
Source: /home/mudler/llama-phase93-qwen3next-gqa-bcast.
Local patch status: no new source changes beyond the Phase101 and Phase102 default-off runtime candidates.
Env: LLAMA_SSM_CONV_SPLIT=1 LLAMA_PAGED_KV_GET_ROWS_F16=1.
Artifacts:
- standalone combined gates: /home/mudler/bench/phase103_combined_layout_cleanups/20260701_211632/gates_combined,
- combined serving profile: /home/mudler/bench/phase103_combined_layout_cleanups/20260701_211821/serving_profile.

Safety gates:

gate	env	MoE md5	dense md5	`SSM_CONV`	`SSM_CONV_SPLIT`	`GET_ROWS`	`GATED_DELTA_NET`	`MUL_MAT`	`MUL_MAT_ID`
standalone combined	split + F16 K/V rows	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`45/45`	`6/6`	`49/49`	`48/48`	`1146/1146`	`806/806`
serving pre combined	split + F16 K/V rows	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`45/45`	`6/6`	`49/49`	`48/48`	`1146/1146`	`806/806`
serving post combined	split + F16 K/V rows	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`45/45`	`6/6`	`49/49`	`48/48`	`1146/1146`	`806/806`

Serving under combined graph-node profiling:

metric	value
aggregate t/s	`212.3`
decode aggregate t/s	`331.5`
decode per-seq t/s	`2.13`
prefill t/s	`1569.1`
TTFT mean ms	`7858.5`
wall s	`38.575`
total kernel time	`19.5519 s`

Fine bucket comparison:

bucket	Phase101 opt-in	Phase102 opt-in	Phase103 combined	Phase103 vs Phase102
`convert_dtype`	`661.35 ms`	`663.99 ms`	`662.36 ms`	`-1.63 ms`
`copy_layout`	`80.32 ms`	`112.53 ms`	`78.22 ms`	`-34.31 ms`
`concat_layout`	`433.13 ms`	`4.59 ms`	`12.51 ms`	`+7.92 ms`
`layout-copy` macro	`1220.30 ms`	`826.87 ms`	`798.52 ms`	`-28.35 ms`
`get_rows`	`277.67 ms`	`278.61 ms`	`278.61 ms`	`0.00 ms`
`gdn_conv`	`453.54 ms`	`383.90 ms`	`390.08 ms`	`+6.18 ms`
`gdn_core`	`5886.76 ms`	`5940.33 ms`	`5930.47 ms`	`-9.86 ms`
`mmq_nvfp4`	`6193.70 ms`	`5987.09 ms`	`6001.77 ms`	`+14.68 ms`

Decision:

Correctness-clean combined stack. The two cleanup candidates are compatible.
The combination improves traced serving over Phase102 and recovers the Phase101 copy_layout reduction while preserving the Phase102 concat removal.
It is still not a parity-closing lever. Dominant buckets remain gdn_core 5930.47 ms and mmq_nvfp4 6001.77 ms, far larger than the residual layout buckets.
Carry Phase101+Phase102 as a combined default-off cleanup stack for future comparisons. Next source work should not spend more time on isolated layout-copy cleanup unless it also changes a serving-critical contract.

Phase102: Split-Input `SSM_CONV` Prefill Path

Date: 2026-07-01.
Source: /home/mudler/llama-phase93-qwen3next-gqa-bcast.
Local patch status: default-off runtime candidate:
- adds ggml_ssm_conv_split(ctx, conv_states, x_cur, conv_kernel) while reusing GGML_OP_SSM_CONV,
- adds CPU and CUDA split-input implementations plus SSM_CONV_SPLIT tests,
- wires Qwen3Next/Qwen35/Qwen35MoE through LLAMA_SSM_CONV_SPLIT=1 only for n_seq_tokens > 1, n_seq_tokens >= K-1, and cparams.n_rs_seq == 0,
- keeps decode fused and rollback/short-prefill cases on the existing path.
Local build: cmake --build build --target test-backend-ops -j $(nproc).
DGX build: cmake --build /home/mudler/llama-phase93-qwen3next-gqa-bcast/build --target llama-server llama-completion test-backend-ops -j $(nproc).
Debug note: the first split-minus-base test used the default normalized-MSE metric and failed with ERR = inf for d_conv=4 because the CPU reference is exactly zero. A direct split CUDA-vs-CPU diagnostic passed 6/6; the final semantic test keeps split - base and uses absolute max error.
Artifacts:
- default/opt-in standalone gates: /home/mudler/bench/phase102_ssm_conv_split/20260701_210559,
- opt-in serving profile: /home/mudler/bench/phase102_ssm_conv_split/20260701_210907/serving_profile.

Safety gates:

gate	env	MoE md5	dense md5	`SSM_CONV`	`SSM_CONV_SPLIT`	`GET_ROWS`	`GATED_DELTA_NET`	`MUL_MAT`	`MUL_MAT_ID`
default	none	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`45/45`	`6/6`	`49/49`	`48/48`	`1146/1146`	`806/806`
standalone opt-in	`LLAMA_SSM_CONV_SPLIT=1`	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`45/45`	`6/6`	`49/49`	`48/48`	`1146/1146`	`806/806`
serving pre opt-in	`LLAMA_SSM_CONV_SPLIT=1`	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`45/45`	`6/6`	`49/49`	`48/48`	`1146/1146`	`806/806`
serving post opt-in	`LLAMA_SSM_CONV_SPLIT=1`	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`45/45`	`6/6`	`49/49`	`48/48`	`1146/1146`	`806/806`

Serving under opt-in graph-node profiling:

metric	value
aggregate t/s	`206.1`
decode aggregate t/s	`320.0`
decode per-seq t/s	`2.06`
prefill t/s	`1538.0`
TTFT mean ms	`7928.4`
wall s	`39.743`
total kernel time	`19.5482 s`

Fine bucket comparison:

bucket	Phase100	Phase101 opt-in	Phase102 opt-in	Phase102 vs Phase101
`convert_dtype`	`661.73 ms`	`661.35 ms`	`663.99 ms`	`+2.64 ms`
`copy_layout`	`116.25 ms`	`80.32 ms`	`112.53 ms`	`+32.21 ms`
`concat_layout`	`438.15 ms`	`433.13 ms`	`4.59 ms`	`-428.54 ms`
`layout-copy` macro	`1262.58 ms`	`1220.30 ms`	`826.87 ms`	`-393.43 ms`
`get_rows`	`283.47 ms`	`277.67 ms`	`278.61 ms`	`+0.94 ms`
`gdn_conv`	`458.13 ms`	`453.54 ms`	`383.90 ms`	`-69.64 ms`
`gdn_core`	`5919.48 ms`	`5886.76 ms`	`5940.33 ms`	`+53.57 ms`
`mmq_nvfp4`	`6127.44 ms`	`6193.70 ms`	`5987.09 ms`	`-206.61 ms`

Decision:

Correctness-clean and structurally useful: the split op removes the large concat materialization from the eligible prefill/microbatch path.
It does not improve live serving throughput in the profiled N=128, PTOK=128, GEN=64, PARALLEL=128 window; aggregate and decode are below Phase100/101 traced profiles despite lower total kernel time.
Carry as a default-off cleanup candidate pending repeat A/B or a follow-up that fuses the remaining state update/copy work. Do not promote as a parity lever by itself.
Next higher-value work should target the still-dominant buckets: gdn_core and mmq_nvfp4, or a larger serving scheduler/packed-decode contract.

Phase101: Paged K/V F16 `GET_ROWS` A/B

Date: 2026-07-01.
Source: /home/mudler/llama-phase93-qwen3next-gqa-bcast.
Local patch status: default-off runtime candidate:
- ggml_get_rows_type(ctx, a, b, type) helper added while preserving stock ggml_get_rows widening semantics,
- CPU reference supports F16 source -> F16 output row copy,
- CUDA already supports F16 GET_ROWS output through get_rows_cuda,
- paged attention K/V gather calls typed F16 GET_ROWS only when LLAMA_PAGED_KV_GET_ROWS_F16=1 and the K/V cache tensor is F16,
- tests add F16-output GET_ROWS cases.
Local build: cmake --build build --target test-backend-ops -j $(nproc).
DGX build: cmake --build /home/mudler/llama-phase93-qwen3next-gqa-bcast/build --target llama-server llama-completion test-backend-ops -j $(nproc).
Artifacts:
- default gates: /home/mudler/bench/phase101_kv_get_rows_f16/20260701_203621/gates_default,
- opt-in gates: /home/mudler/bench/phase101_kv_get_rows_f16/20260701_203754/gates_optin,
- opt-in serving profile: /home/mudler/bench/phase101_kv_get_rows_f16/20260701_203930/serving_profile.

Safety gates:

gate	env	MoE md5	dense md5	`GET_ROWS`	`GATED_DELTA_NET`	`MUL_MAT`	`MUL_MAT_ID`
default	none	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`49/49`	`48/48`	`1146/1146`	`806/806`
standalone opt-in	`LLAMA_PAGED_KV_GET_ROWS_F16=1`	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`49/49`	`48/48`	`1146/1146`	`806/806`
serving pre opt-in raw log	`LLAMA_PAGED_KV_GET_ROWS_F16=1`	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`49/49`	`48/48`	`1146/1146`	`806/806`
serving post opt-in raw log	`LLAMA_PAGED_KV_GET_ROWS_F16=1`	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`49/49`	`48/48`	`1146/1146`	`806/806`

Serving under opt-in graph-node profiling:

metric	value
aggregate t/s	`206.4`
decode aggregate t/s	`328.0`
decode per-seq t/s	`2.08`
prefill t/s	`1479.6`
TTFT mean ms	`8211.1`
wall s	`39.678`
total kernel time	`20.1989 s`

Fine bucket comparison against Phase100:

bucket	Phase100	Phase101 opt-in	change
`convert_dtype`	`661.73 ms`	`661.35 ms`	`-0.38 ms`
`copy_layout`	`116.25 ms`	`80.32 ms`	`-35.93 ms`
`concat_layout`	`438.15 ms`	`433.13 ms`	`-5.02 ms`
`layout-copy` macro	`1262.58 ms`	`1220.30 ms`	`-42.28 ms`
`get_rows`	`283.47 ms`	`277.67 ms`	`-5.80 ms`
`gdn_core`	`5919.48 ms`	`5886.76 ms`	`-32.72 ms`
`mmq_nvfp4`	`6127.44 ms`	`6193.70 ms`	`+66.26 ms`

Decision:

Correctness-clean but not parity-closing.
The hypothesis that K/V F16 typed gather would materially reduce convert_dtype is mostly false for this serving window; convert_dtype stayed flat.
The patch does remove some copy_layout work and keeps md5/op gates green, so it can remain as a small default-off cleanup candidate, but it should not be promoted or treated as the main parity path without a repeat serving A/B.
Next higher-value runtime work remains either the two-source SSM_CONV contract for conv_input or a larger GDN/MMQ serving lever.

Phase100: Layout Trace View-Source Attribution

Date: 2026-07-01.
Source: /home/mudler/llama-phase93-qwen3next-gqa-bcast.
Local patch status: trace-only source change in ggml/src/ggml-cuda/ggml-cuda.cu; LLAMA_LAYOUT_TRACE now prints dst_view, src0_view, and src1_view. Default execution is unchanged.
Local build: cmake --build build --target test-backend-ops -j $(nproc).
DGX build: cmake --build /home/mudler/llama-phase93-qwen3next-gqa-bcast/build --target llama-server llama-completion test-backend-ops -j $(nproc).
Harness:
- trace gate: EXTRA_ENV=LLAMA_LAYOUT_TRACE=128 OPS=GATED_DELTA_NET,MUL_MAT,MUL_MAT_ID,
- serving profile: streamed /home/mudler/bench/phase76_current_moe_profile.sh with source logging fixed for the mirror, GATED_DELTA_NET gates, and LLAMA_LAYOUT_TRACE=30000 on llama-server,
- N=128, PTOK=128, GEN=64, PARALLEL=128, CTX=131072.
Artifacts:
- trace gate: /home/mudler/bench/phase100_layout_view_trace/20260701_201635/trace_gates,
- serving profile: /home/mudler/bench/phase100_layout_view_trace/20260701_201800/serving_profile.

Safety gates:

gate	MoE md5	dense md5	`GATED_DELTA_NET`	`MUL_MAT`	`MUL_MAT_ID`
trace-enabled standalone	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`48/48`	`1146/1146`	`806/806`
serving pre raw log	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`48/48`	`1146/1146`	`806/806`
serving post raw log	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`48/48`	`1146/1146`	`806/806`

Serving under graph-node profiling plus view-source layout trace:

metric	value
aggregate t/s	`207.0`
decode aggregate t/s	`327.9`
decode per-seq t/s	`2.10`
prefill t/s	`1490.9`
TTFT mean ms	`8302.7`
wall s	`39.578`
total kernel time	`20.3464 s`

Fine buckets:

bucket	time	share	launches
`mmq_nvfp4`	`6127.44 ms`	`30.12%`	`33682`
`gdn_core`	`5919.48 ms`	`29.09%`	`4680`
`convert_dtype`	`661.73 ms`	`3.25%`	`52060`
`gdn_conv`	`458.13 ms`	`2.25%`	`7230`
`concat_layout`	`438.15 ms`	`2.15%`	`2130`
`copy_layout`	`116.25 ms`	`0.57%`	`8090`
`ew_repeat`	`46.45 ms`	`0.23%`	`18720`

View-source trace findings:

finding	evidence
K/V cache reads feed F32->F16 converts	For attention layers, `GET_ROWS` outputs F32 `node_` from F16 `cache_k_l` / `cache_v_l*`, then a `CPY` downcasts a view of that node to F16. Examples: `node_358 <- cache_k_l3` and `node_365 <- cache_v_l3`, followed by `cpy` rows with `src0_view=node_358` / `node_365`, `src0_type=f32`, `src1_type=f16`, and shapes like `256x64x2x8`, `256x128x2x8`, `256x162x2x8`.
The pattern repeats across attention layers	The same pair pattern appears for `cache_k_l7/cache_v_l7` (`node_798/node_805`), `cache_k_l11/cache_v_l11` (`node_1238/node_1245`), and later attention layers.
Some converts remain anonymous	`959` F32->F16 `CPY` trace rows still had no tensor or view names; do not assume the K/V path accounts for the full `convert_dtype` bucket without a targeted A/B.
Phase99 conv attribution is confirmed	`concat` rows show `conv_input-` from `conv_states_reshaped-` and `qkv_mixed_transposed-`; the new view fields map `qkv_mixed_transposed-` back to layer-local `node_*` producers.

Decision:

Carry the trace-only Phase100 patch as default-off instrumentation.
The next runtime source candidate should target the attention K/V cache gather dtype path: avoid GET_ROWS producing F32 only to downcast to F16 when the consumer wants F16. This is more directly connected to the convert_dtype bucket than a generic copy/layout tweak.
Keep the two-source SSM_CONV contract as a separate later phase for concat_layout; do not mix it with the K/V dtype experiment.

Phase99: Serving Layout Trace Attribution

Date: 2026-07-01.
Source: /home/mudler/llama-phase93-qwen3next-gqa-bcast.
Local patch status: no source change; the default-off LLAMA_LAYOUT_TRACE hook was already present in the fork and DGX mirror.
Harness:
- trace gate: EXTRA_ENV=LLAMA_LAYOUT_TRACE=128 OPS=GATED_DELTA_NET,MUL_MAT,MUL_MAT_ID,
- serving profile: streamed /home/mudler/bench/phase76_current_moe_profile.sh with measurement-only edits for source logging, GATED_DELTA_NET gates, and LLAMA_LAYOUT_TRACE=30000 on llama-server,
- N=128, PTOK=128, GEN=64, PARALLEL=128, CTX=131072.
Artifacts:
- trace gate: /home/mudler/bench/phase99_layout_trace/20260701_200637/trace_gates,
- serving profile: /home/mudler/bench/phase99_layout_trace/20260701_200835/serving_profile.

Safety gates:

gate	MoE md5	dense md5	`GATED_DELTA_NET`	`MUL_MAT`	`MUL_MAT_ID`
trace-enabled standalone	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`48/48`	`1146/1146`	`806/806`
serving pre raw log	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`48/48`	`1146/1146`	`806/806`
serving post raw log	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`48/48`	`1146/1146`	`806/806`

Serving under graph-node profiling plus layout trace:

metric	value
aggregate t/s	`208.2`
decode aggregate t/s	`332.9`
decode per-seq t/s	`2.12`
prefill t/s	`1476.8`
TTFT mean ms	`8466.3`
wall s	`39.341`
total kernel time	`20.2408 s`

Macro buckets:

bucket	time	share
GDN	`6709.45 ms`	`33.15%`
MoE/FFN-GEMM	`6158.11 ms`	`30.42%`
bf16/fp8-proj	`2786.81 ms`	`13.77%`
layout-copy	`1269.35 ms`	`6.27%`
ew-mul(weight/norm/GDN)	`729.08 ms`	`3.60%`
act-quant	`686.52 ms`	`3.39%`
FA	`268.04 ms`	`1.32%`

Fine buckets:

bucket	time	share	launches
`mmq_nvfp4`	`5936.34 ms`	`29.33%`	`34162`
`gdn_core`	`5920.40 ms`	`29.25%`	`4710`
`convert_dtype`	`662.34 ms`	`3.27%`	`52440`
`gdn_conv`	`457.47 ms`	`2.26%`	`7290`
`concat_layout`	`440.01 ms`	`2.17%`	`2130`
`copy_layout`	`119.16 ms`	`0.59%`	`8110`
`ew_repeat`	`47.83 ms`	`0.24%`	`18840`

Layout trace summary:

route	trace lines
`get_rows`	`18779`
`cpy`	`4638`
`cont`	`4384`
`concat`	`2199`

Top attribution:

finding	evidence
`concat_layout` is conv input materialization	`conv_input-* = concat(conv_states_reshaped-, qkv_mixed_transposed-)`; top shapes include `45x8192x12x1 = 3x8192x12x1 + 42x8192x12x1` (`450` trace lines) and `49x8192x11x1 = 3x8192x11x1 + 46x8192x11x1` (`180` trace lines).
`copy_layout` includes conv state writeback	`conv_state_update-* = cpy(conv_state_last-, conv_state_update-)`; top grouped shapes include `24576x12x1x1 <- 3x8192x12x1` (`780` trace lines), `24576x11x1x1` (`420`), and `24576x13x1x1` (`270`).
`convert_dtype` needs stronger attribution	the trace sees many unnamed `CPY` rows with F32 source and F16 destination, e.g. `256x166x2x11`, `256x166x2x12`, and similar attention/KV-shaped tensors; names are not preserved by the current dispatch trace.

Decision:

Phase99 is a measurement-only phase; no runtime patch was carried or reverted.
Do not spend more time on the Phase96-style conv-state identity shortcut. The serving hot layout path is the prefill/microbatch conv_input concat feeding SSM_CONV, not just decode update writeback.
A conv-side source phase must be a larger two-source SSM_CONV contract that reads (conv_states, qkv_mixed) as a logical concatenation, or it is too small to fund. If not coding that, first extend trace attribution for the larger unnamed F32->F16 convert_dtype bucket.

Phase98: Phase93 Serving Graph-Node Profile

Date: 2026-07-01.
Source: /home/mudler/llama-phase93-qwen3next-gqa-bcast.
Local patch status: no source change; this measured the carried Phase93 stack after Phase95 and Phase96 reverts.
Harness:
- streamed /home/mudler/bench/phase76_current_moe_profile.sh with two measurement-only edits:
  - source logging does not call git because the DGX Phase93 mirror is a source copy without .git,
  - pre/post gate ops include GATED_DELTA_NET,MUL_MAT,MUL_MAT_ID,
- SRC=/home/mudler/llama-phase93-qwen3next-gqa-bcast,
- BIN=/home/mudler/llama-phase93-qwen3next-gqa-bcast/build/bin,
- N=128, PTOK=128, GEN=64, PARALLEL=128, CTX=131072.
Artifact: /home/mudler/bench/phase98_phase93_serving_profile/20260701_215715.

Safety gates:

phase	MoE md5	dense md5	`GATED_DELTA_NET`	`MUL_MAT`	`MUL_MAT_ID`
pre	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`48/48`	`1146/1146`	`806/806`
post	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`48/48`	`1146/1146`	`806/806`

Serving under graph-node profiling, MoE N=128, PTOK=128, GEN=64, PARALLEL=128:

metric	value
aggregate t/s	`208.4`
decode aggregate t/s	`332.0`
decode per-seq t/s	`2.12`
prefill t/s	`1488.1`
TTFT mean ms	`8315.5`
wall s	`39.296`
total kernel time	`20.0411 s`

Macro buckets:

bucket	time	share
GDN	`6679.96 ms`	`33.33%`
MoE/FFN-GEMM	`6034.52 ms`	`30.11%`
bf16/fp8-proj	`2766.06 ms`	`13.80%`
layout-copy	`1257.60 ms`	`6.28%`
ew-mul(weight/norm/GDN)	`726.03 ms`	`3.62%`
act-quant	`686.69 ms`	`3.43%`
FA	`265.00 ms`	`1.32%`

Fine buckets:

bucket	time	share	launches
`gdn_core`	`5892.99 ms`	`29.40%`	`4680`
`mmq_nvfp4`	`5809.55 ms`	`28.99%`	`33442`
`cublas_bf16_gemm`	`1745.83 ms`	`8.71%`	`22200`
`cutlass_bf16_gemm`	`740.22 ms`	`3.69%`	`26190`
`ew_mul`	`720.94 ms`	`3.60%`	`48326`
`act_quant`	`686.69 ms`	`3.43%`	`37526`
`convert_dtype`	`663.45 ms`	`3.31%`	`51300`
`gdn_conv`	`457.11 ms`	`2.28%`	`7260`
`concat_layout`	`430.25 ms`	`2.15%`	`2100`
`get_rows`	`283.56 ms`	`1.41%`	`27978`
`gdn_gather`	`231.32 ms`	`1.15%`	`360`
`mm_ids`	`119.93 ms`	`0.60%`	`16680`
`gdn_l2norm`	`98.54 ms`	`0.49%`	`9360`
`gemv_moe_q`	`81.77 ms`	`0.41%`	`1560`

Decision:

Phase98 confirms the serving hot path is still a two-bucket problem: gdn_core and mmq_nvfp4 together account for 58.39% of kernel time.
The repeated negative GDN micro-tries (Phase91, Phase92, Phase95, Phase96) argue against more scalar/launch/gather shortcuts. A credible GDN follow-up needs a larger recurrence design with a measured PoC, not another local tweak.
layout-copy is now large enough (6.28%, led by convert_dtype and concat_layout) to deserve attribution before code changes, but it is not parity-closing by itself.
Next phase should either:
- attribute convert_dtype/concat_layout to exact graph nodes and remove a proven material copy, or
- pursue a larger gdn_core/mmq_nvfp4 serving lever with a strict PoC gate.

Phase97: Phase93 Serving Snapshot, N=128

Date: 2026-07-01.
Source: /home/mudler/llama-phase93-qwen3next-gqa-bcast.
Local patch status: no source change; this measured the carried Phase93 stack after Phase95 and Phase96 reverts.
Harness:
- streamed paged-current-serving-snapshot.sh with a one-line source-log workaround because the DGX Phase93 mirror is a source copy without .git,
- SRC=/home/mudler/llama-phase93-qwen3next-gqa-bcast,
- BUILD_DIR=/home/mudler/llama-phase93-qwen3next-gqa-bcast/build,
- BIN=/home/mudler/llama-phase93-qwen3next-gqa-bcast/build/bin,
- NPL=128, PTOK=128, GEN=64, PARALLEL=128, CTX=131072,
- gate ops: GATED_DELTA_NET,MUL_MAT,MUL_MAT_ID.
Artifact: /home/mudler/bench/phase97_phase93_serving_snapshot/20260701_214648.

Safety gates:

phase	MoE md5	dense md5	`GATED_DELTA_NET`	`MUL_MAT`	`MUL_MAT_ID`
pre	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`48/48`	`1146/1146`	`806/806`
post	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`48/48`	`1146/1146`	`806/806`

Serving snapshot, MoE PTOK=128, GEN=64, PARALLEL=128, N=128:

arm	n	agg_tps	decode_agg_tps	decode_perseq_tps	prefill_tps	ttft_mean_ms	wall_s
paged Phase93	`128`	`329.6`	`669.8`	`3.85`	`1734.5`	`7415.4`	`24.851`
vLLM	`128`	`664.8`	`1029.4`	`6.79`	`5271.8`	`2519.5`	`11.929`

Ratios:

n	paged decode/vLLM	paged perseq/vLLM	paged agg/vLLM	paged TTFT/vLLM
`128`	`0.6507`	`0.5670`	`0.4958`	`2.9432`

Decision:

Phase93 remains a valid decode-profile improvement, but it is not serving-parity at n=128.
The Phase97 paged aggregate is slightly above the Phase72 default snapshot (329.6 vs 325.8), and TTFT improves (7415.4 ms vs 7822.5 ms), but decode aggregate is lower than Phase72 (669.8 vs 714.0) while vLLM stays essentially unchanged (1029.4 vs 1029.5).
Treat Phase93 as worth carrying for source quality and decode-profile gain, but the next parity phase needs a larger serving-impact lever. More isolated GDN/conv micro-optimizations are unlikely to close the live serving gap.

Phase96: Conv-State Identity Fast Path

Date: 2026-07-01.
Source: /home/mudler/llama-phase93-qwen3next-gqa-bcast.
Local patch status: runtime model-graph change reverted after profiling; Phase93 is still the current carried source.
Rationale:
- The Phase93 decode profile showed ssm_conv_update_ids_f32/gdn_conv around the 66-72 ms range, larger than the cleanly attributable remaining GDN producer math.
- The recurrent GDN path already uses a direct in-place op when s_copy_main is identity. This trial added the same shape of branch to build_conv_state_fused: when inp->s_copy_main_identity was true, it viewed the active conv-state cache slots directly and called ggml_ssm_conv_update_inplace instead of the ids variant.
- The existing build_rs zero/extra-state maintenance stayed around the lambda, and the CUDA update kernel loads the conv window before writing the same slot, so the identity aliasing was expected to be safe.
Gate and profile artifacts:
- canonical gates: /home/mudler/bench/phase96_conv_identity_fastpath/20260701_214023/canonical_gates,
- decode-only profile: /home/mudler/bench/phase96_conv_identity_fastpath/20260701_214141/decode_profile.

Safety gates:

check	result
local build	`cmake --build build --target test-backend-ops -j $(nproc)` OK
local CPU `SSM_CONV`	`45/45`
DGX CUDA `SSM_CONV`	`45/45`, `Backend CUDA0: OK`
DGX CUDA `GATED_DELTA_NET_INPLACE_IDS`	`6/6`, `Backend CUDA0: OK`
canonical MoE md5	`8cb0ce23777bf55f92f63d0292c756b0`
canonical dense md5	`5951a5b4d624ce891e22ab5fca9bc439`
canonical `SSM_CONV`	`45/45`, `Backend CUDA0: OK`
canonical `GATED_DELTA_NET`	`48/48`, `Backend CUDA0: OK`
canonical `MUL_MAT`	`1146/1146`, `Backend CUDA0: OK`
canonical `MUL_MAT_ID`	`806/806`, `Backend CUDA0: OK`
profile pre/post md5/op gates	all OK

Decode-only profile, MoE N=128, N_PREDICT=2048, capture after median depth 74 -> 96, default env:

arm	total kernel s	GDN ms	`gdn_core` ms	`gdn_core` launches	`gdn_conv` ms	`mmq_nvfp4` ms
Phase93 default	`3.5476`	`1409.19`	`1333.48`	`570`	about `66.40` to `72.26`	`1421.63`
Phase96 conv identity	`3.6723`	`1486.12`	`1406.57`	`600`	`70.42`	`1433.84`

Decision:

Reject the conv-state identity fast path. It is inference-safe, but it did not improve gdn_conv and worsened total kernel time and gdn_core versus Phase93.
Revert the runtime model-graph change and keep Phase93 as the current carried candidate.
Do not retry the conv identity branch as a speed lever unless a same-window trace shows the ids variant itself is materially slower than the direct variant independent of launch-count/capture variance.

Phase95: GDN Warp Scalar-Gate Broadcast

Date: 2026-07-01.
Source: /home/mudler/llama-phase93-qwen3next-gqa-bcast.
Local patch status: runtime CUDA change reverted after profiling; Phase93 is still the current carried source.
Env:
- GDN_WARP_SCALAR_GATE=1
Rationale:
- After Phase93, the remaining GDN producer buckets are small while gdn_core remains the largest target.
- The scalar non-KDA decode path loads one scalar gate value per (head, seq, token), but every lane computes expf(*g_t). This default-off trial computed the scalar gate on lane 0 and broadcast it within the warp for the one-token S_v=128, non-KDA, default 16x8 decode path.
- The recurrence order, reductions, state update, and stores were unchanged.
Gate and profile artifacts:
- canonical gates: /home/mudler/bench/phase95_gdn_warp_scalar_gate/20260701_213150/canonical_gates,
- decode-only profile: /home/mudler/bench/phase95_gdn_warp_scalar_gate/20260701_213311/decode_profile.

Safety gates:

check	result
local build	`cmake --build build --target test-backend-ops -j $(nproc)` OK
local CPU `GATED_DELTA_NET`	`48/48`
local CPU `GATED_DELTA_NET_INPLACE_IDS`	`6/6`
DGX CUDA `GATED_DELTA_NET`, `GDN_WARP_SCALAR_GATE=1`	`48/48`, `Backend CUDA0: OK`
DGX CUDA `GATED_DELTA_NET_INPLACE_IDS`, `GDN_WARP_SCALAR_GATE=1`	`6/6`, `Backend CUDA0: OK`
canonical MoE md5	`8cb0ce23777bf55f92f63d0292c756b0`
canonical dense md5	`5951a5b4d624ce891e22ab5fca9bc439`
canonical `GATED_DELTA_NET`	`48/48`, `Backend CUDA0: OK`
canonical `MUL_MAT`	`1146/1146`, `Backend CUDA0: OK`
canonical `MUL_MAT_ID`	`806/806`, `Backend CUDA0: OK`
profile pre/post md5/op gates	all OK

Decode-only profile, MoE N=128, N_PREDICT=2048, capture after median depth 65 -> 87, PROFILE_ENV=GDN_WARP_SCALAR_GATE=1:

arm	total kernel s	GDN ms	GDN %	`gdn_core` ms	`gdn_core` launches	`mmq_nvfp4` ms
Phase93 default	`3.5476`	`1409.19`	`39.72%`	`1333.48`	`570`	`1421.63`
Phase95 warp scalar gate	`3.6317`	`1483.44`	`40.85%`	`1402.40`	`599`	`1402.88`

Decision:

Reject GDN_WARP_SCALAR_GATE=1. It is inference-safe, but worsens the target gdn_core bucket by +68.92 ms and total kernel time by +84.1 ms versus Phase93.
Revert the runtime CUDA change and keep Phase93 as the current carried candidate.
Do not retry scalar-gate warp broadcast unless a future profile shows SFU pressure, rather than recurrent state traffic/reductions, dominating the decode GDN core.

Phase94: Phase93 GDN Geometry Reprobe, 8x8

Date: 2026-07-01.
Source: /home/mudler/llama-phase93-qwen3next-gqa-bcast.
Local patch status: no source change; env-only geometry probe rejected.
Env:
- GDN_NW=8
- GDN_CPW=8
Rationale:
- Phase93 changed the active GDN launch mix and dropped gdn_core to the current best 1333.48 ms.
- The 8x8 geometry keeps a single S_v=128 column tile (grid.z=1) like the default 16x8 path, but halves threads per block. This tested whether lower block occupancy pressure helped after grouped Q/K broadcast.
Gate and profile artifacts:
- canonical gates: /home/mudler/bench/phase94_gdn_geometry_phase93/20260701_211730/canonical_gates_8x8,
- decode-only profile: /home/mudler/bench/phase94_gdn_geometry_phase93/20260701_211855/decode_profile_8x8.

Safety gates:

check	result
DGX CUDA `GATED_DELTA_NET`, `GDN_NW=8 GDN_CPW=8`	`48/48`, `Backend CUDA0: OK`
canonical MoE md5	`8cb0ce23777bf55f92f63d0292c756b0`
canonical dense md5	`5951a5b4d624ce891e22ab5fca9bc439`
canonical `GATED_DELTA_NET`	`48/48`, `Backend CUDA0: OK`
canonical `MUL_MAT`	`1146/1146`, `Backend CUDA0: OK`
canonical `MUL_MAT_ID`	`806/806`, `Backend CUDA0: OK`
profile pre/post md5/op gates	all OK

Decode-only profile, MoE N=128, N_PREDICT=2048, capture after median depth 74 -> 96, PROFILE_ENV=GDN_NW=8 GDN_CPW=8:

arm	total kernel s	GDN ms	GDN %	`gdn_core` ms	`gdn_core` launches	`mmq_nvfp4` ms
Phase93 default geometry	`3.5476`	`1409.19`	`39.72%`	`1333.48`	`570`	`1421.63`
Phase94 8x8 geometry	`3.6223`	`1522.02`	`42.02%`	`1440.79`	`600`	`1352.68`

Decision:

Reject GDN_NW=8 GDN_CPW=8 for Phase93. It is inference-safe, but worsens the target gdn_core bucket by +107.31 ms and total kernel time by +74.7 ms.
Keep the Phase93 default 16x8 geometry.
The profile also shows remaining producer-side GDN work is small compared with recurrence core: l2_norm_f32 8.65 ms, GDN gate/sigmoid kernels about 12.75 ms, and remaining repeat 5.34 ms in the Phase93 default trace. The next candidate should target recurrence work or a larger packed decode contract, not another small producer-only fusion.

Phase93: Qwen3Next Grouped Q/K Broadcast for Fused GDN

Date: 2026-07-01.
Source: /home/mudler/llama-phase93-qwen3next-gqa-bcast.
Local patch status: carried as a positive candidate.
Patch scope:
- added ggml_gated_delta_net_set_bcast(tensor, grouped) using op_params[2],
- kept default GDN Q/K head mapping as the existing tiled/modulo behavior,
- added grouped mapping for opt-in GDN calls: qk_head = value_head / (H_v / H_k),
- threaded the grouped flag through CPU GDN, CUDA sequential decode, and CUDA chunked prefill kernels,
- changed Qwen3Next to skip the explicit q/k repeat only when the GDN op path can consume grouped broadcast,
- added grouped broadcast backend-op coverage for one-token and prompt-sized GATED_DELTA_NET.
Build artifact: /home/mudler/llama-phase93-qwen3next-gqa-bcast/build.
Gate and profile artifacts:
- canonical gates: /home/mudler/bench/phase93_qwen3next_gqa_bcast/20260701_210857/canonical_gates,
- decode-only profile: /home/mudler/bench/phase93_qwen3next_gqa_bcast/20260701_211019/decode_profile.

Safety gates:

check	result
local build	`cmake --build build --target test-backend-ops -j $(nproc)` OK
local CPU `GATED_DELTA_NET`	`48/48`, includes grouped AR and PP cases
local CPU `GATED_DELTA_NET_INPLACE_IDS`	`6/6`
DGX CUDA `GATED_DELTA_NET`	`48/48`, includes grouped AR and PP cases
DGX CUDA `GATED_DELTA_NET_INPLACE_IDS`	`6/6`
canonical MoE md5	`8cb0ce23777bf55f92f63d0292c756b0`
canonical dense md5	`5951a5b4d624ce891e22ab5fca9bc439`
canonical `GATED_DELTA_NET`	`48/48`, `Backend CUDA0: OK`
canonical `MUL_MAT`	`1146/1146`, `Backend CUDA0: OK`
canonical `MUL_MAT_ID`	`806/806`, `Backend CUDA0: OK`
profile pre/post md5/op gates	all OK

Decode-only profile, MoE N=128, N_PREDICT=2048, capture after median depth 73 -> 94, default env:

arm	total kernel s	GDN ms	GDN %	`gdn_core` ms	`gdn_core` launches	`mmq_nvfp4` ms
Phase87 same-source default	`3.6310`	`1471.27`	`40.52%`	`1390.56`	`598`	`1416.46`
Phase91 pack2 PDL-fix	`3.5813`	`1505.91`	`42.05%`	`1425.44`	`598`	`1333.39`
Phase92 store-fused	`3.7419`	`1609.81`	`43.02%`	`1529.72`	`600`	`1383.82`
Phase93 Qwen3Next grouped broadcast	`3.5476`	`1409.19`	`39.72%`	`1333.48`	`570`	`1421.63`

Decision:

Carry Phase93. It is md5/op clean and improves the target gdn_core bucket by -57.08 ms vs Phase87 same-source default, -91.86 ms vs Phase85 identity-state (1400.34 ms), and -92.0 ms vs the rejected Phase91 pack2 trial.
The win is consistent with the intended work reduction: Qwen3Next stops materializing repeated q/k heads for fused GDN and lets the op map value heads to grouped q/k heads directly.
Next follow-up should profile/count node-level repeat/layout buckets around Qwen3Next GDN to confirm whether more vLLM-style packed decode producer work remains worth porting.

Phase92: Scalar Decode Store-Fused GDN Trial

Date: 2026-07-01.
Source: /home/mudler/llama-phase92-gdn-store-fused, default-off CUDA experiment on top of the Phase90/91 guardrail stack.
Local patch status: runtime CUDA changes reverted after profiling; guardrail stack remains.
Patch scope:
- added a STORE_FUSED CUDA kernel instantiation behind GDN_SCALAR_DECODE_STORE_FUSED=1,
- gated it to S_v=128, scalar-gate, final-state, one-token, in-place decode with default geometry,
- wrote state_dst inside the scalar update loop and skipped the final post-token register-store loop for that instantiation.
Build artifact: /home/mudler/llama-phase92-gdn-store-fused/build.
Guardrail and gate artifacts:
- canonical gates: /home/mudler/bench/phase92_gdn_scalar_store_fused/20260701_204550/canonical_gates,
- decode-only profile: /home/mudler/bench/phase92_gdn_scalar_store_fused/20260701_204718/decode_profile.

Safety gates:

check	result
local build	`cmake --build build --target test-backend-ops -j $(nproc)` OK
local CPU guardrail	`GATED_DELTA_NET_INPLACE_IDS` `6/6`, `Backend CPU: OK`
DGX CUDA guardrail, `GDN_SCALAR_DECODE_STORE_FUSED=1`	`6/6`, `Backend CUDA0: OK`
canonical MoE md5	`8cb0ce23777bf55f92f63d0292c756b0`
canonical dense md5	`5951a5b4d624ce891e22ab5fca9bc439`
canonical `GATED_DELTA_NET`	`46/46`, `Backend CUDA0: OK`
canonical `MUL_MAT`	`1146/1146`, `Backend CUDA0: OK`
canonical `MUL_MAT_ID`	`806/806`, `Backend CUDA0: OK`
profile pre/post md5/op gates	all OK

Decode-only profile, MoE N=128, N_PREDICT=2048, capture after median depth 72 -> 94, PROFILE_ENV=GDN_SCALAR_DECODE_STORE_FUSED=1:

arm	total kernel s	GDN ms	GDN %	`gdn_core` ms	`gdn_core` launches	`mmq_nvfp4` ms
Phase87 same-source default	`3.6310`	`1471.27`	`40.52%`	`1390.56`	`598`	`1416.46`
Phase91 pack2 PDL-fix	`3.5813`	`1505.91`	`42.05%`	`1425.44`	`598`	`1333.39`
Phase92 store-fused	`3.7419`	`1609.81`	`43.02%`	`1529.72`	`600`	`1383.82`

Decision:

Reject and revert the store-fused runtime patch. It is inference-safe under the current md5/op gates, but it worsens the target gdn_core bucket by +139.16 ms vs Phase87 same-source default and +104.28 ms vs the already rejected Phase91 pack2 trial.
The extra in-loop global stores likely increase pressure/ordering cost enough to outweigh removing the final register pass. Do not retry this shape unless a profile shows the final store loop as independently dominant.
Next higher-value direction from the vLLM code audit is not another recurrence micro-loop tweak; scope the larger packed decode contract or the Qwen3Next GQA-repeat removal as separate, guarded phases.

Phase91: Default-off PACK=2 Decode Kernel, Guarded Retry

Date: 2026-07-01.
Source: /home/mudler/llama-phase91-gdn-pack2-guarded-source, default-off CUDA experiment on top of the Phase90 guardrail stack.
Local patch status: runtime CUDA changes reverted after profiling; Phase90 test guardrail remains.
Patch scope:
- reintroduced a GDN_DECODE_PACK2=1 F32 scalar-gate, one-token, in-place decode kernel that packs two sequences into one CTA,
- added a PDL-safety fix after the first canonical md5 failure: inactive odd/single sequence lanes now call ggml_cuda_pdl_sync() before returning,
- extended the guardrail with F32 n_seqs=1 and n_seqs=3 output-plus-state cases.
Build artifact: /home/mudler/llama-phase91-gdn-pack2-guarded-source/build.
Guardrail artifacts:
- initial n_seqs=2 guardrail pass: /home/mudler/bench/phase91_gdn_pack2_guarded/20260701_201943/guardrail,
- initial canonical md5 failure: /home/mudler/bench/phase91_gdn_pack2_guarded/20260701_202024/canonical_gates,
- PDL-fix expanded guardrail pass: /home/mudler/bench/phase91_gdn_pack2_guarded/20260701_202140/guardrail_pdl_fix,
- PDL-fix canonical gates with GATED_DELTA_NET,MUL_MAT,MUL_MAT_ID: /home/mudler/bench/phase91_gdn_pack2_guarded/20260701_202154/canonical_gates_pdl_fix,
- decode-only profile: /home/mudler/bench/phase91_gdn_pack2_guarded/20260701_202425/decode_profile_pdl_fix.

Safety gates:

check	result
initial Phase90 guardrail, `GDN_DECODE_PACK2=1`	`4/4`, `Backend CUDA0: OK`
initial canonical MoE md5	failed: `b93724e88460d90379c5009df0e1f2b6` vs `8cb0ce23777bf55f92f63d0292c756b0`
expanded guardrail after PDL fix	`6/6`, covers F32 `n_seqs=1,2,3` output-plus-state
PDL-fix MoE md5	`8cb0ce23777bf55f92f63d0292c756b0`
PDL-fix dense md5	`5951a5b4d624ce891e22ab5fca9bc439`
PDL-fix `GATED_DELTA_NET`	`46/46`, `Backend CUDA0: OK`
PDL-fix `MUL_MAT`	`1146/1146`, `Backend CUDA0: OK`
PDL-fix `MUL_MAT_ID`	`806/806`, `Backend CUDA0: OK`

Decode-only profile, MoE N=128, N_PREDICT=2048, capture after median depth 66 -> 88, PROFILE_ENV=GDN_DECODE_PACK2=1:

arm	total kernel s	GDN ms	GDN %	`gdn_core` ms	`gdn_core` launches	`mmq_nvfp4` ms
Phase87 same-source default	`3.6310`	`1471.27`	`40.52%`	`1390.56`	`598`	`1416.46`
Phase85 identity state	`3.6622`	`1480.21`	`40.42%`	`1400.34`	`596`	`1437.53`
Phase91 pack2 PDL-fix	`3.5813`	`1505.91`	`42.05%`	`1425.44`	`598`	`1333.39`

Decision:

Reject and revert the pack2 runtime patch. It is inference-safe after the PDL fix, but it worsens the target gdn_core bucket by +34.88 ms vs the Phase87 same-source default and +25.10 ms vs Phase85.
Keep the expanded Phase90/91 GATED_DELTA_NET_INPLACE_IDS guardrail cases because they caught the missing odd/single sequence coverage.
Do not retry CTA-level sequence packing without a different per-sequence work reduction; packing alone raises GDN's share of total kernel time.

Phase90: In-place GDN Decode State Guardrail

Date: 2026-07-01.
Source: /home/mudler/llama-phase90-gdn-inplace-ids-guardrail-source, test-only experiment on top of the current Phase85 carry-forward stack.
Local patch status: kept as a guardrail candidate in tests/test-backend-ops.cpp.
Patch scope:
- fixes the in-place ids fixture initialization by mirroring the identity source cache bytes into state_dst after random tensor initialization,
- adds F32 serving-shape cases: head_count=4, head_size=128, n_seqs=2, scalar gate and KDA,
- makes those F32 cases return concat(flatten(out), flatten(state_dst)), so the normal backend comparator validates both attention output and the recurrent-state side effect.
Build artifact: /home/mudler/llama-phase90-gdn-inplace-ids-guardrail-source/build.
Gate artifacts:
- stale-source assertion: /home/mudler/bench/phase90_gdn_inplace_ids_guardrail/20260701_200946/direct,
- output-only corrected pass: /home/mudler/bench/phase90_gdn_inplace_ids_guardrail/20260701_201058/direct,
- output-plus-state corrected pass: /home/mudler/bench/phase90_gdn_inplace_ids_guardrail/20260701_201257/direct.

DGX verification:

check	result
local build	`cmake --build build --target test-backend-ops -j $(nproc)` completed
local CPU selected op	`4/4`, including F32 `check_state=1` cases
DGX CUDA selected op, stale source	failed before comparison on BF16 `state_dst` F32-only assert
DGX CUDA selected op, corrected output-only source	`4/4`, `Backend CUDA0: OK`
DGX CUDA selected op, output plus state	`4/4`, `Backend CUDA0: OK`

Decision:

Keep this as the minimum guardrail for the next packed decode attempt. It covers the Phase88 target shape (S_v=128, one-token decode, two sequences) and observes the side-effect state_dst update for F32 scalar-gate and KDA cases.
BF16 in-place ids cases remain output-only in this fixture; use canonical md5 gates for full-model BF16 inference safety.
Do not profile Phase90: it is a test harness/guardrail attempt, not a runtime performance candidate.

Phase89: In-place GDN Decode Test Guardrail Attempt

Date: 2026-07-01.
Source: /home/mudler/llama-phase89-gdn-decode-gate-source, test-only experiment on top of the reverted Phase88 source.
Local patch status: reverted after the targeted test filter failed.
Patch scope:
- temporarily added two test_gated_delta_net_inplace_ids cases in tests/test-backend-ops.cpp:
  - F32, head_count=4, head_size=128, n_seqs=2, scalar gate,
  - F32, head_count=4, head_size=128, n_seqs=2, KDA.
Build artifact: /home/mudler/llama-phase89-gdn-decode-gate-source/build-cuda.
Build logs:
- /home/mudler/llama-phase89-gdn-decode-gate-source/configure.phase89.log
- /home/mudler/llama-phase89-gdn-decode-gate-source/build.phase89.log
Gate artifact: /home/mudler/bench/phase89_gdn_decode_gate/20260701_175903/direct.

DGX verification:

check	result
local build	`cmake --build build --target test-backend-ops -j 8` completed
local run	local CPU backend skipped for this op set
CUDA `GATED_DELTA_NET` filter	`46/46`, `Backend CUDA0: OK`
CUDA `GATED_DELTA_NET_INPLACE_IDS` filter	failed `0/4`, including both newly added F32 cases and the two pre-existing BF16 cases

Decision:

Reject and revert the test-only change. The direct GATED_DELTA_NET_INPLACE_IDS filter is not currently a reliable green guardrail, because the existing BF16 cases fail when selected directly.
Do not add more packed decode source until there is a focused harness for the serving decode shape that compares both attention output and the side-effect state_dst update against the existing sequential kernel.

Phase88: Default-off PACK=2 Decode CTA Kernel

Date: 2026-07-01.
Source: /home/mudler/llama-phase88-gdn-pack2-source, one-file CUDA experiment on top of Phase85.
Local patch status: reverted after md5 failure.
Patch scope:
- added gated_delta_net_decode_pack2_cuda in ggml/src/ggml-cuda/gated_delta_net.cu,
- gated it behind GDN_DECODE_PACK2=1,
- limited it to F32 state, scalar-gate, S_v == 128, n_tokens == 1, in-place decode, with no GDN_NW/GDN_CPW override,
- attempted to preserve the existing (16,8) per-column math order while packing two independent sequences into one CTA.
Build artifact: /home/mudler/llama-phase88-gdn-pack2-source/build-cuda.
Build logs:
- /home/mudler/llama-phase88-gdn-pack2-source/configure.phase88.log
- /home/mudler/llama-phase88-gdn-pack2-source/build.phase88.log
Gate artifact: /home/mudler/bench/phase88_gdn_pack2_gates/20260701_175059/direct.
Profile artifact: none. Profiling was skipped because the md5 gate failed.

DGX gates with GDN_DECODE_PACK2=1:

check	result
MoE md5	failed, got `320b5ed679844cbfd6f18d85d7ae32b0`, expected `8cb0ce23777bf55f92f63d0292c756b0`
dense md5	failed, got `6a65e9d9e47321ebce9e461c8abf036c`, expected `5951a5b4d624ce891e22ab5fca9bc439`
`GATED_DELTA_NET`	`Backend CUDA0: OK`
`MUL_MAT`	`Backend CUDA0: OK`
`MUL_MAT_ID`	`Backend CUDA0: OK`

Observed output symptom:

MoE output duplicated the opening <think> marker.
Dense output degenerated into repeated / characters immediately after the opening <think> marker.

Decision:

Reject and revert. The sacred greedy md5 gate failed, so no profile was run.
The existing test-backend-ops -o GATED_DELTA_NET set did not catch this because it does not cover the exact serving decode shape that triggers the pack2 path. Before another packed decode attempt, add or script a focused n_seq_tokens=1, n_seqs > 1, in-place F32 state equivalence gate against the existing sequential kernel.
Do not carry the pack2 kernel in the patch stack.

Phase87: Decode Geometry Probe `(GDN_NW=4, GDN_CPW=8)`

Date: 2026-07-01.
Source: /home/mudler/llama-phase87-gdn-4x8-source, one-line CUDA dispatcher experiment on top of Phase85: expose launch_gdn_variant<128, ..., NUM_WARPS=4, COLS_PER_WARP=8> through the existing GDN_NW/GDN_CPW env sweep.
Local patch status: reverted after profiling. The attempt was env-gated and never made default.
Build artifact: /home/mudler/llama-phase87-gdn-4x8-source/build-cuda.
Build logs:
- /home/mudler/llama-phase87-gdn-4x8-source/configure.phase87.log
- /home/mudler/llama-phase87-gdn-4x8-source/build.phase87.log
Gate artifact: /home/mudler/bench/phase87_gdn_4x8_gates/20260701_174014/direct.
Profile artifact: /home/mudler/bench/phase87_gdn_4x8_profile/20260701_174310.
Result type: source geometry probe. The hypothesis was that a 4*8 = 32 column tile would be closer to vLLM's BV=32 decode program shape while preserving the existing per-column reduction order.

DGX gates with GDN_NW=4 GDN_CPW=8:

check	result
MoE md5	`8cb0ce23777bf55f92f63d0292c756b0`
dense md5	`5951a5b4d624ce891e22ab5fca9bc439`
`GATED_DELTA_NET`	`Backend CUDA0: OK`
`MUL_MAT`	`Backend CUDA0: OK`
`MUL_MAT_ID`	`Backend CUDA0: OK`

Same-source decode-only profile:

arm	source	env	active slots	depth start	depth mid	total kernel s	GDN ms	GDN share	`gdn_core` ms	`gdn_core` launches	`mmq_nvfp4` ms
default geometry	`/home/mudler/llama-phase87-gdn-4x8-source`	default `(16,8)`	`128`	`74`	`96`	`3.6310`	`1471.27`	`40.52%`	`1390.56`	`598`	`1416.46`
Phase87 4x8	`/home/mudler/llama-phase87-gdn-4x8-source`	`GDN_NW=4 GDN_CPW=8`	`128`	`71`	`92`	`3.5988`	`1493.66`	`41.50%`	`1417.13`	`569`	`1396.11`

Decision:

Reject. The target bucket regressed by +26.57 ms (+1.91%) despite lower total kernel time from unrelated mmq_nvfp4 variance.
Reverted the one-line dispatcher addition. Do not carry this in the patch stack.
The subagent/code audit points to a different Phase88 shape: keep the current (16,8) per-column math order and pack two independent sequences per CTA, or implement a fuller vLLM-style packed decode kernel that fuses producer math and recurrence.

Phase86: Producer-fusion Scope Audit

Date: 2026-07-01.
Source: no source patch. This is a profile-backed scope rejection using the Phase85 node-traced DGX artifact before spending code on a small-ceiling fusion.
Input profile artifact: /home/mudler/bench/phase85_gdn_identity_state_profile/20260701_171856.
Source audit:
- ggml/src/ggml-cuda/ggml-cuda.cu already fuses { GGML_OP_UNARY, GGML_OP_MUL } for SILU, SIGMOID, and SOFTPLUS, covering the expensive part of alpha_softplus * ssm_a.
- Qwen35 and Qwen35MoE still compute beta sigmoid and the alpha bias/softplus producer as separate graph pieces, but those pieces are small in the decode-only trace.
- vLLM's Triton producer fusion remains a useful design reference, but its isolated producer scope is not the main GB10 bottleneck in this llama.cpp profile.
Gate artifact: not applicable, no binary changed.
Result type: no-code benchmark/scope attempt. The benchmark record below is copied from the Phase85 candidate profile because Phase86 deliberately asks whether a source patch is worth writing.

Same-window profile evidence:

bucket	time	share	launches	interpretation
total kernel time	`3.6622 s`	`100.00%`	-	Phase85 identity-state candidate capture
`GDN` macro	`1480.21 ms`	`40.42%`	`2980`	target family remains dominant
`gdn_core`	`1400.34 ms`	`38.24%`	`596`	real parity lever must reduce this bucket
`act/GDN-gate(shared)` macro	`13.57 ms`	`0.37%`	`3771`	entire producer/gate-side ceiling is tiny
`gated_act_silu_sigmoid`	`10.84 ms`	`0.30%`	`1786`	already includes fused unary-gated kernels
`gdn_sigmoid`	`2.73 ms`	`0.07%`	`1985`	beta sigmoid ceiling
`unary_op_kernel<&op_softplus>`	about `1.08 ms`	about `0.03%`	`596`	alpha softplus standalone signal from `nsys stats`

Decision:

Reject a narrow Phase86 producer-only implementation. Even deleting the whole act/GDN-gate(shared) macro would improve the captured total by only 0.37%, and deleting only the still-unfused beta sigmoid would be about 0.07%.
Do not modify or gate source for this phase. It would add upstream conflict surface without meaningful parity upside.
Phase87 should target a packed decode GDN kernel, inspired by vLLM's decode path, that reduces launches and memory traffic inside gdn_core itself while preserving the default F32 recurrent S-cache and md5/op gates.

Phase85: Identity-contiguous GDN State Fast Path

Date: 2026-07-01.
Source: /home/mudler/llama-phase85-gdn-identity-state-source, local eight-file experiment on top of fork commit 237ad9b96 feat(cuda): add BF16 Qwen GDN state cache.
Local patch scope:
- carry forward Phase84 attention-only in-place GDN output cleanup,
- add a side-effect-free llama_memory_recurrent_context::s_copy_main_is_identity,
- store that identity bit in llm_graph_input_rs,
- include it in base and hybrid graph reuse checks,
- call ggml_gated_delta_net_inplace on a direct state view when active recurrent rows are identity-contiguous, otherwise keep the ids path.
Build artifact: /home/mudler/llama-phase85-gdn-identity-state-source/build-cuda.
Build logs:
- /home/mudler/llama-phase85-gdn-identity-state-source/configure.phase85.log
- /home/mudler/llama-phase85-gdn-identity-state-source/build.phase85.log
Gate artifact: /home/mudler/bench/phase85_gdn_identity_state_gates/20260701_171733/direct.
Profile artifact: /home/mudler/bench/phase85_gdn_identity_state_profile/20260701_171856.
Result type: source cleanup / small performance experiment. This reuses the existing F32 recurrent-state CUDA kernel and changes only the source-state view used for identity-contiguous decode windows. It avoids the ids scratch allocation and no-op gdn_gather_nonident_kernel launch in that graph shape.

Local verification:

check	result
local build	`cmake --build build --target test-backend-ops llama-server -j 8` completed
local note	`llama-server` build used the UI archive fallback after local npm engine warning; target completed

DGX gates:

check	result
MoE md5	`8cb0ce23777bf55f92f63d0292c756b0`
dense md5	`5951a5b4d624ce891e22ab5fca9bc439`
`GATED_DELTA_NET`	`46/46`, `Backend CUDA0: OK`
`MUL_MAT`	`1146/1146`, `Backend CUDA0: OK`
`MUL_MAT_ID`	`806/806`, `Backend CUDA0: OK`

Same-window decode-only profile:

arm	source	active slots	depth start	depth mid	total kernel s	GDN ms	GDN share	`gdn_core` ms	`gdn_core` launches	`gdn_gather` ms	GDN macro launches	`mmq_nvfp4` ms
baseline F32	`/home/mudler/llama-phase81-bf16-state-source`	`128`	`73`	`95`	`3.7081`	`1493.78`	`40.28%`	`1412.33`	`600`	`0.89`	`3600`	`1473.60`
Phase85 identity state	`/home/mudler/llama-phase85-gdn-identity-state-source`	`128`	`72`	`94`	`3.6622`	`1480.21`	`40.42%`	`1400.34`	`596`	not present	`2980`	`1437.53`

Server log signal:

arm	CUDA free memory at startup	graph reuse
baseline F32	`116418 MiB`	`105/122 = 86.1%`
Phase85 identity state	`117857 MiB`	`105/123 = 85.4%`

Decision:

Carry forward only as a small cleanup candidate. The patch is md5/op green, removes the explicit gdn_gather bucket, and reduces GDN macro launches.
Do not treat it as a parity-closing speed lever: direct removed work was only 0.89 ms over the capture, and gdn_core improved by only 0.85% (1412.33 -> 1400.34 ms) in a noisy same-window run.
Keep the next speed-focused scope on either producer fusion (alpha softplus * A, beta sigmoid) or a larger packed decode kernel. The remaining GDN gap is not explained by ids gather overhead.

Phase84: Attention-only Outputs for In-place GDN

Date: 2026-07-01.
Source: /home/mudler/llama-phase84-attn-only-source, local three-file experiment on top of fork commit 237ad9b96 feat(cuda): add BF16 Qwen GDN state cache.
Local patch files:
- ggml/src/ggml.c
- ggml/src/ggml-cpu/ggml-cpu.c
- ggml/src/ggml-cpu/ops.cpp
Build artifact: /home/mudler/llama-phase84-attn-only-source/build-cuda.
Build logs:
- /home/mudler/llama-phase84-attn-only-source/configure.phase84.log
- /home/mudler/llama-phase84-attn-only-source/build.phase84.log
Gate artifact: /home/mudler/bench/phase84_attn_only_gates/20260701_165952/direct.
Profile artifact: /home/mudler/bench/phase84_attn_only_profile/20260701_170131.
Result type: source cleanup / memory experiment. ggml_gated_delta_net_inplace and ggml_gated_delta_net_inplace_ids now allocate only the attention-score output tensor because final recurrent state is written as a side effect into state_dst. The CPU inplace_ids non-identity fallback was moved from the old unused output tail to explicit workspace so CPU/CUDA semantics remain aligned.

Local verification:

check	result
local build	`cmake --build build --target test-backend-ops -j 8` completed
local GDN subset	no non-CPU backend locally, so CPU was skipped by `test-backend-ops`

DGX gates:

check	result
MoE md5	`8cb0ce23777bf55f92f63d0292c756b0`
dense md5	`5951a5b4d624ce891e22ab5fca9bc439`
`GATED_DELTA_NET`	`46/46`, `Backend CUDA0: OK`
`MUL_MAT`	`1146/1146`, `Backend CUDA0: OK`
`MUL_MAT_ID`	`806/806`, `Backend CUDA0: OK`

Same-window decode-only profile:

arm	source	active slots	depth start	depth mid	total kernel s	GDN ms	GDN share	`gdn_core` ms	`gdn_core` launches	`gdn_core`/launch	`mmq_nvfp4` ms
baseline F32	`/home/mudler/llama-phase81-bf16-state-source`	`128`	`74`	`96`	`3.6464`	`1481.59`	`40.63%`	`1399.72`	`599`	`2.337 ms`	`1418.47`
Phase84 attention-only	`/home/mudler/llama-phase84-attn-only-source`	`128`	`65`	`87`	`3.5814`	`1489.33`	`41.59%`	`1407.38`	`598`	`2.354 ms`	`1349.11`

Server log memory signal:

arm	CUDA free memory at startup	graph reuse
baseline F32	`117472 MiB`	`107/124 = 86.3%`
Phase84 attention-only	`117855 MiB`	`98/115 = 85.2%`

Decision:

Do not count Phase84 as a speed parity win. The target GDN bucket moved 1399.72 -> 1407.38 ms (+0.55%), and the lower total kernel time is again explained by unrelated mmq_nvfp4 variance (1418.47 -> 1349.11 ms).
Keep as a possible memory-footprint cleanup only if upstream maintainability is acceptable: gates are green and the server startup memory signal improved by about 383 MiB in the same profile window.
Do not regenerate the LocalAI patch series until a follow-up decides whether this memory-only cleanup belongs in the fork commit stack.

Phase83: KDA GDN exp-cache Decode Shortcut

Date: 2026-07-01.
Source: /home/mudler/llama-phase83-kda-gexp-source, local one-file CUDA experiment on top of fork commit 237ad9b96 feat(cuda): add BF16 Qwen GDN state cache.
Build artifact: /home/mudler/llama-phase83-kda-gexp-source/build-cuda.
Build log: /home/mudler/llama-phase83-kda-gexp-source/build.phase83.log.
Gate artifact: /home/mudler/bench/phase83_kda_gexp_gates/20260701_184237/direct_retry.
Profile artifact: /home/mudler/bench/phase83_kda_gexp_profile/20260701_164731.
Result type: source micro-optimization. Cache the KDA per-row expf(g_t[i]) value in a register once per token/thread in ggml/src/ggml-cuda/gated_delta_net.cu, then reuse it in both the KDA kv and S-update loops. This preserves the same recurrence storage, operation order at the algorithm level, and F32 state path.

Gate harness notes:

First copied-harness attempt used a LocalAI worktree path that was not present on DGX and failed before running gates.
Second harness attempt refused to run because this job already owned the GPU lock.
First direct gate script had an awk quoting bug after producing partial output.
Corrected direct retry completed and is the valid gate artifact.

Gates:

check	result
MoE md5	`8cb0ce23777bf55f92f63d0292c756b0`
dense md5	`5951a5b4d624ce891e22ab5fca9bc439`
`GATED_DELTA_NET`	`46/46`, `Backend CUDA0: OK`
`MUL_MAT`	`1146/1146`, `Backend CUDA0: OK`
`MUL_MAT_ID`	`806/806`, `Backend CUDA0: OK`

Same-window decode-only profile:

arm	source	active slots	depth start	depth mid	total kernel s	GDN ms	GDN share	`gdn_core` ms	`gdn_core` launches	`gdn_core`/launch	`mmq_nvfp4` ms
baseline F32	`/home/mudler/llama-phase81-bf16-state-source`	`128`	`73`	`95`	`3.6487`	`1481.06`	`40.59%`	`1399.46`	`597`	`2.344 ms`	`1424.65`
Phase83 exp-cache	`/home/mudler/llama-phase83-kda-gexp-source`	`128`	`66`	`88`	`3.5501`	`1487.71`	`41.91%`	`1405.62`	`600`	`2.343 ms`	`1317.98`

Decision:

Reject carry-forward. The target GDN bucket was flat-to-slightly worse: gdn_core changed 1399.46 -> 1405.62 ms (+0.44%), while per-launch cost stayed effectively unchanged (2.344 -> 2.343 ms).
The lower total kernel time is not credited to the shortcut because the unrelated mmq_nvfp4 bucket dropped by 106.67 ms in the candidate sample.
Do not regenerate LocalAI patch-series output for this experiment. Next GDN work should target a structural traffic or launch-shape change, not single-expression reuse inside the current core loop.

Phase82: BF16 Persistent GDN S-Cache f16 KL Gate

Date: 2026-07-01.
Source: /home/mudler/llama-phase81-bf16-state-source, fork commit 237ad9b96 feat(cuda): add BF16 Qwen GDN state cache.
Build artifact: /home/mudler/llama-phase81-bf16-state-source/build-cuda.
KL artifact: /home/mudler/bench/phase82_bf16_s_cache_f16_kl/20260701_183016.
Result type: full MoE f16-reference KL gate for the Phase81 default-off BF16 persistent GDN S-cache candidate.
Reference base: /home/mudler/bench/l4gate/klbase_moe.dat, generated from /home/mudler/work/darwin_36b_opus/f16.gguf at -c 512 -b 2048 --chunks 16 with f16 PPL 7.3760 +/- 0.29100.
Acceptance reference from PAGED_BITEXACT_NOTE.md: paged FP4-MMQ vs f16 KLD 0.136000 +/- 0.003285, PPL 7.4009; non-paged FP4-MMQ vs f16 KLD 0.136597 +/- 0.003157.
Run note: the script metadata hash lines hit an awk quoting issue, so BASE_SHA256 and MODEL_SHA256_HEAD are blank in meta.txt; both KL passes completed and produced full logs. Treat the blank hashes as harness metadata noise, not a model-output failure.

Result:

arm	env	KLD vs f16	PPL(Q)	PPL ratio vs f16	same-top-p	max KLD
same-source F32	`LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1`	`0.136563 +/- 0.003242`	`7.418401 +/- 0.296694`	`1.006105 +/- 0.008899`	`83.725 +/- 0.578%`	`3.602697`
BF16 S-cache	`LLAMA_QWEN35_GDN_S_CACHE_TYPE=bf16` plus same env	`0.137162 +/- 0.003456`	`7.321044 +/- 0.290693`	`0.992902 +/- 0.008714`	`84.240 +/- 0.571%`	`5.973692`

Decision:

Reject promotion of the BF16 persistent GDN S-cache patch.
Do not run serving A/B for this candidate under the current rules: the hard lossy-path gate requires KLD(new||f16) <= KLD(FP4-MMQ||f16), and the BF16 S-cache mean KLD is above both the documented paged reference (0.136000) and the same-source F32 measurement (0.136563).
Keep the Phase81 source only as a local experimental branch unless the gate is deliberately re-scoped. The next source attempt should preserve F32 recurrent S-cache quality or reduce traffic without changing the MoE f16 KL band.

Phase81: Qwen35 BF16 Persistent GDN S-Cache

Date: 2026-07-01.
Source: /home/mudler/llama-phase81-bf16-state-source, local fork patch in /home/mudler/_git/llama.cpp branch localai-paged.
Build artifact: /home/mudler/llama-phase81-bf16-state-source/build-cuda.
Gate artifact: /home/mudler/bench/phase81_bf16_s_cache_gates/20260701_161350.
Profile artifacts:
- default F32: /home/mudler/bench/phase81_bf16_s_cache_profile/default_20260701_162117
- BF16 S-cache: /home/mudler/bench/phase81_bf16_s_cache_profile/bf16_20260701_162028
KL smoke artifact: /home/mudler/bench/phase81_bf16_s_cache_kl/20260701_162322.
Result type: source experiment. LLAMA_QWEN35_GDN_S_CACHE_TYPE=bf16 stores Qwen35/Qwen35MoE persistent recurrent S cache in BF16 while keeping GDN recurrence math, q/k/v/g/beta, and output in F32. Default remains F32.

Implementation scope:

Added BF16 state support for ggml_gated_delta_net_inplace_ids only.
Added CPU/CUDA BF16 state load/store conversion at the persistent cache boundary.
Added BF16 CPU/CUDA SCALE support because recurrent cache zeroing uses ggml_scale_inplace(..., 0) on the S cache.
Added tests for BF16 GATED_DELTA_NET_INPLACE_IDS and BF16 in-place SCALE.

Local verification:

check	result
RED test before implementation	`ggml_gated_delta_net_inplace_ids` rejected BF16 state at `state->type == GGML_TYPE_F32`
CPU `SCALE -p bf16`	`1/1` passed
CPU `GATED_DELTA_NET_INPLACE_IDS`	`2/2` passed
DGX CUDA build	completed for `llama-completion`, `llama-batched-bench`, `test-backend-ops`, `llama-server`, later `llama-perplexity`

Gates:

mode	MoE md5	dense md5	`MUL_MAT`	`MUL_MAT_ID`
default F32	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`1146/1146`	`806/806`
BF16 S-cache	`07db32c2bcb78d17a43ed18bc22705cd`	`5951a5b4d624ce891e22ab5fca9bc439`	`1146/1146`	`806/806`

Profile:

arm	env	active slots	depth start	depth mid	total kernel s	GDN ms	GDN share	`gdn_core` ms	`gdn_core` launches	`gdn_core`/launch	`mmq_nvfp4` ms
default F32	none	`128`	`65`	`87`	`3.6157`	`1480.44`	`40.94%`	`1399.30`	`599`	`2.336 ms`	`1394.28`
BF16 S-cache	`LLAMA_QWEN35_GDN_S_CACHE_TYPE=bf16`	`128`	`65`	`91`	`3.5244`	`961.61`	`27.28%`	`863.57`	`720`	`1.199 ms`	`1665.38`

KL smoke against same-source F32 base:

check	result
shape	MoE, `-c 256 -b 256 --chunks 32`, Wikitext-2 raw
F32 floor KLD vs F32 base	`0.000000 +/- 0.000000`, same-top-p `99.975%`
BF16 S-cache KLD vs F32 base	`0.055499 +/- 0.001705`, same-top-p `88.361%`
BF16 PPL ratio vs F32 base	`1.010356 +/- 0.005817`

Decision:

Carry forward as a default-off candidate and run Phase82 full gates.
Do not make it default-on: MoE greedy md5 is not canonical, and the KL smoke is not the full f16-reference acceptance gate.
Required Phase82 before patch-series promotion: full f16-reference KL gate for MoE and dense, same-source serving A/B against F32 default and vLLM, then regenerate LocalAI patches from the fork only if serving and KL both hold.

Phase80: GDN Identity-Ids Shortcut Source A/B

Date: 2026-07-01.
Artifact root: /home/mudler/bench/phase80_gdn_identity_ids_ab/20260701_153927.
Arms:
- A_baseline: /home/mudler/llama-phase6-source, default source 14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output.
- B_identity: /home/mudler/llama-phase80-gdn-identity-source, one-file default-off source patch in ggml/src/ggml-cuda/gated_delta_net.cu, enabled with GDN_ASSUME_IDENTITY_IDS=1.
Result type: source A/B of an identity-ids shortcut that skips the non-identity scratch gather for one-token final-state decode and reads the in-place state cache directly.
Shape: same as Phase77 decode-only graph-node profile.
Build: candidate CUDA build completed for llama-completion, llama-batched-bench, test-backend-ops, and llama-server.

Gates:

arm	phase	MoE md5	dense md5	`MUL_MAT`	`MUL_MAT_ID`
`A_baseline`	pre	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`1146/1146`	`806/806`
`A_baseline`	post	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`1146/1146`	`806/806`
`B_identity`	pre	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`1146/1146`	`806/806`
`B_identity`	post	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`1146/1146`	`806/806`

Capture:

arm	active slots	depth start	depth mid	`gdn_core` launches
`A_baseline`	`128`	`74`	`96`	`600`
`B_identity`	`128`	`65`	`87`	`600`

Result:

arm	env	total kernel s	GDN ms	GDN share	`gdn_core` ms	`gdn_gather` ms	GDN macro launches
`A_baseline`	none	`3.7132`	`1493.57`	`40.22%`	`1411.65`	`0.79`	`3600`
`B_identity`	`GDN_ASSUME_IDENTITY_IDS=1`	`3.5685`	`1489.96`	`41.75%`	`1409.28`	not present	`3000`

Decision:

Reject carry-forward/default for GDN_ASSUME_IDENTITY_IDS=1.
The shortcut did remove the gdn_gather fine bucket and kept all gates green, but the removed bucket was only 0.79 ms over the capture and gdn_core was effectively unchanged.
The identity assumption is too narrow/risky for the size of the measured win. Do not spend more parity time on gather-only GDN shortcuts unless a future profile shows gather becoming material.
Keep the next real GDN source scope on recurrent-state precision/traffic.

Phase79: GDN Decode BV32 Source A/B

Date: 2026-07-01.
Artifact root: /home/mudler/bench/phase79_gdn_decode_bv32_ab/20260701_152530.
Arms:
- A_baseline: /home/mudler/llama-phase6-source, default source 14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output.
- B_bv32: /home/mudler/llama-phase79-gdn-source, one-file default-off source patch in ggml/src/ggml-cuda/gated_delta_net.cu, enabled with GDN_DECODE_BV32=1.
Result type: source A/B of a decode-only S_v=128, n_tokens=1, scalar-gate smaller-V-tile kernel inspired by vLLM's packed decode topology.
Shape: same as Phase77 decode-only graph-node profile.
Build: candidate CUDA build completed for llama-completion, llama-batched-bench, test-backend-ops, and llama-server.

Gate detail:

Candidate default gates before profiling were green: MoE md5 8cb0ce23777bf55f92f63d0292c756b0, dense md5 5951a5b4d624ce891e22ab5fca9bc439, MUL_MAT 1146/1146, MUL_MAT_ID 806/806.
Candidate opt-in gates before the A/B were green with GDN_DECODE_BV32=1: same md5 values, MUL_MAT 1146/1146, MUL_MAT_ID 806/806.
A/B baseline pre-gates were green. Baseline post-gate first run hit a transient MUL_MAT 1145/1146 failure on MUL_MAT(type_a=q4_1,type_b=f32,m=16,n=1,k=256,...); immediate retry at A_baseline/gate_post_retry was green for md5, MUL_MAT 1146/1146, and MUL_MAT_ID 806/806.
B_bv32 pre/post gates were green with GDN_DECODE_BV32=1.

Capture:

arm	active slots	depth start	depth mid	`gdn_core` launches
`A_baseline`	`128`	`67`	`89`	`600`
`B_bv32`	`128`	`72`	`93`	`570`

Result:

arm	env	total kernel s	GDN ms	GDN share	`gdn_core` ms	`gdn_core`/launch	`mmq_nvfp4` ms
`A_baseline`	none	`3.6274`	`1493.14`	`41.16%`	`1411.46`	`2.352`	`1392.60`
`B_bv32`	`GDN_DECODE_BV32=1`	`3.5739`	`1502.89`	`42.05%`	`1426.17`	`2.502`	`1363.65`

Decision:

Reject the BV32 decode source patch.
Although all safety gates passed, normalized gdn_core worsened by about 6.4% per launch and the GDN macro bucket increased.
Lower total kernel time in the candidate is not accepted as a win because the capture contains fewer graph-node launches (570 vs 600 gdn_core), while the per-launch GDN core cost is worse.
Do not retry smaller V-tile decode topology without a new profile-level reason. The next GDN source hypothesis should attack recurrent-state precision/traffic or another structural difference from vLLM.

Phase78: GDN Decode Launch-Shape Sweep

Date: 2026-07-01.
Baseline artifact: /home/mudler/bench/phase77_moe_decode_only_profile/20260701_150134.
Sweep artifacts:
- /home/mudler/bench/phase78_gdn_launch_sweep/nw8_cpw8_20260701_150654
- /home/mudler/bench/phase78_gdn_launch_sweep/nw16_cpw4_20260701_150954
Source baseline: 14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output.
Result type: env-gated launch-shape sweep only; no source change.
Shape: same as Phase77 decode-only graph-node profile.

Result:

arm	env	gate status	GDN ms	GDN share	`gdn_core` ms	`gdn_core` share	`mmq_nvfp4` ms
Phase77 default	none	pre/post green	`1489.71`	`41.20%`	`1408.33`	`38.95%`	`1383.50`
sweep `8x8`	`GDN_NW=8 GDN_CPW=8`	pre/post green	`1525.86`	`41.94%`	`1443.55`	`39.68%`	`1366.33`
sweep `16x4`	`GDN_NW=16 GDN_CPW=4`	rejected	not run	not run	not run	not run	not run

Gate detail:

8x8: pre/post MoE md5 8cb0ce23777bf55f92f63d0292c756b0, dense md5 5951a5b4d624ce891e22ab5fca9bc439, MUL_MAT 1146/1146, MUL_MAT_ID 806/806.
16x4: completion md5 and MUL_MAT 1146/1146 passed, but MUL_MAT_ID failed 805/806; rejected before profiling.

Decision:

Keep the current default GDN_NW=16 GDN_CPW=8.
Do not spend more GB10 time on launch-shape retunes without a new hypothesis.
The funded source path remains a structural default-off GDN decode A/B/PoC that reduces the Phase77 gdn_core bucket, not another existing-env sweep.

Phase77: MoE Decode-Only Graph-Node Profile

Date: 2026-07-01.
Artifact: /home/mudler/bench/phase77_moe_decode_only_profile/20260701_150134.
Setup-hiccup artifact: /home/mudler/bench/phase77_moe_decode_only_profile/20260701_145815.
Source baseline: 14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output.
Result type: current-stack llama.cpp decode-only graph-node profile; no source change.
Shape: MoE q36-35b-a3b-nvfp4, N=128, long-running /completion requests, N_PREDICT=2048, capture after active decode.
Capture window: active slots 128; median decoded depth 67 at start and 89 mid-capture; CAPTURE_SECONDS=4.
Profiler: nsys launch --cuda-graph-trace=node, bucketed with /home/mudler/bench/bucket2.py.

Gates:

phase	MoE md5	dense md5	`MUL_MAT`	`MUL_MAT_ID`
pre	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`1146/1146`	`806/806`
post	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`1146/1146`	`806/806`

Macro buckets:

bucket	time ms	share	instances
GDN	`1489.71`	`41.20%`	`3600`
MoE/FFN-GEMM	`1400.77`	`38.74%`	`7220`
bf16/fp8-proj	`352.90`	`9.76%`	`7400`
layout-copy	`69.85`	`1.93%`	`10400`
act-quant	`67.63`	`1.87%`	`4820`
FA	`36.74`	`1.02%`	`600`

Fine buckets:

bucket	macro	time ms	share	instances
`gdn_core`	GDN	`1408.33`	`38.95%`	`600`
`mmq_nvfp4`	MoE/FFN-GEMM	`1383.50`	`38.26%`	`4820`
`gdn_conv`	GDN	`71.76`	`1.98%`	`1200`
`gdn_l2norm`	GDN	`8.81`	`0.24%`	`1200`
`gdn_gather`	GDN	`0.80`	`0.02%`	`600`

Decision:

Phase77 confirms Phase76's GDN bucket is not only prompt/prefill contamination. In an isolated decode window, gdn_core is the largest fine bucket and is slightly larger than mmq_nvfp4.
This supersedes the Phase75 no-GB10-GDN-source stance. The source-funded path is no longer C=64 prefill inverse work; it is a narrow default-off GDN decode A/B or standalone PoC based on the direct recurrent/packed decode structure found in vLLM.
Acceptance gate for the next source attempt: reduce the Phase77 gdn_core bucket materially, keep pre/post md5 and MUL_MAT/MUL_MAT_ID green, and show no serving/decode throughput regression under the same decode-only capture shape.

Phase76: Current MoE Serving Graph-Node Profile

Date: 2026-07-01.
Artifact: /home/mudler/bench/phase76_current_moe_profile/20260701_145116.
Setup-hiccup artifacts: /home/mudler/bench/phase76_current_moe_profile/20260701_144754 and /home/mudler/bench/phase76_current_moe_profile/20260701_144929.
Source baseline: 14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output.
Result type: current-stack llama.cpp graph-node serving profile; no source change.
Shape: MoE q36-35b-a3b-nvfp4, n=128, PTOK=128, GEN=64, PARALLEL=128, CTX=131072, production defaults.
Profiler: nsys launch --cuda-graph-trace=node, bucketed with /home/mudler/bench/bucket2.py.

Gates:

phase	MoE md5	dense md5	`MUL_MAT`	`MUL_MAT_ID`
pre	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`1146/1146`	`806/806`
post	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`1146/1146`	`806/806`

Serving result under graph-node profiling:

n	agg_tps	decode_agg_tps	decode_perseq_tps	prefill_tps	ttft_mean_ms	wall_s
`128`	`204.1`	`320.7`	`2.06`	`1490.1`	`8365.1`	`40.146`

Macro buckets:

bucket	time ms	share	instances
GDN	`6669.16`	`32.88%`	`25980`
MoE/FFN-GEMM	`6264.88`	`30.88%`	`54406`
bf16/fp8-proj	`2772.38`	`13.67%`	`53880`
layout-copy	`1265.44`	`6.24%`	`81280`
ew-mul(weight/norm/GDN)	`734.61`	`3.62%`	`52464`
act-quant	`678.95`	`3.35%`	`37526`
FA	`264.50`	`1.30%`	`3660`

Fine buckets:

bucket	macro	time ms	share	instances
`gdn_core`	GDN	`5876.94`	`28.97%`	`4680`
`gdn_conv`	GDN	`454.03`	`2.24%`	`7260`
`gdn_gather`	GDN	`237.87`	`1.17%`	`4680`
`gdn_l2norm`	GDN	`100.32`	`0.49%`	`9360`
`mmq_nvfp4`	MoE/FFN-GEMM	`6055.03`	`29.85%`	`34162`

Decision:

Phase76 contradicts the Phase75 assumption that GDN decode is not on the current critical path. Under graph-node current serving, GDN is the largest GPU-kernel macro bucket and gdn_core alone is nearly 29%.
Do not patch gated_delta_net.cu yet. This profile is llama-only and graph-node tracing depresses absolute throughput, so it is a source-funding signal, not a source patch gate.
Fund Phase77 as a narrow proof before backend edits: compare current gdn_core against a vLLM-style direct recurrent/packed decode PoC or an in-backend default-off A/B, with pre/post md5 and op gates, and require a material reduction in the Phase76 gdn_core bucket without regressing serving throughput or canonical md5.

Phase75: Post-PoC GDN/VLLM Audit

Date: 2026-07-01.
Artifact: no new benchmark artifact.
Source baseline: 14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output.
Result type: subagent codebase audit and gate-setting only; no source change.
Inputs: Phase74 artifact /home/mudler/bench/phase74_gdn_blocked_solve_poc/20260701_143711, llama.cpp GDN implementation, vLLM FLA/GDN implementation, and parity docs.

Findings:

llama.cpp already has the M5 tensor-core GDN path default-on under paged KV. It includes KK/QK mma, KS/QS 3xtf32 mma, P*U mma, explicit T=A^-1, U=T*RHS, and state carry Kc^T*DU.
The current backend path is fixed at C=16 for GB10 shared-memory limits. The remaining C=64/register-state class is not a shortcut patch.
Phase74 tested a C=64 shared-memory explicit inverse-plus-apply scaffold and failed its source-work gate: inverse/direct speed was 0.5941x weak decay and 0.5927x mixed decay.
vLLM has a structurally different one-token recurrent decode kernel that updates state directly without chunk inverse, and a packed decode path that avoids Q/K/V materialization copies. This is not currently source-funded in llama.cpp because prior parity profiles showed llama.cpp GDN decode faster than vLLM and decode serving dominated by host/MoE synchronization.
vLLM's CuTeDSL GDN prefill path uses SM10x/CUDA-13 Blackwell features including TMA/tcgen05/CUTLASS DSL. Treat it as datacenter-Blackwell reference evidence unless GB10 support is proven in the local toolchain.

Decision:

Do not start GB10 GDN backend source work after Phase74/75.
Do not start a packed/recurrent GDN decode PoC unless a fresh same-session profile shows GDN decode or Q/K/V materialization back on the critical path.
Phase75 acceptance gate for the next real parity attempt is a datacenter Blackwell serving rerun with the Phase72 shape: NPL=8 32 128, PTOK=128, GEN=64, PARALLEL=128, production defaults.
The rerun is valid only if hardware.txt records hardware_class=datacenter_blackwell, pre/post md5 gates are green (8cb0ce23777bf55f92f63d0292c756b0, 5951a5b4d624ce891e22ab5fca9bc439), MUL_MAT 1146/1146 and MUL_MAT_ID 806/806 are green, and decode profiles include nsys --cuda-graph-trace=node.
If datacenter Blackwell materially lifts llama/vLLM decode ratios above the GB10 Phase72 record (0.7561, 0.7158, 0.6935), continue parity work on that surface. If not, record the residual gap as engine/kernel architecture rather than GB10 memory bandwidth and keep GB10 GDN stopped.

Phase74: GDN Blocked-Solve PoC Gate

Date: 2026-07-01.
Plan: docs/superpowers/plans/2026-07-01-gdn-blocked-solve-poc-phase74.md.
Artifact: /home/mudler/bench/phase74_gdn_blocked_solve_poc/20260701_143711.
Source baseline: 14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output.
Result type: standalone CUDA microbenchmark only; no llama.cpp source change.
Toolchain: CUDA 13.0.88, nvcc -O3 -arch=sm_121a.
Hardware: NVIDIA GB10, cc=12.1, 48 SMs, 99 KB dynamic shared memory.
Shape: C=64, DK=128, DV=128, chunks=4096, iters=1000.
Shared memory: direct solve/apply 81920 bytes; inverse-plus-apply 98304 bytes.

Result:

case	direct ms	inverse+apply ms	inverse/direct speed	direct NMSE	inverse NMSE	direct max abs	inverse max abs	max lower row sum
weak decay	`3.263936`	`5.493515`	`0.5941x`	`2.081e-14`	`2.755e-15`	`8.890e-07`	`2.415e-07`	`4.072`
mixed decay	`3.275959`	`5.527584`	`0.5927x`	`1.981e-14`	`7.541e-16`	`8.115e-07`	`7.888e-08`	`1.635`

Decision:

Reject this explicit inverse-plus-apply shape as a backend source candidate on GB10. It is numerically clean but materially slower than direct solve/apply.
Do not touch ggml/src/ggml-cuda/gated_delta_net.cu for the larger C=64 path based on this attempt.
A future GDN source-work gate would need a substantially different tensor-core blocked solve/register-state design, not this shared-memory inverse scaffold.

Phase73: Datacenter Blackwell Rerun Readiness

Date: 2026-07-01.
Plan: docs/superpowers/plans/2026-07-01-datacenter-blackwell-rerun-readiness-phase73.md.
Artifact: no new benchmark artifact.
Source baseline: 14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output.
Result type: harness/spec audit only.

Evidence:

Phase72 is the current GB10 serving baseline. Default llama decode/vLLM ratios remain 0.7561, 0.7158, and 0.6935 at n=8/32/128.
Grouped-MMQ/W4A16: Phase61 direct activation was the last structurally distinct W4A16 shortcut; it failed its keep gate and stayed far behind default FP4-MMQ. Phase66 quantize plus gather was only 5.10%, below the source-funding threshold.
GDN: Phase71 kept shipped M5 as default. The remaining GDN gap is a larger FLA/CuteDSL-class C=64 blocked-solve/register-state implementation, not another C32/QS/global-Ai/local reorder.
Harness: paged-current-serving-snapshot.sh already records hardware_class=datacenter_blackwell for B200/B100/GB200, supports DRY_RUN=1, SERVED_MODEL_NAME, and vLLM deployment overrides.

Decision:

Do not start more GB10 grouped-MMQ/W4A16 source work.
Do not start GDN backend source work until a standalone C=64 blocked-solve PoC records timing, numerical error, and resource estimates.
The next parity run should be on datacenter Blackwell hardware with the existing same-session serving harness plus graph-node decode profiles.
No parity claim is made by this phase.

Phase72: TTFT Min32 Broader Serving

Date: 2026-07-01.
Plan: docs/superpowers/plans/2026-07-01-ttft-min32-serving-phase72.md.
Artifact: /home/mudler/bench/phase72_ttft_min32_serving/20260701_160730.
Source: 14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output.
Shape: MoE serving, NPL=8 32 128, prompt 128, generation 64, PARALLEL=128, CTX=131072.
Env gate: LLAMA_TTFT_PREFILL_FIRST=1 LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING=32.

Gates:

gate	MoE md5	dense md5	`MUL_MAT`	`MUL_MAT_ID`
pre default	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`1146/1146`	`806/806`
pre min32	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	not run	not run
post default	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	not run	not run
post min32	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	not run	not run

Result:

Reject default-on for min32 in the broader serving shape.
Keep the scheduler knob opt-in only.
min32 regressed aggregate, decode, TTFT, and wall time for every tested concurrency.

Phase71: GDN Tensor-Core Revalidation

Date: 2026-07-01.
Plan: docs/superpowers/plans/2026-07-01-gdn-tc-revalidation-phase71.md.
Artifact: /home/mudler/bench/phase71_gdn_tc_revalidation/20260701_153425.
Source: 14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output.
Shape: MoE prefill, PP=512,2048, TG=4, B=32, CTX=131072.

Canonical gates:

gate	env	MoE md5	dense md5	`GATED_DELTA_NET`	`MUL_MAT`	`MUL_MAT_ID`
default	none	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`46/46`	`1146/1146`	`806/806`
sequential-disabled	`GDN_CHUNK_MIN=2147483647`	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`46/46`	not run	not run
serial-chunked	`GDN_TC=0 GDN_CHUNK_MIN=64`	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`46/46`	not run	not run
forced M5	`GDN_TC=4 GDN_CHUNK_MIN=64`	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`46/46`	not run	not run

MoE prefill:

arm	npp	S_PP t/s	T_PP s	S_TG t/s	total S t/s
default	`512`	`2313.57`	`7.082`	`401.82`	`2231.28`
sequential-disabled	`512`	`2198.28`	`7.453`	`392.50`	`2122.58`
serial-chunked	`512`	`1787.49`	`9.166`	`396.23`	`1740.12`
forced M5	`512`	`2323.18`	`7.052`	`393.62`	`2238.13`
default	`2048`	`2422.88`	`27.049`	`389.91`	`2398.50`
sequential-disabled	`2048`	`2361.22`	`27.755`	`386.08`	`2337.91`
serial-chunked	`2048`	`1699.77`	`38.556`	`389.48`	`1688.69`
forced M5	`2048`	`2420.52`	`27.075`	`388.72`	`2396.11`

Ratios:

npp	default/sequential S_PP	default/serial S_PP	forced/default S_PP
`512`	`1.0524`	`1.2943`	`1.0042`
`2048`	`1.0261`	`1.4254`	`0.9990`

Decision:

Keep shipped GDN M5 default behavior.
Do not reopen smaller GDN C32/QS/global-Ai32/kernel-reorder work on GB10.
The stale "two-Gram PoC before M5 exists" framing is superseded by the existing 0047 M5 implementation and this revalidation.

Phase70: BF16 F32 Output Broader Serving

Date: 2026-07-01.
Plan: docs/superpowers/plans/2026-07-01-bf16-f32-output-broader-serving-phase70.md.
Artifact: /home/mudler/bench/phase70_bf16_broader_serving/20260701_151500.
Source: 14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output.
Shape: MoE serving, NPL=8 32 128, prompt 128, generation 64, PARALLEL=128, CTX=131072.

Gates:

gate	MoE md5	dense md5	`MUL_MAT`	`MUL_MAT_ID`
pre default	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`1146/1146`	`806/806`
pre opt-in	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`1146/1146`	not run
post default	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`1146/1146`	`806/806`
post opt-in	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`1146/1146`	not run

Result:

Default-on rejected.
Opt-in remains correctness-clean, but broad serving is mixed-to-negative.

Phase69: Patch Series Mirror Readiness

Date: 2026-07-01.
Plan: docs/superpowers/plans/2026-07-01-patch-series-mirror-readiness-phase69.md.
Artifact: local dry-run only.
Result: current 0001..0063 series matched Phase37 tree dedb1182910eafe9f6875588dc8285bfb544cce5; projected 0064..0073 matched fork HEAD tree fcf5720b659c5e1e2b487ccf3c8f7289bb12b9c4.
Decision: patch regeneration is technically ready but blocked on explicit push approval by policy.

Phase68: BF16 F32 Output Dense Serving

Date: 2026-07-01.
Plan: docs/superpowers/plans/2026-07-01-bf16-f32-output-dense-serving-phase68.md.
Artifact: /home/mudler/bench/phase68_bf16_dense_serving/20260701_145710.
Serving artifact: /home/mudler/bench/phase68_bf16_dense_serving/20260701_145710/serving_ab_20260701_150249.

Dense prefill:

npp	default S_PP	opt-in S_PP	change
`512`	`973.13`	`975.52`	`+0.25%`
`2048`	`1019.88`	`1021.39`	`+0.15%`

MoE serving N=128, prompt 128, generation 128:

metric	default	opt-in	change
`agg_tps`	`409.8`	`415.0`	`+1.27%`
`decode_agg_tps`	`615.3`	`627.2`	`+1.93%`
`prefill_tps`	`1630.2`	`1648.0`	`+1.09%`
`ttft_mean_ms`	`8574.7`	`8085.9`	`-5.70%`
`wall_s`	`39.978`	`39.480`	`-1.25%`

Decision:

Carry as default-off opt-in candidate pending broader serving evidence.

Phase67: BF16 cuBLAS F32 Output

Date: 2026-07-01.
Plan: docs/superpowers/plans/2026-07-01-bf16-cublas-f32-output-phase67.md.
Artifact: /home/mudler/bench/phase67_bf16_f32_out/20260701_144909.
Fork commit: ea0875d14 feat(cuda): gate BF16 cuBLAS F32 output.
DGX mirror commit: 14fd69f1e.
Env gate: LLAMA_BF16_CUBLAS_F32_OUT=1.

Gates:

mode	MoE md5	dense md5	`MUL_MAT`
default	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`1146/1146`
opt-in	`8cb0ce23777bf55f92f63d0292c756b0`	`5951a5b4d624ce891e22ab5fca9bc439`	`1146/1146`

MoE prefill:

npp	default S_PP	opt-in S_PP	change
`512`	`2347.41`	`2402.34`	`+2.34%`
`2048`	`2440.18`	`2456.54`	`+0.67%`

Decision:

Keep default-off pending dense and serving A/B.

223 KiB Raw Blame History

llama.cpp vLLM Parity Benchmark Ledger

Current Status

Phase141: GDN Decode-Only Noise Floor

Phase140: GDN Decode Prep Trace

Phase139: Serving Noise-Floor Repeat

Phase138 Attempt 2: Down-MMQ Finalize Writeback

Phase138 Attempt 1: MoE Finalize Trace And Full-Tail Sentinel

Phase136: Routed-FFN Post-Down Weighted Combine

Phase137: GDN Geometry Sweep

Phase135: Routed-FFN Fused SWIGLU-to-NVFP4 Quant

Phase134: Routed-FFN Fused SWIGLU-to-Sorted

Phase133: Routed-FFN Sorted-Down Raw MMQ

Phase132: Default-Off Routed-FFN PoC Scaffold

Phase131: Fused Routed-FFN Scoping Challenge

Phase130: Current-Stack Serving Profile Refresh

Phase129: Qwen35 GDN Q/K Grouped Broadcast Probe

Phase128: Qwen3Next GDN BF16 S-Cache Scope

Phase127: Whole-MoE Expert-Major Executor

Phase126: MMQ Presorted Helper Scaffold

Phase125: Expert-Major Sorted Output Scope

Phase124: Current MoE Serving Graph-Node Refresh

Phase123: MoE Executor Fused Down Input

Phase122: MoE Shared Route Metadata

Phase121: MoE Whole-Pattern Exec Proof

Phase120: MoE Early Whole-Pattern Matcher

Phase119: MoE Whole-Pattern Contract

Phase118: MoE Route Cache

Phase117: MoE Route-Once Boundary Timing

Phase116: MoE SwiGLU Down Fused Quant

Phase115: MoE Small-M Sentinel A/B

Phase114: W4A16 Padded Routing

Phase113: W4A16 Direct-A GPU Tiles

Phase112: W4A16 Direct Activation Staging

Current Serving Record

Attempt Log

Phase111: W4A16 GPU Tile Descriptor Probe

Phase110: GPU MoE Routing Metadata for Fallback/W4A16

Phase109: Existing MoE Prefill and Tile-Policy A/B

Phase108: MoE Whole-Graph Perf CSV Harness

Phase107: Fused-MoE Structural Guardrail

Phase106: Max-Concurrency Current-Stack Serving

Phase105: Current-Stack MoE MMQ Shape Refresh

Phase104: Combined Cleanup Normal Serving Snapshot vs vLLM

Phase103: Combined Layout Cleanup Stack

Phase102: Split-Input SSM_CONV Prefill Path

Phase101: Paged K/V F16 GET_ROWS A/B

Phase100: Layout Trace View-Source Attribution

Phase99: Serving Layout Trace Attribution

Phase98: Phase93 Serving Graph-Node Profile

Phase97: Phase93 Serving Snapshot, N=128

Phase96: Conv-State Identity Fast Path

Phase95: GDN Warp Scalar-Gate Broadcast

Phase94: Phase93 GDN Geometry Reprobe, 8x8

Phase93: Qwen3Next Grouped Q/K Broadcast for Fused GDN

Phase92: Scalar Decode Store-Fused GDN Trial

Phase91: Default-off PACK=2 Decode Kernel, Guarded Retry

Phase90: In-place GDN Decode State Guardrail

Phase89: In-place GDN Decode Test Guardrail Attempt

Phase88: Default-off PACK=2 Decode CTA Kernel

Phase87: Decode Geometry Probe (GDN_NW=4, GDN_CPW=8)

Phase86: Producer-fusion Scope Audit

Phase85: Identity-contiguous GDN State Fast Path

Phase84: Attention-only Outputs for In-place GDN

Phase83: KDA GDN exp-cache Decode Shortcut

Phase82: BF16 Persistent GDN S-Cache f16 KL Gate

Phase81: Qwen35 BF16 Persistent GDN S-Cache

Phase80: GDN Identity-Ids Shortcut Source A/B

Phase79: GDN Decode BV32 Source A/B

Phase78: GDN Decode Launch-Shape Sweep

Phase77: MoE Decode-Only Graph-Node Profile

Phase76: Current MoE Serving Graph-Node Profile

Phase75: Post-PoC GDN/VLLM Audit

Phase74: GDN Blocked-Solve PoC Gate

Phase73: Datacenter Blackwell Rerun Readiness

Phase72: TTFT Min32 Broader Serving

Phase71: GDN Tensor-Core Revalidation

Phase70: BF16 F32 Output Broader Serving

Phase69: Patch Series Mirror Readiness

Phase68: BF16 F32 Output Dense Serving

223 KiB

Raw Blame History

Phase102: Split-Input `SSM_CONV` Prefill Path

Phase101: Paged K/V F16 `GET_ROWS` A/B

Phase87: Decode Geometry Probe `(GDN_NW=4, GDN_CPW=8)`