diff --git a/backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md b/backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md index 1ce3ac064..2cd5b9125 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md @@ -11,16 +11,2254 @@ with artifact path, gates, benchmark rows, and decision. - Canonical paged MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`. - Canonical dense md5: `5951a5b4d624ce891e22ab5fca9bc439`. - Current tested source: DGX mirror - `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`. -- Latest attempt: Phase81. -- Latest decision: default-off `LLAMA_QWEN35_GDN_S_CACHE_TYPE=bf16` is a - promising carry-forward candidate, not default-on. Same-source decode-only - profiling cut normalized `gdn_core` from about `2.34 ms/launch` to - `1.20 ms/launch` and total kernel time from `3.6157 s` to `3.5244 s`, but MoE - greedy md5 changed from the paged canonical `8cb0...` to `07db...`. Dense md5 - stayed canonical and `MUL_MAT`/`MUL_MAT_ID` gates stayed green. Phase82 must - run the full f16-reference KL gate and serving A/B before this can be promoted - beyond an opt-in experiment. + `/home/mudler/llama-phase93-qwen3next-gqa-bcast`, local guardrail stack plus + Qwen3Next grouped Q/K broadcast for fused GDN. +- Latest attempt: Phase141 GDN decode-only noise-floor repeat. +- Latest decision: recurrence-level GDN source A/B must normalize by launch + count or control the decode capture window tightly. Phase141 ran five + identical current-binary decode-only captures with pre/post gates green. Raw + `gdn_core_ms` had median `1415.500`, stdev `30.641`, CV `2.146%`, and range + `1410.300..1482.140 ms`, mostly because capture windows recorded `597`, + `598`, `600`, or `630` `gdn_core` launches. Normalized + `gdn_core_ms_per_launch` was much steadier: median `2.359167`, stdev + `0.005399`, CV `0.229%`, range `2.352603..2.366917 ms`. A future + recurrence-level source patch must beat `max(2.0%, 3 * same-binary stdev)` + on repeated A/B medians, using per-launch GDN core when launch counts drift; + for Phase141 that means at least `6.49%` raw `gdn_core` reduction or `2.0%` + launch-normalized reduction. Phase140 still rejects prep-only L2 fusion. The + most defensible small source follow-up is a default-off scalar gate/beta + hoist inside `gated_delta_net_cuda`; the vLLM-style packed decode recurrence + remains a larger redesign, not a shortcut. + Phase137 was rejected with no source changes: `GDN_NW=4 GDN_CPW=1` improved + isolated 1-token GDN rows but regressed real serving versus Phase135 + (`208.0/332.7 -> 206.2/324.9` aggregate/decode t/s, `gdn_core` + `5926.55 -> 6466.27 ms`). Phase135 remains the current best default-off + routed-FFN base without Phase138 finalize, but not parity. Phase135 adds + `LLAMA_MOE_ROUTED_FFN_FUSED_QUANT=1` on top of + `LLAMA_MOE_ROUTED_FFN_POC=1`: it computes `silu(gate) * up` directly into + the NVFP4 MMQ activation layout and launches raw down MMQ, skipping both the + sorted F32 buffer and the separate activation-quant kernel. Focused gates and + canonical opt-in gates passed; trace proved six `mmq_moe_quantized_raw` + launches and zero `mmq_moe_sorted_raw` launches. Focused perf was mixed but + better at the larger sentinel: default `805.92/1031.06 us`, Phase135 + `807.92/1024.97 us` for `n=128/257`. The same opt-in serving profile at the + Phase130 shape passed pre/post gates and improved decode aggregate t/s + `326.9 -> 332.7`, while `mmq_nvfp4` dropped `6009.52 -> 5915.24 ms`; total + kernel time still rose slightly (`20.1559 -> 20.2498 s`) because GDN and + projection buckets moved up. Next work should either make this path + default-off-clean enough for broader serving comparisons, or attack the + remaining MoE launch/writeback overhead (`mmq_fixup`, route metadata, and + direct weighted combine) rather than another F32 intermediate. Phase134 is + kept as a default-off fused-SWIGLU structural base, + not as a promoted speedup. Phase134 adds + `LLAMA_MOE_ROUTED_FFN_FUSED_SWIGLU=1` on top of + `LLAMA_MOE_ROUTED_FFN_POC=1`: it executes `gate_up`, computes + `silu(gate) * up` directly into expert-sorted F32 rows, then calls the raw + MMQ down helper. Selected opt-in gates passed `13/13`; trace proved six raw + sorted launches; canonical opt-in gates passed MoE/dense md5, + `GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. + Focused perf was mixed: default `804.92/1026.02 us`, Phase134 + `810.61/1025.68 us` for `n=128/257`. It removes the Phase133 standalone + `glu -> get_rows` boundary and recovers n=257, but the extra fused-SWIGLU + kernel is still slower at n=128. Next work should fuse SWIGLU directly into + the down-MMQ quant buffer, or otherwise remove one more launch/buffer. + Phase133 remains only as a default-off structural base for the + next fused routed-FFN slice, not as a speedup. Phase133 adds + `LLAMA_MOE_ROUTED_FFN_SORTED_DOWN=1` on top of + `LLAMA_MOE_ROUTED_FFN_POC=1`: it keeps baseline `gate_up` and `SWIGLU`, + gathers the computed SWIGLU output into expert-sorted compact F32 rows, and + calls a raw MMQ down helper without constructing fake tensors. Default and + opt-in canonical gates passed with canonical MoE/dense md5s, + `GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`; + selected default/Phase132/Phase133 gates passed `13/13`, and trace proved + six `mmq_moe_sorted_raw` launches. Focused perf was not a win: + default `807.37/1020.76 us`, Phase132 `808.21/1018.87 us`, Phase133 + `808.85/1026.87 us` for `n=128/257`. The next phase must fuse + SWIGLU-to-sorted or SWIGLU-to-quant to remove the added gather/quant boundary; + do not promote sorted-down as-is. Phase132 remains the cleaner default-off + scaffold if Phase133 needs to be bypassed. Phase131 challenged the Phase130 fork with two read-only + source explorers. Both rejected another cheap source patch: MoE/FFN-GEMM work + should not continue unless it funds a real fused routed-FFN kernel/executor, + and GDN work should not continue unless it materially changes the f32 + recurrent-state traffic without BF16/quality drift. The next active line is + therefore a default-off fused routed-FFN PoC scoped from vLLM's real fused MoE + design and llama.cpp's current `gate_up -> SWIGLU -> down` executor hook. + Phase131 is a no-source decision/architecture attempt, not a speedup claim. + Keep carrying the Phase93 Qwen3Next GQA-repeat removal + candidate as a decode-profile positive, but it does not close serving parity. + Phase130 refreshed the current-stack graph-node serving profile after the + Phase129 rejection. Pre/post gates stayed green and the profile confirms the + live serving bottleneck remains split between `mmq_nvfp4` (`6009.52 ms`, + `29.82%`) and `gdn_core` (`5891.40 ms`, `29.23%`), with FA only `1.28%` and + get-rows only `1.39%`. This rejects the paged-mask/F16 get-rows idea as the + next source patch and keeps the next credible work on either a larger + MoE/FFN-GEMM executor/kernel or a larger GDN recurrence redesign. Phase129 + tested a default-off Qwen35/Qwen35MoE grouped Q/K broadcast probe for + fused GDN, reusing the existing Qwen3Next op-param path. The default path was + md5/op clean, but the valid opt-in gate changed the MoE greedy md5 to + `b773e2f032aa0e992626d486b321808e`, so the source was rejected and reverted. + Do not port Qwen3Next grouped-broadcast semantics to Qwen35/Qwen35MoE under + the current bit-exact rule. Phase128 scoped the Qwen3Next BF16 GDN S-cache + idea and rejected/reverted the + source probe for the current target: the active `q36-35b-a3b-nvfp4.gguf` + model loads as `qwen35moe`, no true Qwen3Next GGUF was found on DGX, and the + existing Qwen35/Qwen35MoE BF16 S-cache lever was already rejected by the + Phase82 f16-reference KL gate. Phase127 tested the first whole-MoE + expert-major executor using the Phase126 helper; it passed selected + correctness and emitted expert-major markers, but was rejected and reverted + because focused perf regressed `MOE_SWIGLU_DOWN` at both n=128 and n=257. + Phase126 remains the kept scaffold. + Phase104 measured the combined cleanup stack in the normal same-session + serving harness against vLLM at `N=128`. It is md5/op clean and modestly + improves paged serving versus Phase97 (`agg_tps 329.6 -> 338.6`, + `prefill_tps 1734.5 -> 1813.0`, `TTFT 7415.4 -> 7121.6 ms`), but it is not + parity-closing: paged/vLLM is `0.6574` on decode and `0.5122` on aggregate. + Phase105 refreshed the current-stack grouped-MMQ evidence: ragged MoE and + full `MUL_MAT_ID` gates still pass, serving launch traces still have + `fixup=0` and `stream_k_blocks == ntiles_dst`, and the simple live request + landed in density-10 prefill-like shapes (`mmq_x_best=112`) rather than a new + small-M decode opportunity. Phase106 then tested the C1 high-concurrency + operating-point hypothesis at `N=128/192/256`; vLLM completed all legs and + stayed ahead, so C1 is rejected for the current GB10 stack. Do not add another + MMQ micro-policy patch or scheduler shortcut. Phase107 established the + existing fused-MoE correctness guardrails and found that `test-backend-ops + perf` did not emit timing rows for these custom whole-graph cases. Phase108 + added the missing measurement-only harness by exposing the existing MoE + whole-graph cases to perf mode and expanding CSV output to include timing + fields. Use these timings to rank fused routed-MoE work; do not start a fused + kernel without improving one of these rows and preserving md5/op gates. + Phase109 tested the existing default-off W4A16 and FP4 large-M MoE routes, + plus the cheapest grouped-MMQ density/tile-policy knobs, on the Phase108 rows. + All selected op gates passed, but none of the env-only routes is a useful + parity lever: W4A16 and FP4 large-M are much slower at `n_tokens=257`, while + `LLAMA_MOE_DENSITY_MAX=9` / `LLAMA_MOE_MMQ_X=64` are noise-level on + `MUL_MAT_ID_RAGGED_MOE` and do not help `MOE_SWIGLU_DOWN`. The next credible + implementation target is GPU-side routed-MoE metadata construction for the + host-sync fallback/grouped path, taking the vLLM `moe_align_block_size` / + permute-unpermute design as the reference, not importing vLLM wholesale. + Phase110 implemented that first default-off CUDA metadata branch behind + `LLAMA_MOE_GPU_SORT=1`, reusing `mm_ids_helper` and adding a tiny inverse + permutation kernel for the fallback `get_rows` contract. The initial branch + failed `3/13` selected opt-in rows because `mm_ids_helper`'s `ids_dst` is + sorted-to-original while fallback `get_rows` needs original-to-sorted; the + inversion fix made default, W4A16, and W4A16+GPU-sort selected gates `13/13`, + and canonical md5/op gates stayed green. Keep Phase110 as a default-off + structural base only: it improves W4A16 fallback 257-token rows by `7-8%`, + but remains `~1.5x` slower than default grouped-MMQ, so it is not a parity + win by itself. + Phase111 then tried to remove the remaining W4A16 fallback host descriptor + construction by building `w4a16_tile_desc` on GPU from `expert_bounds_dev`. + The first compile needed a pointer mutability fix, then the first runtime + attempt hit a CUDA pool LIFO assertion because the outer expert-bounds + allocation was freed after an inner later allocation. After fixing that, + selected gates passed for the new `LLAMA_W4A16_GPU_TILES=1` path, but clean + perf was flat-to-negative versus Phase110 (`MUL_MAT_ID_RAGGED_MOE n=257` + regressed about `2.0%`). The Phase111 source was reverted; post-revert + W4A16+GPU-sort selected gates passed `13/13`. Do not carry a GPU tile + descriptor path unless it is part of a larger direct-A or graph-safe W4A16 + redesign that removes more than one host-sync/launch bottleneck. + Phase112 implemented the existing default-off `LLAMA_W4A16_DIRECT_A=1` hook + for W4A16 grouped MoE, staging bf16 activations directly from original `src1` + through `ids_to_sorted` instead of materializing a sorted f32 buffer and then + casting it. Selected gates passed for W4A16+GPU-sort, direct-A alone, and + direct-A+GPU-sort (`13/13` each). The useful arm is direct-A+GPU-sort: + `MUL_MAT_ID_RAGGED_MOE n=257` improved `2278.50 -> 2166.22 us` (`+4.93%`) + and `MOE_SWIGLU_DOWN n=257` improved `1551.08 -> 1477.74 us` (`+4.73%`) + versus Phase112's W4A16+GPU-sort control, while the 128-token rows were + neutral/slightly negative. Canonical README md5 gates are green + (`8cb0ce23`, `5951a5b4`) and compact op gates are green on the supported + rows. Keep Phase112 default-off as the next structural base; do not make it + default-on because W4A16 fallback remains slower than the default grouped-MMQ + path. + Phase113 tried the combined follow-up: + `LLAMA_W4A16_DIRECT_A=1 LLAMA_MOE_GPU_SORT=1 LLAMA_W4A16_GPU_TILES=1`. + It built W4A16 tile descriptors from GPU expert bounds and launched over a + zero-initialized `max_tiles` grid to avoid even the one-int tile-count + readback. Selected correctness stayed green (`13/13`), but perf did not meet + the keep threshold: `MOE_SWIGLU_DOWN n=257` was effectively flat + (`1478.16 -> 1476.36 us`) and `MUL_MAT_ID_RAGGED_MOE n=257` regressed + (`2148.44 -> 2214.23 us`). The Phase113 source was reverted; post-revert + Phase112 direct-A+GPU-sort selected gates passed `13/13`. + Phase114 then implemented the vLLM-style padded routing contract behind + `LLAMA_W4A16_PADDED_META=1`: separate padded source ids, padded destination + ids, expert ids per M block, a padded W4A16 expert-id consumer mode, and a + direct scatter that skipped the old compact `get_rows_cuda` restore. It was + correctness-clean (`13/13`) but failed the performance gate. Initial artifact: + `/home/mudler/bench/phase114_w4a16_padded_routing/20260701_234634_padded_meta`; + fix1 artifact: + `/home/mudler/bench/phase114_w4a16_padded_routing/20260701_235003_padded_meta_fix1`. + Fix1 added `num_tokens_post_pad` early returns for padded gather/scatter, but + 257-token rows still regressed (`MOE_SWIGLU_DOWN 1477.88 -> 1726.27 us`, + `MUL_MAT_ID_RAGGED_MOE 2163.35 -> 2650.93 us`). The source was reverted and + post-revert Phase112 direct-A+GPU-sort selected gates passed `13/13`. + Phase115 then re-tested the existing default-off MoE small-M MMQ tile knob on + the current Phase108 whole-graph sentinels rather than adding another patch. + Artifact: + `/home/mudler/bench/phase115_moe_small_m_sentinel/20260702_020258`. + Control and `LLAMA_MOE_SMALL_M_TILE=16/32/64` all passed the selected + `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE` correctness gate (`13/13` each), but + none met the promotion rule. The best 128-token rows were tiny/noise-level + wins, while every capped env regressed the 257-token ragged row + (`1452.30 us` control vs `1455.02`, `1458.71`, `1456.88 us`). Reject + small-M row shaping as a parity lever; the next phase should scope a true + fused routed-MoE kernel or a graph-level fusion target that removes materialized + activation/output traffic. + Phase116 implemented that graph-level probe as a default-off CUDA-only + detector for the plain `GLU -> down MUL_MAT_ID` pattern: + `LLAMA_MOE_SWIGLU_DOWN_FUSED_QUANT=1`. The candidate computed + `silu(gate) * up` directly into the existing grouped-MMQ NVFP4 activation + buffer, leaving the MMQ kernel and graph API unchanged. Artifact: + `/home/mudler/bench/phase116_moe_swiglu_down_fused_quant/20260702_022611`. + Correctness passed (`13/13`) and the fix1 route emitted the fused trace marker + (`6` hits), but perf failed the promotion gate: `MOE_SWIGLU_DOWN n=257` was + flat (`1024.90 -> 1024.69 us`), `n=128` regressed (`806.33 -> 808.79 us`), + and the non-fused ragged sentinel drifted slower. Source was reverted and the + post-revert selected gate passed `13/13`. Do not retry a standalone fused + SwiGLU-to-MMQ-activation-quant path; the next fused-MoE attempt must remove a + larger boundary than one activation materialization. + Phase117 added default-off boundary tracing/timing around the route-sort, + activation quantization, grouped-MMQ launch, GLU, and whole-graph pattern + detector. Artifact: + `/home/mudler/bench/phase117_moe_route_once_boundary/20260702_024140`. + The first timing run proved inline CUDA events are incompatible with CUDA + graph capture (`cudaEventSynchronize` on a capturing stream), so the trace was + guarded to emit `us=-1` during capture and real timings only with + `GGML_CUDA_DISABLE_GRAPHS=1`. Post-guard selected gates passed (`13/13`), + trace mode passed (`7/7`), and canonical gates passed: MoE md5 `8cb0ce23`, + dense md5 `5951a5b4`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`. + No new runtime optimization is promoted from Phase117. The timing attribution + rejects another small route-sort or standalone GLU/quant shortcut; the next + funded MoE source phase needs a larger pipeline boundary: shared route + metadata across gate_up/down and/or an executor that owns + GEMM1->activation->GEMM2 rather than another local micro-fusion. + Phase118 tested a default-off route metadata cache/reuse prototype. Artifact: + `/home/mudler/bench/phase118_moe_route_cache/20260702_030549`. + The first preflight command falsely detected `local-ai-worker` because the + check matched its own shell text; the corrected `pgrep -x local-ai-worker` + preflight was clean. The cache candidate (`LLAMA_MOE_ROUTE_CACHE=1`) was + correctness-clean and did hit (`23` hits, `3` misses on the trace row), but + did not meet the keep rule: `MOE_SWIGLU_DOWN n=257` improved only + `1017.711 -> 1011.915 us` (`+0.57%`) and `n=128` regressed + `799.360 -> 803.738 us` (`-0.55%`). Runtime cache source was reverted; the + post-reject selected gate passed `13/13`. Keep only the local ids metadata + helper refactor if final checks remain clean. This closes route-cache as a + standalone parity lever; next MoE work needs a larger executor boundary than + skipping one metadata build. + Phase119 added a default-off whole-pattern contract trace for + `gate_up MUL_MAT_ID -> views -> SWIGLU -> down MUL_MAT_ID`. Initial artifact: + `/home/mudler/bench/phase119_moe_whole_pattern_contract/20260702_034729`; + fix1 artifact: + `/home/mudler/bench/phase119_moe_whole_pattern_contract/20260702_035126_fix1`. + The initial trace proved coverage but exceeded the trace-overhead rule on + `MOE_SWIGLU_DOWN n=257` (`1015.070 -> 1028.937 us`, `-1.35%`). Fix1 moved + detector work fully off the default path unless a trace env is enabled. It is + correctness-clean (`13/13` selected, `7/7` trace), canonical md5/op clean + (MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT 1146/1146`, + `MUL_MAT_ID 806/806`), and trace overhead is within rule: + `MOE_SWIGLU_DOWN n=128` `805.400 -> 805.584 us` (`-0.02%`) and `n=257` + `1019.715 -> 1021.836 us` (`-0.21%`). Keep Phase119 as default-off + diagnostic/contract scaffolding only. The next source phase is allowed to + implement a guarded executor, but the executor must match at the earlier + `gate_up MUL_MAT_ID` node so it can own `GEMM1->activation->GEMM2` and skip + the remaining nodes; the current GLU hook is validation-only because GEMM1 + has already executed. + Phase120 added that earlier default-off matcher/trace at the + `gate_up MUL_MAT_ID` node. Initial artifact: + `/home/mudler/bench/phase120_moe_early_whole_pattern/20260702_040153`; + fix2 artifact: + `/home/mudler/bench/phase120_moe_early_whole_pattern/20260702_040725_fix2`. + The initial/fix1 traces proved `skip_ready=4` but emitted noisy unsupported + candidates from unrelated `MUL_MAT_ID` rows; fix2 gates output on the actual + `gate/up` view pair only. Fix2 is correctness-clean (`13/13` selected, + `7/7` early trace), canonical md5/op clean (MoE `8cb0ce23`, dense + `5951a5b4`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`), and early trace + overhead stays within rule: `MOE_SWIGLU_DOWN n=128` `803.937 -> 808.978 us` + (`-0.62%`) and `n=257` `1020.412 -> 1026.073 us` (`-0.55%`). Keep Phase120 + as the executor entry-point scaffold. The next source phase should add a + default-off executor that starts from this early matcher, first proving safe + ownership/skip accounting, then moving route-plan reuse and fused activation + into that helper. + Phase121 added that default-off executor proof behind + `LLAMA_MOE_WHOLE_PATTERN_EXEC=1`. Initial artifact: + `/home/mudler/bench/phase121_moe_whole_pattern_exec_proof/20260702_041543`; + fix1 artifact: + `/home/mudler/bench/phase121_moe_whole_pattern_exec_proof/20260702_041739_fix1`. + The initial run passed gates but emitted zero exec markers because the exec + path was incorrectly nested under the early-trace env. Fix1 made exec + detection depend on either exec or trace env. It is correctness-clean + (`13/13` selected, `7/7` exec), canonical md5/op clean (MoE `8cb0ce23`, + dense `5951a5b4`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`), and emits + `skip=4` markers for the six supported MoE rows. Perf is neutral for the + target sentinel: `MOE_SWIGLU_DOWN n=128` `807.772 -> 806.051 us` (`+0.21%`) + and `n=257` `1021.115 -> 1020.839 us` (`+0.03%`). Keep Phase121 as the + executor ownership/skip-accounting proof only. The next real optimization + phase should replace one internal boundary inside this helper, starting with + route-plan reuse or activation-in-route-order, while preserving this md5/op + contract. + Phase122 tested route-plan reuse inside the Phase121 executor by exposing + `ggml_cuda_mmq_ids_meta` and passing one built route to both `gate_up` and + `down` MMQ calls behind `LLAMA_MOE_WHOLE_PATTERN_SHARED_ROUTE=1`. Artifact: + `/home/mudler/bench/phase122_moe_shared_route_meta/20260702_043212`. + Correctness was clean (`13/13` selected, `7/7` shared-route), but the target + `MOE_SWIGLU_DOWN n=257` row regressed versus the Phase121 executor + (`1020.850 -> 1051.666 us`, `-3.02%`) and `n=128` also missed the keep + threshold (`808.190 -> 811.836 us`, `-0.45%`). The source was reverted, + including the public MMQ metadata API. Post-reject gates on the reverted tree + passed (`13/13` selected, `7/7` executor) with six retained Phase121 exec + markers. Do not retry route-only metadata reuse; the next MoE executor phase + should attack activation/down data layout, direct activation-to-down input, + or a larger fused GEMM1->activation->GEMM2 boundary. + Phase123 tested that direct activation-to-down input boundary inside the + Phase121 executor. Artifact: + `/home/mudler/bench/phase123_moe_executor_fused_down_input/20260702_025811`. + The candidate added an NVFP4-only fused `silu(gate) * up -> down MMQ + activation buffer` path behind + `LLAMA_MOE_WHOLE_PATTERN_FUSED_DOWN=1`. Correctness passed (`13/13` + selected, `7/7` fused-down, six fused markers), but perf was flat and missed + the keep rule: versus Phase121 exec, `MOE_SWIGLU_DOWN n=128` was + `811.153 -> 810.618 us` (`+0.07%`) and `n=257` was + `1023.090 -> 1023.657 us` (`-0.06%`). Source was reverted; post-reject + selected and Phase121 exec gates passed (`13/13`, `7/7`, six exec markers). + Do not retry standalone fused-down quantization. The next MoE source attempt + must either own the full expert-major packed pipeline + `GEMM1->activation->GEMM2` or pivot to another measured bottleneck. + Phase124 refreshed the current-stack graph-node serving profile after the + Phase122/123 rejections. Artifact: + `/home/mudler/bench/phase124_current_moe_profile/20260702_031205`. + Pre/post gates were green (MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, + `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`). Serving under graph-node + profiling at `N=128`, prompt `128`, generation `64` was + `agg_tps 206.2`, `decode_agg_tps 320.3`, `prefill_tps 1536.4`, wall + `39.738s`. The fine buckets explain the Phase122/123 failures: + `mmq_nvfp4` is now the largest fine bucket (`6074.78 ms`, `30.17%`) and + `gdn_core` remains essentially tied (`5888.31 ms`, `29.25%`), while + `act_quant` is only `674.88 ms` (`3.35%`). Next work should target either a + full expert-major MoE pipeline that materially reduces `mmq_nvfp4` or a GDN + source experiment that materially reduces `gdn_core`; one-boundary + activation/route shortcuts are no longer funded. Phase125 scoping used two + independent code explorers plus a local GDN audit. The challenged conclusion + is that another GDN micro-patch is not funded: prior geometry/store/broadcast + and conv-state attempts already exhausted the small safe space, while a + useful GDN change would be a larger recurrence redesign. The next source + attempt should therefore test the first maintainable slice of a vLLM-style + expert-major MoE pipeline: a default-off MMQ sorted-output primitive that + still uses expert bounds but writes sorted rows, then immediately unsorts as + a proof. Only if that primitive is correctness clean and materially improves + `MOE_SWIGLU_DOWN` should the following phase proceed to a full + `gate_up -> SWIGLU -> down` expert-major executor. + +### Phase141: GDN Decode-Only Noise Floor + +- Date: 2026-07-02. +- Spec: + `docs/superpowers/specs/2026-07-02-gdn-decode-noise-floor-phase141-design.md`. +- Plan: + `docs/superpowers/plans/2026-07-02-gdn-decode-noise-floor-phase141.md`. +- Result type: measurement-only; no llama.cpp source changes. +- Artifact: + `/home/mudler/bench/phase141_gdn_decode_noise_floor/20260702_090428`. +- Summary files: + - `/home/mudler/bench/phase141_gdn_decode_noise_floor/20260702_090428/summary.tsv` + - `/home/mudler/bench/phase141_gdn_decode_noise_floor/20260702_090428/runs.tsv` + +Setup: + +- Current patched Phase93 binary: + `/home/mudler/llama-phase93-qwen3next-gqa-bcast/build/bin`. +- Env: + `LLAMA_MOE_ROUTED_FFN_POC=1`, + `LLAMA_MOE_ROUTED_FFN_FUSED_QUANT=1`, + `LLAMA_MOE_ROUTED_FFN_FINALIZE_POC=1`. +- Harness: + `/home/mudler/bench/phase77_moe_decode_only_profile.sh`. +- Shape: + `N=128 N_PREDICT=2048 DEPTH_TARGET=64 CAPTURE_SECONDS=4 CTX=131072 PARALLEL=128 BATCH=2048 UBATCH=512`. + +Gates: + +- All five runs passed pre/post canonical gates: + MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 + `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT 1146/1146`, and + `MUL_MAT_ID 806/806`. + +Run summary: + +| run | total kernel s | GDN ms | GDN launches | `gdn_core` ms | `gdn_core` launches | `gdn_core` ms/launch | `mmq_nvfp4` ms | `mmq_nvfp4` launches | +|-----|---------------:|-------:|-------------:|--------------:|--------------------:|---------------------:|---------------:|---------------------:| +| 1 | `3.553400` | `1500.210000` | `3000` | `1420.150000` | `600` | `2.366917` | `1315.460000` | `4816` | +| 2 | `3.708300` | `1492.230000` | `2994` | `1410.300000` | `598` | `2.358361` | `1470.550000` | `4801` | +| 3 | `3.678100` | `1566.780000` | `3150` | `1482.140000` | `630` | `2.352603` | `1336.250000` | `5061` | +| 4 | `3.698400` | `1495.970000` | `3000` | `1415.500000` | `600` | `2.359167` | `1458.510000` | `4820` | +| 5 | `3.620900` | `1490.630000` | `2985` | `1410.870000` | `597` | `2.363266` | `1389.990000` | `4784` | + +Variance summary: + +| metric | median | mean | stdev | CV | min | max | +|--------|-------:|-----:|------:|---:|----:|----:| +| `total_kernel_s` | `3.678100` | `3.651820` | `0.064600` | `1.769%` | `3.553400` | `3.708300` | +| `gdn_ms` | `1495.970000` | `1509.164000` | `32.419626` | `2.148%` | `1490.630000` | `1566.780000` | +| `gdn_core_ms` | `1415.500000` | `1427.792000` | `30.641160` | `2.146%` | `1410.300000` | `1482.140000` | +| `mmq_nvfp4_ms` | `1389.990000` | `1394.152000` | `69.894566` | `5.013%` | `1315.460000` | `1470.550000` | +| `gdn_core_ms_per_launch` | `2.359167` | `2.360063` | `0.005399` | `0.229%` | `2.352603` | `2.366917` | + +Decision: + +- Raw decode-only `gdn_core` is not a reliable keep/reject metric by itself + unless capture launch counts are fixed; run 3 recorded `630` core launches + while the other runs recorded `597..600`. +- For future GDN source A/B, require repeated medians and either: + - raw `gdn_core` reduction above `max(2.0%, 3 * 30.641160 / 1415.500000) = + 6.49%`, or + - launch-normalized `gdn_core_ms_per_launch` reduction above `2.0%` + (`3 * 0.005399 / 2.359167 = 0.69%`, so the explicit floor dominates). +- This supports a very small default-off scalar gate/beta hoist probe if it can + be kept bit-exact and measured per launch. It does not support large packed + decode recurrence source work yet; that should wait for a broader spec. + +### Phase140: GDN Decode Prep Trace + +- Date: 2026-07-02. +- Spec: + `docs/superpowers/specs/2026-07-02-gdn-decode-prep-trace-phase140-design.md`. +- Plan: + `docs/superpowers/plans/2026-07-02-gdn-decode-prep-trace-phase140.md`. +- Result type: measurement-only; no llama.cpp source changes. +- Artifact: + `/home/mudler/bench/phase140_gdn_decode_prep_trace/20260702_085348`. +- Summary file: + `/home/mudler/bench/phase140_gdn_decode_prep_trace/20260702_085348/gdn_prep_kernel_summary.tsv`. + +Setup: + +- Current patched Phase93 binary: + `/home/mudler/llama-phase93-qwen3next-gqa-bcast/build/bin`. +- Env: + `LLAMA_MOE_ROUTED_FFN_POC=1`, + `LLAMA_MOE_ROUTED_FFN_FUSED_QUANT=1`, + `LLAMA_MOE_ROUTED_FFN_FINALIZE_POC=1`, + plus route/layout trace envs. +- Shape: + `N=128 PTOK=128 GEN=64 CTX=131072 PARALLEL=128 BATCH=2048 UBATCH=512`. + +Gates: + +| gate | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` | +|------|---------|-----------|-----------|--------------| +| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | + +Serving/profile result: + +| metric | value | +|--------|------:| +| `agg_tps` | `207.3` | +| `decode_agg_tps` | `328.9` | +| `decode_perseq_tps` | `2.11` | +| `prefill_tps` | `1490.6` | +| `ttft_mean_ms` | `8325.9` | +| `ttft_max_ms` | `14593.3` | +| `wall_s` | `39.501` | +| total kernel time | `20.2002 s` | + +Key buckets: + +| bucket | ms | +|--------|---:| +| `GDN` | `6673.66` | +| `gdn_core` | `5890.44` | +| `MoE/FFN-GEMM` | `6144.19` | +| `mmq_nvfp4` | `5918.31` | +| `gdn_conv` | `454.99` | +| `gdn_gather` | `227.92` | +| `gdn_l2norm` | `100.30` | +| `gdn_sigmoid` | `22.68` | + +Focused kernel summary: + +| kernel | count | ms | avg us | +|--------|------:|---:|-------:| +| `gated_delta_net_cuda` | `4650` | `5804.7074` | `1248.3242` | +| `k_bin_bcast` | `89426` | `1155.3901` | `12.9201` | +| `convert_unary` | `52060` | `659.7529` | `12.6729` | +| `concat_non_cont` | `2130` | `441.9353` | `207.4814` | +| `ssm_conv_update_ids_f32` | `2610` | `227.8964` | `87.3166` | +| `mul_mat_f` | `3670` | `227.7857` | `62.0669` | +| `ssm_conv_long_token_f32` | `1110` | `190.6664` | `171.7715` | +| `unary_gated_op_kernel` | `14340` | `184.3254` | `12.8539` | +| `rms_norm_gate_mul_f32` | `4740` | `170.0508` | `35.8757` | +| `rms_norm_f32` | `9798` | `114.3863` | `11.6745` | +| `rms_norm_pre_add_mul_f32` | `6160` | `108.2927` | `17.5800` | +| `cpy_scalar` | `5130` | `106.8951` | `20.8373` | +| `l2_norm_f32` | `9480` | `100.3024` | `10.5804` | +| `gated_delta_net_chunked_cuda` | `90` | `85.7367` | `952.6300` | + +Decision: + +- Reject an immediate in-GDN Q/K L2-normalization source patch for this shape. +- `l2_norm_f32` is above the absolute Phase139 noise floor + (`3 * 17.8110 ms = 53.433 ms`) but only about `1.7%` of `gdn_core`, below + the phase's `3%` materiality rule. +- Do not spend another phase on prep-only GDN micro-fusion unless a future + profile shows prep kernels above the materiality gate. +- Next GDN work should be recurrence-level, packed-state, or datacenter + Blackwell-specific, and still default-off with md5/op gates. + +### Phase139: Serving Noise-Floor Repeat + +- Date: 2026-07-02. +- Spec: + `docs/superpowers/specs/2026-07-02-serving-noise-floor-phase139-design.md`. +- Plan: + `docs/superpowers/plans/2026-07-02-serving-noise-floor-phase139.md`. +- Result type: measurement-only; no llama.cpp source changes. +- Artifact: + `/home/mudler/bench/phase139_serving_noise_floor/20260702_081901`. +- Summary files: + - `/home/mudler/bench/phase139_serving_noise_floor/20260702_081901/summary.tsv` + - `/home/mudler/bench/phase139_serving_noise_floor/20260702_081901/runs.tsv` + +Setup: + +- Current patched Phase93 binary: + `/home/mudler/llama-phase93-qwen3next-gqa-bcast/build/bin`. +- Env: + `LLAMA_MOE_ROUTED_FFN_POC=1`, + `LLAMA_MOE_ROUTED_FFN_FUSED_QUANT=1`, + `LLAMA_MOE_ROUTED_FFN_FINALIZE_POC=1`. +- Shape: + `N=128 PTOK=128 GEN=64 CTX=131072 PARALLEL=128 BATCH=2048 UBATCH=512`. +- Harness: + `/home/mudler/bench/phase76_current_moe_profile.sh`. + +Gates: + +- All seven runs passed pre/post canonical gates: + MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 + `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT 1146/1146`, and + `MUL_MAT_ID 806/806`. + +Run summary: + +| run | agg t/s | decode agg t/s | wall s | kernel s | MoE ms | mmq_nvfp4 ms | gdn_core ms | mmq_fixup ms | ew_add ms | +|-----|--------:|---------------:|-------:|---------:|-------:|-------------:|------------:|-------------:|----------:| +| 1 | `212.3` | `333.6` | `38.586` | `19.5196` | `5642.07` | `5464.17` | `5877.57` | `104.64` | `371.81` | +| 2 | `208.6` | `330.1` | `39.272` | `19.8779` | `5927.18` | `5719.41` | `5886.67` | `104.49` | `353.07` | +| 3 | `206.8` | `327.2` | `39.606` | `20.0228` | `5983.97` | `5756.85` | `5906.11` | `105.76` | `369.31` | +| 4 | `208.5` | `331.4` | `39.284` | `19.8543` | `5921.30` | `5702.74` | `5911.82` | `104.31` | `371.32` | +| 5 | `208.8` | `335.6` | `39.240` | `20.0571` | `5950.46` | `5720.96` | `5913.65` | `104.53` | `371.59` | +| 6 | `203.4` | `319.7` | `40.277` | `20.3933` | `6285.32` | `6049.05` | `5914.11` | `104.98` | `379.23` | +| 7 | `205.7` | `320.4` | `39.818` | `20.1422` | `6173.88` | `5978.03` | `5929.75` | `106.28` | `355.59` | + +Variance summary: + +| metric | median | mean | stdev | CV | min | max | +|--------|-------:|-----:|------:|---:|----:|----:| +| `agg_tps` | `208.5000` | `207.7286` | `2.8022` | `1.349%` | `203.4000` | `212.3000` | +| `decode_agg_tps` | `330.1000` | `328.2857` | `6.2157` | `1.893%` | `319.7000` | `335.6000` | +| `wall_s` | `39.2840` | `39.4404` | `0.5312` | `1.347%` | `38.5860` | `40.2770` | +| `kernel_s` | `20.0228` | `19.9810` | `0.2717` | `1.360%` | `19.5196` | `20.3933` | +| `moe_ms` | `5950.4600` | `5983.4543` | `204.9581` | `3.425%` | `5642.0700` | `6285.3200` | +| `mmq_nvfp4_ms` | `5720.9600` | `5770.1729` | `193.3642` | `3.351%` | `5464.1700` | `6049.0500` | +| `gdn_ms` | `6695.0800` | `6690.3629` | `17.4585` | `0.261%` | `6656.7100` | `6705.9100` | +| `gdn_core_ms` | `5911.8200` | `5905.6686` | `17.8110` | `0.302%` | `5877.5700` | `5929.7500` | +| `mmq_fixup_ms` | `104.6400` | `104.9986` | `0.7420` | `0.707%` | `104.3100` | `106.2800` | +| `ew_add_ms` | `371.3200` | `367.4171` | `9.4938` | `2.584%` | `353.0700` | `379.2300` | + +Decision: + +- Phase138 remains md5/op clean and focused-positive, but its one-off serving + gain (`+0.63%` aggregate, `+0.24%` decode) is inside same-binary noise. +- Do not use Phase138's single serving run as evidence to stack another + finalize/MMQ micro-patch. +- Future serving claims need repeated A/B medians and must exceed + `max(2.0%, 3 * same-binary stdev)` on aggregate throughput. With this + Phase139 stdev, that is materially higher than the Phase138 one-off delta. +- Bucket attribution also needs repeated evidence: the same binary had + `mmq_nvfp4` CV `3.351%`, so a small MMQ movement is not enough. GDN was much + steadier (`gdn_core` CV `0.302%`), making a measured GDN-side source attempt + the more defensible next phase. + +### Phase138 Attempt 2: Down-MMQ Finalize Writeback + +- Date: 2026-07-02. +- Plan: + `docs/superpowers/plans/2026-07-02-moe-down-mmq-finalize-phase138.md`. +- Result type: kept source candidate, default-off; narrow serving-positive + result, not parity and not default-on. +- Focused artifact: + `/home/mudler/bench/phase138_moe_down_mmq_finalize/20260702_095927_focused`. +- Canonical gate artifact: + `/home/mudler/bench/phase138_moe_down_mmq_finalize/20260702_100202_canonical`. +- Serving/profile artifact: + `/home/mudler/bench/phase138_moe_down_mmq_finalize_serving/20260702_100330`. +- Source files changed: + - `ggml/src/ggml-cuda/ggml-cuda.cu` + - `ggml/src/ggml-cuda/mmq.cu` + - `ggml/src/ggml-cuda/mmq.cuh` + - `ggml/src/ggml-cuda/moe-ffn.cu` + - `ggml/src/ggml-cuda/moe-ffn.cuh` + - `tests/test-backend-ops.cpp` + +Implementation: + +- Added default-off `LLAMA_MOE_ROUTED_FFN_FINALIZE_POC=1`, requiring both + `LLAMA_MOE_ROUTED_FFN_POC=1` and + `LLAMA_MOE_ROUTED_FFN_FUSED_QUANT=1`. +- Added a finalize helper that zeroes the final output, sends router weights + and the final output pointer into the grouped down-MMQ path, and skips the + strict weighted tail only after the helper is selected. +- Added optional finalize metadata to MMQ and stream-k/fixup writeback. The + finalize branch uses the routed destination id to derive `(token, slot)` and + atomically accumulates `sum * weight` into the final token row. +- Left all existing non-finalize MMQ call sites disabled-by-default. + +Focused gates and trace: + +| route | result | +|-------|--------| +| `MOE_SWIGLU_FINALIZE` default | `7/7` | +| `MOE_SWIGLU_FINALIZE` Phase135 opt-in | `7/7` | +| `MOE_SWIGLU_FINALIZE` Phase138 finalize opt-in | `7/7` | +| Phase138 exec trace | `6` records, `FINALIZE_EXEC skip=20 tail_nodes=16` | + +Canonical gates on patched Phase93 binary: + +| route | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` | +|-------|---------|-----------|-----------|--------------| +| Phase138 via `EXTRA_ENV` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | + +Focused perf: + +| row | default | Phase135 | Phase138 finalize | +|-----|--------:|---------:|------------------:| +| `MOE_SWIGLU_FINALIZE nvfp4 n_tokens=128` | `198.021937 us` | `197.301518 us` | `187.134493 us` | +| `MOE_SWIGLU_FINALIZE nvfp4 n_tokens=257` | `429.235219 us` | `428.697087 us` | `384.673195 us` | + +Serving comparison: + +| metric | Phase135 opt-in | Phase138 finalize opt-in | +|--------|----------------:|--------------------------:| +| aggregate t/s | `208.0` | `209.3` | +| decode aggregate t/s | `332.7` | `333.5` | +| decode per-seq t/s | `2.12` | `2.13` | +| prefill t/s | `1475.1` | `1492.8` | +| TTFT mean | `8468.1 ms` | `8382.5 ms` | +| wall | `39.375 s` | `39.144 s` | +| total kernel time | `20.2498 s` | `20.0489 s` | + +Serving buckets: + +| bucket | Phase135 opt-in | Phase138 finalize opt-in | +|--------|----------------:|--------------------------:| +| `gdn_core` | `5926.55 ms` | `5914.04 ms` | +| `mmq_nvfp4` | `5915.24 ms` | `5802.87 ms` | +| `ew_mul` | `727.04 ms` | `723.65 ms` | +| `act_quant` | `677.59 ms` | `678.17 ms` | +| `get_rows` | `283.62 ms` | `283.80 ms` | +| `mmq_fixup` | `104.81 ms` | `106.06 ms` | +| `ew_add` | not listed in Phase135 top rows | `374.09 ms` | + +Serving pre/post gates: + +| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` | +|-------|---------|-----------|-----------|--------------| +| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | + +Decision: + +- Keep Phase138 default-off. It passes md5/op gates and beats Phase135 on the + configured keep thresholds: aggregate/decode throughput, total kernel time, + and `mmq_nvfp4`. +- Do not promote/default-on. The serving delta is small and the weighted + fan-in still appears as `ew_add 374.09 ms`, so this is not a complete tail + removal and not parity. +- Next work should either reduce the remaining fan-in/writeback path more + deeply, or pivot back to the two dominant buckets: `gdn_core` and + `mmq_nvfp4`. + +### Phase138 Attempt 1: MoE Finalize Trace And Full-Tail Sentinel + +- Date: 2026-07-02. +- Plan: + `docs/superpowers/plans/2026-07-02-moe-down-mmq-finalize-phase138.md`. +- Result type: kept trace/test scaffold, default-off; no runtime speedup claim. +- Trace-only `MOE_SWIGLU_DOWN` artifact: + `/home/mudler/bench/phase138_moe_down_mmq_finalize_trace/20260702_092943`. +- Traced canonical gate artifact using the old default gate binary, superseded: + `/home/mudler/bench/phase138_moe_down_mmq_finalize_trace/20260702_093003_gate`. +- Traced canonical gate artifact using patched Phase93 binary: + `/home/mudler/bench/phase138_moe_down_mmq_finalize_trace/20260702_093141_gate_phase93`. +- Traced early-pattern gate artifact using patched Phase93 binary: + `/home/mudler/bench/phase138_moe_down_mmq_finalize_trace/20260702_093243_gate_phase93_early`. +- Full-tail sentinel artifact: + `/home/mudler/bench/phase138_moe_down_mmq_finalize_trace/20260702_093617_full_tail`. +- Canonical gate artifact: + `/home/mudler/bench/phase138_moe_down_mmq_finalize_trace/20260702_093731_canonical`. +- Source files changed: + - `ggml/src/ggml-cuda/ggml-cuda.cu` + - `tests/test-backend-ops.cpp` + +Implementation: + +- Added default-off `LLAMA_MOE_ROUTED_FFN_FINALIZE_TRACE`. +- Added a trace-only strict tail scanner for + `down -> MUL(weights) -> VIEW/ADD rank reduction`. +- Added `MOE_SWIGLU_FINALIZE`, a whole-graph backend-op sentinel that composes + the existing `gate_up -> SWIGLU -> down` graph with the existing + router-weighted rank-add tail. +- No production finalize/writeback kernel was added in this attempt. + +Focused gates: + +| route | result | +|-------|--------| +| `MOE_SWIGLU_DOWN` + Phase135 opt-in + finalize trace | `6` early records, `0` supported tail records | +| `MOE_SWIGLU_FINALIZE` default | `7/7` | +| `MOE_SWIGLU_FINALIZE` + Phase135 opt-in + finalize trace | `7/7`, `6` supported tail records | + +Representative finalize trace row: + +| field | value | +|-------|-------| +| `supported` | `1` | +| `tail_nodes` | `16` | +| `views` | `8` | +| `adds` | `7` | +| `down_ne` | `2048x8x128` on the 128-token row | +| `weights_ne` | `1x8x128` | +| `weights_nb` | `4,4,32` | +| `final_ne` | `2048x128x1` | +| `final_nb` | `4,8192,1048576` | + +Canonical gates on patched Phase93 binary: + +| MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` | +|---------|-----------|-----------|--------------| +| `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | + +Decision: + +- Keep the trace/test scaffold as Phase138 groundwork. +- Proceed next to the default-off down-MMQ finalize/writeback implementation, + but only against `MOE_SWIGLU_FINALIZE` first. +- Do not claim a speedup from this attempt; it only proves graph availability + and preserves md5/op gates. + +### Phase136: Routed-FFN Post-Down Weighted Combine + +- Date: 2026-07-02. +- Plan: + `docs/superpowers/plans/2026-07-02-routed-ffn-combine-phase136.md`. +- Result type: rejected source probe; source and sentinel test reverted. +- Focused artifact: + `/home/mudler/bench/phase136_routed_ffn_combine/20260702_083727`. +- Serving/profile artifact: + `/home/mudler/bench/phase136_routed_ffn_combine_serving/20260702_085749`. +- Source files tested and reverted: + - `ggml/src/ggml-cuda/moe-ffn.cuh` + - `ggml/src/ggml-cuda/moe-ffn.cu` + - `ggml/src/ggml-cuda/ggml-cuda.cu` + - `tests/test-backend-ops.cpp` + +Implementation tested: + +- Added `LLAMA_MOE_ROUTED_FFN_COMBINE=1` on top of Phase135. +- Extended the early routed-FFN graph hook to skip the post-down + `MUL(weights) -> VIEW* -> ADD*` tail. +- Added a separate F32 weighted-combine kernel that preserved expert-rank + accumulation order. +- Added a temporary full-tail `MOE_SWIGLU_COMBINE` sentinel for focused + correctness/perf. + +Focused gates: + +| route | result | +|-------|--------| +| default selected + full-tail sentinel | `MOE_SWIGLU_DOWN,MOE_SWIGLU_COMBINE,MUL_MAT_ID_RAGGED_MOE 20/20` | +| Phase135 selected + full-tail sentinel | `20/20` | +| Phase136 selected + full-tail sentinel | `20/20` | +| Phase136 trace | `6` combine markers, `6` `mmq_moe_quantized_raw`, `0` `mmq_moe_sorted_raw` | +| post-reject Phase135 selected | `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE 13/13` | + +Canonical focused gates: + +| route | MoE md5 | dense md5 | `GATED_DELTA_NET` | `MUL_MAT` | `MUL_MAT_ID` | +|-------|---------|-----------|-------------------|-----------|--------------| +| Phase136 via `EXTRA_ENV` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `46/46` | `1146/1146` | `806/806` | + +Focused perf: + +| row | default | Phase135 | Phase136 | +|-----|--------:|---------:|---------:| +| `MOE_SWIGLU_DOWN n_tokens=128` | `803.97 us` | `805.77 us` | `806.75 us` | +| `MOE_SWIGLU_DOWN n_tokens=257` | `1020.15 us` | `1016.53 us` | `1017.11 us` | +| `MOE_SWIGLU_COMBINE n_tokens=128` | `197.98 us` | `197.74 us` | `191.04 us` | +| `MOE_SWIGLU_COMBINE n_tokens=257` | `429.22 us` | `428.53 us` | `401.81 us` | + +Serving/profile gate: + +| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` | +|-------|---------|-----------|-----------|--------------| +| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | + +Serving metrics at Phase130 shape: + +| metric | Phase135 opt-in | Phase136 opt-in | +|--------|----------------:|----------------:| +| aggregate t/s | `208.0` | `206.5` | +| decode aggregate t/s | `332.7` | `323.2` | +| decode per-seq t/s | `2.12` | `2.07` | +| prefill t/s | `1475.1` | `1519.5` | +| TTFT mean ms | `8468.1` | `8080.6` | +| wall s | `39.375` | `39.668` | +| total kernel time | `20.2498 s` | `19.9778 s` | + +Serving fine buckets: + +| bucket | Phase135 opt-in | Phase136 opt-in | +|--------|----------------:|----------------:| +| `mmq_nvfp4` | `5915.24 ms` | `5885.05 ms` | +| `gdn_core` | `5926.55 ms` | `5912.65 ms` | +| `cublas_bf16_gemm` | `1782.58 ms` | `1728.15 ms` | +| `cutlass_bf16_gemm` | `756.98 ms` | `767.94 ms` | +| `ew_mul` | `727.04 ms` | `712.97 ms` | +| `ew_add` | not listed in Phase135 top rows | `374.70 ms` | +| `act_quant` | `677.59 ms` | `677.60 ms` | +| `get_rows` | `283.62 ms` | `278.31 ms` | +| `mmq_fixup` | `104.81 ms` | `103.73 ms` | + +Decision: + +- Reject and revert Phase136. The focused synthetic full-tail row improved, but + serving aggregate and decode throughput regressed versus Phase135. +- Keep Phase135 as the current default-off routed-FFN source base. +- Do not retry a separate post-MMQ weighted-combine launch next. A future + combine/finalize attempt needs to remove a larger serving-visible boundary, + likely by integrating finalize/writeback with the down projection or by + changing graph scheduling enough to reduce launches without hurting decode. + +### Phase137: GDN Geometry Sweep + +- Date: 2026-07-02. +- Plan: + `docs/superpowers/plans/2026-07-02-gdn-geometry-sweep-phase137.md`. +- Result type: rejected env-only serving probe; no source changes. +- Focused artifact: + `/home/mudler/bench/phase137_gdn_geometry_sweep/20260702_091441`. +- Serving/profile artifact: + `/home/mudler/bench/phase137_gdn_geometry_serving/20260702_091740`. + +Implementation tested: + +- No source edits. +- Swept existing `GDN_NW`/`GDN_CPW` runtime knobs: + default `(16,8)`, `(8,8)`, `(16,4)`, `(8,4)`, and `(4,1)`. +- Ran serving only for the best focused candidate: + `LLAMA_MOE_ROUTED_FFN_POC=1 LLAMA_MOE_ROUTED_FFN_FUSED_QUANT=1 + GDN_NW=4 GDN_CPW=1`. + +Focused GDN perf: + +| row | default | `8x8` | `16x4` | `8x4` | `4x1` | +|-----|--------:|------:|-------:|------:|------:| +| `hc=32,hs=128,nt=1,kda=0` | `6.793748 us` | `6.992506 us` | `6.161572 us` | `5.501046 us` | `4.713682 us` | +| `hc=32,hs=128,nt=1,kda=1` | `7.790557 us` | `7.639035 us` | `6.553847 us` | `5.772280 us` | `5.194275 us` | +| `hc=4,hs=128,nt=1,nseq=2,vrep=2,bcast=1` | `5.967364 us` | `4.721621 us` | `3.759859 us` | `3.747508 us` | `3.407998 us` | +| `hc=32,hs=128,nt=64,kda=0` | `153.718880 us` | `152.660797 us` | `119.964294 us` | `94.862477 us` | `125.016141 us` | +| `hc=32,hs=128,nt=256,kda=0` | `491.066095 us` | `678.143207 us` | `495.650551 us` | `454.202876 us` | `489.942166 us` | +| `hc=32,hs=128,nt=512,kda=0` | `1033.510463 us` | `2081.115639 us` | `1197.792952 us` | `1143.683921 us` | `1025.449339 us` | +| `hc=32,hs=128,nt=1024,kda=0` | `2060.529106 us` | `4382.363825 us` | `2403.995842 us` | `2310.580042 us` | `2060.707900 us` | +| `hc=4,hs=128,nt=64,kda=0` | `151.409035 us` | `142.777045 us` | `82.000488 us` | `78.839499 us` | `26.777607 us` | +| `hc=4,hs=128,nt=256,kda=0` | `102.606410 us` | `564.485714 us` | `311.945543 us` | `301.296947 us` | `102.232357 us` | +| `hc=4,hs=128,nt=512,kda=0` | `198.996831 us` | `1127.205870 us` | `620.111479 us` | `600.911809 us` | `198.595701 us` | +| `hc=4,hs=128,nt=1024,kda=0` | `396.210102 us` | `2249.487113 us` | `1240.201770 us` | `1200.476178 us` | `395.850039 us` | + +Serving/profile gate: + +| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` | +|-------|---------|-----------|-----------|--------------| +| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | + +Serving metrics at Phase130 shape: + +| metric | Phase135 opt-in | Phase137 `GDN_NW=4 GDN_CPW=1` | +|--------|----------------:|-------------------------------:| +| aggregate t/s | `208.0` | `206.2` | +| decode aggregate t/s | `332.7` | `324.9` | +| decode per-seq t/s | `2.12` | `2.08` | +| prefill t/s | `1475.1` | `1499.4` | +| TTFT mean ms | `8468.1` | `8209.4` | +| TTFT max ms | not recorded | `14511.2` | +| wall s | `39.375` | `39.719` | +| total kernel time | `20.2498 s` | `20.7530 s` | + +Serving fine buckets: + +| bucket | Phase135 opt-in | Phase137 `GDN_NW=4 GDN_CPW=1` | +|--------|----------------:|-------------------------------:| +| `gdn_core` | `5926.55 ms` | `6466.27 ms` | +| `mmq_nvfp4` | `5915.24 ms` | `5978.87 ms` | +| `cublas_bf16_gemm` | `1782.58 ms` | `1726.10 ms` | +| `cutlass_bf16_gemm` | `756.98 ms` | `745.00 ms` | +| `ew_mul` | `727.04 ms` | `711.72 ms` | +| `ew_add` | not listed in Phase135 top rows | `367.85 ms` | +| `act_quant` | `677.59 ms` | `681.32 ms` | +| `get_rows` | `283.62 ms` | `284.31 ms` | +| `mmq_fixup` | `104.81 ms` | `103.26 ms` | + +Decision: + +- Reject Phase137. The isolated 1-token GDN rows improved, but real serving + decode, aggregate throughput, total kernel time, `gdn_core`, and `mmq_nvfp4` + all regressed versus Phase135. +- Do not edit source for a GDN launch-geometry retune. +- Next scoped source line: a default-off MoE finalize/writeback integration in + down-MMQ that removes the serving-visible `MUL(weights) -> VIEW* -> ADD*` + tail without adding a standalone combine launch. + +### Phase135: Routed-FFN Fused SWIGLU-to-NVFP4 Quant + +- Date: 2026-07-02. +- Plan: + `docs/superpowers/plans/2026-07-02-routed-ffn-fused-quant-phase135.md`. +- Result type: source structural base, default-off, serving-profile positive on + decode but not parity-closing. +- Focused artifact: + `/home/mudler/bench/phase135_routed_ffn_fused_quant/20260702_081723`. +- Serving/profile artifact: + `/home/mudler/bench/phase135_routed_ffn_fused_quant_serving/20260702_082102`. +- Source files: + - `ggml/src/ggml-cuda/mmq.cuh` + - `ggml/src/ggml-cuda/mmq.cu` + - `ggml/src/ggml-cuda/moe-ffn.cu` + +Implementation: + +- Added `LLAMA_MOE_ROUTED_FFN_FUSED_QUANT=1` on top of + `LLAMA_MOE_ROUTED_FFN_POC=1`. +- Added `ggml_cuda_mul_mat_q_moe_quantized(...)`, a raw MMQ launcher that + accepts a caller-owned quantized activation buffer. +- Added a Blackwell/NVFP4-only fused kernel that reads `gate/up` views, uses + the existing ids metadata ordering, computes `silu(gate) * up`, and writes + `block_fp4_mmq` activation layout directly. +- MXFP4 and unsupported shapes fall back to earlier paths. + +Focused gates: + +| route | result | +|-------|--------| +| Phase135 selected | `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE 13/13` | +| Phase135 trace | `6` `mmq_moe_quantized_raw` launches, `0` `mmq_moe_sorted_raw` launches | + +Canonical focused gates: + +| route | MoE md5 | dense md5 | `GATED_DELTA_NET` | `MUL_MAT` | `MUL_MAT_ID` | +|-------|---------|-----------|-------------------|-----------|--------------| +| Phase135 via `EXTRA_ENV` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `48/48` | `1146/1146` | `806/806` | + +Focused perf: + +| row | default | Phase134 | Phase135 | +|-----|--------:|---------:|---------:| +| `MOE_SWIGLU_DOWN n_tokens=128` | `805.920354 us` | `807.650845 us` | `807.921963 us` | +| `MOE_SWIGLU_DOWN n_tokens=257` | `1031.064815 us` | `1027.513292 us` | `1024.971370 us` | + +Serving/profile gate: + +| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` | +|-------|---------|-----------|-----------|--------------| +| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | + +Serving metrics at Phase130 shape: + +| metric | Phase130 default | Phase135 opt-in | +|--------|-----------------:|----------------:| +| aggregate t/s | `208.0` | `208.0` | +| decode aggregate t/s | `326.9` | `332.7` | +| decode per-seq t/s | `2.1` | `2.12` | +| prefill t/s | `1519.6` | `1475.1` | +| TTFT mean ms | `8170.6` | `8468.1` | +| wall s | `39.38` | `39.375` | +| total kernel time | `20.1559 s` | `20.2498 s` | + +Serving fine buckets: + +| bucket | Phase130 default | Phase135 opt-in | +|--------|-----------------:|----------------:| +| `mmq_nvfp4` | `6009.52 ms` | `5915.24 ms` | +| `gdn_core` | `5891.40 ms` | `5926.55 ms` | +| `cublas_bf16_gemm` | `1735.98 ms` | `1782.58 ms` | +| `cutlass_bf16_gemm` | `749.64 ms` | `756.98 ms` | +| `act_quant` | `675.67 ms` | `677.59 ms` | +| `get_rows` | `280.62 ms` | `283.62 ms` | +| `mmq_fixup` | not listed in Phase130 top rows | `104.81 ms` | + +Decision: + +- Keep Phase135 as the best current default-off routed-FFN base. It is + canonical-clean and reduces the dominant `mmq_nvfp4` serving bucket. +- Do not promote it as parity: aggregate serving is unchanged, prefill/TTFT are + worse, and total kernel time is slightly higher due to other buckets. +- Next work should target remaining MoE overhead after fused quant, especially + `mmq_fixup`, route/writeback, and weighted-combine/scatter boundaries, or run + a broader serving comparison to determine whether the decode improvement + persists outside this graph-node profile. + +### Phase134: Routed-FFN Fused SWIGLU-to-Sorted + +- Date: 2026-07-02. +- Plan: + `docs/superpowers/plans/2026-07-02-routed-ffn-fused-swiglu-phase134.md`. +- Result type: source structural base, default-off, mixed perf. +- Artifact: + `/home/mudler/bench/phase134_routed_ffn_fused_swiglu/20260702_075828`. +- Source files: + - `ggml/src/ggml-cuda/moe-ffn.cuh` + - `ggml/src/ggml-cuda/moe-ffn.cu` + - `ggml/src/ggml-cuda/ggml-cuda.cu` + +Implementation: + +- Added `LLAMA_MOE_ROUTED_FFN_FUSED_SWIGLU=1` on top of + `LLAMA_MOE_ROUTED_FFN_POC=1`. +- Passes `gate` and `up` views into the Phase132 routed-FFN helper. +- Executes `gate_up`, builds ids metadata, launches a CUDA kernel to write + `silu(gate) * up` directly into expert-sorted F32 rows, then calls Phase133's + raw sorted-F32 down MMQ helper. +- The fused flag now implies the sorted-down machinery; it does not require + `LLAMA_MOE_ROUTED_FFN_SORTED_DOWN=1`. + +Selected and trace gates: + +| route | result | +|-------|--------| +| Phase134 selected | `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE 13/13` | +| Phase134 trace | `MOE_SWIGLU_DOWN 7/7`, `6` `mmq_moe_sorted_raw` launches | + +Canonical gates: + +| route | MoE md5 | dense md5 | `GATED_DELTA_NET` | `MUL_MAT` | `MUL_MAT_ID` | +|-------|---------|-----------|-------------------|-----------|--------------| +| Phase134 via `EXTRA_ENV` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `48/48` | `1146/1146` | `806/806` | + +Focused perf sanity: + +| row | default | Phase132 | Phase133 | Phase134 | +|-----|--------:|---------:|---------:|---------:| +| `MOE_SWIGLU_DOWN n_tokens=128` | `804.920354 us` | `807.999195 us` | `808.068383 us` | `810.614642 us` | +| `MOE_SWIGLU_DOWN n_tokens=257` | `1026.024540 us` | `1028.434560 us` | `1029.015432 us` | `1025.682004 us` | + +Decision: + +- Keep Phase134 only as default-off structural plumbing. It removes the + standalone `glu -> get_rows` boundary and recovers the n=257 regression, but + the extra fused-SWIGLU kernel is still slower at n=128. +- Do not promote `LLAMA_MOE_ROUTED_FFN_FUSED_SWIGLU=1` as a speedup. +- Next work must remove one more boundary, likely by fusing SWIGLU directly + into the down-MMQ quant buffer rather than writing an intermediate sorted F32 + buffer. + +### Phase133: Routed-FFN Sorted-Down Raw MMQ + +- Date: 2026-07-02. +- Plan: + `docs/superpowers/plans/2026-07-02-routed-ffn-sorted-down-phase133.md`. +- Result type: source structural base, default-off, not a speedup. +- Artifact: + `/home/mudler/bench/phase133_routed_ffn_sorted_down/20260702_074651`. +- Source files: + - `ggml/src/ggml-cuda/mmq.cuh` + - `ggml/src/ggml-cuda/mmq.cu` + - `ggml/src/ggml-cuda/moe-ffn.cu` + +Implementation: + +- Exposed `ggml_cuda_mmq_ids_meta` from `mmq.cuh` so the routed-FFN helper can + reuse the existing GPU ids metadata (`ids_src1`, `ids_dst`, `expert_bounds`). +- Added `ggml_cuda_mul_mat_q_moe_sorted_f32(...)`, a raw sorted-F32 MMQ entry + that accepts a compact F32 activation pointer plus `ids_dst` and + `expert_bounds` directly. +- Added `LLAMA_MOE_ROUTED_FFN_SORTED_DOWN=1` on top of + `LLAMA_MOE_ROUTED_FFN_POC=1`. The opt-in path executes baseline `gate_up` and + `SWIGLU`, gathers `SWIGLU` output into compact expert-sorted F32 rows, then + runs the raw MMQ down helper. It falls back to Phase132 if strict shape/type + checks fail. + +Selected op gates: + +| route | result | marker | +|-------|--------|--------| +| default | `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE 13/13` | none | +| Phase132 `LLAMA_MOE_ROUTED_FFN_POC=1` | `13/13` | `6` whole-pattern exec markers | +| Phase133 `LLAMA_MOE_ROUTED_FFN_POC=1 LLAMA_MOE_ROUTED_FFN_SORTED_DOWN=1` | `13/13` | `6` whole-pattern exec markers | + +Trace proof: + +- `LLAMA_QUANT_TRACE=32` with Phase133 opt-in passed `MOE_SWIGLU_DOWN 7/7`. +- `grep -c mmq_moe_sorted_raw phase133_quant_trace.log` returned `6`, proving + the raw sorted-down helper engaged for the NVFP4 rows. + +Canonical gates: + +| route | MoE md5 | dense md5 | `GATED_DELTA_NET` | `MUL_MAT` | `MUL_MAT_ID` | +|-------|---------|-----------|-------------------|-----------|--------------| +| default | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `48/48` | `1146/1146` | `806/806` | +| Phase133 via `EXTRA_ENV` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `48/48` | `1146/1146` | `806/806` | + +Focused perf sanity: + +| row | default | Phase132 | Phase133 | +|-----|--------:|---------:|---------:| +| `MOE_SWIGLU_DOWN n_tokens=128` | `807.369268 us` | `808.213194 us` | `808.848753 us` | +| `MOE_SWIGLU_DOWN n_tokens=257` | `1020.762195 us` | `1018.870935 us` | `1026.874233 us` | + +Decision: + +- Keep Phase133 only as default-off structural plumbing. It is correctness-clean + and proves the fake-tensor boundary can be replaced with a raw helper, but it + adds a separate gather into sorted F32 rows and is not faster. +- Do not promote `LLAMA_MOE_ROUTED_FFN_SORTED_DOWN=1` as a runtime speedup. +- Next work must remove the new overhead by fusing SWIGLU directly into sorted + rows or directly into the down-MMQ quant buffer. A standalone sorted-down + gather is not a parity lever. + +### Phase132: Default-Off Routed-FFN PoC Scaffold + +- Date: 2026-07-02. +- Plan: + `docs/superpowers/plans/2026-07-02-routed-ffn-poc-phase132.md`. +- Result type: source scaffold, default-off, no math change intended. +- Artifact: + `/home/mudler/bench/phase132_routed_ffn_poc/20260702_072725`. +- Source files: + - `ggml/src/ggml-cuda/moe-ffn.cuh` + - `ggml/src/ggml-cuda/moe-ffn.cu` + - `ggml/src/ggml-cuda/ggml-cuda.cu` + +Build: + +- First incremental build failed at link because the existing CMake build + directory had not reconfigured its globbed CUDA source list, so the new + `moe-ffn.cu` object was not compiled. +- Re-running `cmake -S . -B build` in the DGX mirror picked up `moe-ffn.cu`; + `cmake --build build --target test-backend-ops -j"$(nproc)"` then passed. +- Symbol/string evidence: + `strings build/bin/libggml-cuda.so | grep -c LLAMA_MOE_ROUTED_FFN_POC` + returned `1`. + +Selected op gates: + +| route | result | trace | +|-------|--------|-------| +| default | `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE 13/13` | no opt-in markers | +| `LLAMA_MOE_ROUTED_FFN_POC=1` | `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE 13/13` | `6` `LLAMA_MOE_WHOLE_PATTERN_EXEC` markers | + +Canonical gates: + +| route | MoE md5 | dense md5 | `GATED_DELTA_NET` | `MUL_MAT` | `MUL_MAT_ID` | +|-------|---------|-----------|-------------------|-----------|--------------| +| default | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `48/48` | `1146/1146` | `806/806` | +| `LLAMA_MOE_ROUTED_FFN_POC=1` via `EXTRA_ENV` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `48/48` | `1146/1146` | `806/806` | + +Focused perf sanity: + +| row | default | opt-in | delta | +|-----|--------:|-------:|------:| +| `MOE_SWIGLU_DOWN n_tokens=128` | `808.318584 us` | `804.868061 us` | `+0.43%` | +| `MOE_SWIGLU_DOWN n_tokens=257` | `1023.355828 us` | `1022.713701 us` | `+0.06%` | + +Decision: + +- Keep the Phase132 scaffold. It is correctness-clean and neutral, and it gives + the next patch a low-conflict helper boundary for a real fused routed-FFN + slice. +- Do not present Phase132 as a speedup. The helper currently executes the same + baseline `gate_up`, `SWIGLU`, and `down` nodes; it only proves default-off + ownership, capability gating, and reachability. +- Next source phase should replace one internal helper boundary with real work, + preferably a routed-FFN packed workspace or direct sorted activation/down + path that removes more traffic than Phase116/123. + +### Phase131: Fused Routed-FFN Scoping Challenge + +- Date: 2026-07-02. +- Plan: + `docs/superpowers/plans/2026-07-02-fused-routed-ffn-phase131.md`. +- Result type: source-selection and design-gate phase; no source changes and no + DGX benchmark artifact. +- Inputs: + - Phase130 current-stack serving profile: + `/home/mudler/bench/phase130_current_stack_profile/20260702_070949`. + - MoE explorer: `019f2140-de84-7eb2-8ab5-0c7d7de336bd`. + - GDN explorer: `019f2141-0af2-7480-bf66-4fd7e67716c5`. + +Decision: + +- Reject another incremental MoE/FFN-GEMM shortcut for Phase131. The current + stack already includes default grouped FP4-MMQ, default-off W4A16 fallback + routes, route metadata scaffolding, and whole-pattern executor ownership + proof. Prior route-only, activation-only, tile-policy, W4A16, sorted-output, + and fake-executor attempts either regressed or were noise-level. +- Reject another incremental GDN shortcut for Phase131. The remaining GDN bucket + is dominated by the f32 recurrent-state scan; the safe space around launch + geometry, gather/identity, producer fusion, store fusion, BF16 S-cache, and + grouped Q/K broadcast has already been tested and rejected under canonical + md5/KL gates. +- Continue only with a larger default-off fused routed-FFN PoC if the vLLM and + llama.cpp audits identify a concrete low-conflict hook. Otherwise, require a + standalone CUDA PoC before touching llama.cpp source. + +Gates: + +- No correctness or performance gates were run for this no-source decision + phase. +- Any follow-up source phase must use the canonical MoE md5 + `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 + `5951a5b4d624ce891e22ab5fca9bc439`, `GATED_DELTA_NET`, `MUL_MAT 1146/1146`, + `MUL_MAT_ID 806/806`, and selected `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE` + op gates before claiming a speedup. + +### Phase130: Current-Stack Serving Profile Refresh + +- Date: 2026-07-02. +- Plan: + `docs/superpowers/plans/2026-07-02-current-stack-serving-profile-phase130.md`. +- Result type: measurement-only profile; no source changes. +- Artifact: + `/home/mudler/bench/phase130_current_stack_profile/20260702_070949`. +- Shape: MoE `q36-35b-a3b-nvfp4`, `N=128`, prompt `128`, generation `64`, + `PARALLEL=128`, `CTX=131072`, graph-node CUDA tracing. + +Gates: + +| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` | +|-------|---------|-----------|-----------|--------------| +| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | + +Serving metrics: + +| metric | value | +|--------|------:| +| aggregate t/s | `208.0` | +| decode aggregate t/s | `326.9` | +| decode per-seq t/s | `2.1` | +| prefill t/s | `1519.6` | +| TTFT mean ms | `8170.6` | +| TTFT max ms | `14315.6` | +| wall s | `39.38` | +| total kernel time | `20.1559 s` | + +Macro buckets: + +| bucket | time | share | +|--------|-----:|------:| +| GDN | `6646.64 ms` | `32.98%` | +| MoE/FFN-GEMM | `6213.70 ms` | `30.83%` | +| bf16/fp8-proj | `2734.06 ms` | `13.56%` | +| layout-copy | `1260.74 ms` | `6.25%` | +| act-quant | `675.67 ms` | `3.35%` | +| gather | `280.62 ms` | `1.39%` | +| FA | `267.02 ms` | `1.32%` | + +Fine buckets: + +| bucket | time | share | +|--------|-----:|------:| +| `mmq_nvfp4` | `6009.52 ms` | `29.82%` | +| `gdn_core` | `5891.40 ms` | `29.23%` | +| `cublas_bf16_gemm` | `1735.98 ms` | `8.61%` | +| `cutlass_bf16_gemm` | `749.64 ms` | `3.72%` | +| `act_quant` | `675.67 ms` | `3.35%` | +| `convert_dtype` | `656.25 ms` | `3.26%` | +| `concat_layout` | `443.94 ms` | `2.20%` | +| `gdn_conv` | `443.80 ms` | `2.20%` | +| `get_rows` | `280.62 ms` | `1.39%` | +| `fa` | `257.38 ms` | `1.28%` | + +Decision: + +- The current serving profile remains a tied two-bucket problem: + `mmq_nvfp4` and `gdn_core` are effectively equal and far larger than every + candidate cleanup bucket. +- Do not spend the next source attempt on paged mask/F16 get-rows or FA cleanup: + `get_rows` and FA are below `1.5%` each in this profile, matching the older + Phase63 no-go. +- The next credible source attempt must either reduce the MoE/FFN-GEMM bucket + with a larger executor/kernel than the rejected route/activation shortcuts, or + reduce GDN with a materially different recurrent-state/packed-decode design + rather than the rejected grouped-broadcast/BF16-cache/geometry/store shapes. + +### Phase129: Qwen35 GDN Q/K Grouped Broadcast Probe + +- Date: 2026-07-02. +- Plan: + `docs/superpowers/plans/2026-07-02-qwen35-gdn-qk-grouped-bcast-phase129.md`. +- Result type: source attempted, rejected, and reverted. +- Default gate artifact: + `/home/mudler/bench/phase129_qwen35_gdn_qk_bcast/default_20260702_065445`. +- Focused GDN perf artifact: + `/home/mudler/bench/phase129_qwen35_gdn_qk_bcast/perf_20260702_065728`. +- Default decode-profile artifact: + `/home/mudler/bench/phase129_qwen35_gdn_qk_bcast/decode_default_20260702_065847`. +- Valid opt-in reject artifact: + `/home/mudler/bench/phase129_qwen35_gdn_qk_bcast/decode_optin_20260702_070149/gate_pre`. +- Post-reject artifact: + `/home/mudler/bench/phase129_qwen35_gdn_qk_bcast/post_reject_20260702_070258`. +- Candidate env: + `LLAMA_QWEN35_GDN_QK_BCAST=1`. + +Candidate implementation: + +- Added a default-off `qk_bcast_grouped` branch to `src/models/qwen35.cpp` and + `src/models/qwen35moe.cpp`. +- When enabled, the branch skipped explicit Q/K repeat and called the + state-taking `build_recurrent_attn(..., state, il, true)` overload so the + existing `ggml_gated_delta_net_set_bcast()` op parameter could use grouped + Q/K indexing. +- Default source behavior remained unchanged when the env was unset. + +Evidence: + +- Default canonical gates passed: + - MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`; + - dense md5 `5951a5b4d624ce891e22ab5fca9bc439`; + - `GATED_DELTA_NET 46/46`; + - `MUL_MAT 1146/1146`; + - `MUL_MAT_ID 806/806`. +- The first standalone opt-in gate artifact + `/home/mudler/bench/phase129_qwen35_gdn_qk_bcast/optin_20260702_065604` + was not valid evidence because `paged-inference-gates.sh` only injects model + env through `EXTRA_ENV`. +- The valid opt-in gate from the decode harness used + `PROFILE_ENV="LLAMA_QWEN35_GDN_QK_BCAST=1"` and failed before profiling: + MoE md5 became `b773e2f032aa0e992626d486b321808e` instead of the canonical + `8cb0ce23777bf55f92f63d0292c756b0`. +- Focused `test-backend-ops perf -o GATED_DELTA_NET` was effectively neutral + because it exercises op fixtures, not the Qwen35 model-builder branch. The + representative rows were: + +| row | default us/run | opt-in us/run | +|-----|---------------:|--------------:| +| `head_count=32,head_size=128,n_seq_tokens=1024,qk_bcast_grouped=0` | `2064.48` | `2060.23` | +| `head_count=4,head_size=128,n_seq_tokens=256,qk_bcast_grouped=0` | `101.69` | `101.61` | +| `head_count=4,head_size=128,n_seq_tokens=64,v_repeat=2,qk_bcast_grouped=1` | `151.32` | `151.39` | + +- Default decode-profile baseline, before the valid opt-in reject: + +| metric | default | +|--------|--------:| +| total kernel time | `3.6916 s` | +| GDN macro | `1491.99 ms` (`40.42%`) | +| `gdn_core` | `1411.34 ms` (`38.23%`) | +| MoE/FFN-GEMM macro | `1475.96 ms` (`39.98%`) | +| `mmq_nvfp4` | `1458.54 ms` (`39.51%`) | + +- Post-reject rebuild removed the env string from `libllama.so` + (`strings ... | grep -c LLAMA_QWEN35_GDN_QK_BCAST == 0`) and post-reject + gates passed: MoE md5 canonical, dense md5 canonical, `GATED_DELTA_NET 46/46`, + `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`. + +Decision: + +- Reject and revert Phase129 source. The candidate is not bit-exact for the + current `qwen35moe` decision model. +- Do not retry the same Qwen3Next grouped Q/K broadcast port for Qwen35 or + Qwen35MoE unless the quality rule is explicitly changed. The current + bit-exact md5 gate rejects it before any perf profile is meaningful. + +### Phase128: Qwen3Next GDN BF16 S-Cache Scope + +- Date: 2026-07-02. +- Plan: + `docs/superpowers/plans/2026-07-02-qwen3next-gdn-bf16-s-cache-phase128.md`. +- Result type: source probe rejected and reverted. +- Default gate artifact: + `/home/mudler/bench/phase128_qwen3next_gdn_bf16_s_cache/default_20260702_043939`. +- Verbose smoke artifact: + `/home/mudler/bench/phase128_qwen3next_gdn_bf16_s_cache/smoke3_20260702_044434`. + +Candidate implementation: + +- Temporarily generalized the Qwen35/Qwen35MoE GDN S-cache selector in + `src/llama-model.cpp` to accept + `LLAMA_QWEN3NEXT_GDN_S_CACHE_TYPE=bf16` for `LLM_ARCH_QWEN3NEXT`. +- Preserved the existing `LLAMA_QWEN35_GDN_S_CACHE_TYPE=bf16` behavior. +- Reverted the source probe after validation showed it does not apply to the + current decision model and no true Qwen3Next artifact is available. + +Evidence: + +- Default `GATED_DELTA_NET` op gate passed `48/48`. +- Default canonical gates passed: + - MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`; + - dense md5 `5951a5b4d624ce891e22ab5fca9bc439`; + - `MUL_MAT` passed; + - `MUL_MAT_ID` passed. +- Verbose smoke showed the active model metadata: + `general.architecture = qwen35moe`, `print_info: arch = qwen35moe`. +- With `LLAMA_QWEN3NEXT_GDN_S_CACHE_TYPE=bf16`, recurrent cache logs still + showed `S (f32): 60.00 MiB`, as expected for a `qwen35moe` model. +- DGX search found no true Qwen3Next GGUF under `/home/mudler/bench` or + `/home/mudler`. + +Decision: + +- Reject and revert the Qwen3Next selector change for the current parity run. +- Do not retry the existing Qwen35/Qwen35MoE BF16 S-cache lever under the + current rules: Phase81 showed it reduced `gdn_core`, but Phase82 rejected it + because MoE md5 changed and the full f16-reference KL gate missed the hard + acceptance band. +- A future BF16-S-cache attempt needs either a deliberately re-scoped quality + gate or an actual Qwen3Next model artifact to validate. + +### Phase127: Whole-MoE Expert-Major Executor + +- Date: 2026-07-02. +- Plan: + `docs/superpowers/plans/2026-07-02-moe-whole-expert-major-phase127.md`. +- Result type: source attempted, rejected, and reverted. Phase126 helper + remains. +- Red artifact: + `/home/mudler/bench/phase127_moe_whole_expert_major/red_20260702_042125`. +- Green artifact: + `/home/mudler/bench/phase127_moe_whole_expert_major/green2_20260702_042916`. +- Perf artifact: + `/home/mudler/bench/phase127_moe_whole_expert_major/perf_20260702_043104`. +- Post-reject artifact: + `/home/mudler/bench/phase127_moe_whole_expert_major/post_reject_20260702_043318`. +- Candidate env: + `LLAMA_MOE_WHOLE_EXPERT_MAJOR=1 LLAMA_MOE_WHOLE_EXPERT_MAJOR_TRACE=128`. + +Candidate implementation: + +- Added an opt-in executor at the existing early whole-pattern match. +- Built route metadata once with `ggml_cuda_launch_mm_ids_helper()`. +- Wrote `gate_up` to a sorted F32 temporary using identity `ids_dst`. +- Ran SWIGLU on a fake contiguous split-half `[2*n_ff, ne_get_rows]` tensor. +- Ran down MMQ from sorted activations through the Phase126 + `ggml_cuda_mul_mat_q_moe_with_ids(..., src1_sorted=true)` helper. +- Unpermuted once after down into the real graph destination. + +Attempt notes: + +- The red gate passed by fallback and emitted zero + `LLAMA_MOE_WHOLE_EXPERT_MAJOR` markers. +- First green attempt aborted because the executor interpreted `down_w` as + `[n_embd, n_ff, experts]`. Debug trace proved the correct shape is + `[n_ff, n_embd, experts]`; the dimension fix made the selected green gate + pass. + +Gates: + +| gate | result | +|------|--------| +| red `MOE_SWIGLU_DOWN` | `7/7`, zero expert-major markers | +| default selected `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE` | `13/13` | +| opt-in `MOE_SWIGLU_DOWN` | `7/7`, six expert-major markers | +| candidate canonical md5/op | skipped because perf rejected source | +| post-reject selected `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE` | `13/13` | +| post-reject MoE md5 | `8cb0ce23777bf55f92f63d0292c756b0` | +| post-reject dense md5 | `5951a5b4d624ce891e22ab5fca9bc439` | +| post-reject `MUL_MAT` | `1146/1146` | +| post-reject `MUL_MAT_ID` | `806/806` | + +Focused perf: + +| arm | `MOE_SWIGLU_DOWN n=128` | `MUL_MAT_ID_RAGGED_MOE n=128` | `MOE_SWIGLU_DOWN n=257` | `MUL_MAT_ID_RAGGED_MOE n=257` | +|-----|-------------------------:|--------------------------------:|-------------------------:|--------------------------------:| +| default | `802.57 us` | `1236.67 us` | `1023.25 us` | `1455.65 us` | +| expert-major opt-in | `812.14 us` | `1238.50 us` | `1039.36 us` | `1455.06 us` | + +Decision: + +- Reject and revert Phase127 source. The path passed correctness but missed the + keep rule: `MOE_SWIGLU_DOWN n=128` regressed about `1.2%` and `n=257` + regressed about `1.6%`; no row reached the required `>=3%` improvement. +- Do not retry the same fake-tensor whole-executor shape. It removes the early + unsort boundary but adds enough temporary traffic and quant/layout work to + lose on the focused rows. The next MoE attempt must reduce temporary traffic + or move closer to a real fused grouped MMQ/SWIGLU/down path; otherwise pivot + to the scoped GDN BF16 S-cache experiment with non-md5 numerical gates. + +### Phase126: MMQ Presorted Helper Scaffold + +- Date: 2026-07-02. +- Plan: + `docs/superpowers/plans/2026-07-02-mmq-presorted-helper-phase126.md`. +- Result type: source scaffold kept; no default behavior change intended. +- Artifact: + `/home/mudler/bench/phase126_mmq_presorted_helper/fix1_20260702_040858`. +- Source scope: + - `ggml/src/ggml-cuda/mmq.cu` + - `ggml/src/ggml-cuda/mmq.cuh` +- Candidate implementation: + - refactored the current MoE `ggml_cuda_mul_mat_q()` id path into an + internal helper that accepts prebuilt `ids_src1`, `ids_dst`, and + `expert_bounds`; + - added the public CUDA-internal wrapper + `ggml_cuda_mul_mat_q_moe_with_ids(..., bool src1_sorted)`; + - preserved current behavior by having the existing path build metadata and + call the helper with `src1_sorted=false`; + - added `src1_sorted=true` support for the future whole-MoE executor without + wiring that executor in this phase. + +Attempt notes: + +- Initial Phase126 build/gate attempt compiled and selected gates passed, but + local review found the helper had widened the default MMQ q-buffer stride from + `n_expert_used` to `ne_get_rows`. The fix1 attempt restored the old stride + for `src1_sorted=false`; that is the accepted artifact below. +- One canonical gate invocation failed because it was nested under an outer + DGX lock while `paged-inference-gates.sh` owns the lock itself. The gate was + rerun cleanly outside the outer lock. + +Gates: + +| gate | result | +|------|--------| +| build `test-backend-ops llama-completion` | passed | +| selected `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE` | `13/13` | +| MoE md5 | `8cb0ce23777bf55f92f63d0292c756b0` | +| dense md5 | `5951a5b4d624ce891e22ab5fca9bc439` | +| `MUL_MAT` | `1146/1146` | +| `MUL_MAT_ID` | `806/806` | + +Focused perf: + +| row | runs | us/run | TFLOPS | +|-----|-----:|-------:|-------:| +| `MOE_SWIGLU_DOWN n=128` | `1243` | `805.99` | `11.99` | +| `MUL_MAT_ID_RAGGED_MOE n=128` | `832` | `1243.85` | `2.59` | +| `MOE_SWIGLU_DOWN n=257` | `984` | `1018.74` | `19.05` | +| `MUL_MAT_ID_RAGGED_MOE n=257` | `704` | `1452.84` | `4.45` | + +Decision: + +- Keep the scaffold as Phase127 dependency. This phase is perf-neutral versus + the Phase125 baseline/control band and preserves canonical md5/op gates. +- Do not claim parity progress from Phase126 alone. The useful next step is to + use this helper inside the whole-pattern executor so `gate_up` output, + SWIGLU, and `down` input stay in expert-major order, with one unpermute after + the full FFN. + +### Phase125: Expert-Major Sorted Output Scope + +- Date: 2026-07-02. +- Plan: + `docs/superpowers/plans/2026-07-02-moe-expert-major-sorted-output-phase125.md`. +- Result type: source implementation spec and scoped next attempt; no source + change yet. +- Subagent findings: + - llama.cpp audit: the full expert-major executor is credible but too large + for a first patch. The first slice should add a sorted-output grouped MMQ + mode so `expert_bounds` can be used without scattering through `ids_dst`. + - vLLM audit: portable ideas are expert-major layout across both GEMMs, + one permute/unpermute boundary, expert offsets for activation quant/scales, + and whole-layer measurement. CUTLASS/FlashInfer pointer-array, TMA, and + FP4 scale-swizzle contracts should not be copied into GGML/MMQ. + - local GDN challenge: Phase124's `gdn_core` bucket is material, but prior + small GDN attempts already rejected the obvious decode/core knobs. A new + GDN win would need a larger recurrence redesign, not a Phase125 shortcut. +- Decision: + - Phase125 source was tested and rejected. Do not carry + `LLAMA_MOE_EXPERT_MAJOR_SORTED_OUT`, the `mmq_args` identity-destination + flag, the MMQ sorted-output temporary, or the immediate unsort proof path. + - The full expert-major `gate_up -> SWIGLU -> down` executor remains the + right conceptual MoE target, but the first slice proved that sorted-output + plus immediate unsort is too expensive to be a stepping stone by itself. + Any follow-up must avoid adding an extra unsort boundary and must consume + sorted activations directly in the down GEMM. +- Red/baseline attempt: + - Red artifact: + `/home/mudler/bench/phase125_moe_expert_major_sorted_output/red_valid_20260702_032918`. + - Baseline artifact: + `/home/mudler/bench/phase125_moe_expert_major_sorted_output/baseline_valid_20260702_032923`. + - Red env: + `LLAMA_MOE_EXPERT_MAJOR_SORTED_OUT=1 LLAMA_MOE_EXPERT_MAJOR_SORTED_TRACE=32`. + - Red result: `test-backend-ops perf -o MOE_SWIGLU_DOWN` exited `0` and + emitted `0` `LLAMA_MOE_EXPERT_MAJOR_SORTED` markers, as expected before + implementation. + - Baseline selected gate: + `test-backend-ops test -o MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE` passed + `13/13`. + +Baseline perf rows: + +| row | runs | us/run | GFLOP/run | TFLOPS | +|-----|-----:|-------:|----------:|-------:| +| `MOE_SWIGLU_DOWN n=128` | `1243` | `809.70` | `9.66` | `11.93` | +| `MUL_MAT_ID_RAGGED_MOE n=128` | `832` | `1244.18` | `3.22` | `2.59` | +| `MOE_SWIGLU_DOWN n=257` | `984` | `1016.44` | `19.40` | `19.09` | +| `MUL_MAT_ID_RAGGED_MOE n=257` | `688` | `1453.65` | `6.47` | `4.45` | + +Source attempt: + +- Artifact: + `/home/mudler/bench/phase125_moe_expert_major_sorted_output/20260702_033931`. +- Candidate env: + `LLAMA_MOE_EXPERT_MAJOR_SORTED_OUT=1 LLAMA_MOE_EXPERT_MAJOR_SORTED_TRACE=32`. +- Candidate implementation: + - added an internal `mmq_args` identity-destination flag; + - wrote NVFP4 grouped MMQ output to a sorted temporary when the env was set; + - inverted `ids_dst` on GPU and immediately used `get_rows_cuda` to restore + the normal destination layout; + - emitted bounded `LLAMA_MOE_EXPERT_MAJOR_SORTED` trace markers. +- Correctness: + - default selected `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE`: `13/13`; + - opt-in sorted `MOE_SWIGLU_DOWN`: `7/7`; + - opt-in correctness markers: `12` (`gate_up` and `down` for six NVFP4 + rows). + +Perf: + +| arm | `MOE_SWIGLU_DOWN n=128` | `MUL_MAT_ID_RAGGED_MOE n=128` | `MOE_SWIGLU_DOWN n=257` | `MUL_MAT_ID_RAGGED_MOE n=257` | +|-----|-------------------------:|--------------------------------:|-------------------------:|--------------------------------:| +| control | `806.13 us` | `1250.99 us` | `1027.15 us` | `1457.69 us` | +| Phase121 exec | `805.16 us` | `1247.92 us` | `1023.83 us` | `1457.67 us` | +| sorted-output proof | `888.76 us` | `1283.17 us` | `1192.05 us` | `1528.27 us` | + +Rejection: + +- Reject and revert. The proof passed correctness, but it badly missed the keep + rule: versus Phase121 exec, `MOE_SWIGLU_DOWN n=128` regressed by about + `10.4%` and `n=257` regressed by about `16.4%`. The ragged standalone row + also regressed. +- Post-reject artifact: + `/home/mudler/bench/phase125_moe_expert_major_sorted_output/post_reject_20260702_034232`. +- Post-reject gates: + - build: `0`; + - selected `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE`: `13/13`; + - retained Phase121 exec `MOE_SWIGLU_DOWN`: `7/7`, six exec markers; + - MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`; + - dense md5: `5951a5b4d624ce891e22ab5fca9bc439`; + - `MUL_MAT`: `1146/1146`; + - `MUL_MAT_ID`: `806/806`. + +### Phase124: Current MoE Serving Graph-Node Refresh + +- Date: 2026-07-02. +- Artifact: + `/home/mudler/bench/phase124_current_moe_profile/20260702_031205`. +- Result type: current-stack llama.cpp graph-node serving profile; no source + change. +- Shape: MoE `q36-35b-a3b-nvfp4`, `N=128`, `PTOK=128`, `GEN=64`, + `PARALLEL=128`, `CTX=131072`, `BATCH=2048`, `UBATCH=512`. +- Profiler: `nsys launch --cuda-graph-trace=node`, bucketed with + `/home/mudler/bench/bucket2.py`. + +Gates: + +| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` | +|-------|---------|-----------|-----------|--------------| +| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | + +Serving result under graph-node profiling: + +| n | agg_tps | decode_agg_tps | decode_perseq_tps | prefill_tps | ttft_mean_ms | wall_s | +|--:|--------:|---------------:|------------------:|------------:|-------------:|-------:| +| `128` | `206.2` | `320.3` | `2.11` | `1536.4` | `8826.7` | `39.738` | + +Macro buckets: + +| bucket | time ms | share | instances | +|--------|--------:|------:|----------:| +| GDN | `6665.04` | `33.10%` | `20790` | +| MoE/FFN-GEMM | `6246.97` | `31.03%` | `52484` | +| bf16/fp8-proj | `2687.28` | `13.35%` | `51960` | +| layout-copy | `1259.59` | `6.26%` | `79100` | +| ew-mul(weight/norm/GDN) | `728.03` | `3.62%` | `50422` | +| act-quant | `674.88` | `3.35%` | `36084` | +| FA | `264.14` | `1.31%` | `3530` | + +Fine buckets: + +| bucket | macro | time ms | share | instances | +|--------|-------|--------:|------:|----------:| +| `mmq_nvfp4` | MoE/FFN-GEMM | `6074.78` | `30.17%` | `33204` | +| `gdn_core` | GDN | `5888.31` | `29.25%` | `4500` | +| `cublas_bf16_gemm` | bf16/fp8-proj | `1722.37` | `8.55%` | `21970` | +| `cutlass_bf16_gemm` | bf16/fp8-proj | `766.57` | `3.81%` | `26380` | +| `ew_mul` | ew-mul(weight/norm/GDN) | `723.07` | `3.59%` | `46494` | +| `act_quant` | act-quant | `674.88` | `3.35%` | `36084` | +| `convert_dtype` | layout-copy | `660.48` | `3.28%` | `51300` | +| `gdn_conv` | GDN | `457.10` | `2.27%` | `6960` | +| `concat_layout` | layout-copy | `440.02` | `2.19%` | `2040` | + +Decision: + +- Phase124 confirms the current serving gap is still a two-bucket problem: + `mmq_nvfp4` and `gdn_core` together account for about `59.4%` of kernel + time. +- The `act_quant` bucket is only `3.35%`, explaining why Phase116/123 + fused-activation shortcuts did not move end-to-end rows. +- Do not fund more route-only, activation-only, or tile-policy MoE shortcuts. + Next source work must either own the full expert-major MoE pipeline to reduce + `mmq_nvfp4`, or attack `gdn_core` with a default-off GDN decode experiment + measured against this Phase124/Phase77 bucket. + +### Phase123: MoE Executor Fused Down Input + +- Date: 2026-07-02. +- Plan: + `docs/superpowers/plans/2026-07-02-moe-executor-fused-down-input-phase123.md`. +- Artifact: + `/home/mudler/bench/phase123_moe_executor_fused_down_input/20260702_025811`. +- Red check artifact: + `/home/mudler/bench/phase123_moe_executor_fused_down_input/red_20260702_025031`. +- Candidate env: + `LLAMA_MOE_WHOLE_PATTERN_EXEC=1 LLAMA_MOE_WHOLE_PATTERN_FUSED_DOWN=1`. +- Source decision: reject and revert. Do not carry the + `LLAMA_MOE_WHOLE_PATTERN_FUSED_DOWN` env, NVFP4 fused SwiGLU quant kernel, + or `ggml_cuda_mul_mat_q_moe_swiglu_down()` helper. + +Gates: + +| gate | result | trace markers | +|------|--------|---------------| +| red check fused-down trace before implementation | `7/7` test rows | `0` fused-down markers | +| default selected `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE` | `13/13` | n/a | +| fused-down `MOE_SWIGLU_DOWN` | `7/7` | `6` fused-down markers | +| post-reject selected `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE` | `13/13` | n/a | +| post-reject Phase121 exec `MOE_SWIGLU_DOWN` | `7/7` | `6` exec markers | + +Perf: + +| arm | `MOE_SWIGLU_DOWN n=128` | `MUL_MAT_ID_RAGGED_MOE n=128` | `MOE_SWIGLU_DOWN n=257` | `MUL_MAT_ID_RAGGED_MOE n=257` | +|-----|-------------------------:|--------------------------------:|-------------------------:|--------------------------------:| +| control | `812.340097 us` | `1242.909856 us` | `1021.592480 us` | `1461.043605 us` | +| Phase121 exec | `811.152856 us` | `1248.876202 us` | `1023.089980 us` | `1455.405523 us` | +| fused-down | `810.617860 us` | `1250.528750 us` | `1023.657464 us` | `1459.239826 us` | + +Decision: + +- Reject the standalone fused-down activation quantization path. It passed + correctness, but the target row was flat-to-negative and far below the `2%` + keep rule. +- Keep Phase121 executor proof only. The next MoE attempt should not be another + one-boundary activation materialization shortcut; it needs a full + expert-major packed pipeline or a different measured bottleneck. + +### Phase122: MoE Shared Route Metadata + +- Date: 2026-07-02. +- Plan: + `docs/superpowers/plans/2026-07-02-moe-shared-route-meta-phase122.md`. +- Artifact: + `/home/mudler/bench/phase122_moe_shared_route_meta/20260702_043212`. +- Candidate env: + `LLAMA_MOE_WHOLE_PATTERN_EXEC=1 LLAMA_MOE_WHOLE_PATTERN_SHARED_ROUTE=1`. +- Source decision: reject and revert. Do not carry the public + `ggml_cuda_mmq_ids_meta` API, shared-route executor helper, or + `LLAMA_MOE_WHOLE_PATTERN_SHARED_ROUTE` env. + +Gates: + +| gate | result | trace markers | +|------|--------|---------------| +| default selected `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE` | `13/13` | n/a | +| shared-route `MOE_SWIGLU_DOWN` | `7/7` | `6` shared-route markers | +| post-reject selected `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE` | `13/13` | n/a | +| post-reject Phase121 exec `MOE_SWIGLU_DOWN` | `7/7` | `6` exec markers | + +Perf: + +| arm | `MOE_SWIGLU_DOWN n=128` | `MUL_MAT_ID_RAGGED_MOE n=128` | `MOE_SWIGLU_DOWN n=257` | `MUL_MAT_ID_RAGGED_MOE n=257` | +|-----|-------------------------:|--------------------------------:|-------------------------:|--------------------------------:| +| control | `808.519710 us` | `1245.913462 us` | `1022.664622 us` | `1457.690407 us` | +| Phase121 exec | `808.189863 us` | `1250.302500 us` | `1020.849593 us` | `1461.318314 us` | +| shared-route | `811.836039 us` | `1246.143029 us` | `1051.665618 us` | `1449.548295 us` | + +Decision: + +- Reject the shared-route metadata API/path: it did not meet the keep rule and + regressed the target `MOE_SWIGLU_DOWN n=257` row by about `3%` versus the + Phase121 executor. +- Keep Phase121 executor proof only. Route-only reuse is closed as a parity + lever; the next executor scope must remove a larger activation/down boundary. + +### Phase121: MoE Whole-Pattern Exec Proof + +- Date: 2026-07-02. +- Plan: + `docs/superpowers/plans/2026-07-02-moe-whole-pattern-exec-proof-phase121.md`. +- Initial artifact: + `/home/mudler/bench/phase121_moe_whole_pattern_exec_proof/20260702_041543`. +- Fix1 artifact: + `/home/mudler/bench/phase121_moe_whole_pattern_exec_proof/20260702_041739_fix1`. +- Source decision: keep fix1 default-off executor proof; it proves ownership + and skip accounting but does not yet fuse work. + +Gates: + +| run | result | +|-----|--------| +| fix1 selected default, `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE` | `13/13` | +| fix1 exec proof, `LLAMA_MOE_WHOLE_PATTERN_EXEC=1 MOE_SWIGLU_DOWN` | `7/7` | +| fix1 MoE md5 | `8cb0ce23777bf55f92f63d0292c756b0` | +| fix1 dense md5 | `5951a5b4d624ce891e22ab5fca9bc439` | +| fix1 `MUL_MAT` gate | `1146/1146` | +| fix1 `MUL_MAT_ID` gate | `806/806` | + +Perf: + +| row | control us | exec us | change | +|-----|-----------:|--------:|-------:| +| `MOE_SWIGLU_DOWN n_tokens=128` | `807.772325` | `806.051488` | `+0.21%` | +| `MOE_SWIGLU_DOWN n_tokens=257` | `1021.114837` | `1020.839431` | `+0.03%` | +| `MUL_MAT_ID_RAGGED_MOE n=128` | `1243.250000` | `1243.313702` | `-0.01%` | +| `MUL_MAT_ID_RAGGED_MOE n=257` | `1450.889205` | `1456.279070` | `-0.37%` | + +Trace: + +- Initial run passed correctness but emitted `0` exec markers because the exec + branch was accidentally nested under the early trace env condition. +- Fix1 exec gate emitted `6` `skip=4` markers for the supported correctness + rows. +- Fix1 exec perf emitted `6` `skip=4` markers covering `n_tokens=128` and + `n_tokens=257`. + +Decision: + +- Keep the default-off executor proof. +- It changes no default behavior and proves that the early matcher can own + `gate_up`, skip both views, execute `GLU` and `down`, and return `4`. +- Next phase should turn the proof helper into a useful executor by replacing + one internal boundary at a time. The most defensible next slice is route-plan + reuse inside the helper or activation in route-slot order, not another graph + detector. + +### Phase120: MoE Early Whole-Pattern Matcher + +- Date: 2026-07-02. +- Plan: + `docs/superpowers/plans/2026-07-02-moe-early-whole-pattern-phase120.md`. +- Initial artifact: + `/home/mudler/bench/phase120_moe_early_whole_pattern/20260702_040153`. +- Fix1 artifact: + `/home/mudler/bench/phase120_moe_early_whole_pattern/20260702_040515_fix1`. +- Fix2 artifact: + `/home/mudler/bench/phase120_moe_early_whole_pattern/20260702_040725_fix2`. +- Source decision: keep fix2 default-off early matcher/trace; no execution is + skipped yet. + +Gates: + +| run | result | +|-----|--------| +| fix2 selected default, `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE` | `13/13` | +| fix2 early trace, `LLAMA_MOE_WHOLE_PATTERN_EARLY_TRACE=16 MOE_SWIGLU_DOWN` | `7/7` | +| fix2 MoE md5 | `8cb0ce23777bf55f92f63d0292c756b0` | +| fix2 dense md5 | `5951a5b4d624ce891e22ab5fca9bc439` | +| fix2 `MUL_MAT` gate | `1146/1146` | +| fix2 `MUL_MAT_ID` gate | `806/806` | + +Perf: + +| row | control us | early trace us | change | +|-----|-----------:|---------------:|-------:| +| `MOE_SWIGLU_DOWN n_tokens=128` | `803.937002` | `808.978278` | `-0.62%` | +| `MOE_SWIGLU_DOWN n_tokens=257` | `1020.411585` | `1026.072597` | `-0.55%` | +| `MUL_MAT_ID_RAGGED_MOE n=128` | `1246.259615` | `1243.800481` | `+0.20%` | +| `MUL_MAT_ID_RAGGED_MOE n=257` | `1456.428779` | `1456.109012` | `+0.02%` | + +Trace: + +- Initial artifact emitted `96` early markers with only `6` supported rows; + fix1 emitted `104` markers with only `6` supported rows. +- Fix2 emits exactly `6` early markers, all supported, covering + `n_tokens=128` and `n_tokens=257`. +- The fix2 marker proves the executor entry contract before GEMM1 dispatch: + `skip_ready=4`, `ids_match=1`, `swiglu=1`, `n_used=8`, `experts=128`, + `n_embd=2048`, `n_ff=768`. + +Decision: + +- Keep the default-off early matcher/trace. +- This does not improve runtime by itself; it establishes the correct hook for + the next executor attempt. +- Next phase should add a guarded executor at this matcher. First prove that it + can own the five-node sequence and return `4` only after reproducing the + existing outputs, then move useful work into the helper: route-plan reuse + across both expert GEMMs, activation in route-slot order, and later direct + weighted combine. + +### Phase119: MoE Whole-Pattern Contract + +- Date: 2026-07-02. +- Plan: + `docs/superpowers/plans/2026-07-02-moe-whole-pattern-contract-phase119.md`. +- Initial artifact: + `/home/mudler/bench/phase119_moe_whole_pattern_contract/20260702_034729`. +- Fix1 artifact: + `/home/mudler/bench/phase119_moe_whole_pattern_contract/20260702_035126_fix1`. +- Source decision: keep default-off contract trace after fix1; no runtime + executor yet. + +Gates: + +| run | result | +|-----|--------| +| fix1 selected default, `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE` | `13/13` | +| fix1 trace gate, `LLAMA_MOE_WHOLE_PATTERN_TRACE=16 MOE_SWIGLU_DOWN` | `7/7` | +| fix1 MoE md5 | `8cb0ce23777bf55f92f63d0292c756b0` | +| fix1 dense md5 | `5951a5b4d624ce891e22ab5fca9bc439` | +| fix1 `MUL_MAT` gate | `1146/1146` | +| fix1 `MUL_MAT_ID` gate | `806/806` | + +Initial perf: + +| row | control us | trace us | change | +|-----|-----------:|---------:|-------:| +| `MOE_SWIGLU_DOWN n_tokens=128` | `809.251810` | `811.777597` | `-0.31%` | +| `MOE_SWIGLU_DOWN n_tokens=257` | `1015.069697` | `1028.937243` | `-1.35%` | +| `MUL_MAT_ID_RAGGED_MOE n=128` | `1247.114183` | `1247.876202` | `-0.06%` | +| `MUL_MAT_ID_RAGGED_MOE n=257` | `1450.355114` | `1456.109012` | `-0.40%` | + +Fix1 perf: + +| row | control us | trace us | change | +|-----|-----------:|---------:|-------:| +| `MOE_SWIGLU_DOWN n_tokens=128` | `805.399839` | `805.584071` | `-0.02%` | +| `MOE_SWIGLU_DOWN n_tokens=257` | `1019.715447` | `1021.836382` | `-0.21%` | +| `MUL_MAT_ID_RAGGED_MOE n=128` | `1247.504808` | `1247.542067` | `-0.00%` | +| `MUL_MAT_ID_RAGGED_MOE n=257` | `1458.351744` | `1454.090116` | `+0.29%` | + +Trace: + +- Initial and fix1 trace perf emitted `6` whole-pattern markers. +- Fix1 covered supported NVFP4 contract rows at `n_tokens=128` and + `n_tokens=257`: `view_pair=1`, `ids_match=1`, `swiglu=1`, + `n_used=8`, `experts=128`, `n_embd=2048`, `n_ff=768`. +- The trace gate also covered smaller correctness shapes; the F32 row reports + `supported=0` by design because the executor target is native FP4. + +Decision: + +- Keep the default-off trace/contract scaffold. +- This phase does not promote a runtime optimization. +- The next executor attempt should be matched from the earlier + `gate_up MUL_MAT_ID` node, not from the current `GLU -> down` validation + hook, so it can own route-plan reuse, GEMM1, activation, GEMM2, and later + weighted combine. + +### Phase118: MoE Route Cache + +- Date: 2026-07-02. +- Plan: + `docs/superpowers/plans/2026-07-02-moe-route-cache-phase118.md`. +- Artifact: + `/home/mudler/bench/phase118_moe_route_cache/20260702_030549`. +- Source decision: reject and revert runtime cache; keep helper refactor only. + +Preflight note: + +- The initial `pgrep -af "[l]ocal-ai-worker"` preflight was a false positive + because the remote shell contained the literal text `local-ai-worker busy`. + Corrected follow-up used `pgrep -x local-ai-worker`; Docker, worker, and GPU + compute-app checks were clean. + +Gates: + +| run | result | +|-----|--------| +| helper refactor selected gate | `13/13` | +| cache default selected gate | `13/13` | +| cache opt-in selected gate, `LLAMA_MOE_ROUTE_CACHE=1` | `13/13` | +| post-reject selected gate | `13/13` | + +Perf: + +| row | baseline us | cache us | change | +|-----|------------:|---------:|-------:| +| `MOE_SWIGLU_DOWN n_tokens=128` | `799.360447` | `803.738437` | `-0.55%` | +| `MOE_SWIGLU_DOWN n_tokens=257` | `1017.711382` | `1011.915152` | `+0.57%` | +| `MUL_MAT_ID_RAGGED_MOE n=128` | `1239.332933` | `1239.560096` | `-0.02%` | +| `MUL_MAT_ID_RAGGED_MOE n=257` | `1447.588068` | `1441.795455` | `+0.40%` | + +Trace: + +- `LLAMA_MOE_ROUTE_CACHE=1 LLAMA_MOE_ROUTE_CACHE_TRACE=128` on + `MOE_SWIGLU_DOWN n_tokens=128`: `23` hits, `3` misses. + +Decision: + +- Reject and revert the runtime route cache. It proves reuse is possible, but + the win is too small for the additional context-owned state and graph-capture + lifetime surface. +- Keep only the local `ggml_cuda_mmq_ids_meta` helper refactor as low-conflict + groundwork for a future whole-pattern executor. + +### Phase117: MoE Route-Once Boundary Timing + +- Date: 2026-07-02. +- Plan: + `docs/superpowers/plans/2026-07-02-moe-route-once-boundary-phase117.md`. +- Artifact: + `/home/mudler/bench/phase117_moe_route_once_boundary/20260702_024140`. +- Trace env: + `LLAMA_MOE_BOUNDARY_TRACE=1`; optional timings with + `LLAMA_MOE_BOUNDARY_TIMING=1`. +- Source decision: keep default-off diagnostic trace only; no runtime + optimization promoted. + +Gates: + +| run | result | +|-----|--------| +| post-guard selected default, `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE` | `13/13` | +| post-guard trace/timing, `MOE_SWIGLU_DOWN` | `7/7`, `50` trace lines | +| canonical MoE md5 | `8cb0ce23777bf55f92f63d0292c756b0` | +| canonical dense md5 | `5951a5b4d624ce891e22ab5fca9bc439` | +| canonical `MUL_MAT` | `1146/1146` | +| canonical `MUL_MAT_ID` | `806/806` | + +Perf / timing: + +| row | perf us | boundary medians | +|-----|--------:|------------------| +| graph-enabled `MOE_SWIGLU_DOWN n=128`, trace+timing guarded | `806.271923` | capture emits `us=-1` after graph warmup | +| no-graph `MOE_SWIGLU_DOWN n=128` | `821.530713` | gate_up: sort `8.992`, quant `103.840`, mmq `1218.656`; down: sort `8.800`, quant `50.720`, mmq `632.768`; GLU `26.240` | +| no-graph `MOE_SWIGLU_DOWN n=257` | `1079.544086` | gate_up: sort `13.376`, quant `185.632`, mmq `1297.728`; down: sort `13.952`, quant `83.808`, mmq `672.096`; GLU `51.232` | +| no-graph `MUL_MAT_ID_RAGGED_MOE n=128` | `1255.156250` | sort `8.896`, quant `99.232`, mmq `1133.472` | +| no-graph `MUL_MAT_ID_RAGGED_MOE n=257` | `1531.667683` | sort `14.624`, quant `174.464`, mmq `1263.360` | + +Notes: + +- Inline CUDA events cannot be synchronized inside CUDA graph capture. The + guard is required: graph-enabled timing no longer aborts, but captured + sections report `us=-1`; use `GGML_CUDA_DISABLE_GRAPHS=1` only for boundary + attribution. +- The route-sort bucket is small, and standalone GLU/down-quant is not enough + after the Phase116 flat result. Do not fund another small sort/tile/quant + shortcut from this evidence. +- Next source work should be a larger MoE pipeline: route-once metadata shared + by both expert GEMMs and/or whole-pattern GEMM1->activation->GEMM2 ownership. + +### Phase116: MoE SwiGLU Down Fused Quant + +- Date: 2026-07-02. +- Plan: + `docs/superpowers/plans/2026-07-02-moe-swiglu-down-fused-quant-phase116.md`. +- Artifact: + `/home/mudler/bench/phase116_moe_swiglu_down_fused_quant/20260702_022611`. +- Env under test: + `LLAMA_MOE_SWIGLU_DOWN_FUSED_QUANT=1`. +- Source decision: rejected and reverted. + +Selected gates: + +| run | selected gate | route marker | +|-----|---------------|--------------| +| control | `13/13` | n/a | +| initial candidate | `13/13` | absent | +| fix1 candidate | `13/13` | present, `6` hits | +| post-revert | `13/13` | n/a | + +Perf: + +| op | shape | control us | fused us | candidate change | +|----|-------|-----------:|---------:|-----------------:| +| `MOE_SWIGLU_DOWN` | `n_tokens=128` | `806.332261` | `808.791633` | `-0.30%` | +| `MUL_MAT_ID_RAGGED_MOE` | `n=128` | `1241.147837` | `1245.063702` | `-0.32%` | +| `MOE_SWIGLU_DOWN` | `n_tokens=257` | `1024.895706` | `1024.685072` | `+0.02%` | +| `MUL_MAT_ID_RAGGED_MOE` | `n=257` | `1454.116279` | `1455.965116` | `-0.13%` | + +Decision: + +- Reject and revert Phase116. +- The route is technically feasible without a new ggml op or MMQ kernel change, + but fusing only `SWIGLU` into MMQ activation quantization is too small to move + GB10 parity. +- Do not retry this exact standalone fused-quant path. The next credible fused + routed-MoE phase needs route-once metadata shared by both expert GEMMs plus a + larger fused GEMM1/activation/GEMM2 or weighted-combine/scatter boundary. + +### Phase115: MoE Small-M Sentinel A/B + +- Date: 2026-07-02. +- Plan: + `docs/superpowers/plans/2026-07-01-moe-small-m-sentinel-phase115.md`. +- Artifact: + `/home/mudler/bench/phase115_moe_small_m_sentinel/20260702_020258`. +- Env under test: + `LLAMA_MOE_SMALL_M_TILE=16`, `LLAMA_MOE_SMALL_M_TILE=32`, + `LLAMA_MOE_SMALL_M_TILE=64`. +- Source decision: no source change; reject as a parity lever. + +Selected gates: + +| env | selected gate | +|-----|---------------| +| control | `13/13` | +| `LLAMA_MOE_SMALL_M_TILE=16` | `13/13` | +| `LLAMA_MOE_SMALL_M_TILE=32` | `13/13` | +| `LLAMA_MOE_SMALL_M_TILE=64` | `13/13` | + +Perf: + +| env | `MOE_SWIGLU_DOWN` 128 us | `MUL_MAT_ID_RAGGED_MOE` 128 us | `MOE_SWIGLU_DOWN` 257 us | `MUL_MAT_ID_RAGGED_MOE` 257 us | +|-----|-------------------------:|-------------------------------:|-------------------------:|-------------------------------:| +| control | `809.814159` | `1247.719952` | `1021.508130` | `1452.301136` | +| `LLAMA_MOE_SMALL_M_TILE=16` | `804.780370` | `1241.008413` | `1020.710366` | `1455.017442` | +| `LLAMA_MOE_SMALL_M_TILE=32` | `809.751408` | `1242.140625` | `1021.155488` | `1458.712209` | +| `LLAMA_MOE_SMALL_M_TILE=64` | `807.938858` | `1247.765625` | `1021.431911` | `1456.875000` | + +Decision: + +- Reject small-M row shaping for the current stack. +- This confirms the older Phase33 serving-level rejection on the newer + whole-graph sentinels: smaller MoE token tiles are correctness-safe, but the + 257-token ragged down path does not improve. +- Do not add a down-name special case or another tile-policy shortcut. Phase116 + should scope a fused routed-MoE kernel or graph-level fusion that avoids + materializing intermediate activation/output traffic. + +### Phase114: W4A16 Padded Routing + +- Date: 2026-07-01. +- Plan: + `docs/superpowers/plans/2026-07-01-w4a16-padded-routing-phase114.md`. +- Initial artifact: + `/home/mudler/bench/phase114_w4a16_padded_routing/20260701_234634_padded_meta`. +- Fix1 artifact: + `/home/mudler/bench/phase114_w4a16_padded_routing/20260701_235003_padded_meta_fix1`. +- Env under test: + `LLAMA_W4A16_PREFILL_M=128 LLAMA_W4A16_DIRECT_A=1 LLAMA_MOE_GPU_SORT=1 LLAMA_W4A16_PADDED_META=1`. +- Source decision: rejected and reverted. + +Selected gates: + +| run | control | candidate | +|-----|---------|-----------| +| initial padded metadata | `13/13` | `13/13` | +| fix1 with `num_tokens_post_pad` early returns | `13/13` | `13/13` | +| post-revert Phase112 control | `13/13` | n/a | + +Fix1 perf: + +| op | shape | Phase112 control us | Phase114 fix1 us | candidate change | +|----|-------|--------------------:|-----------------:|-----------------:| +| `MOE_SWIGLU_DOWN` | `n_tokens=128` | `805.094932` | `804.176236` | `+0.11%` | +| `MUL_MAT_ID_RAGGED_MOE` | `n=128` | `1243.722356` | `1245.055288` | `-0.11%` | +| `MOE_SWIGLU_DOWN` | `n_tokens=257` | `1477.876106` | `1726.273196` | `-16.81%` | +| `MUL_MAT_ID_RAGGED_MOE` | `n=257` | `2163.346983` | `2650.932292` | `-22.54%` | + +Decision: + +- Reject and revert Phase114. +- The vLLM-style padded metadata contract is correctness-feasible in llama.cpp, + but a naive padded consumer does too much padded gather/GEMM/scatter work for + sparse expert occupancy on these GB10 test rows. +- Do not retry this exact padded-W4A16 route unless the kernel is changed to + avoid padded activation/output traffic, or the work shifts to a true fused + routed-MoE kernel where padding is part of the native tile scheduler. + +### Phase113: W4A16 Direct-A GPU Tiles + +- Date: 2026-07-01. +- Plan: + `docs/superpowers/plans/2026-07-01-w4a16-direct-a-gpu-tiles-phase113.md`. +- Artifact: + `/home/mudler/bench/phase113_w4a16_direct_a_gpu_tiles/20260701_233345_no_readback`. +- Env under test: + `LLAMA_W4A16_PREFILL_M=128 LLAMA_W4A16_DIRECT_A=1 LLAMA_MOE_GPU_SORT=1 LLAMA_W4A16_GPU_TILES=1`. +- Source decision: rejected and reverted. + +Selected gates: + +| env | selected gate | +|-----|---------------| +| Phase112 control, `DIRECT_A=1 MOE_GPU_SORT=1` | `13/13` | +| Phase113 candidate, plus `W4A16_GPU_TILES=1` | `13/13` | +| post-revert Phase112 control | `13/13` | + +Perf: + +| op | shape | Phase112 control us | Phase113 candidate us | candidate change | +|----|-------|--------------------:|----------------------:|-----------------:| +| `MOE_SWIGLU_DOWN` | `n_tokens=128` | `808.130330` | `803.574960` | `+0.56%` | +| `MUL_MAT_ID_RAGGED_MOE` | `n=128` | `1242.206731` | `1239.567308` | `+0.21%` | +| `MOE_SWIGLU_DOWN` | `n_tokens=257` | `1478.156342` | `1476.355457` | `+0.12%` | +| `MUL_MAT_ID_RAGGED_MOE` | `n=257` | `2148.437500` | `2214.230603` | `-3.06%` | + +Canonical gates: + +- Skipped for the candidate because the perf gate failed. +- Post-revert selected gate passed `13/13`, restoring the accepted Phase112 + state on DGX. + +Decision: + +- Reject and revert Phase113. +- Do not spend more time on compact GPU tile descriptors for W4A16 unless the + GEMM itself consumes a vLLM-style padded metadata contract directly. +- The next credible MoE phase should move toward padded aligned metadata + (`sorted_token_ids`, expert-per-block ids, and padded row count) rather than + compact descriptors plus a ragged tile map. + +### Phase112: W4A16 Direct Activation Staging + +- Date: 2026-07-01. +- Plan: + `docs/superpowers/plans/2026-07-01-w4a16-direct-a-phase112.md`. +- Artifact: + `/home/mudler/bench/phase112_w4a16_direct_a/20260701_231749_direct_a`. +- Env under test: + `LLAMA_W4A16_PREFILL_M=128 LLAMA_W4A16_DIRECT_A=1 LLAMA_MOE_GPU_SORT=1`. +- Source decision: keep default-off. + +Selected gates: + +| env | selected gate | +|-----|---------------| +| `LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1` | `13/13` | +| `LLAMA_W4A16_PREFILL_M=128 LLAMA_W4A16_DIRECT_A=1` | `13/13` | +| `LLAMA_W4A16_PREFILL_M=128 LLAMA_W4A16_DIRECT_A=1 LLAMA_MOE_GPU_SORT=1` | `13/13` | + +Perf: + +| op | shape | W4A16+GPU-sort us | direct-A us | direct-A+GPU-sort us | best change vs control | +|----|-------|------------------:|------------:|---------------------:|-----------------------:| +| `MOE_SWIGLU_DOWN` | `n_tokens=128` | `807.219630` | `805.847949` | `809.409493` | `-0.27%` | +| `MUL_MAT_ID_RAGGED_MOE` | `n=128` | `1242.664663` | `1245.671875` | `1247.674279` | `-0.40%` | +| `MOE_SWIGLU_DOWN` | `n_tokens=257` | `1551.081790` | `1576.045597` | `1477.738938` | `+4.73%` | +| `MUL_MAT_ID_RAGGED_MOE` | `n=257` | `2278.504464` | `2347.164352` | `2166.224138` | `+4.93%` | + +Canonical gates for direct-A+GPU-sort: + +| gate | result | +|------|--------| +| README MoE md5 | `8cb0ce23777bf55f92f63d0292c756b0` | +| README dense md5 | `5951a5b4d624ce891e22ab5fca9bc439` | +| `SSM_CONV` | `45/45` | +| `SSM_CONV_SPLIT` | `6/6` | +| `GET_ROWS` | `49/49` supported rows | +| `GATED_DELTA_NET` | `48/48` | +| `MUL_MAT` | `1146/1146` supported rows | +| `MUL_MAT_ID` | `806/806` | + +Note: the older handoff snippet with `-no-cnv -c 4096` produced stable but +non-canonical md5s (`18a4e85031694388bab85e5f5b03effc` and +`0764361176d94719ab94f82da12eed65`) for both the direct-A candidate and the +W4A16+GPU-sort control. Treat that as a harness mismatch, not a sanctioned +gate. The patch-series README gate without `-no-cnv` and without explicit +`-c 4096` is the canonical md5 gate used above. + +Decision: + +- Carry Phase112 as default-off only. +- The improvement is real for the larger Phase108 MoE rows, but it only narrows + the fallback path. W4A16 fallback is still not the default grouped-MMQ parity + path. +- Next target: either remove another W4A16 fallback boundary that remains after + direct-A, or shift to a fused routed-MoE kernel that avoids fallback entirely + while preserving the same md5/op gates. ## Current Serving Record @@ -60,6 +2298,1800 @@ Decision: ## Attempt Log +### Phase111: W4A16 GPU Tile Descriptor Probe + +- Date: 2026-07-01. +- Plan: + `docs/superpowers/plans/2026-07-01-w4a16-gpu-tile-descriptors-phase111.md`. +- Source: `/home/mudler/llama-phase93-qwen3next-gqa-bcast`. +- Local patch status: rejected and reverted. + - Probe added default-off `LLAMA_W4A16_GPU_TILES=1`. + - It built W4A16 tile descriptors on GPU from Phase110 `expert_bounds_dev` + with an atomic tile counter, then copied back one `n_tiles` integer for the + grouped W4A16 launch dimension. + - The final source returned to the Phase110 `LLAMA_MOE_GPU_SORT=1` state. +- Failed build/runtime artifact: + `/home/mudler/bench/phase111_w4a16_gpu_tiles/20260701_230216`. +- Measured artifact: + `/home/mudler/bench/phase111_w4a16_gpu_tiles/20260701_230400_fix1`. + +Failure/fix notes: + +| attempt | result | cause | +|---------|--------|-------| +| initial DGX compile | failed | `expert_bounds_for_w4a16` was typed `const int32_t *` but `mm_ids_helper` writes expert bounds | +| first runtime artifact `20260701_230216` | aborted | CUDA pool LIFO assert: outer `expert_bounds_dev` was allocated after inner `ids_dst_dev` but freed later | +| fix1 artifact `20260701_230400_fix1` | selected gates passed | allocation order corrected; `LLAMA_W4A16_GPU_TILES=1` branch traced | +| post-revert gate | `13/13` | source restored to Phase110 behavior | + +Selected gates: + +| env | selected gate result | +|-----|----------------------| +| `LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1` | `13/13` | +| `LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1 LLAMA_W4A16_GPU_TILES=1` | `13/13` | +| post-revert `LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1` | `13/13` | + +Clean perf A/B: + +| env | case | `n_tokens` | time_us | n_runs | vs Phase110 GPU-sort | +|-----|------|-----------:|--------:|-------:|---------------------:| +| `LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1` | `MOE_SWIGLU_DOWN` | `128` | `807.037812` | `1243` | `1.000` | +| `LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1` | `MOE_SWIGLU_DOWN` | `257` | `1531.958716` | `654` | `1.000` | +| `LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1 LLAMA_W4A16_GPU_TILES=1` | `MOE_SWIGLU_DOWN` | `128` | `802.969697` | `1254` | `0.995` | +| `LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1 LLAMA_W4A16_GPU_TILES=1` | `MOE_SWIGLU_DOWN` | `257` | `1538.542813` | `654` | `1.004` | +| `LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1` | `MUL_MAT_ID_RAGGED_MOE` | `128` | `1244.568510` | `832` | `1.000` | +| `LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1` | `MUL_MAT_ID_RAGGED_MOE` | `257` | `2250.435268` | `448` | `1.000` | +| `LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1 LLAMA_W4A16_GPU_TILES=1` | `MUL_MAT_ID_RAGGED_MOE` | `128` | `1243.544471` | `832` | `0.999` | +| `LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1 LLAMA_W4A16_GPU_TILES=1` | `MUL_MAT_ID_RAGGED_MOE` | `257` | `2295.743304` | `448` | `1.020` | + +Trace facts: + +- `MOE_SWIGLU_DOWN n=257` built `128` W4A16 tiles for `2056` rows. +- `MUL_MAT_ID_RAGGED_MOE n=257` built `288` W4A16 tiles for `2056` rows. +- The clean perf rerun omitted `LLAMA_W4A16_GPU_TILES_TRACE=1`; the earlier + traced perf leg is preserved in the artifact but should not be used for timing. + +Decision: + +- Reject and revert Phase111 source. Moving only the W4A16 tile descriptor build + to GPU is correctness-clean after fixes, but it does not improve the parity + row and slightly regresses the most relevant 257-token ragged row. +- Do not spend another phase on a one-piece W4A16 host-metadata cleanup. The + next W4A16 attempt must remove a larger boundary, such as direct activation + consumption plus GPU descriptors in one path, or avoid the host-sync fallback + path entirely. + +### Phase110: GPU MoE Routing Metadata for Fallback/W4A16 + +- Date: 2026-07-01. +- Plan: + `docs/superpowers/plans/2026-07-01-gpu-moe-routing-metadata-phase110.md`. +- Source: `/home/mudler/llama-phase93-qwen3next-gqa-bcast`. +- Local patch status: new default-off CUDA source change in + `ggml/src/ggml-cuda/ggml-cuda.cu`. + - Add `LLAMA_MOE_GPU_SORT=1` to route fallback `ggml_cuda_mul_mat_id` + metadata construction through existing `ggml_cuda_launch_mm_ids_helper()`. + - Add a local inverse-permutation kernel because `mm_ids_helper` returns + sorted-to-original `ids_dst`, while fallback `get_rows_cuda()` needs + original-to-sorted `ids_from_sorted`. + - Leave graph-safe grouped-MMQ untouched. +- Failed first artifact: + `/home/mudler/bench/phase110_gpu_moe_sort/20260701_224103`. +- Accepted artifact: + `/home/mudler/bench/phase110_gpu_moe_sort/20260701_224446_fix1`. + +Initial failure and fix: + +| artifact | env | selected gate result | reason | +|----------|-----|----------------------|--------| +| `20260701_224103` | default | `13/13` | baseline clean | +| `20260701_224103` | `LLAMA_W4A16_PREFILL_M=128` | `13/13` | fallback baseline clean | +| `20260701_224103` | `LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1` | `10/13` | wrong permutation direction for fallback `get_rows` | +| `20260701_224446_fix1` | default | `13/13` | accepted fix | +| `20260701_224446_fix1` | `LLAMA_W4A16_PREFILL_M=128` | `13/13` | accepted fix | +| `20260701_224446_fix1` | `LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1` | `13/13` | accepted fix; trace showed branch execution | + +Canonical gates: + +| env | MoE md5 | dense md5 | `SSM_CONV` | `SSM_CONV_SPLIT` | `GET_ROWS` | `GATED_DELTA_NET` | `MUL_MAT` | `MUL_MAT_ID` | +|-----|---------|-----------|------------|------------------|------------|-------------------|-----------|--------------| +| default | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `45/45` | `6/6` | `49/49` | `48/48` | `1146/1146` | `806/806` | +| `LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `45/45` | `6/6` | `49/49` | `48/48` | `1146/1146` | `806/806` | + +Perf A/B: + +| env | case | `n_tokens` | time_us | n_runs | vs W4A16 | vs default | +|-----|------|-----------:|--------:|-------:|---------:|-----------:| +| default | `MOE_SWIGLU_DOWN` | `128` | `806.724859` | `1243` | n/a | `1.000` | +| default | `MOE_SWIGLU_DOWN` | `257` | `1022.161585` | `984` | n/a | `1.000` | +| `LLAMA_W4A16_PREFILL_M=128` | `MOE_SWIGLU_DOWN` | `128` | `809.339501` | `1243` | `1.000` | `1.003` | +| `LLAMA_W4A16_PREFILL_M=128` | `MOE_SWIGLU_DOWN` | `257` | `1656.102310` | `606` | `1.000` | `1.620` | +| `LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1` | `MOE_SWIGLU_DOWN` | `128` | `807.311344` | `1243` | `0.997` | `1.001` | +| `LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1` | `MOE_SWIGLU_DOWN` | `257` | `1536.868502` | `654` | `0.928` | `1.504` | +| default | `MUL_MAT_ID_RAGGED_MOE` | `128` | `1242.343750` | `832` | n/a | `1.000` | +| default | `MUL_MAT_ID_RAGGED_MOE` | `257` | `1453.979651` | `688` | n/a | `1.000` | +| `LLAMA_W4A16_PREFILL_M=128` | `MUL_MAT_ID_RAGGED_MOE` | `128` | `1248.412260` | `832` | `1.000` | `1.005` | +| `LLAMA_W4A16_PREFILL_M=128` | `MUL_MAT_ID_RAGGED_MOE` | `257` | `2428.586538` | `416` | `1.000` | `1.670` | +| `LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1` | `MUL_MAT_ID_RAGGED_MOE` | `128` | `1247.145433` | `832` | `0.999` | `1.004` | +| `LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1` | `MUL_MAT_ID_RAGGED_MOE` | `257` | `2237.145089` | `448` | `0.921` | `1.539` | + +Decision: + +- Keep Phase110 as a default-off structural base. It is md5/op clean after the + inverse-permutation fix and confirms vLLM-style GPU route metadata can replace + the CPU id scan for the host-sync fallback path. +- Do not promote it as a speed parity lever by itself. The W4A16 fallback + improves by `7.2%` on `MOE_SWIGLU_DOWN n=257` and `7.9%` on + `MUL_MAT_ID_RAGGED_MOE n=257`, but still remains about `1.5x` slower than + the default grouped-MMQ path. +- Phase111 should only build on this if it removes another fallback bottleneck: + either the remaining `expert_bounds` host copy / host tile descriptor build, + or a grouped W4A16 path that can consume GPU expert bounds directly. + +### Phase109: Existing MoE Prefill and Tile-Policy A/B + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase93-qwen3next-gqa-bcast`. +- Local patch status: no new source changes. This was an env-only benchmark + attempt using the Phase108 perf CSV harness. +- Artifact: + `/home/mudler/bench/phase109_existing_moe_prefill_ab/20260701_222559`. + +Perf A/B: + +| env | case | `n_tokens` | time_us | n_runs | vs default | +|-----|------|-----------:|--------:|-------:|-----------:| +| default | `MOE_SWIGLU_DOWN` | `128` | `800.802233` | `1254` | `1.000` | +| default | `MOE_SWIGLU_DOWN` | `257` | `1008.593373` | `996` | `1.000` | +| `LLAMA_W4A16_PREFILL_M=128` | `MOE_SWIGLU_DOWN` | `128` | `805.747385` | `1243` | `1.006` | +| `LLAMA_W4A16_PREFILL_M=128` | `MOE_SWIGLU_DOWN` | `257` | `1646.679739` | `612` | `1.633` | +| `LLAMA_FP4_PREFILL_M=128` | `MOE_SWIGLU_DOWN` | `128` | `806.103781` | `1243` | `1.007` | +| `LLAMA_FP4_PREFILL_M=128` | `MOE_SWIGLU_DOWN` | `257` | `4070.191057` | `246` | `4.035` | +| `LLAMA_MOE_DENSITY_MAX=9` | `MOE_SWIGLU_DOWN` | `128` | `810.080451` | `1243` | `1.012` | +| `LLAMA_MOE_DENSITY_MAX=9` | `MOE_SWIGLU_DOWN` | `257` | `1024.869121` | `978` | `1.016` | +| `LLAMA_MOE_MMQ_X=64` | `MOE_SWIGLU_DOWN` | `128` | `806.358005` | `1243` | `1.007` | +| `LLAMA_MOE_MMQ_X=64` | `MOE_SWIGLU_DOWN` | `257` | `1008.191767` | `996` | `1.000` | +| default | `MUL_MAT_ID_RAGGED_MOE` | `128` | `1241.417067` | `832` | `1.000` | +| default | `MUL_MAT_ID_RAGGED_MOE` | `257` | `1445.333807` | `704` | `1.000` | +| `LLAMA_W4A16_PREFILL_M=128` | `MUL_MAT_ID_RAGGED_MOE` | `128` | `1242.049279` | `832` | `1.001` | +| `LLAMA_W4A16_PREFILL_M=128` | `MUL_MAT_ID_RAGGED_MOE` | `257` | `2518.852500` | `400` | `1.743` | +| `LLAMA_FP4_PREFILL_M=128` | `MUL_MAT_ID_RAGGED_MOE` | `128` | `1244.775240` | `832` | `1.003` | +| `LLAMA_FP4_PREFILL_M=128` | `MUL_MAT_ID_RAGGED_MOE` | `257` | `2898.838068` | `352` | `2.006` | +| `LLAMA_MOE_DENSITY_MAX=9` | `MUL_MAT_ID_RAGGED_MOE` | `128` | `1247.564904` | `832` | `1.005` | +| `LLAMA_MOE_DENSITY_MAX=9` | `MUL_MAT_ID_RAGGED_MOE` | `257` | `1438.245739` | `704` | `0.995` | +| `LLAMA_MOE_MMQ_X=64` | `MUL_MAT_ID_RAGGED_MOE` | `128` | `1246.139423` | `832` | `1.004` | +| `LLAMA_MOE_MMQ_X=64` | `MUL_MAT_ID_RAGGED_MOE` | `257` | `1434.058239` | `704` | `0.992` | + +`MOE_WEIGHTED_COMBINE` spot rows: + +| env | `n_tokens=128` | `n_tokens=257` | +|-----|---------------:|---------------:| +| default | `27.695333` | `67.423746` | +| `LLAMA_W4A16_PREFILL_M=128` | `27.502254` | `95.550477` | +| `LLAMA_FP4_PREFILL_M=128` | `27.687500` | `229.421474` | + +Correctness gates: + +| env | selected gate result | +|-----|----------------------| +| default | `13/13` | +| `LLAMA_W4A16_PREFILL_M=128` | `13/13` | +| `LLAMA_FP4_PREFILL_M=128` | `13/13` | +| `LLAMA_MOE_DENSITY_MAX=9` | `13/13` | +| `LLAMA_MOE_MMQ_X=64` | `13/13` | + +Trace notes: + +- The default/density route remained CUDA-graph-safe grouped MMQ: + `route=mmq host_sync=0`. +- For the 257-token ragged row the traced launch uses + `ncols_dst=2056`, `ncols_max=257`, `mmq_x=96`, `stream_k_blocks == ntiles_dst`, + and `fixup=0`. +- For 128-token rows the current default already selects `mmq_x=64`; raising + density or forcing 64 does not open a new path. + +Decision: + +- Reject existing W4A16 and FP4 large-M env routes for these Phase108 MoE + sentinel rows. They are correctness-clean but slower, especially at + `n_tokens=257`. +- Reject `LLAMA_MOE_DENSITY_MAX=9` and `LLAMA_MOE_MMQ_X=64` as parity levers. + The best `MUL_MAT_ID_RAGGED_MOE` improvement is only `0.5-0.8%` and + `MOE_SWIGLU_DOWN` is flat or worse. +- Do not spend Phase110 on another MMQ tile-policy shortcut. +- Next implementation should target the structural gap identified by the vLLM + audit: build routed-MoE sorted token/expert metadata on GPU and remove the + host ID readback/sync path from the grouped fallback/W4A16 path, while keeping + the graph-safe MMQ path untouched. + +### Phase108: MoE Whole-Graph Perf CSV Harness + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase93-qwen3next-gqa-bcast`. +- Local patch status: measurement-only source change in + `tests/test-backend-ops.cpp`. + - Add existing `MOE_SWIGLU_DOWN`, `MOE_WEIGHTED_COMBINE`, and + `MUL_MAT_ID_RAGGED_MOE` whole-graph cases to `make_test_cases_perf()` for + `n_tokens=128` and `257`. + - Expand `--output csv` to use `test_result::get_fields()`, which includes + `time_us`, `flops`, `bandwidth_gb_s`, `memory_kb`, and `n_runs`. +- Artifact: + `/home/mudler/bench/phase108_moe_perf_csv/20260701_221559`. + +RED condition from Phase107: + +| command | Phase107 result | +|---------|-----------------| +| `test-backend-ops perf -b CUDA0 -o MOE_SWIGLU_DOWN --output csv` | zero rows | +| `test-backend-ops perf -b CUDA0 -o MOE_WEIGHTED_COMBINE --output csv` | zero rows | +| `test-backend-ops perf -b CUDA0 -o MUL_MAT_ID_RAGGED_MOE --output csv` | zero rows | + +Perf rows after patch: + +| case | params | time_us | n_runs | flops | +|------|--------|--------:|-------:|------:| +| `MOE_SWIGLU_DOWN` | `type_a=nvfp4,n_mats=128,n_used=8,n_ff=768,n_tokens=128,n_embd=2048` | `801.764753` | `1254` | `12053007297164.449219` | +| `MOE_SWIGLU_DOWN` | `type_a=nvfp4,n_mats=128,n_used=8,n_ff=768,n_tokens=257,n_embd=2048` | `1019.953252` | `984` | `19023274120980.359375` | +| `MOE_WEIGHTED_COMBINE` | `type_a=nvfp4,n_mats=128,n_used=8,n_ff=768,n_tokens=128,n_embd=2048` | `27.550055` | `36320` | `117074893979840.453125` | +| `MOE_WEIGHTED_COMBINE` | `type_a=nvfp4,n_mats=128,n_used=8,n_ff=768,n_tokens=257,n_embd=2048` | `67.593041` | `14800` | `95809244446043.828125` | +| `MUL_MAT_ID_RAGGED_MOE` | `type_a=nvfp4,n_mats=256,n_used=8,m=768,n=128,k=2048` | `1239.103365` | `832` | `2599642259062.170898` | +| `MUL_MAT_ID_RAGGED_MOE` | `type_a=nvfp4,n_mats=256,n_used=8,m=768,n=257,k=2048` | `1445.950284` | `704` | `4472917803025.495117` | + +Safety gates: + +| gate | result | +|------|--------| +| MoE md5 | `8cb0ce23777bf55f92f63d0292c756b0` | +| dense md5 | `5951a5b4d624ce891e22ab5fca9bc439` | +| `MOE_SWIGLU_DOWN` | `7/7` | +| `MOE_WEIGHTED_COMBINE` | `7/7` | +| `MUL_MAT_ID_RAGGED_MOE` | `6/6` | +| `SSM_CONV` | `45/45` | +| `SSM_CONV_SPLIT` | `6/6` | +| `GET_ROWS` | `49/49` | +| `GATED_DELTA_NET` | `48/48` | +| `MUL_MAT` | `1146/1146` | +| `MUL_MAT_ID` | `806/806` | + +Notes: + +- The first md5 attempt in `gates/` used `-no-cnv` and intentionally failed + against the canonical chat-template hashes. The corrected historical gate is + in `gates_chat/` and passed. +- CSV output is now a usable perf ledger for these cases; the schema includes + timing columns instead of support metadata only. + +Decision: + +- Phase108 closes the Phase107 measurement gap; it is not a parity-improving + runtime patch by itself. +- The dominant focused row is `MUL_MAT_ID_RAGGED_MOE` (`1239-1446 us/run`) and + `MOE_SWIGLU_DOWN` (`802-1020 us/run`), not `MOE_WEIGHTED_COMBINE` + (`28-68 us/run`). +- Next fused-MoE work should target the routed matmul/SWIGLU/down chain and + must report deltas against these Phase108 rows plus the same md5/op gates. + +### Phase107: Fused-MoE Structural Guardrail + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase93-qwen3next-gqa-bcast`. +- Local patch status: no new source changes. This was a correctness and + measurement-surface attempt for the next structural fused routed-MoE path. +- Artifact: + `/home/mudler/bench/phase107_moe_fusion_guardrail/20260701_220227`. + +Correctness guardrails: + +| guard | result | +|-------|--------| +| `MOE_SWIGLU_DOWN` | `7/7` | +| `MOE_WEIGHTED_COMBINE` | `7/7` | +| `MUL_MAT_ID_RAGGED_MOE` | `6/6` | + +Perf-output check: + +| command | result | +|---------|--------| +| `test-backend-ops perf -b CUDA0 -o MOE_SWIGLU_DOWN --output csv` | zero rows | +| `test-backend-ops perf -b CUDA0 -o MOE_WEIGHTED_COMBINE --output csv` | zero rows | +| `test-backend-ops perf -b CUDA0 -o MUL_MAT_ID_RAGGED_MOE --output csv` | zero rows | +| `test-backend-ops perf -b CUDA0 -o MUL_MAT_ID --output csv` | `116` support rows, `63` relevant rows, but no timing columns | + +Decision: + +- Existing correctness guardrails are sufficient to protect the three structural + MoE surfaces before a future source change. +- Existing `test-backend-ops perf` output is not sufficient as a performance + guard for these custom whole-graph cases because it emits support metadata, + not timings. +- The next source patch should be measurement-only: a narrow MoE fusion timing + harness that emits `case,iterations,total_ms,mean_ms` for the selected + `MOE_SWIGLU_DOWN`, `MOE_WEIGHTED_COMBINE`, and `MUL_MAT_ID_RAGGED_MOE` + shapes. +- Do not start fused routed-MoE kernel implementation until that timing harness + proves which sub-surface is large enough to move Phase104/106 serving. + +### Phase106: Max-Concurrency Current-Stack Serving + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase93-qwen3next-gqa-bcast`. +- Local patch status: no new source changes. This was a measurement-only + serving-contract attempt on top of the carried Phase101/102 default-off + cleanup candidates. +- Harness: streamed `paged-current-serving-snapshot.sh` with: + - source-log workaround for the non-git DGX mirror, + - paged env + `LLAMA_SSM_CONV_SPLIT=1 LLAMA_PAGED_KV_GET_ROWS_F16=1`, + - expanded gate ops: + `SSM_CONV,SSM_CONV_SPLIT,GET_ROWS,GATED_DELTA_NET,MUL_MAT,MUL_MAT_ID`, + - `NPL=128 192 256`, `PTOK=128`, `GEN=64`, `PARALLEL=256`, + `CTX=131072`, `BATCH=2048`, `UBATCH=512`, `VLLM_MAX_NUM_SEQS=256`. +- Artifacts: + - dry-run: + `/home/mudler/bench/phase106_max_concurrency_current_stack/20260701_214839_dryrun`, + - full sweep: + `/home/mudler/bench/phase106_max_concurrency_current_stack/20260701_214907`. + +Safety gates: + +| phase | env | MoE md5 | dense md5 | `SSM_CONV` | `SSM_CONV_SPLIT` | `GET_ROWS` | `GATED_DELTA_NET` | `MUL_MAT` | `MUL_MAT_ID` | +|-------|-----|---------|-----------|------------|------------------|------------|-------------------|-----------|--------------| +| pre | split + F16 K/V rows | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `45/45` | `6/6` | `49/49` | `48/48` | `1146/1146` | `806/806` | +| post | split + F16 K/V rows | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `45/45` | `6/6` | `49/49` | `48/48` | `1146/1146` | `806/806` | + +Serving snapshot: + +| arm | n | agg_tps | decode_agg_tps | decode_perseq_tps | prefill_tps | ttft_mean_ms | wall_s | +|-----|--:|--------:|---------------:|------------------:|------------:|-------------:|-------:| +| paged combined | `128` | `331.8` | `678.9` | `3.90` | `1734.1` | `7392.5` | `24.689` | +| paged combined | `192` | `318.4` | `681.8` | `2.50` | `1602.4` | `11058.0` | `38.595` | +| paged combined | `256` | `338.4` | `824.6` | `2.10` | `1542.8` | `14933.5` | `48.410` | +| vLLM | `128` | `663.4` | `1029.8` | `6.78` | `5228.9` | `2514.6` | `11.970` | +| vLLM | `192` | `709.8` | `1202.4` | `4.98` | `4881.5` | `3674.8` | `16.769` | +| vLLM | `256` | `723.8` | `1320.4` | `3.94` | `4520.9` | `4999.0` | `21.931` | + +Ratios: + +| n | paged decode/vLLM | paged perseq/vLLM | paged agg/vLLM | paged TTFT/vLLM | +|--:|------------------:|------------------:|---------------:|----------------:| +| `128` | `0.6593` | `0.5752` | `0.5002` | `2.9398` | +| `192` | `0.5670` | `0.5020` | `0.4486` | `3.0091` | +| `256` | `0.6245` | `0.5330` | `0.4675` | `2.9873` | + +Decision: + +- Reject C1 as a GB10 parity lever for the current stack. +- llama.cpp completed `N=256`, but vLLM also completed `N=256` under the same + harness cap and remained materially faster. +- Higher concurrency did not reveal an aggregate operating point where llama.cpp + catches vLLM: paged aggregate stayed around `318-338 t/s`, while vLLM rose to + `724 t/s`. +- TTFT widened with higher concurrency on llama.cpp (`7392.5 -> 14933.5 ms`) + and stayed much lower on vLLM (`2514.6 -> 4999.0 ms`). +- The next phase should not be another scheduler or MMQ micro-policy. The + remaining plausible source work is structural: persistent batch state, fused + routed-MoE dispatch, or a larger GDN/packed-decode design with new guardrails. + +### Phase105: Current-Stack MoE MMQ Shape Refresh + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase93-qwen3next-gqa-bcast`. +- Local patch status: no new source changes. This was a measurement-only + attempt on top of the carried Phase101/102 default-off cleanup candidates. +- Env for trace legs: + `LLAMA_SSM_CONV_SPLIT=1 LLAMA_PAGED_KV_GET_ROWS_F16=1`. +- Artifacts: + - gates: + `/home/mudler/bench/phase105_mmq_current_shape/20260701_213927`, + - serving trace retry: + `/home/mudler/bench/phase105_mmq_current_shape/20260701_214129_serving_retry`. + +Safety gates: + +| gate | env | result | +|------|-----|--------| +| `MUL_MAT_ID_RAGGED_MOE` | default | `6/6` | +| `MUL_MAT_ID_RAGGED_MOE` | split + F16 K/V rows + shape traces | `6/6` | +| `MUL_MAT_ID` | split + F16 K/V rows | `806/806` | + +Trace refresh: + +| source | shape lines | launch lines | small-M lines | shape summary | launch summary | +|--------|------------:|-------------:|--------------:|---------------|----------------| +| ragged gate | `3` | `3` | `2` | density `2/4/9`, `mmq_x_best 40/64/96` | `fixup=0`, `stream_k_blocks == ntiles_dst` | +| one live serving request | `120` | `120` | `0` | `ncols_max=317`, density `10`, `mmq_x_best=112`, `stream_k=1` | `fixup=0`, `stream_k_blocks == ntiles_dst` (`120/120`), efficiency `100` | + +Notes: + +- The first live-serving trace leg used the wrong model path and exited before + loading the model. It is preserved in the gate artifact as a harness hiccup, + not an inference failure. +- The serving retry used `~/bench/q36-35b-a3b-nvfp4.gguf`; the request returned + a non-empty response (`3648` bytes), and the wrapper's nonzero exit was from + `grep` under `pipefail` when there were zero `SMALL_M` lines. + +Decision: + +- The current Phase104 stack did not create a new cheap grouped-MMQ lever. +- The trace reconfirms that no-fixup/no-stream-k shortcuts are closed for this + workload, and the live sampled shape is prefill-like rather than a new + small-M decode class. +- Do not pursue another host-side MMQ tile policy. Any next MMQ work must be a + structural kernel or serving-contract change with a clear path to reducing + the dominant `mmq_nvfp4` bucket. +- Given prior GDN micro-kernel rejections, the next high-value phase should be + a larger serving contract or a new structural design, not more isolated + micro-knobs. + +### Phase104: Combined Cleanup Normal Serving Snapshot vs vLLM + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase93-qwen3next-gqa-bcast`. +- Local patch status: no new source changes beyond the carried Phase101/102 + default-off runtime candidates. +- Harness: streamed `paged-current-serving-snapshot.sh` with: + - source-log workaround for the non-git DGX mirror, + - paged env + `LLAMA_SSM_CONV_SPLIT=1 LLAMA_PAGED_KV_GET_ROWS_F16=1`, + - expanded gate ops: + `SSM_CONV,SSM_CONV_SPLIT,GET_ROWS,GATED_DELTA_NET,MUL_MAT,MUL_MAT_ID`, + - `NPL=128`, `PTOK=128`, `GEN=64`, `PARALLEL=128`, `CTX=131072`, + `BATCH=2048`, `UBATCH=512`. +- Artifact: + `/home/mudler/bench/phase104_combined_serving_snapshot/20260701_212551`. + +Safety gates: + +| phase | env | MoE md5 | dense md5 | `SSM_CONV` | `SSM_CONV_SPLIT` | `GET_ROWS` | `GATED_DELTA_NET` | `MUL_MAT` | `MUL_MAT_ID` | +|-------|-----|---------|-----------|------------|------------------|------------|-------------------|-----------|--------------| +| pre | split + F16 K/V rows | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `45/45` | `6/6` | `49/49` | `48/48` | `1146/1146` | `806/806` | +| post | split + F16 K/V rows | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `45/45` | `6/6` | `49/49` | `48/48` | `1146/1146` | `806/806` | + +Serving snapshot, MoE `PTOK=128`, `GEN=64`, `PARALLEL=128`, `N=128`: + +| arm | n | agg_tps | decode_agg_tps | decode_perseq_tps | prefill_tps | ttft_mean_ms | wall_s | +|-----|--:|--------:|---------------:|------------------:|------------:|-------------:|-------:| +| paged combined | `128` | `338.6` | `675.8` | `3.93` | `1813.0` | `7121.6` | `24.196` | +| vLLM | `128` | `661.1` | `1028.0` | `6.80` | `5208.7` | `2572.3` | `11.980` | + +Ratios: + +| n | paged decode/vLLM | paged perseq/vLLM | paged agg/vLLM | paged TTFT/vLLM | +|--:|------------------:|------------------:|---------------:|----------------:| +| `128` | `0.6574` | `0.5779` | `0.5122` | `2.7686` | + +Comparison to Phase97 Phase93-only normal serving: + +| metric | Phase97 | Phase104 combined | change | +|--------|--------:|------------------:|-------:| +| `agg_tps` | `329.6` | `338.6` | `+2.73%` | +| `decode_agg_tps` | `669.8` | `675.8` | `+0.90%` | +| `prefill_tps` | `1734.5` | `1813.0` | `+4.53%` | +| `ttft_mean_ms` | `7415.4` | `7121.6` | `-3.96%` | +| `wall_s` | `24.851` | `24.196` | `-2.64%` | +| `paged_decode_over_vllm` | `0.6507` | `0.6574` | `+0.0067` | +| `paged_agg_over_vllm` | `0.4958` | `0.5122` | `+0.0164` | + +Decision: + +- The combined cleanup stack has a small real serving benefit outside `nsys`. +- It does not change the parity conclusion: vLLM is still about `1.52x` faster + on decode aggregate and `1.95x` faster on aggregate throughput at this shape. +- Carry the combined cleanup env as the best current comparison baseline. +- Next source work should target the remaining high-impact gap, not another + isolated layout cleanup. The current evidence points to larger serving + contracts or the dominant GDN/MMQ buckets. + +### Phase103: Combined Layout Cleanup Stack + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase93-qwen3next-gqa-bcast`. +- Local patch status: no new source changes beyond the Phase101 and Phase102 + default-off runtime candidates. +- Env: + `LLAMA_SSM_CONV_SPLIT=1 LLAMA_PAGED_KV_GET_ROWS_F16=1`. +- Artifacts: + - standalone combined gates: + `/home/mudler/bench/phase103_combined_layout_cleanups/20260701_211632/gates_combined`, + - combined serving profile: + `/home/mudler/bench/phase103_combined_layout_cleanups/20260701_211821/serving_profile`. + +Safety gates: + +| gate | env | MoE md5 | dense md5 | `SSM_CONV` | `SSM_CONV_SPLIT` | `GET_ROWS` | `GATED_DELTA_NET` | `MUL_MAT` | `MUL_MAT_ID` | +|------|-----|---------|-----------|------------|------------------|------------|-------------------|-----------|--------------| +| standalone combined | split + F16 K/V rows | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `45/45` | `6/6` | `49/49` | `48/48` | `1146/1146` | `806/806` | +| serving pre combined | split + F16 K/V rows | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `45/45` | `6/6` | `49/49` | `48/48` | `1146/1146` | `806/806` | +| serving post combined | split + F16 K/V rows | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `45/45` | `6/6` | `49/49` | `48/48` | `1146/1146` | `806/806` | + +Serving under combined graph-node profiling: + +| metric | value | +|--------|------:| +| aggregate t/s | `212.3` | +| decode aggregate t/s | `331.5` | +| decode per-seq t/s | `2.13` | +| prefill t/s | `1569.1` | +| TTFT mean ms | `7858.5` | +| wall s | `38.575` | +| total kernel time | `19.5519 s` | + +Fine bucket comparison: + +| bucket | Phase101 opt-in | Phase102 opt-in | Phase103 combined | Phase103 vs Phase102 | +|--------|----------------:|----------------:|------------------:|---------------------:| +| `convert_dtype` | `661.35 ms` | `663.99 ms` | `662.36 ms` | `-1.63 ms` | +| `copy_layout` | `80.32 ms` | `112.53 ms` | `78.22 ms` | `-34.31 ms` | +| `concat_layout` | `433.13 ms` | `4.59 ms` | `12.51 ms` | `+7.92 ms` | +| `layout-copy` macro | `1220.30 ms` | `826.87 ms` | `798.52 ms` | `-28.35 ms` | +| `get_rows` | `277.67 ms` | `278.61 ms` | `278.61 ms` | `0.00 ms` | +| `gdn_conv` | `453.54 ms` | `383.90 ms` | `390.08 ms` | `+6.18 ms` | +| `gdn_core` | `5886.76 ms` | `5940.33 ms` | `5930.47 ms` | `-9.86 ms` | +| `mmq_nvfp4` | `6193.70 ms` | `5987.09 ms` | `6001.77 ms` | `+14.68 ms` | + +Decision: + +- Correctness-clean combined stack. The two cleanup candidates are compatible. +- The combination improves traced serving over Phase102 and recovers the + Phase101 `copy_layout` reduction while preserving the Phase102 concat removal. +- It is still not a parity-closing lever. Dominant buckets remain + `gdn_core 5930.47 ms` and `mmq_nvfp4 6001.77 ms`, far larger than the + residual layout buckets. +- Carry Phase101+Phase102 as a combined default-off cleanup stack for future + comparisons. Next source work should not spend more time on isolated + layout-copy cleanup unless it also changes a serving-critical contract. + +### Phase102: Split-Input `SSM_CONV` Prefill Path + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase93-qwen3next-gqa-bcast`. +- Local patch status: default-off runtime candidate: + - adds `ggml_ssm_conv_split(ctx, conv_states, x_cur, conv_kernel)` while + reusing `GGML_OP_SSM_CONV`, + - adds CPU and CUDA split-input implementations plus `SSM_CONV_SPLIT` tests, + - wires Qwen3Next/Qwen35/Qwen35MoE through + `LLAMA_SSM_CONV_SPLIT=1` only for `n_seq_tokens > 1`, + `n_seq_tokens >= K-1`, and `cparams.n_rs_seq == 0`, + - keeps decode fused and rollback/short-prefill cases on the existing path. +- Local build: `cmake --build build --target test-backend-ops -j $(nproc)`. +- DGX build: + `cmake --build /home/mudler/llama-phase93-qwen3next-gqa-bcast/build --target llama-server llama-completion test-backend-ops -j $(nproc)`. +- Debug note: the first split-minus-base test used the default normalized-MSE + metric and failed with `ERR = inf` for `d_conv=4` because the CPU reference is + exactly zero. A direct split CUDA-vs-CPU diagnostic passed `6/6`; the final + semantic test keeps `split - base` and uses absolute max error. +- Artifacts: + - default/opt-in standalone gates: + `/home/mudler/bench/phase102_ssm_conv_split/20260701_210559`, + - opt-in serving profile: + `/home/mudler/bench/phase102_ssm_conv_split/20260701_210907/serving_profile`. + +Safety gates: + +| gate | env | MoE md5 | dense md5 | `SSM_CONV` | `SSM_CONV_SPLIT` | `GET_ROWS` | `GATED_DELTA_NET` | `MUL_MAT` | `MUL_MAT_ID` | +|------|-----|---------|-----------|------------|------------------|------------|-------------------|-----------|--------------| +| default | none | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `45/45` | `6/6` | `49/49` | `48/48` | `1146/1146` | `806/806` | +| standalone opt-in | `LLAMA_SSM_CONV_SPLIT=1` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `45/45` | `6/6` | `49/49` | `48/48` | `1146/1146` | `806/806` | +| serving pre opt-in | `LLAMA_SSM_CONV_SPLIT=1` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `45/45` | `6/6` | `49/49` | `48/48` | `1146/1146` | `806/806` | +| serving post opt-in | `LLAMA_SSM_CONV_SPLIT=1` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `45/45` | `6/6` | `49/49` | `48/48` | `1146/1146` | `806/806` | + +Serving under opt-in graph-node profiling: + +| metric | value | +|--------|------:| +| aggregate t/s | `206.1` | +| decode aggregate t/s | `320.0` | +| decode per-seq t/s | `2.06` | +| prefill t/s | `1538.0` | +| TTFT mean ms | `7928.4` | +| wall s | `39.743` | +| total kernel time | `19.5482 s` | + +Fine bucket comparison: + +| bucket | Phase100 | Phase101 opt-in | Phase102 opt-in | Phase102 vs Phase101 | +|--------|---------:|----------------:|----------------:|---------------------:| +| `convert_dtype` | `661.73 ms` | `661.35 ms` | `663.99 ms` | `+2.64 ms` | +| `copy_layout` | `116.25 ms` | `80.32 ms` | `112.53 ms` | `+32.21 ms` | +| `concat_layout` | `438.15 ms` | `433.13 ms` | `4.59 ms` | `-428.54 ms` | +| `layout-copy` macro | `1262.58 ms` | `1220.30 ms` | `826.87 ms` | `-393.43 ms` | +| `get_rows` | `283.47 ms` | `277.67 ms` | `278.61 ms` | `+0.94 ms` | +| `gdn_conv` | `458.13 ms` | `453.54 ms` | `383.90 ms` | `-69.64 ms` | +| `gdn_core` | `5919.48 ms` | `5886.76 ms` | `5940.33 ms` | `+53.57 ms` | +| `mmq_nvfp4` | `6127.44 ms` | `6193.70 ms` | `5987.09 ms` | `-206.61 ms` | + +Decision: + +- Correctness-clean and structurally useful: the split op removes the large + concat materialization from the eligible prefill/microbatch path. +- It does not improve live serving throughput in the profiled `N=128`, + `PTOK=128`, `GEN=64`, `PARALLEL=128` window; aggregate and decode are below + Phase100/101 traced profiles despite lower total kernel time. +- Carry as a default-off cleanup candidate pending repeat A/B or a follow-up + that fuses the remaining state update/copy work. Do not promote as a parity + lever by itself. +- Next higher-value work should target the still-dominant buckets: + `gdn_core` and `mmq_nvfp4`, or a larger serving scheduler/packed-decode + contract. + +### Phase101: Paged K/V F16 `GET_ROWS` A/B + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase93-qwen3next-gqa-bcast`. +- Local patch status: default-off runtime candidate: + - `ggml_get_rows_type(ctx, a, b, type)` helper added while preserving stock + `ggml_get_rows` widening semantics, + - CPU reference supports F16 source -> F16 output row copy, + - CUDA already supports F16 `GET_ROWS` output through `get_rows_cuda`, + - paged attention K/V gather calls typed F16 `GET_ROWS` only when + `LLAMA_PAGED_KV_GET_ROWS_F16=1` and the K/V cache tensor is F16, + - tests add F16-output `GET_ROWS` cases. +- Local build: `cmake --build build --target test-backend-ops -j $(nproc)`. +- DGX build: + `cmake --build /home/mudler/llama-phase93-qwen3next-gqa-bcast/build --target llama-server llama-completion test-backend-ops -j $(nproc)`. +- Artifacts: + - default gates: + `/home/mudler/bench/phase101_kv_get_rows_f16/20260701_203621/gates_default`, + - opt-in gates: + `/home/mudler/bench/phase101_kv_get_rows_f16/20260701_203754/gates_optin`, + - opt-in serving profile: + `/home/mudler/bench/phase101_kv_get_rows_f16/20260701_203930/serving_profile`. + +Safety gates: + +| gate | env | MoE md5 | dense md5 | `GET_ROWS` | `GATED_DELTA_NET` | `MUL_MAT` | `MUL_MAT_ID` | +|------|-----|---------|-----------|------------|-------------------|-----------|--------------| +| default | none | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `49/49` | `48/48` | `1146/1146` | `806/806` | +| standalone opt-in | `LLAMA_PAGED_KV_GET_ROWS_F16=1` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `49/49` | `48/48` | `1146/1146` | `806/806` | +| serving pre opt-in raw log | `LLAMA_PAGED_KV_GET_ROWS_F16=1` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `49/49` | `48/48` | `1146/1146` | `806/806` | +| serving post opt-in raw log | `LLAMA_PAGED_KV_GET_ROWS_F16=1` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `49/49` | `48/48` | `1146/1146` | `806/806` | + +Serving under opt-in graph-node profiling: + +| metric | value | +|--------|------:| +| aggregate t/s | `206.4` | +| decode aggregate t/s | `328.0` | +| decode per-seq t/s | `2.08` | +| prefill t/s | `1479.6` | +| TTFT mean ms | `8211.1` | +| wall s | `39.678` | +| total kernel time | `20.1989 s` | + +Fine bucket comparison against Phase100: + +| bucket | Phase100 | Phase101 opt-in | change | +|--------|---------:|----------------:|-------:| +| `convert_dtype` | `661.73 ms` | `661.35 ms` | `-0.38 ms` | +| `copy_layout` | `116.25 ms` | `80.32 ms` | `-35.93 ms` | +| `concat_layout` | `438.15 ms` | `433.13 ms` | `-5.02 ms` | +| `layout-copy` macro | `1262.58 ms` | `1220.30 ms` | `-42.28 ms` | +| `get_rows` | `283.47 ms` | `277.67 ms` | `-5.80 ms` | +| `gdn_core` | `5919.48 ms` | `5886.76 ms` | `-32.72 ms` | +| `mmq_nvfp4` | `6127.44 ms` | `6193.70 ms` | `+66.26 ms` | + +Decision: + +- Correctness-clean but not parity-closing. +- The hypothesis that K/V F16 typed gather would materially reduce + `convert_dtype` is mostly false for this serving window; `convert_dtype` + stayed flat. +- The patch does remove some `copy_layout` work and keeps md5/op gates green, + so it can remain as a small default-off cleanup candidate, but it should not + be promoted or treated as the main parity path without a repeat serving A/B. +- Next higher-value runtime work remains either the two-source `SSM_CONV` + contract for `conv_input` or a larger GDN/MMQ serving lever. + +### Phase100: Layout Trace View-Source Attribution + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase93-qwen3next-gqa-bcast`. +- Local patch status: trace-only source change in + `ggml/src/ggml-cuda/ggml-cuda.cu`; `LLAMA_LAYOUT_TRACE` now prints + `dst_view`, `src0_view`, and `src1_view`. Default execution is unchanged. +- Local build: `cmake --build build --target test-backend-ops -j $(nproc)`. +- DGX build: + `cmake --build /home/mudler/llama-phase93-qwen3next-gqa-bcast/build --target llama-server llama-completion test-backend-ops -j $(nproc)`. +- Harness: + - trace gate: + `EXTRA_ENV=LLAMA_LAYOUT_TRACE=128 OPS=GATED_DELTA_NET,MUL_MAT,MUL_MAT_ID`, + - serving profile: streamed `/home/mudler/bench/phase76_current_moe_profile.sh` + with source logging fixed for the mirror, `GATED_DELTA_NET` gates, and + `LLAMA_LAYOUT_TRACE=30000` on `llama-server`, + - `N=128`, `PTOK=128`, `GEN=64`, `PARALLEL=128`, `CTX=131072`. +- Artifacts: + - trace gate: + `/home/mudler/bench/phase100_layout_view_trace/20260701_201635/trace_gates`, + - serving profile: + `/home/mudler/bench/phase100_layout_view_trace/20260701_201800/serving_profile`. + +Safety gates: + +| gate | MoE md5 | dense md5 | `GATED_DELTA_NET` | `MUL_MAT` | `MUL_MAT_ID` | +|------|---------|-----------|-------------------|-----------|--------------| +| trace-enabled standalone | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `48/48` | `1146/1146` | `806/806` | +| serving pre raw log | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `48/48` | `1146/1146` | `806/806` | +| serving post raw log | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `48/48` | `1146/1146` | `806/806` | + +Serving under graph-node profiling plus view-source layout trace: + +| metric | value | +|--------|------:| +| aggregate t/s | `207.0` | +| decode aggregate t/s | `327.9` | +| decode per-seq t/s | `2.10` | +| prefill t/s | `1490.9` | +| TTFT mean ms | `8302.7` | +| wall s | `39.578` | +| total kernel time | `20.3464 s` | + +Fine buckets: + +| bucket | time | share | launches | +|--------|-----:|------:|---------:| +| `mmq_nvfp4` | `6127.44 ms` | `30.12%` | `33682` | +| `gdn_core` | `5919.48 ms` | `29.09%` | `4680` | +| `convert_dtype` | `661.73 ms` | `3.25%` | `52060` | +| `gdn_conv` | `458.13 ms` | `2.25%` | `7230` | +| `concat_layout` | `438.15 ms` | `2.15%` | `2130` | +| `copy_layout` | `116.25 ms` | `0.57%` | `8090` | +| `ew_repeat` | `46.45 ms` | `0.23%` | `18720` | + +View-source trace findings: + +| finding | evidence | +|---------|----------| +| K/V cache reads feed F32->F16 converts | For attention layers, `GET_ROWS` outputs F32 `node_*` from F16 `cache_k_l*` / `cache_v_l*`, then a `CPY` downcasts a view of that node to F16. Examples: `node_358 <- cache_k_l3` and `node_365 <- cache_v_l3`, followed by `cpy` rows with `src0_view=node_358` / `node_365`, `src0_type=f32`, `src1_type=f16`, and shapes like `256x64x2x8`, `256x128x2x8`, `256x162x2x8`. | +| The pattern repeats across attention layers | The same pair pattern appears for `cache_k_l7/cache_v_l7` (`node_798/node_805`), `cache_k_l11/cache_v_l11` (`node_1238/node_1245`), and later attention layers. | +| Some converts remain anonymous | `959` F32->F16 `CPY` trace rows still had no tensor or view names; do not assume the K/V path accounts for the full `convert_dtype` bucket without a targeted A/B. | +| Phase99 conv attribution is confirmed | `concat` rows show `conv_input-*` from `conv_states_reshaped-*` and `qkv_mixed_transposed-*`; the new view fields map `qkv_mixed_transposed-*` back to layer-local `node_*` producers. | + +Decision: + +- Carry the trace-only Phase100 patch as default-off instrumentation. +- The next runtime source candidate should target the attention K/V cache gather + dtype path: avoid `GET_ROWS` producing F32 only to downcast to F16 when the + consumer wants F16. This is more directly connected to the `convert_dtype` + bucket than a generic copy/layout tweak. +- Keep the two-source `SSM_CONV` contract as a separate later phase for + `concat_layout`; do not mix it with the K/V dtype experiment. + +### Phase99: Serving Layout Trace Attribution + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase93-qwen3next-gqa-bcast`. +- Local patch status: no source change; the default-off `LLAMA_LAYOUT_TRACE` + hook was already present in the fork and DGX mirror. +- Harness: + - trace gate: + `EXTRA_ENV=LLAMA_LAYOUT_TRACE=128 OPS=GATED_DELTA_NET,MUL_MAT,MUL_MAT_ID`, + - serving profile: streamed `/home/mudler/bench/phase76_current_moe_profile.sh` + with measurement-only edits for source logging, `GATED_DELTA_NET` gates, + and `LLAMA_LAYOUT_TRACE=30000` on `llama-server`, + - `N=128`, `PTOK=128`, `GEN=64`, `PARALLEL=128`, `CTX=131072`. +- Artifacts: + - trace gate: + `/home/mudler/bench/phase99_layout_trace/20260701_200637/trace_gates`, + - serving profile: + `/home/mudler/bench/phase99_layout_trace/20260701_200835/serving_profile`. + +Safety gates: + +| gate | MoE md5 | dense md5 | `GATED_DELTA_NET` | `MUL_MAT` | `MUL_MAT_ID` | +|------|---------|-----------|-------------------|-----------|--------------| +| trace-enabled standalone | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `48/48` | `1146/1146` | `806/806` | +| serving pre raw log | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `48/48` | `1146/1146` | `806/806` | +| serving post raw log | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `48/48` | `1146/1146` | `806/806` | + +Serving under graph-node profiling plus layout trace: + +| metric | value | +|--------|------:| +| aggregate t/s | `208.2` | +| decode aggregate t/s | `332.9` | +| decode per-seq t/s | `2.12` | +| prefill t/s | `1476.8` | +| TTFT mean ms | `8466.3` | +| wall s | `39.341` | +| total kernel time | `20.2408 s` | + +Macro buckets: + +| bucket | time | share | +|--------|-----:|------:| +| GDN | `6709.45 ms` | `33.15%` | +| MoE/FFN-GEMM | `6158.11 ms` | `30.42%` | +| bf16/fp8-proj | `2786.81 ms` | `13.77%` | +| layout-copy | `1269.35 ms` | `6.27%` | +| ew-mul(weight/norm/GDN) | `729.08 ms` | `3.60%` | +| act-quant | `686.52 ms` | `3.39%` | +| FA | `268.04 ms` | `1.32%` | + +Fine buckets: + +| bucket | time | share | launches | +|--------|-----:|------:|---------:| +| `mmq_nvfp4` | `5936.34 ms` | `29.33%` | `34162` | +| `gdn_core` | `5920.40 ms` | `29.25%` | `4710` | +| `convert_dtype` | `662.34 ms` | `3.27%` | `52440` | +| `gdn_conv` | `457.47 ms` | `2.26%` | `7290` | +| `concat_layout` | `440.01 ms` | `2.17%` | `2130` | +| `copy_layout` | `119.16 ms` | `0.59%` | `8110` | +| `ew_repeat` | `47.83 ms` | `0.24%` | `18840` | + +Layout trace summary: + +| route | trace lines | +|-------|------------:| +| `get_rows` | `18779` | +| `cpy` | `4638` | +| `cont` | `4384` | +| `concat` | `2199` | + +Top attribution: + +| finding | evidence | +|---------|----------| +| `concat_layout` is conv input materialization | `conv_input-* = concat(conv_states_reshaped-*, qkv_mixed_transposed-*)`; top shapes include `45x8192x12x1 = 3x8192x12x1 + 42x8192x12x1` (`450` trace lines) and `49x8192x11x1 = 3x8192x11x1 + 46x8192x11x1` (`180` trace lines). | +| `copy_layout` includes conv state writeback | `conv_state_update-* = cpy(conv_state_last-*, conv_state_update-*)`; top grouped shapes include `24576x12x1x1 <- 3x8192x12x1` (`780` trace lines), `24576x11x1x1` (`420`), and `24576x13x1x1` (`270`). | +| `convert_dtype` needs stronger attribution | the trace sees many unnamed `CPY` rows with F32 source and F16 destination, e.g. `256x166x2x11`, `256x166x2x12`, and similar attention/KV-shaped tensors; names are not preserved by the current dispatch trace. | + +Decision: + +- Phase99 is a measurement-only phase; no runtime patch was carried or reverted. +- Do not spend more time on the Phase96-style conv-state identity shortcut. + The serving hot layout path is the prefill/microbatch `conv_input` concat + feeding `SSM_CONV`, not just decode update writeback. +- A conv-side source phase must be a larger two-source `SSM_CONV` contract that + reads `(conv_states, qkv_mixed)` as a logical concatenation, or it is too small + to fund. If not coding that, first extend trace attribution for the larger + unnamed F32->F16 `convert_dtype` bucket. + +### Phase98: Phase93 Serving Graph-Node Profile + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase93-qwen3next-gqa-bcast`. +- Local patch status: no source change; this measured the carried Phase93 stack + after Phase95 and Phase96 reverts. +- Harness: + - streamed `/home/mudler/bench/phase76_current_moe_profile.sh` with two + measurement-only edits: + - source logging does not call `git` because the DGX Phase93 mirror is a + source copy without `.git`, + - pre/post gate ops include `GATED_DELTA_NET,MUL_MAT,MUL_MAT_ID`, + - `SRC=/home/mudler/llama-phase93-qwen3next-gqa-bcast`, + - `BIN=/home/mudler/llama-phase93-qwen3next-gqa-bcast/build/bin`, + - `N=128`, `PTOK=128`, `GEN=64`, `PARALLEL=128`, `CTX=131072`. +- Artifact: + `/home/mudler/bench/phase98_phase93_serving_profile/20260701_215715`. + +Safety gates: + +| phase | MoE md5 | dense md5 | `GATED_DELTA_NET` | `MUL_MAT` | `MUL_MAT_ID` | +|-------|---------|-----------|-------------------|-----------|--------------| +| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `48/48` | `1146/1146` | `806/806` | +| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `48/48` | `1146/1146` | `806/806` | + +Serving under graph-node profiling, MoE `N=128`, `PTOK=128`, `GEN=64`, +`PARALLEL=128`: + +| metric | value | +|--------|------:| +| aggregate t/s | `208.4` | +| decode aggregate t/s | `332.0` | +| decode per-seq t/s | `2.12` | +| prefill t/s | `1488.1` | +| TTFT mean ms | `8315.5` | +| wall s | `39.296` | +| total kernel time | `20.0411 s` | + +Macro buckets: + +| bucket | time | share | +|--------|-----:|------:| +| GDN | `6679.96 ms` | `33.33%` | +| MoE/FFN-GEMM | `6034.52 ms` | `30.11%` | +| bf16/fp8-proj | `2766.06 ms` | `13.80%` | +| layout-copy | `1257.60 ms` | `6.28%` | +| ew-mul(weight/norm/GDN) | `726.03 ms` | `3.62%` | +| act-quant | `686.69 ms` | `3.43%` | +| FA | `265.00 ms` | `1.32%` | + +Fine buckets: + +| bucket | time | share | launches | +|--------|-----:|------:|---------:| +| `gdn_core` | `5892.99 ms` | `29.40%` | `4680` | +| `mmq_nvfp4` | `5809.55 ms` | `28.99%` | `33442` | +| `cublas_bf16_gemm` | `1745.83 ms` | `8.71%` | `22200` | +| `cutlass_bf16_gemm` | `740.22 ms` | `3.69%` | `26190` | +| `ew_mul` | `720.94 ms` | `3.60%` | `48326` | +| `act_quant` | `686.69 ms` | `3.43%` | `37526` | +| `convert_dtype` | `663.45 ms` | `3.31%` | `51300` | +| `gdn_conv` | `457.11 ms` | `2.28%` | `7260` | +| `concat_layout` | `430.25 ms` | `2.15%` | `2100` | +| `get_rows` | `283.56 ms` | `1.41%` | `27978` | +| `gdn_gather` | `231.32 ms` | `1.15%` | `360` | +| `mm_ids` | `119.93 ms` | `0.60%` | `16680` | +| `gdn_l2norm` | `98.54 ms` | `0.49%` | `9360` | +| `gemv_moe_q` | `81.77 ms` | `0.41%` | `1560` | + +Decision: + +- Phase98 confirms the serving hot path is still a two-bucket problem: + `gdn_core` and `mmq_nvfp4` together account for `58.39%` of kernel time. +- The repeated negative GDN micro-tries (Phase91, Phase92, Phase95, Phase96) + argue against more scalar/launch/gather shortcuts. A credible GDN follow-up + needs a larger recurrence design with a measured PoC, not another local tweak. +- `layout-copy` is now large enough (`6.28%`, led by `convert_dtype` and + `concat_layout`) to deserve attribution before code changes, but it is not + parity-closing by itself. +- Next phase should either: + - attribute `convert_dtype`/`concat_layout` to exact graph nodes and remove a + proven material copy, or + - pursue a larger `gdn_core`/`mmq_nvfp4` serving lever with a strict PoC gate. + +### Phase97: Phase93 Serving Snapshot, N=128 + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase93-qwen3next-gqa-bcast`. +- Local patch status: no source change; this measured the carried Phase93 stack + after Phase95 and Phase96 reverts. +- Harness: + - streamed `paged-current-serving-snapshot.sh` with a one-line source-log + workaround because the DGX Phase93 mirror is a source copy without `.git`, + - `SRC=/home/mudler/llama-phase93-qwen3next-gqa-bcast`, + - `BUILD_DIR=/home/mudler/llama-phase93-qwen3next-gqa-bcast/build`, + - `BIN=/home/mudler/llama-phase93-qwen3next-gqa-bcast/build/bin`, + - `NPL=128`, `PTOK=128`, `GEN=64`, `PARALLEL=128`, `CTX=131072`, + - gate ops: `GATED_DELTA_NET,MUL_MAT,MUL_MAT_ID`. +- Artifact: + `/home/mudler/bench/phase97_phase93_serving_snapshot/20260701_214648`. + +Safety gates: + +| phase | MoE md5 | dense md5 | `GATED_DELTA_NET` | `MUL_MAT` | `MUL_MAT_ID` | +|-------|---------|-----------|-------------------|-----------|--------------| +| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `48/48` | `1146/1146` | `806/806` | +| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `48/48` | `1146/1146` | `806/806` | + +Serving snapshot, MoE `PTOK=128`, `GEN=64`, `PARALLEL=128`, `N=128`: + +| arm | n | agg_tps | decode_agg_tps | decode_perseq_tps | prefill_tps | ttft_mean_ms | wall_s | +|-----|--:|--------:|---------------:|------------------:|------------:|-------------:|-------:| +| paged Phase93 | `128` | `329.6` | `669.8` | `3.85` | `1734.5` | `7415.4` | `24.851` | +| vLLM | `128` | `664.8` | `1029.4` | `6.79` | `5271.8` | `2519.5` | `11.929` | + +Ratios: + +| n | paged decode/vLLM | paged perseq/vLLM | paged agg/vLLM | paged TTFT/vLLM | +|--:|------------------:|------------------:|---------------:|----------------:| +| `128` | `0.6507` | `0.5670` | `0.4958` | `2.9432` | + +Decision: + +- Phase93 remains a valid decode-profile improvement, but it is not + serving-parity at `n=128`. +- The Phase97 paged aggregate is slightly above the Phase72 default snapshot + (`329.6` vs `325.8`), and TTFT improves (`7415.4 ms` vs `7822.5 ms`), but + decode aggregate is lower than Phase72 (`669.8` vs `714.0`) while vLLM stays + essentially unchanged (`1029.4` vs `1029.5`). +- Treat Phase93 as worth carrying for source quality and decode-profile gain, + but the next parity phase needs a larger serving-impact lever. More isolated + GDN/conv micro-optimizations are unlikely to close the live serving gap. + +### Phase96: Conv-State Identity Fast Path + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase93-qwen3next-gqa-bcast`. +- Local patch status: runtime model-graph change reverted after profiling; + Phase93 is still the current carried source. +- Rationale: + - The Phase93 decode profile showed `ssm_conv_update_ids_f32`/`gdn_conv` + around the 66-72 ms range, larger than the cleanly attributable remaining + GDN producer math. + - The recurrent GDN path already uses a direct in-place op when + `s_copy_main` is identity. This trial added the same shape of branch to + `build_conv_state_fused`: when `inp->s_copy_main_identity` was true, it + viewed the active conv-state cache slots directly and called + `ggml_ssm_conv_update_inplace` instead of the ids variant. + - The existing `build_rs` zero/extra-state maintenance stayed around the + lambda, and the CUDA update kernel loads the conv window before writing the + same slot, so the identity aliasing was expected to be safe. +- Gate and profile artifacts: + - canonical gates: + `/home/mudler/bench/phase96_conv_identity_fastpath/20260701_214023/canonical_gates`, + - decode-only profile: + `/home/mudler/bench/phase96_conv_identity_fastpath/20260701_214141/decode_profile`. + +Safety gates: + +| check | result | +|-------|--------| +| local build | `cmake --build build --target test-backend-ops -j $(nproc)` OK | +| local CPU `SSM_CONV` | `45/45` | +| DGX CUDA `SSM_CONV` | `45/45`, `Backend CUDA0: OK` | +| DGX CUDA `GATED_DELTA_NET_INPLACE_IDS` | `6/6`, `Backend CUDA0: OK` | +| canonical MoE md5 | `8cb0ce23777bf55f92f63d0292c756b0` | +| canonical dense md5 | `5951a5b4d624ce891e22ab5fca9bc439` | +| canonical `SSM_CONV` | `45/45`, `Backend CUDA0: OK` | +| canonical `GATED_DELTA_NET` | `48/48`, `Backend CUDA0: OK` | +| canonical `MUL_MAT` | `1146/1146`, `Backend CUDA0: OK` | +| canonical `MUL_MAT_ID` | `806/806`, `Backend CUDA0: OK` | +| profile pre/post md5/op gates | all OK | + +Decode-only profile, MoE `N=128`, `N_PREDICT=2048`, capture after median +depth `74 -> 96`, default env: + +| arm | total kernel s | GDN ms | `gdn_core` ms | `gdn_core` launches | `gdn_conv` ms | `mmq_nvfp4` ms | +|-----|---------------:|-------:|--------------:|--------------------:|--------------:|---------------:| +| Phase93 default | `3.5476` | `1409.19` | `1333.48` | `570` | about `66.40` to `72.26` | `1421.63` | +| Phase96 conv identity | `3.6723` | `1486.12` | `1406.57` | `600` | `70.42` | `1433.84` | + +Decision: + +- Reject the conv-state identity fast path. It is inference-safe, but it did + not improve `gdn_conv` and worsened total kernel time and `gdn_core` versus + Phase93. +- Revert the runtime model-graph change and keep Phase93 as the current carried + candidate. +- Do not retry the conv identity branch as a speed lever unless a same-window + trace shows the ids variant itself is materially slower than the direct + variant independent of launch-count/capture variance. + +### Phase95: GDN Warp Scalar-Gate Broadcast + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase93-qwen3next-gqa-bcast`. +- Local patch status: runtime CUDA change reverted after profiling; Phase93 is + still the current carried source. +- Env: + - `GDN_WARP_SCALAR_GATE=1` +- Rationale: + - After Phase93, the remaining GDN producer buckets are small while + `gdn_core` remains the largest target. + - The scalar non-KDA decode path loads one scalar gate value per + `(head, seq, token)`, but every lane computes `expf(*g_t)`. This + default-off trial computed the scalar gate on lane 0 and broadcast it within + the warp for the one-token `S_v=128`, non-KDA, default `16x8` decode path. + - The recurrence order, reductions, state update, and stores were unchanged. +- Gate and profile artifacts: + - canonical gates: + `/home/mudler/bench/phase95_gdn_warp_scalar_gate/20260701_213150/canonical_gates`, + - decode-only profile: + `/home/mudler/bench/phase95_gdn_warp_scalar_gate/20260701_213311/decode_profile`. + +Safety gates: + +| check | result | +|-------|--------| +| local build | `cmake --build build --target test-backend-ops -j $(nproc)` OK | +| local CPU `GATED_DELTA_NET` | `48/48` | +| local CPU `GATED_DELTA_NET_INPLACE_IDS` | `6/6` | +| DGX CUDA `GATED_DELTA_NET`, `GDN_WARP_SCALAR_GATE=1` | `48/48`, `Backend CUDA0: OK` | +| DGX CUDA `GATED_DELTA_NET_INPLACE_IDS`, `GDN_WARP_SCALAR_GATE=1` | `6/6`, `Backend CUDA0: OK` | +| canonical MoE md5 | `8cb0ce23777bf55f92f63d0292c756b0` | +| canonical dense md5 | `5951a5b4d624ce891e22ab5fca9bc439` | +| canonical `GATED_DELTA_NET` | `48/48`, `Backend CUDA0: OK` | +| canonical `MUL_MAT` | `1146/1146`, `Backend CUDA0: OK` | +| canonical `MUL_MAT_ID` | `806/806`, `Backend CUDA0: OK` | +| profile pre/post md5/op gates | all OK | + +Decode-only profile, MoE `N=128`, `N_PREDICT=2048`, capture after median +depth `65 -> 87`, `PROFILE_ENV=GDN_WARP_SCALAR_GATE=1`: + +| arm | total kernel s | GDN ms | GDN % | `gdn_core` ms | `gdn_core` launches | `mmq_nvfp4` ms | +|-----|---------------:|-------:|------:|--------------:|--------------------:|---------------:| +| Phase93 default | `3.5476` | `1409.19` | `39.72%` | `1333.48` | `570` | `1421.63` | +| Phase95 warp scalar gate | `3.6317` | `1483.44` | `40.85%` | `1402.40` | `599` | `1402.88` | + +Decision: + +- Reject `GDN_WARP_SCALAR_GATE=1`. It is inference-safe, but worsens the target + `gdn_core` bucket by `+68.92 ms` and total kernel time by `+84.1 ms` versus + Phase93. +- Revert the runtime CUDA change and keep Phase93 as the current carried + candidate. +- Do not retry scalar-gate warp broadcast unless a future profile shows SFU + pressure, rather than recurrent state traffic/reductions, dominating the + decode GDN core. + +### Phase94: Phase93 GDN Geometry Reprobe, 8x8 + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase93-qwen3next-gqa-bcast`. +- Local patch status: no source change; env-only geometry probe rejected. +- Env: + - `GDN_NW=8` + - `GDN_CPW=8` +- Rationale: + - Phase93 changed the active GDN launch mix and dropped `gdn_core` to the + current best `1333.48 ms`. + - The 8x8 geometry keeps a single S_v=128 column tile (`grid.z=1`) like the + default 16x8 path, but halves threads per block. This tested whether lower + block occupancy pressure helped after grouped Q/K broadcast. +- Gate and profile artifacts: + - canonical gates: + `/home/mudler/bench/phase94_gdn_geometry_phase93/20260701_211730/canonical_gates_8x8`, + - decode-only profile: + `/home/mudler/bench/phase94_gdn_geometry_phase93/20260701_211855/decode_profile_8x8`. + +Safety gates: + +| check | result | +|-------|--------| +| DGX CUDA `GATED_DELTA_NET`, `GDN_NW=8 GDN_CPW=8` | `48/48`, `Backend CUDA0: OK` | +| canonical MoE md5 | `8cb0ce23777bf55f92f63d0292c756b0` | +| canonical dense md5 | `5951a5b4d624ce891e22ab5fca9bc439` | +| canonical `GATED_DELTA_NET` | `48/48`, `Backend CUDA0: OK` | +| canonical `MUL_MAT` | `1146/1146`, `Backend CUDA0: OK` | +| canonical `MUL_MAT_ID` | `806/806`, `Backend CUDA0: OK` | +| profile pre/post md5/op gates | all OK | + +Decode-only profile, MoE `N=128`, `N_PREDICT=2048`, capture after median +depth `74 -> 96`, `PROFILE_ENV=GDN_NW=8 GDN_CPW=8`: + +| arm | total kernel s | GDN ms | GDN % | `gdn_core` ms | `gdn_core` launches | `mmq_nvfp4` ms | +|-----|---------------:|-------:|------:|--------------:|--------------------:|---------------:| +| Phase93 default geometry | `3.5476` | `1409.19` | `39.72%` | `1333.48` | `570` | `1421.63` | +| Phase94 8x8 geometry | `3.6223` | `1522.02` | `42.02%` | `1440.79` | `600` | `1352.68` | + +Decision: + +- Reject `GDN_NW=8 GDN_CPW=8` for Phase93. It is inference-safe, but worsens + the target `gdn_core` bucket by `+107.31 ms` and total kernel time by + `+74.7 ms`. +- Keep the Phase93 default `16x8` geometry. +- The profile also shows remaining producer-side GDN work is small compared with + recurrence core: `l2_norm_f32 8.65 ms`, GDN gate/sigmoid kernels about + `12.75 ms`, and remaining repeat `5.34 ms` in the Phase93 default trace. The + next candidate should target recurrence work or a larger packed decode + contract, not another small producer-only fusion. + +### Phase93: Qwen3Next Grouped Q/K Broadcast for Fused GDN + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase93-qwen3next-gqa-bcast`. +- Local patch status: carried as a positive candidate. +- Patch scope: + - added `ggml_gated_delta_net_set_bcast(tensor, grouped)` using + `op_params[2]`, + - kept default GDN Q/K head mapping as the existing tiled/modulo behavior, + - added grouped mapping for opt-in GDN calls: + `qk_head = value_head / (H_v / H_k)`, + - threaded the grouped flag through CPU GDN, CUDA sequential decode, and CUDA + chunked prefill kernels, + - changed Qwen3Next to skip the explicit q/k repeat only when the GDN op path + can consume grouped broadcast, + - added grouped broadcast backend-op coverage for one-token and prompt-sized + `GATED_DELTA_NET`. +- Build artifact: + `/home/mudler/llama-phase93-qwen3next-gqa-bcast/build`. +- Gate and profile artifacts: + - canonical gates: + `/home/mudler/bench/phase93_qwen3next_gqa_bcast/20260701_210857/canonical_gates`, + - decode-only profile: + `/home/mudler/bench/phase93_qwen3next_gqa_bcast/20260701_211019/decode_profile`. + +Safety gates: + +| check | result | +|-------|--------| +| local build | `cmake --build build --target test-backend-ops -j $(nproc)` OK | +| local CPU `GATED_DELTA_NET` | `48/48`, includes grouped AR and PP cases | +| local CPU `GATED_DELTA_NET_INPLACE_IDS` | `6/6` | +| DGX CUDA `GATED_DELTA_NET` | `48/48`, includes grouped AR and PP cases | +| DGX CUDA `GATED_DELTA_NET_INPLACE_IDS` | `6/6` | +| canonical MoE md5 | `8cb0ce23777bf55f92f63d0292c756b0` | +| canonical dense md5 | `5951a5b4d624ce891e22ab5fca9bc439` | +| canonical `GATED_DELTA_NET` | `48/48`, `Backend CUDA0: OK` | +| canonical `MUL_MAT` | `1146/1146`, `Backend CUDA0: OK` | +| canonical `MUL_MAT_ID` | `806/806`, `Backend CUDA0: OK` | +| profile pre/post md5/op gates | all OK | + +Decode-only profile, MoE `N=128`, `N_PREDICT=2048`, capture after median +depth `73 -> 94`, default env: + +| arm | total kernel s | GDN ms | GDN % | `gdn_core` ms | `gdn_core` launches | `mmq_nvfp4` ms | +|-----|---------------:|-------:|------:|--------------:|--------------------:|---------------:| +| Phase87 same-source default | `3.6310` | `1471.27` | `40.52%` | `1390.56` | `598` | `1416.46` | +| Phase91 pack2 PDL-fix | `3.5813` | `1505.91` | `42.05%` | `1425.44` | `598` | `1333.39` | +| Phase92 store-fused | `3.7419` | `1609.81` | `43.02%` | `1529.72` | `600` | `1383.82` | +| Phase93 Qwen3Next grouped broadcast | `3.5476` | `1409.19` | `39.72%` | `1333.48` | `570` | `1421.63` | + +Decision: + +- Carry Phase93. It is md5/op clean and improves the target `gdn_core` bucket by + `-57.08 ms` vs Phase87 same-source default, `-91.86 ms` vs Phase85 + identity-state (`1400.34 ms`), and `-92.0 ms` vs the rejected Phase91 pack2 + trial. +- The win is consistent with the intended work reduction: Qwen3Next stops + materializing repeated q/k heads for fused GDN and lets the op map value heads + to grouped q/k heads directly. +- Next follow-up should profile/count node-level repeat/layout buckets around + Qwen3Next GDN to confirm whether more vLLM-style packed decode producer work + remains worth porting. + +### Phase92: Scalar Decode Store-Fused GDN Trial + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase92-gdn-store-fused`, default-off CUDA + experiment on top of the Phase90/91 guardrail stack. +- Local patch status: runtime CUDA changes reverted after profiling; guardrail + stack remains. +- Patch scope: + - added a `STORE_FUSED` CUDA kernel instantiation behind + `GDN_SCALAR_DECODE_STORE_FUSED=1`, + - gated it to S_v=128, scalar-gate, final-state, one-token, in-place decode + with default geometry, + - wrote `state_dst` inside the scalar update loop and skipped the final + post-token register-store loop for that instantiation. +- Build artifact: + `/home/mudler/llama-phase92-gdn-store-fused/build`. +- Guardrail and gate artifacts: + - canonical gates: + `/home/mudler/bench/phase92_gdn_scalar_store_fused/20260701_204550/canonical_gates`, + - decode-only profile: + `/home/mudler/bench/phase92_gdn_scalar_store_fused/20260701_204718/decode_profile`. + +Safety gates: + +| check | result | +|-------|--------| +| local build | `cmake --build build --target test-backend-ops -j $(nproc)` OK | +| local CPU guardrail | `GATED_DELTA_NET_INPLACE_IDS` `6/6`, `Backend CPU: OK` | +| DGX CUDA guardrail, `GDN_SCALAR_DECODE_STORE_FUSED=1` | `6/6`, `Backend CUDA0: OK` | +| canonical MoE md5 | `8cb0ce23777bf55f92f63d0292c756b0` | +| canonical dense md5 | `5951a5b4d624ce891e22ab5fca9bc439` | +| canonical `GATED_DELTA_NET` | `46/46`, `Backend CUDA0: OK` | +| canonical `MUL_MAT` | `1146/1146`, `Backend CUDA0: OK` | +| canonical `MUL_MAT_ID` | `806/806`, `Backend CUDA0: OK` | +| profile pre/post md5/op gates | all OK | + +Decode-only profile, MoE `N=128`, `N_PREDICT=2048`, capture after median +depth `72 -> 94`, `PROFILE_ENV=GDN_SCALAR_DECODE_STORE_FUSED=1`: + +| arm | total kernel s | GDN ms | GDN % | `gdn_core` ms | `gdn_core` launches | `mmq_nvfp4` ms | +|-----|---------------:|-------:|------:|--------------:|--------------------:|---------------:| +| Phase87 same-source default | `3.6310` | `1471.27` | `40.52%` | `1390.56` | `598` | `1416.46` | +| Phase91 pack2 PDL-fix | `3.5813` | `1505.91` | `42.05%` | `1425.44` | `598` | `1333.39` | +| Phase92 store-fused | `3.7419` | `1609.81` | `43.02%` | `1529.72` | `600` | `1383.82` | + +Decision: + +- Reject and revert the store-fused runtime patch. It is inference-safe under + the current md5/op gates, but it worsens the target `gdn_core` bucket by + `+139.16 ms` vs Phase87 same-source default and `+104.28 ms` vs the already + rejected Phase91 pack2 trial. +- The extra in-loop global stores likely increase pressure/ordering cost enough + to outweigh removing the final register pass. Do not retry this shape unless + a profile shows the final store loop as independently dominant. +- Next higher-value direction from the vLLM code audit is not another + recurrence micro-loop tweak; scope the larger packed decode contract or the + Qwen3Next GQA-repeat removal as separate, guarded phases. + +### Phase91: Default-off PACK=2 Decode Kernel, Guarded Retry + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase91-gdn-pack2-guarded-source`, default-off + CUDA experiment on top of the Phase90 guardrail stack. +- Local patch status: runtime CUDA changes reverted after profiling; Phase90 + test guardrail remains. +- Patch scope: + - reintroduced a `GDN_DECODE_PACK2=1` F32 scalar-gate, one-token, + in-place decode kernel that packs two sequences into one CTA, + - added a PDL-safety fix after the first canonical md5 failure: inactive + odd/single sequence lanes now call `ggml_cuda_pdl_sync()` before returning, + - extended the guardrail with F32 `n_seqs=1` and `n_seqs=3` + output-plus-state cases. +- Build artifact: + `/home/mudler/llama-phase91-gdn-pack2-guarded-source/build`. +- Guardrail artifacts: + - initial `n_seqs=2` guardrail pass: + `/home/mudler/bench/phase91_gdn_pack2_guarded/20260701_201943/guardrail`, + - initial canonical md5 failure: + `/home/mudler/bench/phase91_gdn_pack2_guarded/20260701_202024/canonical_gates`, + - PDL-fix expanded guardrail pass: + `/home/mudler/bench/phase91_gdn_pack2_guarded/20260701_202140/guardrail_pdl_fix`, + - PDL-fix canonical gates with `GATED_DELTA_NET,MUL_MAT,MUL_MAT_ID`: + `/home/mudler/bench/phase91_gdn_pack2_guarded/20260701_202154/canonical_gates_pdl_fix`, + - decode-only profile: + `/home/mudler/bench/phase91_gdn_pack2_guarded/20260701_202425/decode_profile_pdl_fix`. + +Safety gates: + +| check | result | +|-------|--------| +| initial Phase90 guardrail, `GDN_DECODE_PACK2=1` | `4/4`, `Backend CUDA0: OK` | +| initial canonical MoE md5 | failed: `b93724e88460d90379c5009df0e1f2b6` vs `8cb0ce23777bf55f92f63d0292c756b0` | +| expanded guardrail after PDL fix | `6/6`, covers F32 `n_seqs=1,2,3` output-plus-state | +| PDL-fix MoE md5 | `8cb0ce23777bf55f92f63d0292c756b0` | +| PDL-fix dense md5 | `5951a5b4d624ce891e22ab5fca9bc439` | +| PDL-fix `GATED_DELTA_NET` | `46/46`, `Backend CUDA0: OK` | +| PDL-fix `MUL_MAT` | `1146/1146`, `Backend CUDA0: OK` | +| PDL-fix `MUL_MAT_ID` | `806/806`, `Backend CUDA0: OK` | + +Decode-only profile, MoE `N=128`, `N_PREDICT=2048`, capture after +median depth `66 -> 88`, `PROFILE_ENV=GDN_DECODE_PACK2=1`: + +| arm | total kernel s | GDN ms | GDN % | `gdn_core` ms | `gdn_core` launches | `mmq_nvfp4` ms | +|-----|---------------:|-------:|------:|--------------:|--------------------:|---------------:| +| Phase87 same-source default | `3.6310` | `1471.27` | `40.52%` | `1390.56` | `598` | `1416.46` | +| Phase85 identity state | `3.6622` | `1480.21` | `40.42%` | `1400.34` | `596` | `1437.53` | +| Phase91 pack2 PDL-fix | `3.5813` | `1505.91` | `42.05%` | `1425.44` | `598` | `1333.39` | + +Decision: + +- Reject and revert the pack2 runtime patch. It is inference-safe after the PDL + fix, but it worsens the target `gdn_core` bucket by `+34.88 ms` vs the + Phase87 same-source default and `+25.10 ms` vs Phase85. +- Keep the expanded Phase90/91 `GATED_DELTA_NET_INPLACE_IDS` guardrail cases + because they caught the missing odd/single sequence coverage. +- Do not retry CTA-level sequence packing without a different per-sequence work + reduction; packing alone raises GDN's share of total kernel time. + +### Phase90: In-place GDN Decode State Guardrail + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase90-gdn-inplace-ids-guardrail-source`, + test-only experiment on top of the current Phase85 carry-forward stack. +- Local patch status: kept as a guardrail candidate in + `tests/test-backend-ops.cpp`. +- Patch scope: + - fixes the in-place ids fixture initialization by mirroring the identity + source cache bytes into `state_dst` after random tensor initialization, + - adds F32 serving-shape cases: `head_count=4`, `head_size=128`, + `n_seqs=2`, scalar gate and KDA, + - makes those F32 cases return `concat(flatten(out), flatten(state_dst))`, + so the normal backend comparator validates both attention output and the + recurrent-state side effect. +- Build artifact: + `/home/mudler/llama-phase90-gdn-inplace-ids-guardrail-source/build`. +- Gate artifacts: + - stale-source assertion: + `/home/mudler/bench/phase90_gdn_inplace_ids_guardrail/20260701_200946/direct`, + - output-only corrected pass: + `/home/mudler/bench/phase90_gdn_inplace_ids_guardrail/20260701_201058/direct`, + - output-plus-state corrected pass: + `/home/mudler/bench/phase90_gdn_inplace_ids_guardrail/20260701_201257/direct`. + +DGX verification: + +| check | result | +|-------|--------| +| local build | `cmake --build build --target test-backend-ops -j $(nproc)` completed | +| local CPU selected op | `4/4`, including F32 `check_state=1` cases | +| DGX CUDA selected op, stale source | failed before comparison on BF16 `state_dst` F32-only assert | +| DGX CUDA selected op, corrected output-only source | `4/4`, `Backend CUDA0: OK` | +| DGX CUDA selected op, output plus state | `4/4`, `Backend CUDA0: OK` | + +Decision: + +- Keep this as the minimum guardrail for the next packed decode attempt. It + covers the Phase88 target shape (`S_v=128`, one-token decode, two sequences) + and observes the side-effect `state_dst` update for F32 scalar-gate and KDA + cases. +- BF16 in-place ids cases remain output-only in this fixture; use canonical md5 + gates for full-model BF16 inference safety. +- Do not profile Phase90: it is a test harness/guardrail attempt, not a runtime + performance candidate. + +### Phase89: In-place GDN Decode Test Guardrail Attempt + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase89-gdn-decode-gate-source`, test-only + experiment on top of the reverted Phase88 source. +- Local patch status: reverted after the targeted test filter failed. +- Patch scope: + - temporarily added two `test_gated_delta_net_inplace_ids` cases in + `tests/test-backend-ops.cpp`: + - F32, `head_count=4`, `head_size=128`, `n_seqs=2`, scalar gate, + - F32, `head_count=4`, `head_size=128`, `n_seqs=2`, KDA. +- Build artifact: + `/home/mudler/llama-phase89-gdn-decode-gate-source/build-cuda`. +- Build logs: + - `/home/mudler/llama-phase89-gdn-decode-gate-source/configure.phase89.log` + - `/home/mudler/llama-phase89-gdn-decode-gate-source/build.phase89.log` +- Gate artifact: + `/home/mudler/bench/phase89_gdn_decode_gate/20260701_175903/direct`. + +DGX verification: + +| check | result | +|-------|--------| +| local build | `cmake --build build --target test-backend-ops -j 8` completed | +| local run | local CPU backend skipped for this op set | +| CUDA `GATED_DELTA_NET` filter | `46/46`, `Backend CUDA0: OK` | +| CUDA `GATED_DELTA_NET_INPLACE_IDS` filter | failed `0/4`, including both newly added F32 cases and the two pre-existing BF16 cases | + +Decision: + +- Reject and revert the test-only change. The direct + `GATED_DELTA_NET_INPLACE_IDS` filter is not currently a reliable green + guardrail, because the existing BF16 cases fail when selected directly. +- Do not add more packed decode source until there is a focused harness for the + serving decode shape that compares both attention output and the side-effect + `state_dst` update against the existing sequential kernel. + +### Phase88: Default-off PACK=2 Decode CTA Kernel + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase88-gdn-pack2-source`, one-file CUDA + experiment on top of Phase85. +- Local patch status: reverted after md5 failure. +- Patch scope: + - added `gated_delta_net_decode_pack2_cuda` in + `ggml/src/ggml-cuda/gated_delta_net.cu`, + - gated it behind `GDN_DECODE_PACK2=1`, + - limited it to F32 state, scalar-gate, `S_v == 128`, `n_tokens == 1`, + in-place decode, with no `GDN_NW/GDN_CPW` override, + - attempted to preserve the existing `(16,8)` per-column math order while + packing two independent sequences into one CTA. +- Build artifact: + `/home/mudler/llama-phase88-gdn-pack2-source/build-cuda`. +- Build logs: + - `/home/mudler/llama-phase88-gdn-pack2-source/configure.phase88.log` + - `/home/mudler/llama-phase88-gdn-pack2-source/build.phase88.log` +- Gate artifact: + `/home/mudler/bench/phase88_gdn_pack2_gates/20260701_175059/direct`. +- Profile artifact: none. Profiling was skipped because the md5 gate failed. + +DGX gates with `GDN_DECODE_PACK2=1`: + +| check | result | +|-------|--------| +| MoE md5 | failed, got `320b5ed679844cbfd6f18d85d7ae32b0`, expected `8cb0ce23777bf55f92f63d0292c756b0` | +| dense md5 | failed, got `6a65e9d9e47321ebce9e461c8abf036c`, expected `5951a5b4d624ce891e22ab5fca9bc439` | +| `GATED_DELTA_NET` | `Backend CUDA0: OK` | +| `MUL_MAT` | `Backend CUDA0: OK` | +| `MUL_MAT_ID` | `Backend CUDA0: OK` | + +Observed output symptom: + +- MoE output duplicated the opening `` marker. +- Dense output degenerated into repeated `/` characters immediately after the + opening `` marker. + +Decision: + +- Reject and revert. The sacred greedy md5 gate failed, so no profile was run. +- The existing `test-backend-ops -o GATED_DELTA_NET` set did not catch this + because it does not cover the exact serving decode shape that triggers the + pack2 path. Before another packed decode attempt, add or script a focused + `n_seq_tokens=1`, `n_seqs > 1`, in-place F32 state equivalence gate against + the existing sequential kernel. +- Do not carry the pack2 kernel in the patch stack. + +### Phase87: Decode Geometry Probe `(GDN_NW=4, GDN_CPW=8)` + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase87-gdn-4x8-source`, one-line CUDA + dispatcher experiment on top of Phase85: + expose `launch_gdn_variant<128, ..., NUM_WARPS=4, COLS_PER_WARP=8>` through + the existing `GDN_NW/GDN_CPW` env sweep. +- Local patch status: reverted after profiling. The attempt was env-gated and + never made default. +- Build artifact: + `/home/mudler/llama-phase87-gdn-4x8-source/build-cuda`. +- Build logs: + - `/home/mudler/llama-phase87-gdn-4x8-source/configure.phase87.log` + - `/home/mudler/llama-phase87-gdn-4x8-source/build.phase87.log` +- Gate artifact: + `/home/mudler/bench/phase87_gdn_4x8_gates/20260701_174014/direct`. +- Profile artifact: + `/home/mudler/bench/phase87_gdn_4x8_profile/20260701_174310`. +- Result type: source geometry probe. The hypothesis was that a `4*8 = 32` + column tile would be closer to vLLM's `BV=32` decode program shape while + preserving the existing per-column reduction order. + +DGX gates with `GDN_NW=4 GDN_CPW=8`: + +| check | result | +|-------|--------| +| MoE md5 | `8cb0ce23777bf55f92f63d0292c756b0` | +| dense md5 | `5951a5b4d624ce891e22ab5fca9bc439` | +| `GATED_DELTA_NET` | `Backend CUDA0: OK` | +| `MUL_MAT` | `Backend CUDA0: OK` | +| `MUL_MAT_ID` | `Backend CUDA0: OK` | + +Same-source decode-only profile: + +| arm | source | env | active slots | depth start | depth mid | total kernel s | GDN ms | GDN share | `gdn_core` ms | `gdn_core` launches | `mmq_nvfp4` ms | +|-----|--------|-----|-------------:|------------:|----------:|---------------:|-------:|----------:|--------------:|--------------------:|---------------:| +| default geometry | `/home/mudler/llama-phase87-gdn-4x8-source` | default `(16,8)` | `128` | `74` | `96` | `3.6310` | `1471.27` | `40.52%` | `1390.56` | `598` | `1416.46` | +| Phase87 4x8 | `/home/mudler/llama-phase87-gdn-4x8-source` | `GDN_NW=4 GDN_CPW=8` | `128` | `71` | `92` | `3.5988` | `1493.66` | `41.50%` | `1417.13` | `569` | `1396.11` | + +Decision: + +- Reject. The target bucket regressed by `+26.57 ms` (`+1.91%`) despite lower + total kernel time from unrelated `mmq_nvfp4` variance. +- Reverted the one-line dispatcher addition. Do not carry this in the patch + stack. +- The subagent/code audit points to a different Phase88 shape: keep the current + `(16,8)` per-column math order and pack two independent sequences per CTA, or + implement a fuller vLLM-style packed decode kernel that fuses producer math + and recurrence. + +### Phase86: Producer-fusion Scope Audit + +- Date: 2026-07-01. +- Source: no source patch. This is a profile-backed scope rejection using the + Phase85 node-traced DGX artifact before spending code on a small-ceiling + fusion. +- Input profile artifact: + `/home/mudler/bench/phase85_gdn_identity_state_profile/20260701_171856`. +- Source audit: + - `ggml/src/ggml-cuda/ggml-cuda.cu` already fuses + `{ GGML_OP_UNARY, GGML_OP_MUL }` for `SILU`, `SIGMOID`, and `SOFTPLUS`, + covering the expensive part of `alpha_softplus * ssm_a`. + - Qwen35 and Qwen35MoE still compute beta sigmoid and the alpha bias/softplus + producer as separate graph pieces, but those pieces are small in the + decode-only trace. + - vLLM's Triton producer fusion remains a useful design reference, but its + isolated producer scope is not the main GB10 bottleneck in this llama.cpp + profile. +- Gate artifact: not applicable, no binary changed. +- Result type: no-code benchmark/scope attempt. The benchmark record below is + copied from the Phase85 candidate profile because Phase86 deliberately asks + whether a source patch is worth writing. + +Same-window profile evidence: + +| bucket | time | share | launches | interpretation | +|--------|-----:|------:|---------:|----------------| +| total kernel time | `3.6622 s` | `100.00%` | - | Phase85 identity-state candidate capture | +| `GDN` macro | `1480.21 ms` | `40.42%` | `2980` | target family remains dominant | +| `gdn_core` | `1400.34 ms` | `38.24%` | `596` | real parity lever must reduce this bucket | +| `act/GDN-gate(shared)` macro | `13.57 ms` | `0.37%` | `3771` | entire producer/gate-side ceiling is tiny | +| `gated_act_silu_sigmoid` | `10.84 ms` | `0.30%` | `1786` | already includes fused unary-gated kernels | +| `gdn_sigmoid` | `2.73 ms` | `0.07%` | `1985` | beta sigmoid ceiling | +| `unary_op_kernel<&op_softplus>` | about `1.08 ms` | about `0.03%` | `596` | alpha softplus standalone signal from `nsys stats` | + +Decision: + +- Reject a narrow Phase86 producer-only implementation. Even deleting the whole + `act/GDN-gate(shared)` macro would improve the captured total by only + `0.37%`, and deleting only the still-unfused beta sigmoid would be about + `0.07%`. +- Do not modify or gate source for this phase. It would add upstream conflict + surface without meaningful parity upside. +- Phase87 should target a packed decode GDN kernel, inspired by vLLM's decode + path, that reduces launches and memory traffic inside `gdn_core` itself while + preserving the default F32 recurrent S-cache and md5/op gates. + +### Phase85: Identity-contiguous GDN State Fast Path + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase85-gdn-identity-state-source`, local + eight-file experiment on top of fork commit + `237ad9b96 feat(cuda): add BF16 Qwen GDN state cache`. +- Local patch scope: + - carry forward Phase84 attention-only in-place GDN output cleanup, + - add a side-effect-free `llama_memory_recurrent_context::s_copy_main_is_identity`, + - store that identity bit in `llm_graph_input_rs`, + - include it in base and hybrid graph reuse checks, + - call `ggml_gated_delta_net_inplace` on a direct state view when active + recurrent rows are identity-contiguous, otherwise keep the ids path. +- Build artifact: + `/home/mudler/llama-phase85-gdn-identity-state-source/build-cuda`. +- Build logs: + - `/home/mudler/llama-phase85-gdn-identity-state-source/configure.phase85.log` + - `/home/mudler/llama-phase85-gdn-identity-state-source/build.phase85.log` +- Gate artifact: + `/home/mudler/bench/phase85_gdn_identity_state_gates/20260701_171733/direct`. +- Profile artifact: + `/home/mudler/bench/phase85_gdn_identity_state_profile/20260701_171856`. +- Result type: source cleanup / small performance experiment. This reuses the + existing F32 recurrent-state CUDA kernel and changes only the source-state + view used for identity-contiguous decode windows. It avoids the ids scratch + allocation and no-op `gdn_gather_nonident_kernel` launch in that graph shape. + +Local verification: + +| check | result | +|-------|--------| +| local build | `cmake --build build --target test-backend-ops llama-server -j 8` completed | +| local note | `llama-server` build used the UI archive fallback after local npm engine warning; target completed | + +DGX gates: + +| check | result | +|-------|--------| +| MoE md5 | `8cb0ce23777bf55f92f63d0292c756b0` | +| dense md5 | `5951a5b4d624ce891e22ab5fca9bc439` | +| `GATED_DELTA_NET` | `46/46`, `Backend CUDA0: OK` | +| `MUL_MAT` | `1146/1146`, `Backend CUDA0: OK` | +| `MUL_MAT_ID` | `806/806`, `Backend CUDA0: OK` | + +Same-window decode-only profile: + +| arm | source | active slots | depth start | depth mid | total kernel s | GDN ms | GDN share | `gdn_core` ms | `gdn_core` launches | `gdn_gather` ms | GDN macro launches | `mmq_nvfp4` ms | +|-----|--------|-------------:|------------:|----------:|---------------:|-------:|----------:|--------------:|--------------------:|----------------:|------------------:|---------------:| +| baseline F32 | `/home/mudler/llama-phase81-bf16-state-source` | `128` | `73` | `95` | `3.7081` | `1493.78` | `40.28%` | `1412.33` | `600` | `0.89` | `3600` | `1473.60` | +| Phase85 identity state | `/home/mudler/llama-phase85-gdn-identity-state-source` | `128` | `72` | `94` | `3.6622` | `1480.21` | `40.42%` | `1400.34` | `596` | not present | `2980` | `1437.53` | + +Server log signal: + +| arm | CUDA free memory at startup | graph reuse | +|-----|----------------------------:|------------:| +| baseline F32 | `116418 MiB` | `105/122 = 86.1%` | +| Phase85 identity state | `117857 MiB` | `105/123 = 85.4%` | + +Decision: + +- Carry forward only as a small cleanup candidate. The patch is md5/op green, + removes the explicit `gdn_gather` bucket, and reduces GDN macro launches. +- Do not treat it as a parity-closing speed lever: direct removed work was only + `0.89 ms` over the capture, and `gdn_core` improved by only `0.85%` + (`1412.33 -> 1400.34 ms`) in a noisy same-window run. +- Keep the next speed-focused scope on either producer fusion + (`alpha softplus * A`, beta sigmoid) or a larger packed decode kernel. The + remaining GDN gap is not explained by ids gather overhead. + +### Phase84: Attention-only Outputs for In-place GDN + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase84-attn-only-source`, local three-file + experiment on top of fork commit + `237ad9b96 feat(cuda): add BF16 Qwen GDN state cache`. +- Local patch files: + - `ggml/src/ggml.c` + - `ggml/src/ggml-cpu/ggml-cpu.c` + - `ggml/src/ggml-cpu/ops.cpp` +- Build artifact: `/home/mudler/llama-phase84-attn-only-source/build-cuda`. +- Build logs: + - `/home/mudler/llama-phase84-attn-only-source/configure.phase84.log` + - `/home/mudler/llama-phase84-attn-only-source/build.phase84.log` +- Gate artifact: + `/home/mudler/bench/phase84_attn_only_gates/20260701_165952/direct`. +- Profile artifact: + `/home/mudler/bench/phase84_attn_only_profile/20260701_170131`. +- Result type: source cleanup / memory experiment. `ggml_gated_delta_net_inplace` + and `ggml_gated_delta_net_inplace_ids` now allocate only the attention-score + output tensor because final recurrent state is written as a side effect into + `state_dst`. The CPU `inplace_ids` non-identity fallback was moved from the + old unused output tail to explicit workspace so CPU/CUDA semantics remain + aligned. + +Local verification: + +| check | result | +|-------|--------| +| local build | `cmake --build build --target test-backend-ops -j 8` completed | +| local GDN subset | no non-CPU backend locally, so CPU was skipped by `test-backend-ops` | + +DGX gates: + +| check | result | +|-------|--------| +| MoE md5 | `8cb0ce23777bf55f92f63d0292c756b0` | +| dense md5 | `5951a5b4d624ce891e22ab5fca9bc439` | +| `GATED_DELTA_NET` | `46/46`, `Backend CUDA0: OK` | +| `MUL_MAT` | `1146/1146`, `Backend CUDA0: OK` | +| `MUL_MAT_ID` | `806/806`, `Backend CUDA0: OK` | + +Same-window decode-only profile: + +| arm | source | active slots | depth start | depth mid | total kernel s | GDN ms | GDN share | `gdn_core` ms | `gdn_core` launches | `gdn_core`/launch | `mmq_nvfp4` ms | +|-----|--------|-------------:|------------:|----------:|---------------:|-------:|----------:|--------------:|--------------------:|------------------:|---------------:| +| baseline F32 | `/home/mudler/llama-phase81-bf16-state-source` | `128` | `74` | `96` | `3.6464` | `1481.59` | `40.63%` | `1399.72` | `599` | `2.337 ms` | `1418.47` | +| Phase84 attention-only | `/home/mudler/llama-phase84-attn-only-source` | `128` | `65` | `87` | `3.5814` | `1489.33` | `41.59%` | `1407.38` | `598` | `2.354 ms` | `1349.11` | + +Server log memory signal: + +| arm | CUDA free memory at startup | graph reuse | +|-----|----------------------------:|------------:| +| baseline F32 | `117472 MiB` | `107/124 = 86.3%` | +| Phase84 attention-only | `117855 MiB` | `98/115 = 85.2%` | + +Decision: + +- Do not count Phase84 as a speed parity win. The target GDN bucket moved + `1399.72 -> 1407.38 ms` (`+0.55%`), and the lower total kernel time is again + explained by unrelated `mmq_nvfp4` variance (`1418.47 -> 1349.11 ms`). +- Keep as a possible memory-footprint cleanup only if upstream maintainability + is acceptable: gates are green and the server startup memory signal improved + by about `383 MiB` in the same profile window. +- Do not regenerate the LocalAI patch series until a follow-up decides whether + this memory-only cleanup belongs in the fork commit stack. + +### Phase83: KDA GDN exp-cache Decode Shortcut + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase83-kda-gexp-source`, local one-file CUDA + experiment on top of fork commit + `237ad9b96 feat(cuda): add BF16 Qwen GDN state cache`. +- Build artifact: `/home/mudler/llama-phase83-kda-gexp-source/build-cuda`. +- Build log: + `/home/mudler/llama-phase83-kda-gexp-source/build.phase83.log`. +- Gate artifact: + `/home/mudler/bench/phase83_kda_gexp_gates/20260701_184237/direct_retry`. +- Profile artifact: + `/home/mudler/bench/phase83_kda_gexp_profile/20260701_164731`. +- Result type: source micro-optimization. Cache the KDA per-row + `expf(g_t[i])` value in a register once per token/thread in + `ggml/src/ggml-cuda/gated_delta_net.cu`, then reuse it in both the KDA + `kv` and S-update loops. This preserves the same recurrence storage, + operation order at the algorithm level, and F32 state path. + +Gate harness notes: + +- First copied-harness attempt used a LocalAI worktree path that was not present + on DGX and failed before running gates. +- Second harness attempt refused to run because this job already owned the GPU + lock. +- First direct gate script had an `awk` quoting bug after producing partial + output. +- Corrected direct retry completed and is the valid gate artifact. + +Gates: + +| check | result | +|-------|--------| +| MoE md5 | `8cb0ce23777bf55f92f63d0292c756b0` | +| dense md5 | `5951a5b4d624ce891e22ab5fca9bc439` | +| `GATED_DELTA_NET` | `46/46`, `Backend CUDA0: OK` | +| `MUL_MAT` | `1146/1146`, `Backend CUDA0: OK` | +| `MUL_MAT_ID` | `806/806`, `Backend CUDA0: OK` | + +Same-window decode-only profile: + +| arm | source | active slots | depth start | depth mid | total kernel s | GDN ms | GDN share | `gdn_core` ms | `gdn_core` launches | `gdn_core`/launch | `mmq_nvfp4` ms | +|-----|--------|-------------:|------------:|----------:|---------------:|-------:|----------:|--------------:|--------------------:|------------------:|---------------:| +| baseline F32 | `/home/mudler/llama-phase81-bf16-state-source` | `128` | `73` | `95` | `3.6487` | `1481.06` | `40.59%` | `1399.46` | `597` | `2.344 ms` | `1424.65` | +| Phase83 exp-cache | `/home/mudler/llama-phase83-kda-gexp-source` | `128` | `66` | `88` | `3.5501` | `1487.71` | `41.91%` | `1405.62` | `600` | `2.343 ms` | `1317.98` | + +Decision: + +- Reject carry-forward. The target GDN bucket was flat-to-slightly worse: + `gdn_core` changed `1399.46 -> 1405.62 ms` (`+0.44%`), while per-launch cost + stayed effectively unchanged (`2.344 -> 2.343 ms`). +- The lower total kernel time is not credited to the shortcut because the + unrelated `mmq_nvfp4` bucket dropped by `106.67 ms` in the candidate sample. +- Do not regenerate LocalAI patch-series output for this experiment. Next GDN + work should target a structural traffic or launch-shape change, not + single-expression reuse inside the current core loop. + +### Phase82: BF16 Persistent GDN S-Cache f16 KL Gate + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase81-bf16-state-source`, fork commit + `237ad9b96 feat(cuda): add BF16 Qwen GDN state cache`. +- Build artifact: `/home/mudler/llama-phase81-bf16-state-source/build-cuda`. +- KL artifact: + `/home/mudler/bench/phase82_bf16_s_cache_f16_kl/20260701_183016`. +- Result type: full MoE f16-reference KL gate for the Phase81 default-off + BF16 persistent GDN S-cache candidate. +- Reference base: `/home/mudler/bench/l4gate/klbase_moe.dat`, generated from + `/home/mudler/work/darwin_36b_opus/f16.gguf` at `-c 512 -b 2048 --chunks 16` + with f16 PPL `7.3760 +/- 0.29100`. +- Acceptance reference from `PAGED_BITEXACT_NOTE.md`: paged FP4-MMQ vs f16 + KLD `0.136000 +/- 0.003285`, PPL `7.4009`; non-paged FP4-MMQ vs f16 KLD + `0.136597 +/- 0.003157`. +- Run note: the script metadata hash lines hit an `awk` quoting issue, so + `BASE_SHA256` and `MODEL_SHA256_HEAD` are blank in `meta.txt`; both KL passes + completed and produced full logs. Treat the blank hashes as harness metadata + noise, not a model-output failure. + +Result: + +| arm | env | KLD vs f16 | PPL(Q) | PPL ratio vs f16 | same-top-p | max KLD | +|-----|-----|-----------:|-------:|-----------------:|-----------:|--------:| +| same-source F32 | `LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1` | `0.136563 +/- 0.003242` | `7.418401 +/- 0.296694` | `1.006105 +/- 0.008899` | `83.725 +/- 0.578%` | `3.602697` | +| BF16 S-cache | `LLAMA_QWEN35_GDN_S_CACHE_TYPE=bf16` plus same env | `0.137162 +/- 0.003456` | `7.321044 +/- 0.290693` | `0.992902 +/- 0.008714` | `84.240 +/- 0.571%` | `5.973692` | + +Decision: + +- Reject promotion of the BF16 persistent GDN S-cache patch. +- Do not run serving A/B for this candidate under the current rules: the hard + lossy-path gate requires `KLD(new||f16) <= KLD(FP4-MMQ||f16)`, and the BF16 + S-cache mean KLD is above both the documented paged reference (`0.136000`) and + the same-source F32 measurement (`0.136563`). +- Keep the Phase81 source only as a local experimental branch unless the gate is + deliberately re-scoped. The next source attempt should preserve F32 recurrent + S-cache quality or reduce traffic without changing the MoE f16 KL band. + ### Phase81: Qwen35 BF16 Persistent GDN S-Cache - Date: 2026-07-01. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index 923343fcb..5c47041e0 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -5,6 +5,575 @@ > and `GB10_PARITY_PHASE0_RESULTS.md`, with Phase 6 serving-nsys evidence and > the active follow-up plans under `docs/superpowers/plans/`. Use those files for > the current state before relying on the older "closed" conclusion below. +> +> 2026-07-01 Phase112 update: keep the new default-off +> `LLAMA_W4A16_DIRECT_A=1` direct activation staging hook, especially combined +> with Phase110 `LLAMA_MOE_GPU_SORT=1`. Artifact: +> `/home/mudler/bench/phase112_w4a16_direct_a/20260701_231749_direct_a`. +> Selected gates passed `13/13` for W4A16+GPU-sort, direct-A, and +> direct-A+GPU-sort. Direct-A+GPU-sort improved the 257-token W4A16 fallback +> rows versus W4A16+GPU-sort control (`MOE_SWIGLU_DOWN 1551.08 -> 1477.74 us`, +> `MUL_MAT_ID_RAGGED_MOE 2278.50 -> 2166.22 us`) but was neutral/slightly +> slower on 128-token rows. Canonical README md5 gates are green: MoE +> `8cb0ce23777bf55f92f63d0292c756b0`, dense +> `5951a5b4d624ce891e22ab5fca9bc439`; compact supported op gates are green +> (`SSM_CONV 45/45`, `SSM_CONV_SPLIT 6/6`, `GET_ROWS 49/49`, +> `GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`). +> This is still default-off structural groundwork, not parity: W4A16 fallback +> remains slower than the default grouped-MMQ path. Use the patch-series README +> md5 command as canonical; the handoff `-no-cnv -c 4096` snippet produced +> stable but non-canonical md5s for both candidate and control. +> +> 2026-07-01 Phase113 update: reject the combined direct-A GPU-tile descriptor +> attempt. Artifact: +> `/home/mudler/bench/phase113_w4a16_direct_a_gpu_tiles/20260701_233345_no_readback`. +> The candidate (`LLAMA_W4A16_GPU_TILES=1` on top of Phase112 direct-A+GPU-sort) +> avoided the `n_tiles` readback by launching over zero-initialized `max_tiles` +> and returning early on `rows <= 0`. Selected correctness passed `13/13`, but +> perf failed the keep gate: `MOE_SWIGLU_DOWN n=257` was flat +> (`1478.16 -> 1476.36 us`) and `MUL_MAT_ID_RAGGED_MOE n=257` regressed +> (`2148.44 -> 2214.23 us`). The source was reverted and post-revert +> Phase112 direct-A+GPU-sort selected gates passed `13/13`. Next W4A16/MoE work +> should not revisit compact GPU tile descriptors; use vLLM-style padded routing +> metadata (`sorted_token_ids`, expert ids per M block, padded row count) if +> continuing this line. +> +> 2026-07-01 Phase114 update: reject the naive padded routing implementation. +> It implemented the vLLM-style metadata contract with separate padded source +> ids and destination ids for llama.cpp, plus an expert-id W4A16 consumer mode +> and a direct scatter that skipped compact `get_rows_cuda`. Correctness passed +> (`13/13`) but perf failed: after a fix using `num_tokens_post_pad` early +> returns, `MOE_SWIGLU_DOWN n=257` regressed `1477.88 -> 1726.27 us` and +> `MUL_MAT_ID_RAGGED_MOE n=257` regressed `2163.35 -> 2650.93 us`. Artifacts: +> `/home/mudler/bench/phase114_w4a16_padded_routing/20260701_234634_padded_meta` +> and +> `/home/mudler/bench/phase114_w4a16_padded_routing/20260701_235003_padded_meta_fix1`. +> Source was reverted; post-revert Phase112 direct-A+GPU-sort selected gate +> passed `13/13`. Padded metadata is not enough by itself on GB10 because sparse +> expert occupancy makes padded activation/output traffic too expensive. +> +> 2026-07-02 Phase115 update: reject another small-M/tile-policy shortcut. +> Phase115 re-tested the existing default-off `LLAMA_MOE_SMALL_M_TILE=16/32/64` +> knob on the newer Phase108 whole-graph MoE sentinels. Artifact: +> `/home/mudler/bench/phase115_moe_small_m_sentinel/20260702_020258`. +> Control and all three tile caps passed selected correctness (`13/13` each), +> but no candidate met the promotion rule. The 257-token ragged down row +> regressed for every cap (`1452.30 us` control vs `1455.02`, `1458.71`, and +> `1456.88 us`). Do not add name-based down special cases or another MMQ +> tile-policy patch. The next credible target is a true fused routed-MoE kernel +> or a graph-level fusion that removes materialized activation/output traffic. +> +> 2026-07-02 Phase116 update: reject the standalone graph-level +> SwiGLU-to-MMQ-activation-quant fusion. The default-off candidate +> `LLAMA_MOE_SWIGLU_DOWN_FUSED_QUANT=1` detected the plain +> `GLU -> down MUL_MAT_ID` pattern and computed `silu(gate) * up` directly into +> the grouped-MMQ NVFP4 activation buffer. Artifact: +> `/home/mudler/bench/phase116_moe_swiglu_down_fused_quant/20260702_022611`. +> Correctness passed (`13/13`) and the fix1 route emitted the fused marker +> (`6` hits), but perf was not useful: `MOE_SWIGLU_DOWN n=257` was flat +> (`1024.90 -> 1024.69 us`), `n=128` regressed (`806.33 -> 808.79 us`), and the +> ragged sentinel drifted slower. Source was reverted and post-revert selected +> gate passed `13/13`. Do not retry this narrow fused-quant route; the next +> fused-MoE attempt must remove a larger boundary, such as route-once metadata +> shared by both expert GEMMs plus fused GEMM1/activation/GEMM2 or +> weighted-combine/scatter. +> +> 2026-07-02 Phase117 update: keep the default-off MoE boundary trace as +> diagnostic instrumentation only. Artifact: +> `/home/mudler/bench/phase117_moe_route_once_boundary/20260702_024140`. +> The trace decomposes `MOE_SWIGLU_DOWN` into route-sort, activation +> quantization, grouped-MMQ launch, GLU, and graph-pattern records under +> `LLAMA_MOE_BOUNDARY_TRACE=1`; optional timing is gated by +> `LLAMA_MOE_BOUNDARY_TIMING=1`. Inline CUDA event timing initially aborted +> under CUDA graph capture, so the guarded trace emits `us=-1` while capturing +> and only produces real event timings with `GGML_CUDA_DISABLE_GRAPHS=1`. +> Post-guard selected gates passed (`13/13`), trace mode passed (`7/7`), and +> canonical gates passed: MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, +> `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`. The timing attribution does not +> fund another local route-sort, tile, GLU, or activation-quant shortcut. The +> next MoE source phase should own a larger pipeline boundary: shared +> route-once metadata across gate_up/down and/or whole-pattern +> GEMM1->activation->GEMM2 execution. +> +> 2026-07-02 Phase118 update: reject standalone route-metadata caching. +> Artifact: +> `/home/mudler/bench/phase118_moe_route_cache/20260702_030549`. The +> default-off candidate `LLAMA_MOE_ROUTE_CACHE=1` stored ids-derived grouped-MMQ +> route metadata in context-owned buffers and reused it within a graph +> evaluation. It was correctness-clean (`13/13` default, opt-in, and +> post-reject) and the trace showed reuse (`23` hits, `3` misses on +> `MOE_SWIGLU_DOWN n=128`), but perf was too small: `MOE_SWIGLU_DOWN n=257` +> improved only `1017.711 -> 1011.915 us` (`+0.57%`) and `n=128` regressed +> `799.360 -> 803.738 us` (`-0.55%`). Runtime cache source was reverted; only a +> local `ggml_cuda_mmq_ids_meta` helper refactor remains as low-conflict +> groundwork. Do not retry metadata-cache-only work. The next attempt must own +> more of the vLLM-style pipeline: GEMM1->activation->GEMM2 and/or +> scatter/combine, not just skipping one `mm_ids_helper` launch. +> +> 2026-07-02 Phase119 update: keep the default-off whole-pattern contract trace +> after fix1. Initial artifact: +> `/home/mudler/bench/phase119_moe_whole_pattern_contract/20260702_034729`; +> fix1 artifact: +> `/home/mudler/bench/phase119_moe_whole_pattern_contract/20260702_035126_fix1`. +> The initial trace proved coverage but missed the overhead rule on +> `MOE_SWIGLU_DOWN n=257` (`1015.070 -> 1028.937 us`, `-1.35%`). Fix1 moved +> detector work off the default path unless `LLAMA_MOE_WHOLE_PATTERN_TRACE` or +> the existing boundary trace is enabled. Fix1 gates are green: selected +> `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE` `13/13`, trace `MOE_SWIGLU_DOWN` +> `7/7`, canonical MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, +> `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Trace overhead is now within +> rule (`MOE_SWIGLU_DOWN n=128` `805.400 -> 805.584 us`, `-0.02%`; +> `n=257` `1019.715 -> 1021.836 us`, `-0.21%`) and emits supported NVFP4 +> markers for both `n_tokens=128` and `257`. This is diagnostic scaffolding, +> not a runtime optimization. The next executor attempt should match at the +> earlier `gate_up MUL_MAT_ID` node and skip through `VIEW, VIEW, GLU, down +> MUL_MAT_ID`; the current `GLU -> down` hook is validation-only because GEMM1 +> has already executed. +> +> 2026-07-02 Phase120 update: keep the default-off early whole-pattern matcher +> after fix2. Initial artifact: +> `/home/mudler/bench/phase120_moe_early_whole_pattern/20260702_040153`; +> fix2 artifact: +> `/home/mudler/bench/phase120_moe_early_whole_pattern/20260702_040725_fix2`. +> The initial/fix1 versions proved `skip_ready=4` but emitted noisy unsupported +> markers from unrelated `MUL_MAT_ID` candidates. Fix2 emits only the actual +> early pattern and is clean: selected `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE` +> `13/13`, early trace `MOE_SWIGLU_DOWN` `7/7`, canonical MoE md5 +> `8cb0ce23`, dense md5 `5951a5b4`, `MUL_MAT 1146/1146`, and +> `MUL_MAT_ID 806/806`. It emits exactly six supported early markers for the +> perf sentinels, covering `n_tokens=128` and `257`, with `skip_ready=4`, +> `ids_match=1`, and `swiglu=1`. Trace overhead is within rule +> (`MOE_SWIGLU_DOWN n=128` `803.937 -> 808.978 us`, `-0.62%`; +> `n=257` `1020.412 -> 1026.073 us`, `-0.55%`). The next source phase can now +> implement a guarded executor at this early matcher. First prove safe +> ownership/skip accounting for the five-node sequence, then move route-plan +> reuse and activation/down execution into the helper. +> +> 2026-07-02 Phase121 update: keep the default-off executor proof after fix1. +> Initial artifact: +> `/home/mudler/bench/phase121_moe_whole_pattern_exec_proof/20260702_041543`; +> fix1 artifact: +> `/home/mudler/bench/phase121_moe_whole_pattern_exec_proof/20260702_041739_fix1`. +> The initial run passed correctness but emitted zero exec markers because the +> exec branch was accidentally nested under the early-trace env condition. +> Fix1 makes `LLAMA_MOE_WHOLE_PATTERN_EXEC=1` engage independently. Gates are +> green: selected `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE` `13/13`, exec +> `MOE_SWIGLU_DOWN` `7/7`, canonical MoE md5 `8cb0ce23`, dense md5 +> `5951a5b4`, `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Exec perf emits +> six `skip=4` markers covering `n_tokens=128` and `257`, and target perf is +> neutral (`MOE_SWIGLU_DOWN n=128` `807.772 -> 806.051 us`, `+0.21%`; +> `n=257` `1021.115 -> 1020.839 us`, `+0.03%`). This proves ownership and skip +> accounting only; it is not a fused-MoE speedup. The next source phase should +> replace one internal boundary inside this helper, preferably route-plan reuse +> or activation in route-slot order, with the same md5/op gates. +> +> 2026-07-02 Phase122 update: reject route-only metadata reuse inside the +> Phase121 executor. Artifact: +> `/home/mudler/bench/phase122_moe_shared_route_meta/20260702_043212`. +> The candidate exposed `ggml_cuda_mmq_ids_meta` as a public MMQ helper and +> used `LLAMA_MOE_WHOLE_PATTERN_SHARED_ROUTE=1` to build route metadata once +> for both `gate_up` and `down`. Correctness passed (`13/13` selected and +> `7/7` shared-route), but perf missed the keep gate: +> `MOE_SWIGLU_DOWN n=128` regressed `808.190 -> 811.836 us` and `n=257` +> regressed `1020.850 -> 1051.666 us` versus the Phase121 executor. Source was +> reverted, including the public metadata API and shared-route env. Post-reject +> gates on the reverted tree passed (`13/13` selected and `7/7` Phase121 exec) +> with six retained exec markers. Do not retry route-only metadata reuse. The +> next MoE executor scope should target activation/down data layout, direct +> activation-to-down input, or a larger GEMM1->activation->GEMM2 fused boundary. +> +> 2026-07-02 Phase123 update: reject standalone fused-down activation +> quantization inside the Phase121 executor. Artifact: +> `/home/mudler/bench/phase123_moe_executor_fused_down_input/20260702_025811`; +> red check: +> `/home/mudler/bench/phase123_moe_executor_fused_down_input/red_20260702_025031`. +> The candidate used `LLAMA_MOE_WHOLE_PATTERN_FUSED_DOWN=1` to run `gate_up`, +> compute `silu(gate) * up` directly into the sorted NVFP4 down MMQ activation +> buffer, and launch the existing down MMQ kernel. Correctness passed +> (`13/13` selected, `7/7` fused-down, six fused markers), but perf was flat: +> versus Phase121 exec, `MOE_SWIGLU_DOWN n=128` was +> `811.153 -> 810.618 us` and `n=257` was `1023.090 -> 1023.657 us`. +> Source was reverted, including the fused-down env, MMQ helper, and NVFP4 +> fused quant kernel. Post-reject gates passed (`13/13` selected, `7/7` +> Phase121 exec, six exec markers). Do not retry a single-boundary +> SwiGLU-to-down-quant shortcut; if continuing MoE source work, scope a full +> expert-major packed pipeline that owns `GEMM1->activation->GEMM2`, or pivot to +> another measured bottleneck. +> +> 2026-07-02 Phase124 update: current-stack graph-node serving was refreshed +> after the Phase122/123 rejections. Artifact: +> `/home/mudler/bench/phase124_current_moe_profile/20260702_031205`. +> Pre/post gates are green: MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, +> `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`. At `N=128`, prompt `128`, +> generation `64`, serving under graph-node profiling was +> `agg_tps 206.2`, `decode_agg_tps 320.3`, `prefill_tps 1536.4`, wall +> `39.738s`. Fine buckets are now `mmq_nvfp4 6074.78 ms` (`30.17%`) and +> `gdn_core 5888.31 ms` (`29.25%`), with `act_quant` only `674.88 ms` +> (`3.35%`). This explains why single-boundary activation/quant attempts were +> flat. The next source work must reduce one of the two dominant buckets: +> either a full expert-major MoE pipeline for `mmq_nvfp4`, or a default-off GDN +> decode/core experiment for `gdn_core`. Do not spend more GB10 time on +> route-only metadata reuse, fused-down quantization, or MoE tile-policy knobs +> unless a new profile makes those buckets material. +> +> 2026-07-02 Phase125 scoping update: two independent code explorers and a +> local GDN audit challenged the Phase124 fork in the road. The chosen next +> source attempt is the MoE side, but only as a first maintainable slice: +> implement a default-off MMQ sorted-output primitive behind +> `LLAMA_MOE_EXPERT_MAJOR_SORTED_OUT=1`, immediately unsort as a proof, and +> measure `MOE_SWIGLU_DOWN` before attempting the full +> `gate_up -> SWIGLU -> down` expert-major executor. Rationale: vLLM's portable +> advantage is keeping activations expert-major across both GEMMs and +> unpermuting once; Phase122/123 failed because they only touched route metadata +> or one activation boundary. Do not copy CUTLASS/FlashInfer pointer-array, TMA, +> or FP4 scale-swizzle internals. A small GDN patch is not funded by current +> evidence because previous decode/core micro-attempts already rejected the +> obvious geometry/store/broadcast/conv-state shortcuts. Plan: +> `docs/superpowers/plans/2026-07-02-moe-expert-major-sorted-output-phase125.md`. +> +> 2026-07-02 Phase125 result: reject the MMQ sorted-output plus immediate +> unsort proof. Artifact: +> `/home/mudler/bench/phase125_moe_expert_major_sorted_output/20260702_033931`; +> post-reject: +> `/home/mudler/bench/phase125_moe_expert_major_sorted_output/post_reject_20260702_034232`. +> The candidate was default-off and correctness-clean (`13/13` default +> selected, `7/7` opt-in `MOE_SWIGLU_DOWN`, 12 sorted markers), but perf failed +> decisively: versus Phase121 exec, `MOE_SWIGLU_DOWN n=128` regressed +> `805.16 -> 888.76 us` and `n=257` regressed `1023.83 -> 1192.05 us`. +> Source was reverted. Post-reject gates are green: selected `13/13`, Phase121 +> exec `7/7` with six markers, MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, +> `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Do not retry a path that adds +> a sorted-output temporary and immediately unsorts. A future expert-major MoE +> attempt must keep sorted activations through the down GEMM and unpermute only +> once after the full FFN, or pivot to a larger GDN recurrence design. + +> 2026-07-02 Phase126 result: keep the grouped-MMQ presorted helper scaffold. +> The patch only touches `mmq.cu`/`mmq.cuh`, refactors the current MoE id path +> behind explicit `ids_src1`/`ids_dst`/`expert_bounds` metadata, and exposes a +> `src1_sorted` entry point for the future whole-MoE executor. Fixed artifact: +> `/home/mudler/bench/phase126_mmq_presorted_helper/fix1_20260702_040858`. +> Gates were green: selected `13/13`, MoE md5 +> `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 +> `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT 1146/1146`, +> `MUL_MAT_ID 806/806`. Focused perf was neutral: +> `MOE_SWIGLU_DOWN n=128 805.99 us`, `MUL_MAT_ID_RAGGED_MOE n=128 +> 1243.85 us`, `MOE_SWIGLU_DOWN n=257 1018.74 us`, +> `MUL_MAT_ID_RAGGED_MOE n=257 1452.84 us`. This is not a parity win by +> itself; it is the dependency for Phase127 to keep `gate_up -> SWIGLU -> down` +> in expert-major order and unpermute only once after the full FFN. + +> 2026-07-02 Phase127 result: reject and revert the whole-MoE expert-major +> executor built on the Phase126 helper. Red: +> `/home/mudler/bench/phase127_moe_whole_expert_major/red_20260702_042125` +> passed by fallback with zero markers. Candidate green: +> `/home/mudler/bench/phase127_moe_whole_expert_major/green2_20260702_042916` +> passed default selected `13/13` and opt-in `MOE_SWIGLU_DOWN 7/7`, emitting +> six `LLAMA_MOE_WHOLE_EXPERT_MAJOR` markers after fixing the down-weight shape +> interpretation (`down_w` is `[n_ff, n_embd, experts]`). Perf artifact: +> `/home/mudler/bench/phase127_moe_whole_expert_major/perf_20260702_043104`. +> It failed the keep rule: `MOE_SWIGLU_DOWN n=128` regressed +> `802.57 -> 812.14 us`; `n=257` regressed `1023.25 -> 1039.36 us`; +> ragged standalone was essentially flat. Source was reverted. Post-reject: +> `/home/mudler/bench/phase127_moe_whole_expert_major/post_reject_20260702_043318` +> passed selected `13/13`, MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, +> `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Do not retry the same +> fake-tensor whole-executor shape; the next MoE attempt must remove more +> temporary traffic or become a real fused grouped MMQ/SWIGLU/down path. A +> separate alternative is the previously scoped Qwen3Next BF16 GDN S-cache +> experiment, but that needs non-md5 numerical gates. + +> 2026-07-02 Phase128 result: reject/revert the Qwen3Next BF16 GDN S-cache +> selector probe for the current target. Artifact: +> `/home/mudler/bench/phase128_qwen3next_gdn_bf16_s_cache/default_20260702_043939` +> built and passed default gates (`GATED_DELTA_NET 48/48`, canonical MoE md5 +> `8cb0ce23`, dense md5 `5951a5b4`, `MUL_MAT`, `MUL_MAT_ID`). Verbose smoke +> artifact: +> `/home/mudler/bench/phase128_qwen3next_gdn_bf16_s_cache/smoke3_20260702_044434` +> showed the active decision model is `qwen35moe`, not Qwen3Next, and S cache +> remained `f32` under `LLAMA_QWEN3NEXT_GDN_S_CACHE_TYPE=bf16`. No true +> Qwen3Next GGUF was found on DGX. The relevant Qwen35/Qwen35MoE BF16 S-cache +> lever was already Phase81/82: it cut `gdn_core` but changed MoE md5 and +> missed the full f16-reference KL acceptance band. Do not retry this exact +> lever unless the quality gate is explicitly re-scoped or a real Qwen3Next +> model artifact is available. + +> 2026-07-02 Phase129 result: reject/revert the Qwen35/Qwen35MoE grouped Q/K +> broadcast probe for fused GDN. Plan: +> `docs/superpowers/plans/2026-07-02-qwen35-gdn-qk-grouped-bcast-phase129.md`. +> The candidate added a default-off `LLAMA_QWEN35_GDN_QK_BCAST=1` branch in +> `src/models/qwen35.cpp` and `src/models/qwen35moe.cpp`, reusing the existing +> Qwen3Next `ggml_gated_delta_net_set_bcast()` path. Default gates were green: +> `/home/mudler/bench/phase129_qwen35_gdn_qk_bcast/default_20260702_065445` +> passed MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, `GATED_DELTA_NET 46/46`, +> `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. A standalone opt-in gate +> artifact at `optin_20260702_065604` was invalid because +> `paged-inference-gates.sh` only passes completion env through `EXTRA_ENV`. +> The valid opt-in pre-gate from +> `/home/mudler/bench/phase129_qwen35_gdn_qk_bcast/decode_optin_20260702_070149/gate_pre` +> changed MoE md5 to `b773e2f032aa0e992626d486b321808e`, so profiling was +> stopped and the source was reverted. Post-reject: +> `/home/mudler/bench/phase129_qwen35_gdn_qk_bcast/post_reject_20260702_070258` +> passed canonical MoE/dense md5, `GATED_DELTA_NET 46/46`, `MUL_MAT 1146/1146`, +> and `MUL_MAT_ID 806/806`; rebuilt `libllama.so` has zero +> `LLAMA_QWEN35_GDN_QK_BCAST` strings. Do not retry this Qwen3Next +> grouped-broadcast port for Qwen35/Qwen35MoE under the current bit-exact md5 +> rule. + +> 2026-07-02 Phase130 result: current-stack graph-node serving profile refresh, +> measurement-only. Artifact: +> `/home/mudler/bench/phase130_current_stack_profile/20260702_070949`. Shape: +> MoE `q36-35b-a3b-nvfp4`, `N=128`, prompt `128`, generation `64`, +> `PARALLEL=128`, `CTX=131072`. Pre/post gates passed canonical MoE md5 +> `8cb0ce23`, dense md5 `5951a5b4`, `MUL_MAT 1146/1146`, and +> `MUL_MAT_ID 806/806`. Serving metrics: `agg_tps 208.0`, +> `decode_agg_tps 326.9`, `prefill_tps 1519.6`, `TTFT mean 8170.6 ms`, wall +> `39.38 s`, total kernel time `20.1559 s`. The profile confirms the live +> bottleneck remains split between `mmq_nvfp4 6009.52 ms` (`29.82%`) and +> `gdn_core 5891.40 ms` (`29.23%`). FA/mask cleanup is not funded: +> `get_rows 280.62 ms` (`1.39%`) and `fa 257.38 ms` (`1.28%`). The next source +> attempt must target a larger MoE/FFN-GEMM executor/kernel or a materially +> different GDN recurrent-state/packed-decode design, not another paged-mask, +> route-only, activation-only, grouped-broadcast, BF16-cache, or launch-geometry +> shortcut. + +> 2026-07-02 Phase131 result: source-selection challenge, no source changes. +> Plan: +> `docs/superpowers/plans/2026-07-02-fused-routed-ffn-phase131.md`. Two +> read-only explorers challenged the Phase130 fork. MoE/FFN-GEMM source work is +> not funded unless it becomes a real fused routed-FFN kernel/executor; another +> route-only, activation-only, W4A16, tile-policy, sorted-output, or fake +> executor patch is expected to repeat Phases 110-127. GDN source work is not +> funded unless it materially reduces f32 recurrent-state traffic without +> BF16/quality drift; launch geometry, gather/identity, producer/store fusion, +> BF16 S-cache, and grouped Q/K broadcast have already failed or changed md5s. +> The next active line is to audit vLLM's fused MoE design and llama.cpp's +> current whole-pattern executor hook for a default-off fused routed-FFN PoC. +> If that audit does not produce a concrete low-conflict hook, require a +> standalone CUDA PoC before touching llama.cpp source. +> +> 2026-07-02 Phase132 result: keep the new default-off routed-FFN PoC scaffold. +> Plan: +> `docs/superpowers/plans/2026-07-02-routed-ffn-poc-phase132.md`. Artifact: +> `/home/mudler/bench/phase132_routed_ffn_poc/20260702_072725`. Source adds +> `ggml/src/ggml-cuda/moe-ffn.cu/.cuh` and a narrow hook in +> `ggml/src/ggml-cuda/ggml-cuda.cu` behind `LLAMA_MOE_ROUTED_FFN_POC=1`. +> The helper currently executes the baseline `gate_up -> SWIGLU -> down` +> sequence through the existing whole-pattern hook, so it is a scaffold, not a +> parity speedup. Initial incremental build failed until CMake was reconfigured +> to pick up the new globbed CUDA source; after `cmake -S . -B build`, build +> passed. Selected default and opt-in gates passed +> `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE 13/13`; opt-in emitted six exec +> markers and `libggml-cuda.so` contains one `LLAMA_MOE_ROUTED_FFN_POC` string. +> Default and opt-in canonical gates passed MoE md5 `8cb0ce23`, dense md5 +> `5951a5b4`, `GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, and +> `MUL_MAT_ID 806/806`. Focused perf was neutral (`808.32 -> 804.87 us` at +> n=128, `1023.36 -> 1022.71 us` at n=257). Next phase may replace the helper +> internals with a real fused routed-FFN slice; do not claim Phase132 itself as +> a speedup. +> +> 2026-07-02 Phase133 result: keep only as a default-off structural base, not a +> speedup. Plan: +> `docs/superpowers/plans/2026-07-02-routed-ffn-sorted-down-phase133.md`. +> Artifact: +> `/home/mudler/bench/phase133_routed_ffn_sorted_down/20260702_074651`. +> Source exposes `ggml_cuda_mmq_ids_meta`, adds raw +> `ggml_cuda_mul_mat_q_moe_sorted_f32(...)`, and adds +> `LLAMA_MOE_ROUTED_FFN_SORTED_DOWN=1` on top of +> `LLAMA_MOE_ROUTED_FFN_POC=1`. The path executes baseline `gate_up` and +> `SWIGLU`, gathers the SWIGLU output into compact expert-sorted F32 rows, then +> calls raw MMQ down without fake tensors. Selected default, Phase132, and +> Phase133 gates passed `13/13`; Phase133 trace proved six +> `mmq_moe_sorted_raw` launches. Default and Phase133 canonical gates passed +> MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, `GATED_DELTA_NET 48/48`, +> `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Perf was not a win: +> default `807.37/1020.76 us`, Phase132 `808.21/1018.87 us`, Phase133 +> `808.85/1026.87 us` for `n=128/257`. Next phase must fuse SWIGLU-to-sorted +> or SWIGLU-to-quant to remove this added gather/quant boundary; do not promote +> sorted-down as-is. +> +> 2026-07-02 Phase134 result: keep only as default-off fused-SWIGLU structural +> base, not a speedup. Plan: +> `docs/superpowers/plans/2026-07-02-routed-ffn-fused-swiglu-phase134.md`. +> Artifact: +> `/home/mudler/bench/phase134_routed_ffn_fused_swiglu/20260702_075828`. +> Source adds `LLAMA_MOE_ROUTED_FFN_FUSED_SWIGLU=1` on top of +> `LLAMA_MOE_ROUTED_FFN_POC=1`, passes `gate/up` views into the routed-FFN +> helper, computes `silu(gate) * up` directly into expert-sorted F32 rows, and +> calls the raw sorted-F32 down MMQ helper. The fused flag now implies the +> sorted-down path; `LLAMA_MOE_ROUTED_FFN_SORTED_DOWN=1` is not required. +> Selected opt-in gates passed `13/13`; trace proved six `mmq_moe_sorted_raw` +> launches; canonical opt-in gates passed MoE md5 `8cb0ce23`, dense md5 +> `5951a5b4`, `GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, and +> `MUL_MAT_ID 806/806`. Perf is mixed: default `804.92/1026.02 us`, Phase132 +> `808.00/1028.43 us`, Phase133 `808.07/1029.02 us`, Phase134 +> `810.61/1025.68 us` for `n=128/257`. It recovers n=257 but regresses n=128; +> next work must fuse SWIGLU directly into down-MMQ quant or remove another +> launch/buffer before this becomes a parity lever. +> +> 2026-07-02 Phase135 result: keep as current best default-off routed-FFN base, +> but not parity. Plan: +> `docs/superpowers/plans/2026-07-02-routed-ffn-fused-quant-phase135.md`. +> Focused artifact: +> `/home/mudler/bench/phase135_routed_ffn_fused_quant/20260702_081723`. +> Serving artifact: +> `/home/mudler/bench/phase135_routed_ffn_fused_quant_serving/20260702_082102`. +> Source adds `LLAMA_MOE_ROUTED_FFN_FUSED_QUANT=1` on top of +> `LLAMA_MOE_ROUTED_FFN_POC=1`, computes `silu(gate) * up` directly into the +> NVFP4 MMQ activation layout, and launches raw down MMQ via +> `ggml_cuda_mul_mat_q_moe_quantized(...)`. Focused selected gates passed +> `13/13`; trace proved six `mmq_moe_quantized_raw` launches and zero +> `mmq_moe_sorted_raw` launches; canonical focused gates passed MoE md5 +> `8cb0ce23`, dense md5 `5951a5b4`, `GATED_DELTA_NET 48/48`, +> `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Focused perf: +> default `805.92/1031.06 us`, Phase134 `807.65/1027.51 us`, Phase135 +> `807.92/1024.97 us` for `n=128/257`. Serving at the Phase130 shape passed +> pre/post gates and improved decode aggregate t/s `326.9 -> 332.7`, while +> `mmq_nvfp4` dropped `6009.52 -> 5915.24 ms`; aggregate stayed `208.0`, prefill +> worsened `1519.6 -> 1475.1`, and total kernel time rose slightly +> `20.1559 -> 20.2498 s`. Next work should target remaining MoE overhead after +> fused quant (`mmq_fixup`, route/writeback, weighted combine), not another F32 +> intermediate. +> +> 2026-07-02 Phase136 result: reject and revert the separate post-down +> weighted-combine fuse. Plan: +> `docs/superpowers/plans/2026-07-02-routed-ffn-combine-phase136.md`. +> Focused artifact: +> `/home/mudler/bench/phase136_routed_ffn_combine/20260702_083727`. +> Serving artifact: +> `/home/mudler/bench/phase136_routed_ffn_combine_serving/20260702_085749`. +> The candidate added `LLAMA_MOE_ROUTED_FFN_COMBINE=1` on top of Phase135 and +> skipped the post-down `MUL(weights) -> VIEW* -> ADD*` tail with a separate +> F32 weighted-combine kernel. It was correctness-clean: expanded selected +> gates passed `20/20`, trace proved six combine markers plus six +> `mmq_moe_quantized_raw` launches and zero sorted launches, canonical gates +> passed MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, `GATED_DELTA_NET 46/46`, +> `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Focused full-tail perf +> improved (`MOE_SWIGLU_COMBINE n=257` `428.53 -> 401.81 us` versus Phase135), +> but serving regressed versus Phase135: aggregate/decode t/s +> `208.0/332.7 -> 206.5/323.2`. Source and the sentinel test were reverted; +> post-reject Phase135 selected gates passed `13/13`. Do not retry a standalone +> post-MMQ combine launch as the next parity lever; any combine/finalize work +> needs a larger serving-visible fused writeback/finalize design. +> +> 2026-07-02 Phase137 result: reject the GDN launch-geometry retune with no +> source changes. Plan: +> `docs/superpowers/plans/2026-07-02-gdn-geometry-sweep-phase137.md`. +> Focused artifact: +> `/home/mudler/bench/phase137_gdn_geometry_sweep/20260702_091441`. +> Serving artifact: +> `/home/mudler/bench/phase137_gdn_geometry_serving/20260702_091740`. +> The env-only sweep tested existing `GDN_NW`/`GDN_CPW` knobs. The best focused +> candidate, `GDN_NW=4 GDN_CPW=1`, improved 1-token GDN rows +> (`hc=32,hs=128,kda=0` `6.793748 -> 4.713682 us`, KDA +> `7.790557 -> 5.194275 us`, grouped broadcast `5.967364 -> 3.407998 us`), +> but real serving regressed versus Phase135 despite clean pre/post gates: +> MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, `MUL_MAT 1146/1146`, and +> `MUL_MAT_ID 806/806`. Aggregate/decode t/s moved +> `208.0/332.7 -> 206.2/324.9`, total kernel time rose +> `20.2498 -> 20.7530 s`, and `gdn_core` worsened +> `5926.55 -> 6466.27 ms`. Do not promote or source-code a GDN geometry retune +> for this target. The next scoped source line is default-off MoE +> finalize/writeback inside the existing down-MMQ path, not a standalone +> post-MMQ combine launch. +> +> 2026-07-02 Phase138 attempt 1 update: keep the default-off finalize trace and +> full-tail sentinel scaffold; no runtime speedup claim yet. Plan: +> `docs/superpowers/plans/2026-07-02-moe-down-mmq-finalize-phase138.md`. +> Artifacts: +> `/home/mudler/bench/phase138_moe_down_mmq_finalize_trace/20260702_092943` +> (`MOE_SWIGLU_DOWN` trace-only), +> `/home/mudler/bench/phase138_moe_down_mmq_finalize_trace/20260702_093617_full_tail` +> (new full-tail sentinel), and +> `/home/mudler/bench/phase138_moe_down_mmq_finalize_trace/20260702_093731_canonical` +> (canonical gates). The old `MOE_SWIGLU_DOWN` sentinel emitted six early +> routed-FFN records but no weighted tail. The new `MOE_SWIGLU_FINALIZE` +> sentinel passed default and Phase135-opt-in correctness (`7/7` each) and +> emitted six supported tail records with `tail_nodes=16`, `views=8`, and +> `adds=7`. Canonical patched-Phase93 gates passed MoE md5 `8cb0ce23`, dense +> md5 `5951a5b4`, `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Next work may +> implement default-off down-MMQ finalize/writeback against this sentinel first; +> keep serving promotion gated by Phase135 decode/aggregate/kernel-time +> thresholds. +> +> 2026-07-02 Phase138 attempt 2 update: keep the default-off down-MMQ +> finalize/writeback candidate as a narrow positive, but do not promote it or +> call parity. Plan: +> `docs/superpowers/plans/2026-07-02-moe-down-mmq-finalize-phase138.md`. +> Focused artifact: +> `/home/mudler/bench/phase138_moe_down_mmq_finalize/20260702_095927_focused`; +> canonical gates: +> `/home/mudler/bench/phase138_moe_down_mmq_finalize/20260702_100202_canonical`; +> serving: +> `/home/mudler/bench/phase138_moe_down_mmq_finalize_serving/20260702_100330`. +> The candidate adds `LLAMA_MOE_ROUTED_FFN_FINALIZE_POC=1` on top of Phase135, +> zeroes the final output, atomically accumulates `down_sum * router_weight` +> from the down-MMQ path, and skips the strict weighted tail only after the +> finalize helper is selected. Focused `MOE_SWIGLU_FINALIZE` correctness passed +> for default, Phase135, and Phase138 (`7/7` each); canonical and serving +> pre/post gates passed MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, +> `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Serving versus Phase135 moved +> aggregate/decode t/s `208.0/332.7 -> 209.3/333.5`, total kernel time +> `20.2498 -> 20.0489 s`, and `mmq_nvfp4 5915.24 -> 5802.87 ms`; however +> `ew_add` remains visible at `374.09 ms`, so this is only an incremental +> default-off improvement. Next work should reduce the remaining fan-in/writeback +> path more deeply or return to the dominant `gdn_core`/`mmq_nvfp4` buckets. +> +> 2026-07-02 Phase139 result: serving noise-floor repeat rejects treating the +> Phase138 one-off serving gain as source-funding evidence. Spec: +> `docs/superpowers/specs/2026-07-02-serving-noise-floor-phase139-design.md`. +> Plan: +> `docs/superpowers/plans/2026-07-02-serving-noise-floor-phase139.md`. +> Artifact: +> `/home/mudler/bench/phase139_serving_noise_floor/20260702_081901`. +> Seven identical current-binary Phase138 serving/profile runs all passed +> pre/post gates: MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, +> `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. The runtime variance was much +> larger than Phase138's one-off delta: aggregate throughput median +> `208.5 t/s`, stdev `2.8022`, CV `1.349%`, range `203.4..212.3`; wall CV +> `1.347%`; `mmq_nvfp4` CV `3.351%`. Keep Phase138 default-off as +> correctness-clean and focused-positive, but do not stack another +> finalize/MMQ micro-patch from serving evidence alone. Future serving claims +> need repeated A/B medians and must exceed `max(2.0%, 3 * same-binary stdev)`. +> The next source phase should pivot to a larger measured bucket, with GDN +> packed decode/prep now more defensible than another MoE finalize shortcut. +> +> 2026-07-02 Phase140 result: reject an immediate in-GDN Q/K +> L2-normalization patch. Spec: +> `docs/superpowers/specs/2026-07-02-gdn-decode-prep-trace-phase140-design.md`. +> Plan: +> `docs/superpowers/plans/2026-07-02-gdn-decode-prep-trace-phase140.md`. +> Artifact: +> `/home/mudler/bench/phase140_gdn_decode_prep_trace/20260702_085348`. +> The current Phase138 opt-in serving/profile shape passed pre/post gates: +> MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, `MUL_MAT 1146/1146`, and +> `MUL_MAT_ID 806/806`. Serving/profile reported aggregate/decode throughput +> `207.3/328.9 t/s`, wall `39.501 s`, total kernel `20.2002 s`, `GDN +> 6673.66 ms`, `gdn_core 5890.44 ms`, and `gdn_l2norm 100.30 ms`. The focused +> SQLite summary had `l2_norm_f32 100.3024 ms` versus +> `gated_delta_net_cuda 5804.7074 ms`. This is above the absolute +> three-sigma floor from Phase139 (`53.433 ms`) but below the planned `3%` of +> GDN-core materiality threshold at about `1.7%`, so prep-only L2 fusion is not +> source-funded. Next GDN work should be recurrence-level, packed-state, or +> datacenter-Blackwell-specific, not another prep micro-fusion. +> +> 2026-07-02 Phase141 result: decode-only GDN source claims must normalize by +> launch count or tightly control the capture window. Spec: +> `docs/superpowers/specs/2026-07-02-gdn-decode-noise-floor-phase141-design.md`. +> Plan: +> `docs/superpowers/plans/2026-07-02-gdn-decode-noise-floor-phase141.md`. +> Artifact: +> `/home/mudler/bench/phase141_gdn_decode_noise_floor/20260702_090428`. +> Five identical current-binary decode-only captures all passed pre/post gates: +> MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, `MUL_MAT 1146/1146`, and +> `MUL_MAT_ID 806/806`. Raw `gdn_core_ms` median/stdev/CV was +> `1415.500/30.641/2.146%`, range `1410.300..1482.140 ms`, but launch counts +> drifted (`597`, `598`, `600`, `630`). Normalized `gdn_core_ms_per_launch` +> was stable: median/stdev/CV `2.359167/0.005399/0.229%`. Future GDN A/B +> source claims need repeated medians and must beat either `6.49%` raw +> `gdn_core` reduction or `2.0%` launch-normalized reduction. The small +> default-off source follow-up now worth testing is scalar gate/beta hoisting +> inside `gated_delta_net_cuda`; vLLM-style packed decode recurrence remains a +> larger redesign. Audience: an agent with **zero prior context** who has been told to "continue the GB10 vLLM-parity investigation" on the `llama-cpp-localai-paged` backend. @@ -20,6 +589,69 @@ Read order for a cold start: ## 1. TL;DR STATE +> 2026-07-01 Phase104-108 update: the current carried source line is still the +> Phase93 Qwen3Next grouped Q/K broadcast plus the Phase101/102 default-off +> cleanup candidates. Phase104/106 same-session serving showed the stack is +> md5/op clean but still far from vLLM: at `N=128`, paged/vLLM was about +> `0.66` on decode and `0.50-0.51` on aggregate; at `N=192/256`, vLLM remained +> faster and TTFT stayed about `3x` lower. Phase105 refreshed the grouped-MMQ +> trace and found no new host-side tile-policy lever. Phase107 proved the MoE +> structural correctness gates exist (`MOE_SWIGLU_DOWN 7/7`, +> `MOE_WEIGHTED_COMBINE 7/7`, `MUL_MAT_ID_RAGGED_MOE 6/6`) but also proved +> `test-backend-ops perf` did not time those custom whole-graph cases. Phase108 +> fixed that measurement gap in `tests/test-backend-ops.cpp`: perf mode now +> includes those MoE cases at `n_tokens=128,257`, and CSV output includes +> `time_us`, `flops`, `memory_kb`, and `n_runs`. The Phase108 artifact is +> `/home/mudler/bench/phase108_moe_perf_csv/20260701_221559`; md5s and compact +> op gates are green. Use Phase108 rows as the baseline for any fused routed-MoE +> implementation. Current ranking: `MUL_MAT_ID_RAGGED_MOE` is `1239-1446 us/run`, +> `MOE_SWIGLU_DOWN` is `802-1020 us/run`, and `MOE_WEIGHTED_COMBINE` is only +> `28-68 us/run`, so do not spend the next patch on weighted-combine fusion +> alone. +> Phase109 then tested existing env-gated routes on the Phase108 rows: +> `LLAMA_W4A16_PREFILL_M=128`, `LLAMA_FP4_PREFILL_M=128`, +> `LLAMA_MOE_DENSITY_MAX=9`, and `LLAMA_MOE_MMQ_X=64` +> (`/home/mudler/bench/phase109_existing_moe_prefill_ab/20260701_222559`). +> All selected correctness gates passed (`13/13` per env), but W4A16 and FP4 +> large-M regressed the 257-token rows badly, and density/tile retuning was +> noise-level on `MUL_MAT_ID_RAGGED_MOE` while not helping `MOE_SWIGLU_DOWN`. +> Do not spend another phase on MMQ tile-policy shortcuts. The next credible +> implementation is structural: port the vLLM-style idea of GPU-side +> token/expert routing metadata (`sorted_token_ids`, expert offsets/bounds, +> inverse permutation) into llama.cpp's `mul_mat_id` host-sync fallback/grouped +> W4A16 path, while leaving the graph-safe grouped-MMQ path untouched. +> Phase110 implemented the first slice of that structural path as default-off +> `LLAMA_MOE_GPU_SORT=1` in `ggml_cuda_mul_mat_id`, reusing the existing +> `mm_ids_helper` GPU sort for fallback metadata. The initial branch failed +> `3/13` selected opt-in rows because `mm_ids_helper` returns sorted-to-original +> `ids_dst`, while fallback `get_rows_cuda()` needs original-to-sorted +> `ids_from_sorted`; adding a tiny inverse-permutation kernel fixed correctness. +> Accepted artifact: +> `/home/mudler/bench/phase110_gpu_moe_sort/20260701_224446_fix1`. Gates are +> green: canonical MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 +> `5951a5b4d624ce891e22ab5fca9bc439`, and supported compact ops +> `SSM_CONV 45/45`, `SSM_CONV_SPLIT 6/6`, `GET_ROWS 49/49`, +> `GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806` for both +> default and `LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1`. Perf decision: +> keep as a default-off structural base only. It improves W4A16 fallback +> 257-token rows by `7.2%` (`MOE_SWIGLU_DOWN`) and `7.9%` +> (`MUL_MAT_ID_RAGGED_MOE`), but the opt-in fallback is still about `1.5x` +> slower than default grouped-MMQ. Phase111 must remove another fallback +> bottleneck, such as the remaining `expert_bounds` host copy / host tile +> descriptor build, before this line can matter for parity. +> Phase111 tested that narrow follow-up as default-off `LLAMA_W4A16_GPU_TILES=1`: +> W4A16 tile descriptors were built on GPU from `expert_bounds_dev` with an +> atomic tile counter. It was correctness-clean after fixing a pointer mutability +> compile error and a CUDA pool LIFO allocation bug, but clean perf was +> flat-to-negative (`MUL_MAT_ID_RAGGED_MOE n=257` regressed about `2.0%` versus +> Phase110 GPU-sort). Artifact: +> `/home/mudler/bench/phase111_w4a16_gpu_tiles/20260701_230400_fix1`. The +> Phase111 source was reverted, and post-revert W4A16+GPU-sort selected gates +> passed `13/13`. Do not reopen a standalone GPU tile descriptor cleanup; the +> next W4A16 attempt must remove a larger boundary, such as direct activation +> consumption plus GPU descriptors together, or bypass the host-sync fallback +> path entirely. +> > 2026-07-01 active update: Phase50-59 reopened the dense and MoE serving > scheduler question. > True dense decode is much closer to vLLM (`383.66` vs `435.00` t/s, `88.2%`) @@ -46,7 +678,7 @@ Read order for a cold start: > are local and DGX-gated but not pushed, so the LocalAI patch series has not > been regenerated. > -> 2026-07-01 Phase81 update: the next viable GDN lever is no longer launch +> 2026-07-01 Phase81-85 update: the next viable GDN lever is no longer launch > shape or gather removal. A default-off Qwen35/Qwen35MoE BF16 persistent > recurrent S-cache experiment (`LLAMA_QWEN35_GDN_S_CACHE_TYPE=bf16`) cut > same-source decode-only `gdn_core` from `1399.30 ms / 599 launches` @@ -54,9 +686,257 @@ Read order for a cold start: > F32 md5 gates and op gates stayed green, and BF16 dense md5 stayed canonical, > but BF16 MoE md5 changed to `07db32c2bcb78d17a43ed18bc22705cd`. A quick > MoE KL smoke vs the same-source F32 base showed KLD `0.055499 +/- 0.001705`, -> same-top-p `88.361%`, and PPL ratio `1.010356`. Treat this as a promising -> default-off candidate only. Phase82 must run the full f16-reference KL gate -> plus serving A/B before regenerating LocalAI patches or considering promotion. +> same-top-p `88.361%`, and PPL ratio `1.010356`. Phase82 then ran the full MoE +> f16-reference gate at +> `/home/mudler/bench/phase82_bf16_s_cache_f16_kl/20260701_183016`: same-source +> F32 measured KLD `0.136563 +/- 0.003242`, while BF16 S-cache measured +> `0.137162 +/- 0.003456` against the documented paged acceptance reference +> `0.136000 +/- 0.003285`. Reject promotion and do not run serving A/B for this +> candidate under the current hard KL rule. Phase83 then tested a bit-exact +> KDA `expf(g)` register-cache shortcut in the GDN CUDA core. Md5 and op gates +> stayed green, but same-window decode-only `gdn_core` moved +> `1399.46 -> 1405.62 ms`, so reject that micro-optimization too. Phase84 +> reduced in-place GDN op outputs to attention-only tensors and moved the CPU +> ids fallback scratch to workspace; md5/op gates stayed green and startup free +> CUDA memory improved `117472 -> 117855 MiB`, but same-window decode-only +> `gdn_core` moved `1399.72 -> 1407.38 ms`. Treat Phase84 as a possible +> memory-footprint cleanup only, not a speed parity lever. Phase85 added a +> graph-reuse-safe identity-contiguous recurrent-state fast path: it calls +> `ggml_gated_delta_net_inplace` on a direct state view when `s_copy_main` is +> identity, otherwise keeps the ids path. Md5/op gates stayed green, the +> `gdn_gather` fine bucket disappeared, GDN macro launches dropped +> `3600 -> 2980`, and same-window `gdn_core` moved `1412.33 -> 1400.34 ms`. +> Carry Phase85 only as a small cleanup candidate. Phase86 audited the producer +> fusion idea against the Phase85 node-traced profile before coding it: the +> whole `act/GDN-gate(shared)` macro is only `13.57 ms` of `3.6622 s`, beta +> sigmoid is `2.73 ms`, and CUDA already fuses `UNARY + MUL` for softplus, +> sigmoid, and SILU. Reject producer-only fusion as too small. Phase87 then +> exposed an env-gated `GDN_NW=4 GDN_CPW=8` decode geometry probe to test a +> vLLM-like `BV=32` tile shape. It was md5/op green, but same-source +> decode-only `gdn_core` regressed `1390.56 -> 1417.13 ms`, so the source line +> was reverted. Phase88 tried a first default-off `GDN_DECODE_PACK2=1` packed +> decode CTA kernel. It built and CUDA op tests stayed green, but canonical md5 +> failed for both MoE (`320b5ed...` vs `8cb0ce...`) and dense (`6a65e9...` vs +> `5951a5...`), with visible output corruption, so it was reverted without +> profiling. Phase89 tried to add that focused guardrail through +> `test_gated_delta_net_inplace_ids`, but selecting that test class directly +> already fails the pre-existing BF16 cases on CUDA, so the naive test addition +> was also reverted. Phase90 fixed the fixture root cause for identity ids by +> mirroring `state` into `state_dst` during initialization and added F32 +> `S_v=128`, `n_seqs=2` cases that return `concat(out,state_dst)`, so the +> backend comparator now checks both attention output and the side-effect state +> write. DGX CUDA selected-op gate is green (`4/4`). Use this Phase90 guardrail +> before any new packed-decode kernel, then still run canonical md5/op gates. +> Phase91 retried the default-off `GDN_DECODE_PACK2=1` CTA sequence-packing +> kernel under that guardrail. The first `n_seqs=2` guardrail passed but MoE md5 +> failed for the single-sequence completion gate, exposing an uncovered odd/single +> sequence PDL hazard. Moving inactive lanes past `ggml_cuda_pdl_sync()` and +> adding `n_seqs=1,3` guardrail cases made the candidate md5/op clean +> (`GATED_DELTA_NET 46/46`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`), but +> decode-only `gdn_core` regressed to `1425.44 ms`, so the runtime patch was +> reverted. Keep the expanded guardrail; do not retry CTA-level sequence packing +> unless it also reduces per-sequence GDN work. ids gather, producer overhead, +> simple geometry changes, and ungated packed kernels are not acceptable parity +> paths. Phase92 tried the next smallest scalar one-token recurrence +> micro-optimization: a default-off `GDN_SCALAR_DECODE_STORE_FUSED=1` CUDA path +> that stores final state inside the scalar update loop and skips the final +> post-token register-store loop. It passed local CPU guardrail, DGX CUDA +> guardrail, canonical md5s, `GATED_DELTA_NET 46/46`, `MUL_MAT 1146/1146`, and +> `MUL_MAT_ID 806/806`, but decode-only `gdn_core` regressed further to +> `1529.72 ms` (`/home/mudler/bench/phase92_gdn_scalar_store_fused/20260701_204718/decode_profile`), +> so the runtime patch was reverted. Do not retry store-fusing without evidence +> that the final state store loop is independently dominant. The next credible +> scoped ideas from the vLLM audit are the larger packed decode contract and the +> Qwen3Next GQA-repeat removal, each as a separate guarded phase. Phase93 +> implemented the Qwen3Next GQA-repeat removal as an explicit grouped Q/K +> broadcast mode on `GGML_OP_GATED_DELTA_NET` (`op_params[2]`), preserving the +> existing modulo/tiled broadcast for Qwen35 while allowing Qwen3Next to map +> `qk_head = value_head / (H_v / H_k)` and skip materializing repeated q/k heads +> when the GDN op path is active. Local CPU `GATED_DELTA_NET` passed `48/48`, +> local CPU in-place ids passed `6/6`, DGX CUDA `GATED_DELTA_NET` passed `48/48`, +> DGX CUDA in-place ids passed `6/6`, canonical md5/op gates passed +> (`GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`), and +> decode-only `gdn_core` improved to `1333.48 ms` +> (`/home/mudler/bench/phase93_qwen3next_gqa_bcast/20260701_211019/decode_profile`). +> Carry Phase93 as the current positive candidate. Phase94 then retested +> decode geometry on top of Phase93 with env-only `GDN_NW=8 GDN_CPW=8`. It +> stayed md5/op clean (`GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, +> `MUL_MAT_ID 806/806`) but decode-only `gdn_core` regressed to `1440.79 ms` +> (`/home/mudler/bench/phase94_gdn_geometry_phase93/20260701_211855/decode_profile_8x8`), +> so reject 8x8 and keep Phase93's default 16x8 geometry. Phase93 trace evidence +> also shows remaining producer-side GDN work is small (`l2_norm_f32 8.65 ms`, +> GDN gate/sigmoid about `12.75 ms`, remaining repeat `5.34 ms`), so the next +> useful lead should target recurrence work or a larger packed decode contract, +> not another small producer-only fusion. Phase95 tested a default-off +> `GDN_WARP_SCALAR_GATE=1` CUDA decode specialization on top of Phase93: lane 0 +> computed the scalar non-KDA gate and broadcast it within the warp for the +> one-token `S_v=128`, default `16x8` path. Local CPU guardrails passed +> (`GATED_DELTA_NET 48/48`, in-place ids `6/6`), DGX CUDA guardrails passed +> (`GATED_DELTA_NET 48/48`, in-place ids `6/6`), canonical md5/op gates passed +> (`GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`), but +> decode-only `gdn_core` regressed to `1402.40 ms` +> (`/home/mudler/bench/phase95_gdn_warp_scalar_gate/20260701_213311/decode_profile`). +> The runtime patch was reverted. Do not retry scalar-gate warp broadcast unless +> a future profile shows SFU pressure, rather than recurrent state traffic or +> reductions, dominating the GDN core. Phase96 then tested the narrow +> conv-state identity fast path suggested by the trace audit: when +> `s_copy_main` was identity, `build_conv_state_fused` viewed the active +> conv-cache slots directly and called `ggml_ssm_conv_update_inplace` instead of +> the ids variant. Local CPU `SSM_CONV` passed `45/45`; DGX CUDA `SSM_CONV` +> passed `45/45`; canonical gates passed (`SSM_CONV 45/45`, +> `GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`, md5s +> canonical). Decode-only profile regressed to total kernel `3.6723 s`, +> `gdn_core 1406.57 ms`, and `gdn_conv 70.42 ms` +> (`/home/mudler/bench/phase96_conv_identity_fastpath/20260701_214141/decode_profile`). +> The runtime model-graph patch was reverted. Do not retry the conv identity +> branch as a speed lever unless a same-window trace proves the ids variant is +> independently dominant. Phase97 then measured the carried Phase93 stack in an +> end-to-end `n=128`, `PTOK=128`, `GEN=64`, `PARALLEL=128` serving snapshot +> against vLLM. Pre/post canonical gates stayed green. Paged Phase93 measured +> `agg_tps 329.6`, `decode_agg_tps 669.8`, `prefill_tps 1734.5`, +> `ttft_mean_ms 7415.4`, `wall_s 24.851`; vLLM measured `agg_tps 664.8`, +> `decode_agg_tps 1029.4`, `prefill_tps 5271.8`, `ttft_mean_ms 2519.5`, +> `wall_s 11.929` +> (`/home/mudler/bench/phase97_phase93_serving_snapshot/20260701_214648`). +> Phase93 therefore remains a decode-profile positive candidate, but it does not +> close serving parity (`paged_decode_over_vllm=0.6507`). The next useful phase +> needs a larger serving-impact lever; isolated GDN/conv micro-optimizations +> have now repeatedly failed to move live serving enough. Phase98 profiled that +> carried Phase93 serving window with graph-node CUDA tracing. Pre/post gates +> stayed green. Total kernel time was `20.0411 s`; macro buckets were GDN +> `6679.96 ms` (`33.33%`), MoE/FFN-GEMM `6034.52 ms` (`30.11%`), +> bf16/fp8-proj `2766.06 ms` (`13.80%`), and layout-copy `1257.60 ms` +> (`6.28%`). Fine buckets were led by `gdn_core 5892.99 ms` (`29.40%`) and +> `mmq_nvfp4 5809.55 ms` (`28.99%`), followed by `convert_dtype 663.45 ms`, +> `gdn_conv 457.11 ms`, and `concat_layout 430.25 ms` +> (`/home/mudler/bench/phase98_phase93_serving_profile/20260701_215715`). +> This re-ranks the next work: do not spend more time on scalar GDN, conv +> identity, or gather-only shortcuts. Either attribute and remove a proven +> material layout-copy node, or pursue a larger GDN-core/MMQ serving lever with a +> standalone PoC gate. Phase99 then used the existing default-off +> `LLAMA_LAYOUT_TRACE` hook on the same Phase93 serving profile shape +> (`N=128`, `PTOK=128`, `GEN=64`, `PARALLEL=128`). Trace-enabled gates stayed +> green (`GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`, +> canonical MoE/dense md5s). Serving remained comparable (`total kernel +> 20.2408 s`, `layout-copy 1269.35 ms`). The trace attributed +> `concat_layout 440.01 ms` almost entirely to +> `conv_input-* = concat(conv_states_reshaped-*, qkv_mixed_transposed-*)` before +> `SSM_CONV`; `copy_layout 119.16 ms` includes `conv_state_update-*` writeback. +> The larger `convert_dtype 662.34 ms` bucket is mostly unnamed F32-to-F16 `CPY` +> rows and needs stronger attribution before coding. Decision: Phase99 is +> measurement-only; do not retry the Phase96-style conv-state identity branch. +> The only conv-side patch worth funding is a larger two-source `SSM_CONV` +> contract that reads `(conv_states, qkv_mixed)` as a logical concat, or else +> extend trace attribution for the unnamed `convert_dtype` bucket first +> (`/home/mudler/bench/phase99_layout_trace/20260701_200835/serving_profile`). +> Phase100 extended that trace with `dst_view`, `src0_view`, and `src1_view` +> names. The trace-only patch built locally and on DGX, and trace-enabled gates +> stayed green (`GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, +> `MUL_MAT_ID 806/806`, canonical MoE/dense md5s). Serving stayed comparable +> (`total kernel 20.3464 s`, `convert_dtype 661.73 ms`, `concat_layout +> 438.15 ms`). The new fields identify a concrete `convert_dtype` source: +> `GET_ROWS` reads F16 `cache_k_l*` / `cache_v_l*` into F32 `node_*`, then +> `CPY` downcasts views such as `src0_view=node_358` / `node_365` to F16 +> attention-shaped tensors. This repeats across attention layers +> (`cache_k_l3/v_l3`, `cache_k_l7/v_l7`, `cache_k_l11/v_l11`, ...). Some F32->F16 +> rows remain unnamed, so the next runtime phase should be a narrow K/V cache +> get_rows dtype A/B, not a broad layout rewrite +> (`/home/mudler/bench/phase100_layout_view_trace/20260701_201800/serving_profile`). +> Phase101 implemented that narrow A/B as default-off +> `LLAMA_PAGED_KV_GET_ROWS_F16=1`: add `ggml_get_rows_type`, support CPU F16 +> source -> F16 destination row copy, and use typed F16 `GET_ROWS` only for +> paged K/V gather when the cache tensor is F16. Local and DGX builds completed; +> CUDA `GET_ROWS` passed `49/49` including the new F16-output cases; default and +> opt-in md5/op gates stayed green (`GET_ROWS 49/49`, `GATED_DELTA_NET 48/48`, +> `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`, canonical MoE/dense md5s). +> Serving profile under opt-in measured `total kernel 20.1989 s`, `agg_tps +> 206.4`, `decode_agg_tps 328.0`, and `ttft_mean_ms 8211.1`. It reduced +> `copy_layout 116.25 -> 80.32 ms` and macro `layout-copy 1262.58 -> 1220.30 ms` +> versus Phase100, but `convert_dtype` stayed flat (`661.73 -> 661.35 ms`) and +> serving throughput did not improve. Carry Phase101 only as a small default-off +> cleanup candidate pending repeat A/B; do not promote it as a parity lever +> (`/home/mudler/bench/phase101_kv_get_rows_f16/20260701_203930/serving_profile`). +> Phase102 then implemented the funded two-source `SSM_CONV` contract as +> default-off `LLAMA_SSM_CONV_SPLIT=1`: `ggml_ssm_conv_split(ctx, conv_states, +> x_cur, conv_kernel)` reuses `GGML_OP_SSM_CONV`, reads +> `[K-1,channels,n_seqs]` cached taps plus native `[channels,n_tokens,n_seqs]` +> qkv tokens as a logical concat, and is wired into Qwen3Next/Qwen35/Qwen35MoE +> only for multi-token, non-rollback batches with `n_seq_tokens >= K-1`. The +> initial semantic test exposed a harness issue (`split-base` has an exactly +> zero CPU reference, so normalized MSE reported `ERR=inf`); direct split +> CUDA-vs-CPU passed `6/6`, and the final test keeps `split-base` with absolute +> max error. Local and DGX builds passed; default, standalone opt-in, and +> serving pre/post gates stayed green (`SSM_CONV 45/45`, `SSM_CONV_SPLIT 6/6`, +> `GET_ROWS 49/49`, `GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, +> `MUL_MAT_ID 806/806`, canonical MoE/dense md5s). Opt-in serving measured +> `total kernel 19.5482 s`, `agg_tps 206.1`, `decode_agg_tps 320.0`, +> `prefill_tps 1538.0`, and `ttft_mean_ms 7928.4`. It removed the traced concat +> materialization (`concat_layout 433.13 -> 4.59 ms` versus Phase101 and +> `layout-copy 1220.30 -> 826.87 ms`), but live serving throughput still did not +> improve. Carry Phase102 as a default-off cleanup/follow-up base only; do not +> promote it as parity-closing without a repeat A/B or an additional state-update +> fusion. The remaining high-value targets are still `gdn_core`, `mmq_nvfp4`, or +> a larger serving scheduler/packed-decode contract +> (`/home/mudler/bench/phase102_ssm_conv_split/20260701_210907/serving_profile`). +> Phase103 measured Phase101+Phase102 together, with no new source changes: +> `LLAMA_SSM_CONV_SPLIT=1 LLAMA_PAGED_KV_GET_ROWS_F16=1`. Standalone and +> serving pre/post gates stayed green (`SSM_CONV 45/45`, `SSM_CONV_SPLIT 6/6`, +> `GET_ROWS 49/49`, `GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, +> `MUL_MAT_ID 806/806`, canonical MoE/dense md5s). Combined serving improved +> over Phase102 (`agg_tps 206.1 -> 212.3`, `decode_agg_tps 320.0 -> 331.5`, +> `prefill_tps 1538.0 -> 1569.1`, `wall_s 39.743 -> 38.575`) and reduced +> `layout-copy 826.87 -> 798.52 ms`; it also preserved most of the split +> SSM_CONV concat removal and recovered the F16 K/V `copy_layout` reduction +> (`copy_layout 112.53 -> 78.22 ms`). This proves the two cleanup candidates are +> compatible, but not parity-closing: `gdn_core 5930.47 ms` and `mmq_nvfp4 +> 6001.77 ms` still dominate. Carry the combined env as the cleanup comparison +> baseline; do not rerun isolated layout cleanup unless it changes a larger +> serving contract +> (`/home/mudler/bench/phase103_combined_layout_cleanups/20260701_211821/serving_profile`). +> Phase104 then measured that combined cleanup stack in the normal same-session +> serving harness against vLLM at `N=128`, `PTOK=128`, `GEN=64`, +> `PARALLEL=128`. Pre/post gates stayed green with the same expanded op set and +> canonical md5s. Paged combined measured `agg_tps 338.6`, +> `decode_agg_tps 675.8`, `prefill_tps 1813.0`, `ttft_mean_ms 7121.6`, and +> `wall_s 24.196`; vLLM measured `agg_tps 661.1`, `decode_agg_tps 1028.0`, +> `prefill_tps 5208.7`, `ttft_mean_ms 2572.3`, and `wall_s 11.980`. This is a +> small serving improvement over Phase97 (`agg_tps +2.73%`, `prefill_tps +> +4.53%`, `TTFT -3.96%`), but still not parity: `paged_decode_over_vllm=0.6574` +> and `paged_agg_over_vllm=0.5122`. Carry the combined cleanup stack as the best +> current comparison baseline. The next useful phase must attack a larger +> serving-impact contract or the dominant GDN/MMQ buckets, not more isolated +> layout-copy cleanup +> (`/home/mudler/bench/phase104_combined_serving_snapshot/20260701_212551`). +> Phase105 refreshed grouped-MMQ evidence on that current stack without source +> changes. `MUL_MAT_ID_RAGGED_MOE` stayed green both default and trace-enabled +> (`6/6`), full `MUL_MAT_ID` stayed green (`806/806`), and the live serving +> retry returned a non-empty response while recording `120` shape and launch +> lines. The live sample was prefill-like (`ncols_max=317`, density `10`, +> `mmq_x_best=112`, `stream_k=1`) with no small-M lines; all launches had +> `fixup=0`, `stream_k_blocks == ntiles_dst`, and efficiency `100`. This +> confirms the current cleanup stack did not open a new cheap MMQ shortcut. +> Do not add another host-side MMQ tile policy; only revisit MMQ for a +> genuinely structural kernel or serving-contract change +> (`/home/mudler/bench/phase105_mmq_current_shape/20260701_214129_serving_retry`). +> Phase106 tested the remaining low-conflict C1 operating-point hypothesis on +> the current stack: same-session `N=128/192/256` with `PARALLEL=256`, +> `VLLM_MAX_NUM_SEQS=256`, and the combined cleanup env. Pre/post gates stayed +> green with canonical md5s and the expanded op set. vLLM completed all legs and +> stayed ahead: at `N=256`, paged measured `agg_tps 338.4`, +> `decode_agg_tps 824.6`, `ttft_mean_ms 14933.5`, while vLLM measured +> `agg_tps 723.8`, `decode_agg_tps 1320.4`, `ttft_mean_ms 4999.0`. Reject C1 +> for the current GB10 stack. The next source phase should be structural +> persistent-batch/fused-MoE/GDN work, not another scheduler shortcut +> (`/home/mudler/bench/phase106_max_concurrency_current_stack/20260701_214907`). +> Phase107 established the fused-MoE structural guardrail surface before coding: +> `MOE_SWIGLU_DOWN 7/7`, `MOE_WEIGHTED_COMBINE 7/7`, and +> `MUL_MAT_ID_RAGGED_MOE 6/6` passed on CUDA0. However, +> `test-backend-ops perf` did not provide usable timing rows for these custom +> whole-graph cases; the broad `MUL_MAT_ID` perf CSV reported support metadata +> only. The next source patch should be measurement-only: add a narrow MoE +> fusion timing harness with explicit GPU synchronization and CSV timing before +> funding any fused routed-MoE kernel +> (`/home/mudler/bench/phase107_moe_fusion_guardrail/20260701_220227`). - Historical verdict: the older investigation marked GB10 parity **CLOSED** and unreachable. Treat that as superseded where Phase50-54 provide newer dense @@ -1416,3 +2296,44 @@ assumption is too narrow for a sub-millisecond capture-level win. Do not spend more parity time on gather-only GDN shortcuts unless a future profile makes gather material. The next serious GDN scope remains recurrent-state precision/traffic. + +## Series trim (phases 110-140 review, 2026-07-02) + +The campaign's on-disk patches `0048-0063` were added without matching fork +commits (a fork-first policy violation). After a keep/drop review of the +phase 110-140 work, the series was trimmed to a single kept line plus the +gate harness, and re-mirrored to the fork: + +- KEEP - test sentinels (the MoE gate harness): `MOE_SWIGLU_DOWN`, + `MOE_SWIGLU_COMBINE`, `MUL_MAT_ID_RAGGED_MOE` (old `0051-0053`). +- KEEP - the MTP-draft correctness fix (old `0054`): forces target-side + sampler acceptance for MTP drafts (backend draft sampling can request + multiple output rows per sequence); the backend ships `-mtp` gallery models. +- KEEP - the Phase135 routed-FFN fused-quant line: whole-pattern MoE matcher + + routed-FFN executor hook (Phase120/121), the routed-FFN PoC scaffold + `moe-ffn.{cu,cuh}` (Phase132), and the fused SwiGLU-to-NVFP4-quant + raw down + MMQ (`ggml_cuda_mul_mat_q_moe_quantized` + local `ggml_cuda_mmq_ids_meta` + refactor, Phase135). All default-off, md5-clean opt-in, six + `mmq_moe_quantized_raw` markers with zero sorted launches on the sentinel. + +- DROP - W4A16 grouped-tile pack/tune/pad (old `0048-0050`): dead line, W4A16 + is ~1.5x slower than grouped-MMQ. +- DROP - speculative/trace/cublas-route/mmid-route/mul-mat-route traces + the + rejected small-M tile-policy knob (old `0055-0063`). +- DROP - all other campaign keep-markers not needed by Phase135: GPU-sort + (Phase110), W4A16-direct-A (Phase112), boundary trace/timing (Phase117), + Phase133 sorted-F32 down, Phase134 fused-SWIGLU-only, Phase138 + finalize/weighted-combine. The final fork tree carries zero of these markers. + +Fork branch `mudler/llama.cpp:localai-paged` re-mirrored on top of +`51168c5ee` (LocalAI series `0001-0047`): + +- `fd920cf8a` test(paged): cover MoE swiglu down chain +- `a85c1e098` test(paged): cover MoE weighted combine chain +- `2fed6aacf` test(paged): cover ragged MoE dispatch +- `f1d976f06` fix(speculative): disable backend sampling for MTP drafts +- `1edddc8fe` feat(paged): whole-pattern MoE matcher + routed-FFN fused + NVFP4-quant down MMQ + +New fork HEAD `1edddc8fe`, tree `097c862c`. The rejected/neutral levers of +the 110-140 campaign are recorded above and in the per-phase bench artifacts.