docs(paged): scope GDN M5 state-boundary phase

Add the Phase 11 design and implementation plan for a default-off C16 M5 QS-early GDN experiment after rejecting C32 slabs. Assisted-by: Codex:gpt-5
2026-07-02 20:37:03 -04:00 · 2026-07-01 01:21:05 +00:00
parent 3da3b169fb
commit 24e778de47
2 changed files with 407 additions and 0 deletions
--- a/docs/superpowers/plans/2026-07-01-gdn-m5-state-boundary-phase11.md
+++ b/docs/superpowers/plans/2026-07-01-gdn-m5-state-boundary-phase11.md
@@ -0,0 +1,329 @@
+# GDN M5 State-Boundary Phase 11 Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Test a low-conflict C=16 M5 GDN prefill variant that moves/reuses the QS state-boundary product earlier without changing chunk size or decode routing.
+
+**Architecture:** Phase 11 is fork-first and default-off. It keeps the shipped C=16 M5 path as the baseline, adds one explicit env-selected candidate in `gated_delta_net.cu`, and accepts it only if focused op gates, canonical md5 gates, and same-session GB10 A/B benchmarks all pass.
+
+**Tech Stack:** llama.cpp CUDA GDN kernel, DGX GB10 CUDA build, Qwen3.6 NVFP4 MoE/dense GGUF, LocalAI paged patch stack.
+
+---
+
+## Guardrails
+
+- Do not reintroduce C32 slab code.
+- Do not change the default `GDN_TC=5` path until the candidate wins.
+- Keep `GDN_CHUNK_MIN > 1`; decode must stay on the sequential recurrence.
+- Prefer one env-gated shortcut over new helper files or global scratch.
+- Gate every change with `GATED_DELTA_NET` and canonical MoE/dense md5 before
+  any performance claim.
+
+## Task 1: Preflight and Baseline
+
+**Files:**
+- Read-only: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu`
+- Artifact: `/home/mudler/bench/phase11_gdn_m5_state_boundary/`
+
+- [ ] **Step 1: Check DGX is free**
+
+Run:
+
+```bash
+ssh dgx.casa 'set -e
+echo docker=$(docker ps -q | wc -l)
+echo local_ai_worker=$(docker ps --format "{{.Names}}" | grep -c local-ai-worker || true)
+echo compute=$(nvidia-smi --query-compute-apps=pid --format=csv,noheader | sed "/^$/d" | wc -l)
+if [ -f ~/gpu_bench_lock/owner ]; then cat ~/gpu_bench_lock/owner; else echo FREE-no-lock-file; fi'
+```
+
+Expected:
+
+```text
+docker=0
+local_ai_worker=0
+compute=0
+FREE...
+```
+
+- [ ] **Step 2: Record source provenance**
+
+Run:
+
+```bash
+ssh dgx.casa 'cd /home/mudler/llama-phase6-source && git status --short && git rev-parse HEAD'
+git -C /home/mudler/_git/llama.cpp status --short
+git -C /home/mudler/_git/llama.cpp rev-parse HEAD
+```
+
+Expected: clean llama.cpp fork and DGX mirror before source edits.
+
+- [ ] **Step 3: Create artifact directory**
+
+Run:
+
+```bash
+ssh dgx.casa 'mkdir -p /home/mudler/bench/phase11_gdn_m5_state_boundary/{gates,ab,rejected}'
+```
+
+Expected: command exits 0.
+
+## Task 2: Add Default-Off Candidate
+
+**Files:**
+- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu`
+- Mirror: `/home/mudler/llama-phase6-source/ggml/src/ggml-cuda/gated_delta_net.cu`
+
+- [ ] **Step 1: Add an env selector**
+
+Add a static env flag near the existing `gdn_tc` selector:
+
+```cpp
+static const bool gdn_m5_qs_early = []{
+    const char * e = getenv("GDN_M5_QS_EARLY");
+    return e && atoi(e) != 0;
+}();
+```
+
+Route it only for `S_v == 128 && n_tokens >= gdn_chunk_min && gdn_tc >= 4`.
+
+- [ ] **Step 2: Add a template boolean for the candidate**
+
+Extend the chunked launch templates with a defaulted boolean, keeping existing
+call sites source-compatible:
+
+```cpp
+template <int S_v, int C, int TC = 0, bool QS_EARLY = false>
+__global__ void gated_delta_net_chunked_cuda(...)
+```
+
+and:
+
+```cpp
+template <int S_v, int C, int TC = 0, bool QS_EARLY = false>
+static void launch_gdn_chunked(...)
+```
+
+Use the boolean only inside the M3/M5 code path. Existing launches must remain
+`launch_gdn_chunked<128, 16, TC_>(...)`.
+
+- [ ] **Step 3: Move QS deposition earlier for candidate only**
+
+In `gated_delta_net_chunked_cuda`, after the KS/RHS section and before the
+`solve A U = RHS` section, add a candidate-only QS pass:
+
+```cpp
+if constexpr (QS_EARLY && TC >= 2) {
+    const int w    = threadIdx.x >> 5;
+    const int lane = threadIdx.x & 31;
+    const int lg   = lane >> 2;
+    const int lt   = lane & 3;
+    constexpr int NWARP = S_v / 32;
+    constexpr int NT    = dv / 8;
+    constexpr int NTPW  = (NT + NWARP - 1) / NWARP;
+    #pragma unroll
+    for (int mt = 0; mt < (C + 15) / 16; mt++) {
+        const int rowbase = mt * 16;
+        #pragma unroll
+        for (int nn = 0; nn < NTPW; nn++) {
+            const int nt = w * NTPW + nn;
+            if (nt >= NT) break;
+            const int colbase = nt * 8;
+            float cc[4];
+            gdn_gram_tile_mma_3x<dk>(cc, Qc, Sd, rowbase, colbase, lg, lt);
+            const int tt[4] = {rowbase + lg, rowbase + lg, rowbase + lg + 8, rowbase + lg + 8};
+            const int jj[4] = {colbase + 2*lt, colbase + 2*lt + 1, colbase + 2*lt, colbase + 2*lt + 1};
+            #pragma unroll
+            for (int l = 0; l < 4; l++) {
+                const int t = tt[l], jc = jj[l];
+                if (t < Cc && jc < dv) {
+                    attn_base[(c0 + t) * S_v * H + jc] = gam[t] * cc[l];
+                }
+            }
+        }
+    }
+    __syncthreads();
+}
+```
+
+Then change the existing QS deposition block to:
+
+```cpp
+if constexpr (TC >= 2 && !QS_EARLY) {
+    ...
+}
+```
+
+This is intentionally conservative. It should not change math order for the
+deposited QS values, only their scheduling relative to the solve/P build.
+
+- [ ] **Step 4: Add a candidate launch arm**
+
+In `launch_gated_delta_net`, when `gdn_m5_qs_early && gdn_tc >= 4`, call:
+
+```cpp
+launch_gdn_chunked<128, 16, 4, true>(...);
+return;
+```
+
+Default M5 must continue to call `launch_gdn_chunked<128, 16, 4>(...)`.
+
+- [ ] **Step 5: Mirror to DGX and build**
+
+Run:
+
+```bash
+rsync -a /home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu \
+  dgx.casa:/home/mudler/llama-phase6-source/ggml/src/ggml-cuda/gated_delta_net.cu
+ssh dgx.casa 'cd /home/mudler/llama-phase6-source/build-cuda && cmake --build . --target test-backend-ops llama-completion llama-batched-bench -j 8'
+```
+
+Expected: build exits 0.
+
+## Task 3: Correctness Gates
+
+**Files:**
+- Artifact: `/home/mudler/bench/phase11_gdn_m5_state_boundary/gates/`
+
+- [ ] **Step 1: Run focused op gates**
+
+Run:
+
+```bash
+ssh dgx.casa 'cd /home/mudler/llama-phase6-source/build-cuda/bin
+ART=$HOME/bench/phase11_gdn_m5_state_boundary/gates
+./test-backend-ops test -b CUDA0 -o GATED_DELTA_NET -j 1 > "$ART/gated_delta_net_default.txt" 2>&1
+GDN_M5_QS_EARLY=1 GDN_TC=5 GDN_CHUNK_MIN=2 ./test-backend-ops test -b CUDA0 -o GATED_DELTA_NET -j 1 > "$ART/gated_delta_net_qs_early.txt" 2>&1'
+```
+
+Expected: both logs show CUDA0 `OK` for all `GATED_DELTA_NET` cases.
+
+- [ ] **Step 2: Run canonical md5 gates**
+
+Run:
+
+```bash
+ssh dgx.casa 'cd /home/mudler/llama-phase6-source/build-cuda/bin
+ART=$HOME/bench/phase11_gdn_m5_state_boundary/gates
+LBASE="LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GDN_TC=5 GDN_CHUNK_MIN=64 GGML_NO_BACKTRACE=1"
+LCAND="LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GDN_TC=5 GDN_CHUNK_MIN=2 GDN_M5_QS_EARLY=1 GGML_NO_BACKTRACE=1"
+env $LBASE ./llama-completion -m /home/mudler/bench/q36-35b-a3b-nvfp4.gguf -ngl 99 -fa on -c 4096 --temp 0 --seed 1 -n 48 -p "The capital of France is" </dev/null > "$ART/gate_moe_default.txt" 2> "$ART/gate_moe_default.err"
+md5sum "$ART/gate_moe_default.txt" | tee "$ART/gate_moe_default.md5"
+env $LBASE ./llama-completion -m /home/mudler/bench/q36-27b-nvfp4.gguf -ngl 99 -fa on -c 4096 --temp 0 --seed 1 -n 48 -p "The capital of France is" </dev/null > "$ART/gate_dense_default.txt" 2> "$ART/gate_dense_default.err"
+md5sum "$ART/gate_dense_default.txt" | tee "$ART/gate_dense_default.md5"
+env $LCAND ./llama-completion -m /home/mudler/bench/q36-35b-a3b-nvfp4.gguf -ngl 99 -fa on -c 4096 --temp 0 --seed 1 -n 48 -p "The capital of France is" </dev/null > "$ART/gate_moe_qs_early.txt" 2> "$ART/gate_moe_qs_early.err"
+md5sum "$ART/gate_moe_qs_early.txt" | tee "$ART/gate_moe_qs_early.md5"
+env $LCAND ./llama-completion -m /home/mudler/bench/q36-27b-nvfp4.gguf -ngl 99 -fa on -c 4096 --temp 0 --seed 1 -n 48 -p "The capital of France is" </dev/null > "$ART/gate_dense_qs_early.txt" 2> "$ART/gate_dense_qs_early.err"
+md5sum "$ART/gate_dense_qs_early.txt" | tee "$ART/gate_dense_qs_early.md5"'
+```
+
+Expected:
+
+```text
+MoE   8cb0ce23777bf55f92f63d0292c756b0
+Dense 5951a5b4d624ce891e22ab5fca9bc439
+```
+
+- [ ] **Step 3: Stop if md5 changes**
+
+If either candidate md5 differs, do not benchmark yet. Run the KL gate from
+`backend/cpp/llama-cpp-localai-paged/docs/PAGED_BITEXACT_NOTE.md` and accept
+only if KL is benign and the transcript is sane.
+
+## Task 4: Performance A/B
+
+**Files:**
+- Artifact: `/home/mudler/bench/phase11_gdn_m5_state_boundary/ab/`
+
+- [ ] **Step 1: Run same-session MoE and dense A/B**
+
+Run:
+
+```bash
+ssh dgx.casa 'cd /home/mudler/llama-phase6-source/build-cuda/bin
+ART=$HOME/bench/phase11_gdn_m5_state_boundary/ab
+mkdir -p "$ART"
+LBASE="LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GDN_TC=5 GDN_CHUNK_MIN=64 GGML_NO_BACKTRACE=1"
+LCAND="LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GDN_TC=5 GDN_CHUNK_MIN=64 GDN_M5_QS_EARLY=1 GGML_NO_BACKTRACE=1"
+env $LBASE ./llama-batched-bench -m /home/mudler/bench/q36-35b-a3b-nvfp4.gguf -c 131072 -b 2048 -ub 512 -ngl 99 -fa on -npp 512,2048 -ntg 4 -npl 32 > "$ART/moe_base.txt" 2>&1
+env $LCAND ./llama-batched-bench -m /home/mudler/bench/q36-35b-a3b-nvfp4.gguf -c 131072 -b 2048 -ub 512 -ngl 99 -fa on -npp 512,2048 -ntg 4 -npl 32 > "$ART/moe_qs_early.txt" 2>&1
+env $LBASE ./llama-batched-bench -m /home/mudler/bench/q36-27b-nvfp4.gguf -c 131072 -b 2048 -ub 512 -ngl 99 -fa on -npp 512,2048 -ntg 4 -npl 32 > "$ART/dense_base.txt" 2>&1
+env $LCAND ./llama-batched-bench -m /home/mudler/bench/q36-27b-nvfp4.gguf -c 131072 -b 2048 -ub 512 -ngl 99 -fa on -npp 512,2048 -ntg 4 -npl 32 > "$ART/dense_qs_early.txt" 2>&1'
+```
+
+Expected: candidate improves S_PP for at least the target MoE prefill cases and
+does not regress dense outside noise.
+
+- [ ] **Step 2: Decide accept/reject**
+
+Accept only if:
+
+- op gates pass,
+- md5 is canonical or KL-benign,
+- MoE S_PP improves,
+- dense S_PP does not regress,
+- decode routing remains untouched by `GDN_CHUNK_MIN > 1`.
+
+Reject if the candidate is flat/slower. Save:
+
+```bash
+git -C /home/mudler/_git/llama.cpp diff -- ggml/src/ggml-cuda/gated_delta_net.cu \
+  > /home/mudler/bench/phase11_gdn_m5_state_boundary/rejected/qs_early_rejected.diff
+```
+
+Then restore fork and DGX mirror.
+
+## Task 5: Mirror Accepted Patch or Record Rejection
+
+**Files:**
+- Create if accepted: `backend/cpp/llama-cpp-localai-paged/patches/paged/0055-...patch`
+- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md`
+- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md`
+- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md`
+- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md`
+
+- [ ] **Step 1: If accepted, commit fork patch**
+
+Commit in `/home/mudler/_git/llama.cpp` only after gates pass:
+
+```bash
+git add ggml/src/ggml-cuda/gated_delta_net.cu
+git commit -m "feat(cuda): add gated delta net M5 QS-early path"
+```
+
+- [ ] **Step 2: Generate LocalAI patch**
+
+Run:
+
+```bash
+git -C /home/mudler/_git/llama.cpp format-patch -1 HEAD \
+  --stdout > backend/cpp/llama-cpp-localai-paged/patches/paged/0055-feat-cuda-add-gated-delta-net-M5-QS-early-path.patch
+```
+
+Do not hand-edit the generated patch.
+
+- [ ] **Step 3: Update docs and commit LocalAI**
+
+Record artifacts, md5/KL results, A/B numbers, and the decision. Commit with:
+
+```bash
+git add backend/cpp/llama-cpp-localai-paged/patches/paged/0055-feat-cuda-add-gated-delta-net-M5-QS-early-path.patch \
+  backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \
+  backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \
+  backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md \
+  backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md
+git commit -m "feat(paged): add GDN M5 QS-early path" \
+  -m "Assisted-by: Codex:gpt-5"
+```
+
+If rejected, commit docs only:
+
+```bash
+git add backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \
+  backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \
+  backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md \
+  backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md \
+  docs/superpowers/plans/2026-07-01-gdn-m5-state-boundary-phase11.md
+git commit -m "docs(paged): record GDN M5 QS-early result" \
+  -m "Assisted-by: Codex:gpt-5"
+```
--- a/docs/superpowers/specs/2026-07-01-gdn-m5-state-boundary-design.md
+++ b/docs/superpowers/specs/2026-07-01-gdn-m5-state-boundary-design.md
@@ -0,0 +1,78 @@
+# GDN M5 State-Boundary Design
+
+## Context
+
+Phase 10 tested a default-off C32 slabbed M5 path for
+`ggml/src/ggml-cuda/gated_delta_net.cu`. It was correctness-clean only after
+zeroing staged tail rows, then failed the performance gate:
+
+| Model | PP | M5 S_PP t/s | C32 slab S_PP t/s |
+|-------|----|-------------|-------------------|
+| MoE | 2048 | 2430.32 | 2054.86 |
+| Dense | 2048 | 1019.25 | 903.73 |
+
+The likely root cause is duplicated A/T work per value slab. vLLM/FLA computes
+the per-chunk triangular object once and reuses it through the WY transform; the
+two-slab M5 shortcut could not do that without a larger scratch/precompute
+design.
+
+## Design Choice
+
+Phase 11 stays at the shipped C=16 M5 geometry and tests a smaller C=16
+state-boundary variant before reopening chunk-size changes. The candidate
+targets the two tensor-core state-boundary products that both multiply a chunk
+matrix by the same pre-update state `Sd`:
+
+- `KS = Kc * S0`, currently used to form `Ud`.
+- `QS = Qc * S0`, currently deposited later as the cross-chunk output term.
+
+The first implementation should be default-off and selected by an explicit env
+var such as `GDN_M5_QS_EARLY=1`. It should not change the default `GDN_TC=5`
+path until it clears correctness and performance gates.
+
+## Candidate Shape
+
+The low-conflict version moves the QS state-boundary pass earlier in the C=16
+M5 chunk loop and stores `gamma_t * QS[t][j]` in `attn_base` before the solve.
+The later output section then reuses the predeposited cross-chunk term exactly
+as it does today.
+
+This does not yet fuse K and Q in a single MMA instruction. It tests whether
+moving QS earlier and tightening the state-boundary scheduling helps without
+new global scratch, changed state ownership, or C32 slab duplication. If it is
+flat, Phase 11 should be rejected quickly and the next GDN work should be a
+larger shared-A/Ai design rather than more local scheduling.
+
+## Non-Goals
+
+- Do not reintroduce C32 slabs.
+- Do not add global A/Ai scratch in this phase.
+- Do not import vLLM CuteDSL/TMA kernels.
+- Do not route decode into chunked GDN; keep `GDN_CHUNK_MIN > 1`.
+- Do not change default inference behavior unless gates prove a win.
+
+## Gates
+
+Correctness:
+
+- Build `test-backend-ops` and `llama-completion` on DGX.
+- Run default and forced-candidate `GATED_DELTA_NET` CUDA0 gates.
+- Run canonical MoE and dense greedy md5 gates:
+  - MoE: `8cb0ce23777bf55f92f63d0292c756b0`.
+  - Dense: `5951a5b4d624ce891e22ab5fca9bc439`.
+- If md5 changes, stop and run the existing KL gate before any performance
+  claim.
+
+Performance:
+
+- Compare against current M5 with `GDN_TC=5 GDN_CHUNK_MIN=64`.
+- Run MoE and dense `llama-batched-bench` at `npp=512,2048`, `ntg=4`,
+  `npl=32`.
+- Reject if either model regresses outside noise or if the GDN bucket does not
+  improve in a profile-gated follow-up.
+
+## Decision Rule
+
+Accept only if the candidate is md5/KL-safe and improves end-to-end S_PP. If it
+is flat or slower, record it as rejected and move to a larger shared-A/Ai
+blocked-solve design, likely requiring a separate scratch/precompute phase.