docs(paged): scope GDN M5 state-boundary phase

Add the Phase 11 design and implementation plan for a default-off C16 M5 QS-early GDN experiment after rejecting C32 slabs.

Assisted-by: Codex:gpt-5
This commit is contained in:
Ettore Di Giacinto
2026-07-01 01:21:05 +00:00
parent 3da3b169fb
commit 24e778de47
2 changed files with 407 additions and 0 deletions

View File

@@ -0,0 +1,329 @@
# GDN M5 State-Boundary Phase 11 Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Test a low-conflict C=16 M5 GDN prefill variant that moves/reuses the QS state-boundary product earlier without changing chunk size or decode routing.
**Architecture:** Phase 11 is fork-first and default-off. It keeps the shipped C=16 M5 path as the baseline, adds one explicit env-selected candidate in `gated_delta_net.cu`, and accepts it only if focused op gates, canonical md5 gates, and same-session GB10 A/B benchmarks all pass.
**Tech Stack:** llama.cpp CUDA GDN kernel, DGX GB10 CUDA build, Qwen3.6 NVFP4 MoE/dense GGUF, LocalAI paged patch stack.
---
## Guardrails
- Do not reintroduce C32 slab code.
- Do not change the default `GDN_TC=5` path until the candidate wins.
- Keep `GDN_CHUNK_MIN > 1`; decode must stay on the sequential recurrence.
- Prefer one env-gated shortcut over new helper files or global scratch.
- Gate every change with `GATED_DELTA_NET` and canonical MoE/dense md5 before
any performance claim.
## Task 1: Preflight and Baseline
**Files:**
- Read-only: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu`
- Artifact: `/home/mudler/bench/phase11_gdn_m5_state_boundary/`
- [ ] **Step 1: Check DGX is free**
Run:
```bash
ssh dgx.casa 'set -e
echo docker=$(docker ps -q | wc -l)
echo local_ai_worker=$(docker ps --format "{{.Names}}" | grep -c local-ai-worker || true)
echo compute=$(nvidia-smi --query-compute-apps=pid --format=csv,noheader | sed "/^$/d" | wc -l)
if [ -f ~/gpu_bench_lock/owner ]; then cat ~/gpu_bench_lock/owner; else echo FREE-no-lock-file; fi'
```
Expected:
```text
docker=0
local_ai_worker=0
compute=0
FREE...
```
- [ ] **Step 2: Record source provenance**
Run:
```bash
ssh dgx.casa 'cd /home/mudler/llama-phase6-source && git status --short && git rev-parse HEAD'
git -C /home/mudler/_git/llama.cpp status --short
git -C /home/mudler/_git/llama.cpp rev-parse HEAD
```
Expected: clean llama.cpp fork and DGX mirror before source edits.
- [ ] **Step 3: Create artifact directory**
Run:
```bash
ssh dgx.casa 'mkdir -p /home/mudler/bench/phase11_gdn_m5_state_boundary/{gates,ab,rejected}'
```
Expected: command exits 0.
## Task 2: Add Default-Off Candidate
**Files:**
- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu`
- Mirror: `/home/mudler/llama-phase6-source/ggml/src/ggml-cuda/gated_delta_net.cu`
- [ ] **Step 1: Add an env selector**
Add a static env flag near the existing `gdn_tc` selector:
```cpp
static const bool gdn_m5_qs_early = []{
const char * e = getenv("GDN_M5_QS_EARLY");
return e && atoi(e) != 0;
}();
```
Route it only for `S_v == 128 && n_tokens >= gdn_chunk_min && gdn_tc >= 4`.
- [ ] **Step 2: Add a template boolean for the candidate**
Extend the chunked launch templates with a defaulted boolean, keeping existing
call sites source-compatible:
```cpp
template <int S_v, int C, int TC = 0, bool QS_EARLY = false>
__global__ void gated_delta_net_chunked_cuda(...)
```
and:
```cpp
template <int S_v, int C, int TC = 0, bool QS_EARLY = false>
static void launch_gdn_chunked(...)
```
Use the boolean only inside the M3/M5 code path. Existing launches must remain
`launch_gdn_chunked<128, 16, TC_>(...)`.
- [ ] **Step 3: Move QS deposition earlier for candidate only**
In `gated_delta_net_chunked_cuda`, after the KS/RHS section and before the
`solve A U = RHS` section, add a candidate-only QS pass:
```cpp
if constexpr (QS_EARLY && TC >= 2) {
const int w = threadIdx.x >> 5;
const int lane = threadIdx.x & 31;
const int lg = lane >> 2;
const int lt = lane & 3;
constexpr int NWARP = S_v / 32;
constexpr int NT = dv / 8;
constexpr int NTPW = (NT + NWARP - 1) / NWARP;
#pragma unroll
for (int mt = 0; mt < (C + 15) / 16; mt++) {
const int rowbase = mt * 16;
#pragma unroll
for (int nn = 0; nn < NTPW; nn++) {
const int nt = w * NTPW + nn;
if (nt >= NT) break;
const int colbase = nt * 8;
float cc[4];
gdn_gram_tile_mma_3x<dk>(cc, Qc, Sd, rowbase, colbase, lg, lt);
const int tt[4] = {rowbase + lg, rowbase + lg, rowbase + lg + 8, rowbase + lg + 8};
const int jj[4] = {colbase + 2*lt, colbase + 2*lt + 1, colbase + 2*lt, colbase + 2*lt + 1};
#pragma unroll
for (int l = 0; l < 4; l++) {
const int t = tt[l], jc = jj[l];
if (t < Cc && jc < dv) {
attn_base[(c0 + t) * S_v * H + jc] = gam[t] * cc[l];
}
}
}
}
__syncthreads();
}
```
Then change the existing QS deposition block to:
```cpp
if constexpr (TC >= 2 && !QS_EARLY) {
...
}
```
This is intentionally conservative. It should not change math order for the
deposited QS values, only their scheduling relative to the solve/P build.
- [ ] **Step 4: Add a candidate launch arm**
In `launch_gated_delta_net`, when `gdn_m5_qs_early && gdn_tc >= 4`, call:
```cpp
launch_gdn_chunked<128, 16, 4, true>(...);
return;
```
Default M5 must continue to call `launch_gdn_chunked<128, 16, 4>(...)`.
- [ ] **Step 5: Mirror to DGX and build**
Run:
```bash
rsync -a /home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu \
dgx.casa:/home/mudler/llama-phase6-source/ggml/src/ggml-cuda/gated_delta_net.cu
ssh dgx.casa 'cd /home/mudler/llama-phase6-source/build-cuda && cmake --build . --target test-backend-ops llama-completion llama-batched-bench -j 8'
```
Expected: build exits 0.
## Task 3: Correctness Gates
**Files:**
- Artifact: `/home/mudler/bench/phase11_gdn_m5_state_boundary/gates/`
- [ ] **Step 1: Run focused op gates**
Run:
```bash
ssh dgx.casa 'cd /home/mudler/llama-phase6-source/build-cuda/bin
ART=$HOME/bench/phase11_gdn_m5_state_boundary/gates
./test-backend-ops test -b CUDA0 -o GATED_DELTA_NET -j 1 > "$ART/gated_delta_net_default.txt" 2>&1
GDN_M5_QS_EARLY=1 GDN_TC=5 GDN_CHUNK_MIN=2 ./test-backend-ops test -b CUDA0 -o GATED_DELTA_NET -j 1 > "$ART/gated_delta_net_qs_early.txt" 2>&1'
```
Expected: both logs show CUDA0 `OK` for all `GATED_DELTA_NET` cases.
- [ ] **Step 2: Run canonical md5 gates**
Run:
```bash
ssh dgx.casa 'cd /home/mudler/llama-phase6-source/build-cuda/bin
ART=$HOME/bench/phase11_gdn_m5_state_boundary/gates
LBASE="LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GDN_TC=5 GDN_CHUNK_MIN=64 GGML_NO_BACKTRACE=1"
LCAND="LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GDN_TC=5 GDN_CHUNK_MIN=2 GDN_M5_QS_EARLY=1 GGML_NO_BACKTRACE=1"
env $LBASE ./llama-completion -m /home/mudler/bench/q36-35b-a3b-nvfp4.gguf -ngl 99 -fa on -c 4096 --temp 0 --seed 1 -n 48 -p "The capital of France is" </dev/null > "$ART/gate_moe_default.txt" 2> "$ART/gate_moe_default.err"
md5sum "$ART/gate_moe_default.txt" | tee "$ART/gate_moe_default.md5"
env $LBASE ./llama-completion -m /home/mudler/bench/q36-27b-nvfp4.gguf -ngl 99 -fa on -c 4096 --temp 0 --seed 1 -n 48 -p "The capital of France is" </dev/null > "$ART/gate_dense_default.txt" 2> "$ART/gate_dense_default.err"
md5sum "$ART/gate_dense_default.txt" | tee "$ART/gate_dense_default.md5"
env $LCAND ./llama-completion -m /home/mudler/bench/q36-35b-a3b-nvfp4.gguf -ngl 99 -fa on -c 4096 --temp 0 --seed 1 -n 48 -p "The capital of France is" </dev/null > "$ART/gate_moe_qs_early.txt" 2> "$ART/gate_moe_qs_early.err"
md5sum "$ART/gate_moe_qs_early.txt" | tee "$ART/gate_moe_qs_early.md5"
env $LCAND ./llama-completion -m /home/mudler/bench/q36-27b-nvfp4.gguf -ngl 99 -fa on -c 4096 --temp 0 --seed 1 -n 48 -p "The capital of France is" </dev/null > "$ART/gate_dense_qs_early.txt" 2> "$ART/gate_dense_qs_early.err"
md5sum "$ART/gate_dense_qs_early.txt" | tee "$ART/gate_dense_qs_early.md5"'
```
Expected:
```text
MoE 8cb0ce23777bf55f92f63d0292c756b0
Dense 5951a5b4d624ce891e22ab5fca9bc439
```
- [ ] **Step 3: Stop if md5 changes**
If either candidate md5 differs, do not benchmark yet. Run the KL gate from
`backend/cpp/llama-cpp-localai-paged/docs/PAGED_BITEXACT_NOTE.md` and accept
only if KL is benign and the transcript is sane.
## Task 4: Performance A/B
**Files:**
- Artifact: `/home/mudler/bench/phase11_gdn_m5_state_boundary/ab/`
- [ ] **Step 1: Run same-session MoE and dense A/B**
Run:
```bash
ssh dgx.casa 'cd /home/mudler/llama-phase6-source/build-cuda/bin
ART=$HOME/bench/phase11_gdn_m5_state_boundary/ab
mkdir -p "$ART"
LBASE="LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GDN_TC=5 GDN_CHUNK_MIN=64 GGML_NO_BACKTRACE=1"
LCAND="LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GDN_TC=5 GDN_CHUNK_MIN=64 GDN_M5_QS_EARLY=1 GGML_NO_BACKTRACE=1"
env $LBASE ./llama-batched-bench -m /home/mudler/bench/q36-35b-a3b-nvfp4.gguf -c 131072 -b 2048 -ub 512 -ngl 99 -fa on -npp 512,2048 -ntg 4 -npl 32 > "$ART/moe_base.txt" 2>&1
env $LCAND ./llama-batched-bench -m /home/mudler/bench/q36-35b-a3b-nvfp4.gguf -c 131072 -b 2048 -ub 512 -ngl 99 -fa on -npp 512,2048 -ntg 4 -npl 32 > "$ART/moe_qs_early.txt" 2>&1
env $LBASE ./llama-batched-bench -m /home/mudler/bench/q36-27b-nvfp4.gguf -c 131072 -b 2048 -ub 512 -ngl 99 -fa on -npp 512,2048 -ntg 4 -npl 32 > "$ART/dense_base.txt" 2>&1
env $LCAND ./llama-batched-bench -m /home/mudler/bench/q36-27b-nvfp4.gguf -c 131072 -b 2048 -ub 512 -ngl 99 -fa on -npp 512,2048 -ntg 4 -npl 32 > "$ART/dense_qs_early.txt" 2>&1'
```
Expected: candidate improves S_PP for at least the target MoE prefill cases and
does not regress dense outside noise.
- [ ] **Step 2: Decide accept/reject**
Accept only if:
- op gates pass,
- md5 is canonical or KL-benign,
- MoE S_PP improves,
- dense S_PP does not regress,
- decode routing remains untouched by `GDN_CHUNK_MIN > 1`.
Reject if the candidate is flat/slower. Save:
```bash
git -C /home/mudler/_git/llama.cpp diff -- ggml/src/ggml-cuda/gated_delta_net.cu \
> /home/mudler/bench/phase11_gdn_m5_state_boundary/rejected/qs_early_rejected.diff
```
Then restore fork and DGX mirror.
## Task 5: Mirror Accepted Patch or Record Rejection
**Files:**
- Create if accepted: `backend/cpp/llama-cpp-localai-paged/patches/paged/0055-...patch`
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md`
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md`
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md`
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md`
- [ ] **Step 1: If accepted, commit fork patch**
Commit in `/home/mudler/_git/llama.cpp` only after gates pass:
```bash
git add ggml/src/ggml-cuda/gated_delta_net.cu
git commit -m "feat(cuda): add gated delta net M5 QS-early path"
```
- [ ] **Step 2: Generate LocalAI patch**
Run:
```bash
git -C /home/mudler/_git/llama.cpp format-patch -1 HEAD \
--stdout > backend/cpp/llama-cpp-localai-paged/patches/paged/0055-feat-cuda-add-gated-delta-net-M5-QS-early-path.patch
```
Do not hand-edit the generated patch.
- [ ] **Step 3: Update docs and commit LocalAI**
Record artifacts, md5/KL results, A/B numbers, and the decision. Commit with:
```bash
git add backend/cpp/llama-cpp-localai-paged/patches/paged/0055-feat-cuda-add-gated-delta-net-M5-QS-early-path.patch \
backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \
backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \
backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md \
backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md
git commit -m "feat(paged): add GDN M5 QS-early path" \
-m "Assisted-by: Codex:gpt-5"
```
If rejected, commit docs only:
```bash
git add backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \
backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \
backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md \
backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md \
docs/superpowers/plans/2026-07-01-gdn-m5-state-boundary-phase11.md
git commit -m "docs(paged): record GDN M5 QS-early result" \
-m "Assisted-by: Codex:gpt-5"
```

View File

@@ -0,0 +1,78 @@
# GDN M5 State-Boundary Design
## Context
Phase 10 tested a default-off C32 slabbed M5 path for
`ggml/src/ggml-cuda/gated_delta_net.cu`. It was correctness-clean only after
zeroing staged tail rows, then failed the performance gate:
| Model | PP | M5 S_PP t/s | C32 slab S_PP t/s |
|-------|----|-------------|-------------------|
| MoE | 2048 | 2430.32 | 2054.86 |
| Dense | 2048 | 1019.25 | 903.73 |
The likely root cause is duplicated A/T work per value slab. vLLM/FLA computes
the per-chunk triangular object once and reuses it through the WY transform; the
two-slab M5 shortcut could not do that without a larger scratch/precompute
design.
## Design Choice
Phase 11 stays at the shipped C=16 M5 geometry and tests a smaller C=16
state-boundary variant before reopening chunk-size changes. The candidate
targets the two tensor-core state-boundary products that both multiply a chunk
matrix by the same pre-update state `Sd`:
- `KS = Kc * S0`, currently used to form `Ud`.
- `QS = Qc * S0`, currently deposited later as the cross-chunk output term.
The first implementation should be default-off and selected by an explicit env
var such as `GDN_M5_QS_EARLY=1`. It should not change the default `GDN_TC=5`
path until it clears correctness and performance gates.
## Candidate Shape
The low-conflict version moves the QS state-boundary pass earlier in the C=16
M5 chunk loop and stores `gamma_t * QS[t][j]` in `attn_base` before the solve.
The later output section then reuses the predeposited cross-chunk term exactly
as it does today.
This does not yet fuse K and Q in a single MMA instruction. It tests whether
moving QS earlier and tightening the state-boundary scheduling helps without
new global scratch, changed state ownership, or C32 slab duplication. If it is
flat, Phase 11 should be rejected quickly and the next GDN work should be a
larger shared-A/Ai design rather than more local scheduling.
## Non-Goals
- Do not reintroduce C32 slabs.
- Do not add global A/Ai scratch in this phase.
- Do not import vLLM CuteDSL/TMA kernels.
- Do not route decode into chunked GDN; keep `GDN_CHUNK_MIN > 1`.
- Do not change default inference behavior unless gates prove a win.
## Gates
Correctness:
- Build `test-backend-ops` and `llama-completion` on DGX.
- Run default and forced-candidate `GATED_DELTA_NET` CUDA0 gates.
- Run canonical MoE and dense greedy md5 gates:
- MoE: `8cb0ce23777bf55f92f63d0292c756b0`.
- Dense: `5951a5b4d624ce891e22ab5fca9bc439`.
- If md5 changes, stop and run the existing KL gate before any performance
claim.
Performance:
- Compare against current M5 with `GDN_TC=5 GDN_CHUNK_MIN=64`.
- Run MoE and dense `llama-batched-bench` at `npp=512,2048`, `ntg=4`,
`npl=32`.
- Reject if either model regresses outside noise or if the GDN bucket does not
improve in a profile-gated follow-up.
## Decision Rule
Accept only if the candidate is md5/KL-safe and improves end-to-end S_PP. If it
is flat or slower, record it as rejected and move to a larger shared-A/Ai
blocked-solve design, likely requiring a separate scratch/precompute phase.