docs(paged): reject GDN C32 slab phase

Record the default-off C32 slab experiment, its md5 gates, the dense tail-row fix, and the performance regression that rejects the source patch. Assisted-by: Codex:gpt-5
2026-07-02 20:37:03 -04:00 · 2026-07-01 01:15:00 +00:00
parent ff3ad84191
commit 3da3b169fb
5 changed files with 170 additions and 15 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -874,6 +874,57 @@ Source check:

 Decision:

- Do not ship a Phase 10 source patch yet.
- Keep the baseline and source check as the entry gate for the next C32 slab
-  implementation task.
+- A default-off C32 slab candidate was implemented and rejected by the
+  performance gate.
+- The candidate was correctness-clean only after fixing a tail-chunk staging
+  bug: rows `t >= Cc` in the staged `U=T*RHS` copy-back must be zeroed before
+  state/output math. Before that fix, the dense gate produced a degenerate
+  transcript even though the focused op gate passed.
+- After the tail fix, both default and forced-C32 modes matched the canonical
+  md5 gates exactly:
+  - MoE: `8cb0ce23777bf55f92f63d0292c756b0`.
+  - Dense: `5951a5b4d624ce891e22ab5fca9bc439`.
+- KL was not needed because md5 stayed stable after the tail fix.
+
+Correctness artifacts:
+
+- `/home/mudler/bench/phase10_gdn_c32_slab/gates/gated_delta_net_default_after_tailfix.txt`
+- `/home/mudler/bench/phase10_gdn_c32_slab/gates/gated_delta_net_c32_slab_after_tailfix.txt`
+- `/home/mudler/bench/phase10_gdn_c32_slab/gates/gate_moe_default_after_tailfix.md5`
+- `/home/mudler/bench/phase10_gdn_c32_slab/gates/gate_dense_default_after_tailfix.md5`
+- `/home/mudler/bench/phase10_gdn_c32_slab/gates/gate_moe_c32_after_tailfix.md5`
+- `/home/mudler/bench/phase10_gdn_c32_slab/gates/gate_dense_c32_after_tailfix.md5`
+
+Performance A/B artifacts:
+
+- `/home/mudler/bench/phase10_gdn_c32_slab/ab/moe_base.txt`
+- `/home/mudler/bench/phase10_gdn_c32_slab/ab/moe_c32.txt`
+- `/home/mudler/bench/phase10_gdn_c32_slab/ab/dense_base.txt`
+- `/home/mudler/bench/phase10_gdn_c32_slab/ab/dense_c32.txt`
+
+Performance A/B:
+
+| Model | Mode | PP | TG | B | S_PP t/s | S_TG t/s | S t/s |
+|-------|------|----|----|---|----------|----------|-------|
+| MoE | M5 base | 512 | 4 | 32 | 2323.48 | 397.57 | 2239.39 |
+| MoE | C32 slab | 512 | 4 | 32 | 2069.12 | 357.43 | 1995.06 |
+| MoE | M5 base | 2048 | 4 | 32 | 2430.32 | 388.29 | 2405.66 |
+| MoE | C32 slab | 2048 | 4 | 32 | 2054.86 | 388.01 | 2037.79 |
+| Dense | M5 base | 512 | 4 | 32 | 975.10 | 140.53 | 932.19 |
+| Dense | C32 slab | 512 | 4 | 32 | 866.29 | 144.03 | 833.87 |
+| Dense | M5 base | 2048 | 4 | 32 | 1019.25 | 183.25 | 1010.26 |
+| Dense | C32 slab | 2048 | 4 | 32 | 903.73 | 183.47 | 896.86 |
+
+Rejected diff:
+
+- `/home/mudler/bench/phase10_gdn_c32_slab/rejected/c32_slab_tailfix_rejected.diff`
+
+Conclusion:
+
+- Do not ship Phase 10 C32 slab as implemented.
+- C32 slab is not a maintainable shortcut toward parity because duplicated
+  A/T recomputation per value slab outweighs the intended state-traffic
+  reduction.
+- A future GDN prefill attempt should either share the `A/T` work across value
+  slabs or switch to a different FLA-style chunk design; it should not repeat
+  this env-gated two-slab M5 variant.
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -173,6 +173,7 @@ GDN is the #1 prefill-gap contributor (+59.2 us/tok, ~30%). vLLM's FLA `chunk_ga
 | bf16-C16 | bf16 Gram at C=16 | REJECTED | no win; bf16 mantissa unsafe on state-coupled products |
 | BV block-occupancy A/B (tf32) | raise blocks/SM | REJECTED (occupancy NOT the bound) | 1844 vs 1814 S_PP (-1.04%, within noise) |
 | bf16-C64 | bf16 Gram at C=64 | REJECTED | -18.75%; O(C^2) intra-chunk + serial recurrence dominates |
+| Phase 10 C32 slab M5 | C=32, two `dv_tile=64` slabs, default-off `GDN_C32_SLAB=1` | REJECTED | md5-clean after tail-row zeroing, but slower: MoE 2048 2430.32 -> 2054.86; dense 2048 1019.25 -> 903.73 |

 Why not occupancy/dtype: the cost is the **O(C^2) intra-chunk triangular A-inverse solve + the strictly-serial inter-chunk recurrence**, with C forced to **16** by GB10's 99 KB dynamic-smem cap (the 128x128 f32 state alone is 64 KB). M5 captures the tractable TC part; it does not fully close 2.62x because vLLM's FLA blocked-solve is a more complete TC implementation.

--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md
@@ -172,6 +172,7 @@ products through tensor cores. The series chased that headroom.
 | bf16-C16 | bf16 Gram at C=16 | rejected | no win over tf32-M5; bf16 mantissa unsafe on the state-coupled products | GDN build-plan s4 |
 | BV block-occupancy A/B (tf32) | raise blocks/SM to test if occupancy is the bound | **REJECTED** (occupancy is NOT the bound; latency is wave-hidden) | two arms statistically equal: **1844 vs 1814 S_PP (-1.04%, within noise)** | GDNAB armA/armB |
 | bf16-C64 | bf16 Gram at the larger C=64 chunk | **REJECTED** | **-18.75%** - the O(C^2) intra-chunk triangular-solve + serial recurrence dominates, so growing C hurts | recorded verdict / GDN build-plan |
+| Phase 10 C32 slab M5 | C=32 with two `dv_tile=64` slabs, default-off `GDN_C32_SLAB=1` | **REJECTED** | md5-clean after tail-row zeroing, but S_PP regressed: MoE 2048 **2430.32 -> 2054.86**, dense 2048 **1019.25 -> 903.73** | phase10 gates/ab |

 **Why the bottleneck is not occupancy/dtype:** the cost is the **O(C^2)
 intra-chunk triangular solve + the serial inter-chunk recurrence dependency**, not
--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
@@ -459,6 +459,40 @@ scope until a serving phase proves target-verification cost and rollback safety.

 Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.

+### Phase 10 GDN C32 slab update
+
+Phase 10 tested the tempting low-conflict shortcut for #101: keep the current
+M5 tensor-core GDN form, raise the chunk to `C=32`, and split the value
+dimension into two `dv_tile=64` slabs to stay within shared memory.
+
+Result:
+
+- The shortcut cannot be a launcher-only change. C32 requires staging
+  `U=T*RHS` because the existing M5 apply path relies on one 16-row tile being
+  held in registers before overwriting `Ud`.
+- A default-off `GDN_C32_SLAB=1` candidate was built and md5-gated.
+- The first candidate exposed a dense-only transcript failure on tail chunks;
+  root cause was copying uninitialized staged rows for `t >= Cc` back into
+  `Ud`. Zeroing those rows restored both canonical md5 gates:
+  MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
+  `5951a5b4d624ce891e22ab5fca9bc439`.
+- Performance regressed after correctness was fixed:
+  MoE 2048 S_PP `2430.32 -> 2054.86`; dense 2048 S_PP `1019.25 -> 903.73`.
+
+Decision:
+
+- **REJECT** the two-slab C32 M5 variant.
+- Do not add it to the LocalAI patch stack.
+- The likely blocker is duplicated A/T recomputation per value slab; future GDN
+  work must share that work across slabs or move to a different FLA-style
+  chunked design rather than repeating this env-gated shortcut.
+
+Artifacts:
+
+- `/home/mudler/bench/phase10_gdn_c32_slab/gates/`
+- `/home/mudler/bench/phase10_gdn_c32_slab/ab/`
+- `/home/mudler/bench/phase10_gdn_c32_slab/rejected/c32_slab_tailfix_rejected.diff`
+
 ---

 # PROFILE-VALIDATED PATH (both-engine nsys, adversarially verified Sun Jun 28 11:55:12 PM UTC 2026)
--- a/docs/superpowers/plans/2026-07-01-gdn-c32-slab-phase10.md
+++ b/docs/superpowers/plans/2026-07-01-gdn-c32-slab-phase10.md
@@ -121,7 +121,7 @@ Implication:
 - Because the candidate changes the solve/apply mechanics, it requires a
  focused `GATED_DELTA_NET` op gate before any prefill A/B.

- [ ] **Step 1: Add an explicit env selector**
+- [x] **Step 1: Add an explicit env selector**

 Use an env var such as:

@@ -131,7 +131,7 @@ GDN_C32_SLAB=1

 The default path must stay current M5.

- [ ] **Step 2: Introduce a C=32, dv_tile=64 launch**
+- [x] **Step 2: Introduce a C=32, dv_tile=64 launch**

 Target shape:

@@ -147,7 +147,7 @@ Initial prototype rules:
 - no decode routing,
 - no D2H synchronization.

- [ ] **Step 3: Build on DGX**
+- [x] **Step 3: Build on DGX**

 Run:

@@ -157,12 +157,25 @@ ssh dgx.casa 'cd /home/mudler/llama-phase6-source/build-cuda && cmake --build .

 Expected: build succeeds.

+Result:
+
+- Candidate implemented in the llama.cpp fork as a default-off
+  `GDN_C32_SLAB=1` path.
+- Kernel generalized to `DV_TILE=64`, with two value slabs for `S_v=128`.
+- C32 `U=T*RHS` writes were staged through a slab-local `Utmp` buffer to avoid
+  read/write aliasing against the RHS in `Ud`.
+- Initial md5 failed on dense because tail chunks copied uninitialized staged
+  rows back into `Ud`; the root-cause fix zeroed `t >= Cc` rows during the
+  staged copy-back.
+- DGX build succeeded after the tail fix:
+  `cmake --build . --target test-backend-ops llama-completion -j 8`.
+
 ## Task 3: Correctness Gates

 **Files:**
 - Artifact: `/home/mudler/bench/phase10_gdn_c32_slab/gates/`

- [ ] **Step 1: Run `GATED_DELTA_NET` op gate**
+- [x] **Step 1: Run `GATED_DELTA_NET` op gate**

 Run default and forced C32 slab modes:

@@ -180,7 +193,7 @@ Required coverage to inspect in logs:
 - permuted layout,
 - adversarial decay.

- [ ] **Step 2: Run canonical md5 gates**
+- [x] **Step 2: Run canonical md5 gates**

 Run MoE and dense greedy gates with and without `GDN_C32_SLAB=1`.

@@ -191,18 +204,34 @@ MoE   8cb0ce23777bf55f92f63d0292c756b0
 Dense 5951a5b4d624ce891e22ab5fca9bc439
 ```

- [ ] **Step 3: Run KL gate if md5 changes**
+- [x] **Step 3: Run KL gate if md5 changes**

 If the C32 slab path changes reduction order and therefore md5, stop and run the
 existing KL procedure from `PAGED_BITEXACT_NOTE.md`. Keep the patch only if the
 new path is KL-benign and no worse than current M5.

+Result:
+
+- Default op gate:
+  `/home/mudler/bench/phase10_gdn_c32_slab/gates/gated_delta_net_default_after_tailfix.txt`
+- Forced C32 op gate:
+  `/home/mudler/bench/phase10_gdn_c32_slab/gates/gated_delta_net_c32_slab_after_tailfix.txt`
+- Both `GATED_DELTA_NET` CUDA0 gates passed.
+- Canonical default md5 after tail fix:
+  - MoE: `8cb0ce23777bf55f92f63d0292c756b0`
+  - Dense: `5951a5b4d624ce891e22ab5fca9bc439`
+- Forced C32 md5 after tail fix:
+  - MoE: `8cb0ce23777bf55f92f63d0292c756b0`
+  - Dense: `5951a5b4d624ce891e22ab5fca9bc439`
+- KL gate was not needed because the md5 gates matched the canonical outputs
+  exactly after the tail-row fix.
+
 ## Task 4: Performance A/B

 **Files:**
 - Artifact: `/home/mudler/bench/phase10_gdn_c32_slab/ab/`

- [ ] **Step 1: Run C32 slab prefill at `npp=512`**
+- [x] **Step 1: Run C32 slab prefill at `npp=512`**

 Compare:

@@ -213,30 +242,69 @@ candidate: GDN_TC=5 GDN_CHUNK_MIN=64 GDN_C32_SLAB=1

 Pass: candidate beats current M5 S_PP outside noise.

- [ ] **Step 2: Run C32 slab prefill at `npp=2048`**
+- [x] **Step 2: Run C32 slab prefill at `npp=2048`**

 Use the same A/B. Pass requires a net S_PP improvement or a clear GDN bucket
 reduction without a larger regression elsewhere.

- [ ] **Step 3: Reject if duplicated A/T work cancels the state-traffic win**
+- [x] **Step 3: Reject if duplicated A/T work cancels the state-traffic win**

 If the candidate only shifts time between A/T recomputation and state traffic
 without a net win, save the diff as a rejected artifact and update this plan.

+Result:
+
+Artifacts:
+
+- `/home/mudler/bench/phase10_gdn_c32_slab/ab/moe_base.txt`
+- `/home/mudler/bench/phase10_gdn_c32_slab/ab/moe_c32.txt`
+- `/home/mudler/bench/phase10_gdn_c32_slab/ab/dense_base.txt`
+- `/home/mudler/bench/phase10_gdn_c32_slab/ab/dense_c32.txt`
+
+| Model | Mode | PP | TG | B | S_PP t/s | S_TG t/s | S t/s |
+|-------|------|----|----|---|----------|----------|-------|
+| MoE | M5 base | 512 | 4 | 32 | 2323.48 | 397.57 | 2239.39 |
+| MoE | C32 slab | 512 | 4 | 32 | 2069.12 | 357.43 | 1995.06 |
+| MoE | M5 base | 2048 | 4 | 32 | 2430.32 | 388.29 | 2405.66 |
+| MoE | C32 slab | 2048 | 4 | 32 | 2054.86 | 388.01 | 2037.79 |
+| Dense | M5 base | 512 | 4 | 32 | 975.10 | 140.53 | 932.19 |
+| Dense | C32 slab | 512 | 4 | 32 | 866.29 | 144.03 | 833.87 |
+| Dense | M5 base | 2048 | 4 | 32 | 1019.25 | 183.25 | 1010.26 |
+| Dense | C32 slab | 2048 | 4 | 32 | 903.73 | 183.47 | 896.86 |
+
+Decision:
+
+- Reject the C32 slab source patch.
+- The candidate is correctness-clean after tail-row zeroing, but it regresses
+  S_PP in both model families.
+- The likely mechanism is that recomputing `A/T` once per value slab cancels
+  the intended state-traffic win; optimizing this would require a broader
+  shared-work design rather than a small, low-conflict shortcut patch.
+- Rejected diff saved at:
+  `/home/mudler/bench/phase10_gdn_c32_slab/rejected/c32_slab_tailfix_rejected.diff`.
+
 ## Task 5: Mirror or Reject

 **Files:**
 - Create if accepted: `backend/cpp/llama-cpp-localai-paged/patches/paged/0055-...patch`
 - Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md`

- [ ] **Step 1: Commit accepted fork patch**
+- [x] **Step 1: Commit accepted fork patch**

 Commit only after correctness and performance gates pass.

- [ ] **Step 2: Generate LocalAI patch**
+- [x] **Step 2: Generate LocalAI patch**

 Use `git format-patch`; do not hand-edit the generated patch.

- [ ] **Step 3: Update docs**
+- [x] **Step 3: Update docs**

 Record exact artifacts, md5/KL results, and performance decision.
+
+Result:
+
+- No fork commit and no LocalAI patch were generated because Phase 10 was
+  rejected by the performance gate.
+- The llama.cpp fork and DGX mirror were restored to the prior accepted state.
+- This plan and the parity docs record the rejected source candidate so it is
+  not repeated as an accidental "obvious" follow-up.