From 04ed7fe52fca20b96216cc6d8b55e7a749c03120 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Wed, 1 Jul 2026 15:13:45 +0000
Subject: [PATCH] docs(paged): record GDN launch sweep phase

Assisted-by: Codex:gpt-5
---
 .../llama-cpp-localai-paged/docs/BENCHMARK.md | 48 ++++++++++++++++---
 .../docs/PARITY_HANDOFF.md                    | 16 +++++++
 2 files changed, 58 insertions(+), 6 deletions(-)

diff --git a/backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md b/backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md
index c0b77c980..5d92f0876 100644
--- a/backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md
@@ -12,12 +12,13 @@ with artifact path, gates, benchmark rows, and decision.
 - Canonical dense md5: `5951a5b4d624ce891e22ab5fca9bc439`.
 - Current tested source: DGX mirror
   `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`.
-- Latest attempt: Phase77.
-- Latest decision: decode-only GB10 graph-node profile confirms GDN recurrence
-  is a real current decode bucket. In an isolated n=128 decode window, GDN was
-  `41.20%` of GPU kernel time and `gdn_core` alone was `38.95%`, slightly above
-  `mmq_nvfp4` (`38.26%`). This funds a default-off GDN decode A/B/PoC, with
-  md5/op gates and bucket reduction required before any merge/default change.
+- Latest attempt: Phase78.
+- Latest decision: GDN decode launch-shape env sweep did not beat the current
+  default. `GDN_NW=8 GDN_CPW=8` was correctness-clean but slower
+  (`gdn_core 1443.55 ms` vs Phase77 default `1408.33 ms`), and
+  `GDN_NW=16 GDN_CPW=4` failed the `MUL_MAT_ID` op gate. Keep the current
+  default launch shape; the next source path must be a real decode-kernel
+  structural A/B, not another `GDN_NW`/`GDN_CPW` retune.
 
 ## Current Serving Record
 
@@ -57,6 +58,41 @@ Decision:
 
 ## Attempt Log
 
+### Phase78: GDN Decode Launch-Shape Sweep
+
+- Date: 2026-07-01.
+- Baseline artifact:
+  `/home/mudler/bench/phase77_moe_decode_only_profile/20260701_150134`.
+- Sweep artifacts:
+  - `/home/mudler/bench/phase78_gdn_launch_sweep/nw8_cpw8_20260701_150654`
+  - `/home/mudler/bench/phase78_gdn_launch_sweep/nw16_cpw4_20260701_150954`
+- Source baseline: `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`.
+- Result type: env-gated launch-shape sweep only; no source change.
+- Shape: same as Phase77 decode-only graph-node profile.
+
+Result:
+
+| arm | env | gate status | GDN ms | GDN share | `gdn_core` ms | `gdn_core` share | `mmq_nvfp4` ms |
+|-----|-----|-------------|-------:|----------:|--------------:|-----------------:|---------------:|
+| Phase77 default | none | pre/post green | `1489.71` | `41.20%` | `1408.33` | `38.95%` | `1383.50` |
+| sweep `8x8` | `GDN_NW=8 GDN_CPW=8` | pre/post green | `1525.86` | `41.94%` | `1443.55` | `39.68%` | `1366.33` |
+| sweep `16x4` | `GDN_NW=16 GDN_CPW=4` | rejected | not run | not run | not run | not run | not run |
+
+Gate detail:
+
+- `8x8`: pre/post MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5
+  `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT 1146/1146`,
+  `MUL_MAT_ID 806/806`.
+- `16x4`: completion md5 and `MUL_MAT 1146/1146` passed, but
+  `MUL_MAT_ID` failed `805/806`; rejected before profiling.
+
+Decision:
+
+- Keep the current default `GDN_NW=16 GDN_CPW=8`.
+- Do not spend more GB10 time on launch-shape retunes without a new hypothesis.
+- The funded source path remains a structural default-off GDN decode A/B/PoC
+  that reduces the Phase77 `gdn_core` bucket, not another existing-env sweep.
+
 ### Phase77: MoE Decode-Only Graph-Node Profile
 
 - Date: 2026-07-01.
diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
index 85a5be39d..6a21c05e0 100644
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -1326,3 +1326,19 @@ based on vLLM's direct recurrent/packed decode structure. The next patch must
 prove a material reduction in the Phase77 `gdn_core` bucket, keep canonical md5
 and op gates green, and avoid serving/decode throughput regression under the
 same decode-only capture shape before it can be considered for merge or default.
+
+Phase78 launch-shape sweep:
+
+- Baseline: Phase77 default launch shape (`GDN_NW=16 GDN_CPW=8`) had
+  `gdn_core 1408.33 ms` (`38.95%`) in the decode-only window.
+- `GDN_NW=8 GDN_CPW=8` artifact:
+  `/home/mudler/bench/phase78_gdn_launch_sweep/nw8_cpw8_20260701_150654`.
+  Gates were green, but `gdn_core` worsened to `1443.55 ms` (`39.68%`).
+- `GDN_NW=16 GDN_CPW=4` artifact:
+  `/home/mudler/bench/phase78_gdn_launch_sweep/nw16_cpw4_20260701_150954`.
+  Rejected before profiling: `MUL_MAT_ID` failed `805/806`.
+
+Decision: keep default `GDN_NW=16 GDN_CPW=8`. Do not retry existing
+`GDN_NW`/`GDN_CPW` launch-shape retunes unless a new profile gives a specific
+reason. The next GB10 source-funded work must be structural, default-off, and
+measured against the Phase77 decode-only `gdn_core` bucket.