From 5b8b33a302b1d1b9f20b8bcbb71c13e7c8bdf416 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Thu, 2 Jul 2026 21:04:37 +0000
Subject: [PATCH] docs(paged): record P5 FLA GDN NO-GO - GDN prefill bucket is
 a confirmed shared-hardware floor

P5 ported the full six-kernel vLLM-FLA chunk_gated_delta_rule_fwd pipeline
(cumsum, chunk_scaled_dot_kkt, blocked merge_16x16_to_64x64 solve_tril,
recompute_w_u, register/smem-resident chunk_gated_delta_rule_fwd_h with the
chunk loop in-kernel, chunk_fwd_o) to CUDA tf32 mma, per-kernel validated vs
host fp64 (o NMSE 2.2e-7, final-state 1.2e-7), integrated behind
LLAMA_GDN_FLA_CHUNK=1 (default-off), and A/B'd in-backend vs the shipped M5
chunked scan.

P0 perf kill-gate FAILED decisively: nsys --cuda-graph-trace=node, MoE
q36-35b-a3b, npp2048 M5 56.31 vs FLA 119.46 us/tok (FLA 2.12x SLOWER,
gdn_delta_pct -112.1); npp512 2.29x slower; end-to-end S_PP -13.33%/-13.12%.
GO required FLA >10% faster at npp2048.

Novel decomposition: the blocked solve_tril is only ~2.8% of the FLA bucket
(55.6 ms); fwd_h 46.2% (903 ms) + fwd_o 31.5% (617 ms) dominate. The cost is
the state-recurrence GEMMs plus per-chunk h-state materialization to global
LPDDR5x that FLA's split-kernel structure forces; the fused M5 single kernel
keeps the 128x128 state resident in smem and never materializes per-chunk h,
so it is 2.1x faster on GB10's low-bandwidth memory. So the GDN prefill bucket
(+59.2, the single largest prefill lever) is a confirmed shared-hardware /
memory-bandwidth floor, NOT recoverable by the blocked-solve algorithm.
Extends Phase74 (standalone blocked-inverse 0.59x) and bf16-C64 (-18.75%).

Gates: SMEM PASS (max 96KB < 99KB cap). KL band GREEN (FLA KLD 0.137028 vs
control 0.136563, delta +0.000465 < 0.01; same-top-p 84.61%). DEFAULT path
untouched (canonical md5 GREEN both models default-off AND FLA-on; MoE
8cb0ce23, dense 5951a5b4; GATED_DELTA_NET DEFAULT 46/46). Nothing landed;
series stays at 46 patches. WIP on DGX fork branch p5-fla-gdn (2d64c37f0),
NOT pushed.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
 .../docs/EXECUTION_REARCH_SCOPE.md            | 106 +++++++++++++++++-
 .../docs/PARITY_HANDOFF.md                    |  84 ++++++++++++++
 2 files changed, 184 insertions(+), 6 deletions(-)

diff --git a/backend/cpp/llama-cpp-localai-paged/docs/EXECUTION_REARCH_SCOPE.md b/backend/cpp/llama-cpp-localai-paged/docs/EXECUTION_REARCH_SCOPE.md
index 993c68268..f6d64e0be 100644
--- a/backend/cpp/llama-cpp-localai-paged/docs/EXECUTION_REARCH_SCOPE.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/EXECUTION_REARCH_SCOPE.md
@@ -714,6 +714,96 @@ expected result and any throughput payoff lives on non-GB10 silicon (out of scop
   (6 patches) and upstream may add its own GDN paths; keep the new pipeline in a
   separate `.cu` and gate the dispatch narrowly.
 
+#### P5 RESULT (NO-GO at the P0 perf kill-gate, recorded 2026-07-02, `LLAMA_GDN_FLA_CHUNK`, default-off) - the GDN prefill bucket is now a CONFIRMED SHARED-HARDWARE FLOOR
+
+The full six-kernel vLLM-FLA `chunk_gated_delta_rule_fwd` pipeline was **ported to
+CUDA tf32 mma, per-kernel validated against a host fp64 reference, integrated behind
+`LLAMA_GDN_FLA_CHUNK=1` (default-off), and A/B'd in-backend** against the shipped M5
+f32 chunked scan. It **lost decisively** and by the wrong sign, so `go=false` was the
+kill-gate default, **nothing was built beyond P0, and nothing landed.** This is the
+**scope-anticipated "expected null"** (the P5 section framed this as the program's
+strictest kill-gate given Phase74's standalone blocked-inverse 0.59x), but the phase
+delivered the one thing the prior evidence lacked: **the whole FLA pipeline run
+in-backend with the register/smem-resident inter-chunk state and the chunk loop
+in-kernel** - the exact form that "was never actually tested in-backend." It was tested
+here, and the result **settles the GDN prefill bucket (bucket 1, +59.2, the single
+largest prefill lever) as a shared-hardware / memory-bandwidth floor on GB10.**
+
+- **PERF GO GATE FAILED DECISIVELY (the decisive result).** GO required the in-pipeline
+  blocked `solve_tril` to beat the M5 f32 chunked scan by **> 10% at npp2048**.
+  Measured (nsys `--cuda-graph-trace=node`, MoE `q36-35b-a3b-nvfp4`, per distinct token
+  over the 30 GDN layers): **npp2048 M5 56.31 vs FLA 119.46 us/tok = FLA 2.12x SLOWER**
+  (`gdn_delta_pct_2048 = -112.1`); **npp512 M5 51.23 vs FLA 117.35 = 2.29x slower**.
+  End-to-end **S_PP regressed MoE -13.33% @npp2048 / -13.12% @npp512** (3-rep medians;
+  clears `max(2%, 3 sigma)` by a wide margin, and it is the wrong sign, so there is no
+  3-sigma question). The shipped M5 remains `gdn_core` at **56.31 us/tok = 64.82% of
+  vLLM's FLA chunk-64 36.5 us/tok on this GB10**; the rejected FLA port was only **30.55%
+  of vLLM** (36.5/119.46) - a regression, not a recovery. This reproduces Phase74's
+  standalone blocked-inverse 0.59x and extends bf16-C64 (-18.75%), now **confirmed
+  in-backend** with the register-resident state + in-kernel chunk loop.
+- **WHERE THE TIME WENT (the novel, valuable decomposition - the reason this NO-GO
+  matters beyond a rejection).** Per-kernel nsys share of the FLA bucket: the **blocked
+  `solve_tril` is only ~2.8% (55.6 ms)** - the algorithm the whole phase was about is
+  *cheap*. The bucket is dominated by **`chunk_gated_delta_rule_fwd_h` 46.2% (903 ms) +
+  `chunk_fwd_o` 31.5% (617 ms)**: the inter-chunk state-recurrence GEMMs plus the
+  **per-chunk h-state materialization to global LPDDR5x** that FLA's split-kernel
+  structure forces (`fwd_h` writes `h_pre` per chunk, `fwd_o` re-reads it). The fused M5
+  single kernel keeps the 128x128 state **resident in smem and never materializes
+  per-chunk h**, so it is **2.1x faster on GB10's low-bandwidth memory.** So the novel
+  finding vs all prior evidence: **the blocked solve itself is not the floor - the floor
+  is the state-GEMM + h-materialization region, which the FLA structure makes WORSE than
+  M5, not better.** This is exactly the "materialize-everything tax" the scope warns of.
+  The binding silicon property is **memory bandwidth** (per-chunk h round-trips to
+  LPDDR5x), compounded by the **99 KB smem cap** that forces the FLA split (`fwd_h` and
+  `fwd_o` cannot co-reside), not the mma shapes or wave count.
+- **SMEM GATE PASSES (all six kernels under the 99 KB opt-in cap at C=64;
+  `cudaOccupancyMaxActiveBlocksPerMultiprocessor`):** `k_kkt` 48 KB / 2 blk, `k_solve`
+  38 KB / 2 blk, `k_wu` 48 KB / 2 blk, `k_fwdh` 80 KB / 1 blk, `k_fwdo` 96 KB / 1 blk -
+  **max 96 KB < 99 KB.** The kernels fit; they are simply bandwidth-floored above M5.
+- **KL BAND GREEN / IN-BAND (model numerics sound):** FLA `KLD 0.137028` vs control
+  `0.136563` = **delta +0.000465 < 0.01**; same-top-p **84.61% vs 83.73%** control
+  (>= 84% baseline; FLA marginally better). Per-kernel bring-up validation vs host fp64
+  on synthetic shapes: **o NMSE 2.2e-7, final-state 1.2e-7** (done BEFORE integration,
+  per the "do not debug six kernels blind" rule).
+- **DEFAULT PATH UNTOUCHED (canonical md5 GREEN with the code present):** paged-MoE
+  `8cb0ce23777bf55f92f63d0292c756b0`, dense `5951a5b4d624ce891e22ab5fca9bc439`, both
+  **default-off AND `LLAMA_GDN_FLA_CHUNK`-on** (the small-M greedy path bails to M5).
+  `test-backend-ops GATED_DELTA_NET` **DEFAULT 46/46 OK.** Decode untouched
+  (`GDN_CHUNK_MIN` untouched; decode stays on the sequential recurrence).
+- **`test-backend-ops` env-on = 43-44/46 (`gdn_op_tests_env_on_green=false`; explicit
+  tolerance judgment).** The FLA-engaged `head_size=128, n_seq_tokens>=64` cases
+  marginally exceed the test's `1e-7` threshold (**ERR 1.03-1.06e-7**, fluctuating
+  across the boundary run-to-run) because this port uses **plain tf32** where the shipped
+  M5 uses **3xtf32 (CUTLASS fp32-emulation)** for the decay-coupled compounding state
+  products; M5-chunked (`LLAMA_KV_PAGED=1`, no FLA) passes the SAME cases at `< 1e-7`.
+  Judgment: a marginal tf32-vs-3xtf32 accuracy gap, **benign at the model level (KL
+  green)**; tightening the port to 3xtf32 would only add mma count and **deepen** the
+  perf NO-GO, so it was not pursued.
+- **Engagement PROVEN:** `LLAMA_GDN_FLA_TRACE` fired `[gdn-fla] engage H=32 n_seqs=N
+  n_tokens=128 NT=2` in `batched-bench`; nsys shows all six `gdn_fla::` kernels executing
+  under `LLAMA_GDN_FLA_CHUNK=1` and none under default. Protocols honored: GPU lock held
+  throughout and released; `LLAMA_MAX_BATCH_TOKENS` unset; sm_121a; nsys
+  `--cuda-graph-trace=node`; 3+ iter S_PP medians; no external contention.
+- **Provenance.** WIP on the DGX fork topic branch `p5-fla-gdn` at
+  `2d64c37f08ad323038a44a89ab32189527c6ba29` (base `localai-paged` `653bb2f3d`, **NOT
+  pushed, NOT landed**): new `ggml/src/ggml-cuda/gdn-blocked-solve.cu` + narrow dispatch
+  in `gated_delta_net.cu` / `gated_delta_net.cuh`. Fork `localai-paged` HEAD **untouched
+  at `653bb2f3d`**; the LocalAI series **stays at 46 patches (`0001-0055`)**; topic
+  branches `p1-bf16-stream` / `p2-moe-region` / `p4-cbv2` left intact. Artifacts on the
+  DGX `~/bench/p5_fla_gdn/`: `killgate_20260702_204225/` (RESULTS.md, spp_control.txt,
+  spp_fla.txt, `nsys_{ctrl,fla}{2048,512}.{nsys-rep,kern.csv}`, GATES.txt,
+  `kl_moe_{ctrl,fla}.log`, occupancy.txt, gdn-blocked-solve.cu, p5_fla_test.cu) and
+  `standalone_20260702_203434/` (RESULTS.txt + p5_fla_test.cu, p5_m5_time.cu,
+  m5_kernel_body.cuh).
+- **Honest delta vs the +59.2 expectation.** The scope's conservative expected recovery
+  was **~0-10 of the +59.2, "likely a shared-hardware floor."** Delivered: **0 recovery,
+  a -63 us/tok regression on the FLA arm**; the floor is **confirmed**. The shipped M5
+  fused smem-resident chunked scan (56.31 us/tok) is the winner and is **at or near the
+  GB10 memory-bandwidth floor for this op.** This closes the last speculative prefill
+  lever in the program. What binds is silicon (LPDDR5x bandwidth on the per-chunk h
+  round-trip + the 99 KB smem cap forcing the split), not the algorithm; it lifts only on
+  datacenter Blackwell (HBM + larger smem + TMEM), consistent with section 4's framing.
+
 ### P6: FP8 KV cache + smaller dtype/bandwidth items
 
 - **Goal:** halve decode-time KV cache traffic (K/V stored fp8-e4m3 with a scale) and
@@ -749,7 +839,7 @@ Conservative, showing the math. Baselines from section 2.
 | 3 dtype boundary tax | +36.6 | P1 | ~30 |
 | 4 norms/glue (part) | +37.2 | P1 (norms) + P6 (FP8 proj) | ~18 |
 | 2 GEMM tiling | +56.5 | P2 + P3 | ~40 |
-| 1 GDN scan | +59.2 | P5 (speculative) | ~0-10 |
+| 1 GDN scan | +59.2 | P5 (NO-GO, CONFIRMED FLOOR) | 0 |
 | 5 dispatch | +5.9 | P2/P4 | ~3 |
 
 Recovered ~91-101 us/tok of 198.9. New paged wall ~295-305 us/tok. **Prefill S_PP goes
@@ -819,11 +909,15 @@ honest ceiling on GB10 is roughly **prefill ~55-65%, serving-agg ~80%, decode-GP
 
 ## 6. Risks and open questions
 
-- **P5 is likely a shared-hardware floor.** Phase74's standalone blocked-inverse ran at
-  0.59x the direct solve, and the 99 KB smem cap forces C=16. Open question: does the
-  full FLA pipeline *in-backend* (register-resident inter-chunk state, chunk loop
-  in-kernel) behave differently from the standalone bench? If not, P5 recovers ~0 and is
-  recorded as a confirmed floor. Rank it last-but-one and gate it hardest.
+- **P5 is a shared-hardware floor - RESOLVED / CONFIRMED (2026-07-02, see the P5 RESULT
+  above).** Phase74's standalone blocked-inverse ran at 0.59x the direct solve. The open
+  question was whether the full FLA pipeline *in-backend* (register-resident inter-chunk
+  state, chunk loop in-kernel) behaves differently from the standalone bench. **Answer:
+  no - it is 2.12x SLOWER than M5 at npp2048 (119.46 vs 56.31 us/tok), S_PP -13.3%.** The
+  per-kernel decomposition showed the blocked solve is only 2.8% of the bucket; the floor
+  is the state-GEMM + per-chunk h-materialization to LPDDR5x that FLA's split-kernel
+  structure forces (and the 99 KB smem cap that forces that split). P5 recovers 0 and is a
+  **confirmed shared-hardware / memory-bandwidth floor.**
 - **P1 segment-boundary converts.** Option A keeps f32 at segment edges; if the q36
   residual stream has many short segments, the boundary converts could eat the win.
   Open: how many bf16 segments survive across a q36 layer, and does the shared-expert
diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
index 4abaf8583..fd112ea95 100644
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -2564,3 +2564,87 @@ determinism tsvs), `det2_20260702_193123/` + `det3_20260702_193649/` +
 `det4_20260702_194040/` (determinism diff-matrix), `perf_20260702_194359/`
 (raw_*.json + auto-written RESULTS.md). Environment: `LLAMA_KV_PAGED=1
 LLAMA_MOE_FORCE_GRAPHS=1`, `LLAMA_MAX_BATCH_TOKENS` unset, sm_121a, GPU lock held.
+
+## P5 FLA-faithful GDN prefill scan (blocked solve_tril port) - NO-GO at the perf kill-gate; the GDN prefill bucket is a CONFIRMED SHARED-HARDWARE FLOOR (recorded 2026-07-02)
+
+Fifth phase of the `EXECUTION_REARCH_SCOPE.md` additive program, and its **strictest
+kill-gate**. The full six-kernel vLLM-FLA `chunk_gated_delta_rule_fwd` pipeline was
+**ported to CUDA tf32 mma, per-kernel validated vs a host fp64 reference, integrated
+behind `LLAMA_GDN_FLA_CHUNK=1` (default-off), and A/B'd in-backend** against the shipped
+M5 f32 chunked scan. It lost decisively and by the wrong sign, so `go=false` was the
+kill-gate default, nothing was built beyond P0, and nothing landed. See the "P5 RESULT"
+subsection in `EXECUTION_REARCH_SCOPE.md` for the full record; this closes the last
+speculative prefill lever in the program. Summary and provenance:
+
+- **Verdict: NO-GO / DO-NOT-SHIP at the perf gate (the scope-anticipated "expected
+  null").** The P5 section framed this as the strictest kill-gate given Phase74's
+  standalone blocked-inverse 0.59x. P5 delivered the one thing prior evidence lacked:
+  **the whole FLA pipeline in-backend with the register/smem-resident inter-chunk state
+  and the chunk loop in-kernel** - the form that "was never actually tested in-backend."
+  It was tested here and **settles the GDN prefill bucket (bucket 1, +59.2, the single
+  largest prefill lever) as a shared-hardware / memory-bandwidth floor on GB10.**
+- **PERF GO GATE FAILED DECISIVELY.** GO required the in-pipeline blocked `solve_tril`
+  to beat M5 by **> 10% at npp2048**. Measured (nsys `--cuda-graph-trace=node`, MoE
+  `q36-35b-a3b-nvfp4`, per distinct token over 30 GDN layers): **npp2048 M5 56.31 vs FLA
+  119.46 us/tok = FLA 2.12x SLOWER** (`gdn_delta_pct_2048 = -112.1`); **npp512 M5 51.23
+  vs FLA 117.35 = 2.29x slower.** End-to-end **S_PP regressed MoE -13.33% @npp2048 /
+  -13.12% @npp512** (3-rep medians; wrong sign, no 3-sigma question). The shipped M5
+  stays `gdn_core` at **56.31 us/tok = 64.82% of vLLM's FLA chunk-64 36.5 us/tok**; the
+  rejected FLA port was only **30.55% of vLLM** (36.5/119.46). Reproduces Phase74's
+  standalone 0.59x and extends bf16-C64 (-18.75%), now **confirmed in-backend.**
+- **WHERE THE TIME WENT (the novel, valuable decomposition - why this NO-GO matters).**
+  Per-kernel nsys share of the FLA bucket: **blocked `solve_tril` only ~2.8% (55.6 ms)** -
+  the algorithm the phase was about is *cheap*. Dominated by **`fwd_h` 46.2% (903 ms) +
+  `fwd_o` 31.5% (617 ms)**: the inter-chunk state-recurrence GEMMs + the **per-chunk
+  h-state materialization to global LPDDR5x** that FLA's split-kernel structure forces
+  (`fwd_h` writes `h_pre` per chunk, `fwd_o` re-reads it). The fused M5 single kernel
+  keeps the 128x128 state **resident in smem, never materializes per-chunk h**, so it is
+  **2.1x faster on GB10's low-bandwidth memory.** Novel finding vs all prior evidence:
+  **the blocked solve is not the floor - the floor is the state-GEMM + h-materialization
+  region, which the FLA structure makes WORSE than M5.** The binding silicon property is
+  **LPDDR5x memory bandwidth** (per-chunk h round-trip), compounded by the **99 KB smem
+  cap** that forces the `fwd_h`/`fwd_o` split - not mma shapes or wave count.
+- **Correctness / cap gates (recorded, not decisive):**
+  - **SMEM GATE PASSES** (all six kernels under the 99 KB opt-in cap at C=64;
+    `cudaOccupancyMaxActiveBlocksPerMultiprocessor`): `k_kkt` 48 KB / 2 blk, `k_solve`
+    38 KB / 2 blk, `k_wu` 48 KB / 2 blk, `k_fwdh` 80 KB / 1 blk, `k_fwdo` 96 KB / 1 blk -
+    **max 96 KB < 99 KB.** The kernels fit; they are bandwidth-floored above M5.
+  - **KL BAND GREEN / in-band:** FLA `KLD 0.137028` vs control `0.136563` = **delta
+    +0.000465 < 0.01**; same-top-p **84.61% vs 83.73%** control (>= 84% baseline).
+    Per-kernel bring-up vs host fp64: **o NMSE 2.2e-7, final-state 1.2e-7** (before
+    integration, per the "do not debug six kernels blind" rule).
+  - **DEFAULT PATH UNTOUCHED:** canonical md5 GREEN both models, **default-off AND
+    `LLAMA_GDN_FLA_CHUNK`-on** (paged-MoE `8cb0ce23`, dense `5951a5b4`; small-M greedy
+    bails to M5). `test-backend-ops GATED_DELTA_NET` **DEFAULT 46/46 OK.** Decode
+    untouched (`GDN_CHUNK_MIN` untouched; decode stays on the sequential recurrence).
+  - **`test-backend-ops` env-on = 43-44/46** (`gdn_op_tests_env_on_green=false`): the
+    FLA-engaged `head_size=128, n_seq_tokens>=64` cases marginally exceed `1e-7`
+    (**ERR 1.03-1.06e-7**, fluctuating across the boundary) because the port uses plain
+    **tf32** where M5 uses **3xtf32 (CUTLASS fp32-emulation)** for the decay-coupled
+    compounding state products; M5-chunked passes the SAME cases at `< 1e-7`. Judgment:
+    marginal tf32-vs-3xtf32 gap, benign at model level (KL green); 3xtf32 would add mma
+    count and deepen the perf NO-GO, so not pursued.
+  - **Engagement PROVEN:** `LLAMA_GDN_FLA_TRACE` fired `[gdn-fla] engage H=32 ...` in
+    `batched-bench`; nsys shows all six `gdn_fla::` kernels under
+    `LLAMA_GDN_FLA_CHUNK=1` and none under default.
+- **Honest delta vs the +59.2 expectation.** Scope expected `~0-10 of the +59.2, "likely
+  a shared-hardware floor."` Delivered: **0 recovery, a -63 us/tok regression on the FLA
+  arm; the floor is confirmed.** M5's fused smem-resident chunked scan (56.31 us/tok) is
+  the winner and is at/near the GB10 memory-bandwidth floor for this op. What binds is
+  silicon (LPDDR5x bandwidth on the per-chunk h round-trip + the 99 KB smem cap forcing
+  the split), not the algorithm; it lifts only on datacenter Blackwell (HBM + larger
+  smem + TMEM), consistent with the scope's section-4 framing.
+
+Protocols honored: GPU lock held throughout and released; `LLAMA_MAX_BATCH_TOKENS`
+unset; sm_121a; nsys `--cuda-graph-trace=node`; 3+ iter S_PP medians; no external
+contention. WIP on the DGX fork topic branch `p5-fla-gdn` at
+`2d64c37f08ad323038a44a89ab32189527c6ba29` (base `localai-paged` `653bb2f3d`, **NOT
+pushed, NOT landed**): new `ggml/src/ggml-cuda/gdn-blocked-solve.cu` + narrow dispatch in
+`gated_delta_net.cu` / `gated_delta_net.cuh`. Fork `localai-paged` HEAD **untouched at
+`653bb2f3d`**; the LocalAI series **stays at 46 patches (`0001-0055`)**; topic branches
+`p1-bf16-stream` / `p2-moe-region` / `p4-cbv2` left intact. Artifacts on the DGX
+`~/bench/p5_fla_gdn/`: `killgate_20260702_204225/` (RESULTS.md, spp_control.txt,
+spp_fla.txt, `nsys_{ctrl,fla}{2048,512}.{nsys-rep,kern.csv}`, GATES.txt,
+`kl_moe_{ctrl,fla}.log`, occupancy.txt, gdn-blocked-solve.cu, p5_fla_test.cu) and
+`standalone_20260702_203434/` (RESULTS.txt + p5_fla_test.cu, p5_m5_time.cu,
+m5_kernel_body.cuh).