docs(paged): record post-PoC GDN audit phase

Assisted-by: Codex:gpt-5
2026-07-03 04:46:54 -04:00 · 2026-07-01 14:44:17 +00:00
parent 5369219729
commit 26a41fad1a
2 changed files with 78 additions and 6 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md
@@ -12,12 +12,13 @@ with artifact path, gates, benchmark rows, and decision.
 - Canonical dense md5: `5951a5b4d624ce891e22ab5fca9bc439`.
 - Current tested source: DGX mirror
  `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`.
- Latest attempt: Phase74.
- Latest decision: standalone C=64 GDN solve/apply PoC did not fund backend
-  source work. Explicit inverse-plus-apply was only `0.594x`/`0.593x` the
-  direct solve/apply baseline for weak/mixed decay, so the next parity evidence
-  should be a datacenter Blackwell rerun or a substantially different TC solve
-  PoC.
+- Latest attempt: Phase75.
+- Latest decision: subagent codebase audit found no source-funded GB10 GDN
+  backend change. Phase74 rejects the C=64 inverse scaffold; vLLM's distinct
+  one-token recurrent decode path is not on the current llama.cpp critical path
+  because prior profiles showed GDN decode already faster than vLLM and serving
+  decode host/MoE-sync bound. The next parity evidence should be a datacenter
+  Blackwell rerun, or a fresh profile proving a different GB10 bottleneck.

 ## Current Serving Record

@@ -57,6 +58,54 @@ Decision:

 ## Attempt Log

+### Phase75: Post-PoC GDN/VLLM Audit
+
+- Date: 2026-07-01.
+- Artifact: no new benchmark artifact.
+- Source baseline: `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`.
+- Result type: subagent codebase audit and gate-setting only; no source change.
+- Inputs: Phase74 artifact
+  `/home/mudler/bench/phase74_gdn_blocked_solve_poc/20260701_143711`,
+  llama.cpp GDN implementation, vLLM FLA/GDN implementation, and parity docs.
+
+Findings:
+
+- llama.cpp already has the M5 tensor-core GDN path default-on under paged KV.
+  It includes `KK/QK` mma, `KS/QS` 3xtf32 mma, `P*U` mma, explicit
+  `T=A^-1`, `U=T*RHS`, and state carry `Kc^T*DU`.
+- The current backend path is fixed at `C=16` for GB10 shared-memory limits.
+  The remaining C=64/register-state class is not a shortcut patch.
+- Phase74 tested a C=64 shared-memory explicit inverse-plus-apply scaffold and
+  failed its source-work gate: inverse/direct speed was `0.5941x` weak decay
+  and `0.5927x` mixed decay.
+- vLLM has a structurally different one-token recurrent decode kernel that
+  updates state directly without chunk inverse, and a packed decode path that
+  avoids Q/K/V materialization copies. This is not currently source-funded in
+  llama.cpp because prior parity profiles showed llama.cpp GDN decode faster
+  than vLLM and decode serving dominated by host/MoE synchronization.
+- vLLM's CuTeDSL GDN prefill path uses SM10x/CUDA-13 Blackwell features
+  including TMA/tcgen05/CUTLASS DSL. Treat it as datacenter-Blackwell reference
+  evidence unless GB10 support is proven in the local toolchain.
+
+Decision:
+
+- Do not start GB10 GDN backend source work after Phase74/75.
+- Do not start a packed/recurrent GDN decode PoC unless a fresh same-session
+  profile shows GDN decode or Q/K/V materialization back on the critical path.
+- Phase75 acceptance gate for the next real parity attempt is a datacenter
+  Blackwell serving rerun with the Phase72 shape:
+  `NPL=8 32 128`, `PTOK=128`, `GEN=64`, `PARALLEL=128`, production defaults.
+- The rerun is valid only if `hardware.txt` records
+  `hardware_class=datacenter_blackwell`, pre/post md5 gates are green
+  (`8cb0ce23777bf55f92f63d0292c756b0`,
+  `5951a5b4d624ce891e22ab5fca9bc439`), `MUL_MAT 1146/1146` and
+  `MUL_MAT_ID 806/806` are green, and decode profiles include
+  `nsys --cuda-graph-trace=node`.
+- If datacenter Blackwell materially lifts llama/vLLM decode ratios above the
+  GB10 Phase72 record (`0.7561`, `0.7158`, `0.6935`), continue parity work on
+  that surface. If not, record the residual gap as engine/kernel architecture
+  rather than GB10 memory bandwidth and keep GB10 GDN stopped.
+
 ### Phase74: GDN Blocked-Solve PoC Gate

 - Date: 2026-07-01.
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -1256,3 +1256,26 @@ Decision: do not touch `ggml/src/ggml-cuda/gated_delta_net.cu` for this C=64
 inverse scaffold on GB10. A future GDN source-work gate must be a substantially
 different tensor-core blocked-solve/register-state design that shows a material
 timing win before backend changes.
+
+Phase75 follow-up audit:
+
+- llama.cpp already ships the M5 tensor-core GDN path default-on under paged KV:
+  `KK/QK`, `KS/QS`, `P*U`, explicit `T=A^-1`, `U=T*RHS`, and
+  `Kc^T*DU` state carry are covered in the current `C=16` GB10 path.
+- vLLM has a distinct one-token recurrent decode path that updates state
+  directly and a packed decode path that avoids Q/K/V materialization copies,
+  but this is not source-funded in llama.cpp without a fresh profile: prior
+  parity evidence showed llama.cpp GDN decode already faster than vLLM and
+  decode serving dominated by host/MoE synchronization.
+- vLLM's CuTeDSL GDN prefill path is useful reference material for datacenter
+  Blackwell, but depends on SM10x/CUDA-13 features such as TMA/tcgen05/CUTLASS
+  DSL and should not be treated as a portable GB10 patch base until the local
+  toolchain proves support.
+
+Current next gate: run the Phase72 same-session serving shape on B200/B100/GB200
+or equivalent datacenter Blackwell hardware. Require `hardware_class` to be
+`datacenter_blackwell`, pre/post md5 and op gates to be green, and graph-node
+decode profiles for both llama.cpp and vLLM. If the rerun does not materially
+raise the GB10 Phase72 decode ratios (`0.7561`, `0.7158`, `0.6935`), keep GB10
+GDN source work stopped and classify the residual gap as engine/kernel
+architecture rather than GB10 memory bandwidth.