docs(paged): record GDN tensor-core revalidation phase

Assisted-by: Codex:gpt-5
2026-07-03 04:46:54 -04:00 · 2026-07-01 14:05:20 +00:00
parent 6cf8b782d1
commit e5c5746c0a
4 changed files with 136 additions and 9 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md
@@ -10,10 +10,13 @@ with artifact path, gates, benchmark rows, and decision.
 - Current decision model: MoE `q36-35b-a3b-nvfp4`.
 - Canonical paged MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`.
 - Canonical dense md5: `5951a5b4d624ce891e22ab5fca9bc439`.
- Current tested source: DGX mirror `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`.
- Latest attempt: Phase70.
- Latest decision: keep `LLAMA_BF16_CUBLAS_F32_OUT=1` default-off. It is
-  correctness-clean but not serving-safe enough to default on.
+- Current tested source: DGX mirror
+  `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`.
+- Latest attempt: Phase71.
+- Latest decision: keep shipped GDN M5 default as-is. It still beats
+  sequential-disabled and serial-chunked GDN, and forced `GDN_TC=4` is within
+  noise of the current default. Do not reopen smaller GDN kernel reorders on
+  GB10.

 ## Current Serving Record

@@ -52,6 +55,51 @@ Decision:

 ## Attempt Log

+### Phase71: GDN Tensor-Core Revalidation
+
+- Date: 2026-07-01.
+- Plan: `docs/superpowers/plans/2026-07-01-gdn-tc-revalidation-phase71.md`.
+- Artifact:
+  `/home/mudler/bench/phase71_gdn_tc_revalidation/20260701_153425`.
+- Source: `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`.
+- Shape: MoE prefill, `PP=512,2048`, `TG=4`, `B=32`, `CTX=131072`.
+
+Canonical gates:
+
+| gate | env | MoE md5 | dense md5 | `GATED_DELTA_NET` | `MUL_MAT` | `MUL_MAT_ID` |
+|------|-----|---------|-----------|-------------------|-----------|--------------|
+| default | none | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `46/46` | `1146/1146` | `806/806` |
+| sequential-disabled | `GDN_CHUNK_MIN=2147483647` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `46/46` | not run | not run |
+| serial-chunked | `GDN_TC=0 GDN_CHUNK_MIN=64` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `46/46` | not run | not run |
+| forced M5 | `GDN_TC=4 GDN_CHUNK_MIN=64` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `46/46` | not run | not run |
+
+MoE prefill:
+
+| arm | npp | S_PP t/s | T_PP s | S_TG t/s | total S t/s |
+|-----|----:|---------:|-------:|---------:|------------:|
+| default | `512` | `2313.57` | `7.082` | `401.82` | `2231.28` |
+| sequential-disabled | `512` | `2198.28` | `7.453` | `392.50` | `2122.58` |
+| serial-chunked | `512` | `1787.49` | `9.166` | `396.23` | `1740.12` |
+| forced M5 | `512` | `2323.18` | `7.052` | `393.62` | `2238.13` |
+| default | `2048` | `2422.88` | `27.049` | `389.91` | `2398.50` |
+| sequential-disabled | `2048` | `2361.22` | `27.755` | `386.08` | `2337.91` |
+| serial-chunked | `2048` | `1699.77` | `38.556` | `389.48` | `1688.69` |
+| forced M5 | `2048` | `2420.52` | `27.075` | `388.72` | `2396.11` |
+
+Ratios:
+
+| npp | default/sequential S_PP | default/serial S_PP | forced/default S_PP |
+|-----|------------------------:|---------------------:|--------------------:|
+| `512` | `1.0524` | `1.2943` | `1.0042` |
+| `2048` | `1.0261` | `1.4254` | `0.9990` |
+
+Decision:
+
+- Keep shipped GDN M5 default behavior.
+- Do not reopen smaller GDN C32/QS/global-Ai32/kernel-reorder work on GB10.
+- The stale "two-Gram PoC before M5 exists" framing is superseded by the
+  existing `0047` M5 implementation and this revalidation.
+
 ### Phase70: BF16 F32 Output Broader Serving

 - Date: 2026-07-01.
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -1107,3 +1107,46 @@ Decision: reject default-on for `LLAMA_BF16_CUBLAS_F32_OUT=1`. The shortcut is
 correctness-clean, but it materially regressed low-concurrency serving and
 slightly widened the vLLM decode gap at `n=32` and `n=128`. Keep it
 default-off only and move the next parity effort to a different lever.
+
+## 16. PHASE71 RESULT: GDN TENSOR-CORE REVALIDATION
+
+Phase71 challenged the stale GDN planning docs before starting more source work.
+Plan:
+`docs/superpowers/plans/2026-07-01-gdn-tc-revalidation-phase71.md`.
+Benchmark ledger:
+`backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md`.
+DGX artifact:
+`/home/mudler/bench/phase71_gdn_tc_revalidation/20260701_153425`.
+
+Source under test stayed at DGX mirror commit
+`14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`. No llama.cpp source was
+changed.
+
+Canonical gates matched for all four GDN modes: MoE md5 `8cb0ce23`, dense md5
+`5951a5b4`, and `GATED_DELTA_NET 46/46`. Default also passed `MUL_MAT
+1146/1146` and `MUL_MAT_ID 806/806`.
+
+MoE prefill, `PP=512,2048`, `TG=4`, `B=32`, `CTX=131072`:
+
+| arm | npp512 S_PP | npp2048 S_PP |
+|-----|------------:|-------------:|
+| default | `2313.57` | `2422.88` |
+| sequential-disabled (`GDN_CHUNK_MIN=2147483647`) | `2198.28` | `2361.22` |
+| serial-chunked (`GDN_TC=0 GDN_CHUNK_MIN=64`) | `1787.49` | `1699.77` |
+| forced M5 (`GDN_TC=4 GDN_CHUNK_MIN=64`) | `2323.18` | `2420.52` |
+
+Decision: keep shipped GDN M5 default behavior. It still beats
+sequential-disabled by `+5.24%`/`+2.61%`, beats serial-chunked by
+`+29.43%`/`+42.54%`, and forced M5 is within noise of the current default. Do
+not reopen smaller GDN C32/QS/global-Ai32/kernel-reorder work on GB10.
+
+Post-Phase71 do-not-reopen list for GB10:
+
+- Smaller W4A16/MoE GEMM body, metadata, direct-activation, or quant/gather
+  shortcuts.
+- GDN C32 slab, QS-early, Global-Ai32, or another low-conflict M5 reorder.
+- BF16 cuBLAS F32 output as a default-on policy.
+
+The only GDN work that should be reconsidered is a larger FLA/CuteDSL-class
+blocked-solve implementation or a hardware pivot where the GB10 constraints no
+longer apply.
--- a/backend/cpp/llama-cpp-localai-paged/docs/TENSORCORE_GDN_BUILD_PLAN.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/TENSORCORE_GDN_BUILD_PLAN.md
@@ -512,7 +512,20 @@ Files: design lives in `backend/cpp/llama-cpp-localai-paged/docs/TENSORCORE_GDN_

 ## 5. Synthesized build plan + milestones + gate

-All anchors verified. 0031's kernel body, the 7-step structure, the `GDN_CHUNK_MIN`/`GDN_CHUNK_OFF` gating at the `if constexpr (!KDA && !keep_rs_t)` site, the `launch_gdn_chunked<128,16>` template, the smem formula, and the test-backend-ops shapes are all confirmed. The scope doc's KL gate, 3xtf32 ladder, risk register, and Phase 0-3 plan are confirmed. Here is the build-ready synthesis.
+Historical note: this plan predates the shipped f32-only M5 tensor-core GDN
+path in patch `0047`. Current code parses `GDN_CHUNK_MIN` and `GDN_TC`; the
+older `GDN_CHUNK_OFF` and `GDN_CHUNK_TC` names in this section are obsolete.
+Phase71 revalidated the current default against sequential-disabled and
+serial-chunked modes on DGX and kept M5 as shipped. Use this document as
+background for any larger FLA/CuteDSL-class redesign, not as the active next
+patch queue.
+
+All anchors were verified at the time of writing. 0031's kernel body, the
+7-step structure, the `GDN_CHUNK_MIN` gating at the `if constexpr (!KDA &&
+!keep_rs_t)` site, the `launch_gdn_chunked<128,16>` template, the smem formula,
+and the test-backend-ops shapes were confirmed. The scope doc's KL gate,
+3xtf32 ladder, risk register, and Phase 0-3 plan were confirmed. Here is the
+historical build-ready synthesis.

 ---

@@ -589,12 +602,14 @@ Each milestone is a **separate patch** stacked on 0031, **green on `test-backend

 ---

-## (4) Slot into 0031's existing framework (opt-in, default-OFF)
+## (4) Slot into 0031's existing framework (historical, superseded by 0047)

 Same dispatch site - the `if constexpr (!KDA && !keep_rs_t)` block inside `launch_gated_delta_net` (0031 patch, after `init_fastdiv_values`). Extend, don't replace:

- Keep `GDN_CHUNK_MIN` (token threshold, default `INT_MAX` = off) and `GDN_CHUNK_OFF` (kill switch).
- Add **`GDN_CHUNK_TC`** selector: `0` = 0031 serial-solve chunked (fallback, retained), `1` = tensor-core. Add **`GDN_CHUNK_C` ∈ {16,32,64}** and **`GDN_DV_TILE` ∈ {32,64,128}** for A/B; defaults `C=32, DV_TILE=64` (CONFIG C) for serving, `DV_TILE=32` saturator for n_seqs=1.
+- Current code keeps `GDN_CHUNK_MIN` as the token threshold and uses `GDN_TC`
+  as the tensor-core level selector. It does not parse `GDN_CHUNK_OFF` or
+  `GDN_CHUNK_TC`.
+- Historical plan: add **`GDN_CHUNK_TC`** selector: `0` = 0031 serial-solve chunked (fallback, retained), `1` = tensor-core. Add **`GDN_CHUNK_C` ∈ {16,32,64}** and **`GDN_DV_TILE` ∈ {32,64,128}** for A/B; defaults `C=32, DV_TILE=64` (CONFIG C) for serving, `DV_TILE=32` saturator for n_seqs=1.
 - New launcher `launch_gdn_chunked_tc<128, C, DV_TILE>` mirrors `launch_gdn_chunked`: `cudaFuncSetAttribute(...MaxDynamicSharedMemorySize...)` **return-checked** (0031 precedent), `grid = dim3(H, n_seqs, n_slabs)`, `block = dim3(256,1,1)`. Per-slab the kernel recomputes A/A⁻¹/gates (dv-independent), dv-slices S/Ud/O.
 - **Default OFF** (`gdn_chunk_min=INT_MAX`) exactly as 0031 ships. Flip the default to on **only when** the M8 A/B shows an S_PP win over the tuned sequential recurrence at the serving regime (n_seqs≥2) **and** the KL gate + adversarial op case hold - recorded in README s5 (dev notes / rejected-flat levers) and `PAGED_BITEXACT_NOTE.md`. Until then it ships like 0031: opt-in, regression-free default.
 - Extend the test-backend-ops block 0031 added (the `S_v==128` shapes at lines after :9398) so the tc path is exercised at C=64 and C=32 in CI.
@@ -610,4 +625,4 @@ Same dispatch site - the `if constexpr (!KDA && !keep_rs_t)` block inside `launc

 **Risk 3 - Grid starvation at n_seqs=1 is structural (H=32 < the ~96 blocks needed for 2 blk/SM × 48 SM).** Only `DV_TILE=32` (4 slabs) fully saturates a single stream, and it pays ~1.5-2× redundant-A flops (A/A⁻¹/gates recomputed per slab) plus the per-chunk restage. **Kill criterion:** if the M8 bench shows single-stream (n_seqs=1) S_PP is slower than sequential even at full saturation (dv32×4) due to redundant-A + restage overhead, **and** the batched regime (n_seqs≥2) gain also fails to materialize → the lever only helps a regime the target workload doesn't hit → keep default-OFF, ship as opt-in experiment only, record. (If n_seqs≥2 *does* win, ship enabled for the serving regime and gate single-stream back to sequential via `GDN_CHUNK_MIN` + an n_seqs check - a partial, honest win.)

-**Overarching kill gate:** the disposition is the bench, not the theory. The kernel flips to default-on only when it beats the tuned sequential recurrence at the serving regime AND clears the KL + adversarial gates. Any milestone that regresses test-backend-ops or md5-stability halts the stack until fixed; M1 and M0 are the cheap fail-fast exits before the expensive kernel work.
+**Overarching kill gate:** the disposition is the bench, not the theory. The kernel flips to default-on only when it beats the tuned sequential recurrence at the serving regime AND clears the KL + adversarial gates. Any milestone that regresses test-backend-ops or md5-stability halts the stack until fixed; M1 and M0 are the cheap fail-fast exits before the expensive kernel work.
--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
@@ -140,6 +140,17 @@ At `N=32` and `N=128`, opt-in slightly widened the vLLM decode gap
 (`0.6864x` vs `0.6882x`, and `0.6839x` vs `0.6921x`). Keep
 `LLAMA_BF16_CUBLAS_F32_OUT=1` default-off only and move to another lever.

+Phase71 revalidated the current shipped GDN tensor-core default before adding
+more GDN source work. Artifact:
+`/home/mudler/bench/phase71_gdn_tc_revalidation/20260701_153425`. Canonical
+MoE/dense md5 gates matched for default, sequential-disabled, serial-chunked,
+and forced M5 modes; `GATED_DELTA_NET` passed `46/46` for each mode, and
+default passed `MUL_MAT 1146/1146` plus `MUL_MAT_ID 806/806`. Current default
+beat sequential-disabled by `+5.24%`/`+2.61%` S_PP at `npp=512/2048`, beat
+serial-chunked by `+29.43%`/`+42.54%`, and forced `GDN_TC=4 GDN_CHUNK_MIN=64`
+was within noise of default (`+0.42%`/`-0.10%`). Decision: keep shipped M5 and
+do not reopen smaller GDN C32/QS/global-Ai32/kernel-reorder work on GB10.
+
 Relevant files: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch`, and the graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/models/{qwen35moe.cpp,delta-net-base.cpp}` + `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/llama-graph.cpp` (build_moe_ffn ~1500-1834, build_attn ~2136-2189).

 ## 2. Decode-serving compute hypotheses (ranked)
@@ -1836,6 +1847,16 @@ D2/D3/D4 for DECODE are all REJECTED by the methodology's "a faster kernel off t
 Honest scope on D1's payoff: at HIGH-concurrency serving the paged GPU is already 83.5% busy because overlapping request streams hide the host stalls, so D1's win concentrates at LOW-concurrency / latency / batch-1 (GPU 4-16% busy), where it is large. The complementary serving-throughput lever is FIXING PREFILL (GDN #101 + MoE GEMM D2/#105): paged's 2x-slower prefill steals serving cycles under continuous batching (~25-55% of the serving step is prefill work) - so the prefill levers ARE also serving-decode levers. GATE: separately-built in-backend A/B (compiled-in, so a runtime flag does NOT isolate it) showing higher static/low-concurrency decode t/s with no high-concurrency-serving regression; bit-exact greedy md5 (graph replay re-issues identical kernels).

 ### next_3_levers
+
+Post-Phase71 supersession: this ranked list is historical. `0047` already
+ships the M5 tensor-core GDN path default-on under paged KV, Phase71
+revalidated it against sequential-disabled and serial-chunked baselines, and
+Phase10/11/13 rejected the smaller follow-up GDN reorders. Phase41/43 closed
+D1 on the current GB10 path unless a fresh route trace proves a host-sync
+fallback returned. Phase60/61/66 rejected another small W4A16/direct-A or
+quant/gather pass. Treat the list below as pre-Phase60 planning context, not an
+active queue.
+
 Ranked, each with its pass-gate:

 1) #101 TENSOR-CORE mma CHUNKED GDN PREFILL KERNEL (prefill, GO). #1 prefill-gap contributor (+59 us/tok, ~30%), ~3/4 math (tensor cores help) with 2.62x measured headroom on identical silicon, 1/4 layout folds in; also helps serving decode. GATE: Phase-0 regime already satisfied by this profile; Phase-1 two-Gram-product PoC must move S_PP in a SEPARATELY-BUILT in-backend A/B vs sequential (flat => NO-GO the multi-week build); then KL-gate (tf32/3xtf32) + greedy md5 + adversarial-decay op test; ship opt-in default-off until A/B beats sequential.