docs(paged): record GDN tensor-core revalidation phase

Assisted-by: Codex:gpt-5
This commit is contained in:
Ettore Di Giacinto
2026-07-01 14:05:20 +00:00
parent 6cf8b782d1
commit e5c5746c0a
4 changed files with 136 additions and 9 deletions

View File

@@ -10,10 +10,13 @@ with artifact path, gates, benchmark rows, and decision.
- Current decision model: MoE `q36-35b-a3b-nvfp4`.
- Canonical paged MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`.
- Canonical dense md5: `5951a5b4d624ce891e22ab5fca9bc439`.
- Current tested source: DGX mirror `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`.
- Latest attempt: Phase70.
- Latest decision: keep `LLAMA_BF16_CUBLAS_F32_OUT=1` default-off. It is
correctness-clean but not serving-safe enough to default on.
- Current tested source: DGX mirror
`14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`.
- Latest attempt: Phase71.
- Latest decision: keep shipped GDN M5 default as-is. It still beats
sequential-disabled and serial-chunked GDN, and forced `GDN_TC=4` is within
noise of the current default. Do not reopen smaller GDN kernel reorders on
GB10.
## Current Serving Record
@@ -52,6 +55,51 @@ Decision:
## Attempt Log
### Phase71: GDN Tensor-Core Revalidation
- Date: 2026-07-01.
- Plan: `docs/superpowers/plans/2026-07-01-gdn-tc-revalidation-phase71.md`.
- Artifact:
`/home/mudler/bench/phase71_gdn_tc_revalidation/20260701_153425`.
- Source: `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`.
- Shape: MoE prefill, `PP=512,2048`, `TG=4`, `B=32`, `CTX=131072`.
Canonical gates:
| gate | env | MoE md5 | dense md5 | `GATED_DELTA_NET` | `MUL_MAT` | `MUL_MAT_ID` |
|------|-----|---------|-----------|-------------------|-----------|--------------|
| default | none | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `46/46` | `1146/1146` | `806/806` |
| sequential-disabled | `GDN_CHUNK_MIN=2147483647` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `46/46` | not run | not run |
| serial-chunked | `GDN_TC=0 GDN_CHUNK_MIN=64` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `46/46` | not run | not run |
| forced M5 | `GDN_TC=4 GDN_CHUNK_MIN=64` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `46/46` | not run | not run |
MoE prefill:
| arm | npp | S_PP t/s | T_PP s | S_TG t/s | total S t/s |
|-----|----:|---------:|-------:|---------:|------------:|
| default | `512` | `2313.57` | `7.082` | `401.82` | `2231.28` |
| sequential-disabled | `512` | `2198.28` | `7.453` | `392.50` | `2122.58` |
| serial-chunked | `512` | `1787.49` | `9.166` | `396.23` | `1740.12` |
| forced M5 | `512` | `2323.18` | `7.052` | `393.62` | `2238.13` |
| default | `2048` | `2422.88` | `27.049` | `389.91` | `2398.50` |
| sequential-disabled | `2048` | `2361.22` | `27.755` | `386.08` | `2337.91` |
| serial-chunked | `2048` | `1699.77` | `38.556` | `389.48` | `1688.69` |
| forced M5 | `2048` | `2420.52` | `27.075` | `388.72` | `2396.11` |
Ratios:
| npp | default/sequential S_PP | default/serial S_PP | forced/default S_PP |
|-----|------------------------:|---------------------:|--------------------:|
| `512` | `1.0524` | `1.2943` | `1.0042` |
| `2048` | `1.0261` | `1.4254` | `0.9990` |
Decision:
- Keep shipped GDN M5 default behavior.
- Do not reopen smaller GDN C32/QS/global-Ai32/kernel-reorder work on GB10.
- The stale "two-Gram PoC before M5 exists" framing is superseded by the
existing `0047` M5 implementation and this revalidation.
### Phase70: BF16 F32 Output Broader Serving
- Date: 2026-07-01.

View File

@@ -1107,3 +1107,46 @@ Decision: reject default-on for `LLAMA_BF16_CUBLAS_F32_OUT=1`. The shortcut is
correctness-clean, but it materially regressed low-concurrency serving and
slightly widened the vLLM decode gap at `n=32` and `n=128`. Keep it
default-off only and move the next parity effort to a different lever.
## 16. PHASE71 RESULT: GDN TENSOR-CORE REVALIDATION
Phase71 challenged the stale GDN planning docs before starting more source work.
Plan:
`docs/superpowers/plans/2026-07-01-gdn-tc-revalidation-phase71.md`.
Benchmark ledger:
`backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md`.
DGX artifact:
`/home/mudler/bench/phase71_gdn_tc_revalidation/20260701_153425`.
Source under test stayed at DGX mirror commit
`14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`. No llama.cpp source was
changed.
Canonical gates matched for all four GDN modes: MoE md5 `8cb0ce23`, dense md5
`5951a5b4`, and `GATED_DELTA_NET 46/46`. Default also passed `MUL_MAT
1146/1146` and `MUL_MAT_ID 806/806`.
MoE prefill, `PP=512,2048`, `TG=4`, `B=32`, `CTX=131072`:
| arm | npp512 S_PP | npp2048 S_PP |
|-----|------------:|-------------:|
| default | `2313.57` | `2422.88` |
| sequential-disabled (`GDN_CHUNK_MIN=2147483647`) | `2198.28` | `2361.22` |
| serial-chunked (`GDN_TC=0 GDN_CHUNK_MIN=64`) | `1787.49` | `1699.77` |
| forced M5 (`GDN_TC=4 GDN_CHUNK_MIN=64`) | `2323.18` | `2420.52` |
Decision: keep shipped GDN M5 default behavior. It still beats
sequential-disabled by `+5.24%`/`+2.61%`, beats serial-chunked by
`+29.43%`/`+42.54%`, and forced M5 is within noise of the current default. Do
not reopen smaller GDN C32/QS/global-Ai32/kernel-reorder work on GB10.
Post-Phase71 do-not-reopen list for GB10:
- Smaller W4A16/MoE GEMM body, metadata, direct-activation, or quant/gather
shortcuts.
- GDN C32 slab, QS-early, Global-Ai32, or another low-conflict M5 reorder.
- BF16 cuBLAS F32 output as a default-on policy.
The only GDN work that should be reconsidered is a larger FLA/CuteDSL-class
blocked-solve implementation or a hardware pivot where the GB10 constraints no
longer apply.

View File

@@ -512,7 +512,20 @@ Files: design lives in `backend/cpp/llama-cpp-localai-paged/docs/TENSORCORE_GDN_
## 5. Synthesized build plan + milestones + gate
All anchors verified. 0031's kernel body, the 7-step structure, the `GDN_CHUNK_MIN`/`GDN_CHUNK_OFF` gating at the `if constexpr (!KDA && !keep_rs_t)` site, the `launch_gdn_chunked<128,16>` template, the smem formula, and the test-backend-ops shapes are all confirmed. The scope doc's KL gate, 3xtf32 ladder, risk register, and Phase 0-3 plan are confirmed. Here is the build-ready synthesis.
Historical note: this plan predates the shipped f32-only M5 tensor-core GDN
path in patch `0047`. Current code parses `GDN_CHUNK_MIN` and `GDN_TC`; the
older `GDN_CHUNK_OFF` and `GDN_CHUNK_TC` names in this section are obsolete.
Phase71 revalidated the current default against sequential-disabled and
serial-chunked modes on DGX and kept M5 as shipped. Use this document as
background for any larger FLA/CuteDSL-class redesign, not as the active next
patch queue.
All anchors were verified at the time of writing. 0031's kernel body, the
7-step structure, the `GDN_CHUNK_MIN` gating at the `if constexpr (!KDA &&
!keep_rs_t)` site, the `launch_gdn_chunked<128,16>` template, the smem formula,
and the test-backend-ops shapes were confirmed. The scope doc's KL gate,
3xtf32 ladder, risk register, and Phase 0-3 plan were confirmed. Here is the
historical build-ready synthesis.
---
@@ -589,12 +602,14 @@ Each milestone is a **separate patch** stacked on 0031, **green on `test-backend
---
## (4) Slot into 0031's existing framework (opt-in, default-OFF)
## (4) Slot into 0031's existing framework (historical, superseded by 0047)
Same dispatch site - the `if constexpr (!KDA && !keep_rs_t)` block inside `launch_gated_delta_net` (0031 patch, after `init_fastdiv_values`). Extend, don't replace:
- Keep `GDN_CHUNK_MIN` (token threshold, default `INT_MAX` = off) and `GDN_CHUNK_OFF` (kill switch).
- Add **`GDN_CHUNK_TC`** selector: `0` = 0031 serial-solve chunked (fallback, retained), `1` = tensor-core. Add **`GDN_CHUNK_C` ∈ {16,32,64}** and **`GDN_DV_TILE` ∈ {32,64,128}** for A/B; defaults `C=32, DV_TILE=64` (CONFIG C) for serving, `DV_TILE=32` saturator for n_seqs=1.
- Current code keeps `GDN_CHUNK_MIN` as the token threshold and uses `GDN_TC`
as the tensor-core level selector. It does not parse `GDN_CHUNK_OFF` or
`GDN_CHUNK_TC`.
- Historical plan: add **`GDN_CHUNK_TC`** selector: `0` = 0031 serial-solve chunked (fallback, retained), `1` = tensor-core. Add **`GDN_CHUNK_C` ∈ {16,32,64}** and **`GDN_DV_TILE` ∈ {32,64,128}** for A/B; defaults `C=32, DV_TILE=64` (CONFIG C) for serving, `DV_TILE=32` saturator for n_seqs=1.
- New launcher `launch_gdn_chunked_tc<128, C, DV_TILE>` mirrors `launch_gdn_chunked`: `cudaFuncSetAttribute(...MaxDynamicSharedMemorySize...)` **return-checked** (0031 precedent), `grid = dim3(H, n_seqs, n_slabs)`, `block = dim3(256,1,1)`. Per-slab the kernel recomputes A/A⁻¹/gates (dv-independent), dv-slices S/Ud/O.
- **Default OFF** (`gdn_chunk_min=INT_MAX`) exactly as 0031 ships. Flip the default to on **only when** the M8 A/B shows an S_PP win over the tuned sequential recurrence at the serving regime (n_seqs≥2) **and** the KL gate + adversarial op case hold - recorded in README s5 (dev notes / rejected-flat levers) and `PAGED_BITEXACT_NOTE.md`. Until then it ships like 0031: opt-in, regression-free default.
- Extend the test-backend-ops block 0031 added (the `S_v==128` shapes at lines after :9398) so the tc path is exercised at C=64 and C=32 in CI.
@@ -610,4 +625,4 @@ Same dispatch site - the `if constexpr (!KDA && !keep_rs_t)` block inside `launc
**Risk 3 - Grid starvation at n_seqs=1 is structural (H=32 < the ~96 blocks needed for 2 blk/SM × 48 SM).** Only `DV_TILE=32` (4 slabs) fully saturates a single stream, and it pays ~1.5-2× redundant-A flops (A/A⁻¹/gates recomputed per slab) plus the per-chunk restage. **Kill criterion:** if the M8 bench shows single-stream (n_seqs=1) S_PP is slower than sequential even at full saturation (dv32×4) due to redundant-A + restage overhead, **and** the batched regime (n_seqs≥2) gain also fails to materialize → the lever only helps a regime the target workload doesn't hit → keep default-OFF, ship as opt-in experiment only, record. (If n_seqs≥2 *does* win, ship enabled for the serving regime and gate single-stream back to sequential via `GDN_CHUNK_MIN` + an n_seqs check - a partial, honest win.)
**Overarching kill gate:** the disposition is the bench, not the theory. The kernel flips to default-on only when it beats the tuned sequential recurrence at the serving regime AND clears the KL + adversarial gates. Any milestone that regresses test-backend-ops or md5-stability halts the stack until fixed; M1 and M0 are the cheap fail-fast exits before the expensive kernel work.
**Overarching kill gate:** the disposition is the bench, not the theory. The kernel flips to default-on only when it beats the tuned sequential recurrence at the serving regime AND clears the KL + adversarial gates. Any milestone that regresses test-backend-ops or md5-stability halts the stack until fixed; M1 and M0 are the cheap fail-fast exits before the expensive kernel work.

View File

@@ -140,6 +140,17 @@ At `N=32` and `N=128`, opt-in slightly widened the vLLM decode gap
(`0.6864x` vs `0.6882x`, and `0.6839x` vs `0.6921x`). Keep
`LLAMA_BF16_CUBLAS_F32_OUT=1` default-off only and move to another lever.
Phase71 revalidated the current shipped GDN tensor-core default before adding
more GDN source work. Artifact:
`/home/mudler/bench/phase71_gdn_tc_revalidation/20260701_153425`. Canonical
MoE/dense md5 gates matched for default, sequential-disabled, serial-chunked,
and forced M5 modes; `GATED_DELTA_NET` passed `46/46` for each mode, and
default passed `MUL_MAT 1146/1146` plus `MUL_MAT_ID 806/806`. Current default
beat sequential-disabled by `+5.24%`/`+2.61%` S_PP at `npp=512/2048`, beat
serial-chunked by `+29.43%`/`+42.54%`, and forced `GDN_TC=4 GDN_CHUNK_MIN=64`
was within noise of default (`+0.42%`/`-0.10%`). Decision: keep shipped M5 and
do not reopen smaller GDN C32/QS/global-Ai32/kernel-reorder work on GB10.
Relevant files: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch`, and the graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/models/{qwen35moe.cpp,delta-net-base.cpp}` + `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/llama-graph.cpp` (build_moe_ffn ~1500-1834, build_attn ~2136-2189).
## 2. Decode-serving compute hypotheses (ranked)
@@ -1836,6 +1847,16 @@ D2/D3/D4 for DECODE are all REJECTED by the methodology's "a faster kernel off t
Honest scope on D1's payoff: at HIGH-concurrency serving the paged GPU is already 83.5% busy because overlapping request streams hide the host stalls, so D1's win concentrates at LOW-concurrency / latency / batch-1 (GPU 4-16% busy), where it is large. The complementary serving-throughput lever is FIXING PREFILL (GDN #101 + MoE GEMM D2/#105): paged's 2x-slower prefill steals serving cycles under continuous batching (~25-55% of the serving step is prefill work) - so the prefill levers ARE also serving-decode levers. GATE: separately-built in-backend A/B (compiled-in, so a runtime flag does NOT isolate it) showing higher static/low-concurrency decode t/s with no high-concurrency-serving regression; bit-exact greedy md5 (graph replay re-issues identical kernels).
### next_3_levers
Post-Phase71 supersession: this ranked list is historical. `0047` already
ships the M5 tensor-core GDN path default-on under paged KV, Phase71
revalidated it against sequential-disabled and serial-chunked baselines, and
Phase10/11/13 rejected the smaller follow-up GDN reorders. Phase41/43 closed
D1 on the current GB10 path unless a fresh route trace proves a host-sync
fallback returned. Phase60/61/66 rejected another small W4A16/direct-A or
quant/gather pass. Treat the list below as pre-Phase60 planning context, not an
active queue.
Ranked, each with its pass-gate:
1) #101 TENSOR-CORE mma CHUNKED GDN PREFILL KERNEL (prefill, GO). #1 prefill-gap contributor (+59 us/tok, ~30%), ~3/4 math (tensor cores help) with 2.62x measured headroom on identical silicon, 1/4 layout folds in; also helps serving decode. GATE: Phase-0 regime already satisfied by this profile; Phase-1 two-Gram-product PoC must move S_PP in a SEPARATELY-BUILT in-backend A/B vs sequential (flat => NO-GO the multi-week build); then KL-gate (tf32/3xtf32) + greedy md5 + adversarial-decay op test; ship opt-in default-off until A/B beats sequential.