docs(paged): vLLM-parity lever map + tensor-core GDN build plan (both-engine profile-validated)

Lever map records the full prefill/decode gap decomposition vs vLLM, the ranked levers, and the rejected dead ends. GDN build plan is the per-product mma mapping + A-inverse + occupancy design.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
Ettore Di Giacinto
2026-06-29 06:15:10 +00:00
parent c4058eb4da
commit 042deab40e
2 changed files with 1106 additions and 0 deletions

View File

@@ -0,0 +1,613 @@
# Tensor-Core GDN Build Plan
> Auto-generated from the GDN build-design workflow. Build-ready spec for the full tensor-core chunked-scan kernel (2nd prefill lever).
## 1. Remaining intra-chunk products -> mma mapping
I have everything needed: the exact chunked-scan math from patch 0031, the sm_121a constraints from the scope doc, and the concrete ggml tf32 fragment (`mma.sync.aligned.m16n8k8.row.col.f32.tf32.tf32.f32`, `tile<16,8,float> D, tile<16,8,float> A, tile<8,8,float> B`) at `mma.cuh:976-984`. Here is the design.
---
# Tensor-core mapping of the REMAINING intra-chunk GDN products (patch 0031 steps 3-7)
## 0. Building block + what the PoC already covered
**Grounding.** Math: `backend/cpp/llama-cpp-localai-paged/patches/paged/0031-paged-chunked-gdn-prefill-scan-kernel.patch` (steps reproduced inline below). Scope/constraints: `backend/cpp/llama-cpp-localai-paged/docs/TENSORCORE_GDN_SCOPE.md`. Fragment API: `ggml/src/ggml-cuda/mma.cuh:976-984` (the only f32-accumulate tf32 overload on sm_121a).
The single warp-level primitive on sm_121a is **`m16n8k8` tf32 / f32-accumulate**:
- `A` fragment = `tile<16,8,float>` (M=16, K=8; 4 floats/thread, `Axi[0..3]`)
- `B` fragment = `tile<8,8,float>` (K=8, N=8, `.col` operand; 2 floats/thread, `Bxi[0..1]`)
- `D` accumulator = `tile<16,8,float>` (M=16, N=8; 4 floats/thread)
- A GEMM `[M×K]·[K×N]` tiles to `ceil(M/16) × ceil(N/8) × ceil(K/8)` mma calls, f32-accumulating over the K-subtiles.
- bf16 alternative `m16n8k16` (`mma.cuh:1064`, K=16/mma, 7-bit mantissa) exists but is **only** admissible for the tf32-safe Gram class — never the state/decay-coupled class.
- 3xtf32 ladder = split each f32 operand into 3 tf32 limbs, run 3 limb-products per K-subtile (hi·hi, hi·lo, lo·hi), accumulate high→low. ~3x the mma count, ~f32 accuracy.
**PoC covered products 1 + 2** (the two `C×C` Gram products, both tf32-safe, NMSE ~3e-9): `KK[t,t']=k_t·k_t'``A`, and `QK[t,t']=q_t·k_t'``P`. Both are `(C×dk)·(dk×C)`, M=C N=C K=dk=128, decay+beta applied in f32 after. They already share the `Kc^T` B-fragments.
The remaining families are **steps 3,4,5,6,7**. Notation: `C` = chunk (default 64; PoC 16), `dk=dv=128`, per `(head,seq)` block. Tile counts below are for **C=64**.
---
## 1. Per-product mma mapping table (the deliverable)
| # | Product (0031 step) | Result = matmul | M | N | K | mma tiles `(M/16)·(N/8)·(K/8)` @C=64 | Accumulation order | Precision class | Shares staged operand with |
|---|---|---|---|---|---|---|---|---|---|
| 1 | `KK→A` (PoC) | `Kc · Kcᵀ` | C | C | dk=128 | 4·8·16 = **512** (~½ tri) | over 16 k-subtiles | **tf32-safe** (proven) | `Kcᵀ` B-frag ↔ P2; `Kc` LHS ↔ P3 |
| 2 | `QK→P` (PoC) | `Qc · Kcᵀ` | C | C | dk=128 | 4·8·16 = **512** (~½ tri) | over 16 k-subtiles | **tf32-safe** (proven) | `Kcᵀ` B-frag ↔ P1; `Qc` LHS ↔ P4 |
| 3 | `KS = S0ᵀk_t` | `Kc · S0` | C | dv=128 | dk=128 | 4·16·16 = **1024** | 16 k-subtiles, limbs hi→lo | **3xtf32 / f32** (state-boundary, feeds solve) | `S0` B-frag ↔ P4; `Kc` LHS ↔ P1 |
| 4 | `QS = S0ᵀq_t` | `Qc · S0` | C | dv=128 | dk=128 | 4·16·16 = **1024** | 16 k-subtiles, limbs hi→lo | **3xtf32 → demote-first** (×γ_t≤1 attenuated) | `S0` B-frag ↔ P3; `Qc` LHS ↔ P2 |
| 5 | `O += P·U` | `P · U` | C | dv=128 | C=64 | 4·16·8 = **512** (~½ tri over K) | C/8 k-subtiles, triangular | **tf32-safe** (P decay-masked & bounded in f32 first) | `P`(=Amat) ↔ P2; `U` B-frag ↔ P6 |
| 6 | `S_C += Kᵀ(D·U)` | `Kcᵀ · DU` | dk=128 | dv=128 | C=64 | 8·16·8 = **1024** | scale state by γ_last (f32) **first**, then C/8 k-subtiles, limbs hi→lo | **3xtf32 / f32** (THE cross-chunk carry, compounds over n_tok/C) | `U` B-frag ↔ P5; `Kc` (transposed) ↔ P1/3 |
| 7 | `U = A⁻¹·RHS` off-diag coupling `A_ij·U_j` | `A_ij · U_j` | b=16 | dv=128 | b=16 | 1·16·2 = **32**/pair → **~192** (6 pairs) +~128 diag | forward sweep i=0..C/b; off-diag subtractions before diagonal solve | **tf32-safe off-diag + f32 in-register `16×16` diagonal** | `A`(=Amat) ↔ P1; `U` blocks ↔ P5/P6 |
3xtf32 inflation if the ladder is taken: P3 1024→**3072**, P4→**3072**, P6 1024→**3072**.
---
## 2. Per-product detail (the 5 remaining families)
### Product 3 - `KS = S0ᵀ k_t` (RHS state-boundary term)
0031: `ks = Σ_i Sd[j·dk+i]·Kc[t·dk+i]`; feeds `RHS[t][j] = β_t(v_t[j] γ_t·ks)`.
- **As a GEMM:** `KS[t][j] = Σ_i Kc[t][i]·S0[i][j]``KS = Kc[C×dk] · S0[dk×dv]`. **M=C, N=dv=128, K=dk=128.** Contraction over the state-row index `i`.
- **Schedule:** `Kc` is the LHS (M-major over `t`, K over `i`) — already staged for P1. `S0` is the B operand, K-major over `i`, N over `j`. The patch's `Sd[j·dk+i]` layout (i contiguous for fixed j) **is already a K-major B layout**`ldmatrix`-friendly as `tile<8,8>` B fragments. Accumulate 16 k-subtiles into f32 D.
- **Precision: 3xtf32/f32.** This is a state-boundary product: `S0` carries the full sequence history (wide dynamic range), and the result is *differenced* against `v_t` then fed into the solve, so error here propagates through `U` into both `O` and `S_C`. Default to the 3xtf32 ladder; A/B a plain-tf32 demote only after P4.
### Product 4 - `QS = S0ᵀ q_t` (γ cross-chunk `O` term)
0031: `qs = Σ_i Sd[j·dk+i]·Qc[t·dk+i]`; `o = γ_t·qs + Σ P·U`.
- **As a GEMM:** `QS = Qc[C×dk] · S0[dk×dv]`. **M=C, N=dv=128, K=dk=128** — identical shape to P3.
- **Schedule:** identical to P3 but LHS=`Qc` (shared with P2). **Fuse with P3 on the shared `S0` B-fragments:** stage `S0` once as B, run `Kc·S0` then `Qc·S0` back-to-back — `S0` is the heavy operand (128×128) and is loaded once for both.
- **Precision: 3xtf32 but the demote-first candidate.** The term is scaled by `γ_t ≤ 1` in f32 after the mma, so when the chunk has decayed (`γ_t→0`) the absolute error is attenuated. Least sensitive of the three state-boundary products; it is the first to try at plain tf32 in the precision A/B.
### Product 5 - `O += P · U` (attention-weighted output)
0031: `o += Amat[t·Cc+tp]·Ud[j·C+tp]` for `tp≤t`.
- **As a GEMM:** `O[C×dv] += P[C×C] · U[C×dv]`. **M=C, N=dv=128, K=C=64.** Contraction over the chunk index `t'`.
- **Schedule:** `P` (=Amat scratch from P2, with `d(t',t)` applied in f32) is LHS (M over t, K over t'); `U` (solved, in `Ud`) is the B operand, K-major over t'. `P` is **lower-triangular** ⇒ for M-tile `m` only K-subtiles `≤ m` are non-zero → ~½ the mma. Accumulate `C/8` k-subtiles. Add the `γ_t·QS` term (P4) into the same f32 D accumulator before write-out.
- **Precision: tf32-safe.** `P = d·QK` with `d≤1` is formed and bounded **in f32 first** (strong-decay rows already underflowed to ~0), so down-casting the bounded `P` to tf32 for this mma is benign. The decay is never inside the accumulation — it is pre-baked in f32, preserving the bounded de-gating invariant.
### Product 6 - `S_C += Kᵀ(D·U)` (the state update)
0031: `s = γ_last·Sd[j·dk+i] + Σ_t d(t,last)·Kc[t·dk+i]·Ud[j·C+t]`.
- **As a GEMM:** let `DU[t][j] = d(t,last)·U[t][j]` (D=diag applied in f32). `S_C[i][j] += Σ_t Kc[t][i]·DU[t][j]``S_C[dk×dv] += Kcᵀ[dk×C] · DU[C×dv]`. **M=dk=128, N=dv=128, K=C=64.** Contraction over the chunk index `t`.
- **Schedule:** the accumulator D fragments **are the register-resident state** that persists across the chunk loop. Order is strict: (i) scale the state fragments by `γ_last` in f32 in-register, **then** (ii) mma-accumulate `Kcᵀ·DU` over `C/8` k-subtiles into them. LHS = `Kc` read **transposed** (i as M-row, t as K) — a different fragment view of the same `Kc` smem buffer (use the `ldmatrix` transpose / J-major tile view). B = `DU` = `U` scaled by `d(t,last)` in f32, K-major over t — **same `U` B-layout as P5**.
- **Precision: 3xtf32 / f32 — the strongest ladder candidate.** This is the only product whose error *compounds across all `n_tokens/C` chunk steps*; it defines the state trajectory. Keep at 3xtf32 longest; this is the last product to ever consider demoting, and the place where a full-f32 accumulate (3xtf32) is most justified even if everything else passes plain tf32.
### Product 7 - the A-inverse (blocked forward substitution, FLA UT-transform)
0031 does a serial per-thread fwd-subst. Tensor-core form (block `b=16` = one mma M-tile, `C/b=4` blocks at C=64):
- For block `i`: `U_i = Ainv_ii·(RHS_i Σ_{j<i} A_ij·U_j)`.
- **The A-inverse-adjacent matmul = the off-diagonal coupling `A_ij·U_j`:** **M=b=16, N=dv=128, K=b=16**`1·16·2 = 32` mma/pair; 6 lower pairs at C=64 → **192** mma. Optional materialized-`Ainv_ii` apply is the same shape (~128 more).
- **Schedule:** forward sweep `i=0..3`; for each `i` accumulate all `j<i` couplings into a `b×dv` register tile (subtract from `RHS_i`), then apply the `b×b` diagonal inverse. `A`=Amat (from P1, β·d applied in f32) is the LHS; `U_j` blocks are read from `Ud` and updated in place as the sweep advances.
- **Precision: split.** Off-diagonal coupling = **tf32-safe** (`A_ij`=β·d·kk is bounded, `d≤1`; well-conditioned for the stable de-gating). The `16×16` **diagonal block inverse stays f32/in-register** (Neumann series on the b-nilpotent, ≤b1 terms, or a short serial solve) — exact, sensitive, but tiny. This is exactly the scope's recommended structure.
---
## 3. Staged-operand sharing graph (load amortization)
Five smem/register operands, and which products read them — the fusion that makes the added flops nearly free:
- **`Kc` (C×dk)** — the most-shared buffer. M-major-over-t LHS: P1, P3. K-major-over-i B (`Kcᵀ`): P1, P2. Transposed (i-major, contract t): P6. ⇒ stage once per chunk, feeds 1,2,3,6.
- **`Qc` (C×dk)** — LHS for P2 and P4.
- **`S0` B-fragments (dk×dv, register-resident state)** — P3 and P4. **Stage once as B, run KS then QS** (heaviest operand, amortized 2×).
- **`Amat` (C×C)** — P1 writes `A` → P7 reads `A` → P2 overwrites with `P` → P5 reads `P`. One buffer, lifecycle-reused (as 0031 already does).
- **`Ud` (C×dv)** — P7 writes `U` → P5 reads `U` (B, contract t) → P6 reads `U` scaled to `DU` (B, contract t). **P5 and P6 share the identical `U` B-layout** (both contract the chunk dim) → fully shared B-fragments.
Three concrete fusions worth coding as fused passes:
1. **P1+P2** share `Kcᵀ` B (PoC already does this).
2. **P3+P4** share `S0` B (stage the 128×128 state-as-B once).
3. **P5+P6** share `U` as B (both K=C contractions); compute `P·U` and `Kcᵀ·DU` from one `U` staging, P6 accumulating straight into the persistent state fragments.
---
## 4. tf32-safe vs 3xtf32 ladder - summary + recommended A/B order
**Plain-tf32-safe (well-conditioned, bounded, intra-chunk; bf16 `m16n8k16` is even an option if more throughput is needed):**
- P1 `KK`, P2 `QK` (PoC-proven), P5 `P·U` (P bounded/f32-pre-masked), P7 off-diagonal coupling.
**3xtf32 / f32 ladder (state-boundary, cross-chunk carry, error compounds):**
- P6 `Kᵀ(D·U)` — keep at 3xtf32 longest (compounds over every chunk).
- P3 `KS` — feeds the solve; 3xtf32 by default.
- P4 `QS` — 3xtf32 by default but γ_t-attenuated → **first to demote** to plain tf32 in the precision A/B.
- P7 `16×16` diagonal block inverse — stays **f32/in-register** (not a tensor-core op).
**Recommended precision A/B ladder (drives the KL-gate from `PAGED_BITEXACT_NOTE.md`):** start P3/P4/P6 at 3xtf32 and P1/P2/P5/P7-offdiag at plain tf32. If the KL-gate has margin, demote in order **P4 → P3**, holding **P6 at 3xtf32**. If even all-3xtf32 misses the KL-gate, the residual is the `16×16` diagonal solve precision, not the mma — that already stays f32.
---
## 5. Two honest implementation gotchas (not in the scope doc, surface in the mapping)
1. **Accumulator→B relayout of the state at each chunk boundary.** The register-resident state lives as P6's **D/accumulator** fragments (`tile<16,8>`), but P3/P4 need it as a **B operand** (`tile<8,8>`, K-major over `i`). These fragment layouts differ, so at chunk entry the state must be re-laid-out accumulator→B. Cheapest correct path: bounce the 128×128 state through a transient smem tile (write D fragments, `ldmatrix` back as B fragments) once per chunk — `n_tokens/C` times total, negligible vs the `C`x state-BW saved, but it means the "freed 64KB" needs a *transient* state-shaped smem tile at boundaries (not resident across the intra-chunk work). With dv-slabbing it's per-slab (`dk×dv_tile`).
2. **`Kc` needs two fragment views.** P1/P2/P3 read `Kc` contracting over `i` (dk); P6 reads `Kc` contracting over `t` (transposed, i as M-row). One smem buffer, but P6 must use the `ldmatrix` transpose / J-major `tile` view — budget for the transposed load, don't assume one staging serves both.
Ordering within a chunk is already correct in 0031 and must be preserved by the tensor-core version: P3,P4 read **pre-update** `S0` → P7 solve → P5 → **P6 overwrites** `S0``S_C`. Single accumulator, read-first/write-last, no state double-buffer needed.
## 2. A-inverse solve (form-T then apply, FLA UT transform)
I have everything I need: the chunked GDN math (patch 0031), the scope doc's recommendation, the ggml tf32 `mma.sync` API (`mma(tile<16,8,float>&D, tile<16,8,float>&A, tile<8,8,float>&B)` = m16n8k8), and the proven Gram PoC layout (`g=lane>>2, t=lane&3`; tf32 NMSE ~3e-9). Here is the design.
---
# A-inverse solve on `mma.sync` tensor cores (C=64, sm_121a) - design
Notation: `C=64`, head dim `dk=dv=128`, block size `b=16` (= one `m16n8k8` m-tile), `n_b=C/b=4`. `A = I + N`, `N = tril(beta_t·d(t',t)·(k_t·k_t'), -1)` strictly-lower (nilpotent, `N^C=0`); `RHS[t][j] = beta_t(v_t[j] - gamma_t(S0^T k_t)[j])` is `C×dv`; we want `U = A^{-1}·RHS`.
## 0. Core decision: form `T=A^{-1}` explicitly, then one wide apply (not direct back-subst)
Two routes were on the table. **Form `T = A^{-1}` in the `C×C` domain (FLA "UT transform"), then `U = T·RHS` as a single tf32 GEMM** - rather than blocked forward-substitution applied directly to the `C×dv` RHS. Reasons, all decisive on this part:
1. **Confines the only triangular dependency to the cheap `C×C` domain.** The expensive `dv=128`-wide work (`U=T·RHS`) becomes a dependency-free dense GEMM. The serial part is just the tiny `T`-formation. This is the single most important move for "don't serialize the warps."
2. **Fewer serial passes vs `dv`.** Inverting the `16×16` diagonal block once = a 16-column solve. Direct-solving against `RHS` re-solves against all `dv=128` columns per block. Form-`T`-once + reuse via mma is far cheaper in serial work.
3. **dv-slab reuse (the occupancy lever).** `T` depends only on `K`, not on `dv`. Form once, reuse for every `dv`-slab's `T·RHS_slab` apply. (Improvement over the scope's conservative "recompute per slab": when single-block, `T` lives in 16KB shared and is broadcast; only when dv-slabbing across separate blocks for occupancy do we recompute - which is cheap anyway, ~12% of the apply's mma count.)
4. **Isolates the error amplifier.** All recursion (the part that "amplifies error") lives in the small `T`-formation where 3xtf32 is nearly free; the bulk apply is a single well-conditioned GEMM.
This still **is** the scope's "blocked forward substitution: in-register diagonal solves + mma off-diagonal coupling" - just organized to produce `T` explicitly so the wide apply is dependency-free.
## 1. Solve algorithm
Block-partition `A` into a `4×4` lower-triangular grid of `16×16` blocks. `A_ii = I_b + N_ii` (unit-lower-tri, `N_ii` strictly-lower nilpotent); `A_ij` (i>j) full `16×16`. `T=A^{-1}` is block-lower-tri with:
```
T_ii = A_ii^{-1} (diagonal block inverse)
T_ij = -A_ii^{-1} · ( Σ_{m=j}^{i-1} A_im · T_mj ) for i > j (block fwd subst)
```
Then `U = T·RHS`, with `U_i = Σ_{j≤i} T_ij·RHS_j`.
**Phase D - diagonal inverses (4 blocks, fully parallel).** Each `A_ii` is `16×16` unit-lower-tri. Invert **exactly in f32** via shared-memory column-parallel forward substitution: stage `A_ii` to shared; thread `c` (c=0..15) solves `A_ii x = e_c` (`x_c=1`, `x_r = -Σ_{m=c}^{r-1} A_ii[r][m]·x_m`), writes column `c` of `T_ii`. 16 columns in parallel, ≤16 serial MACs each, all 4 blocks on 4 warps simultaneously. **No tensor cores here, and no reduced precision** - this is where the strongest coupling lives (see §4).
**Phase O - off-diagonal, mma.** For each i>j: accumulate `P_ij = Σ_m A_im·T_mj` (δ block-products), then `T_ij = -T_ii·P_ij`. All on `mma.sync` (`16×16×16` = `2 n-tiles × 2 k-steps` = 4 m16n8k8 per block-product).
**Apply.** `U = T·RHS`: warp `w` owns output rows `16w..16w+15`, sweeps all `dv=128` (16 n-tiles) × `C=64` (8 k-steps) = 128 m16n8k8/warp. This is the bulk and is embarrassingly parallel.
`A`, `P` (the QK term), `RHS`, and `T` are all assembled from tf32 Gram mma's (`KK`,`QK`,`KS`,`QS` - the PoC-proven step-1/2 plus step-3/4) with **all decay/`gamma`/`beta` applied in f32 outside the mma** (preserves bounded de-gating).
## 2. Tile schedule - keeping the triangular dependency off the warps
Block = 128 threads = 4 warps; **"warp == 16-row m-tile" throughout** (same mapping as the PoC's C=64 kernel, `rowbase = warp*16`, `g=lane>>2`, `t=lane&3`). Three layered mechanisms keep the warps busy despite the triangular dependency:
**(a) Wavefront (anti-diagonal) parallelism in `T`-formation.** The 6 off-diagonal blocks have a critical path of only `n_b-1=3`, not 6. Group by distance `δ=i-j`:
| Wave | Blocks (δ) | count | depends on | mapped to |
|---|---|---|---|---|
| D | (0,0)(1,1)(2,2)(3,3) | 4 | - | 4 warps ‖ |
| 1 | (1,0)(2,1)(3,2) | 3 | D | 3 warps ‖ |
| 2 | (2,0)(3,1) | 2 | D,1 | 2 warps ‖ |
| 3 | (3,0) | 1 | D,1,2 | 1 warp |
Within each wave all blocks are independent → one block per warp, no intra-wave serialization. Critical path = 4 dependency levels. Total `T`-formation mma: ~10 accumulation block-products + 6 inverse-applies = ~16 block-products × 4 = **~64 m16n8k8**, vs the apply's **512** (128/warp × 4) - so `T`-formation is ~12% of apply width and carries the only dependency.
**(b) Confinement.** Because we form `T` then apply, the dependency-laden work is the ~64-mma `C×C` formation; the 512-mma `dv`-wide apply has zero triangular dependency. The serial chain never touches the throughput-critical width.
**(c) Latency hiding via RHS overlap.** `T` depends only on `K` (→ `A` ← KK Gram). `RHS` depends on `V` and `S0^T k` (KS Gram, `dv`-wide, the expensive RHS term) and is **independent of the solve**. Schedule the wavefront `T`-formation (cheap, short critical path) concurrently with the `dv`-wide KS/QS Grams that build `RHS` and the `O` cross-term. The Phase-D shared scalar inverse (~16 shared round-trips × 4 warps) hides entirely under the KS Gram (thousands of cycles). By the time `T` is ready, `RHS` is staged and the apply fires immediately.
**Shared/register budget (C=64, state register-resident per the scope):**
| Buffer | bytes | note |
|---|---|---|
| `Kc`,`Qc` (bf16/tf32-staged) | 16KB+16KB | Gram operands |
| `A``T` scratch (`C×C` f32, in place) | 16KB | `A` consumed into `T`; reuses scope's A/P slot |
| `RHS`/`U` (`C×dv`) | 16-32KB | bf16 for the P·U and KᵀU mma's |
| diag-inverse scratch | ~1KB | `16×16` per warp, transient |
| gates `cs/gam/beta` | <1KB | f32 |
| state `S` | 0 (registers) | frees the 64KB that forced 0031's C=16 |
Total ~65-80KB, under the 99KB opt-in - the solve adds **no** net shared pressure (T overwrites A; diag scratch is transient). Per-thread diag-inverse needs ~16 regs (one column of `x`), released before the apply - does not compound the already-heavy state-accumulator register budget.
## 3. Precision risk assessment
**Error model.** `‖ΔU‖/‖U‖ ≲ κ(A)·(‖ΔA‖/‖A‖ + ‖ΔRHS‖/‖RHS‖) + ‖Δ_apply‖/‖U‖`. The inverse is the amplifier; `κ(A)` is data-dependent. For DeltaNet, keys are L2-normalized so `|k_t·k_t'|≤1``|N[t][t']|≤beta_t≤1`; in the decaying regime `‖N‖<1` and `κ` is modest, but in the weak-decay/aligned-keys corner `κ` grows and the `δ=3` column path (`T_30`) compounds 3 multiplies. tf32 input rounding is ~`2⁻¹¹``5e-4` relative (f32-accumulate; PoC measured Gram NMSE ~`3e-9`). 3xtf32 (3-limb split, the CUTLASS fp32-emulation trick) buys ~f32 (~`1e-7`) at ~3× that step's mma cost.
**Where the strong coupling actually sits (the key structural fact):** the *inverse* `T_ii` is computed f32-exact, **but the dominant near-diagonal mixing is applied in the tf32 apply GEMM** (`U_i ⊃ T_ii·RHS_i`), and block-boundary adjacent pairs (e.g. tokens 15↔16) live in the `δ=1` off-diagonal `T_10`. So "f32 protects the strong coupling" is only true for the inverse *computation*; its *application* is tf32 unless promoted. This drives the ladder.
**Precision config + 3xtf32 ladder (mandatory vs optional):**
| Step | Default | Mandatory? | 3xtf32 cost |
|---|---|---|---|
| Diagonal inverse `T_ii` | **f32 (shared scalar)** | **Mandatory-and-free** (it's already f32) | n/a |
| Off-diag coupling `A_im·T_mj`, `T_ii·P_ij` | **3xtf32 (default-on)** | Effectively mandatory; ~3× of ~64 tiny mma = **negligible** | free insurance |
| KK/QK Gram → A,P | tf32 | optional (rung 1) | 3× of C×C Grams (cheap) |
| Apply `U=T·RHS` | tf32 | optional (rungs 2-4) | up to 3× the bulk |
| KS/QS Gram → RHS, O | tf32 | optional (rung 5) | vLLM keeps these bf16 (L4-rejected precedent) |
Decays/`gamma`/`beta` **always f32, outside the mma** - invariant, not a rung.
**Ladder ordering if the default config misses the KL-gate (cheapest → most expensive):**
1. KK Gram (feeds `A`) → 3xtf32 [cheap, C×C].
2. Apply **block-diagonal terms only** `T_ii·RHS_i` → 3xtf32 [≈+0.8× apply; protects within-window strong coupling - mixed-precision-by-distance].
3. + apply `δ=1` off-diagonal terms → 3xtf32 [covers block-boundary adjacent pairs].
4. Full apply → 3xtf32 [≈+2× apply; expensive escape hatch].
5. KS/QS Gram → 3xtf32.
6. Fall back to direct blocked back-substitution against RHS in 3xtf32 (the alternative route, slightly more accurate than form-`T`-then-multiply at the cost of the parallelism), else keep 0031's serial path.
**Adversarial `g∈[-20,-1e-4]`:** strong decay ⇒ `d=exp(big-negative)→0` ⇒ off-diagonal `N→0``A≈I`, `T≈I`, apply≈identity, tf32 error vanishes; bounded de-gating (f32) guarantees underflow-to-zero, never inf. Weak decay (`g→0`) ⇒ `d≈1`, `A` well-conditioned, tf32's 8-bit exponent (vs f16's 5) holds the `gamma` dynamic range. The dangerous middle is the only KL-empirical risk - re-run this op case explicitly per the scope.
**KL impact / gating.** Same gate as the backend's new-FP-path precedents: NMSE is expected to *fail* at reduced precision (this is a new path on a new path) - **the binding gate is KL** (`KLD(tc‖f16) ≤ KLD(seq‖f16)` + PPL band) plus greedy-md5 stability (md5 will not match 0031's serial path - per-path, validated benign). Expectation: the **default config (f32 diagonal + 3xtf32 off-diagonal-coupling + tf32 everything-else)** clears the KL-gate, because (i) the dominant apply matches the PoC Gram's `~3e-9`/tf32-input grade and (ii) the recursion-amplified `C×C` work is f32-grade for free. The expensive apply-3xtf32 rungs are reserved escapes. Worst case all-3xtf32 ≈ 3× the mma cost - still an order of magnitude under 0031's serial-f32 reductions and still net-positive given the `~C×` state-BW cut.
## 4. Integration + validation
- Build on `ggml/src/ggml-cuda/mma.cuh`: the tf32 path is `mma(tile<16,8,float>&D, tile<16,8,float>&A, tile<8,8,float>&B)``mma.sync.aligned.m16n8k8.row.col.f32.tf32.tf32.f32` (line ~1089), gated by `AMPERE_MMA_AVAILABLE` (sm_121-correct). tf32 operands stage to shared and load via `load_generic` (or the PoC's `cvt.rna.tf32.f32` register packing); `ldmatrix` is `.b16`-only so it is **not** usable for tf32 fragments - use `load_generic`. 3xtf32 = 3 limb-split tf32 mma's + f32 recombine, reusing the same tile type.
- Slots into patch 0031's `gated_delta_net_chunked_cuda`: replace the serial `KK`/`QK` Grams (lines 183-198, 222-235) with the PoC tiles; replace the serial forward-subst (lines 211-218) with this Phase-D + Phase-O `T`-formation; replace the serial `O`/`S_C` reductions (steps 5/6) with the apply + `KᵀU` mma. `C=16``C=64`, state register-resident.
- Validation step before the full kernel: extend `~/scratch_tc_gdn_poc/gdn_gram_bench.cu` (DGX) to microbench Phase-D+O `T`-formation and the `T·RHS` apply in isolation, with the f64 host oracle measuring `κ(A)` and the tf32-vs-3xtf32 NMSE per rung - mirrors the Gram PoC that proved 6.7×→9.3×. This tells you, cheaply, whether the default precision config will clear the KL-gate before committing to the multi-week kernel (Phase 2 of the scope's plan).
**Net:** the solve adds zero shared pressure, ~12% mma overhead over the apply, a 4-level (not 64-step) critical path fully hidden under the RHS Grams, an f32-exact inverse, and a cheap-by-default 3xtf32 placement aimed exactly at the recursion-amplified term - so the A-inverse is **not** the bottleneck and **not** the precision wall; the residual risk is the data-dependent `κ(A)` in the weak-decay corner, arbitrated by the KL-gate with the apply-3xtf32 rungs as the escape.
Files referenced (absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/TENSORCORE_GDN_SCOPE.md`; `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/patches/paged/0031-paged-chunked-gdn-prefill-scan-kernel.patch`; ggml tf32 mma at `ggml/src/ggml-cuda/mma.cuh` (line ~1089); DGX PoC `~/scratch_tc_gdn_poc/gdn_gram_bench.cu`.
## 3. Register-resident state + C=64 budget
I have everything I need: the scope doc, the full 0031 kernel (smem layout, the 7 product families, the chunk-loop sequencing), and the sibling FP4-MMA GEMM kernel's concrete patterns (PAD=4 u32 rows, STAGES=2 cp.async, WM=4/WN=2 8-warp grid, ldmatrix, register-resident accumulators). Here is the concrete data-layout design.
---
# Register-resident state + C=64 design (tensor-core chunked GDN, sm_121a / mma.sync)
## 0. Block/warp configuration (the frame everything hangs on)
0031 used `block_dims(128,1,1)` = 128 threads, "thread j owns v-column j" - a **column-parallel scalar** model. The tensor-core kernel must abandon that and adopt a **warp-tiled** model (same as the sibling GEMM kernel):
- **256 threads = 8 warps**, arranged as a **WARPS_M x WARPS_N = 4 x 2** warp grid (the GEMM kernel's proven `WM=4, WN=2`).
- Threads no longer own columns; warp `(wm,wn)` owns a rectangular sub-tile of each matrix and drives `mma.sync` on it.
- Precision: **tf32 m16n8k8** for the S-coupled / decay-coupled products, **bf16 m16n8k16** allowed only for the well-conditioned intra-chunk Gram terms (KK, QK). f32 accumulate throughout. Decays/`gamma`/`beta` stay f32, applied outside the mma (preserve bounded de-gating).
This 4x2 warp grid is the denominator for every ownership calc below.
---
## 1. The one hard problem: S is an *accumulator* for step 6 but an *operand* for steps 3/4
This is the crux the scope hand-waves ("read S as the stationary operand; step 6 accumulates into it"). The register fragment layouts are **not** interchangeable:
| Use | Role | mma shape | S indexing | Fragment layout |
|---|---|---|---|---|
| Step 6 `S += Kᵀ(D·U)` | **accumulator (D/C)** | m=dk, n=dv, k=C | `S[i][j]`, m=i, n=j | `tile<16,8,float>` acc grid |
| Step 3 `KS = K·S` | **B operand** | m=C, n=dv, k=dk | `S[i][j]`, k=i, n=j | `tile<8,8,float>` B frag |
| Step 4 `QS = Q·S` | **B operand** | m=C, n=dv, k=dk | same as step 3 | `tile<8,8,float>` B frag |
An accumulator fragment's thread→element map differs from a B-operand fragment's, so **you cannot feed the persistent S registers directly into the step-3/4 mma.** A bridge is mandatory. The design decision:
> **S lives register-resident in the step-6 ACCUMULATOR layout** (it is written every chunk; that is the hot path). Steps 3/4 reach it via a **once-per-chunk restage to a small smem tile, re-read with `ldmatrix`** as B-operand fragments.
The restage cost is paid `n_tokens/C` times (not per token) - it is *inside* the BW saving the whole lever buys. And critically, the restage smem **time-multiplexes onto the Uc/Amat region**: at chunk entry (when KS/QS are needed) U and A for this chunk are not yet computed, so their buffers are free to hold the S restage. **Net additional persistent smem for the state: 0KB** - the scope's "0KB shared state" holds, with this scheduling caveat made explicit.
KS and QS read the **same** pre-update S0, so restage once → do both → then overwrite with U.
---
## 2. Register allocation map (per thread, 256-thread block)
State `S` is `dk x dv = 128 x 128` f32 = 16384 elems. Distributed over 256 threads = **64 f32/thread** at full dv.
| Register class | Lifetime | Full dv=128 | dv-slab=64 | dv-slab=32 | Layout / ownership |
|---|---|---|---|---|---|
| **Persistent S accumulator** | whole chunk loop | **64 regs** | **32 regs** | **16 regs** | warp `(wm,wn)` owns dk-rows `[wm·32, +32)` x dv-cols `[wn·(dv/2), +dv/2)`; = 2 m-tiles x (dv/2/8) n-tiles of `tile<16,8,float>`, 4 f32 each |
| Transient A-operand frag | per product | 4 regs/tile | same | same | `tile<16,8,float>` (tf32 packs k8) reused across KK/QK/KS/QS/O/Supd |
| Transient B-operand frag | per product | 2 regs/tile | same | same | `tile<8,8,float>` |
| Transient product accumulator (KK/QK/KS/QS/O) | per product, then spilled to smem | ≤8 tiles·4 = 32 regs | ≤16 | ≤8 | these outputs go to smem; acc is transient, reused |
| A⁻¹ diagonal-block solve (16x16, in-registers) | step 7 only | ~8-12 regs | same | same | one `b=16` unit-lower-tri block per warp-row, scalar Neumann/`<b-1` terms |
| loop/index/gate scalars | always | ~12 regs | same | same | c0, Cc, cs/gam/beta locals |
**Per-thread totals (256 threads):**
- Full dv: 64 (S) + ~50 (transients, non-overlapping with S) + ~12 ≈ **~130 regs/thread** → fits **1 block/SM** (256 regs/thread budget at 65536/SM÷256). 2 blocks/SM (128 regs/thread cap) would spill - hence dv-slab for occupancy.
- dv-slab 64: 32 (S) + ~50 + 12 ≈ **~94 regs/thread** → fits **2 blocks/SM** (128-reg cap). ✓
- dv-slab 32: 16 (S) → ~78 regs/thread → headroom; grid x4.
Persistent-state register pressure is the occupancy gate; everything else is transient and reused across the 7 products.
---
## 3. Shared-memory allocation map (PAD-padded, conflict-free)
Apply the GEMM kernel's lesson verbatim: **PAD = 4 (in the row's element width)** so a 128-wide row (a multiple of the 32 banks → 8-way conflict for the 8-row `ldmatrix`) becomes stride `132`, and the 8 rows of an `ldmatrix.m8n8` land in 8 distinct banks: `(r·132 + c) mod 32 = (4r + c) mod 32`, distinct for `r=0..7`. ✓
| Buffer | Logical shape | Row stride (padded) | Element | Notes |
|---|---|---|---|---|
| `Kc` (chunk K) | `[C][dk]` | `dk + 8` bf16 (= +4 u32) | bf16 | A-operand for KK/QK; A/Bᵀ for KS; transposed-A for Supd. bf16 default |
| `Qc` (chunk Q) | `[C][dk]` | `dk + 8` bf16 | bf16 | A-operand for QK/QS/O |
| `Uc` (solved U) | `[dv][C]` | `C + 4` f32 | f32 | f32 for the triangular solve accuracy; B-operand (down-cast tf32) for O & Supd |
| `Amat` (A then P) | `[C][C]` | `C + 4` f32 | f32 | KK→A→solve, then reused for QK→P; decays applied in f32 here |
| `gates` cs/gam/beta | `[3·C]` | (1-D, no pad) | f32 | prefix-sum + `expf`, f32 always |
| **S-restage tile** | `[dk][dv_strip]` | `dv_strip + 4` f32 | f32 | **overlays UcAmat** at chunk entry; not additive at peak |
| `cp.async` stage dup | (STAGES=2 on Kc/Qc) | as Kc/Qc | bf16 | Phase-3 latency hiding only |
PAD widths: f32 tiles +4 elems; bf16 tiles +8 elems (= +4 u32, identical bank offset as the GEMM kernel). `Uc` and `Amat` are padded on the C dimension (their `ldmatrix` access dimension).
---
## 4. C=64 shared budget table (under the 99KB opt-in)
Byte math with PAD included (`KB = bytes/1024`):
| Buffer | CONFIG A — **default**: C=64, dv=128, K/Q **bf16** | CONFIG B: C=64, dv=128, K/Q **tf32** | CONFIG C — **2 blk/SM**: C=32, dv-slab=64, K/Q bf16 |
|---|---|---|---|
| `Kc` | 64·136·2 = **17.0KB** | 64·132·4 = 33.0KB | 32·136·2 = 8.5KB |
| `Qc` | **17.0KB** | 33.0KB | 8.5KB |
| `Uc` (f32) | 128·68·4 = **34.0KB** | 34.0KB | 64·36·4 = 9.0KB |
| `Amat` (f32) | 64·68·4 = **17.0KB** | 17.0KB | 32·36·4 = 4.5KB |
| gates | **0.75KB** | 0.75KB | 0.4KB |
| S-restage | overlay (0 net) | overlay (0 net) | overlay (0 net) |
| **Per-block total** | **≈ 85.8KB** ✅ < 99 | **≈ 117.8KB** ❌ | **≈ 30.9KB** |
| Blocks/SM (≈100KB/SM) | **1** | n/a | **2** (61.8KB) ✅ |
Read-out:
- **CONFIG A is the recommended default**: C=64 (4x the 0031 chunk), full dv, fits at ~86KB with margin, 1 block/SM. Peak is the O/Supd phase (all of Kc+Qc+Uc+Amat live).
- **CONFIG B (tf32 K/Q) is budget-hostile** (117KB) - tf32 K/Q tiles don't shrink with dv-slab (they're `C x dk`), so even dv-slab=64 lands ~101KB. **Conclusion: stage K/Q as bf16; reserve tf32/3xtf32 for the S-coupled and decay-coupled terms** (which arrive via the small streamed S-restage and the f32 gate scaling), exactly per the scope's "bf16 only for well-conditioned Gram terms."
- **CONFIG C is the 2-block/SM lever**: C=32 + dv-slab=64 → 31KB/block, two resident blocks under the ~100KB/SM total, and the grid grows to `H x n_seqs x 2`.
---
## 5. dv-slab strategy (the 2nd block/SM + grid-starvation fix)
Split the `dv=128` value dimension into `n_slabs` blocks; each block computes a `dv_tile`-wide vertical strip of O and of the state.
- **Grid**: `dim3(H, n_seqs, n_slabs)` (was `(H, n_seqs)`). `n_slabs ∈ {1,2,4}` for `dv_tile ∈ {128,64,32}`. This **multiplies the grid by `n_slabs`**, directly attacking 0031's low-`n_seqs` grid starvation.
- **What is dv-independent and must be recomputed per slab**: `A` (KK Gram), the `A⁻¹` solve, the gate prefix - all depend only on K and the gates, *not* dv. Each slab recomputes them. Cheap once they are on tensor cores (the whole point); this is the FLA per-slab pattern.
- **What is dv-sliced**: the S accumulator (`128 x dv_tile`), `Uc` (`dv_tile x C`), KS/QS/O outputs, step-6 update. Halving/quartering dv halves/quarters both the **S register footprint** (64→32→16 regs/thread, §2) and the dv-scaled smem (`Uc`, restage).
- **Restage budget bonus**: at `dv_tile=64` the per-block S is `128 x 64` = 32KB, so the once-per-chunk restage fits the UcAmat overlay window in a single pass (no strip loop). At full dv=128 the restage is done as **2 dv-strips of 32KB** reusing the same overlay (or 16 k-strips of 8x128 if registers are tighter than smem).
`b`-block forward substitution (step 7) is independent of dv too, so the in-register `16x16` diagonal solves are computed once and the off-diagonal mma coupling `Uᵢ -= Aᵢⱼ Uⱼ` runs per slab as a `(16x16)·(16 x dv_tile)` mma.
---
## 6. Bank-conflict-free layout - the GEMM PAD lesson applied
Concretely, per the sibling kernel's `ARS = KBLK·SAW + PAD` with `PAD=4` (which gave +19%):
- Every smem matrix read by `ldmatrix` (or its tf32 equivalent in `ggml/src/ggml-cuda/mma.cuh`) is stored with **row stride = logical_width + PAD**, PAD chosen so `stride mod 32 ≠ 0`: f32 width-128 → 132 (`132 mod 32 = 4`); bf16 width-128 (packed 64 u32) → 68 u32 (`68 mod 32 = 4`).
- This guarantees the 8 rows an `m8n8` `ldmatrix` touches map to 8 distinct banks for any fixed column → no replays on the operand loads, which are the kernel's inner-loop smem traffic.
- `cp.async` (CONFIG, Phase 3): `STAGES=2` double-buffer on `Kc`/`Qc` only (the GEMM kernel found multistage saturates BW past depth 2). 16B `cp.async.cg` copies into the padded rows; `commit_group`/`wait_group` Ampere-style (no TMA on sm_121). The pad keeps the staged writes coalesced and the mma reads conflict-free simultaneously.
---
## 7. Summary of the allocation decisions
| Decision | Value |
|---|---|
| Threads / warp grid | 256 / 4x2 (WM=4, WN=2) |
| **State residency** | register-resident in **step-6 accumulator layout** (`tile<16,8,float>` grid), 64/32/16 f32-regs/thread at dv 128/64/32 |
| **Accumulator↔operand bridge** | once-per-chunk `ldmatrix` restage of S to a small smem tile that **overlays UcAmat** (0 net persistent smem); KS+QS share one restage |
| K/Q precision | **bf16** staged (tf32 K/Q breaks the 99KB budget); tf32/f32 reserved for S-coupled + decay-coupled terms |
| Uc / Amat | f32, padded on C (+4) |
| **PAD** | +4 f32 (+8 bf16 = +4 u32) row stride → `ldmatrix` 8-row conflict-free (GEMM-proven) |
| **C=64 default budget** | **≈86KB**, 1 block/SM (CONFIG A) ✅ |
| 2 blocks/SM | C=32 + dv-slab=64 → ≈31KB/block, grid x2 (CONFIG C) |
| dv-slab | grid `(H, n_seqs, n_slabs)`; A/A⁻¹/gates recomputed per slab; S/Uc/O dv-sliced |
| cp.async | STAGES=2 on Kc/Qc (Phase 3 only) |
One honest caveat surfaced beyond the scope doc: the scope's "~64-80KB / 0KB shared state" budget only holds with **bf16 K/Q staging** and the **overlay restage**; tf32 K/Q (CONFIG B) does not fit even with dv-slab, which is why bf16-Gram + tf32-only-for-S/decay is the forced (and scope-consistent) precision split.
Files referenced: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/TENSORCORE_GDN_SCOPE.md`, `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/patches/paged/0031-paged-chunked-gdn-prefill-scan-kernel.patch`, and the sibling GEMM patterns in `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch`.
## 4. Occupancy + launch config
I have everything I need. Key facts confirmed: tf32 tile is `mma.sync.aligned.m16n8k8.f32.tf32.tf32.f32` (mma.cuh:1089, accumulator `tile<16,8,float>` = 4 f32/thread/tile); grid.x = `H` = `n_v_heads` = `ssm_dt_rank` (llama-model.cpp:504; Qwen3-Next family = 32 GDN value heads); the sequential kernel already uses a 3D grid `(H, n_seqs, ceil(S_v/num_warps))` (gated_delta_net.cu:184) — the chunked 0031 collapsed that z-axis to 1, which is exactly what starves the grid.
Here is the occupancy + launch design.
---
# Occupancy + launch design — tensor-core chunked GDN prefill (sm_121a)
## 0. The two independent caps 0031 hit (must relieve BOTH for ≥2 blocks/SM)
0031's -22% is not one wall, it's two stacked walls, and they are relieved by *different* levers:
| Cap | 0031 value | Binding resource | Lever |
|---|---|---|---|
| **Shared-memory cap** | 89 KB (64 KB all-shared state) | 100 KB/SM, 99 KB dyn opt-in | state→registers **+ smaller C** |
| **Register cap** | n/a (was scalar) | 65536 regs/SM | **dv-slab** the register-resident state |
| **Grid cap** | `(H, n_seqs, 1)` = 32·n_seqs blocks | 48 SMs | **dv-slab multiplies grid** by n_slabs |
sm_120/121-class per-SM limits used throughout: **1536 threads/SM, 65536 32-bit regs/SM, 100 KB shared/SM (99 KB dynamic opt-in), 255 regs/thread, ≤24-32 blocks/SM (hw, never the binding limit here).** The binding limits are **shared and registers.**
Critical correction to the scope-doc budget table: it assumes **bf16** K/Q staging (2 B). The precision default is **tf32**, which is a 32-bit container in shared — tf32 K/Q would *double* Kc/Qc and blow C=64 past 99 KB. So the occupancy config **stages K/Q as bf16** (the well-conditioned KK/QK Gram products per the scope's "bf16 only for Gram terms"), keeps gates/decays/beta/the solve in f32. This is a real precision↔occupancy coupling, flagged in §5.
## 1. Grid mapping — three parallel axes, the chunk axis is serial
The inter-chunk recurrence carries state `S` across chunks, so **the chunk axis cannot be a grid axis** (it's the sequential dependency — that's the whole algorithm). The only legitimate grid axes that don't break the recurrence are:
```
dim3 grid(H, n_seqs, n_slabs); // H = n_v_heads = 32 (ssm_dt_rank)
// n_slabs = dv / dv_tile (the new lever)
```
- `blockIdx.x = head` (0..31), `blockIdx.y = seq`, `blockIdx.z = dv-slab`.
- A block owns v-columns `[z·dv_tile, (z+1)·dv_tile)`, walks the chunk loop serially, and keeps **only its `dk × dv_tile` state slab** register-resident.
- This reuses the **same 3D grid shape the sequential kernel already has** (gated_delta_net.cu:184 uses z for S_v-splitting); the chunked kernel repurposes z from S_v-split to dv-slab. The dispatcher change is minimal.
**Saturation math (the core of the task).** Target ≥2 blocks/SM on 48 SMs ⇒ **≥96 concurrent blocks**. With H=32:
| n_seqs | dv_tile=128 (n_slabs=1) | dv_tile=64 (2) | dv_tile=32 (4) |
|---|---|---|---|
| 1 | 32 (starved, 0031) | 64 (48/48 SMs busy, 67% warp-occ) | **128 (100%)** |
| 2 | 64 | **128 (100%)** | 256 |
| 4 | 128 | 256 | 512 |
So **dv-slabbing is simultaneously the register-relief lever and the grid-multiplier** — it's the single most important move. Rejected grid alternatives: split-K over dk (needs cross-block atomic reduction + fights the state carry); batching heads/seqs per block (reduces grid, wrong direction).
## 2. Block dim / warp count — 8 warps / 256 threads
```
constexpr int WARPS = 8;
dim3 block(32 * WARPS, 1, 1); // 256 threads
```
Why 8 warps:
- **Clean mma tile partition at C=32:** KK/QK output is `C×C = 32×32` = (32/16)·(32/8) = **8 m16n8 tiles → exactly 1 tile/warp**, dk=128 = 16 k8-steps. Steps 3/4 (KS/QS) and 5 (P·U) → 2 tiles/warp. Step 6 state update `dk×dv_tile`=128×64 → 64 tiles → **8 tiles/warp** (these are the persistent register-resident accumulators).
- **Register dilution:** the register-resident state accumulator is spread across all 256 threads (see §3) — more warps = fewer state-regs/thread.
- **Threads are not the cap:** 256 threads ⇒ up to 6 blocks/SM by the 1536 thread limit, so registers/shared decide.
Fallback if register-capped (§5): **12 warps (384 threads)** dilutes the state accumulator further (dv_tile=64: 32→21 state-regs/thread) at the cost of thinner per-warp tiles and ≤4 blocks/SM by threads.
## 3. Register-resident state ↔ dv-slab ↔ occupancy interaction
The state slab is held as **tf32 mma accumulator fragments** (`tile<16,8,float>`, 4 f32/thread/tile) persisting across the chunk loop. Per-thread state-register cost = `dk·dv_tile / 256`:
| dv_tile | state f32/block | state regs/thread (256 thr) | + working (est.) | regs/thread | regs/block | reg-allowed blocks/SM |
|---|---|---|---|---|---|---|
| 128 (no slab) | 16384 | 64 | ~50 | ~114 | ~29 K | 2 (tight) |
| 64 | 8192 | 32 | ~50 | ~82 | ~21 K | 3 |
| 32 | 4096 | 16 | ~50 | ~66 | ~17 K | 3 |
So on registers alone, dv_tile≤64 admits ≥2 blocks/SM. **Shared memory is then the binding cap**, and it's governed by **C**, not dv_tile (Kc/Qc/A all scale with C, only U scales with dv_tile):
| Config | Kc+Qc (bf16) | A/P (f32) | U (f32) | single | +cp.async dbl-buf K/Q | blocks/SM (shared) |
|---|---|---|---|---|---|---|
| C=64, dv_tile=128 | 32 KB | 16 KB | 32 KB | 80 KB | 112 KB ✗(no room!) | **1** |
| C=64, dv_tile=64 | 32 KB | 16 KB | 16 KB | 64 KB | 96 KB ✓ | **1** |
| **C=32, dv_tile=64** | 16 KB | 4 KB | 8 KB | **28 KB** | **44 KB ✓** | **2** |
| C=32, dv_tile=32 | 16 KB | 4 KB | 4 KB | 24 KB | 40 KB ✓ | **2** |
**Finding the scope doc missed:** C=64-no-slab is shared-saturated at 80 KB — there is **no room for cp.async double-buffering**, so the 1-block/SM kernel would have *no latency hiding* and likely still lose. C=64 needs dv_tile≤64 *just to make room for cp.async*, and is still 1 block/SM. **Genuine 2 blocks/SM requires C=32** (to drop Kc/Qc/A under the 49.5 KB/block budget).
## 4. cp.async double-buffering (depth 2, no TMA)
At 1 block/SM (C=64 path) cp.async is the *only* latency-hiding mechanism, so it's mandatory, not optional. Plain Ampere `cp.async` (`cp.async.commit_group` / `cp.async.wait_group`) — **no `cp.async.bulk`/TMA on sm_121.** Stage the **next chunk's Kc, Qc** (and Vc if the KL-gate doesn't force V from global) into a second shared buffer while the current chunk's mma runs. Depth **2 only** — the sibling GEMM kernel proved multistage saturates BW past depth 2. The double-buffer cost is already in the "+cp.async" column above (44 KB at C=32 keeps 2 blocks/SM).
## 5. Launch config (concrete) + honest occupancy estimate
**Recommended default (batched-prefill serving regime, n_seqs≥2):**
```
C = 32 ; dv_tile = 64 ; n_slabs = 2 ; WARPS = 8
grid = dim3(H=32, n_seqs, 2)
block = dim3(256, 1, 1)
smem = 44 KB (Kc/Qc bf16 ×2 dbl-buf + A/P f32 + U f32) // cudaFuncSetAttribute return CHECKED (0031 precedent)
→ 2 blocks/SM. n_seqs≥2 ⇒ ≥128 blocks ⇒ 48/48 SMs at full 2-block occupancy (100%), 1.33 waves.
A/Gram/solve recomputed 2× across slabs (state-update per slab is 2× the A work ⇒ ~25% redundant-flop overhead).
```
**Single-stream prefill (n_seqs=1) saturator:**
```
C = 32 ; dv_tile = 32 ; n_slabs = 4 ; WARPS = 8
grid = dim3(32, 1, 4) = 128 blocks ⇒ 2 blocks/SM on all 48 SMs (100%) even at n_seqs=1.
Cost: A recomputed 4×, and at dv_tile=32 the A bucket ≈ the per-slab state bucket ⇒ ~1.5-2× total-flop overhead.
```
**BW-max alternative (1 block/SM, bench against the above):**
```
C = 64 ; dv_tile = 64 ; n_slabs = 2 ; WARPS = 8 ; smem = 96 KB (dbl-buf, fits 99 KB)
→ 1 block/SM, but 4× state-BW cut (vs 2× at C=32) + grid ×2. n_seqs=1 ⇒ 64 blocks ⇒ 48/48 SMs busy (67% warp-occ).
```
**Occupancy summary:**
| Config | blocks/SM | regs/thread | smem/block | SM util @ n_seqs=1 | SM util @ n_seqs≥2 | state-BW cut | redundant-A |
|---|---|---|---|---|---|---|---|
| 0031 | 1 | scalar | 89 KB | 32/48 busy (starved) | 1024 blk, no overlap | 1× (C=16) | none |
| C=32 dv64 (default) | **2** | ~82 | 44 KB | 48 busy, 67% occ | **100%** | 2× | 2× (~25%) |
| C=32 dv32 (1-seq) | **2** | ~66 | 40 KB | **100%** | 100% | 2× | 4× (~1.5-2×) |
| C=64 dv64 (BW-max) | 1 | ~114 | 96 KB | 48 busy, 67% occ | 100%, multi-wave | **4×** | 2× |
The C=32 (occupancy) vs C=64 (BW) choice is the empirical fork the scope doc defers to Phase-3 bench: 2 blocks/SM at half the BW saving, vs 1 block/SM at full BW saving + cp.async. **Wire both behind the existing `GDN_CHUNK_MIN` gate plus a `GDN_CHUNK_C` / `GDN_DV_TILE` selector and A/B them; do not assume.**
## 6. Residual risk — register pressure likely caps it at 1 block/SM (honest)
The ≥2-blocks/SM result rests on the **~50 working-regs/thread estimate**, which is optimistic:
- **The blocked-forward-subst A⁻¹ (step 7) is the swing factor.** The in-register 16×16 unit-lower-triangular diagonal inverse + the off-diagonal mma coupling + mma operand fragments + the **accumulator→operand fragment transpose** for reusing the register-resident S as a step-3/4 operand (a `movmatrix`/shared round-trip, since S lives in C-fragment layout but steps 3/4 need it as an A/B operand) can push working regs to **80-120**. At 256 threads, regs/thread > 128 ⇒ > 32 K regs/block ⇒ **silently drops to 1 block/SM** regardless of the 44 KB shared headroom. The scope doc names this exactly: "blocked-forward-subst register pressure trades against state-register pressure; both compete for the same budget."
- **Mitigation ladder, in order:** (i) 12 warps to dilute (dv_tile=64: state 32→21 regs/thread); (ii) `__launch_bounds__(256, 2)` to force the compiler under 128 regs/thread (risks spills to local → BW back); (iii) smaller dv_tile (more grid, more redundant A). If all fail, accept **1 block/SM and lean on cp.async double-buffering + the 4× BW cut + mma throughput** — which is still very likely a win over 0031's serial-f32/-22%, just not the 2-block target.
- **Grid-starvation at n_seqs=1 is structural** (H=32 < 96): only dv_tile=32 fully saturates a single stream, and it pays ~1.5-2× redundant-A flops. This is an inherent floor of the chunked formulation on a 32-head layer; batched serving (n_seqs≥2) is where the design cleanly saturates.
- **Precision↔occupancy coupling:** the 2-block budget assumes **bf16 K/Q** staging. If the KL-gate demands tf32 for the KS/QS S0-products (decay-coupled), that needs a second 32-bit K/Q copy or 3×tf32 — both inflate shared/registers and can knock C=32 back toward 1 block/SM. The occupancy win is contingent on bf16 Gram clearing the gate.
**Bottom line:** 2 blocks/SM and full 48-SM saturation are *reachable* — via **C=32 + dv-slab (64 for serving, 32 for single-stream) + 8 warps + depth-2 cp.async** — on the **shared-memory** budget. Whether the **register** budget also permits it is the one genuinely open risk, and the A⁻¹ solve's footprint is what decides it; the realistic outcome remains the scope doc's honest **1-2 blocks/SM**, with the win carried by mma + cp.async + the C× BW cut rather than by high occupancy.
Files: design lives in `backend/cpp/llama-cpp-localai-paged/docs/TENSORCORE_GDN_SCOPE.md` (§3 occupancy plan — this analysis sharpens it); kernel/dispatcher to modify `backend/cpp/llama-cpp-paged-dev/ggml/src/ggml-cuda/gated_delta_net.cu` (grid at :184, chunked launch added by patch `patches/paged/0031-paged-chunked-gdn-prefill-scan-kernel.patch`); tf32 tile at `ggml/src/ggml-cuda/mma.cuh:1089`; H source `src/llama-model.cpp:504`.
## 5. Synthesized build plan + milestones + gate
All anchors verified. 0031's kernel body, the 7-step structure, the `GDN_CHUNK_MIN`/`GDN_CHUNK_OFF` gating at the `if constexpr (!KDA && !keep_rs_t)` site, the `launch_gdn_chunked<128,16>` template, the smem formula, and the test-backend-ops shapes are all confirmed. The scope doc's KL gate, 3xtf32 ladder, risk register, and Phase 0-3 plan are confirmed. Here is the build-ready synthesis.
---
# BUILD-READY PLAN: tensor-core chunked-GDN prefill kernel (sm_121a)
Anchors (absolute): kernel + launch + gate live in `ggml/src/ggml-cuda/gated_delta_net.cu` (grid at :184), added by `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/patches/paged/0031-paged-chunked-gdn-prefill-scan-kernel.patch`. tf32 tile `mma(tile<16,8,float>&D, tile<16,8,float>&A, tile<8,8,float>&B)` = `mma.sync.aligned.m16n8k8.row.col.f32.tf32.tf32.f32` in `ggml/src/ggml-cuda/mma.cuh` (m16n8k8 overload ~976-984, dispatch ~1089), gated by `AMPERE_MMA_AVAILABLE`. PAD/cp.async patterns from `patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch`. Gate/precedent docs: `docs/TENSORCORE_GDN_SCOPE.md`, `docs/PAGED_BITEXACT_NOTE.md`, `README.md` s5. Microbench: `~/scratch_tc_gdn_poc/gdn_gram_bench.cu` (DGX). Last patch in series is 0042 → this work is patches 0043+.
The new kernel is `gated_delta_net_chunked_tc_cuda<S_v, C, DV_TILE>`, a sibling to 0031's `gated_delta_net_chunked_cuda`. Symbols below reuse 0031's smem names (`Sd, Kc, Qc, Ud, Amat, csh, gam, bet`).
---
## (1) Phase-by-phase kernel structure
Block = **256 threads / 8 warps** in a **4×2 (WM×WN)** warp grid. State `S` (`dk×dv_tile`) is **register-resident in the step-6 accumulator layout** (`tile<16,8,float>` grid). Grid = `dim3(H, n_seqs, n_slabs)`, `blockIdx.z` = dv-slab. Chunk axis is the serial recurrence (NOT a grid axis). Invariant preserved from 0031: read pre-update `S0` (P3/P4) → solve → output (P5) → **overwrite S last** (P6). Single accumulator, no state double-buffer.
Per chunk `c0` (the loop body):
**Phase A - chunk load + gate prefix (f32, cooperative).** Load `Kc[C][dk]`, `Qc[C][dk]` **as bf16** (tf32 K/Q blows the 99KB budget - see §5 of the state design), load `V` chunk. Compute `csh = cumsum(g)` (≤0), `gam = exp(csh)` (≤1), `bet` - all f32, identical to 0031 lines (the `j==0` prefix scan, kept scalar; it is <1KB and hides under the Grams). cp.async depth-2 prefetch of the *next* chunk's `Kc/Qc` starts here.
**Phase B - state restage (accumulator→B bridge).** The crux. `S0` lives as P6's D/accumulator fragments but P3/P4 need it as a **B operand** (`tile<8,8>`, K-major over `i`). Bounce the `dk×dv_tile` state through a transient smem tile that **overlays the `UdAmat` region** (free at chunk entry - U/A not yet computed) → `load_generic` back as B fragments (NOT `ldmatrix`: it is `.b16`-only, unusable for tf32; use `load_generic`). Paid `n_tokens/C` times, **0KB net persistent smem**. KS and QS share this one restage.
**Phase C - Gram + state-boundary products (the matmuls that read pre-update S0).**
- **P1 `KK→A`** = `Kc·Kcᵀ`, M=C N=C K=dk, lower-tri (~½ tiles). **tf32-safe** (PoC-proven NMSE ~3e-9). Apply `A = I + tril(βₜ·d(t',t)·KK, -1)` in **f32** after the mma.
- **P3 `KS`** = `Kc·S0`, M=C N=dv K=dk. **3xtf32** (state-boundary, feeds the solve). Output → `Ud` region (becomes RHS).
- **P4 `QS`** = `Qc·S0`, M=C N=dv K=dk, **fused with P3 on the shared S0 B-fragments**. **3xtf32** (γ-attenuated → first demote candidate). Seed the **O accumulator fragments register-resident with `γₜ·QS`** immediately (avoids parking QS in smem through to Phase F). Restage overlay is now free; `Ud`/`Amat` reclaim it.
**Phase D - A-inverse (form T = A⁻¹ explicitly, then wide apply).**
- **Phase-D inverses:** 4 diagonal `16×16` unit-lower-tri blocks, **f32 in shared-memory column-parallel forward substitution** (thread `c` solves `A_ii x = e_c`). No tensor cores, no reduced precision (this is the strong-coupling amplifier). 4 blocks on 4 warps in parallel, hides entirely under the Phase-C/RHS Grams.
- **Phase-O off-diagonal:** wavefront (anti-diagonal) schedule, critical path `n_b-1=3` not 6. For each i>j: `P_ij = Σ_m A_im·T_mj` then `T_ij = -T_ii·P_ij`, on `m16n8k8`. **3xtf32 default-on** (~64 tiny mma total, negligible). `T` overwrites the `A` scratch in place.
**Phase E - RHS + apply.** `RHS = βₜ(vₜ - γₜ·KS)` in **f32** (uses P3 result + V) → `Ud`. **`U = T·RHS`** as one dependency-free wide **tf32** GEMM, M=C N=dv K=C (the bulk, 128 mma/warp at full dv), in place → `Ud`.
**Phase F - intra-chunk output.**
- **P2 `QK→P`** = `Qc·Kcᵀ`, reuse `Amat` (now free after T consumed). **tf32-safe**. Apply `P = d(t',t)·QK` in **f32** (bounded, decay pre-baked - preserves the bounded de-gating invariant).
- **P5 `O += P·U`**, M=C N=dv K=C, P lower-tri (~½ tiles). **tf32-safe** (P f32-bounded first). Accumulate into the O fragments already seeded with `γₜ·QS`. Write `O*scale` to `dst`.
**Phase G - state carry (overwrites S0 last).** `DU = d(t,last)·U` in f32. **Scale the persistent S accumulator fragments by `γ_last` in f32 in-register first**, then **P6 `S_C += Kcᵀ·DU`** = `Kcᵀ·DU`, M=dk N=dv K=C, **3xtf32 (the strongest ladder candidate - compounds over every chunk)**, accumulated straight into the persistent registers. `Kc` is read **transposed** here (second fragment view, `load_generic` transpose). No restage-out: S stays resident for the next chunk.
After the loop: final-state write-back (M-layout), identical to 0031's tail.
Buffer lifecycle (single `Amat`, single `Ud`, as 0031): `Amat`: A(P1) → T(Phase-D/O, in place) → consumed by apply → P(P2) → consumed by P5. `Ud`: KS(P3) → RHS(Phase-E) → U(apply, in place) → read by P5 (B) and P6 (B, scaled to DU). Restage tile overlays `UdAmat` only at chunk entry (Phase B), before either is written.
---
## (2) Build sequence - incremental, each independently GPU-verifiable vs 0031
Each milestone is a **separate patch** stacked on 0031, **green on `test-backend-ops GATED_DELTA_NET` + greedy-md5 stable before the next is started**. Reference for every step = the `test_gated_delta_net` op's f64/CPU oracle (already in-tree) and 0031's serial-chunked output. **No milestone integrates on top of an unverified one.**
| M | Scope | Patch | GPU verification gate (vs 0031 / op oracle) |
|---|---|---|---|
| **M0** | Re-confirm regime, NO code (scope Phase 0) | - | Profile 0031 (`GDN_CHUNK_MIN` low): confirm GDN prefill bucket dominates + grid-starved at low n_seqs. If not, kill the lever now. |
| **M1** | **DGX microbench (NO kernel yet)** - extend `gdn_gram_bench.cu` with KS/QS/PU/KᵀU microkernels + Phase-D/O T-formation + T·RHS apply, each with f64 host oracle measuring **κ(A)** and tf32-vs-3xtf32 NMSE per rung, incl. adversarial `g∈[-20,-1e-4]` | - | **The cheap go/no-go before multi-week kernel work.** Pass = default precision config (f32 diag + 3xtf32 off-diag + tf32 bulk) reaches ~PoC `3e-9`-grade on benign data and survives the κ(A) weak-decay corner within the ladder. Mirrors the PoC that proved 6.7×→9.3×. |
| **M2** | In-kernel: replace **only** step-1/2 serial Grams (KK/QK) with tensor-core tiles. **C=16, scalar everything else, same occupancy** (scope Phase 1 / PoC integration) | 0043 | test-backend-ops 128-shapes green via KL gate (NMSE if it passes); greedy-md5 stable. |
| **M3** | Add **P3/P4 (KS/QS)** tensor-core (3xtf32) + S restage bridge. Still C=16, scalar solve + scalar O/state | 0044 | Same gate. Isolates the accumulator→B bridge correctness. |
| **M4** | **A-inverse** Phase-D (f32 diag) + Phase-O (3xtf32 off-diag), form T; replace 0031's serial fwd-subst. Still C=16 | 0045 | Same gate + the adversarial-decay op case (this is the amplifier). |
| **M5** | **Apply `U=T·RHS`** + **P5 `P·U`** tensor-core. Still C=16 | 0046 | Same gate. |
| **M6** | **P6 `Kᵀ(D·U)`** tensor-core + **register-resident state** (step-6 accumulator layout) + accumulator→B restage in steady state. State leaves smem here | 0047 | Same gate. Frees the 64KB that forced C=16. |
| **M7** | **Flip C=16→C=64, full dv (CONFIG A ~86KB, 1 blk/SM)**, 8-warp 4×2 grid, PAD=4 smem | 0048 | Gate + **first A/B bench vs sequential** (S_PP at n_seqs≥2). |
| **M8** | **Occupancy:** C=32 + dv-slab grid `(H,n_seqs,n_slabs)` (CONFIG C, 2 blk/SM) + cp.async depth-2; selectors `GDN_CHUNK_C`/`GDN_DV_TILE` | 0049 | Gate + A/B bench across {C=32/dv64, C=32/dv32, C=64/dv64-BW-max}; pick winner per regime. |
---
## (3) Bit-exact / KL gate plan
**md5 is per-path and will NOT match** 0031-serial or the sequential recurrence (different FP reduction order). This is the established `-paged` precedent (`PAGED_BITEXACT_NOTE.md`): per-path md5, validated benign. So:
- **Binding gate = KL** (not strict NMSE): `KLD(tensorcore ‖ f16) ≤ KLD(sequential ‖ f16)` plus a PPL band, on the README s5 harness. NMSE is *expected to fail* at reduced precision (new path on a new path); NMSE-pass is a bonus, KL-pass is the bar.
- **Stability gate:** greedy-md5 **stable across runs** (deterministic), not equal to the serial path.
- **Adversarial op case mandatory:** `g∈[-20,-1e-4]` (the dangerous middle-decay regime where κ(A) grows); strong-decay underflows to 0 (safe), weak-decay is well-conditioned (tf32's 8-bit exponent holds γ range), the middle is the only empirical risk.
**Precision default config (the bet that clears the gate):** f32 diagonal inverse (mandatory, already f32) · **3xtf32 off-diagonal coupling** (default-on, negligible ~64-mma cost) · **tf32** Grams + apply · **bf16** K/Q staging (well-conditioned KK/QK only) · decays/γ**always f32 outside the mma** (invariant, not a rung). Hold **P6 state carry at 3xtf32 longest** (it compounds over every chunk).
**3xtf32 ladder (cheapest→dearest) if default misses the gate:** (1) KK Gram→3xtf32; (2) apply **block-diagonal `T_ii·RHS_i`**→3xtf32 (within-window strong coupling, mixed-precision-by-distance); (3) +**δ=1 off-diagonal** apply→3xtf32 (block-boundary adjacent pairs e.g. tokens 15↔16); (4) **full apply**→3xtf32 (≈+2× apply, expensive escape); (5) KS/QS→3xtf32; (6) fall back to direct blocked back-substitution in 3xtf32, else keep 0031's serial path. **Demote order if the gate has margin:** P4→P3, holding P6 at 3xtf32. If even all-3xtf32 misses, the residual is the f32 diagonal solve (already f32) → not fixable by more mma precision → fall to (6). Record the final rung in `PAGED_BITEXACT_NOTE.md` + README s5.
---
## (4) Slot into 0031's existing framework (opt-in, default-OFF)
Same dispatch site - the `if constexpr (!KDA && !keep_rs_t)` block inside `launch_gated_delta_net` (0031 patch, after `init_fastdiv_values`). Extend, don't replace:
- Keep `GDN_CHUNK_MIN` (token threshold, default `INT_MAX` = off) and `GDN_CHUNK_OFF` (kill switch).
- Add **`GDN_CHUNK_TC`** selector: `0` = 0031 serial-solve chunked (fallback, retained), `1` = tensor-core. Add **`GDN_CHUNK_C` ∈ {16,32,64}** and **`GDN_DV_TILE` ∈ {32,64,128}** for A/B; defaults `C=32, DV_TILE=64` (CONFIG C) for serving, `DV_TILE=32` saturator for n_seqs=1.
- New launcher `launch_gdn_chunked_tc<128, C, DV_TILE>` mirrors `launch_gdn_chunked`: `cudaFuncSetAttribute(...MaxDynamicSharedMemorySize...)` **return-checked** (0031 precedent), `grid = dim3(H, n_seqs, n_slabs)`, `block = dim3(256,1,1)`. Per-slab the kernel recomputes A/A⁻¹/gates (dv-independent), dv-slices S/Ud/O.
- **Default OFF** (`gdn_chunk_min=INT_MAX`) exactly as 0031 ships. Flip the default to on **only when** the M8 A/B shows an S_PP win over the tuned sequential recurrence at the serving regime (n_seqs≥2) **and** the KL gate + adversarial op case hold - recorded in README s5 (dev notes / rejected-flat levers) and `PAGED_BITEXACT_NOTE.md`. Until then it ships like 0031: opt-in, regression-free default.
- Extend the test-backend-ops block 0031 added (the `S_v==128` shapes at lines after :9398) so the tc path is exercised at C=64 and C=32 in CI.
- New per-path md5 acknowledged in the dispatch comment (tc-md5 ≠ serial-chunked-md5 ≠ sequential-md5; all benign, KL-validated).
---
## (5) Top 3 risks that could make it NOT beat sequential + kill-criteria
**Risk 1 - Register pressure forces 1 block/SM (the swing factor).** The ~50 working-regs/thread estimate is optimistic; the A⁻¹ blocked solve (in-register `16×16` diag inverse), the accumulator→B restage transpose, and the O+state transient accumulators can push working regs to 80-120. At 256 threads, >128 regs/thread → >32K regs/block → **silently 1 block/SM regardless of the 44KB shared headroom**, and local-memory spills push BW back. *Mitigation ladder:* (i) 12 warps (dilute state 32→21 regs/thread); (ii) `__launch_bounds__(256,2)`; (iii) smaller `DV_TILE`. **Kill criterion:** if after the full ladder the M8 occupancy build still spills to local OR stays 1 block/SM, **and** the CONFIG-A BW-max 1-block path (C=64, dv64, 96KB, cp.async, 4× state-BW cut) **also** fails to beat sequential S_PP at n_seqs≥2 in the A/B bench → the occupancy lever is dead; keep 0031 serial-chunked behind `GDN_CHUNK_TC=0`, record rejected in README s5.
**Risk 2 - Precision: tf32 (and even all-3xtf32) misses the KL gate in the weak-decay/aligned-keys κ(A) corner.** The inverse amplifies error; κ(A) is data-dependent and grows where keys align and decay is weak. **Detected cheaply at M1** (microbench measures κ(A) + per-rung NMSE on the adversarial case *before* the kernel exists). **Kill criterion:** if at M1 the **top of the ladder (all-3xtf32 + f32 diagonal)** cannot reach f32-grade on `g∈[-20,-1e-4]`, OR at M4+ `KLD(tc‖f16) ≤ KLD(seq‖f16)` fails on that op case at the top rung → the tensor-core solve is not numerically viable as a default; fall to ladder rung (6) (direct back-subst 3xtf32); if that also misses, abandon the tc solve and keep 0031 serial. **Fail-fast:** M1 gates this before any multi-week kernel commitment.
**Risk 3 - Grid starvation at n_seqs=1 is structural (H=32 < the ~96 blocks needed for 2 blk/SM × 48 SM).** Only `DV_TILE=32` (4 slabs) fully saturates a single stream, and it pays ~1.5-2× redundant-A flops (A/A⁻¹/gates recomputed per slab) plus the per-chunk restage. **Kill criterion:** if the M8 bench shows single-stream (n_seqs=1) S_PP is slower than sequential even at full saturation (dv32×4) due to redundant-A + restage overhead, **and** the batched regime (n_seqs≥2) gain also fails to materialize → the lever only helps a regime the target workload doesn't hit → keep default-OFF, ship as opt-in experiment only, record. (If n_seqs≥2 *does* win, ship enabled for the serving regime and gate single-stream back to sequential via `GDN_CHUNK_MIN` + an n_seqs check - a partial, honest win.)
**Overarching kill gate:** the disposition is the bench, not the theory. The kernel flips to default-on only when it beats the tuned sequential recurrence at the serving regime AND clears the KL + adversarial gates. Any milestone that regresses test-backend-ops or md5-stability halts the stack until fixed; M1 and M0 are the cheap fail-fast exits before the expensive kernel work.

View File

@@ -0,0 +1,493 @@
# vLLM Parity Lever Map
> Auto-generated from the parity-exploration workflow. Working artifact (the multi-week path to vLLM parity on prefill + decode, Qwen3.6 NVFP4 / GB10).
## 1. Prefill gap re-audit
I have walked the full prefill forward pass against the committed numbers (final_benchmark.csv, PREFILL_GEMM_SCOPE/RESULTS, the 0042 dense nsys profile, the qwen35moe/delta-net graph source). Here is the re-audit.
---
# PREFILL gap re-audit - Qwen3.6 NVFP4 on GB10
## Grounding (what the gap actually is)
From `docs/final_benchmark.csv`, prefill (S_PP, t/s; patched vs vLLM):
- **Dense 27B**: ~922 vs ~1929-2182 → patched is **44-48%** of vLLM.
- **MoE 35B-A3B** (the decision model): ~1510-2177 vs ~5186-6223 → patched is **29-41%** of vLLM. In us/tok at npl64: llama ~471, vLLM ~169 → **gap ~302 us/tok**.
The GEMM scope's bucket (~232 us/tok llama vs ~68 vLLM) = a **164 us/tok** GEMM difference = **~51-54% of the gap**, and GEMM is **~49% of the llama prefill wall** (232/471). GDN is cited at **~17% of the gap** (vLLM chunked scan ~2.5x cheaper). So GEMM+GDN ≈ **~68% of the gap** by the existing framing - leaving ~30% that the two levers' headline numbers do not name. This audit walks every op to place that residual.
Important structural facts confirmed from source (`models/qwen35moe.cpp`, `delta-net-base.cpp`, `llama-graph.cpp`):
- MoE = 40 layers (interval-4 → **30 GDN + 10 full-attention**), 256 experts top-8, **plus a dense shared expert on every layer**. Dense = 64 layers (48 GDN + 16 attn).
- **Default prefill GDN is NOT a single kernel.** `fused_gdn_ch`/patch-0031 is default-OFF, so prefill runs `build_delta_net_chunking` - a long graph of `ggml_mul`/`mul_mat`/`solve_tri`/`cumsum`/`tri`/`exp` + many `ggml_cont`/`transpose`/`pad`/`repeat` layout copies + a host-side per-chunk loop. The GDN lever (tensor-core fused kernel) is scoped to replace this **entire** decomposition, so the "11% k_bin_bcast op_mul gating muls" the 0042 patch calls "a separate lever" are in fact **inside the GDN bucket** (a fused GDN kernel subsumes them).
## Prefill op-share table (MoE decision model; % of the patched/llama prefill wall)
Estimates triangulated from the committed numbers (232/68 GEMM, 11%/5% from the 0042 dense nsys, the gap arithmetic), not a fresh nsys run.
| Op (prefill) | ~% of llama wall | vLLM faster? why | Covered by GEMM lever | Covered by GDN lever |
|---|---:|---|:---:|:---:|
| Token embed (`get_rows`) | <1% | tie | - | - |
| **NVFP4 weight GEMMs** total | **~49%** | **Yes** - vLLM W4A16-Marlin/cutlass large-M tiles + async pipeline vs MMQ small-tile / new FP4-MMA at 57.7% of peak | **YES** | - |
| ┝ routed-expert grouped GEMM (gate_up+down, `mul_mat_id`) | ~28% | yes (biggest single bucket) | yes | - |
| ┝ shared-expert dense GEMMs (all tokens, ×40) | ~9% | yes | yes | - |
| ┝ GDN in/out projections (wqkv, wqkv_gate, ssm_out) | ~7% | yes | yes | - |
| ┝ attention QKV/O projections (×10) | ~5% | yes | yes | - |
| **GDN chunked decomposition** (30 layers) | **~22%** | **Yes** - vLLM chunked scan ~2.5x cheaper (tensor-core intra-chunk vs llama's f32 graph ops + layout copies + host loop) | - | **YES** |
| ┝ gating/decay muls (`k_bin_bcast op_mul`) | ~11%* | yes | - | yes (fused kernel absorbs) |
| ┝ small f32 mul_mats + `solve_tri` + cumsum/tri/exp | ~7% | yes | - | yes |
| ┝ layout `cont`/`transpose`/`pad`/`repeat` copies | ~4% | yes | - | yes |
| **FlashAttention prefill** (QK^T·softmax·PV, 10 layers) | **~3-6%**† | maybe - L²-growing; bounded at npp=128, larger at serving context | **NO** | **NO** |
| **MoE router + combine/scatter** | **~5-8%** | **Yes** - vLLM fuses gather/weight/scatter into the grouped-GEMM epilogue | **NO** | **NO** |
| ┝ `argsort_top_k`(256→8) + softmax + weight-norm | ~2-3% | yes | no | no |
| ┝ combine: 7× fp32 `add` + weight `mul` (×40) | ~3-5% | yes | no | no |
| **Activation quantization** (W4A4 e4m3 pass per GEMM) | **~3-6%** | **Yes - structurally**: vLLM W4A16-Marlin on GB10 has **no** activation-quant step | **NO**‡ | partial |
| Norm + residual tail (attn/post/q/k/ssm/l2/out + adds) | ~4% | small (0042 fused the main one) | - | - |
| RoPE + sigmoid/silu gates + scale | ~2-3% | small | - | - |
| LM head (last-token only in prefill) | <1% | tie | - | - |
\* 0042 dense profile; in MoE the relative share is a bit lower (MoE FFN is heavier). † grows quadratically - under-weighted at the benchmark's npp=128; re-measure at real serving lengths. ‡ the quant pass feeds the GEMM but is a *separate kernel*, not inside the GEMM-lever's mul_mat bucket.
## Verdict: GEMM + GDN are the two dominant buckets but NOT the whole gap
They cover ~71% of the prefill wall and the bulk of the gap. Three contributors are **materially uncovered** by either lever:
### Newly-identified lever 1 - MoE router + combine/scatter (the strongest miss on the decision model)
llama runs the expert routing and recombination as **separate memory-bound ggml ops**: `argsort_top_k` over 256 experts, softmax/normalize, then a fan-in of **7 fp32 `ggml_add` + a weight `ggml_mul`** per MoE layer (`llama-graph.cpp` ~1797-1824), every one of 40 layers. vLLM's fused-MoE (and Marlin grouped) path folds the gather, the router-weight multiply, and the scatter-accumulate into the **GEMM epilogue/prologue** - so this is overhead vLLM essentially does not pay. Est. ~5-8% of the MoE prefill wall, entirely outside GEMM (the `mul_mat_id` is covered; the surrounding argsort/adds/mul are not) and outside GDN. **Lever: a fused top-k-weighted expert-output accumulation (or a fused-MoE epilogue), removing the 7-add fan-in and the separate weight mul.** Bit-exact-gateable (it is an fp32 reduction-order change, same precedent as the paged-MoE `8cb0ce23`).
### Newly-identified lever 2 - the W4A4 activation-quant pass (a vLLM-asymmetry, not just a kernel-speed gap)
Every NVFP4 GEMM (MMQ today, and the new 0034 FP4-MMA) **quantizes activations to e4m3 (amax/6 + code search) before the matmul** - a distinct, M-proportional kernel. vLLM on **sm_121 falls back to W4A16-Marlin** (the TENSORCORE_GDN_SCOPE confirms this: no tcgen05/cutlass-FP4 on GB10), i.e. **f16 activations, zero activation-quant**. So this pass (~3-6% of prefill) is a structural cost vLLM avoids, and it explains part of why even a peak FP4-MMA GEMM will not fully reach vLLM's prefill. The README's "act-quant FLAT" and "W4A16 rejected" verdicts are **decode/BW-bound findings**; in compute-bound prefill the trade is different and unaudited. **Lever: measure this quant bucket as its own nsys row; consider fusing the activation-quant into the GEMM prologue (cp.async + in-register quant) so it is not a separate global-memory pass.**
### Flag 3 - FlashAttention prefill (context-dependent, currently under-measured)
The 10-16 full-attention layers' QK^T·softmax·PV is a separate kernel covered by neither lever. It is small at the benchmark's npp=128 but **grows as L²**; at the long contexts the decode-serving work targets it can become a real bucket. The whole prefill ground-truth (232/68) was taken at one ubatch size - **re-profile FA share at the real serving prefill lengths** before assuming it is negligible.
### Confirmed inside the existing levers (not new)
- The 0042 "11% gating muls" and all the GDN small-matmuls/`solve_tri`/cumsum/layout-conts are **inside the GDN bucket** - the tensor-core GDN kernel subsumes them; they are only "live and uncovered" *today* because patch 0031 is default-off and losing at C=16.
- Shared-expert dense GEMMs, GDN/attention projections = **GEMM lever** (the FP4-MMA 0034 path already routes them).
## Bottom line
Two prefill levers (GEMM, GDN) are correctly the top-2 and own ~the gap's majority, but they are **not** the whole gap. The op-walk surfaces **MoE router+combine/scatter** and the **W4A4 activation-quant pass** as genuine, currently-untracked prefill contributors on the MoE decision model (~8-14% combined), plus **FA prefill** as a context-dependent risk the npp=128 bench hides. Per the methodology, step 0 is an nsys prefill-only window that explicitly breaks out `argsort/add(combine)`, `quantize_mmq_nvfp4`, and `flash_attn` as separate rows to size these three before funding a kernel.
Relevant files: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch`, and the graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/models/{qwen35moe.cpp,delta-net-base.cpp}` + `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/llama-graph.cpp` (build_moe_ffn ~1500-1834, build_attn ~2136-2189).
## 2. Decode-serving compute hypotheses (ranked)
RANKED DECODE-SERVING GPU-COMPUTE HYPOTHESES (paged llama.cpp vs vLLM, MoE Qwen3.6-35B-A3B-NVFP4 on GB10)
Grounding facts that constrain the ranking:
- The gap is empirically MoE-specific: dense static is parity-to-ahead, MoE static is 89-93% of vLLM, but MoE *burst* serving is ~66% (n=128: paged 4.53 vs vLLM 6.87 tok/s/seq). So whatever degrades is on a path that hurts MoE far more than dense.
- It is GPU-compute-bound, NOT host/reuse-bound: padded-shape lever rejected, baseline reuse 0% statistically equal to S1+S3 reuse 72% on aggregate tok/s, hostproc only 4-8% of wall. So the host loop (0040/0041/S2) is closed; the residual lives in per-step kernel time.
- The decode KERNELS tie vLLM at a fixed WIDE lockstep shape (static batched-bench). The serving loss is therefore about how a RAGGED/NARROW/fluctuating live batch (varying decoder count D, ragged KV lengths, ragged token->expert assignment) feeds those same kernels, vs how gracefully vLLM's kernels degrade at the same concurrency. This is exactly the Phase-0 "re-scope" branch in DECODE_SERVING_SCOPE.md ("serving runs a worse effective batch shape into the kernels").
Decisive measurement that arbitrates all of these (run first): nsys a clean steady-state serving window (serve_bench staggered ~128 clients through llama-server, LLAMA_KV_PAGED=1 + LLAMA_MOE_FORCE_GRAPHS=1, -fa on -ngl 99) AND the same nsys on vLLM at the same concurrency (both-engine rule). Decompose per-step GPU-kernel-time into buckets {MoE-expert-GEMM (MUL_MAT_ID), full-attn FA, GDN recurrence, bf16 projections, activation-quant, sampling/logits} and compare serving-narrow vs static-wide vs vLLM. The bucket whose per-useful-token time grows MOST going static->serving (relative to vLLM's same bucket) is the gap. Avoid the known window artifact; measure a steady span. Reference doc: backend/cpp/llama-cpp-localai-paged/docs/DECODE_SERVING_SCOPE.md.
---
H1 (TOP) - MoE expert GEMM collapses to per-expert GEMV at ragged/narrow serving width, plus risk of the host-sync sorted per-expert fallback.
- Mechanism: top-8 of 256 experts. Tokens/expert ~= D*8/256. Static npl128 -> ~4 tok/expert; serving burst-tail D->8 -> ~0.25 tok/expert, so most active experts get 0-1 tokens. The grouped MMQ id-GEMM's per-expert M collapses to 1 -> pure GEMV that reads the full FP4 expert weight (memory-bound, weight bytes unamortized) and re-loads per-expert scales. This is the "256 tiny-expert weight bandwidth" README s5 names as the residual. Separately, patch 0025 only keeps CUDA graphs on for the should_use_mmq grouped path; any serving step where MUL_MAT_ID ne[2]>8 (mmvq_mmid_max) AND should_use_mmq returns false falls to the per-expert host-loop fallback that cudaStreamSynchronizes per expert (the [TAG_MUL_MAT_ID_CUDA_GRAPHS] disable) - catastrophic, and serving's varying per-step shapes can trip it unevenly.
- Why slower than vLLM: vLLM runs ONE fused MoE GEMM with sorted_token_ids/expert_ids computed on-GPU (fused_moe / Marlin-MoE), a single persistent launch that keeps the grouped GEMM dense and amortizes launch + scale loads; it degrades gracefully at small M. llama issues a grouped MMQ that, at ragged narrow width, is many near-empty expert tiles each re-reading scales, and can drop to a host-synced loop.
- nsys metric to confirm: (a) MUL_MAT_ID kernel-time as % of per-step GPU wall, static-wide vs serving-narrow vs vLLM; (b) the tokens-per-expert (M) distribution per step - look for M->1 GEMV collapse and achieved FLOP/s vs M; (c) count cudaStreamSynchronize / per-expert cudaMemcpy *between* MUL_MAT_ID launches per step (host-sync fallback firing); (d) vLLM single fused-MoE kernel duration at same concurrency.
- Candidate fix: a fused grouped-NVFP4 MoE decode GEMM with on-GPU token sorting (device-computed sorted token offsets + expert ids) so all active experts share one persistent launch and scales amortize - i.e. port vLLM's fused-MoE dispatch shape onto the FP4-MMA MMQ id-path; as a floor, extend 0025 to GUARANTEE the grouped should_use_mmq path for every serving shape so the host-sync loop never fires. Bit-exact-gateable (graph-replay/grouped path re-issues identical kernels).
H2 - Paged full-attention decode kernel: ragged-KV load imbalance, no tensor cores, indirect block-table reads.
- Mechanism: the 16 full-attn layers run the paged block-table FA decode, pinned by the 0010/0011 dispatch guard to vec/tile and NEVER the mma/wmma tensor-core FA (a present block table routes only to vec/tile; tile loads half2, F16 cache only). Static bench: all sequences one KV length -> balanced. Serving: KV lengths are ragged (each request at a different position), so per-sequence attention work is imbalanced across the grid and the step waits on the longest-context tail; there is no KV-dimension split. Every K/V access is an indirect physical-cell load via the block table (gather-like), less coalesced than a contiguous read.
- Why slower than vLLM: vLLM PagedAttention v2 uses a split-K / partitioned reduction designed for ragged long contexts (flash-decoding style) that balances work and lifts occupancy on the tail, and keeps the contiguous-within-block layout. llama's vec/tile paged read has no KV split and leaves tensor cores idle on the full-attn layers.
- nsys metric to confirm: FA-decode (vec/tile) kernel duration vs KV-length VARIANCE across the live batch (does it scale with max-KV/tail rather than mean-KV?); tensor-core-active-% during FA layers (expect ~0); achieved memory-BW of the FA kernel under ragged KV; vLLM paged-attn kernel time + util at same concurrency.
- Candidate fix: a KV-split (flash-decoding / split-K) paged FA decode so long sequences are partitioned across blocks for balance + occupancy; longer term a tensor-core paged FA for the full-attn layers (mma.sync down-translation, same approach as the GDN tensor-core scope). At minimum a per-sequence work-balanced launch.
H3 - GDN/SSM recurrence decode kernel under-occupied at narrow/variable serving width.
- Mechanism: patch 0022 tuned the recurrence (NUM_WARPS=16, COLS_PER_WARP=8, grid.z = S_v/(NW*CPW)) for the WIDE B=128 lockstep batch; its DRAM-latency coverage / MLP needs ~128 independent sequence-states in flight, and it is bandwidth-bound (re-streams the 128x128 f32 state per sequence per step at 84.6% of peak BW *at B=128*). In serving D fluctuates and collapses in the burst tail; at low D the kernel is grid-starved (few independent states), achieved-BW falls below the tuned point and per-token state traffic rises - the same grid-starvation failure mode the chunked-prefill kernel hit at low n_seqs. Plus the serial-SSM host loop (README s2d/s5 structural floor) is amortized over fewer tokens.
- Why slower than vLLM: vLLM's fused_recurrent_gated_delta_rule + its scheduler keep the recurrence fed at small batch; llama's fixed B=128-tuned launch params under-saturate when D is small.
- nsys metric to confirm: gated_delta_net kernel achieved-BW (GB/s) and occupancy as a function of live D in serving vs the static 84.6%@B128 baseline; recurrence kernel time/token vs D; grid occupancy at the burst tail.
- Candidate fix: width-adaptive recurrence launch params - auto-select NUM_WARPS/COLS_PER_WARP (already env GDN_NW/GDN_CPW) by live D so the grid stays saturated at narrow width; bit-exact-safe (0022's column assignment is provably independent of visit order). Longer term the chunked/register-resident state scan cuts state traffic.
H4 - Continuous-batch ragged-shape overhead: every kernel sized to the batch union/max; bf16 projections become GEMV at narrow D (umbrella + the "bf16-projection bandwidth" half of README's stated residual).
- Mechanism: ragged positions/lengths/expert-assignments mean each per-step kernel is launched for the max/union over the live batch, so useful-token efficiency < lockstep. This is the shared root of H1-H3 but is worth isolating because it also covers the q/k/v/gate/o projections (deliberately kept bf16, per README s5) which at narrow D become GEMV-like memory-bound weight reads - the "bf16-projection bandwidth" residual vLLM also pays but amortizes over a steadier batch.
- Why slower than vLLM: vLLM's scheduler holds a steadier/denser decode batch (padded bucketed decode + chunked-prefill interleave) so its projection/attn GEMMs run at higher effective M; llama's batch width fluctuates more.
- nsys metric to confirm: GPU-busy% in a steady serving window vs static (expect lower in serving) and (sum useful-token FLOPs)/(kernel-time) serving vs static; bf16 projection GEMM achieved FLOP/s vs M (GEMV collapse at small D).
- Candidate fix: largely subsumed by fixing H1-H3 at the kernel level. Note: holding D high via admission was effectively probed by the padded-shape lever and REJECTED for throughput (the completion-driven shrink is itself a per-survivor win); so do NOT re-pursue width-padding - the payoff is in the per-kernel fixes.
H5 - Per-step sampling + logits handling across D independent sequences (low, cheap to exclude).
- Mechanism: each live sequence has its own sampler chain run after logits land; at narrow D this fixed per-step cost (+ any D2H logits copy) is amortized over fewer tokens. vLLM batches sampling on-GPU across the whole decode batch.
- nsys metric to confirm: sampling/logits-copy time as % of per-step wall serving vs static; D2H logits cudaMemcpy size+time; count of per-sequence sampler launches.
- Candidate fix: single on-GPU batched sampler over [D, vocab], no per-sequence D2H. Likely small on the greedy/temp0 path; profile mainly to exclude.
H6 - NVFP4 activation (re)quantize fixed per-step cost at small M (lowest; likely a wash vs vLLM, no bit-exact lever).
- Mechanism: W4A4 quantizes activations to FP4 each step; 0023 de-dups the per-expert re-quant but the base quantize + scale compute is a per-row pass amortized over fewer tokens at narrow D. README lever-3 records the act-quant tax is removable only by W4A16 (rejected) or a kernel rewrite.
- nsys metric to confirm: quantize-kernel time as % of step vs M, serving vs static, vs vLLM (which also FP4-quantizes activations -> expected wash).
- Candidate fix: none bit-exact; include only so the profile can rule it out.
---
Priority for the profiling workflow: profile the MoE model (the gap model) first; bucket per-step GPU-kernel-time and A/B serving-narrow vs static-wide vs vLLM. Expect the ranking to be dominated by H1 (MoE expert GEMM at collapsed M + any host-sync fallback) and H2 (ragged-KV FA on vec/tile), with H3 (recurrence occupancy at narrow D) third; H4 is the shared root, H5/H6 are exclusion checks. The single most decisive metric is the per-step time share of MUL_MAT_ID vs FA vs gated_delta_net in serving-narrow relative to vLLM's equivalents at the same concurrency.
## 3. vLLM pipeline deep-dive
I have what I need. Synthesizing the vLLM pipeline against our patch series.
---
# vLLM vs `llama-cpp-localai-paged`: kernel + scheduler deep-dive (Qwen3-style hybrid SSM+attn MoE, NVFP4, consumer Blackwell sm_121)
## The single most important grounding fact (changes how every GEMM row reads)
On **consumer Blackwell (sm_120/sm_121: DGX Spark/GB10, RTX 5090, RTX PRO 6000)** the native CUTLASS NVFP4 grouped-GEMM path is **broken** (TMA-WS grouped-GEMM init failure, CUTLASS #3096) and there is **no `tcgen05`/TMEM**. So vLLM on *this exact hardware* does **not** run a native FP4-MMA grouped GEMM - it **falls back to the Marlin BF16 kernel that dequantizes FP4->BF16 in-register**, capped at bf16-tensor-core peak (~half FP4 peak). Native FP4 (W4A4/tcgen05) and the best FlashInfer/TRT-LLM kernels are gated to **data-center Blackwell sm_100a**. This means several "vLLM advantages" assumed for B200 do **not** hold on GB10, and our native FP4-MMA path (the just-verified 103 TFLOP/s = 57.7% of FP4 peak GEMM) is potentially *ahead of* vLLM's Marlin-bf16 fallback on this part - the opposite of the usual framing.
## Comparison table
| # | Component | vLLM (this model class, sm_121 reality) | Ours (`llama-cpp-localai-paged`) | Regime | Verdict / gap |
|---|---|---|---|---|---|
| 1 | **Dense weight GEMM - decode** (M≤128, BW-bound) | Marlin FP4→bf16 in-register dequant (W4A4 broken→fallback); reads 4-bit weights | Native FP4-MMA MMQ (FP4 wt × Q8_1 int8 act), M≤128 tile | decode | **Parity** - both at FP4 weight-BW floor. Ours ~96-97% of vLLM, ahead at low concurrency |
| 2 | **Dense weight GEMM - prefill** (large-M, compute-bound) | Marlin grouped/dense, async cp.async pipeline, big tiles, ~bf16 peak | MMQ small-tile, 1 CTA/SM. **New native FP4-MMA large-M kernel @103 TFLOP/s being integrated** (beats cuBLAS-bf16, bit-exact) | prefill | dequant→bf16-cuBLAS lever (0033) was **rejected** (MMQ beat it 29-49%); the native FP4-MMA kernel is the real fix and could **beat** vLLM's bf16-Marlin here |
| 3 | **MoE expert GEMM - decode** | Marlin FP4→bf16 grouped, indirect addressing | Grouped MMQ (`mul_mat_id`), sorted expert layout, native FP4-MMA | decode | **Parity** - both BW-floor. Recurrence/GEMM are *our wins*; residual = bf16-projection BW + host loop |
| 4 | **MoE expert GEMM - prefill** | Marlin grouped GEMM, fused, big tiles | MMQ small-tile grouped (1 CTA/SM) | prefill | **GAP (#1 prefill bottleneck per docs).** Native FP4-MMA grouped kernel is the planned fix; today MMQ is small-tile-bound |
| 5 | **MoE routing / gather / scatter / epilogue** | Triton persistent fused-MoE: indirect token addressing, **fused gate+up + SwiGLU epilogue**, once-quantize, scatter+weighted-combine fused | Sorted per-expert layout; **NVFP4 act-quant de-dup (0023)** mirrors once-quantize; SwiGLU is **separate ops** (no fused epilogue) | both | Partial parity. **No fused gate+up+SwiGLU epilogue** (extra IO passes); matters at prefill, minor at decode |
| 6 | **GDN / linear-attn - decode** | FLA Triton `fused_recurrent_gated_delta_rule` + `fused_sigmoid_gating_delta_rule_update` (sequential, per-step state) | Fused sequential recurrence: in-place state write-back (0018), fused state gather (0019), o_proj MMVQ→MMQ (0020), occupancy retune (0022), conv-tap gather fusion (0028) | decode | **Parity-to-win** - recurrence runs at **102.6% of vLLM bandwidth**, 84.6% of GB10 peak BW. Our strongest area |
| 7 | **GDN / linear-attn - prefill** | FLA `chunk_gated_delta_rule`: intra-chunk products on **tensor cores** (UT-transform), ~2.5× cheaper | Tuned **sequential** scan (default); chunked parallel-scan (0031) is **opt-in + ~22% slower** (serial f32 reductions, no TC, C=16 forced by 99KB smem) | prefill | **GAP (#2 prefill bottleneck).** No tensor-core chunked GDN. Scoped (TENSORCORE_GDN_SCOPE, mma.sync only); **Gram products de-risked at 6.7-9.3× over sequential**, kernel not yet built |
| 8 | **Causal conv1d (short conv)** | FLA `causal_conv1d_fn`/`_update` Triton | `ggml_ssm_conv_update_inplace` (0021): 5-op chain → 1 op, in-place ring | both | Parity |
| 9 | **Full-attention - decode** (16 of 64 layers) | FlashInfer / TRT-LLM paged decode (tensor-core, cascade wrapper, FP8-KV capable) | llama.cpp FA `ggml_flash_attn_ext` with **block-table paged read** (src[5]); routed to **vec/tile** kernels | decode | Parity at decode width (vec/tile is right for small batch) |
| 10 | **Full-attention - prefill** (large-M) | FlashInfer/TRT-LLM tensor-core prefill FA | **Forced to vec/tile** (block-table only grafted into vec/tile; mma/wmma FA ignores it, dispatch-guarded off) | prefill | **GAP (secondary).** Paged prefill full-attn gets **no tensor-core FA**. Docs rank it below MoE-GEMM/GDN, so not the dominant prefill term |
| 11 | **Paged KV manager (full-attn)** | vLLM block manager + hybrid KV cache manager (co-sizes attn/linear blocks to equal physical bytes, anti-fragmentation) + auto prefix caching | `PagedKVManager` (FreeBlockQueue/BlockPool/COW), cross-request prefix sharing, burst-reclaim (0024) | both | **Parity** on the attn side; we lack vLLM's *unified* hybrid co-sizing (we manage SSM state separately - see #12) |
| 12 | **Hybrid SSM-state cache mgmt** | Unified hybrid manager pages linear-attn state alongside attn KV | SSM recurrent + conv state in fixed per-seq slots, updated **in-place** (not paged; O(1)/seq) | both | Different approach, not a perf gap (recurrent state doesn't need paging); we lack unified fragmentation accounting |
| 13 | **Sampler** | **GPU FlashInfer sorting-free sampler** (Dual-Pivot rejection sampling, single kernel, no logits sort, ~0 overhead); RejectionSampler for spec-decode | llama.cpp **host-side** sampler chain (CPU partial-sort for top-k/p) | serving | **GAP - NO EQUIVALENT.** Host sampler + D2H logits adds to the per-step host loop at high concurrency (greedy md5 bench hides it) |
| 14 | **Scheduler / continuous batching / chunked prefill** | V1: mixed prefill+decode step, **chunked prefill default-on**, decode-prioritized `max_num_batched_tokens` budget, auto-chunk | `update_slots()` unified step, **decode-first dynamic budget** (0016, `max(n_ubatch,TD)`), prefill budget (0013), prefix-share (0008) | serving | **Parity** - we match the chunked-prefill + decode-first token-budget design |
| 15 | **CUDA graphs - decode** | **FULL cudagraph**: padded/bucketed decode shapes → 1 persistent captured graph per bucket → steady decode = single `cudaGraphLaunch`, zero host rebuild | S1+S3 (0040/0041) graph **reuse** keyed on bucketed block-table dims + decode-shape-stable scheduling → serving reuse 0%→**72.2%** | serving | **Partial.** We reuse, not full-capture. **Padded/fixed-slot decode (→~100% like vLLM) was built + GPU-tested + REJECTED** - serving decode here is GPU-compute-bound, so dummy-row compute > reuse recovered |
| 16 | **CUDA graphs - prefill** | PIECEWISE cudagraph (default FULL_AND_PIECEWISE) | ggml graph rebuild per prefill step (paged data-ptr churn) | prefill | Gap, low value (prefill is compute-bound; launch overhead amortized over large M) |
| 17 | **Speculative decoding / MTP** | **MTP head + EAGLE-style spec-decode** supported for this model class (Qwen3-Next ships an MTP module) | **None** | decode | **GAP - NO EQUIVALENT.** Biggest *unexploited* decode-throughput lever vLLM has and we don't (potential ~1.5-2× at low-medium concurrency) |
| 18 | **KV-cache dtype** | FP8 KV cache + FP8 attention (halves KV BW) | F16 paged KV | both | Minor gap; partly offset by our overall 1.5-3× lower memory (NVFP4 weights). FP8-KV would cut KV BW further |
## Gaps where we have NO equivalent (ranked by value)
1. **Speculative decoding via the MTP head (#17).** Qwen3-Next/3.6 ships a Multi-Token-Prediction module; vLLM exploits it for spec-decode. We have nothing. This is the single largest *structural* decode-throughput lever vLLM has that is **completely absent** from our series - and unlike the kernel gaps it is not BW-floored. Highest-value greenfield item.
2. **Tensor-core chunked GDN prefill (#7).** vLLM's FLA `chunk_gated_delta_rule` pushes intra-chunk Gram products through tensor cores (~2.5× cheaper prefill). Our 0031 chunked kernel is opt-in and 22% *slower* (serial f32 reductions). Scoped (mma.sync-only on sm_121, no wgmma/tcgen05), Gram products de-risked at 6.7-9.3×, kernel not built. One of the two named prefill bottlenecks.
3. **Large-M native FP4-MMA grouped MoE GEMM (#4).** The #1 prefill bottleneck. vLLM uses Marlin-bf16 grouped (capped at bf16 peak on sm_121); our MMQ is small-tile/1-CTA-bound. The new native FP4-MMA GEMM (103 TFLOP/s, beats cuBLAS-bf16) is the integration that closes this - and because vLLM is bf16-Marlin here, a working native FP4 grouped kernel could *exceed* vLLM on this exact hardware.
4. **GPU fused sorting-free sampler (#13).** vLLM samples on-device (FlashInfer Dual-Pivot rejection, no logits sort); llama.cpp samples on host. Adds to the serving host loop at 128-way concurrency for top-k/p workloads. No GPU-sampler equivalent in the series.
5. **Fused MoE SwiGLU epilogue (#5).** vLLM fuses gate+up+SwiGLU into the grouped-GEMM epilogue (fewer IO passes). We have the act-quant de-dup (0023) but run SwiGLU as separate ops. Prefill-relevant, decode-minor.
6. **Tensor-core FA for the paged prefill full-attn path (#10).** Paged forces vec/tile (mma FA ignores the block table). Secondary - docs rank it below #2/#3 in the prefill budget.
7. **FP8 KV cache / FP8 attention (#18).** Minor; partly offset by our NVFP4 memory lead.
## Where we are at or ahead of vLLM (not gaps)
- **GDN decode recurrence (#6):** 102.6% of vLLM bandwidth - our fusion series (0018-0022, 0028) is the strongest area.
- **Decode weight GEMMs dense+MoE (#1, #3):** at the FP4 weight-BW floor = parity; dense ahead at low concurrency. The residual MoE serving gap (~66% at n=128 burst) is a **GPU-compute** gap (vLLM's MoE decode kernel+scheduler ~1.3× on aggregate), **not** a host-loop gap that a graph-reuse/padding lever can close (padded-shape lever proved this, rejected 2026-06-28).
- **Memory:** 1.5-3× lower than vLLM (NVFP4-resident, no persistent bf16 dequant copies).
- **Scheduler design (#14):** chunked-prefill + decode-first budget matches vLLM's V1 model.
## Net assessment
Our **decode kernels are at parity-to-ahead** (GDN recurrence, both FP4 GEMMs at BW floor) - confirmed in the kernel regime. The two real, *named-in-docs* **prefill** gaps (MoE grouped GEMM #4, tensor-core chunked GDN #7) are being actively closed with the native FP4-MMA GEMM + the de-risked tensor-core Gram products; on consumer Blackwell specifically these can match-or-beat vLLM because vLLM is itself on a **bf16-Marlin fallback**, not native FP4. The two gaps with **no equivalent in the series at all** are **MTP speculative decoding** (highest-value, structural, decode) and the **GPU fused sampler** (serving host-loop, secondary). The serving-decode residual is GPU-compute-bound (not host/graph-reuse), so vLLM's edge there is its faster MoE decode kernel + scheduler, not something a host-side lever recovers.
---
Files read (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/README.md`, `docs/DECODE_SERVING_SCOPE.md`, `docs/PREFILL_GEMM_SCOPE.md`, `docs/PREFILL_GEMM_RESULTS.md`, `docs/TENSORCORE_GDN_SCOPE.md` (same dir).
Sources:
- [vLLM Now Supports Qwen3-Next (FLA Triton kernels, hybrid KV manager, MTP)](https://blog.vllm.ai/2025/09/11/qwen3-next.html)
- [CUTLASS #3096 - SM120 NVFP4 MoE grouped GEMM broken, FlashInfer/Marlin fallback](https://github.com/NVIDIA/cutlass/issues/3096)
- [vLLM Quantization Kernels (NVFP4 W4A16/W4A4, Marlin, Machete)](https://deepwiki.com/bytedance-iaas/vllm/11.4-quantization-kernels)
- [SM120 NVFP4 MoE perf report - Marlin bf16 fallback on consumer Blackwell](https://discuss.vllm.ai/t/sm120-rtx-pro-6000-nvfp4-moe-performance-report-qwen3-5-397b/2536)
- [vLLM Attention Backends - FlashInfer/TRT-LLM default on Blackwell](https://docs.vllm.ai/en/latest/design/attention_backends/)
- [vLLM FLA fused_recurrent_gated_delta_rule](https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/fla/ops/fused_recurrent/)
- [vLLM Fused MoE Kernel Features](https://docs.vllm.ai/en/latest/design/moe_kernel_features/)
- [vLLM scheduling - chunked prefill, decode-first budget, FULL_AND_PIECEWISE cudagraph](https://docs.vllm.ai/en/stable/configuration/optimization/)
- [FlashInfer sorting-free GPU sampling (Dual-Pivot rejection)](https://flashinfer.ai/2025/03/10/sampling.html)
- [vLLM #11394 - FlashInfer sampling kernel in V1](https://github.com/vllm-project/vllm/pull/11394)
- [vLLM #42960 - batch-invariant GDN_ATTN for Qwen3-Next/Qwen3.6](https://github.com/vllm-project/vllm/issues/42960)
## 4. Novel levers
I've grounded myself in the four scope docs, the README patch table + benchmarks (final_benchmark.csv), the methodology doc, and the 0034 FP4-MMA / 0042 fused-residual patch headers. Verified state: prefill is the biggest gap (dense ~920 vs vLLM ~2000 t/s ≈ 44-46%; MoE ~2177 vs ~5300-6223 ≈ 35-41%); decode kernel at parity; serving decode ~65% and measured GPU-compute-bound (host/graph-reuse + padded-shape proved neutral-or-worse). Already-explored/rejected: dequant→bf16 cuBLAS (0033, rejected), bf16-tau (dropped), NVFP4 projections (KL-fail), W4A16-Marlin (rejected), graph coverage (flat), act-quant fusion on decode (flat), padded-shape decode (rejected). Below are levers that go beyond those.
---
# Candidate-lever brainstorm: closing the vLLM gap (paged Qwen3.6 NVFP4, GB10 sm_121a)
Organized by where the verified gap actually is. For each: mechanism / expected gain / gate (bit-exact vs KL) / risk / effort-reward. "Profile-gated" = run Phase-0 nsys before building, per the methodology.
## A. PREFILL (the largest gap, 35-46% of vLLM) — highest reward bucket
### A1. Graph-safe ragged grouped FP4-MMA MoE kernel (remove the per-expert host-sync loop)
- **Mechanism:** 0034 lands the native FP4-MMA dense kernel but routes MoE prefill through the *per-expert host-sync loop* (a `cudaStreamSynchronize` per expert per layer — e.g. dozens-to-hundreds of syncs/layer). Replace it with ONE ragged/grouped FP4-MMA launch over the existing `expert_bounds`/`ids_dst` sorted layout (variable M per expert, single kernel). This is the follow-up 0034 itself flags.
- **Gain:** HIGH. MoE expert GEMM is named the #1 prefill cost; this both removes the serial host syncs and unlocks kernel overlap + graph capture. The single biggest remaining prefill lever after 0034.
- **Gate:** bit-exact by construction (same FP4 math, same K-order as the per-expert path) → greedy md5.
- **Risk:** medium-high (ragged tiling + boundary handling, graph-safety).
- **Effort/reward: HIGH effort / HIGH reward.** The flagged 0034 follow-up; rank #1 for prefill.
### A2. Multi-stream expert dispatch (cheap stepping-stone to A1)
- **Mechanism:** before writing the full ragged kernel, run the independent per-expert FP4-MMA GEMMs on N CUDA streams instead of the serial host-sync loop, overlapping their LPDDR5x weight reads + tensor-core work.
- **Gain:** medium (partial overlap; recovers some of the serial-sync stall without the kernel rewrite).
- **Gate:** bit-exact (same kernel, reordered launches) → greedy md5.
- **Risk:** medium (stream/event mgmt, not graph-safe — prefill isn't graph-replayed so OK).
- **Effort/reward: LOW-MED effort / MED reward.** Bank this before A1.
### A3. Fuse MoE router → token-gather/scatter → GEMM (permutation fusion)
- **Mechanism:** vLLM/SGLang fuse routing→permute→grouped-GEMM→unpermute. Here the activation gather (into the sorted-expert layout) and the scatter-back are separate memory passes. Read activations through `ids_dst` in the GEMM prologue and write through the inverse permutation in the epilogue → removes two full activation memory passes per MoE layer.
- **Gain:** medium for prefill (large activation tensor); smaller for decode (0019/0028 already fuse the decode gather).
- **Gate:** bit-exact (index indirection only, same values) → greedy md5.
- **Risk:** medium (epilogue indexing correctness).
- **Effort/reward: MED / MED.** Pairs naturally with A1's kernel.
### A4. Fused MoE FFN (up_proj → SiLU → down_proj, intermediate register/shared-resident)
- **Mechanism:** keep the per-expert intermediate activation in shared/registers across up→act→down instead of round-tripping it to global. For large-M prefill the intermediate is big → a real BW save; also helps decode.
- **Gain:** medium-high (removes one full intermediate read+write per expert per layer).
- **Gate:** bit-exact if SiLU + accumulation order preserved → greedy md5 (else KL-gate).
- **Risk:** HIGH (fused FP4 FFN kernel is complex; register pressure on sm_121a).
- **Effort/reward: HIGH / MED-HIGH.** Strong but expensive; sequence after A1.
### A5. Activation-quant fusion into the 0042 residual/RMSNorm epilogue (prefill)
- **Mechanism:** the README's "act-quant fusion FLAT" verdict was *decode-only*. For prefill the W4A4 activation-quantize pass is a bigger tensor. 0042 already fuses residual-add+RMSNorm+mul; extend its epilogue to emit the FP4-quantized activation the next GEMM consumes, removing a dedicated act-quant read+write.
- **Gain:** low-medium for prefill.
- **Gate:** bit-exact (same `quantize_mmq_nvfp4` math, just fused) → greedy md5.
- **Risk:** medium (epilogue + the FP4 codepath coupling).
- **Effort/reward: MED / LOW-MED.** Cheap-ish add-on once 0034/A1 are in.
### A6. Stream-K / split-K for the FP4 prefill GEMM (SM occupancy on few-SM GB10)
- **Mechanism:** GB10 has relatively few SMs. For layers whose output grid (⌈M/128⌉×⌈N/128⌉) is smaller than the SM count, SMs idle. Stream-K splits the K dimension across CTAs with a reduction, keeping all SMs busy.
- **Gain:** medium for small-output-grid layers (profile-gated — only if 0034's grid under-fills the GPU).
- **Gate:** bit-exact if the f32-accumulate reduction order is fixed/deterministic; otherwise KL-gate.
- **Risk:** medium (reduction correctness, workspace).
- **Effort/reward: MED / MED.** Complements 0034; profile first.
### A7. Prefill CUDA-graph capture (follow-on to A1)
- **Mechanism:** with fixed prefill chunk size (0013/0016 budgets already exist) and A1 removing the host-sync MoE loop, the whole prefill chunk becomes graph-capturable.
- **Gain:** LOW marginal — prefill kernels are large so launch overhead is amortized; the value is mostly *enabling* it (which A1 already does). Record as low-reward, not a standalone lever.
- **Gate:** bit-exact.
- **Effort/reward: LOW / LOW.** Note, don't prioritize.
## B. DECODE-SERVING (~65% of vLLM aggregate, measured GPU-compute-bound)
### B1. Speculative decoding, greedy = bit-exact (SSM-state rollback is the crux) ⭐ novel
- **Mechanism:** draft γ tokens (small draft model, or prompt-lookup/n-gram for zero extra weights), verify in one target forward. At **temp=0 the accepted tokens are argmax-identical to non-spec → the greedy md5 gate PASSES by construction** (lossless). This is the rare throughput-multiplier that's bit-exact-compatible. Especially powerful at low concurrency where paged is farthest below vLLM (n=8 burst: paged 28 vs vLLM 45) and the GPU is underutilized.
- **The non-obvious crux:** hybrid-SSM rollback. KV rollback under paged is easy (truncate blocks). But the gated-DeltaNet recurrent state is updated **in-place** (patch 0018), so a rejected draft requires restoring the 128×128 f32 state per layer to the last accepted position — snapshot-before-speculate (memory+BW cost) or recompute. This SSM-state checkpoint/restore is the real engineering risk and is why naive llama.cpp spec-decode plumbing won't transfer.
- **Gain:** HIGH (2-3x at favorable acceptance/low concurrency).
- **Gate:** **bit-exact for greedy** (md5 holds); distribution-preserving (KL-gate) for temp>0.
- **Risk:** HIGH (SSM snapshot/rollback, draft integration with paged KV + recurrent state, acceptance tuning).
- **Effort/reward: HIGH / HIGH.** Biggest novel decode lever; start with zero-draft prompt-lookup to de-risk the rollback plumbing before adding a draft model.
### B2. FP8 / quantized paged KV cache
- **Mechanism:** decode is BW-bound; quantizing the paged KV (llama.cpp already has q8_0/q4_0 `--cache-type-k/v`) halves the KV-gather BW and **doubles effective KV capacity → higher max concurrency**. Wire the existing quantized-KV FA-vec path through the paged block-table read (0009/0010). Matches a vLLM feature (fp8 KV).
- **Gain:** medium-high for long-context / high-concurrency decode.
- **Gate:** KL-gate (KV quant changes attention numerics; watch long-context recall), per the `8cb0ce23` precedent.
- **Risk:** medium (paged FA-read FP8 path; precision on long context).
- **Effort/reward: MED / MED-HIGH.**
### B3. Coalesced paged-KV block layout for the in-kernel decode gather
- **Mechanism:** decode is at the LPDDR5x floor, so *effective* BW depends on coalescing. vLLM lays K as `[blocks, kv_heads, head_size/x, block_size, x]` precisely to coalesce the FA read. Re-lay-out the paged blocks so 0009/0010's in-kernel gather issues fully-coalesced vectorized loads matching the FA kernel's access pattern.
- **Gain:** medium (profile-gated: measure the FA-read achieved-BW / sector efficiency first).
- **Gate:** bit-exact (pure memory layout, identical values) → greedy md5.
- **Risk:** medium (touches paged KV manager + FA read).
- **Effort/reward: MED / MED.** Profile before building.
### B4. Megakernel / persistent decode (single-launch fused decode step)
- **Mechanism:** fuse the per-layer decode ops into one persistent kernel that loops layers internally (à la Mirage/MPK persistent megakernel), eliminating inter-op launch overhead, inter-op global round-trips, and the host loop for the decode step; keep the recurrent state resident across the step.
- **Gain:** potentially high for the GPU-compute-bound serving regime (kills launch/scheduling bubbles vLLM avoids). Honest caveat: at 27-35B the activations don't fit SMEM across layers, so the win is mostly launch-overhead + scheduling, less data-residency.
- **Gate:** in principle bit-exact (same ops/order) but extremely hard to guarantee → realistically KL-gate.
- **Risk:** VERY HIGH (essentially re-implements the decode forward as one kernel).
- **Effort/reward: VERY HIGH / HIGH.** The swing-for-the-fences lever; only after cheaper decode levers are exhausted.
### B5. Pipeline sampling off the decode critical path
- **Mechanism:** the doc names the "serial-SSM host loop / sampling can't start until logits land" as a floor. S2 (double-buffer set_inputs) was dropped because set_inputs is cheap — but the *sampling stall* between steps is different. Overlap step N's sampling + step N+1's input build with the GPU launch, so the GPU never idles waiting on host sampling.
- **Gain:** medium (recovers the inter-step sampling bubble; this is the precise residual S2 didn't target).
- **Gate:** bit-exact (host reordering only) → greedy md5.
- **Risk:** medium (ordering correctness vs the recurrent in-place state).
- **Effort/reward: MED / MED.**
### B6. Co-batch chunked prefill INTO decode steps (vLLM-style GPU saturation — flips S3) ⭐ reframe
- **Mechanism:** S3 deliberately keeps prefill *out* of decode steps (for graph reuse). But the later measurement proved serving decode is **GPU-compute-bound, not host-bound** — which *removes S3's rationale*. vLLM does the opposite: mixes small prefill chunks into decode steps to fill otherwise-idle GPU at low decode width. Test co-batching a sized prefill chunk with decode to use spare SMs.
- **Gain:** medium at low-to-mid decode width (better GPU utilization).
- **Gate:** bit-exact (same math, scheduling only) → greedy md5.
- **Risk:** low-medium (it partially contradicts S3 — A/B them; the GPU-compute-bound finding says S3's reuse benefit is ~nil here, so co-batching likely wins).
- **Effort/reward: LOW-MED / MED.** Cheap A/B with high information value (directly tests the regime conclusion).
### B7. Adaptive-width bucketed decode graph (doc-sanctioned revisit)
- **Mechanism:** the rejected padded-shape lever used fixed pad-to-`--parallel`; the doc explicitly leaves the door open for *adaptive* width (round up to next small bucket 8/16/32/64).
- **Gain:** LOW on GB10 — the same doc measured serving decode GPU-compute-bound, so graph reuse buys ~nothing here. Record as: revisit ONLY if the host loop is re-confirmed dominant on other hardware.
- **Gate:** bit-exact.
- **Effort/reward: MED / LOW (on GB10).** Note, don't build for GB10.
## C. CROSS-CUTTING / aggregate-throughput reframes
### C1. Exploit the 1.5-3x memory advantage for higher max concurrency ⭐ reframe
- **Mechanism:** the benchmark stops at npl=128 where both engines fit. With 1.5-3x lower memory (and synergistic with B2 FP8-KV), the paged backend can serve npl=256+ in the same VRAM where vLLM OOMs. Per-stream tok/s gap is irrelevant if paged sustains 2x the concurrent streams per GPU — aggregate tok/s/GPU can match or beat vLLM.
- **Gain:** HIGH for aggregate throughput-per-GPU at the memory ceiling (a legitimate, honestly-labeled "different operating point," not a per-stream parity claim).
- **Gate:** bit-exact (no numeric change) → greedy md5.
- **Risk:** low (scheduler/admission tuning to actually pack the streams).
- **Effort/reward: LOW / HIGH.** Cheapest high-reward lever — measure aggregate at max-concurrency, pair with B2.
---
## Ranked summary (effort vs reward)
| # | Lever | Regime | Gate | Effort | Reward |
|---|-------|--------|------|--------|--------|
| C1 | Higher max-concurrency via memory advantage (+B2) | aggregate | bit-exact | LOW | **HIGH** |
| A1 | Graph-safe ragged grouped FP4-MMA MoE kernel | prefill | bit-exact | HIGH | **HIGH** |
| B1 | Speculative decode (greedy=bit-exact; SSM rollback crux) | decode | bit-exact (greedy) | HIGH | **HIGH** |
| A2 | Multi-stream expert dispatch (→A1) | prefill | bit-exact | LOW-MED | MED |
| B6 | Co-batch chunked-prefill into decode (flips S3) | serving | bit-exact | LOW-MED | MED |
| B2 | FP8/quantized paged KV cache | decode | KL-gate | MED | MED-HIGH |
| A3 | MoE router+gather+GEMM permutation fusion | prefill | bit-exact | MED | MED |
| B3 | Coalesced paged-KV layout for decode gather | decode | bit-exact | MED | MED |
| B5 | Pipeline sampling off decode critical path | serving | bit-exact | MED | MED |
| A4 | Fused MoE FFN (up+SiLU+down resident) | prefill+decode | bit-exact | HIGH | MED-HIGH |
| A6 | Stream-K/split-K FP4 prefill GEMM | prefill | bit-exact/KL | MED | MED |
| A5 | Act-quant fusion into 0042 epilogue (prefill) | prefill | bit-exact | MED | LOW-MED |
| B4 | Megakernel/persistent decode | decode | KL-gate | VERY HIGH | HIGH |
| A7 | Prefill CUDA-graph capture (→ enabled by A1) | prefill | bit-exact | LOW | LOW |
| B7 | Adaptive-width bucketed decode graph | serving | bit-exact | MED | LOW (GB10) |
**Suggested attack order:** (1) **C1** — near-free aggregate win exploiting the memory advantage, immediately defensible. (2) **A2→A1** — the prefill MoE GEMM is the biggest single gap and 0034 already flags A1. (3) **B6** — cheap A/B that directly tests/exploits the "serving is GPU-compute-bound" conclusion. (4) **B1** — the highest-ceiling decode lever, but gate the SSM-state rollback plumbing first via zero-draft prompt-lookup. (5) **B2/B3/B5** as the BW + bubble cleanup. (6) **A4 / B4** as the high-effort structural swings only if the cheaper levers leave a funded gap.
**Two highest-value non-obvious insights:** (a) speculative decoding is *bit-exact under greedy* (md5 passes by construction) — the only throughput-multiplier compatible with the sacred gate — but its hybrid-SSM in-place-state rollback (patch 0018) is the unsolved crux. (b) the measured "serving decode is GPU-compute-bound" finding **invalidates S3's keep-prefill-out rationale** and argues for the *opposite* (B6 co-batching, vLLM-style), plus reframes the win toward aggregate-per-GPU concurrency (C1) rather than per-stream parity.
Relevant files: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE,PREFILL_GEMM_SCOPE,PREFILL_GEMM_RESULTS,TENSORCORE_GDN_SCOPE}.md`, `.../README.md` (s4 benchmarks, s5 rejected levers), `.../docs/final_benchmark.csv`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (A1 is its flagged follow-up), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (A5 extends it).
## 5. Synthesized prioritized lever map
# Prioritized Lever Map - vLLM Parity, Qwen3.6 NVFP4 on GB10 (sm_121a)
## Bottom line (where the gap actually is)
- **Prefill is the largest absolute gap**: dense ~44-48% of vLLM, MoE (decision model) ~29-41%. Two buckets own ~71% of the wall (NVFP4 GEMM ~49%, chunked GDN ~22%); the op-walk surfaces **three uncovered residuals** (MoE router/combine, prefill act-quant, FA-at-length).
- **Decode kernels are at parity-to-ahead** (GDN recurrence 102.6% of vLLM BW; both FP4 GEMMs at the BW floor). **Decode-*serving* is the still-open gap** (~66% at n=128 burst), is **MoE-specific** and **GPU-compute-bound** (host-loop/graph-reuse/padded-shape all proved neutral-or-worse, so they are closed).
- The two structural levers vLLM has that the series has **no equivalent for**: **MTP speculative decode** and **GPU fused sampler**. On *this* hardware vLLM is itself on a **bf16-Marlin FP4 fallback** (no tcgen05/CUTLASS-grouped), so a working native FP4 path can **match-or-beat** it, not just chase it.
## Single highest-leverage NEXT action for the still-open decode-serving gap
**Run the both-engine steady-state serving nsys window FIRST (it is the gate before any decode kernel is funded).** Stagger ~128 clients through `llama-server` (`LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 -fa -ngl 99`) and the identical concurrency on vLLM; bucket per-step GPU-kernel time into `{MUL_MAT_ID, FA-vec/tile, gated_delta_net, bf16-projections, act-quant, sampling}` and compare **serving-narrow vs static-wide vs vLLM**. The decisive single metric: the per-useful-token time share of `MUL_MAT_ID` vs `FA` vs `gated_delta_net` in serving relative to vLLM. **Primary hypothesis to confirm/refute: H1** - MoE grouped GEMM collapsing to per-expert GEMV at ragged width, **and** count `cudaStreamSynchronize` *between* `MUL_MAT_ID` launches to catch the per-expert host-sync fallback firing. This one A/B arbitrates D2 vs D3 vs D4 (all HIGH-effort) at once, and the methodology forbids building a kernel before it. **Bank D1 (grouped-path guarantee) immediately as near-free insurance against the host-sync cliff regardless of outcome.**
## Master ranked lever table (pursue list)
| # | Lever | Gap | Gain → parity | Effort | Risk | Gate | Dependency / sequence | Status |
|---|-------|-----|--------------|--------|------|------|----------------------|--------|
| 0 | **Phase-0 serving nsys (both-engine bucket A/B)** | decode | enabling - sizes/arbitrates H1-H4 | LOW | low | n/a | none - **do first** | NOT DONE |
| 1 | **X1 (C1) Exploit 1.5-3× memory → serve npl=256+ where vLLM OOMs** | aggregate | **HIGH** (different operating point: aggregate tok/s/GPU) | LOW | low | BE | pairs w/ D6; admission tuning | NOT STARTED |
| 2 | **P1 Native FP4-MMA large-M dense GEMM (patch 0034)** | prefill | **HIGH** - GEMM ~49% of wall; can *beat* vLLM bf16-Marlin | HIGH | med | BE (md5) | foundation for P2/P8 | **IN PROGRESS (0034 scaffold landed)** |
| 3 | **D1 Guarantee grouped MMQ path - never host-sync per-expert fallback (extend 0025)** | decode | **HIGH if firing** (removes catastrophic cliff) | LOW | low | BE | gated by #0; bank regardless | NOT STARTED |
| 4 | **P3 Multi-stream expert dispatch (→P2)** | prefill | MED (partial overlap of serial syncs) | LOW-MED | med | BE | stepping-stone, bank before P2 | NOT STARTED |
| 5 | **P2 (A1) Graph-safe ragged grouped FP4-MMA MoE GEMM** | prefill | **HIGH** - the #1 prefill bucket (~28% of wall) | HIGH | med-high | BE (md5) | after P1/P3; **shares kernel arch w/ D2** | **FLAGGED 0034 follow-up** |
| 6 | **D10 (B6) Co-batch chunked-prefill into decode (flips S3)** | serving | MED (fills idle SMs at low D) | LOW-MED | low-med | BE | cheap A/B; tests "GPU-compute-bound" conclusion | NOT STARTED |
| 7 | **P4 Tensor-core chunked GDN prefill kernel (rewrite 0031)** | prefill | **HIGH** - #2 prefill bucket (~22% of wall, ~17% of gap) | HIGH | med-high | BE→KL | Gram products de-risked 6.7-9.3× | **DESIGN SCOPED, kernel NOT built** |
| 8 | **D2 (H1) Fused grouped-NVFP4 MoE decode GEMM + on-GPU token sort** | decode | **HIGH** - top decode hypothesis (MoE-specific) | HIGH | high | BE | gated by #0; **co-develop kernel w/ P2** | NOT STARTED |
| 9 | **D5 (B1) Speculative decode via MTP head** | decode | **HIGH** (2-3× at low/mid concurrency) | HIGH | high | BE (greedy) / KL (temp>0) | crux=SSM in-place state rollback (0018); de-risk w/ zero-draft prompt-lookup | NOT STARTED |
| 10 | **D6 (B2) FP8 / quantized paged KV cache** | decode | MED-HIGH (halves KV BW; doubles capacity → enables X1) | MED | med | KL (8cb0ce23 precedent) | wire quantized-KV FA-vec through paged read (0009/0010) | NOT STARTED |
| 11 | **D3 (H2) KV-split / flash-decoding paged FA decode** | decode | MED-HIGH (ragged-KV balance + occupancy) | MED-HIGH | med | BE→KL | gated by #0 (build only if FA bucket grows) | NOT STARTED |
| 12 | **P5 (A3+PREFILL-L1) Fused MoE router+gather+scatter+combine** | prefill | MED (~5-8% MoE wall, uncovered by P2/P4) | MED | med | BE (fp32 reorder; 8cb0ce23) | pairs w/ P2 kernel | NOT STARTED |
| 13 | **D4 (H3) Width-adaptive GDN recurrence launch params** | decode | MED (saturate grid at narrow D) | LOW-MED | low | BE (0022 col-independence) | env GDN_NW/GDN_CPW already exists | NOT STARTED |
| 14 | **D7 (B3) Coalesced paged-KV block layout for decode gather** | decode | MED (effective BW / sector efficiency) | MED | med | BE | profile-gated (#0 FA-read BW) | NOT STARTED |
| 15 | **P6 (A4) Fused MoE FFN (up→SiLU→down resident)** | prefill+decode | MED-HIGH (removes intermediate round-trip) | HIGH | high | BE→KL | after P2 | NOT STARTED |
| 16 | **D9 (B5) Pipeline host sampling off decode critical path** | serving | MED (recovers inter-step sampling bubble) | MED | med | BE | ordering vs in-place recurrent state | NOT STARTED |
| 17 | **D8 (H5/#13) GPU fused sorting-free sampler** | serving | MED (small on greedy; matters at 128-way top-k/p) | MED | med | BE-ish | alt to D9; profile to size | NOT STARTED |
| 18 | **P8 (A6) Stream-K / split-K FP4 prefill GEMM** | prefill | MED (small-output-grid layers on few-SM GB10) | MED | med | BE if det. else KL | profile-gated; complements P1 | NOT STARTED |
| 19 | **P7 (A5/PREFILL-L2) Act-quant fusion into 0042 epilogue (prefill)** | prefill | LOW-MED (~3-6% prefill; vLLM avoids it entirely) | MED | med | BE (md5) | extends landed 0042; after P1 | NOT STARTED |
| 20 | **P9 (#10/flag-3) Tensor-core paged prefill FA** | prefill | LOW-MED, **context-dependent (grows L²)** | MED-HIGH | med | BE→KL | re-profile FA share at real serving lengths first | NOT STARTED |
| 21 | **D11 (B4) Megakernel / persistent decode** | decode | HIGH (kills launch/scheduling bubbles) | VERY HIGH | very high | KL | last resort, only if funded gap remains | NOT STARTED |
Gate key: BE = bit-exact (greedy md5); KL = KL-divergence gate; BE→KL = bit-exact preferred, KL fallback.
## Drop / closed (do NOT pursue)
| Lever | Why dropped |
|-------|-------------|
| Padded / fixed-slot decode (pad-to-`--parallel`) | Built, GPU-tested, **REJECTED** - serving decode is GPU-compute-bound; dummy-row compute > reuse recovered |
| B7 Adaptive-width bucketed decode graph | LOW value on GB10 (same GPU-compute-bound finding); revisit only if host-loop re-confirmed dominant on other HW |
| dequant→bf16 cuBLAS prefill (0033) | **REJECTED** - MMQ beat it 29-49%; superseded by native FP4-MMA (P1) |
| W4A16-Marlin / NVFP4 projections (bf16→FP4) | **REJECTED** - KL-fail; vLLM keeps SAME bf16 projections, no advantage to chase |
| bf16-tau | Dropped |
| Act-quant fusion on **decode** (lever-3) | **FLAT** - decode is BW-bound; the prefill variant (P7) is the live one |
| S2 double-buffer set_inputs | Dropped - set_inputs is cheap (host loop closed by 0040/0041) |
| H6 NVFP4 act-quant decode tax | No bit-exact lever; **exclusion check only** (expected wash vs vLLM, which also FP4-quantizes) |
| P10 (A7) Prefill CUDA-graph capture | LOW/LOW - prefill launch overhead amortized over large M; merely *enabled* by P2, not a standalone item |
| H4 ragged-shape umbrella | Not a lever - it is the shared *root* of H1-H3; fixed by D2/D3/D4 at the kernel level |
| H5 (as exclusion) / H6 | profile-only rule-outs, not builds (D8 is the actual sampler lever) |
## Critical-path sequence (two parallel tracks per the multi-agent GPU methodology)
**Decode-serving track (gated):** #0 serving nsys → bank #3 (D1) → branch on the dominant bucket: if MUL_MAT_ID-GEMV → #8 (D2); if FA → #11 (D3); if recurrence → #13 (D4). In parallel, cheap A/Bs #6 (D10) and #1 (X1). Highest-ceiling greenfield #9 (D5) once SSM-rollback de-risked via zero-draft prompt-lookup. BW cleanup #10 (D6, synergistic with X1).
**Prefill track (already moving):** #2 (P1, in progress) → #4 (P3) → #5 (P2) - and **co-develop the P2 ragged-grouped kernel with the D2 decode kernel** (one fused-MoE dispatch that degrades gracefully across M = vLLM's single fused_moe shape). In parallel #7 (P4, design ready). Then the residual-coverage adds #12 (P5), #15 (P6), #19 (P7). Profile-gated #18 (P8), #20 (P9).
**Two highest non-obvious insights to act on:** (a) the P2 prefill kernel and the D2 decode kernel are the **same kernel** (on-GPU token sort + single persistent grouped FP4-MMA launch) at different M - fund them as one effort. (b) the "serving decode is GPU-compute-bound" finding **invalidates S3's keep-prefill-out rationale** - #6 (D10 co-batching, vLLM-style) and #1 (X1 aggregate concurrency) are the cheap wins that follow from it, and are higher-reward-per-effort than any further host-side or graph-reuse work.
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
---
# PROFILE-VALIDATED PATH (both-engine nsys, adversarially verified Sun Jun 28 11:55:12 PM UTC 2026)
## Prefill gap decomposition (paged 396 vs vLLM 197 us/tok)
All 4 runs ran on DGX (GB10) via ssh dgx.casa; GPU lock held+released, GPU restored idle. Model = decision MoE Qwen3.6-35B-A3B-NVFP4 (paged GGUF vs q36-35b-a3b-nvfp4-vllm). Buckets = % of GPU-kernel wall (nsys cuda_gpu_kern_sum), and per-prefill-token us.
PAGED MoE PREFILL (npp512 ntg4 npl32, LLAMA_KV_PAGED=1 +LLAMA_MOE_FORCE_GRAPHS=1): S_PP=2417.8 t/s; kernel 6.485s/16384 tok = 395.9 us/tok. MoE-expert-GEMM(MMQ nvfp4) 26.5% | GDN 24.2% (gdn_core 17.2, gdn_gather 3.3, gdn_conv 2.7, l2norm 1.0) | layout-copy 9.8 (convert_dtype 6.3, concat 2.9) | ew-mul 8.7 | bf16-proj 8.6 | act-quant(quantize_mmq_nvfp4) 4.7 | ew-add 4.6 | silu/sigmoid-gate 4.3 | norms 3.6 | MoE-DISPATCH(argsort 0.4+mm_ids 1.1+gather_mmq 0.7) 2.2 | get_rows 1.0 | FA 0.6 | softmax 0.05 | scatter 0.06.
vLLM MoE PREFILL (32x512, 5 reps): S_PP=4925.8 t/s; kernel 16.138s/81920 tok = 197.0 us/tok. SURPRISE: on sm_121 vLLM runs experts as Marlin W4A16 (FP4->bf16 dequant + bf16 GEMM), NOT fused-FP4 cutlass; projections are FP8 (sm89_xmma_e4m3). ew-glue(torch elementwise) 31.7% | MoE-expert-GEMM(Marlin) 24.6 | GDN(FLA chunk_* + causal_conv) 18.5 | bf16/fp8-proj 10.4 | reduce(cumsum/softmax) 5.2 | gate 2.3 | act-quant(scaled_fp8) 1.7 | layernorm 1.7 | MoE-DISPATCH(gather/align/count_sort/argsort) 1.4 | FA 1.1.
Per-token gap decomposition (paged-vLLM, of 198.9 us/tok total): GDN +59.2 (~30%), MoE-GEMM +56.5 (~28%), ew/layout/glue net +21.4 (~11%), act-quant +15.2 (~8%), bf16-proj +13.7 (~7%), gate +12.4 (~6%), norms +11.1 (~6%), dispatch +5.9 (~3%).
## Decode picture (host-bound, not kernel/graph-reuse)
3 decode profiles. KEY: paged decode KERNELS are 5.4x more GPU-efficient than vLLM's, but paged static decode is HOST-BOUND (GPU ~16% busy); vLLM is GPU-bound (99% busy) on a slow recurrent GDN. They tie at static-wide-128 (paged 782 vs vLLM ~819 t/s pure decode) via opposite regimes.
PAGED DECODE-SERVING (staggered 128 clients, llama-server, steady 22s window, 83.5% GPU-busy): MoE/FFN-GEMM 40.7% (mmq 34.2 + gemv_moe 4.6 + gemv 1.4) | bf16-proj 22.8 (mul_mat_f 11.1 + nvjet 9.1 + cutlass 2.5) | GDN 21.2 (gdn_core 19.9) | act-quant 2.8 | layout 2.1 | get_rows 2.0 | ew-mul 2.0 | FA 1.6 | norms 1.2 | MoE-DISPATCH 1.1 | scatter 0.2 | softmax 0.1.
PAGED STATIC npl=128 lockstep (PP128+TG256, ~16% GPU-busy, HOST-BOUND): kernel 7.83s/49152 tok=159 us/tok, S_TG=782 t/s. MoE-GEMM 37.5 | GDN 21.6 | layout 9.6 | bf16-proj 9.2 | ew-mul 5.5 | act-quant 4.1 | ew-add 3.4 | norms 2.5 | dispatch 1.8 | FA 0.55. cudaStreamSynchronize=43.4s (84% of API/87% of wall) vs 7.83s GPU kernel => GPU idle ~84%.
PAGED STATIC npl=1 (batch-1): kernel 0.20s, MEMOPS 0.44s (68% of kern+mem), cudaStreamSync 66.7% => latency/BW-bound, GPU ~4% busy.
vLLM 128-wide offline (PT128 GEN256, 99% GPU-busy): kernel 42.56s/49152 tok=866 us/tok. GDN 45.2% (fused_recurrent_gated_delta decode 42.8!) | MoE-GEMM(Marlin) 36.2 | bf16/fp8-proj 6.6 | ew-glue 6.3 | FA 2.1 | reduce 1.4 | dispatch 0.7.
Per-token decode (paged static-128 | vLLM | ratio): MoE-GEMM 59.7|313.5 paged 5.3x faster; GDN 34.3|391.7 paged 11.4x faster; bf16-proj 14.7|57.2; total 159|866 paged 5.4x less GPU.
H1 verdict (false): the stated mechanism - 'MUL_MAT_ID per-useful-token time growing static->serving from grouped-GEMV collapse' - is REFUTED at the kernel level. The grouped path engages correctly: at width-1 the MoE expert path is GEMV (mul_mat_vec_q), and at width>=~16 it switches to grouped MMQ (mul_mat_q nvfp4) - npl=128 is 37% MMQ/~0 GEMV, serving is 34% MMQ + 6% gemv_moe. It does NOT collapse to per-token GEMV. What IS confirmed (the real H1 mechanism) is HOST-SIDE SERIALIZATION: cudaStreamSynchronize dominates the static-decode wall - npl=1 66.7% of API time (~89% of wall), npl=128 84.3% of API time (43.4s sync vs 7.83s GPU kernel => GPU ~84% idle); the serving window logged 40,902 cudaStreamSynchronize. The grouped MMQ also runs at ragged small-M tiles (mmq_x = 16/24/32/40/48/64/80/96) because tokens-per-expert is tiny -> low tensor-core utilization (small-M MMQ, not a GEMV collapse). Mechanistically the device->host sync to read MoE routing before launching per-expert GEMMs is the serializer (task D1/#104 'no host-sync MoE path').
THE BIG DECODE PICTURE (most important finding): paged and vLLM have OPPOSITE decode profiles. Paged decode kernels are 5.4x more GPU-efficient (159 vs 866 us/tok) but paged static decode is host-bound (GPU ~16% busy, serial SSM+sampling+MoE-dispatch host loop); vLLM is GPU-bound (99% busy) on a recurrent GDN kernel that is 11x slower per token, but it saturates the GPU via CUDA graphs. They tie at static-wide-128 (782 vs ~819 t/s). At SERVING the paged GPU rises to 83.5% busy because overlapping request streams hide the host stalls - so the serving lever for paged is NOT faster decode kernels (they're already fast/idle) but (a) removing host serialization / graphing the whole step incl MoE dispatch, and (b) chunked-prefill: paged's 2x-slower prefill steals serving cycles during continuous batching (the gen-80-128 serving config was ~55% prefill work; the nsys'd run2 gen-256-512 ~25%). vLLM bf16/fp8 projections are a bigger paged decode bucket than expected (22.8% serving) because batch-1/small-batch bf16 proj uses mul_mat_f (11.1%) + nvjet (9.1%).
Methodology/scope: profiled with nsys --trace=cuda + cuda_gpu_kern_sum; no NVTX in either engine so buckets are by kernel-name regex (bucketer at dgx:/home/mudler/bench/bucket2.py; reports at dgx:/home/mudler/bench/profgap/). Shared elementwise (k_bin_bcast add/mul, torch elementwise) straddle resid/MoE-fanin/GDN-glue and are bucketed by dominant use with that caveat; vLLM's torch_ew (31.7% prefill) is GDN-glue+MoE-combine+resid and is genuinely ambiguous. The dense Qwen3.6-27B-NVFP4 was NOT separately profiled (time budget; the MoE decision-model contains both MoE experts AND the same GDN/attention stack, fully answering A/B/C); GDN findings generalize to dense. vLLM decode here is offline 128-wide (continuous-batched), not staggered-server, so the cross-engine serving ratio is taken from prior h2h benches (~55-80% of vLLM at npl 64-128), not a fresh staggered vLLM run. Cross-engine 'gap' numbers are GPU-kernel-time per token (apples for GPU-bound prefill; for decode the host-bound vs GPU-bound asymmetry means wall-throughput parity hides a 5.4x GPU-efficiency paged advantage).
## Decision
### moe_prefill_lever
BETTER GROUPED GEMM KERNEL (D2/#105), NOT P5 dispatch fusion. The profile settles this empirically: explicit MoE dispatch (argsort+softmax+get_rows+set_rows+mm_ids+gather_mmq) is only 8.6 us/tok (~2-3% of the paged prefill wall; +5.9 us/tok = ~3% of the gap). P5 is REJECTED as a standalone lever - and the premise it rests on ("vLLM fuses dispatch into the GEMM epilogue") is FALSE on GB10: vLLM runs Marlin W4A16 with its OWN separate dispatch kernels (count_and_sort_expert_tokens/moe_align/vectorized_gather/moe_sum, 2.7 us/tok). Dispatch is cheap in both engines; epilogue-fusing it buys ~3% at most.
The real lever is the grouped GEMM: paged grouped-MMQ MUL_MAT_ID is 105 us/tok vs vLLM Marlin 48.5 us/tok = 2.16x slower, ~28% of the prefill gap (+56.5 us/tok). It does NOT collapse to GEMV - the grouped path engages correctly; it loses because ragged small-M-per-expert tiles (mmq_x 16-96) under-utilize tensor cores.
Is it winnable given MMQ already beat our native kernel? YES in principle, but ONLY via a kernel approach we have NOT yet tried correctly. Both prior attempts failed for identifiable reasons: 0033 did dequant as a SEPARATE global-memory pass then cuBLAS (lost to fused FP4 MMQ 29-49%); 0034 native FP4-MMA W4A4 PoC did NOT hold in-backend. vLLM proves the winning shape on THIS EXACT silicon (sm_121, Marlin bf16 fallback - no native FP4) is IN-REGISTER FP4->bf16 dequant feeding bf16 mma.sync with cp.async pipelining + large/grouped tiles, and W4A16 means ZERO activation-quant. That second point is load-bearing: act-quant (quantize_mmq_nvfp4) is +15.2 us/tok = ~8% of the gap that vLLM STRUCTURALLY does not pay because it is W4A16. So a Marlin-style W4A16 grouped MoE-prefill GEMM is a combined ~36% prefill lever (GEMM 28% + act-quant 8%), and it is a DIFFERENT kernel from both rejects (not a separate-pass dequant, not native FP4-MMA). The README's "W4A16 rejected" verdict was DECODE-only (BW-bound, wash); prefill is compute-bound and the act-quant pass is M-proportional, so W4A16 for prefill is unaudited and the most promising structural fix. GATE: must beat MMQ in a SEPARATELY-BUILT in-backend A/B at the real ragged-small-M MoE-prefill shapes (NOT a standalone PoC - the exact lesson from rejecting native FP4-MMA); bit-exact via KL-gate for the bf16-dequant reduction-order change (paged-MoE 8cb0ce23 precedent).
### gdn_build_go
True
### gdn_rationale
GO on #101, with a Phase-1 in-backend kill-gate. The profile makes the regime check the scope doc demanded (TENSORCORE_GDN_SCOPE Phase 0) pass cleanly: (1) GDN is the #1 SINGLE contributor to the prefill gap at +59.2 us/tok (~30% of the gap), edging out MoE-GEMM (+56.5). (2) The cost is MATH-predominant, not layout/host: gdn_core (the hand-written FP32 chunked-scan, NOT tensor-core) is 17.2% of the wall; GDN-attributable layout (gdn_gather 3.3 + head-concat 2.9 + a convert_dtype slice) is only ~6-7% (~1/4). So tensor cores attack the dominant 3/4, and the 1/4 layout folds into the same fused kernel. (3) The headroom is MEASURED on identical silicon: vLLM's FLA chunked GDN runs the SAME math at 36.5 us/tok vs paged 95.7 = 2.62x, confirming the scope's "mma absorbs the O(C^2) intra-chunk flops so the Cx state-BW cut becomes a net win" mechanism. (4) Bonus dual payoff: it also chips the decode serial-SSM residual and, via continuous batching, the serving-decode lever (prefill steals ~25-55% of serving cycles).
CONDITION (empirical guard, not PoC-optimism): 0031's chunking math was correct yet came back 22% SLOWER in-backend, and we JUST rejected native FP4-MMA because its standalone PoC win did not hold in-backend. So GO funds Phase 1 ONLY (two Gram products on mma.cuh tf32 tiles at fixed C=16/1-block-SM); it must move S_PP in a SEPARATELY-BUILT in-backend A/B vs the sequential scan. If Phase 1 is flat, the occupancy/register wall is the blocker, not the reductions - NO-GO the multi-week Phase 2/3 build. Precision gate is the KL-gate (tf32 default, 3xtf32 ladder), greedy md5 stability, plus the adversarial g in [-20,-1e-4] decay op case; ship opt-in default-off until a separately-built A/B beats sequential.
### top_decode_lever
D1/#104 - the no-host-sync MoE decode path + full-step CUDA-graph capture (graph the WHOLE decode step INCLUDING MoE dispatch), targeting the device->host MoE-routing readback. Ranked decisively by the profile, NOT by raw GPU-bucket size: the dominant decode cost is not a GPU kernel at all - it is cudaStreamSynchronize, 84% of the static-decode wall (43.4s sync vs 7.83s GPU kernel; npl=1 66.7%, npl=128 84.3% of API time; 40,902 syncs in the serving window). Root cause = the device->host sync to read MoE routing before launching per-expert GEMMs. Paged decode KERNELS are already 5.4x more GPU-efficient than vLLM's and the GPU sits 84% idle in static decode, so D1 is the only decode lever that attacks the actual bottleneck.
D2/D3/D4 for DECODE are all REJECTED by the methodology's "a faster kernel off the critical path benches flat" rule: D2 fused MoE decode GEMM - paged MoE-GEMM is already 5.3x faster/token than vLLM (59.7 vs 313.5 us/tok); making it faster just adds idle. D3 FA-split - FA is 1.6% of decode-serving wall / 0.55% static (H2 refuted; the hybrid is mostly GDN with few full-attn layers); not a lever. D4 GDN-width-adaptive - paged GDN decode is already 11.4x faster/token than vLLM (34 vs 392); H3 confirmed (flat across width, no amortization) but the recurrence is NOT the bottleneck, host serialization is - an occupancy retune yields ~nothing until the host loop is gone.
Honest scope on D1's payoff: at HIGH-concurrency serving the paged GPU is already 83.5% busy because overlapping request streams hide the host stalls, so D1's win concentrates at LOW-concurrency / latency / batch-1 (GPU 4-16% busy), where it is large. The complementary serving-throughput lever is FIXING PREFILL (GDN #101 + MoE GEMM D2/#105): paged's 2x-slower prefill steals serving cycles under continuous batching (~25-55% of the serving step is prefill work) - so the prefill levers ARE also serving-decode levers. GATE: separately-built in-backend A/B (compiled-in, so a runtime flag does NOT isolate it) showing higher static/low-concurrency decode t/s with no high-concurrency-serving regression; bit-exact greedy md5 (graph replay re-issues identical kernels).
### next_3_levers
Ranked, each with its pass-gate:
1) #101 TENSOR-CORE mma CHUNKED GDN PREFILL KERNEL (prefill, GO). #1 prefill-gap contributor (+59 us/tok, ~30%), ~3/4 math (tensor cores help) with 2.62x measured headroom on identical silicon, 1/4 layout folds in; also helps serving decode. GATE: Phase-0 regime already satisfied by this profile; Phase-1 two-Gram-product PoC must move S_PP in a SEPARATELY-BUILT in-backend A/B vs sequential (flat => NO-GO the multi-week build); then KL-gate (tf32/3xtf32) + greedy md5 + adversarial-decay op test; ship opt-in default-off until A/B beats sequential.
2) D1/#104 NO-HOST-SYNC MoE DECODE PATH + FULL-STEP CUDA-GRAPH CAPTURE (decode). Attacks the cudaStreamSynchronize that is 84% of the static-decode wall (the MoE-routing device->host readback). Lowest effort, bit-exact, highest-confidence decode win (concentrated at low-concurrency/latency). GATE: separately-built in-backend A/B (not a runtime-flag toggle) - higher static/low-concurrency decode t/s, no high-concurrency-serving regression; bit-exact greedy md5.
3) D2/#105 MARLIN-STYLE W4A16 GROUPED MoE PREFILL GEMM (prefill). In-register FP4->bf16 dequant + bf16 mma.sync, cp.async, large grouped tiles - captures the 28% MoE-GEMM gap AND the 8% act-quant gap (W4A16 has no activation-quant), = ~36% combined; this is exactly what vLLM does on sm_121. Ranked #3 because of HIGH risk: two prior in-backend GEMM attempts failed (0033 separate-pass dequant, 0034 native FP4-MMA PoC didn't hold). GATE: must beat MMQ in a SEPARATELY-BUILT in-backend A/B at ragged-small-M MoE-prefill shapes (NOT a standalone PoC); bit-exact via KL-gate (bf16-dequant reduction order).
Explicitly REJECTED/deprioritized (record so they aren't re-run): P5 dispatch fusion (~3%, and the "vLLM fuses dispatch" premise is false on GB10); D2-for-decode, D3 FA-split, D4 GDN-width-adaptive (their kernels are already 5-11x faster than vLLM and GPU-idle -> bench flat); padded/fixed-slot decode (already tested+rejected, commit b028c81e).
### notes
Empirical discipline applied throughout (per the just-rejected native FP4-MMA): every funded lever is gated on a SEPARATELY-BUILT in-backend A/B, never a standalone PoC - 0031 (chunking math correct, -22% in-backend) and 0034 (PoC win, didn't hold) are the two cautionary precedents. Two compiled-in levers (#101, D1) cannot be isolated by a runtime flag, so they need build-vs-build A/B (methodology hard rule).
Two profile surprises that reshape the directions: (a) vLLM on sm_121 is NOT native FP4 - it runs Marlin W4A16 (FP4->bf16 in-register dequant + bf16 GEMM) for experts and FP8 projections. So the winnable MoE-prefill GEMM is a W4A16-Marlin-style kernel (which also erases our 8% act-quant tax), not another native-FP4 attempt. (b) Decode is a regime asymmetry, not a kernel gap: paged decode kernels are 5.4x more GPU-efficient than vLLM's but paged static decode is HOST-BOUND (GPU 84% idle on cudaStreamSynchronize); vLLM is GPU-bound at 99% on a recurrence 11x slower/token. They tie at static-wide-128. Hence "make decode kernels faster" is the wrong instinct (benches flat); "remove host serialization / graph the full step" (D1) and "fix prefill so it stops stealing serving cycles" (#101, D2) are the decode-serving levers.
Cross-cutting: the prefill levers (#101 GDN, D2 MoE GEMM) double as serving-decode levers because continuous batching interleaves ~25-55% prefill work into the serving step. GDN edges MoE-GEMM as the top prefill pick (bigger gap, cleaner math mechanism, 2.6x proven headroom, lower in-backend risk, dual payoff).
All numbers from the both-engine nsys profile (cuda_gpu_kern_sum buckets, bucketer dgx:/home/mudler/bench/bucket2.py, reports dgx:/home/mudler/bench/profgap/); caveats: no NVTX (kernel-name regex buckets); shared elementwise straddles resid/MoE-fanin/GDN-glue; vLLM decode is offline 128-wide, not staggered-server. Relevant repo paths (absolute): /home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{TENSORCORE_GDN_SCOPE.md,TENSORCORE_GDN_BUILD_PLAN.md,VLLM_PARITY_LEVER_MAP.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,DECODE_SERVING_SCOPE.md,PAGED_BITEXACT_NOTE.md,final_benchmark.csv}; patches dir .../patches/paged/ (existing 0031 chunked-GDN serial, 0033 dequant->cuBLAS rejected, 0034 native FP4-MMA, 0040/0041 S1/S3 decode-graph, 0042 fused residual+RMSNorm); methodology /home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/.agents/vllm-parity-methodology.md.