Files
LocalAI/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
Ettore Di Giacinto 042deab40e docs(paged): vLLM-parity lever map + tensor-core GDN build plan (both-engine profile-validated)
Lever map records the full prefill/decode gap decomposition vs vLLM, the ranked levers, and the rejected dead ends. GDN build plan is the per-product mma mapping + A-inverse + occupancy design.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-29 06:15:10 +00:00

75 KiB
Raw Blame History

vLLM Parity Lever Map

Auto-generated from the parity-exploration workflow. Working artifact (the multi-week path to vLLM parity on prefill + decode, Qwen3.6 NVFP4 / GB10).

1. Prefill gap re-audit

I have walked the full prefill forward pass against the committed numbers (final_benchmark.csv, PREFILL_GEMM_SCOPE/RESULTS, the 0042 dense nsys profile, the qwen35moe/delta-net graph source). Here is the re-audit.


PREFILL gap re-audit - Qwen3.6 NVFP4 on GB10

Grounding (what the gap actually is)

From docs/final_benchmark.csv, prefill (S_PP, t/s; patched vs vLLM):

  • Dense 27B: ~922 vs ~1929-2182 → patched is 44-48% of vLLM.
  • MoE 35B-A3B (the decision model): ~1510-2177 vs ~5186-6223 → patched is 29-41% of vLLM. In us/tok at npl64: llama ~471, vLLM ~169 → gap ~302 us/tok.

The GEMM scope's bucket (~232 us/tok llama vs ~68 vLLM) = a 164 us/tok GEMM difference = ~51-54% of the gap, and GEMM is ~49% of the llama prefill wall (232/471). GDN is cited at ~17% of the gap (vLLM chunked scan ~2.5x cheaper). So GEMM+GDN ≈ ~68% of the gap by the existing framing - leaving ~30% that the two levers' headline numbers do not name. This audit walks every op to place that residual.

Important structural facts confirmed from source (models/qwen35moe.cpp, delta-net-base.cpp, llama-graph.cpp):

  • MoE = 40 layers (interval-4 → 30 GDN + 10 full-attention), 256 experts top-8, plus a dense shared expert on every layer. Dense = 64 layers (48 GDN + 16 attn).
  • Default prefill GDN is NOT a single kernel. fused_gdn_ch/patch-0031 is default-OFF, so prefill runs build_delta_net_chunking - a long graph of ggml_mul/mul_mat/solve_tri/cumsum/tri/exp + many ggml_cont/transpose/pad/repeat layout copies + a host-side per-chunk loop. The GDN lever (tensor-core fused kernel) is scoped to replace this entire decomposition, so the "11% k_bin_bcast op_mul gating muls" the 0042 patch calls "a separate lever" are in fact inside the GDN bucket (a fused GDN kernel subsumes them).

Prefill op-share table (MoE decision model; % of the patched/llama prefill wall)

Estimates triangulated from the committed numbers (232/68 GEMM, 11%/5% from the 0042 dense nsys, the gap arithmetic), not a fresh nsys run.

Op (prefill) ~% of llama wall vLLM faster? why Covered by GEMM lever Covered by GDN lever
Token embed (get_rows) <1% tie - -
NVFP4 weight GEMMs total ~49% Yes - vLLM W4A16-Marlin/cutlass large-M tiles + async pipeline vs MMQ small-tile / new FP4-MMA at 57.7% of peak YES -
┝ routed-expert grouped GEMM (gate_up+down, mul_mat_id) ~28% yes (biggest single bucket) yes -
┝ shared-expert dense GEMMs (all tokens, ×40) ~9% yes yes -
┝ GDN in/out projections (wqkv, wqkv_gate, ssm_out) ~7% yes yes -
┝ attention QKV/O projections (×10) ~5% yes yes -
GDN chunked decomposition (30 layers) ~22% Yes - vLLM chunked scan ~2.5x cheaper (tensor-core intra-chunk vs llama's f32 graph ops + layout copies + host loop) - YES
┝ gating/decay muls (k_bin_bcast op_mul) ~11%* yes - yes (fused kernel absorbs)
┝ small f32 mul_mats + solve_tri + cumsum/tri/exp ~7% yes - yes
┝ layout cont/transpose/pad/repeat copies ~4% yes - yes
FlashAttention prefill (QK^T·softmax·PV, 10 layers) ~3-6% maybe - L²-growing; bounded at npp=128, larger at serving context NO NO
MoE router + combine/scatter ~5-8% Yes - vLLM fuses gather/weight/scatter into the grouped-GEMM epilogue NO NO
argsort_top_k(256→8) + softmax + weight-norm ~2-3% yes no no
┝ combine: 7× fp32 add + weight mul (×40) ~3-5% yes no no
Activation quantization (W4A4 e4m3 pass per GEMM) ~3-6% Yes - structurally: vLLM W4A16-Marlin on GB10 has no activation-quant step NO partial
Norm + residual tail (attn/post/q/k/ssm/l2/out + adds) ~4% small (0042 fused the main one) - -
RoPE + sigmoid/silu gates + scale ~2-3% small - -
LM head (last-token only in prefill) <1% tie - -

* 0042 dense profile; in MoE the relative share is a bit lower (MoE FFN is heavier). † grows quadratically - under-weighted at the benchmark's npp=128; re-measure at real serving lengths. ‡ the quant pass feeds the GEMM but is a separate kernel, not inside the GEMM-lever's mul_mat bucket.

Verdict: GEMM + GDN are the two dominant buckets but NOT the whole gap

They cover ~71% of the prefill wall and the bulk of the gap. Three contributors are materially uncovered by either lever:

Newly-identified lever 1 - MoE router + combine/scatter (the strongest miss on the decision model)

llama runs the expert routing and recombination as separate memory-bound ggml ops: argsort_top_k over 256 experts, softmax/normalize, then a fan-in of 7 fp32 ggml_add + a weight ggml_mul per MoE layer (llama-graph.cpp ~1797-1824), every one of 40 layers. vLLM's fused-MoE (and Marlin grouped) path folds the gather, the router-weight multiply, and the scatter-accumulate into the GEMM epilogue/prologue - so this is overhead vLLM essentially does not pay. Est. ~5-8% of the MoE prefill wall, entirely outside GEMM (the mul_mat_id is covered; the surrounding argsort/adds/mul are not) and outside GDN. Lever: a fused top-k-weighted expert-output accumulation (or a fused-MoE epilogue), removing the 7-add fan-in and the separate weight mul. Bit-exact-gateable (it is an fp32 reduction-order change, same precedent as the paged-MoE 8cb0ce23).

Newly-identified lever 2 - the W4A4 activation-quant pass (a vLLM-asymmetry, not just a kernel-speed gap)

Every NVFP4 GEMM (MMQ today, and the new 0034 FP4-MMA) quantizes activations to e4m3 (amax/6 + code search) before the matmul - a distinct, M-proportional kernel. vLLM on sm_121 falls back to W4A16-Marlin (the TENSORCORE_GDN_SCOPE confirms this: no tcgen05/cutlass-FP4 on GB10), i.e. f16 activations, zero activation-quant. So this pass (~3-6% of prefill) is a structural cost vLLM avoids, and it explains part of why even a peak FP4-MMA GEMM will not fully reach vLLM's prefill. The README's "act-quant FLAT" and "W4A16 rejected" verdicts are decode/BW-bound findings; in compute-bound prefill the trade is different and unaudited. Lever: measure this quant bucket as its own nsys row; consider fusing the activation-quant into the GEMM prologue (cp.async + in-register quant) so it is not a separate global-memory pass.

Flag 3 - FlashAttention prefill (context-dependent, currently under-measured)

The 10-16 full-attention layers' QK^T·softmax·PV is a separate kernel covered by neither lever. It is small at the benchmark's npp=128 but grows as L²; at the long contexts the decode-serving work targets it can become a real bucket. The whole prefill ground-truth (232/68) was taken at one ubatch size - re-profile FA share at the real serving prefill lengths before assuming it is negligible.

Confirmed inside the existing levers (not new)

  • The 0042 "11% gating muls" and all the GDN small-matmuls/solve_tri/cumsum/layout-conts are inside the GDN bucket - the tensor-core GDN kernel subsumes them; they are only "live and uncovered" today because patch 0031 is default-off and losing at C=16.
  • Shared-expert dense GEMMs, GDN/attention projections = GEMM lever (the FP4-MMA 0034 path already routes them).

Bottom line

Two prefill levers (GEMM, GDN) are correctly the top-2 and own ~the gap's majority, but they are not the whole gap. The op-walk surfaces MoE router+combine/scatter and the W4A4 activation-quant pass as genuine, currently-untracked prefill contributors on the MoE decision model (~8-14% combined), plus FA prefill as a context-dependent risk the npp=128 bench hides. Per the methodology, step 0 is an nsys prefill-only window that explicitly breaks out argsort/add(combine), quantize_mmq_nvfp4, and flash_attn as separate rows to size these three before funding a kernel.

Relevant files: /home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}, /home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch, and the graph source /home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/models/{qwen35moe.cpp,delta-net-base.cpp} + /home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/llama-graph.cpp (build_moe_ffn ~1500-1834, build_attn ~2136-2189).

2. Decode-serving compute hypotheses (ranked)

RANKED DECODE-SERVING GPU-COMPUTE HYPOTHESES (paged llama.cpp vs vLLM, MoE Qwen3.6-35B-A3B-NVFP4 on GB10)

Grounding facts that constrain the ranking:

  • The gap is empirically MoE-specific: dense static is parity-to-ahead, MoE static is 89-93% of vLLM, but MoE burst serving is ~66% (n=128: paged 4.53 vs vLLM 6.87 tok/s/seq). So whatever degrades is on a path that hurts MoE far more than dense.
  • It is GPU-compute-bound, NOT host/reuse-bound: padded-shape lever rejected, baseline reuse 0% statistically equal to S1+S3 reuse 72% on aggregate tok/s, hostproc only 4-8% of wall. So the host loop (0040/0041/S2) is closed; the residual lives in per-step kernel time.
  • The decode KERNELS tie vLLM at a fixed WIDE lockstep shape (static batched-bench). The serving loss is therefore about how a RAGGED/NARROW/fluctuating live batch (varying decoder count D, ragged KV lengths, ragged token->expert assignment) feeds those same kernels, vs how gracefully vLLM's kernels degrade at the same concurrency. This is exactly the Phase-0 "re-scope" branch in DECODE_SERVING_SCOPE.md ("serving runs a worse effective batch shape into the kernels").

Decisive measurement that arbitrates all of these (run first): nsys a clean steady-state serving window (serve_bench staggered ~128 clients through llama-server, LLAMA_KV_PAGED=1 + LLAMA_MOE_FORCE_GRAPHS=1, -fa on -ngl 99) AND the same nsys on vLLM at the same concurrency (both-engine rule). Decompose per-step GPU-kernel-time into buckets {MoE-expert-GEMM (MUL_MAT_ID), full-attn FA, GDN recurrence, bf16 projections, activation-quant, sampling/logits} and compare serving-narrow vs static-wide vs vLLM. The bucket whose per-useful-token time grows MOST going static->serving (relative to vLLM's same bucket) is the gap. Avoid the known window artifact; measure a steady span. Reference doc: backend/cpp/llama-cpp-localai-paged/docs/DECODE_SERVING_SCOPE.md.


H1 (TOP) - MoE expert GEMM collapses to per-expert GEMV at ragged/narrow serving width, plus risk of the host-sync sorted per-expert fallback.

  • Mechanism: top-8 of 256 experts. Tokens/expert ~= D*8/256. Static npl128 -> ~4 tok/expert; serving burst-tail D->8 -> ~0.25 tok/expert, so most active experts get 0-1 tokens. The grouped MMQ id-GEMM's per-expert M collapses to 1 -> pure GEMV that reads the full FP4 expert weight (memory-bound, weight bytes unamortized) and re-loads per-expert scales. This is the "256 tiny-expert weight bandwidth" README s5 names as the residual. Separately, patch 0025 only keeps CUDA graphs on for the should_use_mmq grouped path; any serving step where MUL_MAT_ID ne[2]>8 (mmvq_mmid_max) AND should_use_mmq returns false falls to the per-expert host-loop fallback that cudaStreamSynchronizes per expert (the [TAG_MUL_MAT_ID_CUDA_GRAPHS] disable) - catastrophic, and serving's varying per-step shapes can trip it unevenly.
  • Why slower than vLLM: vLLM runs ONE fused MoE GEMM with sorted_token_ids/expert_ids computed on-GPU (fused_moe / Marlin-MoE), a single persistent launch that keeps the grouped GEMM dense and amortizes launch + scale loads; it degrades gracefully at small M. llama issues a grouped MMQ that, at ragged narrow width, is many near-empty expert tiles each re-reading scales, and can drop to a host-synced loop.
  • nsys metric to confirm: (a) MUL_MAT_ID kernel-time as % of per-step GPU wall, static-wide vs serving-narrow vs vLLM; (b) the tokens-per-expert (M) distribution per step - look for M->1 GEMV collapse and achieved FLOP/s vs M; (c) count cudaStreamSynchronize / per-expert cudaMemcpy between MUL_MAT_ID launches per step (host-sync fallback firing); (d) vLLM single fused-MoE kernel duration at same concurrency.
  • Candidate fix: a fused grouped-NVFP4 MoE decode GEMM with on-GPU token sorting (device-computed sorted token offsets + expert ids) so all active experts share one persistent launch and scales amortize - i.e. port vLLM's fused-MoE dispatch shape onto the FP4-MMA MMQ id-path; as a floor, extend 0025 to GUARANTEE the grouped should_use_mmq path for every serving shape so the host-sync loop never fires. Bit-exact-gateable (graph-replay/grouped path re-issues identical kernels).

H2 - Paged full-attention decode kernel: ragged-KV load imbalance, no tensor cores, indirect block-table reads.

  • Mechanism: the 16 full-attn layers run the paged block-table FA decode, pinned by the 0010/0011 dispatch guard to vec/tile and NEVER the mma/wmma tensor-core FA (a present block table routes only to vec/tile; tile loads half2, F16 cache only). Static bench: all sequences one KV length -> balanced. Serving: KV lengths are ragged (each request at a different position), so per-sequence attention work is imbalanced across the grid and the step waits on the longest-context tail; there is no KV-dimension split. Every K/V access is an indirect physical-cell load via the block table (gather-like), less coalesced than a contiguous read.
  • Why slower than vLLM: vLLM PagedAttention v2 uses a split-K / partitioned reduction designed for ragged long contexts (flash-decoding style) that balances work and lifts occupancy on the tail, and keeps the contiguous-within-block layout. llama's vec/tile paged read has no KV split and leaves tensor cores idle on the full-attn layers.
  • nsys metric to confirm: FA-decode (vec/tile) kernel duration vs KV-length VARIANCE across the live batch (does it scale with max-KV/tail rather than mean-KV?); tensor-core-active-% during FA layers (expect ~0); achieved memory-BW of the FA kernel under ragged KV; vLLM paged-attn kernel time + util at same concurrency.
  • Candidate fix: a KV-split (flash-decoding / split-K) paged FA decode so long sequences are partitioned across blocks for balance + occupancy; longer term a tensor-core paged FA for the full-attn layers (mma.sync down-translation, same approach as the GDN tensor-core scope). At minimum a per-sequence work-balanced launch.

H3 - GDN/SSM recurrence decode kernel under-occupied at narrow/variable serving width.

  • Mechanism: patch 0022 tuned the recurrence (NUM_WARPS=16, COLS_PER_WARP=8, grid.z = S_v/(NW*CPW)) for the WIDE B=128 lockstep batch; its DRAM-latency coverage / MLP needs ~128 independent sequence-states in flight, and it is bandwidth-bound (re-streams the 128x128 f32 state per sequence per step at 84.6% of peak BW at B=128). In serving D fluctuates and collapses in the burst tail; at low D the kernel is grid-starved (few independent states), achieved-BW falls below the tuned point and per-token state traffic rises - the same grid-starvation failure mode the chunked-prefill kernel hit at low n_seqs. Plus the serial-SSM host loop (README s2d/s5 structural floor) is amortized over fewer tokens.
  • Why slower than vLLM: vLLM's fused_recurrent_gated_delta_rule + its scheduler keep the recurrence fed at small batch; llama's fixed B=128-tuned launch params under-saturate when D is small.
  • nsys metric to confirm: gated_delta_net kernel achieved-BW (GB/s) and occupancy as a function of live D in serving vs the static 84.6%@B128 baseline; recurrence kernel time/token vs D; grid occupancy at the burst tail.
  • Candidate fix: width-adaptive recurrence launch params - auto-select NUM_WARPS/COLS_PER_WARP (already env GDN_NW/GDN_CPW) by live D so the grid stays saturated at narrow width; bit-exact-safe (0022's column assignment is provably independent of visit order). Longer term the chunked/register-resident state scan cuts state traffic.

H4 - Continuous-batch ragged-shape overhead: every kernel sized to the batch union/max; bf16 projections become GEMV at narrow D (umbrella + the "bf16-projection bandwidth" half of README's stated residual).

  • Mechanism: ragged positions/lengths/expert-assignments mean each per-step kernel is launched for the max/union over the live batch, so useful-token efficiency < lockstep. This is the shared root of H1-H3 but is worth isolating because it also covers the q/k/v/gate/o projections (deliberately kept bf16, per README s5) which at narrow D become GEMV-like memory-bound weight reads - the "bf16-projection bandwidth" residual vLLM also pays but amortizes over a steadier batch.
  • Why slower than vLLM: vLLM's scheduler holds a steadier/denser decode batch (padded bucketed decode + chunked-prefill interleave) so its projection/attn GEMMs run at higher effective M; llama's batch width fluctuates more.
  • nsys metric to confirm: GPU-busy% in a steady serving window vs static (expect lower in serving) and (sum useful-token FLOPs)/(kernel-time) serving vs static; bf16 projection GEMM achieved FLOP/s vs M (GEMV collapse at small D).
  • Candidate fix: largely subsumed by fixing H1-H3 at the kernel level. Note: holding D high via admission was effectively probed by the padded-shape lever and REJECTED for throughput (the completion-driven shrink is itself a per-survivor win); so do NOT re-pursue width-padding - the payoff is in the per-kernel fixes.

H5 - Per-step sampling + logits handling across D independent sequences (low, cheap to exclude).

  • Mechanism: each live sequence has its own sampler chain run after logits land; at narrow D this fixed per-step cost (+ any D2H logits copy) is amortized over fewer tokens. vLLM batches sampling on-GPU across the whole decode batch.
  • nsys metric to confirm: sampling/logits-copy time as % of per-step wall serving vs static; D2H logits cudaMemcpy size+time; count of per-sequence sampler launches.
  • Candidate fix: single on-GPU batched sampler over [D, vocab], no per-sequence D2H. Likely small on the greedy/temp0 path; profile mainly to exclude.

H6 - NVFP4 activation (re)quantize fixed per-step cost at small M (lowest; likely a wash vs vLLM, no bit-exact lever).

  • Mechanism: W4A4 quantizes activations to FP4 each step; 0023 de-dups the per-expert re-quant but the base quantize + scale compute is a per-row pass amortized over fewer tokens at narrow D. README lever-3 records the act-quant tax is removable only by W4A16 (rejected) or a kernel rewrite.
  • nsys metric to confirm: quantize-kernel time as % of step vs M, serving vs static, vs vLLM (which also FP4-quantizes activations -> expected wash).
  • Candidate fix: none bit-exact; include only so the profile can rule it out.

Priority for the profiling workflow: profile the MoE model (the gap model) first; bucket per-step GPU-kernel-time and A/B serving-narrow vs static-wide vs vLLM. Expect the ranking to be dominated by H1 (MoE expert GEMM at collapsed M + any host-sync fallback) and H2 (ragged-KV FA on vec/tile), with H3 (recurrence occupancy at narrow D) third; H4 is the shared root, H5/H6 are exclusion checks. The single most decisive metric is the per-step time share of MUL_MAT_ID vs FA vs gated_delta_net in serving-narrow relative to vLLM's equivalents at the same concurrency.

3. vLLM pipeline deep-dive

I have what I need. Synthesizing the vLLM pipeline against our patch series.


vLLM vs llama-cpp-localai-paged: kernel + scheduler deep-dive (Qwen3-style hybrid SSM+attn MoE, NVFP4, consumer Blackwell sm_121)

The single most important grounding fact (changes how every GEMM row reads)

On consumer Blackwell (sm_120/sm_121: DGX Spark/GB10, RTX 5090, RTX PRO 6000) the native CUTLASS NVFP4 grouped-GEMM path is broken (TMA-WS grouped-GEMM init failure, CUTLASS #3096) and there is no tcgen05/TMEM. So vLLM on this exact hardware does not run a native FP4-MMA grouped GEMM - it falls back to the Marlin BF16 kernel that dequantizes FP4->BF16 in-register, capped at bf16-tensor-core peak (~half FP4 peak). Native FP4 (W4A4/tcgen05) and the best FlashInfer/TRT-LLM kernels are gated to data-center Blackwell sm_100a. This means several "vLLM advantages" assumed for B200 do not hold on GB10, and our native FP4-MMA path (the just-verified 103 TFLOP/s = 57.7% of FP4 peak GEMM) is potentially ahead of vLLM's Marlin-bf16 fallback on this part - the opposite of the usual framing.

Comparison table

# Component vLLM (this model class, sm_121 reality) Ours (llama-cpp-localai-paged) Regime Verdict / gap
1 Dense weight GEMM - decode (M≤128, BW-bound) Marlin FP4→bf16 in-register dequant (W4A4 broken→fallback); reads 4-bit weights Native FP4-MMA MMQ (FP4 wt × Q8_1 int8 act), M≤128 tile decode Parity - both at FP4 weight-BW floor. Ours ~96-97% of vLLM, ahead at low concurrency
2 Dense weight GEMM - prefill (large-M, compute-bound) Marlin grouped/dense, async cp.async pipeline, big tiles, ~bf16 peak MMQ small-tile, 1 CTA/SM. New native FP4-MMA large-M kernel @103 TFLOP/s being integrated (beats cuBLAS-bf16, bit-exact) prefill dequant→bf16-cuBLAS lever (0033) was rejected (MMQ beat it 29-49%); the native FP4-MMA kernel is the real fix and could beat vLLM's bf16-Marlin here
3 MoE expert GEMM - decode Marlin FP4→bf16 grouped, indirect addressing Grouped MMQ (mul_mat_id), sorted expert layout, native FP4-MMA decode Parity - both BW-floor. Recurrence/GEMM are our wins; residual = bf16-projection BW + host loop
4 MoE expert GEMM - prefill Marlin grouped GEMM, fused, big tiles MMQ small-tile grouped (1 CTA/SM) prefill GAP (#1 prefill bottleneck per docs). Native FP4-MMA grouped kernel is the planned fix; today MMQ is small-tile-bound
5 MoE routing / gather / scatter / epilogue Triton persistent fused-MoE: indirect token addressing, fused gate+up + SwiGLU epilogue, once-quantize, scatter+weighted-combine fused Sorted per-expert layout; NVFP4 act-quant de-dup (0023) mirrors once-quantize; SwiGLU is separate ops (no fused epilogue) both Partial parity. No fused gate+up+SwiGLU epilogue (extra IO passes); matters at prefill, minor at decode
6 GDN / linear-attn - decode FLA Triton fused_recurrent_gated_delta_rule + fused_sigmoid_gating_delta_rule_update (sequential, per-step state) Fused sequential recurrence: in-place state write-back (0018), fused state gather (0019), o_proj MMVQ→MMQ (0020), occupancy retune (0022), conv-tap gather fusion (0028) decode Parity-to-win - recurrence runs at 102.6% of vLLM bandwidth, 84.6% of GB10 peak BW. Our strongest area
7 GDN / linear-attn - prefill FLA chunk_gated_delta_rule: intra-chunk products on tensor cores (UT-transform), ~2.5× cheaper Tuned sequential scan (default); chunked parallel-scan (0031) is opt-in + ~22% slower (serial f32 reductions, no TC, C=16 forced by 99KB smem) prefill GAP (#2 prefill bottleneck). No tensor-core chunked GDN. Scoped (TENSORCORE_GDN_SCOPE, mma.sync only); Gram products de-risked at 6.7-9.3× over sequential, kernel not yet built
8 Causal conv1d (short conv) FLA causal_conv1d_fn/_update Triton ggml_ssm_conv_update_inplace (0021): 5-op chain → 1 op, in-place ring both Parity
9 Full-attention - decode (16 of 64 layers) FlashInfer / TRT-LLM paged decode (tensor-core, cascade wrapper, FP8-KV capable) llama.cpp FA ggml_flash_attn_ext with block-table paged read (src[5]); routed to vec/tile kernels decode Parity at decode width (vec/tile is right for small batch)
10 Full-attention - prefill (large-M) FlashInfer/TRT-LLM tensor-core prefill FA Forced to vec/tile (block-table only grafted into vec/tile; mma/wmma FA ignores it, dispatch-guarded off) prefill GAP (secondary). Paged prefill full-attn gets no tensor-core FA. Docs rank it below MoE-GEMM/GDN, so not the dominant prefill term
11 Paged KV manager (full-attn) vLLM block manager + hybrid KV cache manager (co-sizes attn/linear blocks to equal physical bytes, anti-fragmentation) + auto prefix caching PagedKVManager (FreeBlockQueue/BlockPool/COW), cross-request prefix sharing, burst-reclaim (0024) both Parity on the attn side; we lack vLLM's unified hybrid co-sizing (we manage SSM state separately - see #12)
12 Hybrid SSM-state cache mgmt Unified hybrid manager pages linear-attn state alongside attn KV SSM recurrent + conv state in fixed per-seq slots, updated in-place (not paged; O(1)/seq) both Different approach, not a perf gap (recurrent state doesn't need paging); we lack unified fragmentation accounting
13 Sampler GPU FlashInfer sorting-free sampler (Dual-Pivot rejection sampling, single kernel, no logits sort, ~0 overhead); RejectionSampler for spec-decode llama.cpp host-side sampler chain (CPU partial-sort for top-k/p) serving GAP - NO EQUIVALENT. Host sampler + D2H logits adds to the per-step host loop at high concurrency (greedy md5 bench hides it)
14 Scheduler / continuous batching / chunked prefill V1: mixed prefill+decode step, chunked prefill default-on, decode-prioritized max_num_batched_tokens budget, auto-chunk update_slots() unified step, decode-first dynamic budget (0016, max(n_ubatch,TD)), prefill budget (0013), prefix-share (0008) serving Parity - we match the chunked-prefill + decode-first token-budget design
15 CUDA graphs - decode FULL cudagraph: padded/bucketed decode shapes → 1 persistent captured graph per bucket → steady decode = single cudaGraphLaunch, zero host rebuild S1+S3 (0040/0041) graph reuse keyed on bucketed block-table dims + decode-shape-stable scheduling → serving reuse 0%→72.2% serving Partial. We reuse, not full-capture. Padded/fixed-slot decode (→~100% like vLLM) was built + GPU-tested + REJECTED - serving decode here is GPU-compute-bound, so dummy-row compute > reuse recovered
16 CUDA graphs - prefill PIECEWISE cudagraph (default FULL_AND_PIECEWISE) ggml graph rebuild per prefill step (paged data-ptr churn) prefill Gap, low value (prefill is compute-bound; launch overhead amortized over large M)
17 Speculative decoding / MTP MTP head + EAGLE-style spec-decode supported for this model class (Qwen3-Next ships an MTP module) None decode GAP - NO EQUIVALENT. Biggest unexploited decode-throughput lever vLLM has and we don't (potential ~1.5-2× at low-medium concurrency)
18 KV-cache dtype FP8 KV cache + FP8 attention (halves KV BW) F16 paged KV both Minor gap; partly offset by our overall 1.5-3× lower memory (NVFP4 weights). FP8-KV would cut KV BW further

Gaps where we have NO equivalent (ranked by value)

  1. Speculative decoding via the MTP head (#17). Qwen3-Next/3.6 ships a Multi-Token-Prediction module; vLLM exploits it for spec-decode. We have nothing. This is the single largest structural decode-throughput lever vLLM has that is completely absent from our series - and unlike the kernel gaps it is not BW-floored. Highest-value greenfield item.

  2. Tensor-core chunked GDN prefill (#7). vLLM's FLA chunk_gated_delta_rule pushes intra-chunk Gram products through tensor cores (~2.5× cheaper prefill). Our 0031 chunked kernel is opt-in and 22% slower (serial f32 reductions). Scoped (mma.sync-only on sm_121, no wgmma/tcgen05), Gram products de-risked at 6.7-9.3×, kernel not built. One of the two named prefill bottlenecks.

  3. Large-M native FP4-MMA grouped MoE GEMM (#4). The #1 prefill bottleneck. vLLM uses Marlin-bf16 grouped (capped at bf16 peak on sm_121); our MMQ is small-tile/1-CTA-bound. The new native FP4-MMA GEMM (103 TFLOP/s, beats cuBLAS-bf16) is the integration that closes this - and because vLLM is bf16-Marlin here, a working native FP4 grouped kernel could exceed vLLM on this exact hardware.

  4. GPU fused sorting-free sampler (#13). vLLM samples on-device (FlashInfer Dual-Pivot rejection, no logits sort); llama.cpp samples on host. Adds to the serving host loop at 128-way concurrency for top-k/p workloads. No GPU-sampler equivalent in the series.

  5. Fused MoE SwiGLU epilogue (#5). vLLM fuses gate+up+SwiGLU into the grouped-GEMM epilogue (fewer IO passes). We have the act-quant de-dup (0023) but run SwiGLU as separate ops. Prefill-relevant, decode-minor.

  6. Tensor-core FA for the paged prefill full-attn path (#10). Paged forces vec/tile (mma FA ignores the block table). Secondary - docs rank it below #2/#3 in the prefill budget.

  7. FP8 KV cache / FP8 attention (#18). Minor; partly offset by our NVFP4 memory lead.

Where we are at or ahead of vLLM (not gaps)

  • GDN decode recurrence (#6): 102.6% of vLLM bandwidth - our fusion series (0018-0022, 0028) is the strongest area.
  • Decode weight GEMMs dense+MoE (#1, #3): at the FP4 weight-BW floor = parity; dense ahead at low concurrency. The residual MoE serving gap (~66% at n=128 burst) is a GPU-compute gap (vLLM's MoE decode kernel+scheduler ~1.3× on aggregate), not a host-loop gap that a graph-reuse/padding lever can close (padded-shape lever proved this, rejected 2026-06-28).
  • Memory: 1.5-3× lower than vLLM (NVFP4-resident, no persistent bf16 dequant copies).
  • Scheduler design (#14): chunked-prefill + decode-first budget matches vLLM's V1 model.

Net assessment

Our decode kernels are at parity-to-ahead (GDN recurrence, both FP4 GEMMs at BW floor) - confirmed in the kernel regime. The two real, named-in-docs prefill gaps (MoE grouped GEMM #4, tensor-core chunked GDN #7) are being actively closed with the native FP4-MMA GEMM + the de-risked tensor-core Gram products; on consumer Blackwell specifically these can match-or-beat vLLM because vLLM is itself on a bf16-Marlin fallback, not native FP4. The two gaps with no equivalent in the series at all are MTP speculative decoding (highest-value, structural, decode) and the GPU fused sampler (serving host-loop, secondary). The serving-decode residual is GPU-compute-bound (not host/graph-reuse), so vLLM's edge there is its faster MoE decode kernel + scheduler, not something a host-side lever recovers.


Files read (all absolute): /home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/README.md, docs/DECODE_SERVING_SCOPE.md, docs/PREFILL_GEMM_SCOPE.md, docs/PREFILL_GEMM_RESULTS.md, docs/TENSORCORE_GDN_SCOPE.md (same dir).

Sources:

4. Novel levers

I've grounded myself in the four scope docs, the README patch table + benchmarks (final_benchmark.csv), the methodology doc, and the 0034 FP4-MMA / 0042 fused-residual patch headers. Verified state: prefill is the biggest gap (dense ~920 vs vLLM ~2000 t/s ≈ 44-46%; MoE ~2177 vs ~5300-6223 ≈ 35-41%); decode kernel at parity; serving decode ~65% and measured GPU-compute-bound (host/graph-reuse + padded-shape proved neutral-or-worse). Already-explored/rejected: dequant→bf16 cuBLAS (0033, rejected), bf16-tau (dropped), NVFP4 projections (KL-fail), W4A16-Marlin (rejected), graph coverage (flat), act-quant fusion on decode (flat), padded-shape decode (rejected). Below are levers that go beyond those.


Candidate-lever brainstorm: closing the vLLM gap (paged Qwen3.6 NVFP4, GB10 sm_121a)

Organized by where the verified gap actually is. For each: mechanism / expected gain / gate (bit-exact vs KL) / risk / effort-reward. "Profile-gated" = run Phase-0 nsys before building, per the methodology.

A. PREFILL (the largest gap, 35-46% of vLLM) — highest reward bucket

A1. Graph-safe ragged grouped FP4-MMA MoE kernel (remove the per-expert host-sync loop)

  • Mechanism: 0034 lands the native FP4-MMA dense kernel but routes MoE prefill through the per-expert host-sync loop (a cudaStreamSynchronize per expert per layer — e.g. dozens-to-hundreds of syncs/layer). Replace it with ONE ragged/grouped FP4-MMA launch over the existing expert_bounds/ids_dst sorted layout (variable M per expert, single kernel). This is the follow-up 0034 itself flags.
  • Gain: HIGH. MoE expert GEMM is named the #1 prefill cost; this both removes the serial host syncs and unlocks kernel overlap + graph capture. The single biggest remaining prefill lever after 0034.
  • Gate: bit-exact by construction (same FP4 math, same K-order as the per-expert path) → greedy md5.
  • Risk: medium-high (ragged tiling + boundary handling, graph-safety).
  • Effort/reward: HIGH effort / HIGH reward. The flagged 0034 follow-up; rank #1 for prefill.

A2. Multi-stream expert dispatch (cheap stepping-stone to A1)

  • Mechanism: before writing the full ragged kernel, run the independent per-expert FP4-MMA GEMMs on N CUDA streams instead of the serial host-sync loop, overlapping their LPDDR5x weight reads + tensor-core work.
  • Gain: medium (partial overlap; recovers some of the serial-sync stall without the kernel rewrite).
  • Gate: bit-exact (same kernel, reordered launches) → greedy md5.
  • Risk: medium (stream/event mgmt, not graph-safe — prefill isn't graph-replayed so OK).
  • Effort/reward: LOW-MED effort / MED reward. Bank this before A1.

A3. Fuse MoE router → token-gather/scatter → GEMM (permutation fusion)

  • Mechanism: vLLM/SGLang fuse routing→permute→grouped-GEMM→unpermute. Here the activation gather (into the sorted-expert layout) and the scatter-back are separate memory passes. Read activations through ids_dst in the GEMM prologue and write through the inverse permutation in the epilogue → removes two full activation memory passes per MoE layer.
  • Gain: medium for prefill (large activation tensor); smaller for decode (0019/0028 already fuse the decode gather).
  • Gate: bit-exact (index indirection only, same values) → greedy md5.
  • Risk: medium (epilogue indexing correctness).
  • Effort/reward: MED / MED. Pairs naturally with A1's kernel.

A4. Fused MoE FFN (up_proj → SiLU → down_proj, intermediate register/shared-resident)

  • Mechanism: keep the per-expert intermediate activation in shared/registers across up→act→down instead of round-tripping it to global. For large-M prefill the intermediate is big → a real BW save; also helps decode.
  • Gain: medium-high (removes one full intermediate read+write per expert per layer).
  • Gate: bit-exact if SiLU + accumulation order preserved → greedy md5 (else KL-gate).
  • Risk: HIGH (fused FP4 FFN kernel is complex; register pressure on sm_121a).
  • Effort/reward: HIGH / MED-HIGH. Strong but expensive; sequence after A1.

A5. Activation-quant fusion into the 0042 residual/RMSNorm epilogue (prefill)

  • Mechanism: the README's "act-quant fusion FLAT" verdict was decode-only. For prefill the W4A4 activation-quantize pass is a bigger tensor. 0042 already fuses residual-add+RMSNorm+mul; extend its epilogue to emit the FP4-quantized activation the next GEMM consumes, removing a dedicated act-quant read+write.
  • Gain: low-medium for prefill.
  • Gate: bit-exact (same quantize_mmq_nvfp4 math, just fused) → greedy md5.
  • Risk: medium (epilogue + the FP4 codepath coupling).
  • Effort/reward: MED / LOW-MED. Cheap-ish add-on once 0034/A1 are in.

A6. Stream-K / split-K for the FP4 prefill GEMM (SM occupancy on few-SM GB10)

  • Mechanism: GB10 has relatively few SMs. For layers whose output grid (⌈M/128⌉×⌈N/128⌉) is smaller than the SM count, SMs idle. Stream-K splits the K dimension across CTAs with a reduction, keeping all SMs busy.
  • Gain: medium for small-output-grid layers (profile-gated — only if 0034's grid under-fills the GPU).
  • Gate: bit-exact if the f32-accumulate reduction order is fixed/deterministic; otherwise KL-gate.
  • Risk: medium (reduction correctness, workspace).
  • Effort/reward: MED / MED. Complements 0034; profile first.

A7. Prefill CUDA-graph capture (follow-on to A1)

  • Mechanism: with fixed prefill chunk size (0013/0016 budgets already exist) and A1 removing the host-sync MoE loop, the whole prefill chunk becomes graph-capturable.
  • Gain: LOW marginal — prefill kernels are large so launch overhead is amortized; the value is mostly enabling it (which A1 already does). Record as low-reward, not a standalone lever.
  • Gate: bit-exact.
  • Effort/reward: LOW / LOW. Note, don't prioritize.

B. DECODE-SERVING (~65% of vLLM aggregate, measured GPU-compute-bound)

B1. Speculative decoding, greedy = bit-exact (SSM-state rollback is the crux) novel

  • Mechanism: draft γ tokens (small draft model, or prompt-lookup/n-gram for zero extra weights), verify in one target forward. At temp=0 the accepted tokens are argmax-identical to non-spec → the greedy md5 gate PASSES by construction (lossless). This is the rare throughput-multiplier that's bit-exact-compatible. Especially powerful at low concurrency where paged is farthest below vLLM (n=8 burst: paged 28 vs vLLM 45) and the GPU is underutilized.
  • The non-obvious crux: hybrid-SSM rollback. KV rollback under paged is easy (truncate blocks). But the gated-DeltaNet recurrent state is updated in-place (patch 0018), so a rejected draft requires restoring the 128×128 f32 state per layer to the last accepted position — snapshot-before-speculate (memory+BW cost) or recompute. This SSM-state checkpoint/restore is the real engineering risk and is why naive llama.cpp spec-decode plumbing won't transfer.
  • Gain: HIGH (2-3x at favorable acceptance/low concurrency).
  • Gate: bit-exact for greedy (md5 holds); distribution-preserving (KL-gate) for temp>0.
  • Risk: HIGH (SSM snapshot/rollback, draft integration with paged KV + recurrent state, acceptance tuning).
  • Effort/reward: HIGH / HIGH. Biggest novel decode lever; start with zero-draft prompt-lookup to de-risk the rollback plumbing before adding a draft model.

B2. FP8 / quantized paged KV cache

  • Mechanism: decode is BW-bound; quantizing the paged KV (llama.cpp already has q8_0/q4_0 --cache-type-k/v) halves the KV-gather BW and doubles effective KV capacity → higher max concurrency. Wire the existing quantized-KV FA-vec path through the paged block-table read (0009/0010). Matches a vLLM feature (fp8 KV).
  • Gain: medium-high for long-context / high-concurrency decode.
  • Gate: KL-gate (KV quant changes attention numerics; watch long-context recall), per the 8cb0ce23 precedent.
  • Risk: medium (paged FA-read FP8 path; precision on long context).
  • Effort/reward: MED / MED-HIGH.

B3. Coalesced paged-KV block layout for the in-kernel decode gather

  • Mechanism: decode is at the LPDDR5x floor, so effective BW depends on coalescing. vLLM lays K as [blocks, kv_heads, head_size/x, block_size, x] precisely to coalesce the FA read. Re-lay-out the paged blocks so 0009/0010's in-kernel gather issues fully-coalesced vectorized loads matching the FA kernel's access pattern.
  • Gain: medium (profile-gated: measure the FA-read achieved-BW / sector efficiency first).
  • Gate: bit-exact (pure memory layout, identical values) → greedy md5.
  • Risk: medium (touches paged KV manager + FA read).
  • Effort/reward: MED / MED. Profile before building.

B4. Megakernel / persistent decode (single-launch fused decode step)

  • Mechanism: fuse the per-layer decode ops into one persistent kernel that loops layers internally (à la Mirage/MPK persistent megakernel), eliminating inter-op launch overhead, inter-op global round-trips, and the host loop for the decode step; keep the recurrent state resident across the step.
  • Gain: potentially high for the GPU-compute-bound serving regime (kills launch/scheduling bubbles vLLM avoids). Honest caveat: at 27-35B the activations don't fit SMEM across layers, so the win is mostly launch-overhead + scheduling, less data-residency.
  • Gate: in principle bit-exact (same ops/order) but extremely hard to guarantee → realistically KL-gate.
  • Risk: VERY HIGH (essentially re-implements the decode forward as one kernel).
  • Effort/reward: VERY HIGH / HIGH. The swing-for-the-fences lever; only after cheaper decode levers are exhausted.

B5. Pipeline sampling off the decode critical path

  • Mechanism: the doc names the "serial-SSM host loop / sampling can't start until logits land" as a floor. S2 (double-buffer set_inputs) was dropped because set_inputs is cheap — but the sampling stall between steps is different. Overlap step N's sampling + step N+1's input build with the GPU launch, so the GPU never idles waiting on host sampling.
  • Gain: medium (recovers the inter-step sampling bubble; this is the precise residual S2 didn't target).
  • Gate: bit-exact (host reordering only) → greedy md5.
  • Risk: medium (ordering correctness vs the recurrent in-place state).
  • Effort/reward: MED / MED.

B6. Co-batch chunked prefill INTO decode steps (vLLM-style GPU saturation — flips S3) reframe

  • Mechanism: S3 deliberately keeps prefill out of decode steps (for graph reuse). But the later measurement proved serving decode is GPU-compute-bound, not host-bound — which removes S3's rationale. vLLM does the opposite: mixes small prefill chunks into decode steps to fill otherwise-idle GPU at low decode width. Test co-batching a sized prefill chunk with decode to use spare SMs.
  • Gain: medium at low-to-mid decode width (better GPU utilization).
  • Gate: bit-exact (same math, scheduling only) → greedy md5.
  • Risk: low-medium (it partially contradicts S3 — A/B them; the GPU-compute-bound finding says S3's reuse benefit is ~nil here, so co-batching likely wins).
  • Effort/reward: LOW-MED / MED. Cheap A/B with high information value (directly tests the regime conclusion).

B7. Adaptive-width bucketed decode graph (doc-sanctioned revisit)

  • Mechanism: the rejected padded-shape lever used fixed pad-to---parallel; the doc explicitly leaves the door open for adaptive width (round up to next small bucket 8/16/32/64).
  • Gain: LOW on GB10 — the same doc measured serving decode GPU-compute-bound, so graph reuse buys ~nothing here. Record as: revisit ONLY if the host loop is re-confirmed dominant on other hardware.
  • Gate: bit-exact.
  • Effort/reward: MED / LOW (on GB10). Note, don't build for GB10.

C. CROSS-CUTTING / aggregate-throughput reframes

C1. Exploit the 1.5-3x memory advantage for higher max concurrency reframe

  • Mechanism: the benchmark stops at npl=128 where both engines fit. With 1.5-3x lower memory (and synergistic with B2 FP8-KV), the paged backend can serve npl=256+ in the same VRAM where vLLM OOMs. Per-stream tok/s gap is irrelevant if paged sustains 2x the concurrent streams per GPU — aggregate tok/s/GPU can match or beat vLLM.
  • Gain: HIGH for aggregate throughput-per-GPU at the memory ceiling (a legitimate, honestly-labeled "different operating point," not a per-stream parity claim).
  • Gate: bit-exact (no numeric change) → greedy md5.
  • Risk: low (scheduler/admission tuning to actually pack the streams).
  • Effort/reward: LOW / HIGH. Cheapest high-reward lever — measure aggregate at max-concurrency, pair with B2.

Ranked summary (effort vs reward)

# Lever Regime Gate Effort Reward
C1 Higher max-concurrency via memory advantage (+B2) aggregate bit-exact LOW HIGH
A1 Graph-safe ragged grouped FP4-MMA MoE kernel prefill bit-exact HIGH HIGH
B1 Speculative decode (greedy=bit-exact; SSM rollback crux) decode bit-exact (greedy) HIGH HIGH
A2 Multi-stream expert dispatch (→A1) prefill bit-exact LOW-MED MED
B6 Co-batch chunked-prefill into decode (flips S3) serving bit-exact LOW-MED MED
B2 FP8/quantized paged KV cache decode KL-gate MED MED-HIGH
A3 MoE router+gather+GEMM permutation fusion prefill bit-exact MED MED
B3 Coalesced paged-KV layout for decode gather decode bit-exact MED MED
B5 Pipeline sampling off decode critical path serving bit-exact MED MED
A4 Fused MoE FFN (up+SiLU+down resident) prefill+decode bit-exact HIGH MED-HIGH
A6 Stream-K/split-K FP4 prefill GEMM prefill bit-exact/KL MED MED
A5 Act-quant fusion into 0042 epilogue (prefill) prefill bit-exact MED LOW-MED
B4 Megakernel/persistent decode decode KL-gate VERY HIGH HIGH
A7 Prefill CUDA-graph capture (→ enabled by A1) prefill bit-exact LOW LOW
B7 Adaptive-width bucketed decode graph serving bit-exact MED LOW (GB10)

Suggested attack order: (1) C1 — near-free aggregate win exploiting the memory advantage, immediately defensible. (2) A2→A1 — the prefill MoE GEMM is the biggest single gap and 0034 already flags A1. (3) B6 — cheap A/B that directly tests/exploits the "serving is GPU-compute-bound" conclusion. (4) B1 — the highest-ceiling decode lever, but gate the SSM-state rollback plumbing first via zero-draft prompt-lookup. (5) B2/B3/B5 as the BW + bubble cleanup. (6) A4 / B4 as the high-effort structural swings only if the cheaper levers leave a funded gap.

Two highest-value non-obvious insights: (a) speculative decoding is bit-exact under greedy (md5 passes by construction) — the only throughput-multiplier compatible with the sacred gate — but its hybrid-SSM in-place-state rollback (patch 0018) is the unsolved crux. (b) the measured "serving decode is GPU-compute-bound" finding invalidates S3's keep-prefill-out rationale and argues for the opposite (B6 co-batching, vLLM-style), plus reframes the win toward aggregate-per-GPU concurrency (C1) rather than per-stream parity.

Relevant files: /home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE,PREFILL_GEMM_SCOPE,PREFILL_GEMM_RESULTS,TENSORCORE_GDN_SCOPE}.md, .../README.md (s4 benchmarks, s5 rejected levers), .../docs/final_benchmark.csv, .../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch (A1 is its flagged follow-up), .../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch (A5 extends it).

5. Synthesized prioritized lever map

Prioritized Lever Map - vLLM Parity, Qwen3.6 NVFP4 on GB10 (sm_121a)

Bottom line (where the gap actually is)

  • Prefill is the largest absolute gap: dense ~44-48% of vLLM, MoE (decision model) ~29-41%. Two buckets own ~71% of the wall (NVFP4 GEMM ~49%, chunked GDN ~22%); the op-walk surfaces three uncovered residuals (MoE router/combine, prefill act-quant, FA-at-length).
  • Decode kernels are at parity-to-ahead (GDN recurrence 102.6% of vLLM BW; both FP4 GEMMs at the BW floor). Decode-serving is the still-open gap (~66% at n=128 burst), is MoE-specific and GPU-compute-bound (host-loop/graph-reuse/padded-shape all proved neutral-or-worse, so they are closed).
  • The two structural levers vLLM has that the series has no equivalent for: MTP speculative decode and GPU fused sampler. On this hardware vLLM is itself on a bf16-Marlin FP4 fallback (no tcgen05/CUTLASS-grouped), so a working native FP4 path can match-or-beat it, not just chase it.

Single highest-leverage NEXT action for the still-open decode-serving gap

Run the both-engine steady-state serving nsys window FIRST (it is the gate before any decode kernel is funded). Stagger ~128 clients through llama-server (LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 -fa -ngl 99) and the identical concurrency on vLLM; bucket per-step GPU-kernel time into {MUL_MAT_ID, FA-vec/tile, gated_delta_net, bf16-projections, act-quant, sampling} and compare serving-narrow vs static-wide vs vLLM. The decisive single metric: the per-useful-token time share of MUL_MAT_ID vs FA vs gated_delta_net in serving relative to vLLM. Primary hypothesis to confirm/refute: H1 - MoE grouped GEMM collapsing to per-expert GEMV at ragged width, and count cudaStreamSynchronize between MUL_MAT_ID launches to catch the per-expert host-sync fallback firing. This one A/B arbitrates D2 vs D3 vs D4 (all HIGH-effort) at once, and the methodology forbids building a kernel before it. Bank D1 (grouped-path guarantee) immediately as near-free insurance against the host-sync cliff regardless of outcome.

Master ranked lever table (pursue list)

# Lever Gap Gain → parity Effort Risk Gate Dependency / sequence Status
0 Phase-0 serving nsys (both-engine bucket A/B) decode enabling - sizes/arbitrates H1-H4 LOW low n/a none - do first NOT DONE
1 X1 (C1) Exploit 1.5-3× memory → serve npl=256+ where vLLM OOMs aggregate HIGH (different operating point: aggregate tok/s/GPU) LOW low BE pairs w/ D6; admission tuning NOT STARTED
2 P1 Native FP4-MMA large-M dense GEMM (patch 0034) prefill HIGH - GEMM ~49% of wall; can beat vLLM bf16-Marlin HIGH med BE (md5) foundation for P2/P8 IN PROGRESS (0034 scaffold landed)
3 D1 Guarantee grouped MMQ path - never host-sync per-expert fallback (extend 0025) decode HIGH if firing (removes catastrophic cliff) LOW low BE gated by #0; bank regardless NOT STARTED
4 P3 Multi-stream expert dispatch (→P2) prefill MED (partial overlap of serial syncs) LOW-MED med BE stepping-stone, bank before P2 NOT STARTED
5 P2 (A1) Graph-safe ragged grouped FP4-MMA MoE GEMM prefill HIGH - the #1 prefill bucket (~28% of wall) HIGH med-high BE (md5) after P1/P3; shares kernel arch w/ D2 FLAGGED 0034 follow-up
6 D10 (B6) Co-batch chunked-prefill into decode (flips S3) serving MED (fills idle SMs at low D) LOW-MED low-med BE cheap A/B; tests "GPU-compute-bound" conclusion NOT STARTED
7 P4 Tensor-core chunked GDN prefill kernel (rewrite 0031) prefill HIGH - #2 prefill bucket (~22% of wall, ~17% of gap) HIGH med-high BE→KL Gram products de-risked 6.7-9.3× DESIGN SCOPED, kernel NOT built
8 D2 (H1) Fused grouped-NVFP4 MoE decode GEMM + on-GPU token sort decode HIGH - top decode hypothesis (MoE-specific) HIGH high BE gated by #0; co-develop kernel w/ P2 NOT STARTED
9 D5 (B1) Speculative decode via MTP head decode HIGH (2-3× at low/mid concurrency) HIGH high BE (greedy) / KL (temp>0) crux=SSM in-place state rollback (0018); de-risk w/ zero-draft prompt-lookup NOT STARTED
10 D6 (B2) FP8 / quantized paged KV cache decode MED-HIGH (halves KV BW; doubles capacity → enables X1) MED med KL (8cb0ce23 precedent) wire quantized-KV FA-vec through paged read (0009/0010) NOT STARTED
11 D3 (H2) KV-split / flash-decoding paged FA decode decode MED-HIGH (ragged-KV balance + occupancy) MED-HIGH med BE→KL gated by #0 (build only if FA bucket grows) NOT STARTED
12 P5 (A3+PREFILL-L1) Fused MoE router+gather+scatter+combine prefill MED (~5-8% MoE wall, uncovered by P2/P4) MED med BE (fp32 reorder; 8cb0ce23) pairs w/ P2 kernel NOT STARTED
13 D4 (H3) Width-adaptive GDN recurrence launch params decode MED (saturate grid at narrow D) LOW-MED low BE (0022 col-independence) env GDN_NW/GDN_CPW already exists NOT STARTED
14 D7 (B3) Coalesced paged-KV block layout for decode gather decode MED (effective BW / sector efficiency) MED med BE profile-gated (#0 FA-read BW) NOT STARTED
15 P6 (A4) Fused MoE FFN (up→SiLU→down resident) prefill+decode MED-HIGH (removes intermediate round-trip) HIGH high BE→KL after P2 NOT STARTED
16 D9 (B5) Pipeline host sampling off decode critical path serving MED (recovers inter-step sampling bubble) MED med BE ordering vs in-place recurrent state NOT STARTED
17 D8 (H5/#13) GPU fused sorting-free sampler serving MED (small on greedy; matters at 128-way top-k/p) MED med BE-ish alt to D9; profile to size NOT STARTED
18 P8 (A6) Stream-K / split-K FP4 prefill GEMM prefill MED (small-output-grid layers on few-SM GB10) MED med BE if det. else KL profile-gated; complements P1 NOT STARTED
19 P7 (A5/PREFILL-L2) Act-quant fusion into 0042 epilogue (prefill) prefill LOW-MED (~3-6% prefill; vLLM avoids it entirely) MED med BE (md5) extends landed 0042; after P1 NOT STARTED
20 P9 (#10/flag-3) Tensor-core paged prefill FA prefill LOW-MED, context-dependent (grows L²) MED-HIGH med BE→KL re-profile FA share at real serving lengths first NOT STARTED
21 D11 (B4) Megakernel / persistent decode decode HIGH (kills launch/scheduling bubbles) VERY HIGH very high KL last resort, only if funded gap remains NOT STARTED

Gate key: BE = bit-exact (greedy md5); KL = KL-divergence gate; BE→KL = bit-exact preferred, KL fallback.

Drop / closed (do NOT pursue)

Lever Why dropped
Padded / fixed-slot decode (pad-to---parallel) Built, GPU-tested, REJECTED - serving decode is GPU-compute-bound; dummy-row compute > reuse recovered
B7 Adaptive-width bucketed decode graph LOW value on GB10 (same GPU-compute-bound finding); revisit only if host-loop re-confirmed dominant on other HW
dequant→bf16 cuBLAS prefill (0033) REJECTED - MMQ beat it 29-49%; superseded by native FP4-MMA (P1)
W4A16-Marlin / NVFP4 projections (bf16→FP4) REJECTED - KL-fail; vLLM keeps SAME bf16 projections, no advantage to chase
bf16-tau Dropped
Act-quant fusion on decode (lever-3) FLAT - decode is BW-bound; the prefill variant (P7) is the live one
S2 double-buffer set_inputs Dropped - set_inputs is cheap (host loop closed by 0040/0041)
H6 NVFP4 act-quant decode tax No bit-exact lever; exclusion check only (expected wash vs vLLM, which also FP4-quantizes)
P10 (A7) Prefill CUDA-graph capture LOW/LOW - prefill launch overhead amortized over large M; merely enabled by P2, not a standalone item
H4 ragged-shape umbrella Not a lever - it is the shared root of H1-H3; fixed by D2/D3/D4 at the kernel level
H5 (as exclusion) / H6 profile-only rule-outs, not builds (D8 is the actual sampler lever)

Critical-path sequence (two parallel tracks per the multi-agent GPU methodology)

Decode-serving track (gated): #0 serving nsys → bank #3 (D1) → branch on the dominant bucket: if MUL_MAT_ID-GEMV → #8 (D2); if FA → #11 (D3); if recurrence → #13 (D4). In parallel, cheap A/Bs #6 (D10) and #1 (X1). Highest-ceiling greenfield #9 (D5) once SSM-rollback de-risked via zero-draft prompt-lookup. BW cleanup #10 (D6, synergistic with X1).

Prefill track (already moving): #2 (P1, in progress) → #4 (P3) → #5 (P2) - and co-develop the P2 ragged-grouped kernel with the D2 decode kernel (one fused-MoE dispatch that degrades gracefully across M = vLLM's single fused_moe shape). In parallel #7 (P4, design ready). Then the residual-coverage adds #12 (P5), #15 (P6), #19 (P7). Profile-gated #18 (P8), #20 (P9).

Two highest non-obvious insights to act on: (a) the P2 prefill kernel and the D2 decode kernel are the same kernel (on-GPU token sort + single persistent grouped FP4-MMA launch) at different M - fund them as one effort. (b) the "serving decode is GPU-compute-bound" finding invalidates S3's keep-prefill-out rationale - #6 (D10 co-batching, vLLM-style) and #1 (X1 aggregate concurrency) are the cheap wins that follow from it, and are higher-reward-per-effort than any further host-side or graph-reuse work.

Relevant files (all absolute): /home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}, .../README.md, .../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch (P1/P2), .../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch (P7), .../patches/paged/0031 (P4), 0025 (D1), 0018/0022 (D4/D5), 0009/0010 (D3/D6/D7); graph source /home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}.


PROFILE-VALIDATED PATH (both-engine nsys, adversarially verified Sun Jun 28 11:55:12 PM UTC 2026)

Prefill gap decomposition (paged 396 vs vLLM 197 us/tok)

All 4 runs ran on DGX (GB10) via ssh dgx.casa; GPU lock held+released, GPU restored idle. Model = decision MoE Qwen3.6-35B-A3B-NVFP4 (paged GGUF vs q36-35b-a3b-nvfp4-vllm). Buckets = % of GPU-kernel wall (nsys cuda_gpu_kern_sum), and per-prefill-token us.

PAGED MoE PREFILL (npp512 ntg4 npl32, LLAMA_KV_PAGED=1 +LLAMA_MOE_FORCE_GRAPHS=1): S_PP=2417.8 t/s; kernel 6.485s/16384 tok = 395.9 us/tok. MoE-expert-GEMM(MMQ nvfp4) 26.5% | GDN 24.2% (gdn_core 17.2, gdn_gather 3.3, gdn_conv 2.7, l2norm 1.0) | layout-copy 9.8 (convert_dtype 6.3, concat 2.9) | ew-mul 8.7 | bf16-proj 8.6 | act-quant(quantize_mmq_nvfp4) 4.7 | ew-add 4.6 | silu/sigmoid-gate 4.3 | norms 3.6 | MoE-DISPATCH(argsort 0.4+mm_ids 1.1+gather_mmq 0.7) 2.2 | get_rows 1.0 | FA 0.6 | softmax 0.05 | scatter 0.06.

vLLM MoE PREFILL (32x512, 5 reps): S_PP=4925.8 t/s; kernel 16.138s/81920 tok = 197.0 us/tok. SURPRISE: on sm_121 vLLM runs experts as Marlin W4A16 (FP4->bf16 dequant + bf16 GEMM), NOT fused-FP4 cutlass; projections are FP8 (sm89_xmma_e4m3). ew-glue(torch elementwise) 31.7% | MoE-expert-GEMM(Marlin) 24.6 | GDN(FLA chunk_* + causal_conv) 18.5 | bf16/fp8-proj 10.4 | reduce(cumsum/softmax) 5.2 | gate 2.3 | act-quant(scaled_fp8) 1.7 | layernorm 1.7 | MoE-DISPATCH(gather/align/count_sort/argsort) 1.4 | FA 1.1.

Per-token gap decomposition (paged-vLLM, of 198.9 us/tok total): GDN +59.2 (~30%), MoE-GEMM +56.5 (~28%), ew/layout/glue net +21.4 (~11%), act-quant +15.2 (~8%), bf16-proj +13.7 (~7%), gate +12.4 (~6%), norms +11.1 (~6%), dispatch +5.9 (~3%).

Decode picture (host-bound, not kernel/graph-reuse)

3 decode profiles. KEY: paged decode KERNELS are 5.4x more GPU-efficient than vLLM's, but paged static decode is HOST-BOUND (GPU ~16% busy); vLLM is GPU-bound (99% busy) on a slow recurrent GDN. They tie at static-wide-128 (paged 782 vs vLLM ~819 t/s pure decode) via opposite regimes.

PAGED DECODE-SERVING (staggered 128 clients, llama-server, steady 22s window, 83.5% GPU-busy): MoE/FFN-GEMM 40.7% (mmq 34.2 + gemv_moe 4.6 + gemv 1.4) | bf16-proj 22.8 (mul_mat_f 11.1 + nvjet 9.1 + cutlass 2.5) | GDN 21.2 (gdn_core 19.9) | act-quant 2.8 | layout 2.1 | get_rows 2.0 | ew-mul 2.0 | FA 1.6 | norms 1.2 | MoE-DISPATCH 1.1 | scatter 0.2 | softmax 0.1.

PAGED STATIC npl=128 lockstep (PP128+TG256, ~16% GPU-busy, HOST-BOUND): kernel 7.83s/49152 tok=159 us/tok, S_TG=782 t/s. MoE-GEMM 37.5 | GDN 21.6 | layout 9.6 | bf16-proj 9.2 | ew-mul 5.5 | act-quant 4.1 | ew-add 3.4 | norms 2.5 | dispatch 1.8 | FA 0.55. cudaStreamSynchronize=43.4s (84% of API/87% of wall) vs 7.83s GPU kernel => GPU idle ~84%.

PAGED STATIC npl=1 (batch-1): kernel 0.20s, MEMOPS 0.44s (68% of kern+mem), cudaStreamSync 66.7% => latency/BW-bound, GPU ~4% busy.

vLLM 128-wide offline (PT128 GEN256, 99% GPU-busy): kernel 42.56s/49152 tok=866 us/tok. GDN 45.2% (fused_recurrent_gated_delta decode 42.8!) | MoE-GEMM(Marlin) 36.2 | bf16/fp8-proj 6.6 | ew-glue 6.3 | FA 2.1 | reduce 1.4 | dispatch 0.7.

Per-token decode (paged static-128 | vLLM | ratio): MoE-GEMM 59.7|313.5 paged 5.3x faster; GDN 34.3|391.7 paged 11.4x faster; bf16-proj 14.7|57.2; total 159|866 paged 5.4x less GPU.

H1 verdict (false): the stated mechanism - 'MUL_MAT_ID per-useful-token time growing static->serving from grouped-GEMV collapse' - is REFUTED at the kernel level. The grouped path engages correctly: at width-1 the MoE expert path is GEMV (mul_mat_vec_q), and at width>=~16 it switches to grouped MMQ (mul_mat_q nvfp4) - npl=128 is 37% MMQ/~0 GEMV, serving is 34% MMQ + 6% gemv_moe. It does NOT collapse to per-token GEMV. What IS confirmed (the real H1 mechanism) is HOST-SIDE SERIALIZATION: cudaStreamSynchronize dominates the static-decode wall - npl=1 66.7% of API time (~89% of wall), npl=128 84.3% of API time (43.4s sync vs 7.83s GPU kernel => GPU ~84% idle); the serving window logged 40,902 cudaStreamSynchronize. The grouped MMQ also runs at ragged small-M tiles (mmq_x = 16/24/32/40/48/64/80/96) because tokens-per-expert is tiny -> low tensor-core utilization (small-M MMQ, not a GEMV collapse). Mechanistically the device->host sync to read MoE routing before launching per-expert GEMMs is the serializer (task D1/#104 'no host-sync MoE path').

THE BIG DECODE PICTURE (most important finding): paged and vLLM have OPPOSITE decode profiles. Paged decode kernels are 5.4x more GPU-efficient (159 vs 866 us/tok) but paged static decode is host-bound (GPU ~16% busy, serial SSM+sampling+MoE-dispatch host loop); vLLM is GPU-bound (99% busy) on a recurrent GDN kernel that is 11x slower per token, but it saturates the GPU via CUDA graphs. They tie at static-wide-128 (782 vs ~819 t/s). At SERVING the paged GPU rises to 83.5% busy because overlapping request streams hide the host stalls - so the serving lever for paged is NOT faster decode kernels (they're already fast/idle) but (a) removing host serialization / graphing the whole step incl MoE dispatch, and (b) chunked-prefill: paged's 2x-slower prefill steals serving cycles during continuous batching (the gen-80-128 serving config was ~55% prefill work; the nsys'd run2 gen-256-512 ~25%). vLLM bf16/fp8 projections are a bigger paged decode bucket than expected (22.8% serving) because batch-1/small-batch bf16 proj uses mul_mat_f (11.1%) + nvjet (9.1%).

Methodology/scope: profiled with nsys --trace=cuda + cuda_gpu_kern_sum; no NVTX in either engine so buckets are by kernel-name regex (bucketer at dgx:/home/mudler/bench/bucket2.py; reports at dgx:/home/mudler/bench/profgap/). Shared elementwise (k_bin_bcast add/mul, torch elementwise) straddle resid/MoE-fanin/GDN-glue and are bucketed by dominant use with that caveat; vLLM's torch_ew (31.7% prefill) is GDN-glue+MoE-combine+resid and is genuinely ambiguous. The dense Qwen3.6-27B-NVFP4 was NOT separately profiled (time budget; the MoE decision-model contains both MoE experts AND the same GDN/attention stack, fully answering A/B/C); GDN findings generalize to dense. vLLM decode here is offline 128-wide (continuous-batched), not staggered-server, so the cross-engine serving ratio is taken from prior h2h benches (~55-80% of vLLM at npl 64-128), not a fresh staggered vLLM run. Cross-engine 'gap' numbers are GPU-kernel-time per token (apples for GPU-bound prefill; for decode the host-bound vs GPU-bound asymmetry means wall-throughput parity hides a 5.4x GPU-efficiency paged advantage).

Decision

moe_prefill_lever

BETTER GROUPED GEMM KERNEL (D2/#105), NOT P5 dispatch fusion. The profile settles this empirically: explicit MoE dispatch (argsort+softmax+get_rows+set_rows+mm_ids+gather_mmq) is only 8.6 us/tok (~2-3% of the paged prefill wall; +5.9 us/tok = ~3% of the gap). P5 is REJECTED as a standalone lever - and the premise it rests on ("vLLM fuses dispatch into the GEMM epilogue") is FALSE on GB10: vLLM runs Marlin W4A16 with its OWN separate dispatch kernels (count_and_sort_expert_tokens/moe_align/vectorized_gather/moe_sum, 2.7 us/tok). Dispatch is cheap in both engines; epilogue-fusing it buys ~3% at most.

The real lever is the grouped GEMM: paged grouped-MMQ MUL_MAT_ID is 105 us/tok vs vLLM Marlin 48.5 us/tok = 2.16x slower, ~28% of the prefill gap (+56.5 us/tok). It does NOT collapse to GEMV - the grouped path engages correctly; it loses because ragged small-M-per-expert tiles (mmq_x 16-96) under-utilize tensor cores.

Is it winnable given MMQ already beat our native kernel? YES in principle, but ONLY via a kernel approach we have NOT yet tried correctly. Both prior attempts failed for identifiable reasons: 0033 did dequant as a SEPARATE global-memory pass then cuBLAS (lost to fused FP4 MMQ 29-49%); 0034 native FP4-MMA W4A4 PoC did NOT hold in-backend. vLLM proves the winning shape on THIS EXACT silicon (sm_121, Marlin bf16 fallback - no native FP4) is IN-REGISTER FP4->bf16 dequant feeding bf16 mma.sync with cp.async pipelining + large/grouped tiles, and W4A16 means ZERO activation-quant. That second point is load-bearing: act-quant (quantize_mmq_nvfp4) is +15.2 us/tok = ~8% of the gap that vLLM STRUCTURALLY does not pay because it is W4A16. So a Marlin-style W4A16 grouped MoE-prefill GEMM is a combined ~36% prefill lever (GEMM 28% + act-quant 8%), and it is a DIFFERENT kernel from both rejects (not a separate-pass dequant, not native FP4-MMA). The README's "W4A16 rejected" verdict was DECODE-only (BW-bound, wash); prefill is compute-bound and the act-quant pass is M-proportional, so W4A16 for prefill is unaudited and the most promising structural fix. GATE: must beat MMQ in a SEPARATELY-BUILT in-backend A/B at the real ragged-small-M MoE-prefill shapes (NOT a standalone PoC - the exact lesson from rejecting native FP4-MMA); bit-exact via KL-gate for the bf16-dequant reduction-order change (paged-MoE 8cb0ce23 precedent).

gdn_build_go

True

gdn_rationale

GO on #101, with a Phase-1 in-backend kill-gate. The profile makes the regime check the scope doc demanded (TENSORCORE_GDN_SCOPE Phase 0) pass cleanly: (1) GDN is the #1 SINGLE contributor to the prefill gap at +59.2 us/tok (~30% of the gap), edging out MoE-GEMM (+56.5). (2) The cost is MATH-predominant, not layout/host: gdn_core (the hand-written FP32 chunked-scan, NOT tensor-core) is 17.2% of the wall; GDN-attributable layout (gdn_gather 3.3 + head-concat 2.9 + a convert_dtype slice) is only ~6-7% (~1/4). So tensor cores attack the dominant 3/4, and the 1/4 layout folds into the same fused kernel. (3) The headroom is MEASURED on identical silicon: vLLM's FLA chunked GDN runs the SAME math at 36.5 us/tok vs paged 95.7 = 2.62x, confirming the scope's "mma absorbs the O(C^2) intra-chunk flops so the Cx state-BW cut becomes a net win" mechanism. (4) Bonus dual payoff: it also chips the decode serial-SSM residual and, via continuous batching, the serving-decode lever (prefill steals ~25-55% of serving cycles).

CONDITION (empirical guard, not PoC-optimism): 0031's chunking math was correct yet came back 22% SLOWER in-backend, and we JUST rejected native FP4-MMA because its standalone PoC win did not hold in-backend. So GO funds Phase 1 ONLY (two Gram products on mma.cuh tf32 tiles at fixed C=16/1-block-SM); it must move S_PP in a SEPARATELY-BUILT in-backend A/B vs the sequential scan. If Phase 1 is flat, the occupancy/register wall is the blocker, not the reductions - NO-GO the multi-week Phase 2/3 build. Precision gate is the KL-gate (tf32 default, 3xtf32 ladder), greedy md5 stability, plus the adversarial g in [-20,-1e-4] decay op case; ship opt-in default-off until a separately-built A/B beats sequential.

top_decode_lever

D1/#104 - the no-host-sync MoE decode path + full-step CUDA-graph capture (graph the WHOLE decode step INCLUDING MoE dispatch), targeting the device->host MoE-routing readback. Ranked decisively by the profile, NOT by raw GPU-bucket size: the dominant decode cost is not a GPU kernel at all - it is cudaStreamSynchronize, 84% of the static-decode wall (43.4s sync vs 7.83s GPU kernel; npl=1 66.7%, npl=128 84.3% of API time; 40,902 syncs in the serving window). Root cause = the device->host sync to read MoE routing before launching per-expert GEMMs. Paged decode KERNELS are already 5.4x more GPU-efficient than vLLM's and the GPU sits 84% idle in static decode, so D1 is the only decode lever that attacks the actual bottleneck.

D2/D3/D4 for DECODE are all REJECTED by the methodology's "a faster kernel off the critical path benches flat" rule: D2 fused MoE decode GEMM - paged MoE-GEMM is already 5.3x faster/token than vLLM (59.7 vs 313.5 us/tok); making it faster just adds idle. D3 FA-split - FA is 1.6% of decode-serving wall / 0.55% static (H2 refuted; the hybrid is mostly GDN with few full-attn layers); not a lever. D4 GDN-width-adaptive - paged GDN decode is already 11.4x faster/token than vLLM (34 vs 392); H3 confirmed (flat across width, no amortization) but the recurrence is NOT the bottleneck, host serialization is - an occupancy retune yields ~nothing until the host loop is gone.

Honest scope on D1's payoff: at HIGH-concurrency serving the paged GPU is already 83.5% busy because overlapping request streams hide the host stalls, so D1's win concentrates at LOW-concurrency / latency / batch-1 (GPU 4-16% busy), where it is large. The complementary serving-throughput lever is FIXING PREFILL (GDN #101 + MoE GEMM D2/#105): paged's 2x-slower prefill steals serving cycles under continuous batching (~25-55% of the serving step is prefill work) - so the prefill levers ARE also serving-decode levers. GATE: separately-built in-backend A/B (compiled-in, so a runtime flag does NOT isolate it) showing higher static/low-concurrency decode t/s with no high-concurrency-serving regression; bit-exact greedy md5 (graph replay re-issues identical kernels).

next_3_levers

Ranked, each with its pass-gate:

  1. #101 TENSOR-CORE mma CHUNKED GDN PREFILL KERNEL (prefill, GO). #1 prefill-gap contributor (+59 us/tok, ~30%), ~3/4 math (tensor cores help) with 2.62x measured headroom on identical silicon, 1/4 layout folds in; also helps serving decode. GATE: Phase-0 regime already satisfied by this profile; Phase-1 two-Gram-product PoC must move S_PP in a SEPARATELY-BUILT in-backend A/B vs sequential (flat => NO-GO the multi-week build); then KL-gate (tf32/3xtf32) + greedy md5 + adversarial-decay op test; ship opt-in default-off until A/B beats sequential.

  2. D1/#104 NO-HOST-SYNC MoE DECODE PATH + FULL-STEP CUDA-GRAPH CAPTURE (decode). Attacks the cudaStreamSynchronize that is 84% of the static-decode wall (the MoE-routing device->host readback). Lowest effort, bit-exact, highest-confidence decode win (concentrated at low-concurrency/latency). GATE: separately-built in-backend A/B (not a runtime-flag toggle) - higher static/low-concurrency decode t/s, no high-concurrency-serving regression; bit-exact greedy md5.

  3. D2/#105 MARLIN-STYLE W4A16 GROUPED MoE PREFILL GEMM (prefill). In-register FP4->bf16 dequant + bf16 mma.sync, cp.async, large grouped tiles - captures the 28% MoE-GEMM gap AND the 8% act-quant gap (W4A16 has no activation-quant), = ~36% combined; this is exactly what vLLM does on sm_121. Ranked #3 because of HIGH risk: two prior in-backend GEMM attempts failed (0033 separate-pass dequant, 0034 native FP4-MMA PoC didn't hold). GATE: must beat MMQ in a SEPARATELY-BUILT in-backend A/B at ragged-small-M MoE-prefill shapes (NOT a standalone PoC); bit-exact via KL-gate (bf16-dequant reduction order).

Explicitly REJECTED/deprioritized (record so they aren't re-run): P5 dispatch fusion (~3%, and the "vLLM fuses dispatch" premise is false on GB10); D2-for-decode, D3 FA-split, D4 GDN-width-adaptive (their kernels are already 5-11x faster than vLLM and GPU-idle -> bench flat); padded/fixed-slot decode (already tested+rejected, commit b028c81e).

notes

Empirical discipline applied throughout (per the just-rejected native FP4-MMA): every funded lever is gated on a SEPARATELY-BUILT in-backend A/B, never a standalone PoC - 0031 (chunking math correct, -22% in-backend) and 0034 (PoC win, didn't hold) are the two cautionary precedents. Two compiled-in levers (#101, D1) cannot be isolated by a runtime flag, so they need build-vs-build A/B (methodology hard rule).

Two profile surprises that reshape the directions: (a) vLLM on sm_121 is NOT native FP4 - it runs Marlin W4A16 (FP4->bf16 in-register dequant + bf16 GEMM) for experts and FP8 projections. So the winnable MoE-prefill GEMM is a W4A16-Marlin-style kernel (which also erases our 8% act-quant tax), not another native-FP4 attempt. (b) Decode is a regime asymmetry, not a kernel gap: paged decode kernels are 5.4x more GPU-efficient than vLLM's but paged static decode is HOST-BOUND (GPU 84% idle on cudaStreamSynchronize); vLLM is GPU-bound at 99% on a recurrence 11x slower/token. They tie at static-wide-128. Hence "make decode kernels faster" is the wrong instinct (benches flat); "remove host serialization / graph the full step" (D1) and "fix prefill so it stops stealing serving cycles" (#101, D2) are the decode-serving levers.

Cross-cutting: the prefill levers (#101 GDN, D2 MoE GEMM) double as serving-decode levers because continuous batching interleaves ~25-55% prefill work into the serving step. GDN edges MoE-GEMM as the top prefill pick (bigger gap, cleaner math mechanism, 2.6x proven headroom, lower in-backend risk, dual payoff).

All numbers from the both-engine nsys profile (cuda_gpu_kern_sum buckets, bucketer dgx:/home/mudler/bench/bucket2.py, reports dgx:/home/mudler/bench/profgap/); caveats: no NVTX (kernel-name regex buckets); shared elementwise straddles resid/MoE-fanin/GDN-glue; vLLM decode is offline 128-wide, not staggered-server. Relevant repo paths (absolute): /home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{TENSORCORE_GDN_SCOPE.md,TENSORCORE_GDN_BUILD_PLAN.md,VLLM_PARITY_LEVER_MAP.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,DECODE_SERVING_SCOPE.md,PAGED_BITEXACT_NOTE.md,final_benchmark.csv}; patches dir .../patches/paged/ (existing 0031 chunked-GDN serial, 0033 dequant->cuBLAS rejected, 0034 native FP4-MMA, 0040/0041 S1/S3 decode-graph, 0042 fused residual+RMSNorm); methodology /home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/.agents/vllm-parity-methodology.md.