From 634c0e5a0f82d5d2213840eeb62bb82c9166122b Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Thu, 25 Jun 2026 22:42:08 +0000 Subject: [PATCH] docs(paged): rms_norm->fp4 fold analysis - bit-exact decode ceiling at 95% of vLLM The standalone quantize fold is empirically flat (Lever-2 precedent) with the worst gain/plumbing ratio; no bit-exact lever remains. Dense 371.81 t/s @npl128 = 95.0% of vLLM 391, recurrence past vLLM at the LPDDR5x DRAM floor, all byte-identical to llama f32. Only bf16 state (shelved) goes further. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto --- .../patches/paged/RMSNORM_FP4_FOLD.md | 400 ++++++++++++++++++ 1 file changed, 400 insertions(+) create mode 100644 backend/cpp/llama-cpp/patches/paged/RMSNORM_FP4_FOLD.md diff --git a/backend/cpp/llama-cpp/patches/paged/RMSNORM_FP4_FOLD.md b/backend/cpp/llama-cpp/patches/paged/RMSNORM_FP4_FOLD.md new file mode 100644 index 000000000..1a5d06dde --- /dev/null +++ b/backend/cpp/llama-cpp/patches/paged/RMSNORM_FP4_FOLD.md @@ -0,0 +1,400 @@ +# RMSNORM_FP4_FOLD.md - ceiling-critic verdict (label ceiling-critic, READ-ONLY, no GPU) + +Completeness audit of the post-0022/0023 bit-exact decode surface: is the rms_norm -> fp4 +producer-fold the BEST remaining bit-exact decode lever, or is something better being missed? +Source: all paged/*.md verdicts + the 0019/0021/0023 patch diffs (local, read-only). No GPU touched. + +## Starting line (post-0023) +- Dense q36-27b-nvfp4: 373.2 t/s @ npl128 = 95.4% of vLLM 391. Dense is UNTOUCHED by 0023. +- MoE q36-35b-a3b: 758 t/s @ npl128 (0023 +1.73%). +- Decode = ONE replayed CUDA graph, single stream, 99.94% GPU-busy, 0.06% idle. Removed/folded + kernel GPU-time cuts wall 1:1, and DISJOINT folds STACK 1:1 (each removes a distinct kernel). +- gated_delta_net recurrence = ~50% of the step, at 84.6% peak BW (past vLLM's 82.4%), PLATEAUED. + +## TIER 0 - confirmed NO bit-exact lever (dead, do not pursue) + +(a) GDN recurrence past 84.6% - NO. The 0022 sweep is MONOTONIC toward grid.z=1: 8x4 (grid.z=4, + 32 cols/block) = 79.9%, 16x4/8x8 (grid.z=2) = 82.3%, 16x8/32x4 (grid.z=1, all 128 cols in one + block = max in-flight independent state-loads per warp) = 84.6%. grid.z>1 is the WRONG direction + (fewer cols/block = less memory-level parallelism = lower BW), already measured worse. The only + thing past 84.6% is the float4/vectorized load or a different row-partition, BOTH of which + repartition which rows a lane sums into the warp-butterfly = a different reduction grouping = + breaks md5 (the exact f32x4 trap that was explicitly avoided). 84.6% (230.9 of 273 GB/s) is at + the practical LPDDR5x DRAM ceiling AND past vLLM. No bit-exact decomposition exists. FLOOR. +(b) flash_attn_ext_f16 (3.1%) - NO. 48 CTAs = exactly one full wave, no occupancy headroom, no tail. + Every grid knob (split-KV / parallel_blocks / ncols / cols_per_block / KV-retile) changes the + online-softmax running-max/sum RESCALE ORDER across KV blocks = forbidden. FLOOR. +(c) lm_head (nvjet/cublas, 3.1%) - NO. cublas-internal; any algo/kernel swap changes the K-accum + order vs the current f32 reference = breaks md5. Already tuned. No knob. NO lever. +(d) mul_mat_q FP4 GEMM (~24-27%, the biggest bucket) - NO decode lever. P2a (mmq_y=64 / minblocks=2) + is bit-exact (1115/805, md5-identical) but MEASURED FLAT on decode (decode mmq -1.1%, stream_k + fixup +1.7ms = net worse). The -24.7% is a PREFILL large-N asymptotic number; the m=128 decode + GEMM is LPDDR5x-bandwidth-bound and mmq_y is deliberately bandwidth-neutral. FLOOR. + +=> Of the four largest buckets (recurrence 50% + GEMM 25% + lm_head 3% + attn 3% = ~81% of the + step), NONE has any bit-exact lever left. All remaining headroom lives in the ~12% of small, + foldable glue/quantize/gather buckets below. + +## TIER 1 - the bit-exact-feasible folds, RANKED by ROI (gain / plumbing+risk) + +Confirmed bit-exact-foldable buckets from the post-0021/0022 node trace: +- quantize_mmq_nvfp4 ........ 4.5% (dense-foldable ~2.7% ceiling; fold captures ~2-2.5%) +- k_get_rows_float .......... 1.9-2.1% (STILL LIVE post-0021; pure gather) +- pointwise glue ............ ~3.1% (k_bin_bcast 1.7% + silu/sigmoid output-gate 1.4%; ~1.5-2.5% net) + +Rank 1 - POINTWISE ACTIVATION FOLD (~1.5-2.5%, MEDIUM plumbing, NO new ABI). Best ROI/risk of the + three. Fold k_bin_bcast residual-adds + gate-muls and the silu/sigmoid output gate into adjacent + kernel epilogues/prologues. Pure elementwise f32, same formula+order standalone or folded = + byte-identical. STRICT EXCLUSION: do NOT re-fold the rms_norm/l2_norm REDUCTIONS (reduction-tree / + eps-placement trap). No frozen ABI, no GEMM surgery. Well-scoped already (NONRECURRENCE Lever #2). + +Rank 2 - rms_norm -> fp4 PRODUCER-FOLD (the proposed lever) (~2-2.5% realistic dense, HIGHEST + plumbing). LARGEST single clean dense bucket and HIGHEST-confidence ROI (skip-B measured dense + +2.7% for the whole quantize; the fold removes the f32 round-trip, keeps the quant compute, so + ~2-2.5%). BIT-EXACT VERDICT: SOUND, and NOT the f32x4-trap class. The trap changed a REDUCTION + grouping; this fold touches only (i) the sumsq block-reduce, kept BYTE-IDENTICAL, and (ii) the + writeback, where the post-norm normalize-MUL is pointwise (order-independent, identical out_i for + any thread partition) and the NVFP4 quant is per-16-consecutive PER-THREAD with NO cross-thread + shfl (verified in quantize.cu; 0023 already shipped on exactly this property and held the byte + gate). Re-partitioning the writeback to 16-consecutive-per-thread therefore changes only WHO + writes/quantizes each element, not the VALUES or the reduction. md5-safe. BUT it carries the worst + plumbing-to-ROI ratio: 3-op {RMS_NORM,MUL,MUL_MAT(NVFP4)} graph fusion + a mul_mat_q + prequantized-src1 path + the frozen block_fp4_mmq ABI + a per-call scratch pool. This is the + LAST-MILE lever, not the first. + +Rank 3 - GET_ROWS / STATE-GATHER FOLD (~up to 2%, LOW-MEDIUM plumbing, ZERO reduction risk - + but UNDER-SCOPED). k_get_rows_float is STILL 7.29-7.32 ms = ~2.1% of the step post-0021/0022; the + 0021 author KEPT the build_rs conv-tap + recurrent-state gathers, explicitly deferring them + ("tiny; not one of the eliminated buckets"), NOT proving them unfoldable. A gather is a pure copy + with NO reduction = the SAFEST possible bit-exact fold (the exact property the 0023 dedup + exploited). Folding the residual build_rs gathers into their consuming kernel (read from cache via + ids/block-table instead of a pre-gathered f32 scratch, mirroring 0019's gather-free recurrence) is + bit-exact by construction. Ranked 3 only because the FOLDABLE FRACTION needs a one-pass source + scoping (some of the 2% may be the "tiny" conv-tap part already); the ROI is lower-confidence than + Rank 1/2, but the RISK is the lowest of all. THIS IS THE "SOMETHING BEING MISSED": it is a live + ~2% bit-exact bucket that the current plan does not address. + +## IS THE fp4 FOLD THE RIGHT NEXT BUILD? + +DEFENSIBLE, but NOT unambiguously the best by ROI. It is the largest single well-understood +bit-exact dense bucket and the verdict is sound (no trap). HOWEVER its plumbing is the highest of +the three folds, and the POINTWISE fold matches its realistic gain (~1.5-2.5%) at MEDIUM plumbing +with no new ABI, while the GET_ROWS fold offers ~2% at the lowest risk (pure copy). The fp4 fold has +the worst gain/plumbing ratio of the candidates. + +Recommended build order (all bit-exact, all stack 1:1 on the serial single stream): + 1. POINTWISE activation fold first (cheapest, no ABI, ~1.5-2.5%). + 2. GET_ROWS gather fold second, after a short source-scoping pass (~up to 2%, lowest risk). + 3. rms_norm -> fp4 producer-fold LAST (the high-plumbing last mile, ~2-2.5% dense), built only if + the remaining gap to the chosen target still justifies the ABI/graph-fusion surgery. +If the workflow insists on a SINGLE decisive lever and accepts the plumbing, the fp4 fold is the +biggest one and a legitimate choice - but it should be sequenced after the cheap folds, not before. + +## HONEST BIT-EXACT CEILING + +The three folds remove DISJOINT kernels on a 99.94%-busy serial stream, so they STACK: + ~2-2.5% (fp4) + ~1.5-2.5% (pointwise) + ~2% (get_rows) = ~5.5-7% gross on dense. + 373 t/s + ~6% = ~393-399 t/s = ~100-102% of vLLM 391. +=> The bit-exact dense ceiling is vLLM PARITY-to-slightly-ahead (~100%), NOT 95%. Declaring the + ceiling at ~95% would leave ~4-5% of identified, bit-exact-FEASIBLE fold headroom unbuilt. + Realistic SHIPPABLE ceiling (fold inefficiency + the realistic-vs-ceiling haircut + some buckets + resisting clean folding): ~98-100% of vLLM dense. The recurrence (50%) is already past vLLM and + at the DRAM floor; attention/lm_head/mul_mat_q have no bit-exact lever; everything left is the + ~6% of small folds above. There is no fourth large bit-exact lever hiding anywhere. + +Caveat that frames the whole result: vLLM 391 is a LOWER-precision reference (w4a4/w4a16 acts vs +llama's q8_1; the recurrence is algebraically reassociated). Bit-exact-vs-vLLM is IMPOSSIBLE; the +only meaningful cross-engine bar is throughput + top-1/KL, and llama at 373 (95%) bit-exact f32 is +already doing strictly MORE precise arithmetic at near-equal throughput. Closing the last ~5% with +the folds reaches throughput parity at higher precision - a strong result, but each fold is a +diminishing 1.5-2.5% at rising plumbing. The bf16-state over-clock (shelved) is the only thing that +goes materially AHEAD, and it is non-bit-exact (KL-gated), out of scope for this gate. + +Assisted-by: Claude:opus-4.8 [Claude Code] + +==================================================================================================== + +# RMS_NORM -> NVFP4 PRODUCER-FOLD - PRECISE IMPLEMENTATION DESIGN (label fold-design, READ-ONLY, no GPU) + +Design-only, no GPU. Reads: DGX `~/llama-paged-dev` HEAD f7409c2 (patch 0023) + `git stash@{0}` +(trackA1-prequant-nvfp4-fused-rmsnorm) + norm.cu/quantize.cu/mmq.cu/mmq.cuh/ggml-cuda.cu/qwen35.cpp. + +## 0. One-line verdict +The fold is bit-exact-FEASIBLE, BUT the Lever-2 stash that exists as the starting point is +(a) almost certainly bit-INEXACT and (b) was measured FLAT. The single mandatory fix is the +reduction block_size dispatch; the single thing that makes it not-flat is de-dup-across-siblings ++ skipping the dead f32 write at the FFN boundary. Build the FFN boundary first, gate on a measured +per-call producer-vs-removed-quantize win before extending. Honest expectation: ~1.5-2.5% dense +best case, real risk of flat (Lever-2 precedent). Lower-risk alternative in Section 7. + +## 1. Which graph nodes fuse +Both boundaries already collapse rms_norm+gain into ONE `rms_norm_f32` kernel +(existing fuse, ggml-cuda.cu:3675). That kernel's f32 output is the byte-exact target. + +- FFN (STRONGEST), qwen35.cpp:188-192 + build_layer_ffn:478-487: + `attn_post_norm = build_norm(cur, RMS)` feeds EXACTLY `ffn_up` + `ffn_gate` (both NVFP4 MMQ at + m=128). NO non-NVFP4 consumer (residual = pre-norm `cur`; ffn_down eats silu(gate)*up). => the + f32 normed tensor is DEAD once both GEMMs read fp4 -> producer can skip the f32 write. An existing + `{MUL_MAT, MUL_MAT, GLU}` fuse (ggml-cuda.cu:3631) already groups up+gate+GLU -> the natural seam. +- GDN/attn (weaker), qwen35.cpp:161 + build_qkvz:228-243: + `attn_norm = build_norm(inpL, RMS)` feeds `wqkv` + `wqkv_gate` (NVFP4 MMQ, share src1) AND + `ssm_beta` + `ssm_alpha` (small N=n_v_heads -> MMVQ, READ THE f32). => f32 still live, producer + MUST write f32 -> smaller win. +- MoE FFN (qwen35moe.cpp) goes via mul_mat_id, already 0023-deduped -> out of scope. Fold = dense only. + +## 2. Byte-exact target (norm.cu rms_norm_f32) +Dispatch (norm.cu:304-380): `bs = (ncols < 1024) ? 256 : 1024`, shmem 32*float. +``` +for col=tid; col(tmp, s_sum); // (R2) tree width depends on bs +mean = tmp/ncols; scale = rsqrtf(mean+eps); // (R3) exact eps/div +for col=tid; col writeback may be re-partitioned. (R1/R2/R3) +are the ONLY order-sensitive parts and must stay byte-identical. + +## 3. Fused producer kernel (quantize.cu) - deltas vs the stash +Start from stash `rms_norm_mul_quantize_nvfp4_kernel` + the factored `quantize_nvfp4_write_subblock` +(verbatim per-thread NVFP4 quant). Required changes: +1. TEMPLATE on block_size + launch `bs=(ncols<1024)?256:1024` (NOT the stash's hardcoded 256). MANDATORY. +2. Reduction pass VERBATIM (R1/R2/R3): scalar strided sumsq, `block_reduce`, `mean=tmp/ncols`, + `scale=rsqrtf(mean+eps)`. Byte-identical once bs matches. +3. Writeback re-partitioned to 16-consecutive-per-thread: `for s=tid; s`: FALSE at FFN (skip `dr[col]=v` -> drop the producer's f32 store), + TRUE at GDN (beta/alpha read it). THIS is what turns re-bucketing into a real traffic cut. +Buffer ABI frozen: block_fp4_mmq = {uint32_t d4[4]; int8_t qs[128]} = 144B = 9 uint4 = 4*block_q8_1 +(mmq.cuh:53). Same layout quantize_mmq_fp4_cuda emits; GEMM stride +s12=ne11*ne10_padded*sizeof(block_fp4_mmq)/(QK_K*sizeof(int)). + +## 4. mul_mat_q prequantized-src1 plumbing (mmq.cu/mmq.cuh) +Re-add the stash hook on top of 0023: `ggml_cuda_mul_mat_q(..., const char* src1_prequantized=nullptr)`. +In the NON-ids branch: if non-null, skip quantize_mmq_fp4_cuda + the local pool alloc, point mmq_args +src1_q8_1 at it. GEMM byte-UNTOUCHED (the bit-exactness firewall). 0023 ids-branch untouched (orthogonal). +Sharing across non-adjacent siblings: +- FFN (preferred): extend `{MUL_MAT,MUL_MAT,GLU}` to `{RMS_NORM,MUL,MUL_MAT,MUL_MAT,GLU}` super-fuse; + one producer (write_f32=false) + one pool buf spanning both GEMMs + GLU, all in one handler. Clean. +- GDN/general: a scratch cache keyed by the normed tensor ptr (graph-eval lifetime); defer until FFN wins. +The stash folds only ONE consumer with a stack-scoped qbuf -> the sibling still standalone-quantizes +(a key reason it was flat; nsys showed quantize 12896->10816, not ->0). + +## 5. Bit-exactness argument +(1) NVFP4 quant of each 16-elem sub-block = PURE per-thread function, NO cross-thread shfl/reduction + (quantize.cu; the exact property 0023 shipped on). => writeback re-partition cannot change a byte. +(2) v=scale*x[col]*mul[col] byte-identical iff scale identical (R1/R2/R3 preserved via bs dispatch) + AND expression verbatim (left-assoc, scalar). Per-column independent -> partition-invariant. +=> produced block_fp4_mmq bytes == standalone == 0022/0023 baseline; GEMM untouched -> md5 held. +Gate: BATCHED (ne[1]>8) md5 == 5951a5b4 dense + 1115/1115 - NOT just batch=1 (the gate Lever-2 skipped). + +## 6. THE TRAP +- block_size trap (the stash's latent bug): canonical = `ncols<1024?256:1024`; qwen35 n_embd is + 1024/2560/4096 (qwen35.cpp:30-31) -> canonical is rms_norm_f32<1024> (LEVER2 nsys confirms). Stash + hardcodes 256 -> different strided grouping {tid,tid+256,...} vs {tid,tid+1024,...} AND 8-warp vs + 32-warp reduce -> different f32 order -> md5 break. FIX = template+dispatch matching bs. +- f32x4 vectorize trap (recurrence class): do NOT vectorize the sumsq load or align the reduction + partition to the 16-consecutive writeback. Keep sumsq scalar + strided-by-bs. +- eps/assoc: `rsqrtf(mean+eps)`, `mean=tmp/ncols`, `(scale*x)*mul`. Never reassociate. +- GEMM K-reduction / stream-k / tile loads: forbidden (NONRECURRENCE FORBIDDEN list). Fold only + changes WHO writes src1. + +## 7. Contrast with Lever-2 + lower-risk alternative +Lever-2 (stash) was FLAT (+0.3% dense) and NET-ADDED GPU-time (+2.3% fused vs -1.1% quantize -0.9% +rms_norm) because it (a) folded only 1 of 2 siblings, (b) always wrote f32, (c) bs=256 (wrong AND +non-canonical). It md5'd only batch=1 (fuse off) -> bit-inexactness never caught. The new fold beats +it ONLY with de-dup-both-siblings + skip-dead-f32-at-FFN; without BOTH, expect flat again. +LOWER-RISK alt (recommend evaluating first): dense quantize DE-DUP, no fold - keep the efficient +standalone quantize, quantize the shared normed activation ONCE, reuse for wqkv+wqkv_gate / +ffn_up+ffn_gate (CSE keyed by src1 ptr, the dense analog of 0023). ZERO reduction risk (rms_norm +untouched), much less plumbing; ceiling ~<=1% (redundant half only), which the fold's de-dup half +captures anyway. The fold's only incremental value is the f32 round-trip read, which Lever-2 showed +is easily eaten by the fused kernel's added work. + +## 8. Scope + build order (the gate) +Scope dense qwen35: quantize.cu/.cuh (templated kernel + bs dispatch), mmq.cu/.cuh (src1_prequantized +on non-ids path), ggml-cuda.cu (FFN super-fuse, gate: NVFP4 src0 + Blackwell + ne[1]>MMVQ_MAX_BATCH_SIZE ++ ne2==ne3==1 + per-channel gain; flag LLAMA_FUSE_NVFP4_QUANT). +Build order: (1) FFN super-fuse only (write_f32=false + de-dup); measure per-call producer GPU-time +vs the two removed quantizes (nsys node trace, same-build flag toggle); SHIP only if decode_agg +actually lifts AND batched md5==0022/1115. (2) Only if (1) lifts, add the GDN boundary (write_f32=true, +keyed scratch). Realistic: ~1.5-2.5% dense FFN best case; ceiling +2.7% (skip-ALL) is unreachable +(fold keeps quant compute+write). If step 1 is flat, dense quantize is at its bit-exact floor -> stop. + +Assisted-by: Claude:opus-4.8 [Claude Code] + +==================================================================================================== + +# RE-PROFILE TARGET MEASUREMENT (label reprofile-target, THE GPU agent) - post-0023, HEAD f7409c2 + +Fresh node-level nsys re-profile of the DENSE decode to confirm the fold target size, foldable +fraction, critical-path status, and the realistic recoverable ceiling, BEFORE BuildFold commits. + +## Build-dir correction (acted on) +The orchestrator framed `build-cuda-base` as the clean 0023 build. It is NOT: empirically +`build-cuda-base` = stale pre-0021 (336.71 t/s), the real post-0023 build is `build-cuda` (371.81 t/s, +git-clean tree, no mmq.cuh P2a remap). All numbers below are from `build-cuda`. (Dense profiling is +unaffected by the 0023 MoE de-dup knob - dense has no MoE.) + +## Confirmed baseline +- llama-batched-bench dense q36-27b-nvfp4 npl128 ntg128: 371.81 t/s, 344 ms/decode-step. CONFIRMS the + ~343 ms / ~373 t/s target. (build-cuda-base stale = 336.71 t/s.) +- nsys --cuda-graph-trace=node, 103 steady windowed steps: step span 345.0 ms, mean kernel-busy 99.0%, + sum-of-kernels/span = 98.9% (< 100% => no NET overlap; serial single stream, ~1.1% idle). + +## Dense decode decomposition (ms/step) +gated_delta_net 168.06 (49.2%) BINDING | mul_mat_q 93.57 (27.4%) | +**quantize_mmq_nvfp4 17.55 (5.1%)** | nvjet 12.02 (3.5%) | flash_attn_ext 11.64 (3.4%) | +ssm_conv 8.56 (2.5%) | k_get_rows_float 7.32 (2.1%) | silu 5.36 | k_bin_bcast(mul) 4.64 | +stream_k_fixup 3.95 | rms_norm 3.53 (1.0%). TOTAL kernel 341.25. + +## quantize_mmq_nvfp4 at the dense decode shape (the answer) +- TOTAL: 17.55 ms/step = 5.1% of kernel time = 5.08% of the 345 ms wall. 496 quant calls/step (1 per + NVFP4 GEMM src1). CONFIRMS the verdict's 17.66 ms / ~4.5-5% (the stray "3.7%" reading was wrong). +- Decomposes EXACTLY by input dim K (graph-verified in qwen35.cpp; 64 layers = 48 GDN + 16 attn): + - K=5120 (368/step) FOLDABLE: GDN {wqkv, wqkv_gate, beta, alpha} + attn {wq,wk,wv} + both {ffn_up, + ffn_gate}. All fed by a plain rms_norm+mul (attn_norm or attn_post_norm). beta/alpha CONFIRMED + foldable: they read the same `cur` as wqkv (qwen35.cpp:359/366). + - K=6144 (64/step) UNAVOIDABLE: ssm_out (gated-norm = rms_norm + mul(ssm_norm) + mul(silu(gate)), + two muls break the chain) + wo (attn-gated producer). + - K=17408 (64/step) UNAVOIDABLE: ffn_down (silu(gate)*up producer). + +## Foldable portion (measured) - LARGER than the byte-model 2.7% +The quant kernel is NOT byte-proportional: ffn_down (K=17408) measures 3.62 ms but a byte-model +predicts 5.75 ms. Small-K quants are launch/overhead-bound (flat ~21.7 us floor, K=5120 vs 6144 +indistinguishable), so the byte model UNDER-counts the numerous small-K (foldable) calls. +- byte-model FOLDABLE = 9.73 ms = 2.82% of step +- flat-split FOLDABLE = 11.90 ms = 3.45% of step (368 small-K quants, the physically correct one) +- => true FOLDABLE raw GPU-time = 9.7 - 11.9 ms = 2.8% - 3.4% of step. UNAVOIDABLE = ssm_out+wo + ~2.1 ms + ffn_down 3.62 ms = ~5.7 ms (1.6%). +- Sub-split for the build order: the FFN boundary alone (ffn_up+ffn_gate, f32 DEAD -> cleanest fold) + = 128 quants/step ~4.1 ms; the input-norm boundary (wqkv/wqkv_gate/wq/wk/wv, +beta/alpha keep f32) + = ~7.8 ms raw but lower net efficiency. + +## Critical path: YES (1:1) +98.9% kern/span, 99.0% busy, single serial stream, no net overlap. The quant kernels are inline on the +serial decode chain; removing their GPU-time cuts the wall ~1:1. Not a gap-fill (there are no gaps). + +## Realistic recoverable - and the honest haircut +RAW foldable removed = 9.7-11.9 ms. NET recoverable is LESS, for reasons the fold-design + ceiling-critic +already flagged and this profile does not overturn: +- the fused producer KEEPS the quant compute + the fp4 write (only the f32 round-trip read is saved, + and the f32 write is droppable ONLY at the FFN boundary where it is dead); +- Lever-2 precedent: the existing stash fold measured FLAT (+0.3% dense) because it folded 1 of 2 + siblings, always wrote f32, and used a non-canonical bs=256 reduction; +- TENSION TO FLAG: the critic cites a skip-B probe of only ~+2.7% for the WHOLE quantize, yet the whole + quantize is 5.1% on a 98.9%-serial stream (which predicts ~5.1% if cleanly 1:1). Either these small + kernels are not perfectly 1:1, or the skip probe is unreliable (same class as the NONREC + garbage-routing skip artifact). This caps the realistic NET nearer the conservative end. +=> Realistic NET recoverable: ~1.5 - 2.5% dense (consistent with fold-design Section 8), real risk of + FLAT. Optimistic ceiling if the f32 round-trips fully convert: up to ~3% (371.8 -> ~383 t/s); do not + bank above ~2.5%. + +## VERDICT (GPU-measurement view) +- The target is REAL: foldable raw GPU-time 9.7-11.9 ms (2.8-3.4%, slightly LARGER than the prior 2.7% + byte-model floor), squarely on the single-stream critical path (1:1), bit-exact-FEASIBLE (no precision + change), and the largest single clean dense bucket left after the plateaued recurrence. +- BUT the NET recoverable is the contested ~1.5-2.5% with a documented FLAT risk, and this fold has the + HIGHEST plumbing of the three identified folds. Worst gain/plumbing ratio of the candidates. +- RECOMMENDATION: build is DEFENSIBLE but should be SEQUENCED AFTER the cheaper pointwise + get_rows + folds (per ceiling-critic). If built as the single decisive lever, do the FFN boundary FIRST (cleanest + ~4.1 ms, f32 dead), gate per-call producer-GPU-time vs the two removed quantizes, and SHIP only if + decode_agg actually lifts AND batched md5 == 5951a5b4 (1115/1115). Kill-switch: if the only bit-exact + construction forces re-partitioning the sumsq reduction (changing accumulation order), abort - not + bit-exact. + +Assisted-by: Claude:opus-4.8 [Claude Code] + +==================================================================================================== + +# BUILD VERDICT (label fold-build, THE GPU agent) - post-0023, HEAD f7409c2 = patch 0023 + +DECISION: NO BUILD. The bit-exact decode ceiling is effectively reached for any lever that justifies +its plumbing. The proposed rms_norm -> fp4 producer-fold is NOT built (it was already built once and +measured FLAT), and the recommended lower-risk alternative (dense quantize de-dup) does NOT have a +clean, contained construction for the portion that matters. Tree left clean at 0023; nothing committed +to the code; this verdict appended only. + +I extended the read-only agents' analysis with the two things they could not verify from the .md +verdicts alone: (1) the prior EMPIRICAL fold attempt, and (2) the actual graph/dispatch structure in +the source. Both kill the build. + +## 1. The fp4 producer-fold was ALREADY BUILT and measured FLAT (decisive) +LEVER2_PROGRESS.md + stash@{0} (trackA1-prequant-nvfp4-fused-rmsnorm) is exactly this fold. Measured: + - dense q36-27b-nvfp4 npl128: 333.32 -> 334.44 t/s (+0.3%), npl32 -0.5% + - MoE q36-35b-a3b npl128: 690.23 -> 690.89 (+0.1%), npl32 -0.3% +nsys A/B (fusion fires): quantize_mmq_nvfp4 -2080 inst (-1.1%), rms_norm_f32<1024> -2080 (-0.9%), +NEW rms_norm_mul_quantize_nvfp4 +2080 (+2.3%). NET GPU-time = +0.3%. The fused producer ADDS BACK +the GPU-time it removes - it RELOCATES work, it does not remove it. The +0.3% wall is exactly +consistent with strict 1:1 wall scaling on the single serial stream (reprofile's own model). So the +fold is not a victim of a bad implementation that a rewrite fixes - it is structurally flat: the +producer must still read x, compute sumsq, normalize, quantize and WRITE the fp4 blocks; the only +recoverable traffic is the f32 round-trip, which the fused kernel's extra work eats (Lever-2 proved +this empirically; fold-design Section 7 and reprofile both predicted it). The design's two "fixes" +(de-dup both siblings + skip dead f32 at FFN) do not change this: the skip-f32 saves one f32 write at +the FFN boundary only (~0.5% of step), and the de-dup-both-siblings is item 2 below. + +## 2. The dense quantize de-dup is NOT a clean analog of 0023 (the meaningful part is infeasible) +This is the critical finding the read-only agents missed. 0023's MoE de-dup lifted +1.73% because the +redundancy is INTRA-CALL: inside ONE mul_mat_id, the broadcast (ne11==1) up/gate quantize repeats the +SAME token n_expert_used times, all within a single ggml_cuda_mul_mat_q call, so de-dup is a contained +quantize-once + gather with a stack-scoped buffer. NO precedent issue, NO cross-node lifetime. +The DENSE redundancy is INTER-NODE and that is a different, much harder problem: + - The shared-src1 GEMMs are SEPARATE graph nodes. build_qkvz (qwen35.cpp:228-243) emits wqkv MM, + reshape, wqkv_gate MM; then ssm_beta MM, reshape, sigmoid; ssm_alpha MM, reshape, add, softplus, + mul. The four src1-sharing MMs are INTERSPERSED with reshape/sigmoid/softplus/add/mul - they are + NOT consecutive graph nodes, so ggml's consecutive-op fusion framework cannot match them. A + contained, single-handler de-dup (the only kind with safe buffer lifetime, like 0023) is impossible + for the qkvz bucket. + - De-duping them therefore requires graph-level CSE: recognize 2-4 non-adjacent MUL_MAT nodes share + src1, quantize once, and keep that pool buffer alive across the intervening nodes until the last + sibling GEMM consumes it - under CUDA-graph CAPTURE (buffer addresses baked at capture, the pool + must not recycle the buffer between siblings). This is the SAME high-plumbing scratch-pool + + src1_prequantized path the fold needs, with real implementation risk (graph-capture + non-determinism / crashes), and NO precedent in the tree. fold-design's "much less plumbing" + framing for this alternative is optimistic - the hard part (inter-node buffer sharing under graphs) + is common to both. + - The qkvz bucket (the big one, ~192 redundant quants/step ~= 1.4%) is exactly the inter-node case. + - The ONLY contained, tractable dense de-dup is the FFN {MUL_MAT,MUL_MAT,GLU} (consecutive; build_ffn + LLM_FFN_PAR). But that existing fusion executes ONLY via ggml_cuda_mul_mat_vec (gated on batch<=8; + ggml_cuda_should_fuse_mul_mat_vec_q). At npl128 (m=128) it falls through to two separate MMQ nodes. + Adding an MMQ-path branch to quantize src1 once captures only the FFN redundancy = ~64 quants/step + ~= 0.5% of the step - below the +-0.3-0.5% bench noise the runs already show, not worth a new + fusion code path + the risk to the byte gate. + +## 3. The pointwise + get_rows folds are not clean wins either +- Pointwise: the cheap ops are ALREADY fused in the tree - {RMS_NORM,MUL(,ADD)} -> rms_norm_fused + (ggml-cuda.cu:4194/4199), {SSM_CONV,(ADD),SILU} -> ssm_conv (4204/4209), {UNARY(silu/sigmoid/ + softplus),MUL} -> unary_mul (4216). The residual silu 5.36 + k_bin_bcast 4.64 ms is the un-fusable + remainder inside the GDN gating chain feeding the 50% binding gated_delta_net kernel; GAP_PROGRESS + measured the whole gating-glue ceiling at only 3.35% and folding further means surgery on the binding + kernel. Lower-confidence, needs a GPU node-scoping pass - not a clean lever. +- get_rows: 0019 already folded the main recurrent-state gathers; the residual ~2% is an unquantified + mix of the conv-tap (already deferred as "tiny") and leftovers - under-scoped, not a confirmed win. + +## 4. Tree state / gates +- Dev tree clean at HEAD f7409c2 (git diff empty; mmq.cuh/mmq.cu/quantize.cu no uncommitted diff - + no P2a remap to revert). build-cuda = the clean 0023 build (371.81 t/s dense @npl128, per reprofile). +- No code change made -> no md5 gate needed (baseline 27b = 5951a5b4, 35b = 07db32c2 unchanged). +- No GPU build/bench launched (no buildable candidate clears the ROI bar; re-confirming the baseline + the reprofile already measured would waste the GPU window). + +## 5. FINAL BIT-EXACT CEILING +Dense q36-27b-nvfp4: 371.81 t/s @npl128 = 95.0% of vLLM 391. MoE q36-35b-a3b: 758.1 @npl128 (0023). +This is the bit-exact f32 decode plateau and there is no single decisive bit-exact lever left: + - gated_delta_net recurrence (~50%) is at 84.6% peak LPDDR5x BW, PAST vLLM (82.4%) - DRAM floor. + - mul_mat_q NVFP4 GEMM (~27%), flash_attn (~3.4%), lm_head nvjet (~3.5%) have NO bit-exact lever + (any knob changes a K-/softmax-reduction order vs the f32 reference). + - The remaining ~5% of small foldable buckets is real GPU-time on the critical path, but the largest + piece (the fp4 fold, ~1.5-2.5%) is EMPIRICALLY FLAT, the next (dense qkvz quant de-dup, ~1.4%) has + no clean inter-node construction and shares the fold's flat-risk, and the contained remainder is + each <=0.5% (FFN de-dup) or entangled in the binding kernel (pointwise) - none clears the + plumbing/risk bar for a 1:1 single-stream gain that the bench noise floor (~0.3-0.5%) can swallow. +FRAME: vLLM 391 is a LOWER-precision (w4a4) reference; bit-exact-vs-vLLM is impossible. llama at 371.81 +bit-exact f32 is doing strictly MORE precise arithmetic at ~95% of vLLM's throughput. The only thing +that goes materially further is bf16 state (precision change, KL-gated, out of scope, shelved). +RECOMMENDATION: ship the 0023 plateau as the bit-exact decode result. Do not build the fp4 fold (flat). +If a future agent insists on the dense qkvz de-dup, it must first build the inter-node graph-CSE +scratch-pool/CUDA-graph-lifetime plumbing and prove on a same-build flag toggle that decode_agg lifts +above the +-0.5% noise AND batched md5 == 5951a5b4 - and should expect the Lever-2 flat outcome. + +Assisted-by: Claude:opus-4.8 [Claude Code]