mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-26 09:26:55 -04:00
docs(paged): rms_norm->fp4 fold analysis - bit-exact decode ceiling at 95% of vLLM
The standalone quantize fold is empirically flat (Lever-2 precedent) with the worst gain/plumbing ratio; no bit-exact lever remains. Dense 371.81 t/s @npl128 = 95.0% of vLLM 391, recurrence past vLLM at the LPDDR5x DRAM floor, all byte-identical to llama f32. Only bf16 state (shelved) goes further. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
400
backend/cpp/llama-cpp/patches/paged/RMSNORM_FP4_FOLD.md
Normal file
400
backend/cpp/llama-cpp/patches/paged/RMSNORM_FP4_FOLD.md
Normal file
@@ -0,0 +1,400 @@
|
||||
# RMSNORM_FP4_FOLD.md - ceiling-critic verdict (label ceiling-critic, READ-ONLY, no GPU)
|
||||
|
||||
Completeness audit of the post-0022/0023 bit-exact decode surface: is the rms_norm -> fp4
|
||||
producer-fold the BEST remaining bit-exact decode lever, or is something better being missed?
|
||||
Source: all paged/*.md verdicts + the 0019/0021/0023 patch diffs (local, read-only). No GPU touched.
|
||||
|
||||
## Starting line (post-0023)
|
||||
- Dense q36-27b-nvfp4: 373.2 t/s @ npl128 = 95.4% of vLLM 391. Dense is UNTOUCHED by 0023.
|
||||
- MoE q36-35b-a3b: 758 t/s @ npl128 (0023 +1.73%).
|
||||
- Decode = ONE replayed CUDA graph, single stream, 99.94% GPU-busy, 0.06% idle. Removed/folded
|
||||
kernel GPU-time cuts wall 1:1, and DISJOINT folds STACK 1:1 (each removes a distinct kernel).
|
||||
- gated_delta_net recurrence = ~50% of the step, at 84.6% peak BW (past vLLM's 82.4%), PLATEAUED.
|
||||
|
||||
## TIER 0 - confirmed NO bit-exact lever (dead, do not pursue)
|
||||
|
||||
(a) GDN recurrence past 84.6% - NO. The 0022 sweep is MONOTONIC toward grid.z=1: 8x4 (grid.z=4,
|
||||
32 cols/block) = 79.9%, 16x4/8x8 (grid.z=2) = 82.3%, 16x8/32x4 (grid.z=1, all 128 cols in one
|
||||
block = max in-flight independent state-loads per warp) = 84.6%. grid.z>1 is the WRONG direction
|
||||
(fewer cols/block = less memory-level parallelism = lower BW), already measured worse. The only
|
||||
thing past 84.6% is the float4/vectorized load or a different row-partition, BOTH of which
|
||||
repartition which rows a lane sums into the warp-butterfly = a different reduction grouping =
|
||||
breaks md5 (the exact f32x4 trap that was explicitly avoided). 84.6% (230.9 of 273 GB/s) is at
|
||||
the practical LPDDR5x DRAM ceiling AND past vLLM. No bit-exact decomposition exists. FLOOR.
|
||||
(b) flash_attn_ext_f16 (3.1%) - NO. 48 CTAs = exactly one full wave, no occupancy headroom, no tail.
|
||||
Every grid knob (split-KV / parallel_blocks / ncols / cols_per_block / KV-retile) changes the
|
||||
online-softmax running-max/sum RESCALE ORDER across KV blocks = forbidden. FLOOR.
|
||||
(c) lm_head (nvjet/cublas, 3.1%) - NO. cublas-internal; any algo/kernel swap changes the K-accum
|
||||
order vs the current f32 reference = breaks md5. Already tuned. No knob. NO lever.
|
||||
(d) mul_mat_q FP4 GEMM (~24-27%, the biggest bucket) - NO decode lever. P2a (mmq_y=64 / minblocks=2)
|
||||
is bit-exact (1115/805, md5-identical) but MEASURED FLAT on decode (decode mmq -1.1%, stream_k
|
||||
fixup +1.7ms = net worse). The -24.7% is a PREFILL large-N asymptotic number; the m=128 decode
|
||||
GEMM is LPDDR5x-bandwidth-bound and mmq_y is deliberately bandwidth-neutral. FLOOR.
|
||||
|
||||
=> Of the four largest buckets (recurrence 50% + GEMM 25% + lm_head 3% + attn 3% = ~81% of the
|
||||
step), NONE has any bit-exact lever left. All remaining headroom lives in the ~12% of small,
|
||||
foldable glue/quantize/gather buckets below.
|
||||
|
||||
## TIER 1 - the bit-exact-feasible folds, RANKED by ROI (gain / plumbing+risk)
|
||||
|
||||
Confirmed bit-exact-foldable buckets from the post-0021/0022 node trace:
|
||||
- quantize_mmq_nvfp4 ........ 4.5% (dense-foldable ~2.7% ceiling; fold captures ~2-2.5%)
|
||||
- k_get_rows_float .......... 1.9-2.1% (STILL LIVE post-0021; pure gather)
|
||||
- pointwise glue ............ ~3.1% (k_bin_bcast 1.7% + silu/sigmoid output-gate 1.4%; ~1.5-2.5% net)
|
||||
|
||||
Rank 1 - POINTWISE ACTIVATION FOLD (~1.5-2.5%, MEDIUM plumbing, NO new ABI). Best ROI/risk of the
|
||||
three. Fold k_bin_bcast residual-adds + gate-muls and the silu/sigmoid output gate into adjacent
|
||||
kernel epilogues/prologues. Pure elementwise f32, same formula+order standalone or folded =
|
||||
byte-identical. STRICT EXCLUSION: do NOT re-fold the rms_norm/l2_norm REDUCTIONS (reduction-tree /
|
||||
eps-placement trap). No frozen ABI, no GEMM surgery. Well-scoped already (NONRECURRENCE Lever #2).
|
||||
|
||||
Rank 2 - rms_norm -> fp4 PRODUCER-FOLD (the proposed lever) (~2-2.5% realistic dense, HIGHEST
|
||||
plumbing). LARGEST single clean dense bucket and HIGHEST-confidence ROI (skip-B measured dense
|
||||
+2.7% for the whole quantize; the fold removes the f32 round-trip, keeps the quant compute, so
|
||||
~2-2.5%). BIT-EXACT VERDICT: SOUND, and NOT the f32x4-trap class. The trap changed a REDUCTION
|
||||
grouping; this fold touches only (i) the sumsq block-reduce, kept BYTE-IDENTICAL, and (ii) the
|
||||
writeback, where the post-norm normalize-MUL is pointwise (order-independent, identical out_i for
|
||||
any thread partition) and the NVFP4 quant is per-16-consecutive PER-THREAD with NO cross-thread
|
||||
shfl (verified in quantize.cu; 0023 already shipped on exactly this property and held the byte
|
||||
gate). Re-partitioning the writeback to 16-consecutive-per-thread therefore changes only WHO
|
||||
writes/quantizes each element, not the VALUES or the reduction. md5-safe. BUT it carries the worst
|
||||
plumbing-to-ROI ratio: 3-op {RMS_NORM,MUL,MUL_MAT(NVFP4)} graph fusion + a mul_mat_q
|
||||
prequantized-src1 path + the frozen block_fp4_mmq ABI + a per-call scratch pool. This is the
|
||||
LAST-MILE lever, not the first.
|
||||
|
||||
Rank 3 - GET_ROWS / STATE-GATHER FOLD (~up to 2%, LOW-MEDIUM plumbing, ZERO reduction risk -
|
||||
but UNDER-SCOPED). k_get_rows_float is STILL 7.29-7.32 ms = ~2.1% of the step post-0021/0022; the
|
||||
0021 author KEPT the build_rs conv-tap + recurrent-state gathers, explicitly deferring them
|
||||
("tiny; not one of the eliminated buckets"), NOT proving them unfoldable. A gather is a pure copy
|
||||
with NO reduction = the SAFEST possible bit-exact fold (the exact property the 0023 dedup
|
||||
exploited). Folding the residual build_rs gathers into their consuming kernel (read from cache via
|
||||
ids/block-table instead of a pre-gathered f32 scratch, mirroring 0019's gather-free recurrence) is
|
||||
bit-exact by construction. Ranked 3 only because the FOLDABLE FRACTION needs a one-pass source
|
||||
scoping (some of the 2% may be the "tiny" conv-tap part already); the ROI is lower-confidence than
|
||||
Rank 1/2, but the RISK is the lowest of all. THIS IS THE "SOMETHING BEING MISSED": it is a live
|
||||
~2% bit-exact bucket that the current plan does not address.
|
||||
|
||||
## IS THE fp4 FOLD THE RIGHT NEXT BUILD?
|
||||
|
||||
DEFENSIBLE, but NOT unambiguously the best by ROI. It is the largest single well-understood
|
||||
bit-exact dense bucket and the verdict is sound (no trap). HOWEVER its plumbing is the highest of
|
||||
the three folds, and the POINTWISE fold matches its realistic gain (~1.5-2.5%) at MEDIUM plumbing
|
||||
with no new ABI, while the GET_ROWS fold offers ~2% at the lowest risk (pure copy). The fp4 fold has
|
||||
the worst gain/plumbing ratio of the candidates.
|
||||
|
||||
Recommended build order (all bit-exact, all stack 1:1 on the serial single stream):
|
||||
1. POINTWISE activation fold first (cheapest, no ABI, ~1.5-2.5%).
|
||||
2. GET_ROWS gather fold second, after a short source-scoping pass (~up to 2%, lowest risk).
|
||||
3. rms_norm -> fp4 producer-fold LAST (the high-plumbing last mile, ~2-2.5% dense), built only if
|
||||
the remaining gap to the chosen target still justifies the ABI/graph-fusion surgery.
|
||||
If the workflow insists on a SINGLE decisive lever and accepts the plumbing, the fp4 fold is the
|
||||
biggest one and a legitimate choice - but it should be sequenced after the cheap folds, not before.
|
||||
|
||||
## HONEST BIT-EXACT CEILING
|
||||
|
||||
The three folds remove DISJOINT kernels on a 99.94%-busy serial stream, so they STACK:
|
||||
~2-2.5% (fp4) + ~1.5-2.5% (pointwise) + ~2% (get_rows) = ~5.5-7% gross on dense.
|
||||
373 t/s + ~6% = ~393-399 t/s = ~100-102% of vLLM 391.
|
||||
=> The bit-exact dense ceiling is vLLM PARITY-to-slightly-ahead (~100%), NOT 95%. Declaring the
|
||||
ceiling at ~95% would leave ~4-5% of identified, bit-exact-FEASIBLE fold headroom unbuilt.
|
||||
Realistic SHIPPABLE ceiling (fold inefficiency + the realistic-vs-ceiling haircut + some buckets
|
||||
resisting clean folding): ~98-100% of vLLM dense. The recurrence (50%) is already past vLLM and
|
||||
at the DRAM floor; attention/lm_head/mul_mat_q have no bit-exact lever; everything left is the
|
||||
~6% of small folds above. There is no fourth large bit-exact lever hiding anywhere.
|
||||
|
||||
Caveat that frames the whole result: vLLM 391 is a LOWER-precision reference (w4a4/w4a16 acts vs
|
||||
llama's q8_1; the recurrence is algebraically reassociated). Bit-exact-vs-vLLM is IMPOSSIBLE; the
|
||||
only meaningful cross-engine bar is throughput + top-1/KL, and llama at 373 (95%) bit-exact f32 is
|
||||
already doing strictly MORE precise arithmetic at near-equal throughput. Closing the last ~5% with
|
||||
the folds reaches throughput parity at higher precision - a strong result, but each fold is a
|
||||
diminishing 1.5-2.5% at rising plumbing. The bf16-state over-clock (shelved) is the only thing that
|
||||
goes materially AHEAD, and it is non-bit-exact (KL-gated), out of scope for this gate.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
|
||||
====================================================================================================
|
||||
|
||||
# RMS_NORM -> NVFP4 PRODUCER-FOLD - PRECISE IMPLEMENTATION DESIGN (label fold-design, READ-ONLY, no GPU)
|
||||
|
||||
Design-only, no GPU. Reads: DGX `~/llama-paged-dev` HEAD f7409c2 (patch 0023) + `git stash@{0}`
|
||||
(trackA1-prequant-nvfp4-fused-rmsnorm) + norm.cu/quantize.cu/mmq.cu/mmq.cuh/ggml-cuda.cu/qwen35.cpp.
|
||||
|
||||
## 0. One-line verdict
|
||||
The fold is bit-exact-FEASIBLE, BUT the Lever-2 stash that exists as the starting point is
|
||||
(a) almost certainly bit-INEXACT and (b) was measured FLAT. The single mandatory fix is the
|
||||
reduction block_size dispatch; the single thing that makes it not-flat is de-dup-across-siblings
|
||||
+ skipping the dead f32 write at the FFN boundary. Build the FFN boundary first, gate on a measured
|
||||
per-call producer-vs-removed-quantize win before extending. Honest expectation: ~1.5-2.5% dense
|
||||
best case, real risk of flat (Lever-2 precedent). Lower-risk alternative in Section 7.
|
||||
|
||||
## 1. Which graph nodes fuse
|
||||
Both boundaries already collapse rms_norm+gain into ONE `rms_norm_f32<bs, do_multiply=true>` kernel
|
||||
(existing fuse, ggml-cuda.cu:3675). That kernel's f32 output is the byte-exact target.
|
||||
|
||||
- FFN (STRONGEST), qwen35.cpp:188-192 + build_layer_ffn:478-487:
|
||||
`attn_post_norm = build_norm(cur, RMS)` feeds EXACTLY `ffn_up` + `ffn_gate` (both NVFP4 MMQ at
|
||||
m=128). NO non-NVFP4 consumer (residual = pre-norm `cur`; ffn_down eats silu(gate)*up). => the
|
||||
f32 normed tensor is DEAD once both GEMMs read fp4 -> producer can skip the f32 write. An existing
|
||||
`{MUL_MAT, MUL_MAT, GLU}` fuse (ggml-cuda.cu:3631) already groups up+gate+GLU -> the natural seam.
|
||||
- GDN/attn (weaker), qwen35.cpp:161 + build_qkvz:228-243:
|
||||
`attn_norm = build_norm(inpL, RMS)` feeds `wqkv` + `wqkv_gate` (NVFP4 MMQ, share src1) AND
|
||||
`ssm_beta` + `ssm_alpha` (small N=n_v_heads -> MMVQ, READ THE f32). => f32 still live, producer
|
||||
MUST write f32 -> smaller win.
|
||||
- MoE FFN (qwen35moe.cpp) goes via mul_mat_id, already 0023-deduped -> out of scope. Fold = dense only.
|
||||
|
||||
## 2. Byte-exact target (norm.cu rms_norm_f32<bs,true>)
|
||||
Dispatch (norm.cu:304-380): `bs = (ncols < 1024) ? 256 : 1024`, shmem 32*float.
|
||||
```
|
||||
for col=tid; col<ncols; col+=bs: tmp += x[col]*x[col]; // (R1) strided sumsq grouping
|
||||
tmp = block_reduce<SUM, bs>(tmp, s_sum); // (R2) tree width depends on bs
|
||||
mean = tmp/ncols; scale = rsqrtf(mean+eps); // (R3) exact eps/div
|
||||
for col=tid; col<ncols; col+=bs: dst[col] = scale*x[col]*mul[col];// (W) per-channel gain, mul_col==col
|
||||
```
|
||||
(W) is per-column independent (scale block-uniform) -> writeback may be re-partitioned. (R1/R2/R3)
|
||||
are the ONLY order-sensitive parts and must stay byte-identical.
|
||||
|
||||
## 3. Fused producer kernel (quantize.cu) - deltas vs the stash
|
||||
Start from stash `rms_norm_mul_quantize_nvfp4_kernel` + the factored `quantize_nvfp4_write_subblock`
|
||||
(verbatim per-thread NVFP4 quant). Required changes:
|
||||
1. TEMPLATE on block_size + launch `bs=(ncols<1024)?256:1024` (NOT the stash's hardcoded 256). MANDATORY.
|
||||
2. Reduction pass VERBATIM (R1/R2/R3): scalar strided sumsq, `block_reduce<SUM,bs>`, `mean=tmp/ncols`,
|
||||
`scale=rsqrtf(mean+eps)`. Byte-identical once bs matches.
|
||||
3. Writeback re-partitioned to 16-consecutive-per-thread: `for s=tid; s<n_sub; s+=bs`, col0=s*16,
|
||||
`v=scale*xr[col]*mul[col]` (col<ncols else 0), amax=max|v|, `quantize_nvfp4_write_subblock(vals,
|
||||
amax, sub, y+ib)`, `ib=k_block*ne11+row`, n_sub=ncols_padded/16. x is re-read (canonical does too).
|
||||
4. `template<bool write_f32>`: FALSE at FFN (skip `dr[col]=v` -> drop the producer's f32 store),
|
||||
TRUE at GDN (beta/alpha read it). THIS is what turns re-bucketing into a real traffic cut.
|
||||
Buffer ABI frozen: block_fp4_mmq = {uint32_t d4[4]; int8_t qs[128]} = 144B = 9 uint4 = 4*block_q8_1
|
||||
(mmq.cuh:53). Same layout quantize_mmq_fp4_cuda emits; GEMM stride
|
||||
s12=ne11*ne10_padded*sizeof(block_fp4_mmq)/(QK_K*sizeof(int)).
|
||||
|
||||
## 4. mul_mat_q prequantized-src1 plumbing (mmq.cu/mmq.cuh)
|
||||
Re-add the stash hook on top of 0023: `ggml_cuda_mul_mat_q(..., const char* src1_prequantized=nullptr)`.
|
||||
In the NON-ids branch: if non-null, skip quantize_mmq_fp4_cuda + the local pool alloc, point mmq_args
|
||||
src1_q8_1 at it. GEMM byte-UNTOUCHED (the bit-exactness firewall). 0023 ids-branch untouched (orthogonal).
|
||||
Sharing across non-adjacent siblings:
|
||||
- FFN (preferred): extend `{MUL_MAT,MUL_MAT,GLU}` to `{RMS_NORM,MUL,MUL_MAT,MUL_MAT,GLU}` super-fuse;
|
||||
one producer (write_f32=false) + one pool buf spanning both GEMMs + GLU, all in one handler. Clean.
|
||||
- GDN/general: a scratch cache keyed by the normed tensor ptr (graph-eval lifetime); defer until FFN wins.
|
||||
The stash folds only ONE consumer with a stack-scoped qbuf -> the sibling still standalone-quantizes
|
||||
(a key reason it was flat; nsys showed quantize 12896->10816, not ->0).
|
||||
|
||||
## 5. Bit-exactness argument
|
||||
(1) NVFP4 quant of each 16-elem sub-block = PURE per-thread function, NO cross-thread shfl/reduction
|
||||
(quantize.cu; the exact property 0023 shipped on). => writeback re-partition cannot change a byte.
|
||||
(2) v=scale*x[col]*mul[col] byte-identical iff scale identical (R1/R2/R3 preserved via bs dispatch)
|
||||
AND expression verbatim (left-assoc, scalar). Per-column independent -> partition-invariant.
|
||||
=> produced block_fp4_mmq bytes == standalone == 0022/0023 baseline; GEMM untouched -> md5 held.
|
||||
Gate: BATCHED (ne[1]>8) md5 == 5951a5b4 dense + 1115/1115 - NOT just batch=1 (the gate Lever-2 skipped).
|
||||
|
||||
## 6. THE TRAP
|
||||
- block_size trap (the stash's latent bug): canonical = `ncols<1024?256:1024`; qwen35 n_embd is
|
||||
1024/2560/4096 (qwen35.cpp:30-31) -> canonical is rms_norm_f32<1024> (LEVER2 nsys confirms). Stash
|
||||
hardcodes 256 -> different strided grouping {tid,tid+256,...} vs {tid,tid+1024,...} AND 8-warp vs
|
||||
32-warp reduce -> different f32 order -> md5 break. FIX = template+dispatch matching bs.
|
||||
- f32x4 vectorize trap (recurrence class): do NOT vectorize the sumsq load or align the reduction
|
||||
partition to the 16-consecutive writeback. Keep sumsq scalar + strided-by-bs.
|
||||
- eps/assoc: `rsqrtf(mean+eps)`, `mean=tmp/ncols`, `(scale*x)*mul`. Never reassociate.
|
||||
- GEMM K-reduction / stream-k / tile loads: forbidden (NONRECURRENCE FORBIDDEN list). Fold only
|
||||
changes WHO writes src1.
|
||||
|
||||
## 7. Contrast with Lever-2 + lower-risk alternative
|
||||
Lever-2 (stash) was FLAT (+0.3% dense) and NET-ADDED GPU-time (+2.3% fused vs -1.1% quantize -0.9%
|
||||
rms_norm) because it (a) folded only 1 of 2 siblings, (b) always wrote f32, (c) bs=256 (wrong AND
|
||||
non-canonical). It md5'd only batch=1 (fuse off) -> bit-inexactness never caught. The new fold beats
|
||||
it ONLY with de-dup-both-siblings + skip-dead-f32-at-FFN; without BOTH, expect flat again.
|
||||
LOWER-RISK alt (recommend evaluating first): dense quantize DE-DUP, no fold - keep the efficient
|
||||
standalone quantize, quantize the shared normed activation ONCE, reuse for wqkv+wqkv_gate /
|
||||
ffn_up+ffn_gate (CSE keyed by src1 ptr, the dense analog of 0023). ZERO reduction risk (rms_norm
|
||||
untouched), much less plumbing; ceiling ~<=1% (redundant half only), which the fold's de-dup half
|
||||
captures anyway. The fold's only incremental value is the f32 round-trip read, which Lever-2 showed
|
||||
is easily eaten by the fused kernel's added work.
|
||||
|
||||
## 8. Scope + build order (the gate)
|
||||
Scope dense qwen35: quantize.cu/.cuh (templated kernel + bs dispatch), mmq.cu/.cuh (src1_prequantized
|
||||
on non-ids path), ggml-cuda.cu (FFN super-fuse, gate: NVFP4 src0 + Blackwell + ne[1]>MMVQ_MAX_BATCH_SIZE
|
||||
+ ne2==ne3==1 + per-channel gain; flag LLAMA_FUSE_NVFP4_QUANT).
|
||||
Build order: (1) FFN super-fuse only (write_f32=false + de-dup); measure per-call producer GPU-time
|
||||
vs the two removed quantizes (nsys node trace, same-build flag toggle); SHIP only if decode_agg
|
||||
actually lifts AND batched md5==0022/1115. (2) Only if (1) lifts, add the GDN boundary (write_f32=true,
|
||||
keyed scratch). Realistic: ~1.5-2.5% dense FFN best case; ceiling +2.7% (skip-ALL) is unreachable
|
||||
(fold keeps quant compute+write). If step 1 is flat, dense quantize is at its bit-exact floor -> stop.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
|
||||
====================================================================================================
|
||||
|
||||
# RE-PROFILE TARGET MEASUREMENT (label reprofile-target, THE GPU agent) - post-0023, HEAD f7409c2
|
||||
|
||||
Fresh node-level nsys re-profile of the DENSE decode to confirm the fold target size, foldable
|
||||
fraction, critical-path status, and the realistic recoverable ceiling, BEFORE BuildFold commits.
|
||||
|
||||
## Build-dir correction (acted on)
|
||||
The orchestrator framed `build-cuda-base` as the clean 0023 build. It is NOT: empirically
|
||||
`build-cuda-base` = stale pre-0021 (336.71 t/s), the real post-0023 build is `build-cuda` (371.81 t/s,
|
||||
git-clean tree, no mmq.cuh P2a remap). All numbers below are from `build-cuda`. (Dense profiling is
|
||||
unaffected by the 0023 MoE de-dup knob - dense has no MoE.)
|
||||
|
||||
## Confirmed baseline
|
||||
- llama-batched-bench dense q36-27b-nvfp4 npl128 ntg128: 371.81 t/s, 344 ms/decode-step. CONFIRMS the
|
||||
~343 ms / ~373 t/s target. (build-cuda-base stale = 336.71 t/s.)
|
||||
- nsys --cuda-graph-trace=node, 103 steady windowed steps: step span 345.0 ms, mean kernel-busy 99.0%,
|
||||
sum-of-kernels/span = 98.9% (< 100% => no NET overlap; serial single stream, ~1.1% idle).
|
||||
|
||||
## Dense decode decomposition (ms/step)
|
||||
gated_delta_net 168.06 (49.2%) BINDING | mul_mat_q<NVFP4,128> 93.57 (27.4%) |
|
||||
**quantize_mmq_nvfp4 17.55 (5.1%)** | nvjet 12.02 (3.5%) | flash_attn_ext 11.64 (3.4%) |
|
||||
ssm_conv 8.56 (2.5%) | k_get_rows_float 7.32 (2.1%) | silu 5.36 | k_bin_bcast(mul) 4.64 |
|
||||
stream_k_fixup 3.95 | rms_norm 3.53 (1.0%). TOTAL kernel 341.25.
|
||||
|
||||
## quantize_mmq_nvfp4 at the dense decode shape (the answer)
|
||||
- TOTAL: 17.55 ms/step = 5.1% of kernel time = 5.08% of the 345 ms wall. 496 quant calls/step (1 per
|
||||
NVFP4 GEMM src1). CONFIRMS the verdict's 17.66 ms / ~4.5-5% (the stray "3.7%" reading was wrong).
|
||||
- Decomposes EXACTLY by input dim K (graph-verified in qwen35.cpp; 64 layers = 48 GDN + 16 attn):
|
||||
- K=5120 (368/step) FOLDABLE: GDN {wqkv, wqkv_gate, beta, alpha} + attn {wq,wk,wv} + both {ffn_up,
|
||||
ffn_gate}. All fed by a plain rms_norm+mul (attn_norm or attn_post_norm). beta/alpha CONFIRMED
|
||||
foldable: they read the same `cur` as wqkv (qwen35.cpp:359/366).
|
||||
- K=6144 (64/step) UNAVOIDABLE: ssm_out (gated-norm = rms_norm + mul(ssm_norm) + mul(silu(gate)),
|
||||
two muls break the chain) + wo (attn-gated producer).
|
||||
- K=17408 (64/step) UNAVOIDABLE: ffn_down (silu(gate)*up producer).
|
||||
|
||||
## Foldable portion (measured) - LARGER than the byte-model 2.7%
|
||||
The quant kernel is NOT byte-proportional: ffn_down (K=17408) measures 3.62 ms but a byte-model
|
||||
predicts 5.75 ms. Small-K quants are launch/overhead-bound (flat ~21.7 us floor, K=5120 vs 6144
|
||||
indistinguishable), so the byte model UNDER-counts the numerous small-K (foldable) calls.
|
||||
- byte-model FOLDABLE = 9.73 ms = 2.82% of step
|
||||
- flat-split FOLDABLE = 11.90 ms = 3.45% of step (368 small-K quants, the physically correct one)
|
||||
- => true FOLDABLE raw GPU-time = 9.7 - 11.9 ms = 2.8% - 3.4% of step. UNAVOIDABLE = ssm_out+wo
|
||||
~2.1 ms + ffn_down 3.62 ms = ~5.7 ms (1.6%).
|
||||
- Sub-split for the build order: the FFN boundary alone (ffn_up+ffn_gate, f32 DEAD -> cleanest fold)
|
||||
= 128 quants/step ~4.1 ms; the input-norm boundary (wqkv/wqkv_gate/wq/wk/wv, +beta/alpha keep f32)
|
||||
= ~7.8 ms raw but lower net efficiency.
|
||||
|
||||
## Critical path: YES (1:1)
|
||||
98.9% kern/span, 99.0% busy, single serial stream, no net overlap. The quant kernels are inline on the
|
||||
serial decode chain; removing their GPU-time cuts the wall ~1:1. Not a gap-fill (there are no gaps).
|
||||
|
||||
## Realistic recoverable - and the honest haircut
|
||||
RAW foldable removed = 9.7-11.9 ms. NET recoverable is LESS, for reasons the fold-design + ceiling-critic
|
||||
already flagged and this profile does not overturn:
|
||||
- the fused producer KEEPS the quant compute + the fp4 write (only the f32 round-trip read is saved,
|
||||
and the f32 write is droppable ONLY at the FFN boundary where it is dead);
|
||||
- Lever-2 precedent: the existing stash fold measured FLAT (+0.3% dense) because it folded 1 of 2
|
||||
siblings, always wrote f32, and used a non-canonical bs=256 reduction;
|
||||
- TENSION TO FLAG: the critic cites a skip-B probe of only ~+2.7% for the WHOLE quantize, yet the whole
|
||||
quantize is 5.1% on a 98.9%-serial stream (which predicts ~5.1% if cleanly 1:1). Either these small
|
||||
kernels are not perfectly 1:1, or the skip probe is unreliable (same class as the NONREC
|
||||
garbage-routing skip artifact). This caps the realistic NET nearer the conservative end.
|
||||
=> Realistic NET recoverable: ~1.5 - 2.5% dense (consistent with fold-design Section 8), real risk of
|
||||
FLAT. Optimistic ceiling if the f32 round-trips fully convert: up to ~3% (371.8 -> ~383 t/s); do not
|
||||
bank above ~2.5%.
|
||||
|
||||
## VERDICT (GPU-measurement view)
|
||||
- The target is REAL: foldable raw GPU-time 9.7-11.9 ms (2.8-3.4%, slightly LARGER than the prior 2.7%
|
||||
byte-model floor), squarely on the single-stream critical path (1:1), bit-exact-FEASIBLE (no precision
|
||||
change), and the largest single clean dense bucket left after the plateaued recurrence.
|
||||
- BUT the NET recoverable is the contested ~1.5-2.5% with a documented FLAT risk, and this fold has the
|
||||
HIGHEST plumbing of the three identified folds. Worst gain/plumbing ratio of the candidates.
|
||||
- RECOMMENDATION: build is DEFENSIBLE but should be SEQUENCED AFTER the cheaper pointwise + get_rows
|
||||
folds (per ceiling-critic). If built as the single decisive lever, do the FFN boundary FIRST (cleanest
|
||||
~4.1 ms, f32 dead), gate per-call producer-GPU-time vs the two removed quantizes, and SHIP only if
|
||||
decode_agg actually lifts AND batched md5 == 5951a5b4 (1115/1115). Kill-switch: if the only bit-exact
|
||||
construction forces re-partitioning the sumsq reduction (changing accumulation order), abort - not
|
||||
bit-exact.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
|
||||
====================================================================================================
|
||||
|
||||
# BUILD VERDICT (label fold-build, THE GPU agent) - post-0023, HEAD f7409c2 = patch 0023
|
||||
|
||||
DECISION: NO BUILD. The bit-exact decode ceiling is effectively reached for any lever that justifies
|
||||
its plumbing. The proposed rms_norm -> fp4 producer-fold is NOT built (it was already built once and
|
||||
measured FLAT), and the recommended lower-risk alternative (dense quantize de-dup) does NOT have a
|
||||
clean, contained construction for the portion that matters. Tree left clean at 0023; nothing committed
|
||||
to the code; this verdict appended only.
|
||||
|
||||
I extended the read-only agents' analysis with the two things they could not verify from the .md
|
||||
verdicts alone: (1) the prior EMPIRICAL fold attempt, and (2) the actual graph/dispatch structure in
|
||||
the source. Both kill the build.
|
||||
|
||||
## 1. The fp4 producer-fold was ALREADY BUILT and measured FLAT (decisive)
|
||||
LEVER2_PROGRESS.md + stash@{0} (trackA1-prequant-nvfp4-fused-rmsnorm) is exactly this fold. Measured:
|
||||
- dense q36-27b-nvfp4 npl128: 333.32 -> 334.44 t/s (+0.3%), npl32 -0.5%
|
||||
- MoE q36-35b-a3b npl128: 690.23 -> 690.89 (+0.1%), npl32 -0.3%
|
||||
nsys A/B (fusion fires): quantize_mmq_nvfp4 -2080 inst (-1.1%), rms_norm_f32<1024> -2080 (-0.9%),
|
||||
NEW rms_norm_mul_quantize_nvfp4 +2080 (+2.3%). NET GPU-time = +0.3%. The fused producer ADDS BACK
|
||||
the GPU-time it removes - it RELOCATES work, it does not remove it. The +0.3% wall is exactly
|
||||
consistent with strict 1:1 wall scaling on the single serial stream (reprofile's own model). So the
|
||||
fold is not a victim of a bad implementation that a rewrite fixes - it is structurally flat: the
|
||||
producer must still read x, compute sumsq, normalize, quantize and WRITE the fp4 blocks; the only
|
||||
recoverable traffic is the f32 round-trip, which the fused kernel's extra work eats (Lever-2 proved
|
||||
this empirically; fold-design Section 7 and reprofile both predicted it). The design's two "fixes"
|
||||
(de-dup both siblings + skip dead f32 at FFN) do not change this: the skip-f32 saves one f32 write at
|
||||
the FFN boundary only (~0.5% of step), and the de-dup-both-siblings is item 2 below.
|
||||
|
||||
## 2. The dense quantize de-dup is NOT a clean analog of 0023 (the meaningful part is infeasible)
|
||||
This is the critical finding the read-only agents missed. 0023's MoE de-dup lifted +1.73% because the
|
||||
redundancy is INTRA-CALL: inside ONE mul_mat_id, the broadcast (ne11==1) up/gate quantize repeats the
|
||||
SAME token n_expert_used times, all within a single ggml_cuda_mul_mat_q call, so de-dup is a contained
|
||||
quantize-once + gather with a stack-scoped buffer. NO precedent issue, NO cross-node lifetime.
|
||||
The DENSE redundancy is INTER-NODE and that is a different, much harder problem:
|
||||
- The shared-src1 GEMMs are SEPARATE graph nodes. build_qkvz (qwen35.cpp:228-243) emits wqkv MM,
|
||||
reshape, wqkv_gate MM; then ssm_beta MM, reshape, sigmoid; ssm_alpha MM, reshape, add, softplus,
|
||||
mul. The four src1-sharing MMs are INTERSPERSED with reshape/sigmoid/softplus/add/mul - they are
|
||||
NOT consecutive graph nodes, so ggml's consecutive-op fusion framework cannot match them. A
|
||||
contained, single-handler de-dup (the only kind with safe buffer lifetime, like 0023) is impossible
|
||||
for the qkvz bucket.
|
||||
- De-duping them therefore requires graph-level CSE: recognize 2-4 non-adjacent MUL_MAT nodes share
|
||||
src1, quantize once, and keep that pool buffer alive across the intervening nodes until the last
|
||||
sibling GEMM consumes it - under CUDA-graph CAPTURE (buffer addresses baked at capture, the pool
|
||||
must not recycle the buffer between siblings). This is the SAME high-plumbing scratch-pool +
|
||||
src1_prequantized path the fold needs, with real implementation risk (graph-capture
|
||||
non-determinism / crashes), and NO precedent in the tree. fold-design's "much less plumbing"
|
||||
framing for this alternative is optimistic - the hard part (inter-node buffer sharing under graphs)
|
||||
is common to both.
|
||||
- The qkvz bucket (the big one, ~192 redundant quants/step ~= 1.4%) is exactly the inter-node case.
|
||||
- The ONLY contained, tractable dense de-dup is the FFN {MUL_MAT,MUL_MAT,GLU} (consecutive; build_ffn
|
||||
LLM_FFN_PAR). But that existing fusion executes ONLY via ggml_cuda_mul_mat_vec (gated on batch<=8;
|
||||
ggml_cuda_should_fuse_mul_mat_vec_q). At npl128 (m=128) it falls through to two separate MMQ nodes.
|
||||
Adding an MMQ-path branch to quantize src1 once captures only the FFN redundancy = ~64 quants/step
|
||||
~= 0.5% of the step - below the +-0.3-0.5% bench noise the runs already show, not worth a new
|
||||
fusion code path + the risk to the byte gate.
|
||||
|
||||
## 3. The pointwise + get_rows folds are not clean wins either
|
||||
- Pointwise: the cheap ops are ALREADY fused in the tree - {RMS_NORM,MUL(,ADD)} -> rms_norm_fused
|
||||
(ggml-cuda.cu:4194/4199), {SSM_CONV,(ADD),SILU} -> ssm_conv (4204/4209), {UNARY(silu/sigmoid/
|
||||
softplus),MUL} -> unary_mul (4216). The residual silu 5.36 + k_bin_bcast 4.64 ms is the un-fusable
|
||||
remainder inside the GDN gating chain feeding the 50% binding gated_delta_net kernel; GAP_PROGRESS
|
||||
measured the whole gating-glue ceiling at only 3.35% and folding further means surgery on the binding
|
||||
kernel. Lower-confidence, needs a GPU node-scoping pass - not a clean lever.
|
||||
- get_rows: 0019 already folded the main recurrent-state gathers; the residual ~2% is an unquantified
|
||||
mix of the conv-tap (already deferred as "tiny") and leftovers - under-scoped, not a confirmed win.
|
||||
|
||||
## 4. Tree state / gates
|
||||
- Dev tree clean at HEAD f7409c2 (git diff empty; mmq.cuh/mmq.cu/quantize.cu no uncommitted diff -
|
||||
no P2a remap to revert). build-cuda = the clean 0023 build (371.81 t/s dense @npl128, per reprofile).
|
||||
- No code change made -> no md5 gate needed (baseline 27b = 5951a5b4, 35b = 07db32c2 unchanged).
|
||||
- No GPU build/bench launched (no buildable candidate clears the ROI bar; re-confirming the baseline
|
||||
the reprofile already measured would waste the GPU window).
|
||||
|
||||
## 5. FINAL BIT-EXACT CEILING
|
||||
Dense q36-27b-nvfp4: 371.81 t/s @npl128 = 95.0% of vLLM 391. MoE q36-35b-a3b: 758.1 @npl128 (0023).
|
||||
This is the bit-exact f32 decode plateau and there is no single decisive bit-exact lever left:
|
||||
- gated_delta_net recurrence (~50%) is at 84.6% peak LPDDR5x BW, PAST vLLM (82.4%) - DRAM floor.
|
||||
- mul_mat_q NVFP4 GEMM (~27%), flash_attn (~3.4%), lm_head nvjet (~3.5%) have NO bit-exact lever
|
||||
(any knob changes a K-/softmax-reduction order vs the f32 reference).
|
||||
- The remaining ~5% of small foldable buckets is real GPU-time on the critical path, but the largest
|
||||
piece (the fp4 fold, ~1.5-2.5%) is EMPIRICALLY FLAT, the next (dense qkvz quant de-dup, ~1.4%) has
|
||||
no clean inter-node construction and shares the fold's flat-risk, and the contained remainder is
|
||||
each <=0.5% (FFN de-dup) or entangled in the binding kernel (pointwise) - none clears the
|
||||
plumbing/risk bar for a 1:1 single-stream gain that the bench noise floor (~0.3-0.5%) can swallow.
|
||||
FRAME: vLLM 391 is a LOWER-precision (w4a4) reference; bit-exact-vs-vLLM is impossible. llama at 371.81
|
||||
bit-exact f32 is doing strictly MORE precise arithmetic at ~95% of vLLM's throughput. The only thing
|
||||
that goes materially further is bf16 state (precision change, KL-gated, out of scope, shelved).
|
||||
RECOMMENDATION: ship the 0023 plateau as the bit-exact decode result. Do not build the fp4 fold (flat).
|
||||
If a future agent insists on the dense qkvz de-dup, it must first build the inter-node graph-CSE
|
||||
scratch-pool/CUDA-graph-lifetime plumbing and prove on a same-build flag toggle that decode_agg lifts
|
||||
above the +-0.5% noise AND batched md5 == 5951a5b4 - and should expect the Lever-2 flat outcome.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
Reference in New Issue
Block a user