docs(paged): speedup-hunt C section + final RANK + PLAN synthesis

Append lever C (structural dense residual: lm_head + scheduling) findings
and the master RANK + PLAN section to SPEEDUP_HUNT.md. Per-lever scorecard
(gain x tractability x gate), ranked build order, the concrete A build plan
for the hybrid per-head f32/bf16 SSM state cache, and the ordered B/C/D queue
with each one's build trigger.

Verdict: ship the MoE re-graph (patch 0025, measured +1.9-4.4%, both gates
PASSED) now; build A as the lead (only lever ABOVE vLLM on dense, KL-gated,
~430-454 t/s = 103-108% of vLLM); bank B-2/B-3 on MoE; C last (<1% bit-exact,
dead-end); D opt-in-only and dense-only behind the same KL gate bf16-SSM failed.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
Ettore Di Giacinto
2026-06-26 14:56:53 +00:00
parent 4d3fecd524
commit 6bfca146d6

View File

@@ -312,3 +312,256 @@ ships purpose-built Marlin-NvFp4) + the bf16 projections (~10.5%). The recurrenc
PAST vLLM. The single highest-ROI, ship-now item is the re-graph patch (0025).
Assisted-by: Claude:opus-4.8 [Claude Code]
---
## C - STRUCTURAL DENSE RESIDUAL: lm_head + scheduling (label: C-structural-design, READ-ONLY no GPU)
Source-confirmed on DGX `~/llama-paged-dev` @ HEAD `2ee65c2` plus committed traces
(`CRITICALPATH_GAP_ANALYSIS.md`, `A2_CUDAGRAPH_DECODE.md`, `F16_DENSE_RESIDUAL_PROBE.md`,
`OTHER_PATHS_INVESTIGATION.md` sec B). Numbers are dense q36-27b-nvfp4 @npl128: step ~333 ms
(384 t/s), gap to vLLM (419 t/s = 305 ms) is ~27-28 ms/step. **Verdict: lever C is a near
dead-end for a bit-exact dense win; rank it LAST of A/B/C/D for the bit-exact default.**
### How the lm_head is stored, and why it routes to cublas/nvjet (not the tuned FP4 MMQ)
`output.weight` is **GGML_TYPE_BF16** (NOT quantized): the `--tensor-type attn/ffn=nvfp4`
recipe converts only attn+ffn, leaving the logit-sensitive final projection (and tok_embd)
at base BF16. Confirmed: `llama-model.cpp:1460` creates the NVFP4 scale `output_s` ONLY
`if (output->type == GGML_TYPE_NVFP4)`, so for the BF16 head `model.output_s` is null, and
`build_lora_mm` (`llama-graph.cpp:1087`) collapses to a plain `ggml_mul_mat`. In
`ggml_cuda_mul_mat` dispatch (`ggml-cuda.cu:2599-2629`): `use_mul_mat_q`/`use_mul_mat_vec_q`
both require `ggml_is_quantized(src0)` (BF16 fails => the tuned FP4 path is INELIGIBLE);
MMF is gated off for the wide `vocab x 128` shape; `use_batched_cublas_bf16` is true but the
batched branch additionally needs `src1->ne[2]*ne[3] > 1` (the 2D decode lm_head fails it).
Falls through to `ggml_cuda_op_mul_mat_cublas` BF16 branch (`:1662`): downcast F32 act ->
BF16, `cublasGemmEx(16BF x 16BF -> COMPUTE_32F)` = **nvjet_sm121**, output rounded BF16 ->
upcast F32. Shape M=vocab(151936) x N=128 x K=5120: a tall-skinny output GEMM reading the
ENTIRE BF16 head weight for 128 columns = inherently **memory-bound**. On the dense model
this is the ONLY non-FP4 cublas GEMM in decode. Cost: nvjet = 11.91 ms = 3.1-3.6% of step.
**CRITICAL CORRECTION the team must carry:** the baseline is NOT "f32 lm_head". The cublas
BF16 branch downcasts the activation F32->BF16 AND rounds the output to BF16. Today's
"bit-exact reference" logits are ALREADY BF16-precision on both input and output. So
"bit-exact" for lever C only protects BF16-rounded logits, which is exactly why option (c)
is "essentially bit-exact" and why any meaningful lm_head speedup requires changing the dtype.
### lm_head bit-exact lever + gain - bandwidth math kills it
nvjet moves the full BF16 head weight in 11.9-12.2 ms = ~195-199 GB/s = ~72% of GB10's
273 GB/s peak: it is ALREADY one of the most bandwidth-efficient kernels in the step (the
overall decode step runs at only ~40% util / ~110 GB/s). The bit-exact ceiling is the
remaining bandwidth headroom only:
- **(c) keep BF16 weight, swap the kernel** (custom skinny wide-vocab streaming GEMM, or a
hand-picked cublasLt algo/workspace heuristic for the thin-N/huge-M shape). The ONLY
essentially-bit-exact option. Perfect HBM saturation 199 -> 273 GB/s = 11.9 -> ~8.7 ms =
**save ~3 ms = ~0.9-1.0% of step = ~11% of the 27 ms gap.** REALISTIC gain: 0 to 3 ms,
leaning toward 0 - cublasLt already selected nvjet as its best algo, so beating it on a
pure weight-stream is not guaranteed, and it is high kernel-writing effort. (F16 probe
independently estimates the same nvjet recovery as "~5 ms, uncertain - may already run TF32".)
Structural reason it is near-zero: the head must read the entire BF16 weight for 128 columns;
you CANNOT cut those weight bytes without changing the dtype. Bit-exactness and the only real
speedup (fewer weight bytes) are mutually exclusive here.
### lm_head NON-bit-exact options (excluded from any vLLM-parity claim)
- **(a) NVFP4-quantize the head -> tuned FP4 MMQ.** Biggest win, BREAKS bit-exactness.
Weight ~4x fewer bytes (BF16 ~1.5-2.4 GB -> NVFP4 ~0.4-0.6 GB) AND rides the already-tuned
`mul_mat_q<NVFP4>` (patch 0017): memory floor drops ~4x = **save ~8-9 ms = ~2.5% of step**.
BUT NVFP4 < BF16 precision => different logit bits, can flip greedy argmax, AND it is
**UNFAIR vs vLLM** (which keeps its LM head BF16). Same opt-in non-bit-exact bucket as the
shelved bf16-SSM / f16-glue; exclude from parity claims.
- (b) FP8 / Q8_0 head: smaller error than NVFP4 but still != BF16 bits AND not on the tuned
FP4 MMQ path, so it buys less speed than (a). No reason to prefer.
- (existing knob) `GGML_CUDA_FORCE_CUBLAS_COMPUTE_16F` (`ggml-cuda.cu:1610`): 16-bit accumulate
on this exact GEMM, faster but NON-bit-exact (16F vs 32F accumulate). Non-bit-exact track only.
### Scheduling / launch bit-exact lever + gain - ~0.05%
The decode step is GPU-bound at 99.94% (node-level trace, single stream, graphId replayed).
CUDA graphs ALREADY collapse within-step launch latency: exposed idle = 0.225 ms/step = 0.06%,
zero gaps > 5 us, graph ON vs OFF = +0.13% @npl128 (noise). Graphs are NOT a pending dense
lever - they are already in effect. The ONLY graph-non-covered overhead is the BETWEEN-step
host gap: ggml rebuilds the cgraph each step with a NEW `cgraph->uid`, so the uid fast-path in
`ggml_cuda_graph_update_required` never fires and the host re-dispatches ~3100 launches between
graph launches. MEASURED exposed cost: ~0.2 ms/step = ~0.05% (most of the ~2 ms host loop
overlaps GPU compute). **Bit-exact lever:** make the cgraph PERSISTENT/reused across decode
steps so the uid fast-path fires (replay-only => bit-exact). GAIN ~0.2 ms/step = ~0.05%, medium
effort (touches ggml graph lifetime), second-order. No other per-step host overhead is exposed
(the host loop is HIDDEN under GPU compute until the kernels get fast enough to drop GPU-busy
below host time).
### Quantified realistic bit-exact total for lever C
lm_head kernel swap 0 to ~3 ms (upper ~0.9%, realistically ~0) + persistent cgraph ~0.2 ms
(~0.05%) = **combined bit-exact ceiling ~3.2 ms = ~0.95% of the 333 ms step = ~12% of the
27 ms gap.** Moves dense parity 91.8% -> at most ~92.7%, realistically <0.5% net (<1.5 ms).
The "~3-4%" in the brief is the lm_head's TOTAL cost, NOT what is bit-exactly recoverable: only
the bandwidth headroom (~3 ms) and host gap (~0.2 ms) are recoverable; the other ~9 ms is the
irreducible BF16 weight stream BOTH engines pay (vLLM keeps a BF16 head too). **Rank C LAST for
the bit-exact default.** Its one durable note for the team: the lm_head logits are ALREADY
BF16-rounded (not f32), which both narrows what option (c) must preserve and is exactly why the
only meaningful lm_head speedup requires a dtype change (= non-bit-exact + unfair vs vLLM).
Source (DGX @2ee65c2): `llama-model.cpp:1460`, `llama-graph.cpp:1087`, `qwen35.cpp:222` /
`qwen35moe.cpp:246`, `ggml-cuda.cu:2599-2629` / `:1662-1690` / `:1610`.
Assisted-by: Claude:opus-4.8 [Claude Code]
---
# RANK + PLAN - the final synthesis (build order, A handoff, B/C/D queue)
This is the decision section: all four levers measured/designed, ranked by gain x tractability
x gate, the concrete A build plan, and the ordered B/C/D queue with each one's trigger. Base:
clean pin-synced llama.cpp 9d5d882d, bit-exact md5 == 0023. Dense gap to vLLM ~27 ms/step (384
vs 419 t/s @npl128); MoE ~82% (726 vs 882). Recurrence already PAST vLLM (84.6% vs 82.4% peak BW).
## (1) Per-lever scorecard: gain (dense + MoE), tractability, gate
| Lever | Dense decode gain | MoE decode gain | Tractability | Quality gate | Bit-exact? |
|-------|-------------------|-----------------|--------------|--------------|------------|
| **B re-graph (patch 0025)** | ~0 (dense already graphed) | **MEASURED +4.4% npl32 / +2.9% npl64 / +1.9% npl128** (MoE 84%->86% .. 90% of vLLM) | **VERY HIGH - already built+measured**, 1 fn / 1 TU / 9 s build | md5 byte-identical: **PASSED** (MUL_MAT_ID 806/806 + parallel-greedy md5 identical) | YES |
| **A hybrid per-head SSM** | **+25% to +35%/call recurrence -> ~430-454 t/s = 103-108% of vLLM** (ABOVE vLLM) | keeps the +13-25% recurrence share KL-passing; does NOT alone close the MoE GEMM floor | MEDIUM-HIGH - builds on `BF16_SSM_STATE.diff`; biggest new piece = split-dtype cache layout (~150-250 LOC) | **KL<1e-3 + Same-top-p>=99.5% + drift sweep 256/1024/2048/4096 both models**; md5 that T_thresh=inf == f32 baseline | f32 default YES; hybrid is at-or-above vLLM precision, KL-gated |
| **B M1 down_proj retune** | ~0 | bit-exact, bounded (act-quant is ~2% of MoE step) - low single-% | HIGH - block/grid retune of `quantize_mmq_nvfp4`, byte-identical output | md5 byte-identical | YES |
| **B mmq_y-down warp-remap** | small (shared FP4 GEMM) | bit-exact, BW-neutral, predicted BOUNDED on this BW-bound model | LOW-MEDIUM - real kernel change (nwarps x tile_C coupling) | test-backend-ops MUL_MAT_ID + md5 | YES |
| **C lm_head kernel swap** | 0 to ~3 ms (~0.9%, realistically ~0; uncertain it beats nvjet) | ~0 | LOW payoff - high kernel-writing effort, not guaranteed to beat cublasLt | md5 (BF16-rounded logits) | YES (essentially) |
| **C persistent cgraph** | ~0.2 ms (~0.05%) | ~0 (B's re-graph already covers MoE host gap) | MEDIUM - touches ggml graph lifetime, for 0.05% | replay-only = bit-exact, md5 | YES |
| **D f16 glue (Option 2)** | ~11-16 ms = 40-60% of residual -> 91.8% -> ~95-96% (NOT a close) | ~0 (dense-only lever) | LOW-MEDIUM - new norm.cu f16 kernels, multi-file | **NON-bit-exact, must pass the SAME KL<1e-3 that plain bf16-SSM FAILED** | NO - opt-in only |
Notes that decide the ranking:
- **B's re-graph helps ONLY MoE** (dense decode is already graphed; the disable is the MoE
MUL_MAT_ID `ne[2]>8` over-guard). It is the single highest-ROI item because it is already
built, measured, and gated - zero remaining build risk, just a default flip.
- **A is the only lever that moves dense ABOVE vLLM** (103-108%) and it does it at-or-above
vLLM precision (vLLM keeps ALL temporal state f32; A keeps f32 on exactly the unsafe heads).
It reaches the largest mass (recurrence = 49.3% dense / ~48% MoE = ~6x what D can touch).
- **C and D are dead-or-tiny for the bit-exact default.** C's bit-exact ceiling is <1% with
real risk; D is non-bit-exact, dense-only, and tops out at ~96% parity (not a close).
## (2) Ranked build order (gain x tractability x gate) - A confirmed as the build lead
1. **B re-graph (patch 0025) - LAND NOW.** Already built + measured + both gates PASSED. The
only remaining decision is flipping the default from env-gated (`LLAMA_MOE_FORCE_GRAPHS`) to
`should_use_mmq`-gated default-ON. Zero new build, measured +1.9-4.4% MoE, bit-exact. This
is not a "build" so much as a "ship"; it precedes A because it is free and de-risked.
2. **A hybrid per-head SSM - THE BUILD LEAD (user-greenlit, CONFIRMED by evidence).** The only
lever that takes dense ABOVE vLLM and the only principled fix for the bf16-SSM KL failure.
Largest reachable mass, bounded build on an existing diff, KL-gated. Build plan in (3).
3. **B M1 down_proj act-quant retune** - cheap bit-exact bank-shot, run after A while the GPU
is warm. Bounded (~2% act-quant tax), byte-identical-output retune.
4. **B mmq_y-down warp-remap** - only if 1+2+3 leave MoE short of target; real kernel work,
predicted bounded on this BW-bound model.
5. **C persistent cgraph** - a bit-exact ~0.05% micro-win for the default; build only if a
broad graph-lifetime refactor is happening anyway (not worth a standalone effort).
6. **C lm_head BF16 kernel swap** - near-zero, uncertain, high effort. Effectively shelved.
7. **D f16 glue (Option 2 norm.cu kernels)** - LAST, opt-in only, non-bit-exact, dense-only,
gated by the same KL threshold bf16-SSM failed. Build only if the last ~4% dense is chased
AFTER A lands and is shown insufficient. Skip Option 1 entirely (cast overhead eats the win).
**Why A over B as the lead, despite B's re-graph being measured:** B's re-graph is already
DONE - it is a ship, not a build. For the NEW build effort, A is correctly the lead: it is the
only lever with a path ABOVE vLLM on dense, it attacks the largest mass (recurrence, shared by
both models), and it converts the already-proven whole-bf16 win (490 t/s = 125% vLLM, but KL
FAIL) into a KL-passing form. B's remaining items (M1, mmq_y) are bounded single-% bank-shots
that cannot reach parity on their own (the residual MoE gap is the FP4 grouped GEMM at the
LPDDR5x BW floor + bf16 projections, both structural). So: ship 0025, then build A, then bank B.
## (3) CONCRETE A BUILD PLAN (hand to the build agent)
**Objective:** a per-head mixed-dtype SSM state cache - f32 on long-memory heads, bf16 on
fast-decaying heads - that captures 50-70% of the whole-bf16 recurrence win (-25% to -35%/call)
while PASSING KL<1e-3. Builds directly on the existing `BF16_SSM_STATE.diff` (untracked backup
on DGX `~/llama-paged-dev`). Target dense ~430-454 t/s (103-108% of vLLM 419), MoE +13-25%
recurrence share KL-passing. f32 default stays bit-exact (md5 == 0023 baseline).
**Reuse VERBATIM from BF16_SSM_STATE.diff** (do NOT rewrite): `gdn_state_t<STATE_BF16>` alias,
templated `__bfloat162float` load / `__float2bfloat16` store, the gather template, the dtype-
detect dispatcher, `type_s`/`type_r` cparam wiring, the CPU mirror, the back-compat row convert,
the bf16 fill path, and the test-backend-ops bf16 cases.
**NEW work items (in build order):**
1. **Head classifier (~80-150 LOC, do first, no GPU).** Host function over `ssm_a` (tensor
`SSM_A_NOSCAN`, `[n_v_heads]`, = `-exp(A_log)`) and `ssm_dt` (tensor `SSM_DT`, `[n_v_heads]`):
for each (layer il, head h) compute `tau_h = 1 / (|ssm_a[il][h]| * softplus(ssm_dt[il][h]))`;
set `head_is_bf16[il][h] = (tau_h <= T_thresh)`. Emit per-layer `n_f32`/`n_bf16` counts +
the `head_slot[il][h] = {is_bf16, local_idx}` map. Add cparam `ssm_hybrid_tau_thresh` / CLI
`--ssm-bf16-tau` (inf => all-f32 bit-exact default; 0 => all-bf16; hybrid band in between).
Runs in microseconds at load, no data, no GPU. (Optional Tier-2: a short calibration pass
measuring per-head time-mean of actual `exp(g[h,t])` -> model-hash sidecar; only if Tier 1
lands just above the gate.)
2. **Split-dtype cache layout (~150-250 LOC - THE BIGGEST piece).** In
`llama-memory-recurrent.cpp`: replace the single `s_l` ([S_v,S_v,H,slots] f32) with two
dtype-homogeneous sub-caches sized by per-layer head COUNT (this is what saves the bytes):
`s_l_f32 [S_v*S_v*n_f32, slots]` f32 + `s_l_bf16 [S_v*S_v*n_bf16, slots]` bf16. In
`build_rs` (`delta-net-base.cpp`): build the two views + pass the `head_slot` map; split the
`n_embd_s` accessors. q/k/v/g/beta KEEP natural head order (no activation permute - they come
from the projection GEMMs). Coarser per-LAYER fallback is REJECTED (long-memory heads span
most layers => too coarse; per-head is the right granularity).
3. **Recurrence kernel: single launch, runtime per-head branch (~120-200 LOC).** Pass BOTH
bases (`const float* s_f32_base`, `const nv_bfloat16* s_bf16_base`) + the two `state_dst`
partition views + the device `head_slot[]` map. Branch on `head_slot[h_idx].is_bf16` at the
load site, the in-place store site, the gather, and the dispatcher. The branch is UNIFORM
within a block (all threads share `h_idx` = `blockIdx.x`) => **NO warp divergence**. The
recurrence math (the ~140-260 region) stays byte-for-byte f32-register, untouched. `keep_rs_t`
snapshots stay f32 (op-output scratch). The `STATE_BF16` template stays as the all-bf16
special case.
4. **ids / in-place per-head.** `state_dst` becomes two partition views; `gdn_gather_nonident`
becomes per-head dtype-aware (copies each head's `S_v*S_v` block from the right partition of
`cache[ids[s]]`; still disjoint-scratch race-free). Each head writes its own partition slot
(read==write slot, loaded to registers before store) => the identity / in-place property is
preserved.
5. **CPU mirror (ops.cpp)** per-head dtype branch for CI / CPU-offload parity.
6. **test-backend-ops: a MIXED-dtype-state GATED_DELTA_NET case** (some heads f32, some bf16)
vs the CPU ref, covering decode + multi-token prefill + `keep_rs_t` (this is the R2
silent-corruption net - do NOT skip it).
7. **Gate (GPU, GateBench harness, already built).** Sweep `T_thresh` to find the MINIMUM f32
fraction that passes: noise floor first, then the 256-tok KL gate, then the long-context
drift sweep 256/1024/2048/4096, BOTH models (dense q36-27b + MoE q36-35b-a3b). Pass bar =
**KL<1e-3 AND Same-top-p>=99.5% AND drift bounded**. nsys per-call confirms `f_bytes` =
`(n_f32 + n_bf16/2)/H` dropped. md5 that `T_thresh=inf` reproduces the f32 baseline (the
bit-exact opt-out MUST be preserved).
**Expected result (from the physics + the whole-bf16 measurement):** KLD contribution per head
~ `(eps*tau_h)^2` (eps~2^-8~3.9e-3) is dominated by the top-tau heads, so removing the top
~25-40% by tau cuts MeanKLD by 1-2 orders. Design band **f32 fraction f in [0.30, 0.50]**:
- f=0.30 (n_bf16/H=0.70): `f_bytes`=0.65 -> ~2.20 ms/call (-35%), captures ~70% of the bf16
win -> dense **~454 t/s = ~108% of vLLM** (gate-likely, MeanKLD ~1e-3..1e-2).
- f=0.50: `f_bytes`=0.75 -> ~2.54 ms/call (-25%), captures ~50% -> dense **~430 t/s = ~103% of
vLLM** (most robust pass; strict KL<1e-3 may need this fraction).
The exact f is found by the T_thresh sweep. **MoE:** A keeps the +13-25% recurrence share
KL-passing but does NOT by itself close the MoE GEMM gap (that is B). Joint ship gate = nsys
per-call bytes down AND KL<1e-3 for BOTH models; neither alone ships. Hybrid is STRICTLY safer
than vLLM (we keep f32 exactly where bf16 is unsafe; vLLM keeps all-f32 everywhere).
## (4) Ordered B / C / D queue with build triggers
- **B-1 re-graph default flip (patch 0025): trigger = NOW / immediate.** Already built, measured
(+1.9-4.4% MoE), both gates PASSED. Flip env-gated -> `should_use_mmq`-gated default-ON. No
dependency on A. Ship first.
- **B-2 down_proj act-quant retune (M1): trigger = after A's kernel work lands** (reuse the warm
GPU window). Bit-exact block/grid retune of `quantize_mmq_nvfp4`, byte-identical output.
Bounded ~1% (act-quant is ~2% of the MoE step). Run it; it is cheap.
- **B-3 mmq_y-down warp-remap: trigger = ONLY if B-1 + B-2 + A leave MoE below the target.**
Real kernel change, BW-neutral, predicted bounded on this BW-bound model. Speculative; gate by
test-backend-ops MUL_MAT_ID + md5.
- **C-1 persistent cgraph: trigger = ONLY if a broader ggml graph-lifetime refactor is already
in flight.** Standalone it is ~0.05%, not worth the graph-lifetime touch. Bit-exact (replay).
- **C-2 lm_head BF16 kernel swap: trigger = effectively NEVER for the default** (0 to ~3 ms,
uncertain it beats nvjet, high effort). Documented; not queued.
- **D Option 2 f16-glue norm.cu kernels: trigger = ONLY if dense parity is still wanted AFTER A
lands AND A is shown insufficient, AND an opt-in non-bit-exact mode is acceptable.** Multi-file,
recovers ~11 ms (norm/elementwise band), gated by the SAME KL<1e-3 that plain bf16-SSM failed.
Skip Option 1 (net-zero cast overhead). Lowest priority of all.
**Bottom line:** ship 0025 now (free, measured MoE +1.9-4.4%), then build A (the only path
ABOVE vLLM on dense, KL-gated, ~430-454 t/s = 103-108% of vLLM), then bank B-2/B-3 on MoE. C is
last for the bit-exact default (<1%, dead-end); D is opt-in-only and dense-only, behind the KL
gate, only if the last ~4% is ever chased. The recurrence is already PAST vLLM; A converts that
proven win into a KL-passing form, and the MoE GEMM floor (the structural residual) is the one
piece no bit-exact lever fully closes - vLLM ships purpose-built Marlin-NvFp4 there.
Assisted-by: Claude:opus-4.8 [Claude Code]