docs(paged): speedup-hunt C section + final RANK + PLAN synthesis

Append lever C (structural dense residual: lm_head + scheduling) findings and the master RANK + PLAN section to SPEEDUP_HUNT.md. Per-lever scorecard (gain x tractability x gate), ranked build order, the concrete A build plan for the hybrid per-head f32/bf16 SSM state cache, and the ordered B/C/D queue with each one's build trigger. Verdict: ship the MoE re-graph (patch 0025, measured +1.9-4.4%, both gates PASSED) now; build A as the lead (only lever ABOVE vLLM on dense, KL-gated, ~430-454 t/s = 103-108% of vLLM); bank B-2/B-3 on MoE; C last (<1% bit-exact, dead-end); D opt-in-only and dense-only behind the same KL gate bf16-SSM failed. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-26 17:37:07 -04:00 · 2026-06-26 14:56:53 +00:00
parent 4d3fecd524
commit 6bfca146d6
1 changed files with 253 additions and 0 deletions
--- a/backend/cpp/llama-cpp/patches/paged/SPEEDUP_HUNT.md
+++ b/backend/cpp/llama-cpp/patches/paged/SPEEDUP_HUNT.md
@@ -312,3 +312,256 @@ ships purpose-built Marlin-NvFp4) + the bf16 projections (~10.5%). The recurrenc
 PAST vLLM. The single highest-ROI, ship-now item is the re-graph patch (0025).

 Assisted-by: Claude:opus-4.8 [Claude Code]
+
+---
+
+## C - STRUCTURAL DENSE RESIDUAL: lm_head + scheduling (label: C-structural-design, READ-ONLY no GPU)
+
+Source-confirmed on DGX `~/llama-paged-dev` @ HEAD `2ee65c2` plus committed traces
+(`CRITICALPATH_GAP_ANALYSIS.md`, `A2_CUDAGRAPH_DECODE.md`, `F16_DENSE_RESIDUAL_PROBE.md`,
+`OTHER_PATHS_INVESTIGATION.md` sec B). Numbers are dense q36-27b-nvfp4 @npl128: step ~333 ms
+(384 t/s), gap to vLLM (419 t/s = 305 ms) is ~27-28 ms/step. **Verdict: lever C is a near
+dead-end for a bit-exact dense win; rank it LAST of A/B/C/D for the bit-exact default.**
+
+### How the lm_head is stored, and why it routes to cublas/nvjet (not the tuned FP4 MMQ)
+
+`output.weight` is **GGML_TYPE_BF16** (NOT quantized): the `--tensor-type attn/ffn=nvfp4`
+recipe converts only attn+ffn, leaving the logit-sensitive final projection (and tok_embd)
+at base BF16. Confirmed: `llama-model.cpp:1460` creates the NVFP4 scale `output_s` ONLY
+`if (output->type == GGML_TYPE_NVFP4)`, so for the BF16 head `model.output_s` is null, and
+`build_lora_mm` (`llama-graph.cpp:1087`) collapses to a plain `ggml_mul_mat`. In
+`ggml_cuda_mul_mat` dispatch (`ggml-cuda.cu:2599-2629`): `use_mul_mat_q`/`use_mul_mat_vec_q`
+both require `ggml_is_quantized(src0)` (BF16 fails => the tuned FP4 path is INELIGIBLE);
+MMF is gated off for the wide `vocab x 128` shape; `use_batched_cublas_bf16` is true but the
+batched branch additionally needs `src1->ne[2]*ne[3] > 1` (the 2D decode lm_head fails it).
+Falls through to `ggml_cuda_op_mul_mat_cublas` BF16 branch (`:1662`): downcast F32 act ->
+BF16, `cublasGemmEx(16BF x 16BF -> COMPUTE_32F)` = **nvjet_sm121**, output rounded BF16 ->
+upcast F32. Shape M=vocab(151936) x N=128 x K=5120: a tall-skinny output GEMM reading the
+ENTIRE BF16 head weight for 128 columns = inherently **memory-bound**. On the dense model
+this is the ONLY non-FP4 cublas GEMM in decode. Cost: nvjet = 11.91 ms = 3.1-3.6% of step.
+
+**CRITICAL CORRECTION the team must carry:** the baseline is NOT "f32 lm_head". The cublas
+BF16 branch downcasts the activation F32->BF16 AND rounds the output to BF16. Today's
+"bit-exact reference" logits are ALREADY BF16-precision on both input and output. So
+"bit-exact" for lever C only protects BF16-rounded logits, which is exactly why option (c)
+is "essentially bit-exact" and why any meaningful lm_head speedup requires changing the dtype.
+
+### lm_head bit-exact lever + gain - bandwidth math kills it
+
+nvjet moves the full BF16 head weight in 11.9-12.2 ms = ~195-199 GB/s = ~72% of GB10's
+273 GB/s peak: it is ALREADY one of the most bandwidth-efficient kernels in the step (the
+overall decode step runs at only ~40% util / ~110 GB/s). The bit-exact ceiling is the
+remaining bandwidth headroom only:
+- **(c) keep BF16 weight, swap the kernel** (custom skinny wide-vocab streaming GEMM, or a
+  hand-picked cublasLt algo/workspace heuristic for the thin-N/huge-M shape). The ONLY
+  essentially-bit-exact option. Perfect HBM saturation 199 -> 273 GB/s = 11.9 -> ~8.7 ms =
+  **save ~3 ms = ~0.9-1.0% of step = ~11% of the 27 ms gap.** REALISTIC gain: 0 to 3 ms,
+  leaning toward 0 - cublasLt already selected nvjet as its best algo, so beating it on a
+  pure weight-stream is not guaranteed, and it is high kernel-writing effort. (F16 probe
+  independently estimates the same nvjet recovery as "~5 ms, uncertain - may already run TF32".)
+
+Structural reason it is near-zero: the head must read the entire BF16 weight for 128 columns;
+you CANNOT cut those weight bytes without changing the dtype. Bit-exactness and the only real
+speedup (fewer weight bytes) are mutually exclusive here.
+
+### lm_head NON-bit-exact options (excluded from any vLLM-parity claim)
+
+- **(a) NVFP4-quantize the head -> tuned FP4 MMQ.** Biggest win, BREAKS bit-exactness.
+  Weight ~4x fewer bytes (BF16 ~1.5-2.4 GB -> NVFP4 ~0.4-0.6 GB) AND rides the already-tuned
+  `mul_mat_q<NVFP4>` (patch 0017): memory floor drops ~4x = **save ~8-9 ms = ~2.5% of step**.
+  BUT NVFP4 < BF16 precision => different logit bits, can flip greedy argmax, AND it is
+  **UNFAIR vs vLLM** (which keeps its LM head BF16). Same opt-in non-bit-exact bucket as the
+  shelved bf16-SSM / f16-glue; exclude from parity claims.
+- (b) FP8 / Q8_0 head: smaller error than NVFP4 but still != BF16 bits AND not on the tuned
+  FP4 MMQ path, so it buys less speed than (a). No reason to prefer.
+- (existing knob) `GGML_CUDA_FORCE_CUBLAS_COMPUTE_16F` (`ggml-cuda.cu:1610`): 16-bit accumulate
+  on this exact GEMM, faster but NON-bit-exact (16F vs 32F accumulate). Non-bit-exact track only.
+
+### Scheduling / launch bit-exact lever + gain - ~0.05%
+
+The decode step is GPU-bound at 99.94% (node-level trace, single stream, graphId replayed).
+CUDA graphs ALREADY collapse within-step launch latency: exposed idle = 0.225 ms/step = 0.06%,
+zero gaps > 5 us, graph ON vs OFF = +0.13% @npl128 (noise). Graphs are NOT a pending dense
+lever - they are already in effect. The ONLY graph-non-covered overhead is the BETWEEN-step
+host gap: ggml rebuilds the cgraph each step with a NEW `cgraph->uid`, so the uid fast-path in
+`ggml_cuda_graph_update_required` never fires and the host re-dispatches ~3100 launches between
+graph launches. MEASURED exposed cost: ~0.2 ms/step = ~0.05% (most of the ~2 ms host loop
+overlaps GPU compute). **Bit-exact lever:** make the cgraph PERSISTENT/reused across decode
+steps so the uid fast-path fires (replay-only => bit-exact). GAIN ~0.2 ms/step = ~0.05%, medium
+effort (touches ggml graph lifetime), second-order. No other per-step host overhead is exposed
+(the host loop is HIDDEN under GPU compute until the kernels get fast enough to drop GPU-busy
+below host time).
+
+### Quantified realistic bit-exact total for lever C
+
+lm_head kernel swap 0 to ~3 ms (upper ~0.9%, realistically ~0) + persistent cgraph ~0.2 ms
+(~0.05%) = **combined bit-exact ceiling ~3.2 ms = ~0.95% of the 333 ms step = ~12% of the
+27 ms gap.** Moves dense parity 91.8% -> at most ~92.7%, realistically <0.5% net (<1.5 ms).
+The "~3-4%" in the brief is the lm_head's TOTAL cost, NOT what is bit-exactly recoverable: only
+the bandwidth headroom (~3 ms) and host gap (~0.2 ms) are recoverable; the other ~9 ms is the
+irreducible BF16 weight stream BOTH engines pay (vLLM keeps a BF16 head too). **Rank C LAST for
+the bit-exact default.** Its one durable note for the team: the lm_head logits are ALREADY
+BF16-rounded (not f32), which both narrows what option (c) must preserve and is exactly why the
+only meaningful lm_head speedup requires a dtype change (= non-bit-exact + unfair vs vLLM).
+
+Source (DGX @2ee65c2): `llama-model.cpp:1460`, `llama-graph.cpp:1087`, `qwen35.cpp:222` /
+`qwen35moe.cpp:246`, `ggml-cuda.cu:2599-2629` / `:1662-1690` / `:1610`.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+
+---
+
+# RANK + PLAN - the final synthesis (build order, A handoff, B/C/D queue)
+
+This is the decision section: all four levers measured/designed, ranked by gain x tractability
+x gate, the concrete A build plan, and the ordered B/C/D queue with each one's trigger. Base:
+clean pin-synced llama.cpp 9d5d882d, bit-exact md5 == 0023. Dense gap to vLLM ~27 ms/step (384
+vs 419 t/s @npl128); MoE ~82% (726 vs 882). Recurrence already PAST vLLM (84.6% vs 82.4% peak BW).
+
+## (1) Per-lever scorecard: gain (dense + MoE), tractability, gate
+
+| Lever | Dense decode gain | MoE decode gain | Tractability | Quality gate | Bit-exact? |
+|-------|-------------------|-----------------|--------------|--------------|------------|
+| **B re-graph (patch 0025)** | ~0 (dense already graphed) | **MEASURED +4.4% npl32 / +2.9% npl64 / +1.9% npl128** (MoE 84%->86% .. 90% of vLLM) | **VERY HIGH - already built+measured**, 1 fn / 1 TU / 9 s build | md5 byte-identical: **PASSED** (MUL_MAT_ID 806/806 + parallel-greedy md5 identical) | YES |
+| **A hybrid per-head SSM** | **+25% to +35%/call recurrence -> ~430-454 t/s = 103-108% of vLLM** (ABOVE vLLM) | keeps the +13-25% recurrence share KL-passing; does NOT alone close the MoE GEMM floor | MEDIUM-HIGH - builds on `BF16_SSM_STATE.diff`; biggest new piece = split-dtype cache layout (~150-250 LOC) | **KL<1e-3 + Same-top-p>=99.5% + drift sweep 256/1024/2048/4096 both models**; md5 that T_thresh=inf == f32 baseline | f32 default YES; hybrid is at-or-above vLLM precision, KL-gated |
+| **B M1 down_proj retune** | ~0 | bit-exact, bounded (act-quant is ~2% of MoE step) - low single-% | HIGH - block/grid retune of `quantize_mmq_nvfp4`, byte-identical output | md5 byte-identical | YES |
+| **B mmq_y-down warp-remap** | small (shared FP4 GEMM) | bit-exact, BW-neutral, predicted BOUNDED on this BW-bound model | LOW-MEDIUM - real kernel change (nwarps x tile_C coupling) | test-backend-ops MUL_MAT_ID + md5 | YES |
+| **C lm_head kernel swap** | 0 to ~3 ms (~0.9%, realistically ~0; uncertain it beats nvjet) | ~0 | LOW payoff - high kernel-writing effort, not guaranteed to beat cublasLt | md5 (BF16-rounded logits) | YES (essentially) |
+| **C persistent cgraph** | ~0.2 ms (~0.05%) | ~0 (B's re-graph already covers MoE host gap) | MEDIUM - touches ggml graph lifetime, for 0.05% | replay-only = bit-exact, md5 | YES |
+| **D f16 glue (Option 2)** | ~11-16 ms = 40-60% of residual -> 91.8% -> ~95-96% (NOT a close) | ~0 (dense-only lever) | LOW-MEDIUM - new norm.cu f16 kernels, multi-file | **NON-bit-exact, must pass the SAME KL<1e-3 that plain bf16-SSM FAILED** | NO - opt-in only |
+
+Notes that decide the ranking:
+- **B's re-graph helps ONLY MoE** (dense decode is already graphed; the disable is the MoE
+  MUL_MAT_ID `ne[2]>8` over-guard). It is the single highest-ROI item because it is already
+  built, measured, and gated - zero remaining build risk, just a default flip.
+- **A is the only lever that moves dense ABOVE vLLM** (103-108%) and it does it at-or-above
+  vLLM precision (vLLM keeps ALL temporal state f32; A keeps f32 on exactly the unsafe heads).
+  It reaches the largest mass (recurrence = 49.3% dense / ~48% MoE = ~6x what D can touch).
+- **C and D are dead-or-tiny for the bit-exact default.** C's bit-exact ceiling is <1% with
+  real risk; D is non-bit-exact, dense-only, and tops out at ~96% parity (not a close).
+
+## (2) Ranked build order (gain x tractability x gate) - A confirmed as the build lead
+
+1. **B re-graph (patch 0025) - LAND NOW.** Already built + measured + both gates PASSED. The
+   only remaining decision is flipping the default from env-gated (`LLAMA_MOE_FORCE_GRAPHS`) to
+   `should_use_mmq`-gated default-ON. Zero new build, measured +1.9-4.4% MoE, bit-exact. This
+   is not a "build" so much as a "ship"; it precedes A because it is free and de-risked.
+2. **A hybrid per-head SSM - THE BUILD LEAD (user-greenlit, CONFIRMED by evidence).** The only
+   lever that takes dense ABOVE vLLM and the only principled fix for the bf16-SSM KL failure.
+   Largest reachable mass, bounded build on an existing diff, KL-gated. Build plan in (3).
+3. **B M1 down_proj act-quant retune** - cheap bit-exact bank-shot, run after A while the GPU
+   is warm. Bounded (~2% act-quant tax), byte-identical-output retune.
+4. **B mmq_y-down warp-remap** - only if 1+2+3 leave MoE short of target; real kernel work,
+   predicted bounded on this BW-bound model.
+5. **C persistent cgraph** - a bit-exact ~0.05% micro-win for the default; build only if a
+   broad graph-lifetime refactor is happening anyway (not worth a standalone effort).
+6. **C lm_head BF16 kernel swap** - near-zero, uncertain, high effort. Effectively shelved.
+7. **D f16 glue (Option 2 norm.cu kernels)** - LAST, opt-in only, non-bit-exact, dense-only,
+   gated by the same KL threshold bf16-SSM failed. Build only if the last ~4% dense is chased
+   AFTER A lands and is shown insufficient. Skip Option 1 entirely (cast overhead eats the win).
+
+**Why A over B as the lead, despite B's re-graph being measured:** B's re-graph is already
+DONE - it is a ship, not a build. For the NEW build effort, A is correctly the lead: it is the
+only lever with a path ABOVE vLLM on dense, it attacks the largest mass (recurrence, shared by
+both models), and it converts the already-proven whole-bf16 win (490 t/s = 125% vLLM, but KL
+FAIL) into a KL-passing form. B's remaining items (M1, mmq_y) are bounded single-% bank-shots
+that cannot reach parity on their own (the residual MoE gap is the FP4 grouped GEMM at the
+LPDDR5x BW floor + bf16 projections, both structural). So: ship 0025, then build A, then bank B.
+
+## (3) CONCRETE A BUILD PLAN (hand to the build agent)
+
+**Objective:** a per-head mixed-dtype SSM state cache - f32 on long-memory heads, bf16 on
+fast-decaying heads - that captures 50-70% of the whole-bf16 recurrence win (-25% to -35%/call)
+while PASSING KL<1e-3. Builds directly on the existing `BF16_SSM_STATE.diff` (untracked backup
+on DGX `~/llama-paged-dev`). Target dense ~430-454 t/s (103-108% of vLLM 419), MoE +13-25%
+recurrence share KL-passing. f32 default stays bit-exact (md5 == 0023 baseline).
+
+**Reuse VERBATIM from BF16_SSM_STATE.diff** (do NOT rewrite): `gdn_state_t<STATE_BF16>` alias,
+templated `__bfloat162float` load / `__float2bfloat16` store, the gather template, the dtype-
+detect dispatcher, `type_s`/`type_r` cparam wiring, the CPU mirror, the back-compat row convert,
+the bf16 fill path, and the test-backend-ops bf16 cases.
+
+**NEW work items (in build order):**
+
+1. **Head classifier (~80-150 LOC, do first, no GPU).** Host function over `ssm_a` (tensor
+   `SSM_A_NOSCAN`, `[n_v_heads]`, = `-exp(A_log)`) and `ssm_dt` (tensor `SSM_DT`, `[n_v_heads]`):
+   for each (layer il, head h) compute `tau_h = 1 / (|ssm_a[il][h]| * softplus(ssm_dt[il][h]))`;
+   set `head_is_bf16[il][h] = (tau_h <= T_thresh)`. Emit per-layer `n_f32`/`n_bf16` counts +
+   the `head_slot[il][h] = {is_bf16, local_idx}` map. Add cparam `ssm_hybrid_tau_thresh` / CLI
+   `--ssm-bf16-tau` (inf => all-f32 bit-exact default; 0 => all-bf16; hybrid band in between).
+   Runs in microseconds at load, no data, no GPU. (Optional Tier-2: a short calibration pass
+   measuring per-head time-mean of actual `exp(g[h,t])` -> model-hash sidecar; only if Tier 1
+   lands just above the gate.)
+2. **Split-dtype cache layout (~150-250 LOC - THE BIGGEST piece).** In
+   `llama-memory-recurrent.cpp`: replace the single `s_l` ([S_v,S_v,H,slots] f32) with two
+   dtype-homogeneous sub-caches sized by per-layer head COUNT (this is what saves the bytes):
+   `s_l_f32 [S_v*S_v*n_f32, slots]` f32 + `s_l_bf16 [S_v*S_v*n_bf16, slots]` bf16. In
+   `build_rs` (`delta-net-base.cpp`): build the two views + pass the `head_slot` map; split the
+   `n_embd_s` accessors. q/k/v/g/beta KEEP natural head order (no activation permute - they come
+   from the projection GEMMs). Coarser per-LAYER fallback is REJECTED (long-memory heads span
+   most layers => too coarse; per-head is the right granularity).
+3. **Recurrence kernel: single launch, runtime per-head branch (~120-200 LOC).** Pass BOTH
+   bases (`const float* s_f32_base`, `const nv_bfloat16* s_bf16_base`) + the two `state_dst`
+   partition views + the device `head_slot[]` map. Branch on `head_slot[h_idx].is_bf16` at the
+   load site, the in-place store site, the gather, and the dispatcher. The branch is UNIFORM
+   within a block (all threads share `h_idx` = `blockIdx.x`) => **NO warp divergence**. The
+   recurrence math (the ~140-260 region) stays byte-for-byte f32-register, untouched. `keep_rs_t`
+   snapshots stay f32 (op-output scratch). The `STATE_BF16` template stays as the all-bf16
+   special case.
+4. **ids / in-place per-head.** `state_dst` becomes two partition views; `gdn_gather_nonident`
+   becomes per-head dtype-aware (copies each head's `S_v*S_v` block from the right partition of
+   `cache[ids[s]]`; still disjoint-scratch race-free). Each head writes its own partition slot
+   (read==write slot, loaded to registers before store) => the identity / in-place property is
+   preserved.
+5. **CPU mirror (ops.cpp)** per-head dtype branch for CI / CPU-offload parity.
+6. **test-backend-ops: a MIXED-dtype-state GATED_DELTA_NET case** (some heads f32, some bf16)
+   vs the CPU ref, covering decode + multi-token prefill + `keep_rs_t` (this is the R2
+   silent-corruption net - do NOT skip it).
+7. **Gate (GPU, GateBench harness, already built).** Sweep `T_thresh` to find the MINIMUM f32
+   fraction that passes: noise floor first, then the 256-tok KL gate, then the long-context
+   drift sweep 256/1024/2048/4096, BOTH models (dense q36-27b + MoE q36-35b-a3b). Pass bar =
+   **KL<1e-3 AND Same-top-p>=99.5% AND drift bounded**. nsys per-call confirms `f_bytes` =
+   `(n_f32 + n_bf16/2)/H` dropped. md5 that `T_thresh=inf` reproduces the f32 baseline (the
+   bit-exact opt-out MUST be preserved).
+
+**Expected result (from the physics + the whole-bf16 measurement):** KLD contribution per head
+~ `(eps*tau_h)^2` (eps~2^-8~3.9e-3) is dominated by the top-tau heads, so removing the top
+~25-40% by tau cuts MeanKLD by 1-2 orders. Design band **f32 fraction f in [0.30, 0.50]**:
+- f=0.30 (n_bf16/H=0.70): `f_bytes`=0.65 -> ~2.20 ms/call (-35%), captures ~70% of the bf16
+  win -> dense **~454 t/s = ~108% of vLLM** (gate-likely, MeanKLD ~1e-3..1e-2).
+- f=0.50: `f_bytes`=0.75 -> ~2.54 ms/call (-25%), captures ~50% -> dense **~430 t/s = ~103% of
+  vLLM** (most robust pass; strict KL<1e-3 may need this fraction).
+
+The exact f is found by the T_thresh sweep. **MoE:** A keeps the +13-25% recurrence share
+KL-passing but does NOT by itself close the MoE GEMM gap (that is B). Joint ship gate = nsys
+per-call bytes down AND KL<1e-3 for BOTH models; neither alone ships. Hybrid is STRICTLY safer
+than vLLM (we keep f32 exactly where bf16 is unsafe; vLLM keeps all-f32 everywhere).
+
+## (4) Ordered B / C / D queue with build triggers
+
+- **B-1 re-graph default flip (patch 0025): trigger = NOW / immediate.** Already built, measured
+  (+1.9-4.4% MoE), both gates PASSED. Flip env-gated -> `should_use_mmq`-gated default-ON. No
+  dependency on A. Ship first.
+- **B-2 down_proj act-quant retune (M1): trigger = after A's kernel work lands** (reuse the warm
+  GPU window). Bit-exact block/grid retune of `quantize_mmq_nvfp4`, byte-identical output.
+  Bounded ~1% (act-quant is ~2% of the MoE step). Run it; it is cheap.
+- **B-3 mmq_y-down warp-remap: trigger = ONLY if B-1 + B-2 + A leave MoE below the target.**
+  Real kernel change, BW-neutral, predicted bounded on this BW-bound model. Speculative; gate by
+  test-backend-ops MUL_MAT_ID + md5.
+- **C-1 persistent cgraph: trigger = ONLY if a broader ggml graph-lifetime refactor is already
+  in flight.** Standalone it is ~0.05%, not worth the graph-lifetime touch. Bit-exact (replay).
+- **C-2 lm_head BF16 kernel swap: trigger = effectively NEVER for the default** (0 to ~3 ms,
+  uncertain it beats nvjet, high effort). Documented; not queued.
+- **D Option 2 f16-glue norm.cu kernels: trigger = ONLY if dense parity is still wanted AFTER A
+  lands AND A is shown insufficient, AND an opt-in non-bit-exact mode is acceptable.** Multi-file,
+  recovers ~11 ms (norm/elementwise band), gated by the SAME KL<1e-3 that plain bf16-SSM failed.
+  Skip Option 1 (net-zero cast overhead). Lowest priority of all.
+
+**Bottom line:** ship 0025 now (free, measured MoE +1.9-4.4%), then build A (the only path
+ABOVE vLLM on dense, KL-gated, ~430-454 t/s = 103-108% of vLLM), then bank B-2/B-3 on MoE. C is
+last for the bit-exact default (<1%, dead-end); D is opt-in-only and dense-only, behind the KL
+gate, only if the last ~4% is ever chased. The recurrence is already PAST vLLM; A converts that
+proven win into a KL-passing form, and the MoE GEMM floor (the structural residual) is the one
+piece no bit-exact lever fully closes - vLLM ships purpose-built Marlin-NvFp4 there.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]