From 001d8334268dd80867273dfb690a7c9b52f7b9e9 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Fri, 26 Jun 2026 09:11:21 +0000
Subject: [PATCH] docs(paged): f16/bf16 glue probe - dense decode residual
 ceiling

Empirical probe on q36-27b-nvfp4 @npl128 (build f7409c2, patch 0023):
- attention KV cache default is ALREADY f16 (K/V f16) -> --cache-type f16 is a
  no-op; q8_0 within noise -> KV dtype is not a decode lever
- nsys node-trace decode budget: f32-glue (norms/elementwise/activations/attn,
  excl. SSM recurrence + NVFP4 GEMM) = 28.7 ms = 8.4% of step (40.9 ms = 12%
  incl. the non-FP4 cublas GEMM)
- f16 realistically recovers ~11-16 ms of the ~27 ms/step gap = ~40-60% of the
  8.2% residual -> ~95-96% parity, not a full close; non-bit-exact opt-in only

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
 .../patches/paged/F16_DENSE_RESIDUAL_PROBE.md | 118 ++++++++++++++++++
 1 file changed, 118 insertions(+)
 create mode 100644 backend/cpp/llama-cpp/patches/paged/F16_DENSE_RESIDUAL_PROBE.md

diff --git a/backend/cpp/llama-cpp/patches/paged/F16_DENSE_RESIDUAL_PROBE.md b/backend/cpp/llama-cpp/patches/paged/F16_DENSE_RESIDUAL_PROBE.md
new file mode 100644
index 000000000..61d8672c0
--- /dev/null
+++ b/backend/cpp/llama-cpp/patches/paged/F16_DENSE_RESIDUAL_PROBE.md
@@ -0,0 +1,118 @@
+# F16/BF16 Glue Probe - the dense decode residual to vLLM
+
+Question: dense decode parity sits at llama 384.6 vs vLLM 418.8 t/s @ npl128 = 91.8%.
+The 49% SSM recurrence (f32 BOTH engines) and the 27% NVFP4 GEMM (W4A4 BOTH) are
+precision-matched. The residual ~8% may be partly that llama runs the NON-recurrence
+GLUE (attention, norms, activations, elementwise, residual stream) in F32 while vLLM
+runs the model in BF16. This probe settles, empirically on q36-27b-nvfp4 @npl128, how
+much of that residual is realistically f16/bf16-closable.
+
+Model: Qwen3.5-27B NVFP4 (dense). 64 layers = 16 attention + 48 gated-DeltaNet
+(SSM) recurrent. Build b104-f7409c2 (patch 0023), verified git-clean and coherent.
+The bf16 SSM work was never applied to the tree (only saved as a diff backup);
+ggml-cuda needed no recompile on rebuild, so the binary is bit-identical to clean 0023.
+
+## (1) Current KV / state dtype  (SETTLED)
+
+From the `-v` init log:
+
+- ATTENTION KV cache (16 of 64 layers):
+  `K (f16): 1280 MiB, V (f16): 1280 MiB`  =>  **DEFAULT IS ALREADY F16.**
+- RECURRENT cache (48 gated-DeltaNet layers):
+  `R (f32): 180 MiB` (conv state), `S (f32): 4608 MiB` (SSM state)  =>  **f32.**
+
+Consequence: the attention KV is ALREADY at vLLM's 16-bit bit-width. `--cache-type f16`
+is a literal no-op; the cheap KV lever is spent. The f32 lives in (a) the recurrent
+SSM/conv state (matched to vLLM, the bf16 version is shelved for failing the f32 KL
+gate) and (b) the intermediate-activation glue (norms, residual stream, attention
+compute, activations) - that glue is where llama still pays f32 vs vLLM bf16.
+
+## (2) Decode kernel budget  (nsys --cuda-graph-trace=node, npl128, 39 steady steps)
+
+step span 342.0 ms ; sum-of-kernels 338.8 ms ; **kern/span 99.0%** - the decode is
+GPU-bound, kernels back-to-back, nsys overhead negligible. The measured bench step
+(128 tok / 373.5 t/s = 342.8 ms) equals the nsys span, so the %-of-step figures below
+ARE wall-time fractions.
+
+OUT of scope - already precision-matched (83.2% of the step):
+
+| kernel | ms/step | % |
+|---|---:|---:|
+| gated_delta_net (SSM recurrence, f32 BOTH) | 167.1 | 49.3 |
+| mul_mat_q NVFP4 (W4A4 GEMM, BOTH)          |  93.0 | 27.4 |
+| quantize_mmq_nvfp4 (FP4 act-quant)         |  17.6 |  5.2 |
+| mul_mat_q stream_k fixup (FP4 reduction)   |   4.1 |  1.2 |
+
+F16-ABLE GLUE - f32 in llama, bf16 in vLLM:
+
+Budget A (clean compute glue, decoupled from the f32 state):
+
+| kernel | ms/step |
+|---|---:|
+| flash_attn_ext            | 11.94 |
+| unary_gated_op (silu)     |  5.16 |
+| k_bin_bcast (mul)         |  4.72 |
+| rms_norm                  |  3.58 |
+| k_bin_bcast (add, residual)|  1.67 |
+| l2_norm                   |  0.65 |
+| cpy_scalar                |  0.37 |
+| rope                      |  0.26 |
+| sigmoid                   |  0.22 |
+| softplus                  |  0.09 |
+| flash_attn fixups         |  0.08 |
+| **Budget A total**        | **28.74 ms = 8.4% of step** |
+
+Budget B (+ the non-FP4 cublas GEMM): + nvjet 12.17 ms => **40.91 ms = 12.0%**.
+
+Recurrence-coupled data movement (NOT bit-safe f16-able - needs the f32 state to go
+bf16, which is the shelved work that fails the f32 KL gate):
+ssm_conv 8.37 + k_get_rows_float 6.98 + k_set_rows 0.66 + gdn_gather 0.06 = 16.08 ms = 4.7%.
+
+## (3) Cache-type A/B  (decode_agg S_TG t/s, dense)
+
+| npl | DEFAULT | F16-explicit | Q8_0 |
+|---:|---:|---:|---:|
+|  32 | 209.05 | 208.75 | 208.63 |
+| 128 | 373.46 | 373.56 | 374.71 |
+
+- F16-explicit == DEFAULT (0.03% delta) => proves the default KV is already f16; the
+  flag is a no-op.
+- Q8_0 (8-bit, half the f16 KV bytes) is within noise at every npl => the attention KV
+  bandwidth is NOT a decode bottleneck (it is 16/64 layers; flash_attn is 3.5% of the
+  step). The KV-cache dtype is not a decode lever for this model.
+- Coherence (48-tok greedy, "The capital of France is"): default and q8_0 both fully
+  coherent; q8_0 only causes minor greedy-path divergence, no quality break. But since
+  q8_0 buys zero speed and is not bit-exact, it is pointless here.
+
+## Read: how much of the ~8% dense residual is f16-closable
+
+The gap is ~27 ms/step (llama 332.8 ms vs vLLM 305.7 ms at npl128).
+
+f16 does not zero the glue, it speeds it up. Realistic recovery:
+- Memory-bound glue (norms + elementwise + activations + copies + rope = 16.7 ms):
+  f16 halves the bytes => ~50% => ~8.4 ms.
+- flash_attn_ext (12.0 ms): KV is ALREADY f16 and the accumulation must stay f32
+  (vLLM also f32-accumulates), so only the Q/projection side helps => ~25% => ~3.0 ms.
+- Budget A realistic recovery ~= **11.4 ms**.
+- nvjet non-FP4 GEMM (12.2 ms): bf16 tensor cores vs f32 ~= ~40-50% => ~5 ms, but
+  uncertain (may already run TF32) => +nvjet recovery ~= **16 ms**.
+
+So f16/bf16 glue realistically recovers **~11 ms (glue only) to ~16 ms (+GEMM) of the
+~27 ms gap = roughly 40-60% of the dense residual.** That moves parity 91.8% ->
+~95-96%, NOT a full close. The remaining ~3-4% is structural: cublas GEMM efficiency
+on the non-FP4 paths, graph/launch scheduling vs vLLM, and the irreducible f32
+accumulation in attention and the recurrence.
+
+Caveats for a build decision:
+1. The single largest f16-able line (flash_attn 11.9 ms) is the LEAST recoverable
+   (KV already f16, accumulate stays f32). The cleanly recoverable mass is the
+   norms+elementwise+activations (~16.7 ms).
+2. The recurrence-coupled 4.7% (ssm_conv + state gather) is only f16-able by taking the
+   SSM/conv state to bf16 = the already-built, already-shelved work that fails the f32
+   KL gate. It is OUT of a bit-safe f16 build.
+3. f16 glue is NON-bit-exact (same category as the shelved bf16 SSM state). It would be
+   an OPT-IN fast path, not the bit-exact default. Realistic ceiling ~95-96% parity for
+   a meaningful (norms/elementwise/activations + optionally nvjet) f16 conversion, at
+   the cost of leaving the 95%-bit-exact f32 plateau.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]