From 89e62fc74f7b719792d38949cbaa8cb22907e3dc Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Fri, 26 Jun 2026 09:12:55 +0000
Subject: [PATCH] docs(paged): finalize f16 glue probe - cost analysis + build
 verdict

Synthesize the GPU kernel-budget probe with the read-only glue source
map. Add (4) the implementation cost - llama has no model-compute-dtype
knob, the residual stream is F32 by construction (ggml_mul_mat hardcodes
F32 output), so f16 glue is not a flag but an opt-in multi-file change
(norm.cu f16 kernels + f16 residual stream). Add the final verdict:
precision is not the dominant cause of the 8% residual (83% of the step
is already f32/W4A4-matched), f16 recovers only 40-60% of the gap and is
non-bit-exact, so do not build it as the default; ship the 95%-bit-exact
f32 plateau and target the structural cublas/graph-launch ~3-4% instead.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
 .../patches/paged/F16_DENSE_RESIDUAL_PROBE.md | 66 +++++++++++++++++++
 1 file changed, 66 insertions(+)

diff --git a/backend/cpp/llama-cpp/patches/paged/F16_DENSE_RESIDUAL_PROBE.md b/backend/cpp/llama-cpp/patches/paged/F16_DENSE_RESIDUAL_PROBE.md
index 61d8672c0..2cd3af3e3 100644
--- a/backend/cpp/llama-cpp/patches/paged/F16_DENSE_RESIDUAL_PROBE.md
+++ b/backend/cpp/llama-cpp/patches/paged/F16_DENSE_RESIDUAL_PROBE.md
@@ -115,4 +115,70 @@ Caveats for a build decision:
    a meaningful (norms/elementwise/activations + optionally nvjet) f16 conversion, at
    the cost of leaving the 95%-bit-exact f32 plateau.
 
+## (4) What it costs to capture it: NOT a flag  (source map, read-only)
+
+The asymmetry confirmed at the source level (DGX `~/llama-paged-dev` @ f7409c2, tree
+git-clean; vLLM ref from BITEXACT_VS_VLLM.md):
+- vLLM `text_config.dtype = bfloat16` => the ENTIRE non-quantized compute (residual
+  stream, RMSNorm I/O with f32-internal reduction, FlashAttention out, SiLU, gating,
+  conv state) runs in BF16. Only the gated-DeltaNet temporal SSM state is f32
+  (`mamba_ssm_dtype="float32"`, matched to llama).
+- llama's intermediate activations are F32 **by construction, everywhere**:
+  `ggml_mul_mat` hardcodes an F32 result (ggml.c:3250), so the stream snaps back to F32
+  after EVERY projection (Q/K/V/O, wqkv, ssm in/out, ffn up/gate/down, eh_proj, lm_head).
+  `ggml_rms_norm`/`ggml_l2_norm`/`ggml_silu`/`ggml_add`/`ggml_mul`/`flash_attn_ext`/
+  `ggml_ssm_conv` all preserve/emit F32. There is no point where the stream is f16.
+
+There is **no vLLM-style global model-compute-dtype knob** in ggml/llama. You cannot flip
+one model-load flag. Three escalating options, all opt-in / non-bit-exact:
+
+- A flag: does not exist and cannot exist as-is - the F32 is structural, not a default.
+- Option 1 (targeted per-op f16, no new kernels): silu/sigmoid/softplus (unary.cu),
+  add/mul (binbcast.cu), rope already have f16 paths. But the residual stream stays F32,
+  so each op must be wrapped cast(F16)->op->cast(F32), adding 2 `cpy` ops per op. At
+  decode these ops are tiny and memory-bound; the cast traffic ~= the op traffic, so the
+  net win is near-zero or negative unless the cast is FUSED into the producer/consumer.
+  Crucially this CANNOT capture the norms - the largest glue item.
+- Option 2 (the real lever, multi-file code change): carry the residual stream in F16
+  across the layer, cast to F32 only at the quantize boundary. Requires (a) f16 projection
+  output (patch `ggml_mul_mat` to honor a dst-type, or a cpy->F16 after each proj),
+  (b) **NEW F16 template instantiations in norm.cu** for rms_norm / l2_norm / fused
+  rms+mul / fused rms+mul+add (today hard-`GGML_ASSERT(type==F32)` at norm.cu:441-442,
+  465-466, 525-527, 601-604) keeping the f32 reduction, (c) optionally an F16 ssm-conv.cu,
+  plus graph-dtype plumbing in qwen35.cpp / llama-graph.cpp to thread F16 through
+  inpL/cur/the residual adds. The single biggest code item is the norm.cu f16 kernels -
+  the exact band vLLM runs in bf16 that Option 1 cannot reach.
+
+Must-stay-f32 regardless (vLLM does the same): RMSNorm/L2Norm sum-of-squares reduction;
+FlashAttention KQ/softmax accumulation (forced `GGML_PREC_F32`, llama-graph.cpp:2117);
+the gated-DeltaNet recurrent SSM temporal state (f32 BOTH engines, out of scope); the
+src1->q8_1/nvfp4 activation quantization reads F32, so the stream must be F32 at every
+projection boundary no matter what.
+
+## Verdict: probe-further-then-decide, leaning not-worth-it for the default
+
+f16 does NOT meaningfully close the dense residual on its own, and what it can close is a
+multi-file non-bit-exact build, not a flag.
+
+- Precision is NOT the dominant cause of the 8% gap. 83.2% of the decode step (recurrence
+  49.3% + FP4 GEMM 27.4% + FP4 act-quant/fixup 6.4%) is already precision-matched f32/W4A4
+  on both engines. The f16-able glue is only 8.4% of the step (Budget A); of the ~27 ms
+  gap, f16 realistically recovers ~11 ms (glue) to ~16 ms (+ the uncertain nvjet GEMM) =
+  40-60% of the residual. The remaining ~3-4% is kernel/scheduling efficiency (non-FP4
+  cublas GEMM, graph-launch overhead, irreducible f32 accumulation) that f16 cannot touch.
+- The recoverable mass is the norm+elementwise+activation band, which is precisely the
+  part that needs NEW f16 norm kernels (Option 2). The no-new-kernel ops (Option 1) are
+  too small and their cast overhead likely eats the win.
+- Any version is opt-in / non-bit-exact, the same gate-failing category as the already
+  shelved bf16-SSM-state work. It cannot be the bit-exact f32 default; it is a second,
+  separately-maintained fast path with a ~95-96% ceiling.
+
+Recommendation: do NOT build the f16 glue path now. Ship the 95%-bit-exact f32 plateau
+(patches 0018-0023) as the default. If chasing the last 4% later, the only lever worth a
+build is Option 2's norm.cu f16 kernels + f16 residual stream (recovers the norm/elementwise
+band, ~11 ms); gate it behind an explicit opt-in flag and validate it against the same KL
+threshold as bf16-SSM before shipping. The non-FP4 cublas GEMM efficiency and graph-launch
+scheduling - the structural ~3-4% - are a better long-term target than precision, because
+they help the bit-exact default too.
+
 Assisted-by: Claude:opus-4.8 [Claude Code]