Source-verify each paged decode optimization as quant-agnostic (operates on the
f32 gated-DeltaNet/conv recurrent state, the paged serving host path, or the
generic MMQ/CUDA-graph routing) vs NVFP4-specific (only fires inside the
use_native_fp4 / GGML_TYPE_NVFP4 branch).
Findings: 14 of 16 landed patches are quant-agnostic (0013/0014/0015/0016/0018/
0019/0020/0021/0022/0024/0025/0026/0028/0029). Only 0023 (MoE FP4 act-quant
de-dup, inside use_native_fp4) is NVFP4-specific; 0017 is NVFP4-only but
default-off/inert (kill-gate, no win).
Corrects the hypothesis on 0025: the actual patch is the MUL_MAT_ID CUDA-graph
guard relaxation gated on ggml_is_quantized + ggml_cuda_should_use_mmq (the
generic quantized grouped-MMQ path), NOT NVFP4. The NVFP4-specific act-quant /
quantize_mmq_nvfp4 work is LEVER 3, which was a measurement STOP and never
landed (no patch); LEVER 4 (NVFP4 projections) KL-failed and never shipped.
Adds the relative-impact-by-quant estimate (fixed f32-recurrence/host ms is the
largest step fraction at NVFP4, shrinks at Q8/bf16 as the weight read grows) and
the A/B plan to prove generality on a Q4_K_M requant of the same Qwen3.6 (build
the control first, md5/KLD bit-exact gate per path, decode_agg npl 32/128, with
0023 as the NVFP4-only negative control).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>