From 5667dfe461b39eb7d166669b620094728572a010 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Sat, 27 Jun 2026 07:02:54 +0000
Subject: [PATCH] docs(paged): arch-generality audit - optimization
 classification (0017-0029)

Classify the paged-attention optimizations as arch-GENERAL (ship everywhere),
GB10-TUNED (per-arch retune), or Blackwell-precision-specific; add the per-arch
expected story (sm_100/Hopper/Ada/Metal/CPU) and the SAFETY gap (fused GDN/conv
ops are CUDA+CPU-only with backend-ungated emission). Extends the prior
build/gallery-targeting audit in the same file.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
 .../patches/paged/ARCH_GENERALITY_AUDIT.md    | 183 ++++++++++++++++++
 1 file changed, 183 insertions(+)

diff --git a/backend/cpp/llama-cpp/patches/paged/ARCH_GENERALITY_AUDIT.md b/backend/cpp/llama-cpp/patches/paged/ARCH_GENERALITY_AUDIT.md
index e5f3ce9a0..5050079fe 100644
--- a/backend/cpp/llama-cpp/patches/paged/ARCH_GENERALITY_AUDIT.md
+++ b/backend/cpp/llama-cpp/patches/paged/ARCH_GENERALITY_AUDIT.md
@@ -218,4 +218,187 @@ description + tags. Recommend a one-line Blackwell-recommended hardware note +
 consistent `nvfp4`/`blackwell` tags on all six, and tempering the GB10 bench
 claims with the "runs slower off-Blackwell" caveat.
 
+## Section: optimization-generality (patches 0013/0016 + 0017-0029)
+
+Classifies each optimization as arch-GENERAL (ship everywhere, helps any arch),
+GB10-TUNED (needs per-arch retuning of the magnitude/constants), or
+Blackwell-PRECISION-specific (only meaningful where FP4-MMA exists). Read from the
+patch commit bodies + the diffs they touch; bit-exactness verdicts are the
+patches' own md5/test-backend-ops gates.
+
+Arch axis used: NVFP4 FP4-MMA needs `BLACKWELL_MMA_AVAILABLE` (sm_120/121 consumer
++ sm_100 datacenter); Hopper sm_90 / Ada sm_89 / Ampere sm_80-86 have none;
+Metal/CPU/AMD/Intel have no NVFP4-MMA. Datacenter Blackwell sm_100 has FP4-MMA but
+HBM3e (~8 TB/s) so it is COMPUTE-bound, not bandwidth-bound: every GB10
+"bandwidth-bound" verdict inverts there. The FP4-MMA kernel itself is UPSTREAM
+ggml-cuda gated by `BLACKWELL_MMA_AVAILABLE`; none of these patches add it - they
+reshape/route/dedup around it (0017/0020/0023/0025) or are precision-agnostic.
+
+### A. ARCH-GENERAL (ship everywhere; pure win or provably neutral)
+
+Graph-shape, host-side, or gather/copy-elimination changes. No FP4, no
+bandwidth-floor assumption. Bit-exact. Help or are neutral on any arch that runs
+the code path.
+
+- 0013 decoupled prefill-token budget - pure `update_slots()` scheduler policy,
+  zero libllama/ggml change, orthogonal to LLAMA_KV_PAGED, default-off
+  byte-identical. Latency/fairness lever (flattens decode-ITL spike from a
+  co-batched long prefill). No arch assumption.
+- 0016 dynamic decode-first prefill budget - supersedes 0013; still pure
+  `update_slots()` policy, default-off byte-identical, T==n_batch degenerate case
+  == stock. Arch-neutral, identical paged on/off.
+- 0024 paged-pool burst-reclaim - host-side block accounting + defrag + slot
+  release; never touches KV values or compute. Gated behind LLAMA_KV_PAGED. Fixes
+  a real fragmentation/throughput-collapse bug on long-lived servers.
+  Arch-independent host bookkeeping.
+- 0029 block-table within-step host cache - memcpy-reuse of the host block table
+  across full-attention layers in one step; bit-exact, LLAMA_PAGED_NO_BT_CACHE=1
+  off. Helps host-bound decode (dense +2.7% on GB10), neutral when compute-bound
+  (MoE flat). The faster the GPU (e.g. sm_100), the MORE host-bound decode is, so
+  the BIGGER this win elsewhere.
+- 0028 recurrent-state (conv-tap) gather fusion - eliminates a k_get_rows by
+  reading cache[ids[s]] in-kernel; bit-identical. Deleting a gather vLLM has no
+  equivalent of is a win on ANY arch running the GDN path; not FP4, not
+  bandwidth-floor specific.
+- 0018 in-place SSM-state write-back + 0019 fused SSM-state gather + 0021
+  conv-state in-place fusion - remove a D2D state copy-back (0018), a state
+  get_rows (0019), and the 4-op conv chain + ring-state copy (0021), mirroring
+  vLLM's in-place recurrent update. Arithmetic byte-identical; what is removed is
+  plumbing, so wins on ANY arch running the gated-DeltaNet recurrence.
+- Paged KV core (0001-0012) - paged KV manager, on-demand alloc, prefix caching,
+  in-kernel paged read. No precision or bandwidth-floor assumption; the most
+  portable part of the work, helps capacity/serving anywhere it compiles.
+
+NOTE: 0018/0019/0021/0028 + the base GDN op have CUDA + CPU kernels ONLY (every
+gate is "CUDA0 vs CPU"). General within {CUDA, HIP/ROCm (hipified ggml-cuda), CPU};
+NOT covered on Metal/SYCL/Vulkan - see SAFETY #1.
+
+### B. GENERAL-IN-DIRECTION but the MAGNITUDE was measured on the GB10 floor
+
+Correct + beneficial everywhere, but the specific %/constants are GB10-bound.
+
+- 0020 o_proj GDN MMVQ->MMQ reshape - collapses the GDN output to 2D so the
+  ssm_out matmul sees src1->ne[1]=128 and routes to MMQ (M=128 GEMM that amortizes
+  the weight read across 128 tokens) instead of MMVQ (built for batch<=8 with the
+  128 sequences stuck in ne[2]). Zero-cost view change, bit-identical, gated to the
+  gated-DeltaNet path. UNIVERSAL: MMVQ (mul_mat_vec_q) is structurally a batch<=8
+  GEMV and cannot amortize the weight read at a real M=128; MMQ does, on ALL CUDA
+  archs (dp4a pre-tensor-core still amortizes) and on HIP. So MMQ > MMVQ at M=128
+  is NOT GB10-specific - pure win wherever MMQ exists. RE-TUNE: the +31.7%
+  magnitude is on the GB10 BW floor; smaller % on sm_100 but still correct.
+  REGRESSION RISK: only at a genuinely tiny real M (single-stream decode n_seqs<=8)
+  could forcing MMQ be slower than MMVQ - see SAFETY #2. On Metal/Vulkan/SYCL the
+  MMVQ/MMQ split differs; the reshape is harmless (a view) but yields no benefit.
+- 0023 MoE NVFP4 activation-quantize de-dup - for broadcast up/gate proj (ne11==1)
+  quantize the unique token activations once and gather the identical FP4 blocks
+  instead of re-quantizing per expert; bit-exact, ..._DEDUP=0 off.
+  DIRECTION-GENERAL (de-duplicating identical work is always good) but
+  NVFP4-block-layout specific (uint4 copy of block_fp4_mmq) and only matters where
+  activation-quant is a measurable decode bucket - on a compute-bound arch the
+  saved quant time may be off the critical path (even on GB10 the MoE TG win is
+  only +1.7%).
+
+### C. GB10-TUNED (constants are GB10 winners; re-sweep per arch)
+
+- 0022 GDN recurrence occupancy/coalescing retune - column-folding template params
+  NUM_WARPS/COLS_PER_WARP, default (16,8), env-selectable GDN_NW/GDN_CPW. The
+  reduction/FMA order is byte-identical (md5-gateable); only the warp/block->column
+  assignment changes to raise memory-level parallelism on a BANDWIDTH-BOUND kernel.
+  (16,8) is explicitly "the measured GB10 winner". Textbook per-arch tune: optimal
+  values depend on DRAM latency / L2 / SM count / occupancy. SAFE everywhere
+  (bit-exact, env-overridable, no forbidden float4 load) but unlikely optimal off
+  GB10; on a compute-bound arch (sm_100) the kernel may not even be the
+  bottleneck. Needs a per-arch GDN_NW/CPW sweep.
+- 0017 FP4 dense-GEMM decode tile tune - shipped as a P0 bit-exact gate + DEFAULT-
+  OFF occupancy levers (GGML_CUDA_FP4_MMQ_Y / ..._MINBLOCKS / ..._DENSE_MMQ_X).
+  Honest GB10 verdict was a KILL-GATE: every cheap occupancy probe REGRESSED on
+  sm_121 (the M=128 tile is already weight-read optimal). Nothing on by default =>
+  byte-identical to stock everywhere. On a DIFFERENT FP4 arch (sm_100) the
+  kill-gate could flip; the levers are in place and inert, ready to re-sweep.
+
+### D. Blackwell-PRECISION-specific (only meaningful where FP4-MMA exists)
+
+- 0025 MoE NVFP4 MoE-decode re-graph - keeps CUDA graphs on for the grouped
+  stream-k mul_mat_q id-path; env-gated LLAMA_MOE_FORCE_GRAPHS, default-off
+  byte-identical. The CUDA-graph mechanism is general, but the specific guard
+  condition (mmvq_mmid_max==8 for NVFP4 on sm_121) and the "graphs are safe here"
+  reasoning are tied to the NVFP4 grouped path on Blackwell. On a non-FP4 arch the
+  node would not take that branch -> inert.
+- 0026 hybrid per-head SSM-state precision (bf16 SSM/conv cache) - adds
+  --cache-type-ssm/-conv + --ssm-bf16-tau (per-head f32-vs-bf16 by memory length).
+  Default f32 = bit-exact. PRECISION/bandwidth lever: bf16 halves the dominant GDN
+  decode byte stream, which only pays off on a BANDWIDTH-bound arch (GB10). On
+  sm_100 HBM3e it buys little. Value is bandwidth-floor specific; correctness is
+  precision-specific (opt-in, default-safe).
+- NVFP4 GGUFs + the 6 gallery -paged rows - inherently Blackwell-precision-specific
+  for the FAST path: NVFP4 weights only get FP4-MMA on sm_120/121/100. Elsewhere
+  they run-via-dequant (correct, slow) per the gallery-targeting section above.
+
+### Per-arch expected story
+
+- Consumer Blackwell sm_120/121 (GB10 / dGPU): the validated target. dGPU sm_120
+  (GDDR7 ~1 TB/s) is less BW-starved than GB10 LPDDR5x 273 GB/s, so the
+  bandwidth-floor wins (0018/0019/0022/0026) shrink in % while the host-pipeline +
+  graph wins (0029/0025) and the MMQ reshape (0020) hold.
+- Datacenter Blackwell sm_100 (HBM3e ~8 TB/s): FP4-MMA WORKS so NVFP4 stays fast
+  (precision bucket + 0025 carry over), BUT the BW floor is GONE -> compute-bound.
+  Re-tune: 0022 GDN_NW/CPW sweep; 0017 kill-gate may flip (levers ready). The
+  bandwidth-motivated wins (0018/0019, 0026 bf16-state) shrink toward neutral; the
+  host-pipeline/graph/MMQ-reshape general wins (0029/0025/0020) still help. Net:
+  works, faster GPU, needs a re-tune pass, do NOT assume the GB10 constants.
+- Hopper sm_90 / Ada sm_89 / Ampere sm_80-86: NO FP4-MMA. NVFP4 GGUFs + the FP4
+  levers (0017/0023/0025) are out of scope -> use a DIFFERENT quant (Q4_K/AWQ/GPTQ
+  etc). BUT the precision-agnostic infra still helps: paged KV core, scheduler
+  (0013/0016), burst-reclaim (0024), block-table cache (0029), SSM/conv
+  plumbing-removal (0018/0019/0021/0028); 0020 still routes the o_proj
+  MMVQ->MMQ in whatever quant it uses. Story: "no FP4 -> another quant, but paged +
+  SSM + scheduler infra is a pure win".
+- Metal / CPU / AMD-ROCm / Intel-SYCL / Vulkan (all built by the matrix): no
+  NVFP4-MMA; paged KV + scheduler infra is the portable value. CPU has reference
+  kernels for every fused op (the bit-exact gate is CUDA0-vs-CPU). ROCm/HIP reuses
+  ggml-cuda (hipify) so it inherits the fused-op kernels. Metal/SYCL/Vulkan do NOT
+  get the new fused-op kernels (SAFETY #1).
+
+### SAFETY / regression risks
+
+1. Fused GDN/conv ops are CUDA + CPU only; emission is NOT backend-gated.
+   0018/0019/0021/0028 add new ops (ggml_gated_delta_net_inplace[_ids],
+   ggml_ssm_conv_update_inplace[_ids]) implemented for CUDA + CPU only. They are
+   emitted by the graph builder whenever cparams.fused_gdn_ar/fused_gdn_ch is set
+   (constructor sets fused_gdn_ch=true, auto_fgdn=true), with NO check that the
+   active compute backend is CUDA/HIP. Fine on CUDA/HIP/CPU. On Metal/SYCL/Vulkan
+   two failure modes: (a) the new GATED_DELTA_NET op has no kernel -> loud
+   supports_op/assert (but the base GDN op may already be CUDA/CPU-only upstream,
+   so a qwen35 model likely cannot run there regardless); (b) the fused conv
+   variant REUSES GGML_OP_SSM_CONV discriminated by a non-null src[3]/src[4] - a
+   backend that supports plain SSM_CONV but ignores the discriminator would compute
+   the WRONG plain conv -> SILENT corruption. That is the one genuine
+   silent-correctness risk. RECOMMENDATION: gate fused-op emission on the compute
+   backend being CUDA/HIP (or add a supports_op guard rejecting the discriminated
+   SSM_CONV where the fused handling is absent).
+2. 0020 MMVQ->MMQ at tiny real M. MMQ is right at decode M=128 (the gallery
+   batched-serving regime); it would be wrong only at a genuine M<=8 (single-stream
+   decode, n_seqs=1). Bit-identical either way - only a potential perf regression
+   at tiny batch on non-GB10 archs (never the measured GB10 case). Worth confirming
+   the reshape still picks the better kernel at n_seqs=1 elsewhere.
+3. 0022 default (16,8) off GB10: safe (bit-exact, env-overridable) but non-optimal;
+   do not ship it as the default for sm_100/Hopper/Ada without a GDN_NW/CPW sweep.
+   No correctness risk.
+4. Gallery rows do not state a GPU-arch requirement (covered in the
+   gallery-targeting section): add a Blackwell (sm_120/121/100) recommended note.
+
+### One-line verdict
+
+The PORTABLE core (paged KV 0001-0012, scheduler 0013/0016, burst-reclaim 0024,
+block-table cache 0029, the SSM/conv plumbing-removal 0018/0019/0021/0028, the
+o_proj MMQ reshape 0020) is arch-general and ships everywhere it compiles -
+bit-exact, mostly default-safe, pure win or neutral. The FP4/NVFP4 levers
+(0017/0023/0025, the GGUFs, the gallery) are Blackwell-precision-specific. The
+occupancy/precision tunes (0017 levers, 0022, 0026) are GB10-bandwidth-floor-tuned
+and need a per-arch re-sweep (especially on sm_100 where the BW floor is gone and
+the regime flips to compute-bound). The single real SAFETY gap: the new fused
+GDN/conv ops are CUDA+CPU-only with backend-ungated emission, so a Vulkan/SYCL/Metal
+paged build of a gated-DeltaNet model could assert (GDN op) or silently miscompute
+(discriminated SSM_CONV) - it should be compute-backend-gated.
+
 Assisted-by: Claude:opus-4.8 [Claude Code]