docs(paged): arch-generality audit - optimization classification (0017-0029)

Classify the paged-attention optimizations as arch-GENERAL (ship everywhere),
GB10-TUNED (per-arch retune), or Blackwell-precision-specific; add the per-arch
expected story (sm_100/Hopper/Ada/Metal/CPU) and the SAFETY gap (fused GDN/conv
ops are CUDA+CPU-only with backend-ungated emission). Extends the prior
build/gallery-targeting audit in the same file.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
Ettore Di Giacinto
2026-06-27 07:02:54 +00:00
parent 34abf392fc
commit 5667dfe461

View File

@@ -218,4 +218,187 @@ description + tags. Recommend a one-line Blackwell-recommended hardware note +
consistent `nvfp4`/`blackwell` tags on all six, and tempering the GB10 bench
claims with the "runs slower off-Blackwell" caveat.
## Section: optimization-generality (patches 0013/0016 + 0017-0029)
Classifies each optimization as arch-GENERAL (ship everywhere, helps any arch),
GB10-TUNED (needs per-arch retuning of the magnitude/constants), or
Blackwell-PRECISION-specific (only meaningful where FP4-MMA exists). Read from the
patch commit bodies + the diffs they touch; bit-exactness verdicts are the
patches' own md5/test-backend-ops gates.
Arch axis used: NVFP4 FP4-MMA needs `BLACKWELL_MMA_AVAILABLE` (sm_120/121 consumer
+ sm_100 datacenter); Hopper sm_90 / Ada sm_89 / Ampere sm_80-86 have none;
Metal/CPU/AMD/Intel have no NVFP4-MMA. Datacenter Blackwell sm_100 has FP4-MMA but
HBM3e (~8 TB/s) so it is COMPUTE-bound, not bandwidth-bound: every GB10
"bandwidth-bound" verdict inverts there. The FP4-MMA kernel itself is UPSTREAM
ggml-cuda gated by `BLACKWELL_MMA_AVAILABLE`; none of these patches add it - they
reshape/route/dedup around it (0017/0020/0023/0025) or are precision-agnostic.
### A. ARCH-GENERAL (ship everywhere; pure win or provably neutral)
Graph-shape, host-side, or gather/copy-elimination changes. No FP4, no
bandwidth-floor assumption. Bit-exact. Help or are neutral on any arch that runs
the code path.
- 0013 decoupled prefill-token budget - pure `update_slots()` scheduler policy,
zero libllama/ggml change, orthogonal to LLAMA_KV_PAGED, default-off
byte-identical. Latency/fairness lever (flattens decode-ITL spike from a
co-batched long prefill). No arch assumption.
- 0016 dynamic decode-first prefill budget - supersedes 0013; still pure
`update_slots()` policy, default-off byte-identical, T==n_batch degenerate case
== stock. Arch-neutral, identical paged on/off.
- 0024 paged-pool burst-reclaim - host-side block accounting + defrag + slot
release; never touches KV values or compute. Gated behind LLAMA_KV_PAGED. Fixes
a real fragmentation/throughput-collapse bug on long-lived servers.
Arch-independent host bookkeeping.
- 0029 block-table within-step host cache - memcpy-reuse of the host block table
across full-attention layers in one step; bit-exact, LLAMA_PAGED_NO_BT_CACHE=1
off. Helps host-bound decode (dense +2.7% on GB10), neutral when compute-bound
(MoE flat). The faster the GPU (e.g. sm_100), the MORE host-bound decode is, so
the BIGGER this win elsewhere.
- 0028 recurrent-state (conv-tap) gather fusion - eliminates a k_get_rows by
reading cache[ids[s]] in-kernel; bit-identical. Deleting a gather vLLM has no
equivalent of is a win on ANY arch running the GDN path; not FP4, not
bandwidth-floor specific.
- 0018 in-place SSM-state write-back + 0019 fused SSM-state gather + 0021
conv-state in-place fusion - remove a D2D state copy-back (0018), a state
get_rows (0019), and the 4-op conv chain + ring-state copy (0021), mirroring
vLLM's in-place recurrent update. Arithmetic byte-identical; what is removed is
plumbing, so wins on ANY arch running the gated-DeltaNet recurrence.
- Paged KV core (0001-0012) - paged KV manager, on-demand alloc, prefix caching,
in-kernel paged read. No precision or bandwidth-floor assumption; the most
portable part of the work, helps capacity/serving anywhere it compiles.
NOTE: 0018/0019/0021/0028 + the base GDN op have CUDA + CPU kernels ONLY (every
gate is "CUDA0 vs CPU"). General within {CUDA, HIP/ROCm (hipified ggml-cuda), CPU};
NOT covered on Metal/SYCL/Vulkan - see SAFETY #1.
### B. GENERAL-IN-DIRECTION but the MAGNITUDE was measured on the GB10 floor
Correct + beneficial everywhere, but the specific %/constants are GB10-bound.
- 0020 o_proj GDN MMVQ->MMQ reshape - collapses the GDN output to 2D so the
ssm_out matmul sees src1->ne[1]=128 and routes to MMQ (M=128 GEMM that amortizes
the weight read across 128 tokens) instead of MMVQ (built for batch<=8 with the
128 sequences stuck in ne[2]). Zero-cost view change, bit-identical, gated to the
gated-DeltaNet path. UNIVERSAL: MMVQ (mul_mat_vec_q) is structurally a batch<=8
GEMV and cannot amortize the weight read at a real M=128; MMQ does, on ALL CUDA
archs (dp4a pre-tensor-core still amortizes) and on HIP. So MMQ > MMVQ at M=128
is NOT GB10-specific - pure win wherever MMQ exists. RE-TUNE: the +31.7%
magnitude is on the GB10 BW floor; smaller % on sm_100 but still correct.
REGRESSION RISK: only at a genuinely tiny real M (single-stream decode n_seqs<=8)
could forcing MMQ be slower than MMVQ - see SAFETY #2. On Metal/Vulkan/SYCL the
MMVQ/MMQ split differs; the reshape is harmless (a view) but yields no benefit.
- 0023 MoE NVFP4 activation-quantize de-dup - for broadcast up/gate proj (ne11==1)
quantize the unique token activations once and gather the identical FP4 blocks
instead of re-quantizing per expert; bit-exact, ..._DEDUP=0 off.
DIRECTION-GENERAL (de-duplicating identical work is always good) but
NVFP4-block-layout specific (uint4 copy of block_fp4_mmq) and only matters where
activation-quant is a measurable decode bucket - on a compute-bound arch the
saved quant time may be off the critical path (even on GB10 the MoE TG win is
only +1.7%).
### C. GB10-TUNED (constants are GB10 winners; re-sweep per arch)
- 0022 GDN recurrence occupancy/coalescing retune - column-folding template params
NUM_WARPS/COLS_PER_WARP, default (16,8), env-selectable GDN_NW/GDN_CPW. The
reduction/FMA order is byte-identical (md5-gateable); only the warp/block->column
assignment changes to raise memory-level parallelism on a BANDWIDTH-BOUND kernel.
(16,8) is explicitly "the measured GB10 winner". Textbook per-arch tune: optimal
values depend on DRAM latency / L2 / SM count / occupancy. SAFE everywhere
(bit-exact, env-overridable, no forbidden float4 load) but unlikely optimal off
GB10; on a compute-bound arch (sm_100) the kernel may not even be the
bottleneck. Needs a per-arch GDN_NW/CPW sweep.
- 0017 FP4 dense-GEMM decode tile tune - shipped as a P0 bit-exact gate + DEFAULT-
OFF occupancy levers (GGML_CUDA_FP4_MMQ_Y / ..._MINBLOCKS / ..._DENSE_MMQ_X).
Honest GB10 verdict was a KILL-GATE: every cheap occupancy probe REGRESSED on
sm_121 (the M=128 tile is already weight-read optimal). Nothing on by default =>
byte-identical to stock everywhere. On a DIFFERENT FP4 arch (sm_100) the
kill-gate could flip; the levers are in place and inert, ready to re-sweep.
### D. Blackwell-PRECISION-specific (only meaningful where FP4-MMA exists)
- 0025 MoE NVFP4 MoE-decode re-graph - keeps CUDA graphs on for the grouped
stream-k mul_mat_q id-path; env-gated LLAMA_MOE_FORCE_GRAPHS, default-off
byte-identical. The CUDA-graph mechanism is general, but the specific guard
condition (mmvq_mmid_max==8 for NVFP4 on sm_121) and the "graphs are safe here"
reasoning are tied to the NVFP4 grouped path on Blackwell. On a non-FP4 arch the
node would not take that branch -> inert.
- 0026 hybrid per-head SSM-state precision (bf16 SSM/conv cache) - adds
--cache-type-ssm/-conv + --ssm-bf16-tau (per-head f32-vs-bf16 by memory length).
Default f32 = bit-exact. PRECISION/bandwidth lever: bf16 halves the dominant GDN
decode byte stream, which only pays off on a BANDWIDTH-bound arch (GB10). On
sm_100 HBM3e it buys little. Value is bandwidth-floor specific; correctness is
precision-specific (opt-in, default-safe).
- NVFP4 GGUFs + the 6 gallery -paged rows - inherently Blackwell-precision-specific
for the FAST path: NVFP4 weights only get FP4-MMA on sm_120/121/100. Elsewhere
they run-via-dequant (correct, slow) per the gallery-targeting section above.
### Per-arch expected story
- Consumer Blackwell sm_120/121 (GB10 / dGPU): the validated target. dGPU sm_120
(GDDR7 ~1 TB/s) is less BW-starved than GB10 LPDDR5x 273 GB/s, so the
bandwidth-floor wins (0018/0019/0022/0026) shrink in % while the host-pipeline +
graph wins (0029/0025) and the MMQ reshape (0020) hold.
- Datacenter Blackwell sm_100 (HBM3e ~8 TB/s): FP4-MMA WORKS so NVFP4 stays fast
(precision bucket + 0025 carry over), BUT the BW floor is GONE -> compute-bound.
Re-tune: 0022 GDN_NW/CPW sweep; 0017 kill-gate may flip (levers ready). The
bandwidth-motivated wins (0018/0019, 0026 bf16-state) shrink toward neutral; the
host-pipeline/graph/MMQ-reshape general wins (0029/0025/0020) still help. Net:
works, faster GPU, needs a re-tune pass, do NOT assume the GB10 constants.
- Hopper sm_90 / Ada sm_89 / Ampere sm_80-86: NO FP4-MMA. NVFP4 GGUFs + the FP4
levers (0017/0023/0025) are out of scope -> use a DIFFERENT quant (Q4_K/AWQ/GPTQ
etc). BUT the precision-agnostic infra still helps: paged KV core, scheduler
(0013/0016), burst-reclaim (0024), block-table cache (0029), SSM/conv
plumbing-removal (0018/0019/0021/0028); 0020 still routes the o_proj
MMVQ->MMQ in whatever quant it uses. Story: "no FP4 -> another quant, but paged +
SSM + scheduler infra is a pure win".
- Metal / CPU / AMD-ROCm / Intel-SYCL / Vulkan (all built by the matrix): no
NVFP4-MMA; paged KV + scheduler infra is the portable value. CPU has reference
kernels for every fused op (the bit-exact gate is CUDA0-vs-CPU). ROCm/HIP reuses
ggml-cuda (hipify) so it inherits the fused-op kernels. Metal/SYCL/Vulkan do NOT
get the new fused-op kernels (SAFETY #1).
### SAFETY / regression risks
1. Fused GDN/conv ops are CUDA + CPU only; emission is NOT backend-gated.
0018/0019/0021/0028 add new ops (ggml_gated_delta_net_inplace[_ids],
ggml_ssm_conv_update_inplace[_ids]) implemented for CUDA + CPU only. They are
emitted by the graph builder whenever cparams.fused_gdn_ar/fused_gdn_ch is set
(constructor sets fused_gdn_ch=true, auto_fgdn=true), with NO check that the
active compute backend is CUDA/HIP. Fine on CUDA/HIP/CPU. On Metal/SYCL/Vulkan
two failure modes: (a) the new GATED_DELTA_NET op has no kernel -> loud
supports_op/assert (but the base GDN op may already be CUDA/CPU-only upstream,
so a qwen35 model likely cannot run there regardless); (b) the fused conv
variant REUSES GGML_OP_SSM_CONV discriminated by a non-null src[3]/src[4] - a
backend that supports plain SSM_CONV but ignores the discriminator would compute
the WRONG plain conv -> SILENT corruption. That is the one genuine
silent-correctness risk. RECOMMENDATION: gate fused-op emission on the compute
backend being CUDA/HIP (or add a supports_op guard rejecting the discriminated
SSM_CONV where the fused handling is absent).
2. 0020 MMVQ->MMQ at tiny real M. MMQ is right at decode M=128 (the gallery
batched-serving regime); it would be wrong only at a genuine M<=8 (single-stream
decode, n_seqs=1). Bit-identical either way - only a potential perf regression
at tiny batch on non-GB10 archs (never the measured GB10 case). Worth confirming
the reshape still picks the better kernel at n_seqs=1 elsewhere.
3. 0022 default (16,8) off GB10: safe (bit-exact, env-overridable) but non-optimal;
do not ship it as the default for sm_100/Hopper/Ada without a GDN_NW/CPW sweep.
No correctness risk.
4. Gallery rows do not state a GPU-arch requirement (covered in the
gallery-targeting section): add a Blackwell (sm_120/121/100) recommended note.
### One-line verdict
The PORTABLE core (paged KV 0001-0012, scheduler 0013/0016, burst-reclaim 0024,
block-table cache 0029, the SSM/conv plumbing-removal 0018/0019/0021/0028, the
o_proj MMQ reshape 0020) is arch-general and ships everywhere it compiles -
bit-exact, mostly default-safe, pure win or neutral. The FP4/NVFP4 levers
(0017/0023/0025, the GGUFs, the gallery) are Blackwell-precision-specific. The
occupancy/precision tunes (0017 levers, 0022, 0026) are GB10-bandwidth-floor-tuned
and need a per-arch re-sweep (especially on sm_100 where the BW floor is gone and
the regime flips to compute-bound). The single real SAFETY gap: the new fused
GDN/conv ops are CUDA+CPU-only with backend-ungated emission, so a Vulkan/SYCL/Metal
paged build of a gated-DeltaNet model could assert (GDN op) or silently miscompute
(discriminated SSM_CONV) - it should be compute-backend-gated.
Assisted-by: Claude:opus-4.8 [Claude Code]