From 5667dfe461b39eb7d166669b620094728572a010 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Sat, 27 Jun 2026 07:02:54 +0000 Subject: [PATCH] docs(paged): arch-generality audit - optimization classification (0017-0029) Classify the paged-attention optimizations as arch-GENERAL (ship everywhere), GB10-TUNED (per-arch retune), or Blackwell-precision-specific; add the per-arch expected story (sm_100/Hopper/Ada/Metal/CPU) and the SAFETY gap (fused GDN/conv ops are CUDA+CPU-only with backend-ungated emission). Extends the prior build/gallery-targeting audit in the same file. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto --- .../patches/paged/ARCH_GENERALITY_AUDIT.md | 183 ++++++++++++++++++ 1 file changed, 183 insertions(+) diff --git a/backend/cpp/llama-cpp/patches/paged/ARCH_GENERALITY_AUDIT.md b/backend/cpp/llama-cpp/patches/paged/ARCH_GENERALITY_AUDIT.md index e5f3ce9a0..5050079fe 100644 --- a/backend/cpp/llama-cpp/patches/paged/ARCH_GENERALITY_AUDIT.md +++ b/backend/cpp/llama-cpp/patches/paged/ARCH_GENERALITY_AUDIT.md @@ -218,4 +218,187 @@ description + tags. Recommend a one-line Blackwell-recommended hardware note + consistent `nvfp4`/`blackwell` tags on all six, and tempering the GB10 bench claims with the "runs slower off-Blackwell" caveat. +## Section: optimization-generality (patches 0013/0016 + 0017-0029) + +Classifies each optimization as arch-GENERAL (ship everywhere, helps any arch), +GB10-TUNED (needs per-arch retuning of the magnitude/constants), or +Blackwell-PRECISION-specific (only meaningful where FP4-MMA exists). Read from the +patch commit bodies + the diffs they touch; bit-exactness verdicts are the +patches' own md5/test-backend-ops gates. + +Arch axis used: NVFP4 FP4-MMA needs `BLACKWELL_MMA_AVAILABLE` (sm_120/121 consumer ++ sm_100 datacenter); Hopper sm_90 / Ada sm_89 / Ampere sm_80-86 have none; +Metal/CPU/AMD/Intel have no NVFP4-MMA. Datacenter Blackwell sm_100 has FP4-MMA but +HBM3e (~8 TB/s) so it is COMPUTE-bound, not bandwidth-bound: every GB10 +"bandwidth-bound" verdict inverts there. The FP4-MMA kernel itself is UPSTREAM +ggml-cuda gated by `BLACKWELL_MMA_AVAILABLE`; none of these patches add it - they +reshape/route/dedup around it (0017/0020/0023/0025) or are precision-agnostic. + +### A. ARCH-GENERAL (ship everywhere; pure win or provably neutral) + +Graph-shape, host-side, or gather/copy-elimination changes. No FP4, no +bandwidth-floor assumption. Bit-exact. Help or are neutral on any arch that runs +the code path. + +- 0013 decoupled prefill-token budget - pure `update_slots()` scheduler policy, + zero libllama/ggml change, orthogonal to LLAMA_KV_PAGED, default-off + byte-identical. Latency/fairness lever (flattens decode-ITL spike from a + co-batched long prefill). No arch assumption. +- 0016 dynamic decode-first prefill budget - supersedes 0013; still pure + `update_slots()` policy, default-off byte-identical, T==n_batch degenerate case + == stock. Arch-neutral, identical paged on/off. +- 0024 paged-pool burst-reclaim - host-side block accounting + defrag + slot + release; never touches KV values or compute. Gated behind LLAMA_KV_PAGED. Fixes + a real fragmentation/throughput-collapse bug on long-lived servers. + Arch-independent host bookkeeping. +- 0029 block-table within-step host cache - memcpy-reuse of the host block table + across full-attention layers in one step; bit-exact, LLAMA_PAGED_NO_BT_CACHE=1 + off. Helps host-bound decode (dense +2.7% on GB10), neutral when compute-bound + (MoE flat). The faster the GPU (e.g. sm_100), the MORE host-bound decode is, so + the BIGGER this win elsewhere. +- 0028 recurrent-state (conv-tap) gather fusion - eliminates a k_get_rows by + reading cache[ids[s]] in-kernel; bit-identical. Deleting a gather vLLM has no + equivalent of is a win on ANY arch running the GDN path; not FP4, not + bandwidth-floor specific. +- 0018 in-place SSM-state write-back + 0019 fused SSM-state gather + 0021 + conv-state in-place fusion - remove a D2D state copy-back (0018), a state + get_rows (0019), and the 4-op conv chain + ring-state copy (0021), mirroring + vLLM's in-place recurrent update. Arithmetic byte-identical; what is removed is + plumbing, so wins on ANY arch running the gated-DeltaNet recurrence. +- Paged KV core (0001-0012) - paged KV manager, on-demand alloc, prefix caching, + in-kernel paged read. No precision or bandwidth-floor assumption; the most + portable part of the work, helps capacity/serving anywhere it compiles. + +NOTE: 0018/0019/0021/0028 + the base GDN op have CUDA + CPU kernels ONLY (every +gate is "CUDA0 vs CPU"). General within {CUDA, HIP/ROCm (hipified ggml-cuda), CPU}; +NOT covered on Metal/SYCL/Vulkan - see SAFETY #1. + +### B. GENERAL-IN-DIRECTION but the MAGNITUDE was measured on the GB10 floor + +Correct + beneficial everywhere, but the specific %/constants are GB10-bound. + +- 0020 o_proj GDN MMVQ->MMQ reshape - collapses the GDN output to 2D so the + ssm_out matmul sees src1->ne[1]=128 and routes to MMQ (M=128 GEMM that amortizes + the weight read across 128 tokens) instead of MMVQ (built for batch<=8 with the + 128 sequences stuck in ne[2]). Zero-cost view change, bit-identical, gated to the + gated-DeltaNet path. UNIVERSAL: MMVQ (mul_mat_vec_q) is structurally a batch<=8 + GEMV and cannot amortize the weight read at a real M=128; MMQ does, on ALL CUDA + archs (dp4a pre-tensor-core still amortizes) and on HIP. So MMQ > MMVQ at M=128 + is NOT GB10-specific - pure win wherever MMQ exists. RE-TUNE: the +31.7% + magnitude is on the GB10 BW floor; smaller % on sm_100 but still correct. + REGRESSION RISK: only at a genuinely tiny real M (single-stream decode n_seqs<=8) + could forcing MMQ be slower than MMVQ - see SAFETY #2. On Metal/Vulkan/SYCL the + MMVQ/MMQ split differs; the reshape is harmless (a view) but yields no benefit. +- 0023 MoE NVFP4 activation-quantize de-dup - for broadcast up/gate proj (ne11==1) + quantize the unique token activations once and gather the identical FP4 blocks + instead of re-quantizing per expert; bit-exact, ..._DEDUP=0 off. + DIRECTION-GENERAL (de-duplicating identical work is always good) but + NVFP4-block-layout specific (uint4 copy of block_fp4_mmq) and only matters where + activation-quant is a measurable decode bucket - on a compute-bound arch the + saved quant time may be off the critical path (even on GB10 the MoE TG win is + only +1.7%). + +### C. GB10-TUNED (constants are GB10 winners; re-sweep per arch) + +- 0022 GDN recurrence occupancy/coalescing retune - column-folding template params + NUM_WARPS/COLS_PER_WARP, default (16,8), env-selectable GDN_NW/GDN_CPW. The + reduction/FMA order is byte-identical (md5-gateable); only the warp/block->column + assignment changes to raise memory-level parallelism on a BANDWIDTH-BOUND kernel. + (16,8) is explicitly "the measured GB10 winner". Textbook per-arch tune: optimal + values depend on DRAM latency / L2 / SM count / occupancy. SAFE everywhere + (bit-exact, env-overridable, no forbidden float4 load) but unlikely optimal off + GB10; on a compute-bound arch (sm_100) the kernel may not even be the + bottleneck. Needs a per-arch GDN_NW/CPW sweep. +- 0017 FP4 dense-GEMM decode tile tune - shipped as a P0 bit-exact gate + DEFAULT- + OFF occupancy levers (GGML_CUDA_FP4_MMQ_Y / ..._MINBLOCKS / ..._DENSE_MMQ_X). + Honest GB10 verdict was a KILL-GATE: every cheap occupancy probe REGRESSED on + sm_121 (the M=128 tile is already weight-read optimal). Nothing on by default => + byte-identical to stock everywhere. On a DIFFERENT FP4 arch (sm_100) the + kill-gate could flip; the levers are in place and inert, ready to re-sweep. + +### D. Blackwell-PRECISION-specific (only meaningful where FP4-MMA exists) + +- 0025 MoE NVFP4 MoE-decode re-graph - keeps CUDA graphs on for the grouped + stream-k mul_mat_q id-path; env-gated LLAMA_MOE_FORCE_GRAPHS, default-off + byte-identical. The CUDA-graph mechanism is general, but the specific guard + condition (mmvq_mmid_max==8 for NVFP4 on sm_121) and the "graphs are safe here" + reasoning are tied to the NVFP4 grouped path on Blackwell. On a non-FP4 arch the + node would not take that branch -> inert. +- 0026 hybrid per-head SSM-state precision (bf16 SSM/conv cache) - adds + --cache-type-ssm/-conv + --ssm-bf16-tau (per-head f32-vs-bf16 by memory length). + Default f32 = bit-exact. PRECISION/bandwidth lever: bf16 halves the dominant GDN + decode byte stream, which only pays off on a BANDWIDTH-bound arch (GB10). On + sm_100 HBM3e it buys little. Value is bandwidth-floor specific; correctness is + precision-specific (opt-in, default-safe). +- NVFP4 GGUFs + the 6 gallery -paged rows - inherently Blackwell-precision-specific + for the FAST path: NVFP4 weights only get FP4-MMA on sm_120/121/100. Elsewhere + they run-via-dequant (correct, slow) per the gallery-targeting section above. + +### Per-arch expected story + +- Consumer Blackwell sm_120/121 (GB10 / dGPU): the validated target. dGPU sm_120 + (GDDR7 ~1 TB/s) is less BW-starved than GB10 LPDDR5x 273 GB/s, so the + bandwidth-floor wins (0018/0019/0022/0026) shrink in % while the host-pipeline + + graph wins (0029/0025) and the MMQ reshape (0020) hold. +- Datacenter Blackwell sm_100 (HBM3e ~8 TB/s): FP4-MMA WORKS so NVFP4 stays fast + (precision bucket + 0025 carry over), BUT the BW floor is GONE -> compute-bound. + Re-tune: 0022 GDN_NW/CPW sweep; 0017 kill-gate may flip (levers ready). The + bandwidth-motivated wins (0018/0019, 0026 bf16-state) shrink toward neutral; the + host-pipeline/graph/MMQ-reshape general wins (0029/0025/0020) still help. Net: + works, faster GPU, needs a re-tune pass, do NOT assume the GB10 constants. +- Hopper sm_90 / Ada sm_89 / Ampere sm_80-86: NO FP4-MMA. NVFP4 GGUFs + the FP4 + levers (0017/0023/0025) are out of scope -> use a DIFFERENT quant (Q4_K/AWQ/GPTQ + etc). BUT the precision-agnostic infra still helps: paged KV core, scheduler + (0013/0016), burst-reclaim (0024), block-table cache (0029), SSM/conv + plumbing-removal (0018/0019/0021/0028); 0020 still routes the o_proj + MMVQ->MMQ in whatever quant it uses. Story: "no FP4 -> another quant, but paged + + SSM + scheduler infra is a pure win". +- Metal / CPU / AMD-ROCm / Intel-SYCL / Vulkan (all built by the matrix): no + NVFP4-MMA; paged KV + scheduler infra is the portable value. CPU has reference + kernels for every fused op (the bit-exact gate is CUDA0-vs-CPU). ROCm/HIP reuses + ggml-cuda (hipify) so it inherits the fused-op kernels. Metal/SYCL/Vulkan do NOT + get the new fused-op kernels (SAFETY #1). + +### SAFETY / regression risks + +1. Fused GDN/conv ops are CUDA + CPU only; emission is NOT backend-gated. + 0018/0019/0021/0028 add new ops (ggml_gated_delta_net_inplace[_ids], + ggml_ssm_conv_update_inplace[_ids]) implemented for CUDA + CPU only. They are + emitted by the graph builder whenever cparams.fused_gdn_ar/fused_gdn_ch is set + (constructor sets fused_gdn_ch=true, auto_fgdn=true), with NO check that the + active compute backend is CUDA/HIP. Fine on CUDA/HIP/CPU. On Metal/SYCL/Vulkan + two failure modes: (a) the new GATED_DELTA_NET op has no kernel -> loud + supports_op/assert (but the base GDN op may already be CUDA/CPU-only upstream, + so a qwen35 model likely cannot run there regardless); (b) the fused conv + variant REUSES GGML_OP_SSM_CONV discriminated by a non-null src[3]/src[4] - a + backend that supports plain SSM_CONV but ignores the discriminator would compute + the WRONG plain conv -> SILENT corruption. That is the one genuine + silent-correctness risk. RECOMMENDATION: gate fused-op emission on the compute + backend being CUDA/HIP (or add a supports_op guard rejecting the discriminated + SSM_CONV where the fused handling is absent). +2. 0020 MMVQ->MMQ at tiny real M. MMQ is right at decode M=128 (the gallery + batched-serving regime); it would be wrong only at a genuine M<=8 (single-stream + decode, n_seqs=1). Bit-identical either way - only a potential perf regression + at tiny batch on non-GB10 archs (never the measured GB10 case). Worth confirming + the reshape still picks the better kernel at n_seqs=1 elsewhere. +3. 0022 default (16,8) off GB10: safe (bit-exact, env-overridable) but non-optimal; + do not ship it as the default for sm_100/Hopper/Ada without a GDN_NW/CPW sweep. + No correctness risk. +4. Gallery rows do not state a GPU-arch requirement (covered in the + gallery-targeting section): add a Blackwell (sm_120/121/100) recommended note. + +### One-line verdict + +The PORTABLE core (paged KV 0001-0012, scheduler 0013/0016, burst-reclaim 0024, +block-table cache 0029, the SSM/conv plumbing-removal 0018/0019/0021/0028, the +o_proj MMQ reshape 0020) is arch-general and ships everywhere it compiles - +bit-exact, mostly default-safe, pure win or neutral. The FP4/NVFP4 levers +(0017/0023/0025, the GGUFs, the gallery) are Blackwell-precision-specific. The +occupancy/precision tunes (0017 levers, 0022, 0026) are GB10-bandwidth-floor-tuned +and need a per-arch re-sweep (especially on sm_100 where the BW floor is gone and +the regime flips to compute-bound). The single real SAFETY gap: the new fused +GDN/conv ops are CUDA+CPU-only with backend-ungated emission, so a Vulkan/SYCL/Metal +paged build of a gated-DeltaNet model could assert (GDN op) or silently miscompute +(discriminated SSM_CONV) - it should be compute-backend-gated. + Assisted-by: Claude:opus-4.8 [Claude Code]