docs(paged): cross-arch synthesis - ship verdict + minimum non-Blackwell fixes

Synthesizes the four ARCH_GENERALITY_AUDIT sections (build-matrix, gguf-gallery-targeting, optimization-generality, patch-arch-safety) into a single cross-arch ship decision: build-safety table per target, every patch bucketed (SAFE-EVERYWHERE / BLACKWELL-ONLY-clean-fallback / GB10-TUNED / RISKY), the NVFP4 gallery recommendation, a per-arch roadmap ranked by value/effort, the empirical-verification matrix (GB10 + M4 cover all but non-Blackwell NVIDIA), and the ship verdict. Verdict: SAFE to ship as Blackwell/Linux today; the build is arch-general (no GB10 pin; FP4 code is default-off + #if-guarded) and NVFP4 GGUFs run everywhere via dequant. The one hard prerequisite before extending paged to Metal/Vulkan/SYCL is closing the backend-ungated, default-on fused GDN/conv op emission (discriminated GGML_OP_SSM_CONV via non-null src[3], CUDA+CPU only, no supports_op guard) - latent on current Linux targets, silent miscompute on a future non-CUDA paged build of a gated-DeltaNet model. Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 09:57:14 -04:00 · 2026-06-27 07:07:20 +00:00
parent 87cfd1fadb
commit af6e133759
1 changed files with 172 additions and 0 deletions
--- a/backend/cpp/llama-cpp/patches/paged/ARCH_GENERALITY_AUDIT.md
+++ b/backend/cpp/llama-cpp/patches/paged/ARCH_GENERALITY_AUDIT.md
@@ -494,4 +494,176 @@ default-off and `#if`-guarded. The sole non-build hazard is the latent
 discriminated-SSM_CONV silent-miscompute on a hypothetical Vulkan/SYCL/Metal GDN
 run, which should be closed by compute-backend-gating the fused-op emission.

+## Section: CROSS-ARCH SYNTHESIS (final verdict)
+
+Consolidates the four audit sections above into a single ship decision. The arch
+axis: NVFP4 FP4-MMA requires `BLACKWELL_MMA_AVAILABLE` = sm_120/121 (consumer
+Blackwell, GB10/RTX-50) + sm_100 (datacenter Blackwell). sm_90 Hopper / sm_89 Ada
+/ sm_80-86 Ampere = NO FP4-MMA. Metal/CPU/AMD/Intel = no NVFP4-MMA. GB10's wins
+are dominated by the LPDDR5x ~273 GB/s bandwidth floor; sm_100 has FP4-MMA but
+HBM3e ~8 TB/s so it is COMPUTE-bound and every "bandwidth-bound" GB10 verdict
+inverts there.
+
+### 1. BUILD SAFETY: does it build + run WITHOUT CRASHING off-Blackwell?
+
+YES on every target it builds for, with ONE latent silent-correctness hazard
+(not a crash) to close before claiming non-Blackwell support. The build is NOT
+GB10-pinned: there is no explicit CUDA arch list anywhere in the paged path
+(`CUDA_DOCKER_ARCH` empty in every matrix row, identical to stock llama-cpp), so
+the CUDA TUs compile the full upstream ggml arch fan and the NVFP4 FP4-MMA path
+is gated INSIDE the kernel by `BLACKWELL_MMA_AVAILABLE`, never by the matrix.
+
+| target | builds? | runs? | notes |
+|--------|---------|-------|-------|
+| CUDA sm_80/86/89/90 (Ampere/Ada/Hopper) | YES | YES | 0017 Blackwell code `#if`-stripped + default-off; all other device code portable. qwen35 hybrid models run (GDN + ssm_conv_update + gather have non-Blackwell kernels). NVFP4 GGUFs run via dequant/DP4A; FP4 levers inert, not broken. |
+| CUDA sm_100 (datacenter Blackwell, HBM3e) | YES | YES | every lever active + bit-exact; GB10-tuned launch defaults are correct but compute-bound regime => safe-but-suboptimal (re-sweep, do not assume GB10 constants). |
+| CPU (amd64 + arm64) | YES | YES | every new op ships a CPU mirror under the reused enums; host patches portable C++. |
+| ROCm/HIP, Intel SYCL, Vulkan | YES | partial | .cu hipifies cleanly (no Blackwell intrinsic outside `#if`; 0022's 512-thread launch within AMD limits). SYCL/Vulkan don't compile the .cu and lack the GDN op, so qwen35 hybrid models assert/fall back rather than run; classic non-qwen35 models unaffected. |
+| Metal / macOS | NOT BUILT | N/A | no `includeDarwin` row, no `metal:` capability key. Mac selection of this backend falls back to `default`=cpu (a Linux image) and does NOT run; no auto-fallthrough to stock llama-cpp. |
+
+No patch in 0017-0029 breaks a non-Blackwell CUDA build, the CPU build, or the
+ROCm/SYCL/Vulkan builds. The only thing that is not merely "suboptimal" is the
+fused-conv silent-miscompute hazard (item RISKY-1 below), and even that is latent
+because the co-emitted GDN op asserts first on the backends that lack it.
+
+### 2. EVERY patch/opt, four buckets
+
+SAFE-EVERYWHERE (ship as-is; bit-exact or default-off byte-identical; pure win or
+neutral on any arch that runs the path):
+- 0001-0012 paged KV core (manager, on-demand alloc, prefix caching, in-kernel paged read)
+- 0013 / 0016 prefill-token budget scheduler (pure `update_slots()` policy, default-off byte-identical)
+- 0018 in-place SSM-state write-back  (CUDA+CPU; see RISKY-1 for backend coverage)
+- 0019 fused SSM-state gather          (CUDA+CPU)
+- 0021 conv-state in-place fusion      (CUDA+CPU)
+- 0028 recurrent-state (conv-tap) gather fusion (CUDA+CPU)
+- 0020 o_proj GDN MMVQ->MMQ reshape (zero-cost view, bit-identical; MMQ>MMVQ at M=128 is universal; magnitude GB10-bound, perf-only caveat at tiny real M=1, see RISKY-2)
+- 0024 paged-pool burst-reclaim (pure host C++; fixes a real long-server fragmentation collapse)
+- 0029 block-table within-step host cache (host memcpy reuse, bit-exact; bigger win the FASTER the GPU, i.e. MORE host-bound decode elsewhere)
+
+BLACKWELL-ONLY, CLEAN FALLBACK (only meaningful where FP4-MMA exists; provably
+inert/byte-identical elsewhere, never a break):
+- 0017 FP4 dense-GEMM decode tile tune - levers `#if BLACKWELL_MMA_AVAILABLE` + `blackwell_mma_available(cc)` + `type==NVFP4`, ALL default-off => default build byte-identical to stock on every arch
+- 0023 MoE NVFP4 activation-quant de-dup - plain uint4 copy kernel reached ONLY inside the pre-existing `if (use_native_fp4)` branch (false off-Blackwell); never executes there
+- 0025 MoE NVFP4 decode re-graph - host-side CUDA-graph guard, env-gated `LLAMA_MOE_FORCE_GRAPHS` default-off; the NVFP4-grouped guard predicate is inert on non-FP4
+- NVFP4 GGUFs + 6 gallery rows - FAST path is sm_120/121/100 only; elsewhere run-via-dequant (correct, slow), never a load/run gate
+
+GB10-TUNED (works + safe everywhere, but the constants/magnitude are GB10
+bandwidth-floor winners; re-sweep per arch, no correctness risk):
+- 0022 GDN recurrence occupancy retune - column-fold default (16,8)=512thr/MIN_BLOCKS=2, bit-exact, env-overridable GDN_NW/GDN_CPW; within the 1024-thread limit on sm_70..120 + AMD. Optimal values depend on DRAM latency/L2/SM-count; on a compute-bound arch the kernel may not be the bottleneck.
+- 0026 bf16 per-head SSM/conv cache - default f32 bit-exact (opt-in `--cache-type-ssm/-conv`); bf16 only pays off on a bandwidth-bound arch, buys little on sm_100 HBM3e. bf16 is STORAGE-only (fill_kernel<nv_bfloat16>), the bf16-compute SSM plan is shelved so STATE_T stays f32 on the hot path.
+- 0017 / 0023 magnitudes (the % wins, not the gating) are also GB10-floor-bound.
+
+RISKY (fix before claiming non-Blackwell ship; neither is a crash, one is silent):
+- RISKY-1 (the one real gap) fused GDN/conv ops are CUDA+CPU-only with
+  backend-UNGATED, DEFAULT-ON emission. Confirmed: `cparams.fused_gdn_ch = true`
+  and `auto_fgdn = true` in the `llama_context` constructor; emission fires on
+  `(n_seq_tokens==1 && n_rs_seq==0 && fused_gdn_ar)` with NO compute-backend
+  check. The fused conv variant REUSES `GGML_OP_SSM_CONV` discriminated by a
+  non-null `src[3]` (verified: CUDA `if (dst->src[3] != nullptr)` branch at the
+  top of `ggml_cuda_op_ssm_conv`, CPU mirror in ops.cpp, NO supports_op guard). A
+  backend that supports plain SSM_CONV but ignores `src[3]` would compute the
+  WRONG plain conv => SILENT corruption. Latent today only because the co-emitted
+  fork-custom GDN op is CUDA/CPU-only, so on Vulkan/SYCL the GDN node asserts
+  first and the qwen35 hybrid model cannot run there anyway, and Metal is not
+  built. FIX: gate fused-op emission on a CUDA/HIP compute backend, OR add a
+  supports_op guard that rejects the discriminated SSM_CONV where fused handling
+  is absent. This is the single thing that could miscompute silently; close it
+  before a Vulkan/SYCL/Metal paged build of a gated-DeltaNet model is ever shipped.
+- RISKY-2 (perf-only, not correctness) 0020 forces MMQ; at a genuine single-stream
+  decode M<=8 (n_seqs=1) MMQ could be slower than MMVQ off the GB10 batched
+  regime. Bit-identical either way. Confirm the reshape still picks the better
+  kernel at n_seqs=1 on non-GB10 archs.
+
+### 3. NVFP4-GGUF + gallery targeting recommendation
+
+Do NOT hardware-gate the entries (and you cannot: LocalAI has no microarch-gating
+primitive - `tags:` are free-text/search-only, `ModelConfig` has no
+hardware/requirements field, and backend `capabilities:` resolves by accelerator
+FAMILY only, serving `nvidia: cuda12-...-paged` to ANY NVIDIA GPU with no
+sub-nvidia resolution). The GGUFs run correctly everywhere via dequant, so the
+risk is PERFORMANCE-EXPECTATION, not correctness; a hard gate would wrongly block
+valid (slow) use. Recommended, in order:
+1. (zero-code, do now) Lead each of the 6 descriptions with one honest line:
+   "Hardware: Blackwell (RTX 50-series / GB10 / B200) recommended; runs on other
+   NVIDIA/CPU via NVFP4 dequant but WITHOUT FP4-MMA and below the quoted GB10
+   throughput." Temper the "90-117% of vLLM" claims with that caveat (those are
+   LPDDR5x-bandwidth-bound specific).
+2. (zero-code) Tag all six consistently with `nvfp4` + a new `blackwell` chip. The
+   four Qwopus/Qwen-MTP entries currently carry only `[llm, gguf]` and are not even
+   discoverable as NVFP4 despite being NVFP4 GGUFs - secondary correctness-of-metadata gap.
+3. (feature, later) A structured `recommended_hardware` field surfaced by the React
+   import dialog is the only way to express this machine-readably; it does not exist.
+
+### 4. Per-arch roadmap (ranked by value / effort)
+
+- sm_100 datacenter Blackwell - HIGH value, MEDIUM effort. FP4-MMA works so NVFP4
+  stays fast and the precision bucket (0017/0023/0025) carries over, but the BW
+  floor is gone => compute-bound. Needs: re-sweep 0022 GDN_NW/CPW; re-evaluate the
+  0017 kill-gate (levers ready, may flip); expect 0018/0019/0026 bandwidth wins to
+  shrink toward neutral while 0029/0025/0020 host/graph/MMQ wins still help. No
+  code change to be SAFE; a tuning pass to be OPTIMAL.
+- Metal / macOS - MEDIUM value, MEDIUM effort. Add the `includeDarwin`
+  `-metal-darwin-arm64-llama-cpp-localai-paged` row + a `metal:` capability key
+  (changed-backends.js already anticipates the source path). Delivers paged-KV +
+  scheduler value only (no NVFP4-MMA on Metal); still a strict win over today's
+  broken Mac selection. MUST also land RISKY-1 first (Metal would otherwise hit the
+  discriminated-SSM_CONV path if it ever gained an SSM_CONV kernel without the
+  discriminator).
+- CPU - LOW effort, already works. Reference kernels exist for every fused op;
+  paged KV + scheduler + reclaim are the portable value. Nothing to do.
+- Hopper sm_90 / Ada sm_89 / Ampere sm_80-86 - MEDIUM value, LOW effort (no FP4
+  work). No FP4-MMA => pair the precision-agnostic infra (paged KV, 0013/0016,
+  0024, 0029, 0018/0019/0021/0028, 0020) with a DIFFERENT quant (Q4_K/AWQ/GPTQ).
+  Messaging: "no NVFP4 here, use another quant, but paged + SSM + scheduler infra
+  is a pure win". The GGUFs/gallery rows are out of scope for these.
+
+### 5. What MUST be empirically verified (and on what hardware)
+
+- GB10 (sm_121, user has it): the validated target; already measured. Re-confirm
+  bit-exactness gates after RISKY-1 fix.
+- M4 Mac (user has it): (a) once an `includeDarwin` paged row exists, verify the
+  Metal build compiles + a NON-qwen35 model runs (paged KV path); (b) verify a
+  qwen35 hybrid model on Metal EITHER asserts loudly OR is correct - it must NOT
+  silently miscompute the discriminated SSM_CONV. This is the direct test of
+  RISKY-1 on real Metal. Do this BEFORE shipping a Metal paged build. Also verify
+  CPU correctness of every fused op on the Mac (arm64 CPU mirror).
+- non-Blackwell NVIDIA (sm_80/86/89/90 - user would need to ACQUIRE, e.g. cloud
+  A100/L4/L40S/H100): verify (a) the cuda12/cuda13 paged image runs a qwen35
+  hybrid model correctly (GDN + ssm_conv_update + gather non-Blackwell kernels),
+  (b) NVFP4 GGUFs load + produce correct output via dequant/DP4A (not garbage),
+  (c) RISKY-2: that 0020's forced MMQ does not regress single-stream (n_seqs=1)
+  decode latency vs MMVQ. This is the only bucket needing hardware acquisition;
+  everything else is covered by the GB10 + M4 the user already has.
+- sm_100 (datacenter Blackwell - cloud B200 if a tuning pass is wanted): only
+  needed to make sm_100 OPTIMAL, not to make it SAFE. Defer unless targeting it.
+
+### 6. SHIP DECISION
+
+SAFE TO SHIP TODAY as a Blackwell-targeted backend on Linux. The build is
+arch-general (same arch fan + variant set as stock llama-cpp), every targeted
+Linux variant builds and runs, and all Blackwell-specific code is default-off +
+`#if`-guarded so a non-Blackwell build is byte-identical to stock on the FP4 path.
+The NVFP4 GGUFs run everywhere via dequant (correct, slower), so broad gallery
+exposure is a performance-expectation issue, not a correctness one.
+
+MINIMUM to not break / mislead other archs:
+1. (correctness, before ANY Vulkan/SYCL/Metal paged build of a gated-DeltaNet
+   model) Close RISKY-1: compute-backend-gate the fused GDN/conv op emission, or
+   add a supports_op guard rejecting the discriminated SSM_CONV. This is the only
+   hard requirement; it is latent on the current Linux targets but becomes live
+   the moment a Metal/Vulkan/SYCL paged build of qwen35 exists.
+2. (availability, zero-risk) Add the `includeDarwin` paged row + `metal:` key so
+   Mac users get a working (paged-KV-only) build instead of a non-running
+   default=cpu selection with no fallthrough to stock.
+3. (expectation, zero-code) Add the Blackwell-recommended hardware note + the
+   "runs slower off-Blackwell via dequant" caveat to the 6 gallery descriptions
+   and tag all six `nvfp4` + `blackwell`.
+4. (perf, verify don't block) Confirm 0020 does not regress n_seqs=1 decode on
+   non-GB10 NVIDIA; if it does, gate the MMVQ->MMQ reshape on a real-M threshold.
+
+Items 2-4 do not block a Linux Blackwell ship. Item 1 blocks only a future
+non-CUDA paged build of a gated-DeltaNet model; on the current build targets the
+hazard is latent (the GDN op asserts first). Net: ship for Blackwell/Linux now;
+land item 1 before extending paged to Metal/Vulkan/SYCL.
+
 Assisted-by: Claude:opus-4.8 [Claude Code]