From af6e13375919e6c3bf526c5112994b0164f3082f Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Sat, 27 Jun 2026 07:07:20 +0000 Subject: [PATCH] docs(paged): cross-arch synthesis - ship verdict + minimum non-Blackwell fixes Synthesizes the four ARCH_GENERALITY_AUDIT sections (build-matrix, gguf-gallery-targeting, optimization-generality, patch-arch-safety) into a single cross-arch ship decision: build-safety table per target, every patch bucketed (SAFE-EVERYWHERE / BLACKWELL-ONLY-clean-fallback / GB10-TUNED / RISKY), the NVFP4 gallery recommendation, a per-arch roadmap ranked by value/effort, the empirical-verification matrix (GB10 + M4 cover all but non-Blackwell NVIDIA), and the ship verdict. Verdict: SAFE to ship as Blackwell/Linux today; the build is arch-general (no GB10 pin; FP4 code is default-off + #if-guarded) and NVFP4 GGUFs run everywhere via dequant. The one hard prerequisite before extending paged to Metal/Vulkan/SYCL is closing the backend-ungated, default-on fused GDN/conv op emission (discriminated GGML_OP_SSM_CONV via non-null src[3], CUDA+CPU only, no supports_op guard) - latent on current Linux targets, silent miscompute on a future non-CUDA paged build of a gated-DeltaNet model. Signed-off-by: Ettore Di Giacinto --- .../patches/paged/ARCH_GENERALITY_AUDIT.md | 172 ++++++++++++++++++ 1 file changed, 172 insertions(+) diff --git a/backend/cpp/llama-cpp/patches/paged/ARCH_GENERALITY_AUDIT.md b/backend/cpp/llama-cpp/patches/paged/ARCH_GENERALITY_AUDIT.md index 039d2356e..ab72bf9a4 100644 --- a/backend/cpp/llama-cpp/patches/paged/ARCH_GENERALITY_AUDIT.md +++ b/backend/cpp/llama-cpp/patches/paged/ARCH_GENERALITY_AUDIT.md @@ -494,4 +494,176 @@ default-off and `#if`-guarded. The sole non-build hazard is the latent discriminated-SSM_CONV silent-miscompute on a hypothetical Vulkan/SYCL/Metal GDN run, which should be closed by compute-backend-gating the fused-op emission. +## Section: CROSS-ARCH SYNTHESIS (final verdict) + +Consolidates the four audit sections above into a single ship decision. The arch +axis: NVFP4 FP4-MMA requires `BLACKWELL_MMA_AVAILABLE` = sm_120/121 (consumer +Blackwell, GB10/RTX-50) + sm_100 (datacenter Blackwell). sm_90 Hopper / sm_89 Ada +/ sm_80-86 Ampere = NO FP4-MMA. Metal/CPU/AMD/Intel = no NVFP4-MMA. GB10's wins +are dominated by the LPDDR5x ~273 GB/s bandwidth floor; sm_100 has FP4-MMA but +HBM3e ~8 TB/s so it is COMPUTE-bound and every "bandwidth-bound" GB10 verdict +inverts there. + +### 1. BUILD SAFETY: does it build + run WITHOUT CRASHING off-Blackwell? + +YES on every target it builds for, with ONE latent silent-correctness hazard +(not a crash) to close before claiming non-Blackwell support. The build is NOT +GB10-pinned: there is no explicit CUDA arch list anywhere in the paged path +(`CUDA_DOCKER_ARCH` empty in every matrix row, identical to stock llama-cpp), so +the CUDA TUs compile the full upstream ggml arch fan and the NVFP4 FP4-MMA path +is gated INSIDE the kernel by `BLACKWELL_MMA_AVAILABLE`, never by the matrix. + +| target | builds? | runs? | notes | +|--------|---------|-------|-------| +| CUDA sm_80/86/89/90 (Ampere/Ada/Hopper) | YES | YES | 0017 Blackwell code `#if`-stripped + default-off; all other device code portable. qwen35 hybrid models run (GDN + ssm_conv_update + gather have non-Blackwell kernels). NVFP4 GGUFs run via dequant/DP4A; FP4 levers inert, not broken. | +| CUDA sm_100 (datacenter Blackwell, HBM3e) | YES | YES | every lever active + bit-exact; GB10-tuned launch defaults are correct but compute-bound regime => safe-but-suboptimal (re-sweep, do not assume GB10 constants). | +| CPU (amd64 + arm64) | YES | YES | every new op ships a CPU mirror under the reused enums; host patches portable C++. | +| ROCm/HIP, Intel SYCL, Vulkan | YES | partial | .cu hipifies cleanly (no Blackwell intrinsic outside `#if`; 0022's 512-thread launch within AMD limits). SYCL/Vulkan don't compile the .cu and lack the GDN op, so qwen35 hybrid models assert/fall back rather than run; classic non-qwen35 models unaffected. | +| Metal / macOS | NOT BUILT | N/A | no `includeDarwin` row, no `metal:` capability key. Mac selection of this backend falls back to `default`=cpu (a Linux image) and does NOT run; no auto-fallthrough to stock llama-cpp. | + +No patch in 0017-0029 breaks a non-Blackwell CUDA build, the CPU build, or the +ROCm/SYCL/Vulkan builds. The only thing that is not merely "suboptimal" is the +fused-conv silent-miscompute hazard (item RISKY-1 below), and even that is latent +because the co-emitted GDN op asserts first on the backends that lack it. + +### 2. EVERY patch/opt, four buckets + +SAFE-EVERYWHERE (ship as-is; bit-exact or default-off byte-identical; pure win or +neutral on any arch that runs the path): +- 0001-0012 paged KV core (manager, on-demand alloc, prefix caching, in-kernel paged read) +- 0013 / 0016 prefill-token budget scheduler (pure `update_slots()` policy, default-off byte-identical) +- 0018 in-place SSM-state write-back (CUDA+CPU; see RISKY-1 for backend coverage) +- 0019 fused SSM-state gather (CUDA+CPU) +- 0021 conv-state in-place fusion (CUDA+CPU) +- 0028 recurrent-state (conv-tap) gather fusion (CUDA+CPU) +- 0020 o_proj GDN MMVQ->MMQ reshape (zero-cost view, bit-identical; MMQ>MMVQ at M=128 is universal; magnitude GB10-bound, perf-only caveat at tiny real M=1, see RISKY-2) +- 0024 paged-pool burst-reclaim (pure host C++; fixes a real long-server fragmentation collapse) +- 0029 block-table within-step host cache (host memcpy reuse, bit-exact; bigger win the FASTER the GPU, i.e. MORE host-bound decode elsewhere) + +BLACKWELL-ONLY, CLEAN FALLBACK (only meaningful where FP4-MMA exists; provably +inert/byte-identical elsewhere, never a break): +- 0017 FP4 dense-GEMM decode tile tune - levers `#if BLACKWELL_MMA_AVAILABLE` + `blackwell_mma_available(cc)` + `type==NVFP4`, ALL default-off => default build byte-identical to stock on every arch +- 0023 MoE NVFP4 activation-quant de-dup - plain uint4 copy kernel reached ONLY inside the pre-existing `if (use_native_fp4)` branch (false off-Blackwell); never executes there +- 0025 MoE NVFP4 decode re-graph - host-side CUDA-graph guard, env-gated `LLAMA_MOE_FORCE_GRAPHS` default-off; the NVFP4-grouped guard predicate is inert on non-FP4 +- NVFP4 GGUFs + 6 gallery rows - FAST path is sm_120/121/100 only; elsewhere run-via-dequant (correct, slow), never a load/run gate + +GB10-TUNED (works + safe everywhere, but the constants/magnitude are GB10 +bandwidth-floor winners; re-sweep per arch, no correctness risk): +- 0022 GDN recurrence occupancy retune - column-fold default (16,8)=512thr/MIN_BLOCKS=2, bit-exact, env-overridable GDN_NW/GDN_CPW; within the 1024-thread limit on sm_70..120 + AMD. Optimal values depend on DRAM latency/L2/SM-count; on a compute-bound arch the kernel may not be the bottleneck. +- 0026 bf16 per-head SSM/conv cache - default f32 bit-exact (opt-in `--cache-type-ssm/-conv`); bf16 only pays off on a bandwidth-bound arch, buys little on sm_100 HBM3e. bf16 is STORAGE-only (fill_kernel), the bf16-compute SSM plan is shelved so STATE_T stays f32 on the hot path. +- 0017 / 0023 magnitudes (the % wins, not the gating) are also GB10-floor-bound. + +RISKY (fix before claiming non-Blackwell ship; neither is a crash, one is silent): +- RISKY-1 (the one real gap) fused GDN/conv ops are CUDA+CPU-only with + backend-UNGATED, DEFAULT-ON emission. Confirmed: `cparams.fused_gdn_ch = true` + and `auto_fgdn = true` in the `llama_context` constructor; emission fires on + `(n_seq_tokens==1 && n_rs_seq==0 && fused_gdn_ar)` with NO compute-backend + check. The fused conv variant REUSES `GGML_OP_SSM_CONV` discriminated by a + non-null `src[3]` (verified: CUDA `if (dst->src[3] != nullptr)` branch at the + top of `ggml_cuda_op_ssm_conv`, CPU mirror in ops.cpp, NO supports_op guard). A + backend that supports plain SSM_CONV but ignores `src[3]` would compute the + WRONG plain conv => SILENT corruption. Latent today only because the co-emitted + fork-custom GDN op is CUDA/CPU-only, so on Vulkan/SYCL the GDN node asserts + first and the qwen35 hybrid model cannot run there anyway, and Metal is not + built. FIX: gate fused-op emission on a CUDA/HIP compute backend, OR add a + supports_op guard that rejects the discriminated SSM_CONV where fused handling + is absent. This is the single thing that could miscompute silently; close it + before a Vulkan/SYCL/Metal paged build of a gated-DeltaNet model is ever shipped. +- RISKY-2 (perf-only, not correctness) 0020 forces MMQ; at a genuine single-stream + decode M<=8 (n_seqs=1) MMQ could be slower than MMVQ off the GB10 batched + regime. Bit-identical either way. Confirm the reshape still picks the better + kernel at n_seqs=1 on non-GB10 archs. + +### 3. NVFP4-GGUF + gallery targeting recommendation + +Do NOT hardware-gate the entries (and you cannot: LocalAI has no microarch-gating +primitive - `tags:` are free-text/search-only, `ModelConfig` has no +hardware/requirements field, and backend `capabilities:` resolves by accelerator +FAMILY only, serving `nvidia: cuda12-...-paged` to ANY NVIDIA GPU with no +sub-nvidia resolution). The GGUFs run correctly everywhere via dequant, so the +risk is PERFORMANCE-EXPECTATION, not correctness; a hard gate would wrongly block +valid (slow) use. Recommended, in order: +1. (zero-code, do now) Lead each of the 6 descriptions with one honest line: + "Hardware: Blackwell (RTX 50-series / GB10 / B200) recommended; runs on other + NVIDIA/CPU via NVFP4 dequant but WITHOUT FP4-MMA and below the quoted GB10 + throughput." Temper the "90-117% of vLLM" claims with that caveat (those are + LPDDR5x-bandwidth-bound specific). +2. (zero-code) Tag all six consistently with `nvfp4` + a new `blackwell` chip. The + four Qwopus/Qwen-MTP entries currently carry only `[llm, gguf]` and are not even + discoverable as NVFP4 despite being NVFP4 GGUFs - secondary correctness-of-metadata gap. +3. (feature, later) A structured `recommended_hardware` field surfaced by the React + import dialog is the only way to express this machine-readably; it does not exist. + +### 4. Per-arch roadmap (ranked by value / effort) + +- sm_100 datacenter Blackwell - HIGH value, MEDIUM effort. FP4-MMA works so NVFP4 + stays fast and the precision bucket (0017/0023/0025) carries over, but the BW + floor is gone => compute-bound. Needs: re-sweep 0022 GDN_NW/CPW; re-evaluate the + 0017 kill-gate (levers ready, may flip); expect 0018/0019/0026 bandwidth wins to + shrink toward neutral while 0029/0025/0020 host/graph/MMQ wins still help. No + code change to be SAFE; a tuning pass to be OPTIMAL. +- Metal / macOS - MEDIUM value, MEDIUM effort. Add the `includeDarwin` + `-metal-darwin-arm64-llama-cpp-localai-paged` row + a `metal:` capability key + (changed-backends.js already anticipates the source path). Delivers paged-KV + + scheduler value only (no NVFP4-MMA on Metal); still a strict win over today's + broken Mac selection. MUST also land RISKY-1 first (Metal would otherwise hit the + discriminated-SSM_CONV path if it ever gained an SSM_CONV kernel without the + discriminator). +- CPU - LOW effort, already works. Reference kernels exist for every fused op; + paged KV + scheduler + reclaim are the portable value. Nothing to do. +- Hopper sm_90 / Ada sm_89 / Ampere sm_80-86 - MEDIUM value, LOW effort (no FP4 + work). No FP4-MMA => pair the precision-agnostic infra (paged KV, 0013/0016, + 0024, 0029, 0018/0019/0021/0028, 0020) with a DIFFERENT quant (Q4_K/AWQ/GPTQ). + Messaging: "no NVFP4 here, use another quant, but paged + SSM + scheduler infra + is a pure win". The GGUFs/gallery rows are out of scope for these. + +### 5. What MUST be empirically verified (and on what hardware) + +- GB10 (sm_121, user has it): the validated target; already measured. Re-confirm + bit-exactness gates after RISKY-1 fix. +- M4 Mac (user has it): (a) once an `includeDarwin` paged row exists, verify the + Metal build compiles + a NON-qwen35 model runs (paged KV path); (b) verify a + qwen35 hybrid model on Metal EITHER asserts loudly OR is correct - it must NOT + silently miscompute the discriminated SSM_CONV. This is the direct test of + RISKY-1 on real Metal. Do this BEFORE shipping a Metal paged build. Also verify + CPU correctness of every fused op on the Mac (arm64 CPU mirror). +- non-Blackwell NVIDIA (sm_80/86/89/90 - user would need to ACQUIRE, e.g. cloud + A100/L4/L40S/H100): verify (a) the cuda12/cuda13 paged image runs a qwen35 + hybrid model correctly (GDN + ssm_conv_update + gather non-Blackwell kernels), + (b) NVFP4 GGUFs load + produce correct output via dequant/DP4A (not garbage), + (c) RISKY-2: that 0020's forced MMQ does not regress single-stream (n_seqs=1) + decode latency vs MMVQ. This is the only bucket needing hardware acquisition; + everything else is covered by the GB10 + M4 the user already has. +- sm_100 (datacenter Blackwell - cloud B200 if a tuning pass is wanted): only + needed to make sm_100 OPTIMAL, not to make it SAFE. Defer unless targeting it. + +### 6. SHIP DECISION + +SAFE TO SHIP TODAY as a Blackwell-targeted backend on Linux. The build is +arch-general (same arch fan + variant set as stock llama-cpp), every targeted +Linux variant builds and runs, and all Blackwell-specific code is default-off + +`#if`-guarded so a non-Blackwell build is byte-identical to stock on the FP4 path. +The NVFP4 GGUFs run everywhere via dequant (correct, slower), so broad gallery +exposure is a performance-expectation issue, not a correctness one. + +MINIMUM to not break / mislead other archs: +1. (correctness, before ANY Vulkan/SYCL/Metal paged build of a gated-DeltaNet + model) Close RISKY-1: compute-backend-gate the fused GDN/conv op emission, or + add a supports_op guard rejecting the discriminated SSM_CONV. This is the only + hard requirement; it is latent on the current Linux targets but becomes live + the moment a Metal/Vulkan/SYCL paged build of qwen35 exists. +2. (availability, zero-risk) Add the `includeDarwin` paged row + `metal:` key so + Mac users get a working (paged-KV-only) build instead of a non-running + default=cpu selection with no fallthrough to stock. +3. (expectation, zero-code) Add the Blackwell-recommended hardware note + the + "runs slower off-Blackwell via dequant" caveat to the 6 gallery descriptions + and tag all six `nvfp4` + `blackwell`. +4. (perf, verify don't block) Confirm 0020 does not regress n_seqs=1 decode on + non-GB10 NVIDIA; if it does, gate the MMVQ->MMQ reshape on a real-M threshold. + +Items 2-4 do not block a Linux Blackwell ship. Item 1 blocks only a future +non-CUDA paged build of a gated-DeltaNet model; on the current build targets the +hazard is latent (the GDN op asserts first). Net: ship for Blackwell/Linux now; +land item 1 before extending paged to Metal/Vulkan/SYCL. + Assisted-by: Claude:opus-4.8 [Claude Code]