docs(paged): ARCH audit - NVFP4 GGUF off-Blackwell portability + gallery targeting gap

NVFP4 (GGML_TYPE_NVFP4=40 / MOSTLY_NVFP4) GGUFs are portable: full CPU/CUDA-DP4A/ generic-MMA/Vulkan dequant coverage. FP4-MMA is a runtime Blackwell-only speed tier (mmq.cu use_native_fp4 flag), not a load/run gate. Off-Blackwell = runs via dequant, correct-but-slow, never fail/garbage. Gallery has no microarch-gating primitive (tags are search-only, capabilities map is family-level nvidia/amd/ metal, model struct has no hardware field), so the 6 -paged entries can only express Blackwell-targeting via description prose + tags. Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 09:57:14 -04:00 · 2026-06-27 07:00:34 +00:00
parent 683e22500f
commit 34abf392fc
1 changed files with 104 additions and 0 deletions
--- a/backend/cpp/llama-cpp/patches/paged/ARCH_GENERALITY_AUDIT.md
+++ b/backend/cpp/llama-cpp/patches/paged/ARCH_GENERALITY_AUDIT.md
@@ -114,4 +114,108 @@ nvidia-cuda-13, nvidia-cuda-12, nvidia-l4t-cuda-12/13.  NO `metal:` key.
  NVFP4 acceleration - still a correctness/availability win over the current
  broken selection.)

+## Section: gguf-gallery-targeting (NVFP4 portability + hardware gating)
+
+### 1. NVFP4 GGUFs LOAD + RUN on non-Blackwell - runs-via-dequant, NOT FP4-MMA-required
+
+The published GGUFs use `file_type` MOSTLY_NVFP4 / `GGML_TYPE_NVFP4` (type id 40).
+This is a standard ggml block-quant type with FULL software dequant + matmul
+coverage across every backend, NOT a Blackwell-only format. Verified against the
+paged backend's pinned ggml source (pin 0a2677c6, same upstream as stock
+llama-cpp):
+
+- CPU (any arch, amd64 + arm64): full support, no special hardware.
+  - `ggml/src/ggml-cpu/quants.c`: `quantize_row_nvfp4` (from_float) +
+    `ggml_vec_dot_nvfp4_q8_0_generic` (the matmul dot product), dequant via the
+    `kvalues_mxfp4` lookup table. Registered in the CPU type-traits table
+    (`ggml-cpu.c` line 283: `[GGML_TYPE_NVFP4] = { .from_float=..., .vec_dot=... }`).
+  - NVFP4 handled in all the CPU op switches (`ops.cpp` lines 674, 1125, 1255,
+    4424, 4701, 4925, 5651). LOADS + RUNS correctly on a pure-CPU host, just slow.
+- CUDA, NON-Blackwell (Pascal/Volta/Turing/Ampere sm_80-86 / Ada sm_89 /
+  Hopper sm_90): RUNS correctly via the integer-quantized matmul paths, no
+  FP4-MMA needed.
+  - `convert.cu` registers `dequantize_row_nvfp4_cuda` as both the to_float and
+    to_fp16 dequant kernel (lines 759, 814) - the generic dequant->GEMM path.
+  - `mmvq.cu`: `vec_dot_nvfp4_q8_1` (DP4A integer dot, works on any GPU with
+    dp4a, i.e. Pascal sm_61+). This is the decode (gemv) path.
+  - `mmq.cuh`: NVFP4 has a `MMQ_DP4A_TXS_Q8_0_16` DP4A tile AND a separate
+    `MMQ_MMA_TILE_X_K_NVFP4` tile explicitly commented "NVFP4 Generic" (line
+    222), DISTINCT from `MMQ_MMA_TILE_X_K_FP4` "MXFP4 and NVFP4 Blackwell" (line
+    221). So there are three tiers: DP4A (oldest), generic-MMA (Turing+), and
+    Blackwell-native FP4-MMA.
+  - The Blackwell path is a runtime FLAG, not a requirement:
+    `mmq.cu` line 125 `const bool use_native_fp4 = blackwell_mma_available(cc)
+    && (... NVFP4)`. When false (non-Blackwell), it falls through to the generic
+    quantized kernel. Grep for any abort/unsupported on NVFP4+blackwell = NONE.
+    No `GGML_ABORT`, no garbage - just the non-MMA kernel.
+- Vulkan: has `dequant_nvfp4.comp` + NVFP4 in `ggml-vulkan.cpp` / dequant_funcs
+  - LOADS + RUNS on Vulkan hosts (AMD/Intel/NVIDIA) via dequant.
+- Metal: NVFP4 referenced only in `ggml-metal-device.m` (type registration /
+  size), NO Metal NVFP4 compute kernel. On Apple Silicon NVFP4 tensors would
+  fall back to the CPU backend op-by-op (correct but slow) IF a Metal build
+  existed - which for THIS backend it does not (see build-targeting Section 3).
+
+Bottom line: the NVFP4 GGUFs are PORTABLE. A Hopper/Ada/Ampere/CPU/Vulkan host
+loads and runs them correctly (bit-faithful dequant), just WITHOUT the FP4-MMA
+speedup. FP4-MMA is a Blackwell-only performance tier layered on top of a
+fully-general software path, NOT a load/run gate. Off-Blackwell = runs-via-dequant,
+correct-but-slow; never fail/garbage.
+
+### 2. Gallery hardware-targeting GAP: nothing stops a non-Blackwell user
+
+The 6 -paged entries declare NO machine-readable hardware targeting. The only
+Blackwell signal is free prose in `description:` ("native Blackwell NVFP4
+(FP4-MMA)", "Benchmarked on GB10 / DGX Spark") and a `nvfp4` string in `tags:`.
+
+How LocalAI's gallery CAN express hardware gating (what exists):
+- `tags:` are FREE-TEXT, search-only. `core/gallery/gallery.go` line 89 just does
+  `strings.Contains(lower(join(tags)), term)` for the search box + line 128
+  collects them for filter chips. There is NO tag that gates install or warns;
+  the `nvfp4` tag is purely discoverability.
+- The model `ModelConfig` struct (`core/gallery/models.go`) has only
+  Description/Icon/License/URLs/Name/ConfigFile/Files/PromptTemplates. There is
+  NO capabilities / requirements / hardware field at the MODEL level. (Signing
+  `verification:` is the only structured gate, unrelated to hardware.)
+- The `capabilities:` map (default/nvidia/intel/amd/metal/vulkan/...) is a
+  BACKEND-level concept in `backend/index.yaml` (paged entry lines 100-111). It
+  selects the backend IMAGE by detected accelerator FAMILY (nvidia vs amd vs
+  metal vs cpu). Crucially it does NOT and CANNOT distinguish Blackwell sm_120/121
+  from older NVIDIA - `nvidia: cuda12-llama-cpp-localai-paged` is served to ANY
+  NVIDIA GPU. There is no sub-nvidia (microarch) gating mechanism in the gallery
+  or the backend capability resolver.
+
+So the gating gap is real: a non-Blackwell user browsing the gallery is offered
+the NVFP4 entries with no machine-readable signal that they will run far below
+the advertised "90-117% of vLLM" numbers (those numbers are GB10/LPDDR5x-bound
+specific). It will install and run correctly, just slowly, and the bench claims
+in the description will not hold.
+
+### 3. How to express Blackwell-targeting (recommendation)
+
+Given there is no microarch-gating primitive, the honest options are, in order:
+
+a. DESCRIPTION + TAG (only thing available today, zero code): the entries already
+   say "native Blackwell NVFP4 (FP4-MMA)" - tighten it to a leading one-line
+   "Hardware: Blackwell (RTX 50-series / GB10 / B200) recommended; runs on other
+   NVIDIA/CPU via NVFP4 dequant but WITHOUT the FP4-MMA speedup and below the
+   quoted GB10 throughput." Add a `blackwell` tag alongside `nvfp4` for the
+   filter chip. This is the existing convention (other entries use free prose +
+   `nvidia` tag, e.g. line 2395; quant trade-offs are described in prose, e.g.
+   the Gemma "Mobile-optimized" notes lines 1312/1366). No other gallery entry
+   today encodes a GPU-microarch requirement, so prose is the de-facto standard.
+b. If a structured signal is wanted, it would need a NEW field (e.g. a
+   `recommended_hardware` / `requires` note surfaced by the React UI import
+   dialog) - that is a feature, not a config tweak, and does not exist yet.
+c. The `nvfp4` tag should at minimum be present on ALL six entries - the four
+   Qwopus/Qwen-MTP entries at lines 819/854/890 have only `[llm, gguf]` tags and
+   omit `nvfp4`, so they are not even discoverable/filterable as NVFP4, despite
+   being NVFP4 GGUFs. Inconsistent tagging is a secondary gap.
+
+Verdict (gallery-targeting): NVFP4 GGUFs are safe to ship broadly (they run
+everywhere via dequant, never fail), so the risk is PERFORMANCE-EXPECTATION, not
+correctness. LocalAI has no microarch gating primitive; the only lever is the
+description + tags. Recommend a one-line Blackwell-recommended hardware note +
+consistent `nvfp4`/`blackwell` tags on all six, and tempering the GB10 bench
+claims with the "runs slower off-Blackwell" caveat.
+
 Assisted-by: Claude:opus-4.8 [Claude Code]