docs(paged): arch-generality audit - build-targeting (CUDA arch fan + variants + Metal gap)

The llama-cpp-localai-paged backend sets NO explicit CUDA arch list anywhere (CUDA_DOCKER_ARCH empty in every matrix row; compile.sh only injects -DCMAKE_CUDA_ARCHITECTURES when non-empty), so it compiles the full upstream ggml default arch fan - bit-identical targeting to stock llama-cpp, NOT Blackwell-only. NVFP4 FP4-MMA is gated inside the kernel by BLACKWELL_MMA_AVAILABLE, not by the build matrix, so the binary is arch-portable. Variants: CUDA 12/13 + l4t arm64, ROCm, SYCL f32/f16, Vulkan amd64/arm64, CPU amd64/arm64 (CPU_ALL_VARIANTS) - same Linux set as stock llama-cpp, not CUDA-only. Single gap vs stock: NO Metal/Darwin row in includeDarwin and NO metal: capability key in the meta-backend. macOS hosts fall back to the default cpu (Linux) image, which will not run, and do not auto-fall to stock llama-cpp. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 09:57:14 -04:00 · 2026-06-27 06:59:36 +00:00
parent db6ebc53b2
commit 683e22500f
1 changed files with 117 additions and 0 deletions
--- a/backend/cpp/llama-cpp/patches/paged/ARCH_GENERALITY_AUDIT.md
+++ b/backend/cpp/llama-cpp/patches/paged/ARCH_GENERALITY_AUDIT.md
@@ -0,0 +1,117 @@
+# ARCH_GENERALITY_AUDIT - llama-cpp-localai-paged backend
+
+Source/build/gallery audit (no GPU, no hardware). Maps how arch-general the
+paged backend's BUILD targeting is, and whether non-Blackwell / Metal / CPU
+hosts get a working build.
+
+## Section: backend-build-matrix (build targeting)
+
+### 1. CUDA arch list: NOT Blackwell-only - it is the FULL upstream ggml default
+
+There is NO explicit CUDA arch list anywhere in the paged build path:
+
+- `.docker/llama-cpp-localai-paged-compile.sh` only injects
+  `-DCMAKE_CUDA_ARCHITECTURES=${CUDA_DOCKER_ARCH}` *when* `CUDA_DOCKER_ARCH` is
+  non-empty (`if [[ -n "${CUDA_DOCKER_ARCH:-}" ]]`).
+- NO `backend-matrix.yml` row for `llama-cpp-localai-paged` sets
+  `CUDA_DOCKER_ARCH` (nor does any stock `llama-cpp` row). It is empty.
+- `backend/cpp/llama-cpp/Makefile` (reused verbatim by the paged wrapper) sets
+  only `-DGGML_CUDA=ON` (+ `-DGGML_NATIVE=OFF`). It never sets
+  `CMAKE_CUDA_ARCHITECTURES` / `CUDA_DOCKER_ARCH`.
+
+=> The compiled arch fan is whatever upstream llama.cpp / ggml-cuda picks by
+default with `GGML_NATIVE=OFF` (the full multi-arch default, which includes
+Blackwell sm_120 alongside the older archs ggml ships). This is BIT-IDENTICAL to
+how the stock llama-cpp backend is targeted - the paged wrapper copies and reuses
+the exact same Makefile + CMakeLists + prepare.sh, only forcing `LLAMA_PAGED=on`.
+
+Consequence for NVFP4: the FP4-MMA kernel is compile-time gated *inside* the
+ggml-cuda TU by `BLACKWELL_MMA_AVAILABLE` (sm_120/121 consumer, sm_100
+datacenter). Because the build emits the full arch fan (not a Blackwell-only
+list), the sm_120 NVFP4-MMA path is present for Blackwell AND the older archs get
+their own kernels (NVFP4 runs the non-MMA fallback at runtime on
+Ampere/Ada/Hopper). NOTHING in the build pins this to GB10/sm_121. The binary is
+arch-portable; only the NVFP4 *speedup* is Blackwell-specific, by kernel gating,
+not by build targeting.
+
+### 2. Variants built: CUDA + ROCm + SYCL + Vulkan + CPU (NOT CUDA-only)
+
+`backend-matrix.yml` `include:` (Linux) ships, for `llama-cpp-localai-paged`,
+one row per stock-llama-cpp Linux row (10 rows, lines 4889-5046):
+
+- cublas CUDA 12.8 (linux/amd64)
+- cublas CUDA 13.0 (linux/amd64)
+- cublas CUDA 13.0 arm64 l4t (jetson)
+- cublas CUDA 12.0 arm64 l4t (jetson)
+- hipblas / ROCm 7.2.1 (linux/amd64) - AMDGPU_TARGETS = gfx908..gfx1201
+- sycl_f32 (Intel oneAPI)
+- sycl_f16 (Intel oneAPI)
+- vulkan (linux/amd64)
+- vulkan (linux/arm64)
+- CPU (linux/amd64) + CPU (linux/arm64), built via the ggml
+  `CPU_ALL_VARIANTS` single-build (dlopen libggml-cpu-*.so by host CPU feature;
+  arm64 uses gcc-14 for SME).
+
+So it is NOT CUDA-only. Per image, `compile.sh` builds: the accelerator variant
+(or CPU_ALL_VARIANTS when BUILD_TYPE empty) + grpc-server + rpc-server.
+
+### 3. THE GAP vs stock llama-cpp: NO Metal / Darwin build
+
+This is the single build-targeting divergence:
+
+- stock `llama-cpp` HAS a Darwin row in `includeDarwin`
+  (`-metal-darwin-arm64-llama-cpp`, line 5071) and a `metal:` capability key
+  (`metal: "metal-llama-cpp"`, index.yaml line 25).
+- `llama-cpp-localai-paged` has ZERO entries in `includeDarwin` (grep past line
+  5048 = none) and NO `metal:` capability key in its meta-backend.
+- There is NO `metal-*-llama-cpp-localai-paged` tag anywhere in
+  backend-matrix.yml or backend/index.yaml.
+
+`scripts/changed-backends.js` already anticipates a future darwin paged row
+(lines 78-81 map `backend === "llama-cpp-localai-paged"` to the C++ source dir),
+but no such matrix row exists, so it is currently dead/forward-looking code.
+
+Everything else (CUDA arch fan, ROCm gfx list, SYCL, Vulkan, CPU) matches stock
+llama-cpp exactly.
+
+### 4. Does a non-Blackwell / Metal / CPU host get a working build of THIS backend?
+
+Meta-backend capabilities map (index.yaml lines 101-111):
+default(cpu), nvidia(cuda12), intel(sycl-f16), amd(rocm), vulkan, nvidia-l4t,
+nvidia-cuda-13, nvidia-cuda-12, nvidia-l4t-cuda-12/13.  NO `metal:` key.
+
+- Non-Blackwell NVIDIA (Ampere sm_80-86 / Ada sm_89 / Hopper sm_90 / datacenter
+  Blackwell sm_100): selects the SAME cuda12 / cuda13 image. That image is
+  compiled for the full arch fan, so it RUNS. NVFP4 falls back to the non-MMA
+  path on pre-Blackwell; on sm_100 it gets FP4-MMA but is compute-bound (HBM3e),
+  not the LPDDR5x-bound GB10 regime the patches were tuned for. WORKS, just
+  without the GB10-specific bandwidth win.
+- AMD / Intel / Vulkan / CPU (amd64 + arm64) Linux hosts: each has its own
+  matching variant in the map + matrix. WORKS.
+- Metal / macOS Apple Silicon: NO `metal:` key and NO darwin build. Capability
+  resolution falls back to `default` = `cpu-llama-cpp-localai-paged`, which is a
+  Linux (amd64/arm64) image, NOT a macOS-native build, so it will NOT run on
+  macOS. And because this is a SEPARATE meta-backend, it does NOT fall through to
+  the stock `llama-cpp` backend - a Mac user who explicitly selects
+  llama-cpp-localai-paged gets a non-running selection and must manually pick the
+  stock llama-cpp backend instead. DOES NOT WORK on Metal/macOS; no auto-fallback
+  to stock.
+
+## Verdict (build-targeting)
+
+- Arch-general on Linux: YES. The build is NOT Blackwell-only; it targets the
+  exact same full CUDA arch fan + the same ROCm/SYCL/Vulkan/CPU variant set as
+  stock llama-cpp. Any Linux host that can run stock llama-cpp can run THIS
+  backend; the NVFP4 speedup is the only Blackwell-gated piece, and that gating
+  is inside the kernel, not in the build matrix.
+- Single gap: NO Metal/Darwin variant and NO `metal:` capability key. macOS /
+  Apple Silicon hosts have no working build of this backend and do not auto-fall
+  to stock llama-cpp. To close the gap, add an `includeDarwin` row
+  (`-metal-darwin-arm64-llama-cpp-localai-paged`, mirroring the stock llama-cpp
+  darwin row + the C++ source build path that changed-backends.js already
+  anticipates) and a `metal:` key to the paged meta-backend. (Note: NVFP4 has no
+  Metal MMA path, so a Metal build would deliver paged-KV behaviour only, no
+  NVFP4 acceleration - still a correctness/availability win over the current
+  broken selection.)
+
+Assisted-by: Claude:opus-4.8 [Claude Code]