feat(paged): restrict llama-cpp-localai-paged to CUDA-only build targets

The paged backend previously built for cublas/cuda, cpu, vulkan, sycl, hipblas and darwin/metal. On non-CUDA the patchset's wins are inert: the GDN fusions are gated off (patch 0030) and NVFP4 falls back to dequant, so the backend is neutral-to-negative there (README section 4c). The darwin grpc-server link also fails on undefined upstream server symbols, turning CI red. Both broken and pointless off-CUDA, so ship CUDA-only. - backend-matrix.yml: drop the hipblas, sycl f32/f16, cpu amd64/arm64, vulkan amd64/arm64 and metal-darwin rows for this backend; keep the four cublas rows (cuda-12, cuda-13, nvidia-l4t cuda-12 and cuda-13). - index.yaml: meta-backend (and -development) capabilities are now CUDA-only with default pointing at cuda12 (mirrors faster-qwen3-tts); removed the orphaned cpu/rocm/sycl/vulkan/metal variant entries. - Removed the now-unused darwin build script and its Makefile target / .NOTPARALLEL entry / backend_build_darwin.yml step. - Documented the CUDA-only build coverage in the patch README and plan. Non-CUDA users should use the stock llama-cpp backend. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 18:06:58 -04:00 · 2026-06-27 12:29:15 +00:00
parent 9115c2c52c
commit a4e730979d
7 changed files with 25 additions and 299 deletions
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/LOCALAI_LLAMACPP_BACKEND_PLAN.md
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/LOCALAI_LLAMACPP_BACKEND_PLAN.md
@@ -3,6 +3,13 @@
 Scoping deliverable only. NOTHING is changed by this document. It is grounded in the
 actual repo structure (read 2026-06-26 in worktree feat+paged-attention), not assumptions.

+SHIPPED REALITY (update 2026-06-27): the backend ships CUDA-only. The matrix rows and
+the index.yaml meta-backend keep ONLY the CUDA/cublas variants (cuda-12, cuda-13, and
+the nvidia-l4t arm64 cuda-12/cuda-13 Jetson rows). The cpu / vulkan / sycl / hipblas /
+metal-darwin variants discussed below as optional/phase-2 were NOT shipped (and the
+darwin row was removed): off-CUDA the patchset's wins gate off, so it is neutral-to-
+negative there and non-CUDA users should use the stock llama-cpp backend (README 4c).
+
 ================================================================================
 0. GROUND TRUTH (what the repo actually does today)
 ================================================================================
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/README.md
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/README.md
@@ -344,6 +344,14 @@ in a recommended/gallery config.

 ## 8. Models

+> **Build coverage: CUDA-only.** This backend ships only the CUDA/cublas build
+> targets (cuda-12, cuda-13, and the nvidia-l4t arm64 cuda-12/cuda-13 Jetson
+> rows). There are no cpu / vulkan / sycl / hipblas / metal-darwin builds: the
+> patchset's wins are CUDA/Blackwell-specific (section 4c), so off-CUDA the
+> backend is neutral-to-negative and non-CUDA users should run the stock
+> `llama-cpp` backend instead. The `backend/index.yaml` meta-backend resolves
+> `default`/`nvidia` to a CUDA variant accordingly.
+
 The benchmarked NVFP4 GGUFs are published and wired into the LocalAI gallery:

 | Gallery entry | Weights (HuggingFace) | Notes |