LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-28 10:27:30 -04:00

Author	SHA1	Message	Date
Ettore Di Giacinto	a4e730979d	feat(paged): restrict llama-cpp-localai-paged to CUDA-only build targets The paged backend previously built for cublas/cuda, cpu, vulkan, sycl, hipblas and darwin/metal. On non-CUDA the patchset's wins are inert: the GDN fusions are gated off (patch 0030) and NVFP4 falls back to dequant, so the backend is neutral-to-negative there (README section 4c). The darwin grpc-server link also fails on undefined upstream server symbols, turning CI red. Both broken and pointless off-CUDA, so ship CUDA-only. - backend-matrix.yml: drop the hipblas, sycl f32/f16, cpu amd64/arm64, vulkan amd64/arm64 and metal-darwin rows for this backend; keep the four cublas rows (cuda-12, cuda-13, nvidia-l4t cuda-12 and cuda-13). - index.yaml: meta-backend (and -development) capabilities are now CUDA-only with default pointing at cuda12 (mirrors faster-qwen3-tts); removed the orphaned cpu/rocm/sycl/vulkan/metal variant entries. - Removed the now-unused darwin build script and its Makefile target / .NOTPARALLEL entry / backend_build_darwin.yml step. - Documented the CUDA-only build coverage in the patch README and plan. Non-CUDA users should use the stock llama-cpp backend. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 12:29:15 +00:00
Ettore Di Giacinto	9115c2c52c	docs(paged): correct Vulkan/SYCL note (GDN op IS upstream) + CUDA-only rationale The gated-DeltaNet + SSM_CONV ops have upstream Metal/Vulkan/SYCL kernels, so the Qwen3.6 hybrids run there (non-fused) - the earlier 'no Vulkan kernel' note was wrong. The patchset's fusions are gated off off-CUDA, so the backend ships CUDA-only; non-CUDA users use stock llama-cpp. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 12:18:11 +00:00
Ettore Di Giacinto	984c8fcbea	docs(paged): Layer-2 upstream scope for native fused-GDN kernels (Metal/Vulkan/SYCL) Source-only analysis of what it would take to give the gated-DeltaNet decode fusions (0018 in-place state write-back, 0019 fused recurrent-state gather, 0021 ssm_conv_update_inplace, 0028 conv-tap gather fusion) native kernels on the non-CUDA compute backends, so the patch-series decode win extends past CUDA-family hardware. Key findings: - The base GGML_OP_GATED_DELTA_NET and GGML_OP_SSM_CONV kernels ALREADY exist upstream on Metal, Vulkan AND SYCL (the README's no-Vulkan-kernel line is stale). The Qwen3.6 hybrids run on all three today via the non-fused path; Layer-2 is the decode SPEEDUP, not enabling the model to run. - Per backend the new work is only the FUSION plumbing: redirect the GDN state write (in-place), add the ids read, write one new conv-update kernel + its ids variant, two tiny gather kernels, plus supports_op + op-handler + (Vulkan) pipeline/push-constant/descriptor wiring. Builders, CPU refs, model graph and test-backend-ops cases are shared and already done. - Bit-exactness is feasible per backend by construction (the fusions redirect addresses, not the f32 reduction order); test-backend-ops (backendX-vs-CPU) is the gate. - The 0030 name allow-list should become capability-driven (make supports_op authoritative for the discriminated src slots). - Ranked: ops-first PR, then Metal (highest value/effort, fixed simdgroup = simplest bit-exactness), then SYCL (near-verbatim CUDA mirror, cheapest to author), then Vulkan (widest hardware reach but the shader-gen + variant matrix + subgroup variance make it the capstone). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 12:11:24 +00:00
Ettore Di Giacinto	4a9a1dd247	docs(paged): Mac stock-vs-patched bench + Vulkan note + cross-backend learnings Section 4(c): real Apple M4/Metal numbers (Qwen3-8B Q4_K_M, stock vs patched) - patchset is neutral-to-slightly-negative on Metal (the in-kernel block-table read is CUDA-only; NVFP4/GDN-fusions inert), so prefer stock llama-cpp on Apple Silicon. Vulkan: same picture, worse (no upstream GDN op). Section 6: cross-backend learnings + upstream candidates (the GDN decode-plumbing fusions are the portable, bit-exact, CPU-mirrored win worth upstreaming). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 11:05:37 +00:00
Ettore Di Giacinto	78fac9a28f	refactor(paged): stock llama-cpp is patch-free; paged backend owns its patch series Move ALL paged-attention content out of the stock backend/cpp/llama-cpp backend and into backend/cpp/llama-cpp-localai-paged, so the stock backend is pure upstream llama.cpp and the paged backend owns and applies its own vendored patch series. - Delete the dead early-exploration scaffold backend/cpp/llama-cpp/paged/ (kernel/w4a16 Marlin scaffold, standalone paged_kv_manager, bench/loadgen, its own 0001-0002 patches, dense-era design docs, tests). Zero references repo-wide. - Move backend/cpp/llama-cpp/patches/ (the 28-patch paged series + paged/README + 3 operational docs, plus the kernel/ scaffold patch and the top-level paged README/BENCHMARKS) to backend/cpp/llama-cpp-localai-paged/patches/. The stock backend keeps no patches/ dir; it had no non-paged base patches. - Purify the stock backend: remove the LLAMA_PAGED make variable, the patches/paged apply loop, and the LLAMA_PAGED passthrough to prepare.sh; remove the paged-series handling from prepare.sh. The stock llama.cpp target now only clones the pin and applies its own (currently empty) base patches/ series. The runtime paged option hooks in the shared grpc-server.cpp are untouched (inert without the patches). - The paged backend's Makefile now applies its OWN patches/paged/0*.patch onto each freshly cloned tree via strict git apply (apply-paged-patches), after the copied stock infra clones the pin and applies base patches. - Repoint every reference to the old patches/paged path: the upstream canary workflow + apply script, bump_deps.yaml, gallery/index.yaml, the docs, backend/index.yaml, backend-matrix.yml, the top-level Makefile comments, and the moved PIN_SYNC / README docs. Drop the now-removed LLAMA_PAGED=on build-toggle from comments. Verified: the full 28-patch series applies strict-clean (git apply, exit 0) to a clean ggml-org/llama.cpp checkout at the pinned c299a92c, and the repointed canary apply script resolves and applies the series end to end. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 11:01:22 +00:00

5 Commits