LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-27 09:57:14 -04:00

Author	SHA1	Message	Date
Ettore Di Giacinto	a4e730979d	feat(paged): restrict llama-cpp-localai-paged to CUDA-only build targets The paged backend previously built for cublas/cuda, cpu, vulkan, sycl, hipblas and darwin/metal. On non-CUDA the patchset's wins are inert: the GDN fusions are gated off (patch 0030) and NVFP4 falls back to dequant, so the backend is neutral-to-negative there (README section 4c). The darwin grpc-server link also fails on undefined upstream server symbols, turning CI red. Both broken and pointless off-CUDA, so ship CUDA-only. - backend-matrix.yml: drop the hipblas, sycl f32/f16, cpu amd64/arm64, vulkan amd64/arm64 and metal-darwin rows for this backend; keep the four cublas rows (cuda-12, cuda-13, nvidia-l4t cuda-12 and cuda-13). - index.yaml: meta-backend (and -development) capabilities are now CUDA-only with default pointing at cuda12 (mirrors faster-qwen3-tts); removed the orphaned cpu/rocm/sycl/vulkan/metal variant entries. - Removed the now-unused darwin build script and its Makefile target / .NOTPARALLEL entry / backend_build_darwin.yml step. - Documented the CUDA-only build coverage in the patch README and plan. Non-CUDA users should use the stock llama-cpp backend. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 12:29:15 +00:00
Ettore Di Giacinto	9115c2c52c	docs(paged): correct Vulkan/SYCL note (GDN op IS upstream) + CUDA-only rationale The gated-DeltaNet + SSM_CONV ops have upstream Metal/Vulkan/SYCL kernels, so the Qwen3.6 hybrids run there (non-fused) - the earlier 'no Vulkan kernel' note was wrong. The patchset's fusions are gated off off-CUDA, so the backend ships CUDA-only; non-CUDA users use stock llama-cpp. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 12:18:11 +00:00
Ettore Di Giacinto	984c8fcbea	docs(paged): Layer-2 upstream scope for native fused-GDN kernels (Metal/Vulkan/SYCL) Source-only analysis of what it would take to give the gated-DeltaNet decode fusions (0018 in-place state write-back, 0019 fused recurrent-state gather, 0021 ssm_conv_update_inplace, 0028 conv-tap gather fusion) native kernels on the non-CUDA compute backends, so the patch-series decode win extends past CUDA-family hardware. Key findings: - The base GGML_OP_GATED_DELTA_NET and GGML_OP_SSM_CONV kernels ALREADY exist upstream on Metal, Vulkan AND SYCL (the README's no-Vulkan-kernel line is stale). The Qwen3.6 hybrids run on all three today via the non-fused path; Layer-2 is the decode SPEEDUP, not enabling the model to run. - Per backend the new work is only the FUSION plumbing: redirect the GDN state write (in-place), add the ids read, write one new conv-update kernel + its ids variant, two tiny gather kernels, plus supports_op + op-handler + (Vulkan) pipeline/push-constant/descriptor wiring. Builders, CPU refs, model graph and test-backend-ops cases are shared and already done. - Bit-exactness is feasible per backend by construction (the fusions redirect addresses, not the f32 reduction order); test-backend-ops (backendX-vs-CPU) is the gate. - The 0030 name allow-list should become capability-driven (make supports_op authoritative for the discriminated src slots). - Ranked: ops-first PR, then Metal (highest value/effort, fixed simdgroup = simplest bit-exactness), then SYCL (near-verbatim CUDA mirror, cheapest to author), then Vulkan (widest hardware reach but the shader-gen + variant matrix + subgroup variance make it the capstone). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 12:11:24 +00:00
Ettore Di Giacinto	4a9a1dd247	docs(paged): Mac stock-vs-patched bench + Vulkan note + cross-backend learnings Section 4(c): real Apple M4/Metal numbers (Qwen3-8B Q4_K_M, stock vs patched) - patchset is neutral-to-slightly-negative on Metal (the in-kernel block-table read is CUDA-only; NVFP4/GDN-fusions inert), so prefer stock llama-cpp on Apple Silicon. Vulkan: same picture, worse (no upstream GDN op). Section 6: cross-backend learnings + upstream candidates (the GDN decode-plumbing fusions are the portable, bit-exact, CPU-mirrored win worth upstreaming). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 11:05:37 +00:00
Ettore Di Giacinto	78fac9a28f	refactor(paged): stock llama-cpp is patch-free; paged backend owns its patch series Move ALL paged-attention content out of the stock backend/cpp/llama-cpp backend and into backend/cpp/llama-cpp-localai-paged, so the stock backend is pure upstream llama.cpp and the paged backend owns and applies its own vendored patch series. - Delete the dead early-exploration scaffold backend/cpp/llama-cpp/paged/ (kernel/w4a16 Marlin scaffold, standalone paged_kv_manager, bench/loadgen, its own 0001-0002 patches, dense-era design docs, tests). Zero references repo-wide. - Move backend/cpp/llama-cpp/patches/ (the 28-patch paged series + paged/README + 3 operational docs, plus the kernel/ scaffold patch and the top-level paged README/BENCHMARKS) to backend/cpp/llama-cpp-localai-paged/patches/. The stock backend keeps no patches/ dir; it had no non-paged base patches. - Purify the stock backend: remove the LLAMA_PAGED make variable, the patches/paged apply loop, and the LLAMA_PAGED passthrough to prepare.sh; remove the paged-series handling from prepare.sh. The stock llama.cpp target now only clones the pin and applies its own (currently empty) base patches/ series. The runtime paged option hooks in the shared grpc-server.cpp are untouched (inert without the patches). - The paged backend's Makefile now applies its OWN patches/paged/0*.patch onto each freshly cloned tree via strict git apply (apply-paged-patches), after the copied stock infra clones the pin and applies base patches. - Repoint every reference to the old patches/paged path: the upstream canary workflow + apply script, bump_deps.yaml, gallery/index.yaml, the docs, backend/index.yaml, backend-matrix.yml, the top-level Makefile comments, and the moved PIN_SYNC / README docs. Drop the now-removed LLAMA_PAGED=on build-toggle from comments. Verified: the full 28-patch series applies strict-clean (git apply, exit 0) to a clean ggml-org/llama.cpp checkout at the pinned c299a92c, and the repointed canary apply script resolves and applies the series end to end. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 11:01:22 +00:00
Ettore Di Giacinto	fb2dc33d52	docs(paged): consolidate the dev-trail docs into one canonical README The paged-attention patch directory had accumulated ~55 scattered dev docs (results, progress, scope, lever, and gap-analysis notes). Consolidate the durable content of all of them into one canonical backend/cpp/llama-cpp/patches/paged/README.md covering: what the patchset is, the architecture (paged KV + block-table flash-attn, the gated-DeltaNet SSM decode path, NVFP4 FP4-MMA, the decode-first scheduler), the full 0001-0030 patch series table with bit-exact status, the GB10 benchmarks (patched-vs-stock-vs-vLLM + the Apple M4 architectural note), the dev notes (bit-exact methodology, the per-path gate, the MoE-parity conclusion, the rejected/flat levers, the opt-in bf16-SSM mode), arch+quant generality, the pin + canary maintenance policy, and the published NVFP4 gallery models. Delete the consolidated-away dev trail. Keep the three operational docs the README links to: PIN_SYNC_c299a92c.md (canary reference), PAGED_BITEXACT_NOTE.md (per-path gate reference) and LOCALAI_LLAMACPP_BACKEND_PLAN.md (the ship-as-own-backend design-of-record), plus the benchmark plots + csv. The .patch files and the unit/bench .cpp are untouched. Repoint every external reference to a deleted doc at the new README: grpc-server.cpp, docs/content/features/backends.md, gallery/index.yaml, the canary apply script (PIN_BUMP_APPLY_CHECK.md -> README), and the base patches/README.md (ADDITIVE_DESIGN.md -> README). The canary's PIN_SYNC reference still resolves; its inert SSM_DECODE_FIX_RESULTS.md glob (a patch-internal path matcher, not a repo-doc link) is left intact. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 09:25:47 +00:00
Ettore Di Giacinto	a5a5b2ad80	feat(paged): bump llama.cpp pin 9d5d882d -> c299a92c (bit-exact verified) Advance the paged-attention backend's owned llama.cpp pin by 23 upstream commits. The shipped source-only patch series (0001-0030, 28 patches) applies strict-clean (git apply, exit 0) on a fresh c299a92c checkout with no re-export needed, and the bit-exact gate is GREEN on every path on GB10 (CUDA sm_121): - md5 greedy decode (-ngl 99 -fa on -n 48 --temp 0 --seed 1): dense non-paged/paged 5951a5b4, MoE non-paged 07db32c2, MoE paged 8cb0ce23; all match the established baselines. - test-backend-ops CUDA0: SSM_CONV 45/45, SSM_CONV_UPDATE 16/16, SSM_CONV_UPDATE_IDS 16/16, GATED_DELTA_NET 84/84, MUL_MAT 1146/1146, MUL_MAT_ID 806/806; all OK. The 23-commit upstream jump did not change our decode output. The .patch files are kept byte-identical (they already apply strict-clean at the new pin); only the pin, the PIN_SYNC evidence doc, and the canary/gallery doc references change. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 08:57:33 +00:00
Ettore Di Giacinto	7e1832b868	fix(paged): strip stray dev-doc hunks so patch series applies on a clean checkout The shipped from-patches build applies the paged series with strict `git apply` (backend/cpp/llama-cpp/Makefile `llama.cpp` target: `git apply --verbose "$p" \|\| { ...; exit 1; }`), which is atomic: a hunk against a file missing from the tree rejects the whole patch and fails the build. Four patches carried hunks against dev-only docs that live in the DGX dev tree but are absent from a clean ggml-org/llama.cpp checkout, so the build only succeeded on the DGX and FAILED on CI / any clean checkout: 0019 -> SSM_DECODE_FIX_RESULTS.md (modify hunk = the root reject) 0020 -> LEVER1_OPROJ_MMQ_RESULTS.md (create) 0021 -> CONV_STATE_FUSION_RESULTS.md (create) 0028 -> LEVER1_GATHER_PROGRESS.md, LEVER1_GATHER_RESULTS.md (create) 0019's reject cascaded to 0021/0022/0026/0028 (which build on 0019's code). Strip each `diff --git a/<devdoc>` section plus its diffstat line, `create mode` trailer, and correct the summary count. Every llama.cpp SOURCE hunk is left byte-identical (verified by sha256 of each patch's source-diff tail). Verified on a fresh clone of ggml-org/llama.cpp at the pin 9d5d882d: BEFORE, strict `git apply` failed at 0019 (cascade 0019/0021/0022/0026/0028); AFTER, the full series 0001-0030 applies with exit 0 (sentinel created, zero stray docs). The tolerant `patch -p1` fallback in prepare.sh also applies with zero rejects. PIN_SYNC_9d5d882d.md documents the durable fix: re-exports/pin-syncs must keep patches source-only (export with a source pathspec / `:!*.md`, gate with a strict `git apply` on a clean checkout). The upcoming c299a92c pin-bump re-export must produce source-only patches too. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 08:39:27 +00:00
Ettore Di Giacinto	2bee7a5ab1	ci(paged): add early-warning canary for vendored llama.cpp paged patches The paged backend (backend/cpp/llama-cpp-localai-paged) pins its own verified llama.cpp tip and is excluded from the nightly auto-bumper so a naive bump can never silently break the shipped build. That exclusion also removed the early warning of upstream drift. This restores the signal without touching the pin. Add .github/workflows/llama-cpp-paged-canary.yml (weekly + workflow_dispatch): - apply-check job (ubuntu-latest, toolchain-free): resolve the latest ggml-org/llama.cpp master tip, shallow-checkout it, and apply the full paged series 0001-0030 in order with the build's own git-apply method via the new shared helper .github/scripts/paged-canary-apply.sh. Red on any apply break. - compile job (needs apply-check): on the exact tip it validated, build the paged backend (cublas) inside the same base-grpc-cuda-12 toolchain and the same `make grpc-server` target the shipped build uses, so a red means upstream drift, not toolchain noise. nvcc compiles the kernels with no GPU present. Red here = run a PIN_SYNC (rebase + bit-exact gate + re-export), then bump the paged Makefile pin. The canary is signal-only: it opens no PR and never moves the pin, so the shipped build and the dep-bump PRs stay green regardless. It is fully separate from bump_deps. The lone pre-existing quirk in the series (patch 0019 carries a stray modify hunk against the dev-only doc SSM_DECODE_FIX_RESULTS.md, absent from any clean upstream checkout; git apply is atomic so it rejects the whole patch and cascades to 0021/0022/0026/0028) is handled path-scoped: the helper excludes only that dev-doc and still applies 0019's real code hunks atomically, mirroring prepare.sh's tolerance, so the quirk never false-positives the canary but a genuine code break in 0019 still turns it red. Point the existing pin comments in backend/cpp/llama-cpp-localai-paged/Makefile and .github/workflows/bump_deps.yaml at this canary as the drift signal, and document it in the PIN_SYNC doc: canary red -> do a pin-sync. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 08:29:09 +00:00
Ettore Di Giacinto	e160041f05	chore(paged): decouple paged llama.cpp pin from the nightly auto-bumper The llama-cpp-localai-paged backend reused backend/cpp/llama-cpp's LLAMA_VERSION, which .github/workflows/bump_deps.yaml auto-bumps nightly to the latest ggml-org/llama.cpp master tip. The stock backend is patch-free so that bump is safe, but the paged backend applies a vendored patch series (backend/cpp/llama-cpp/patches/paged/) hand-verified bit-exact against ONE specific tip. A naive bump moves the tip out from under the patches and breaks 'git apply' at build time - a dep-bump PR would go red (or, worse, the break surfaces later in a release build). Mirror the turboquant precedent: give the paged wrapper its OWN LLAMA_VERSION pin (the verified 9d5d882d) and force it into every copied build via LLAMA_VERSION=$(LLAMA_VERSION), so the nightly stock bump no longer drags the paged build to an unverified tip. Unlike turboquant (whose fork branch carries the patches and is safe to auto-bump), the paged series is vendored, so it gets NO bump_deps.yaml entry: it is advanced only by the manual PIN_SYNC process. Add cross-referencing comments in both Makefiles and bump_deps.yaml. Also add PIN_BUMP_APPLY_CHECK.md: an apply-feasibility report for the latest tip (c299a92c, 23 commits ahead). The full series applies CLEAN under 'git apply' with only benign line offsets and zero conflicts; the lone failure (0019) is a pre-existing stray dev-doc hunk, identical on the current pin, not a bump regression. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 08:02:37 +00:00
Ettore Di Giacinto	400930db19	Merge remote-tracking branch 'origin/master' into worktree-feat+paged-attention # Conflicts: # gallery/index.yaml	2026-06-27 07:48:49 +00:00
LocalAI [bot]	0258f8af55	fix(backends): repair release CI build/test breaks (kokoros, fish-speech, llama-cpp-quantization, sglang) (#10547 ) * fix(kokoros): implement new Backend RPCs to fix the build The backend.proto grew six RPCs (SoundDetection, Depth, TokenClassify, Score and the bidi-streaming Forward) that the kokoros gRPC service never implemented, so the trait impl no longer satisfies `Backend`: error[E0046]: not all trait items implemented, missing: `sound_detection`, `depth`, `token_classify`, `score`, `ForwardStream`, `forward` kokoros is a TTS backend with no use for these, so add `unimplemented` stubs (plus the `ForwardStream` associated type) matching the existing pattern for every other unsupported RPC in this file. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * fix(fish-speech): add setuptools-rust for the editable source install install.sh installs the fish-speech source tree editable with `--no-build-isolation`, which means the build backends of its transitive dependencies must already be present in the venv. One of them builds a Rust extension and its metadata step fails with: ModuleNotFoundError: No module named 'setuptools_rust' Add setuptools-rust to requirements.txt so installRequirements provisions it before the editable install runs. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * fix(llama-cpp-quantization): vendor convert_hf_to_gguf.py with conversion/ Upstream llama.cpp split the model-specific logic out of the single convert_hf_to_gguf.py file into a sibling `conversion/` package, so the script now starts with `from conversion import ...`. Downloading just the one file therefore fails at runtime with: ModuleNotFoundError: No module named 'conversion' Clone the repo (reusing the clone already needed to build llama-quantize) and copy both the script and the `conversion/` package into the backend dir. Python puts the script's own directory on sys.path[0], so the package resolves when it sits beside the script. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * fix(sglang): pin the CPU source build to sglang v0.5.11 The CPU profile builds sgl-kernel from a `git clone` of sglang with no ref, so it always tracks master. Recent master added CPU kernels (e.g. mamba/fla.cpp) that fail to compile in our builder: constexpr variable 'scale' must be initialized by a constant static library kineto_LIBRARY-NOTFOUND not found Pin the clone to v0.5.11, the same release the GPU path already floors on (requirements-cublas12-after.txt). Overridable via SGLANG_VERSION so the pin can be bumped deliberately. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 09:42:22 +02:00
Ettore Di Giacinto	202a29f980	feat(paged): Metal/darwin build availability for llama-cpp-localai-paged Close the single build-targeting gap the cross-arch audit (ARCH_GENERALITY_AUDIT.md section 6, item 2) flagged: the paged backend had no Metal/darwin variant and no metal: capability key, so a Mac user selecting llama-cpp-localai-paged fell back to default=cpu (a Linux image) that does not run, with no fallthrough to stock llama-cpp. Mirror exactly how stock llama-cpp does darwin: - .github/backend-matrix.yml: add the includeDarwin row (-metal-darwin-arm64-llama-cpp-localai-paged, arch arm64, lang go) next to the stock llama-cpp darwin row. - backend/index.yaml: add the metal: capability key to the llama-cpp-localai-paged meta-backend plus the metal-llama-cpp-localai-paged and -development variant entries (URIs match the matrix tag-suffix); add Metal to tags. - scripts/build/llama-cpp-localai-paged-darwin.sh: new bespoke darwin build, a line-for-line mirror of llama-cpp-darwin.sh swapping the paged wrapper dir, binary names, ggml-shared-libs dir and output tar. Same CPU_ALL_VARIANTS + Metal path (GGML_METAL=ON via the reused llama-cpp Makefile when OS=Darwin; --target ggml pulls in ggml-metal via add_dependencies) with LLAMA_PAGED=on. - Makefile: add backends/llama-cpp-localai-paged-darwin target (+ .NOTPARALLEL). - .github/workflows/backend_build_darwin.yml: give the paged backend the same bespoke darwin build step as stock llama-cpp, share the llama ccache restore (save stays stock-only to avoid a same-run key collision), and exclude it from the generic build-darwin-go-backend step. - scripts/changed-backends.js: comment-only - the paged darwin path mapping was already present (forward-looking); update the stale "if a metal row is ever added" note now that the row exists. Metal delivers paged-KV only (NVFP4 FP4-MMA is CUDA/Blackwell-only); the GDN/conv fused ops have no Metal kernel, so a gated-DeltaNet (qwen35) model falls back to the CPU reference op at runtime - made SAFE by the fused-op backend gate (patch 0030). This is config; the Metal build runs in CI on the next push and is runtime-tested on the M4 Mac. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 07:42:08 +00:00
Ettore Di Giacinto	621a20d2b5	feat(paged): backend-gate fused GDN/discriminated SSM_CONV emission (patch 0030) Closes audit RISKY-1 (the one latent silent-miscompute hazard). The fused/in-place Gated Delta Net op (0018/0019/0026) and the discriminated SSM_CONV decode op (0021/0028, which REUSE GGML_OP_SSM_CONV / GGML_OP_GATED_DELTA_NET via a non-null src[3]/src[4] discriminator) are CUDA+CPU-only but were emitted DEFAULT-ON (cparams.fused_gdn_ar/ch=true, auto_fgdn=true) with no backend guard. A backend that supports plain SSM_CONV but ignores the discriminator (Vulkan/SYCL/Metal) would run the wrong plain conv => silent corruption. Fix: in llama_context::sched_reserve(), before the auto_fgdn resolution, force fused_gdn_ar = fused_gdn_ch = auto_fgdn = false when any non-CPU compute backend is not CUDA-family (reg name not "CUDA"/"ROCm"/"MUSA"). Every emission site keys off these flags, so the graph falls back to the upstream non-fused plain ggml_ssm_conv + ggml_silu path that every backend handles. On CUDA the reg name is "CUDA", the flags are left untouched, and the decode graph is byte-identical. Mirror of DGX paged patch 0030; adds FUSED_OP_BACKEND_GATE_RESULTS.md. Verified GPU-free: reconstructed pin 9d5d882d + paged 0001-0029 + 0030, CPU-only build (GGML_CUDA=OFF) of libllama + test-backend-ops links with 0 errors; 0030 applies cleanly via git apply and patch -p1. test-backend-ops correctness for SSM_CONV/SSM_CONV_UPDATE(_IDS)/GATED_DELTA_NET is CUDA0-vs-CPU (pending DGX, tunnel offline this session); registered test cases will exercise it. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 07:32:49 +00:00
Ettore Di Giacinto	af6e133759	docs(paged): cross-arch synthesis - ship verdict + minimum non-Blackwell fixes Synthesizes the four ARCH_GENERALITY_AUDIT sections (build-matrix, gguf-gallery-targeting, optimization-generality, patch-arch-safety) into a single cross-arch ship decision: build-safety table per target, every patch bucketed (SAFE-EVERYWHERE / BLACKWELL-ONLY-clean-fallback / GB10-TUNED / RISKY), the NVFP4 gallery recommendation, a per-arch roadmap ranked by value/effort, the empirical-verification matrix (GB10 + M4 cover all but non-Blackwell NVIDIA), and the ship verdict. Verdict: SAFE to ship as Blackwell/Linux today; the build is arch-general (no GB10 pin; FP4 code is default-off + #if-guarded) and NVFP4 GGUFs run everywhere via dequant. The one hard prerequisite before extending paged to Metal/Vulkan/SYCL is closing the backend-ungated, default-on fused GDN/conv op emission (discriminated GGML_OP_SSM_CONV via non-null src[3], CUDA+CPU only, no supports_op guard) - latent on current Linux targets, silent miscompute on a future non-CUDA paged build of a gated-DeltaNet model. Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 07:07:20 +00:00
Ettore Di Giacinto	87cfd1fadb	docs(paged): quant-generality audit - SSM/serving opts are quant-agnostic (0013-0029) Source-verify each paged decode optimization as quant-agnostic (operates on the f32 gated-DeltaNet/conv recurrent state, the paged serving host path, or the generic MMQ/CUDA-graph routing) vs NVFP4-specific (only fires inside the use_native_fp4 / GGML_TYPE_NVFP4 branch). Findings: 14 of 16 landed patches are quant-agnostic (0013/0014/0015/0016/0018/ 0019/0020/0021/0022/0024/0025/0026/0028/0029). Only 0023 (MoE FP4 act-quant de-dup, inside use_native_fp4) is NVFP4-specific; 0017 is NVFP4-only but default-off/inert (kill-gate, no win). Corrects the hypothesis on 0025: the actual patch is the MUL_MAT_ID CUDA-graph guard relaxation gated on ggml_is_quantized + ggml_cuda_should_use_mmq (the generic quantized grouped-MMQ path), NOT NVFP4. The NVFP4-specific act-quant / quantize_mmq_nvfp4 work is LEVER 3, which was a measurement STOP and never landed (no patch); LEVER 4 (NVFP4 projections) KL-failed and never shipped. Adds the relative-impact-by-quant estimate (fixed f32-recurrence/host ms is the largest step fraction at NVFP4, shrinks at Q8/bf16 as the weight read grows) and the A/B plan to prove generality on a Q4_K_M requant of the same Qwen3.6 (build the control first, md5/KLD bit-exact gate per path, decode_agg npl 32/128, with 0023 as the NVFP4-only negative control). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 07:06:14 +00:00
Ettore Di Giacinto	2a2de1d6c1	docs(paged): patch-arch-safety classification for patches 0018-0029 Build-break / miscompile audit of the paged patch series. Classifies each patch general/Blackwell-gated/risky, records the only conditional arch surface (0017, fully #if-gated + default-off), and gives the per-target build-safety verdict (sm_80-90 CUDA / sm_100 / Metal-not-a-target / CPU / ROCm-SYCL-Vulkan). Flags the one latent silent-correctness hazard: fused GDN/conv ops reuse GGML_OP_SSM_CONV via a src discriminator with CUDA+CPU-only kernels and backend-ungated emission. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 07:04:05 +00:00
Ettore Di Giacinto	5667dfe461	docs(paged): arch-generality audit - optimization classification (0017-0029) Classify the paged-attention optimizations as arch-GENERAL (ship everywhere), GB10-TUNED (per-arch retune), or Blackwell-precision-specific; add the per-arch expected story (sm_100/Hopper/Ada/Metal/CPU) and the SAFETY gap (fused GDN/conv ops are CUDA+CPU-only with backend-ungated emission). Extends the prior build/gallery-targeting audit in the same file. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 07:02:54 +00:00
Ettore Di Giacinto	34abf392fc	docs(paged): ARCH audit - NVFP4 GGUF off-Blackwell portability + gallery targeting gap NVFP4 (GGML_TYPE_NVFP4=40 / MOSTLY_NVFP4) GGUFs are portable: full CPU/CUDA-DP4A/ generic-MMA/Vulkan dequant coverage. FP4-MMA is a runtime Blackwell-only speed tier (mmq.cu use_native_fp4 flag), not a load/run gate. Off-Blackwell = runs via dequant, correct-but-slow, never fail/garbage. Gallery has no microarch-gating primitive (tags are search-only, capabilities map is family-level nvidia/amd/ metal, model struct has no hardware field), so the 6 -paged entries can only express Blackwell-targeting via description prose + tags. Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 07:00:34 +00:00
Ettore Di Giacinto	683e22500f	docs(paged): arch-generality audit - build-targeting (CUDA arch fan + variants + Metal gap) The llama-cpp-localai-paged backend sets NO explicit CUDA arch list anywhere (CUDA_DOCKER_ARCH empty in every matrix row; compile.sh only injects -DCMAKE_CUDA_ARCHITECTURES when non-empty), so it compiles the full upstream ggml default arch fan - bit-identical targeting to stock llama-cpp, NOT Blackwell-only. NVFP4 FP4-MMA is gated inside the kernel by BLACKWELL_MMA_AVAILABLE, not by the build matrix, so the binary is arch-portable. Variants: CUDA 12/13 + l4t arm64, ROCm, SYCL f32/f16, Vulkan amd64/arm64, CPU amd64/arm64 (CPU_ALL_VARIANTS) - same Linux set as stock llama-cpp, not CUDA-only. Single gap vs stock: NO Metal/Darwin row in includeDarwin and NO metal: capability key in the meta-backend. macOS hosts fall back to the default cpu (Linux) image, which will not run, and do not auto-fall to stock llama-cpp. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 06:59:36 +00:00
Ettore Di Giacinto	db6ebc53b2	feat(paged): block-table within-step host cache (patch 0029) Mirror of paged-dev commit e2acb3b (lever 5). get_block_table() is recomputed once per full-attention layer per decode step, but the KV cell layout is fixed for the whole step (it only changes in apply()). This caches the table the first time it is built in a step and memcpy-reuses the identical bytes for the rest, invalidating in apply(). Bit-exact; toggle off with LLAMA_PAGED_NO_BT_CACHE=1. Host-side get_block_table time (llama-batched-bench, npp128 ntg128 npl128, cache OFF -> ON): MoE 112.94 -> 14.82 ms (-87%), dense 193.78 -> 16.90 ms (-91%). Dense decode is partly host-bound and gains (TG 364.8 -> 374.7 t/s, ~96% of the vLLM 391 t/s @npl128 reference); MoE decode is compute-bound (FP4 GEMM) so the saved host time is off the critical path and MoE TG is flat. Details in LEVER5_HOSTPIPE_RESULTS.md. Also records the per-path bit-exactness gate (PAGED_BITEXACT_NOTE.md): the paged-MoE greedy md5 (8cb0ce23) differs from the non-paged md5 (07db32c2) by a benign FP-accumulation-order difference of the paged attention reduction, not a bug. KL-validated vs the f16 reference (16 chunks, c512): KLD(paged\|\|f16) = 0.13600 <= KLD(nonpaged\|\|f16) = 0.13660, PPL(paged) = 7.4009 ~ PPL(nonpaged) = 7.3896 (within +/- 0.29). Canonical references are now per path: non-paged MoE 07db32c2 and paged MoE 8cb0ce23; dense is bit-exact across paths (5951a5b4). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 01:47:08 +00:00
Ettore Di Giacinto	9b0e4e544c	docs(paged): residual-assess FINAL - MoE at bit-exact ceiling, hunt DONE Conclude the MoE-parity hunt. The two remaining sub-levers in the 20.3-vs-13.8 ms projection bucket are both bit-changing or at the BW floor: - convert-glue (3.24 ms/step, measured: 1.73 input f32->bf16 + 1.52 output bf16->f32): NOT bit-exact eliminable. ggml-cuda.cu:1663-1690 rounds the f32 GEMM accumulator to bf16 (CUDA_R_16BF dst) then widens to f32; that bf16-rounded value is load-bearing for the shipped md5. Removing the round-trip (f32-direct output, bf16 residual stream, or NVFP4 weights) all rebaseline md5. A precision boundary, like lever 4. - bf16 projection GEMM (17.27 ms/step): BW-bound at the LPDDR5x floor (~4.7 GB/step at 273 GB/s; M=128 -> 128 FLOP/byte vs >900 ridge). nvjet already TMA-streams the weights; cutlass reads the same bytes. No kernel lever; only fewer bytes (quantize) helps - rejected on quality. Corrects the body premise that vLLM runs these projections as NVFP4-Marlin: vLLM runs the same nvidia-modelopt checkpoint that keeps them BF16, so the projection bucket is a matched-precision gap, not a quant gap. Realistic bit-exact MoE ceiling ~86-88% of vLLM; shipped lever 1 (86.3%) is at it. No one-more-lever for MoE. Only clean win left is DENSE (+0.41% lever 5), gated behind resolving the paged-MoE baseline md5 drift. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 01:05:06 +00:00
LocalAI [bot]	f0d0bff232	fix(llama-cpp): stop reinterpreting plain-string message content as JSON (#10524 ) (#10538 ) The llama-cpp gRPC backend reconstructs OpenAI messages from proto for the tokenizer-template path and blindly json::parse'd each message's content string. LocalAI's Go layer always flattens content to a plain string, so a user prompt that merely looks like JSON (e.g. mealie's ingredient array ["1/4 cup brown sugar", ...]) was reinterpreted as structured content parts and rejected by oaicompat_chat_params_parse with "unsupported content[].type". Normalize content per role instead: user/system/developer content is opaque text and is never JSON-sniffed; assistant/tool content still collapses a literal JSON null/object (tool-call bookkeeping) to a string, but a plain string is never turned into an array/scalar. The array defense is role-independent, so the role gate only governs the benign null/object case. While here, extract the duplicated per-message reconstruction and the pre-template content sanitization into shared, unit-tested helpers (message_content.h) so the streaming (PredictStream) and non-streaming (Predict) paths cannot drift. This removes ~490 lines of copy-pasted defensive code, the dead tool-role parse branches, and the redundant Predict-only tool_calls branch, while preserving the prior #7324 (null content -> "") and #7528 (tool array content -> string) fixes. Tests: - backend/cpp/llama-cpp/message_content_test.cpp: standalone C++ unit tests for all three helpers (#10524, #7324, #7528, multimodal), discovered and run by `make test-backend-cpp` and a new generic tests-backend-cpp CI job. Also wired as an opt-in CMake/ctest target (-DLLAMA_GRPC_BUILD_TESTS=ON). - core/schema/message_test.go: Go regression pinning that ToProto flattens a JSON-array-looking text part to the verbatim string. - prepare.sh now copies message_content.h into the build tree. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 01:42:05 +02:00
Ettore Di Giacinto	e3f8149f3b	docs(paged): lever-4 KL-gate FAIL - NVFP4 MoE projections cost ~6% PPL, no-ship Re-quantizing the MoE GGUF's bf16 GDN/attn projections to NVFP4 (the lever-4 scope hypothesis) fails the KL gate on every axis vs the shipping NVFP4 baseline: PPL +6.51% (FULL) / +6.15% (CONS) against a <1% gate, mean KLD-to-f16 0.164/0.172 vs baseline 0.137, top-1 argmax agreement down ~2.2-2.6 points. Both projq variants rejected; in_proj_ba being kept bf16 (CONS) recovered almost nothing, so the damage is in the bulk attn/GDN projections. Root cause: the bf16 projections are a deliberate modelopt precision choice, not a provenance accident. vLLM runs the same modelopt checkpoint, so it keeps these projections bf16 too - the baseline GGUF already matches vLLM. The ~20.3ms projection-GEMM bucket is the price of high-precision projections that vLLM also pays; it is not the llama-vs-vLLM lever it appeared to be. The speed win is only purchasable with a 6% PPL regression. MoE stays at 86.3% of vLLM @ npl128. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 23:36:38 +00:00
Ettore Di Giacinto	9a1be79f04	docs(paged): lever-4 scope - NVFP4 the still-bf16 MoE GDN/attn projections (+6.5ms bucket) Scopes lever 4 (read-only, no GPU) on top of the flat levers 2+3. Root cause: the MoE GGUF (nvidia modelopt, 241 NVFP4 tensors) quantized only the experts and left the GDN/attn linear projections in BF16, while the dense GGUF (unsloth, 304 NVFP4 tensors) already has them NVFP4 (proven: dense ssm_out runs FP4 MMQ; dense decode at 96.6% of vLLM). Lever 4 = re-quantize the MoE GGUF's bf16 GDN/attn projections to NVFP4, the same move vLLM makes on the identical weights - the +6.5ms projections bucket, the largest single banked MoE gain available. Path: offline re-quantize to a new GGUF variant (expanded --tensor-type); zero kernel code - the loader sidecar-scale path + tuned mul_mat_q<NVFP4> are already in tree and proven by the dense GGUF. Bit-changing => KL-gate, not md5. KL expected to pass (per-step non-accumulating weight quant, unlike the failed bf16-state; experts already W4A4-clean); lm_head is the one risky tensor (gate on argmax-agreement). Expected ~+4-6.5ms => MoE 86.3% -> ~88-91% of vLLM. Recommend a separate OPT-IN gallery variant (preserve the bit-exact default; promote to default only if the KL gate is clean). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 22:57:05 +00:00
LocalAI [bot]	2c96c2d08e	chore: ⬆️ Update mudler/parakeet.cpp to `f469a57270a1cc4554acb15febf60e56619673b9` (#10530 ) ⬆️ Update mudler/parakeet.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-27 00:50:51 +02:00
Ettore Di Giacinto	62c407ed55	docs(paged): lever1 gather-fusion bench landed - checkpoint + attribution (patch 0028) Anchors the rigorous same-session A/B validation of patch 0028 (residual conv-state tap k_get_rows fusion) on this worktree branch with sign-off attribution. The regenerated 0028 patch + bench-updated LEVER1_GATHER_RESULTS.md first landed via a concurrent origin/master merge (`c1f1d1e8e`) that swept the staged files; this records the provenance and the bench summary in the checkpoint. Gate (bit-exact, greedy --temp 0 --seed 1 -n 48): dense q36-27b-nvfp4 5951a5b4d624ce891e22ab5fca9bc439, MoE q36-35b-a3b-nvfp4 07db32c2bcb78d17a43ed18bc22705cd (both == baseline; base == lever1). decode_agg npl128: dense 369.95 -> 377.83 t/s (+2.13%, 96.6% of vLLM), MoE 763.47 -> 777.95 t/s (+1.90%, 86.3% of vLLM). nsys MoE decode: k_get_rows_float 17334 -> 15414 inst (-1920), 358.37 -> 133.52 ms, step -3.13 ms. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 21:41:45 +00:00
Ettore Di Giacinto	c1f1d1e8ea	Merge remote-tracking branch 'origin/master' into worktree-feat+paged-attention # Conflicts: # gallery/index.yaml	2026-06-26 21:38:56 +00:00
Ettore Di Giacinto	79edfd26a3	feat(gallery): -paged suffix rename + qwopus NVFP4-MTP paged variants Rename the two base NVFP4 entries to a consistent -paged suffix (qwen3.6-27b-nvfp4 -> qwen3.6-27b-nvfp4-paged, qwen3.6-35b-a3b-nvfp4 -> qwen3.6-35b-a3b-nvfp4-paged) so all four base/MTP paged entries share the naming convention. Update the two matching examples in the backend plan doc. Add qwopus3.6-27b-v2-mtp-nvfp4-paged and qwopus3.6-27b-coder-mtp-nvfp4-paged: verbatim copies of the stock qwopus NVFP4-MTP entries (same GGUF uri/sha256, sampling, template, tags, function block) rewired onto the LocalAI paged-attention stack (backend llama-cpp-localai-paged; f16, flash_attention, 131072 context, 99 gpu_layers, batch 512; paged_kv + max_batch_tokens:512 + kv_unified:false + parallel:128). The stock entries are left untouched. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 21:26:14 +00:00
Ettore Di Giacinto	b1667b48ea	feat(paged): qwen35 recurrent-state gather fusion (patch 0028) Fuse the residual k_get_rows_float in the gated-DeltaNet decode path (the biggest single kernel vLLM lacks per MOE_GAP_VS_VLLM.md, ~5.2 ms/step MoE). 0019 fused the SSM-state gather, 0021 fused the conv compute but kept a build_rs gather for the conv taps; nsys located that conv-state tap gather (n_embd_r=24576 floats x 128 seqs, ~720 x ~115 us per 24-step window) as the last k_get_rows in the GDN path. New op ggml_ssm_conv_update_inplace_ids reads each sequence's prior conv taps from cache[ids[s]] in-kernel (identity in place from the write slot, non-identity via a disjoint scratch), mirroring the 0019 in-place + ids fusion. Bit-exact: read VALUES unchanged, only the read path changes. Helps both dense and MoE (shared GDN conv). GATE test-backend-ops (CUDA0 2/2): SSM_CONV_UPDATE_IDS, SSM_CONV_UPDATE, SSM_CONV, GATED_DELTA_NET, GET_ROWS all PASS. GATE greedy md5 (-temp 0 -seed 1 -n 48) BYTE-IDENTICAL both models: q36-27b-nvfp4 5951a5b4..., q36-35b-a3b-nvfp4 07db32c2... nsys: k_get_rows<float,float> 10174 -> 9454 instances, 186.3 -> 102.8 ms (720 conv gathers eliminated, replaced by a ~1.1 us no-op gather). Built and gated on the DGX llama tree (branch paged, commit 944636c, f32 default). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 20:59:59 +00:00
LocalAI [bot]	17c1fc74b2	fix(backends): darwin packaging for silero-vad (last Linux-only Go backend) (#10528 ) fix(backends): darwin packaging for silero-vad silero-vad was the last Go backend with Linux-only darwin packaging: - package.sh fell through to "Could not detect architecture" -> exit 1 on macOS (no Darwin branch), so its darwin image never packaged. - run.sh exported LD_LIBRARY_PATH, which macOS dyld ignores, so the bundled libonnxruntime.dylib couldn't be found at runtime. Add a Darwin branch to package.sh (skip the glibc/ld.so bundling; add an @loader_path/lib rpath so @rpath resolves to package/lib/) and a DYLD_LIBRARY_PATH branch to run.sh — mirroring the piper darwin fix (#10525). Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 22:31:06 +02:00
Ettore Di Giacinto	6c6a925213	docs(paged): MoE-vs-vLLM DECIDE synthesis - reject W4A16 Marlin, the GEMM is a llama win Cross-agent synthesis on top of the both-engine nsys decomposition (`3b5957157`): settle the user's "can we do what vLLM does on MoE?" question with the three converging investigations (groundtruth measurement + vllm-marlin source-read + marlin-port feasibility). Verdict: vLLM's ~15% MoE-decode lead is NOT the Marlin GEMM (that bucket is a -1.7 ms llama WIN: native FP4-MMA W4A4 47.3 vs Marlin W4A16 50.0 at the ragged tiny-M decode shape, both at the LPDDR5x BW floor). The gap is bf16 dense-projection bandwidth (+6.5), recurrence state-gather plumbing (+6.6, led by k_get_rows 5.2), graph/stream-overlap overhead (~+7), W4A4 act-quant tax (+3.3), and router/glue (+5.4). A W4A16/Marlin grouped MoE GEMM is REJECTED (default and opt-in): it would regress the 27% GEMM bucket to half-rate bf16 MMA, re-enter the GB10 occupancy wall the dense scaffold already STOPPED at, and its entire intrinsic upside is the ~2% act-quant tax - smaller than the bit-exact +1.9% the 0025 re-graph already banked, and closeable bit-exactly by fusing the act-quant. Recommended build (none a new MoE GEMM): (1) fuse the k_get_rows SSM-state gather (bit-exact, ~+5, biggest single-kernel win); (2) extend CUDA-graph coverage + stream overlap (bit-exact, ~+7); (3) fuse the W4A4 act-quant into RMSNorm/SiLU (bit-exact, +3.3); (4) NVFP4-quantize the still-bf16 GDN/attn projections + lm_head (bit-changing, +6.5, the same NVFP4-dense-quant move vLLM makes). Bit-exact levers alone reach ~94% of vLLM; with the projection quant ~96-97%, parity-or-better physically in reach since both heaviest kernels (SSM core, MoE GEMM) are already llama wins. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 20:14:30 +00:00
Ettore Di Giacinto	3b59571579	docs(paged): both-engine MoE decode decomposition - the 15% is NOT the Marlin GEMM Ground-truth side-by-side per-kernel ms/step of the MoE decode gap on DGX GB10. llama (752 t/s, step 169.8ms) vs vLLM graphs-on (901-equiv, step 142.0ms): 27.8ms gap. Headline: the grouped MoE-expert GEMM is a llama WIN - native FP4-MMA W4A4 47.3ms vs vLLM Marlin W4A16 50.0ms at the tiny-M decode shape. A Marlin-style W4A16 MoE GEMM would be slower; it is not the lever (extends the w4a16-marlin DENSE verdict). The 15% lives elsewhere: bf16 projections + convert glue (+6.5ms), recurrence state-gather plumbing (+6.6ms, led by k_get_rows 5.2ms), graph coverage + stream overlap (~+7ms), W4A4 act-quant tax (+3.3ms), router/glue (+5.4ms). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 20:11:40 +00:00
Ettore Di Giacinto	b3d3323105	feat(paged): wire ssm_bf16_tau model option for hybrid SSM-state fast mode Patch 0026 added the hybrid per-head bf16 SSM-state opt-in as the ssm_hybrid_tau_thresh cparam + the --ssm-bf16-tau CLI flag (default 0 = bit-exact f32). Expose it per-model via the LocalAI gallery/model YAML `options:` list, mirroring the paged_kv / max_batch_tokens setenv hooks. - grpc-server.cpp: new `ssm_bf16_tau` (alias `ssm_hybrid_tau`) option -> setenv(LLAMA_SSM_BF16_TAU) when the value parses to a positive float. It does NOT reference the paged-only common_params field, so the turboquant fork (which lacks patch 0026) stays byte-clean. - patch 0026 (common.cpp common_context_params_to_llama): getenv fallback feeds cparams.ssm_hybrid_tau_thresh from LLAMA_SSM_BF16_TAU only when the --ssm-bf16-tau CLI flag is unset (0). Absent/non-positive env => untouched, so stock stays bit-exact; the CLI flag takes precedence when set. - docs: backend/index.yaml note, docs backends.md, gallery header NOTE (referencing A_HYBRID_SSM_RESULTS.md; the 2 NVFP4 entries stay bit-exact). Byte-safe when unset: with no ssm_bf16_tau option the env is never touched and the default f32 bit-exact recurrence is preserved. Verified the parse + consume code paths with a standalone compile-and-run (option string -> LLAMA_SSM_BF16_TAU -> tau, plus 0 / garbage / CLI-precedence / unset cases). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 19:51:00 +00:00
Ettore Di Giacinto	9c1c2a6a16	docs(paged): B-3 mmq_y-down warp-remap NEGATIVE - bit-exact MoE ceiling ~85% of vLLM B-3 (the 0017-deferred mmq_y-down warp-remap of the NVFP4 grouped FP4-MMA mul_mat_q) was built bit-exact on the clean 0025 base and measured: the grouped GEMM kernel itself runs -1.3% (occupancy did rise via the nwarps=4 warp-remap / 128 threads-per-CTA), but end-to-end MoE decode is FLAT (npl128 +0.4%, npl32 +0.3%, within noise) because the stream-k fixup grows +42% (mmq_y=64 doubles the row-tiles) and the step is SSM/BW-bound. md5 PASS both models, test-backend-ops MUL_MAT 1146/1146 + MUL_MAT_ID 806/806 PASS. No patch 0028; DGX dev tree reverted to pristine 0025. Assessment: the bit-exact MoE GEMM/launch track is exhausted (B-1 re-graph banked ~82->85%; B-2 and B-3 are 0). Honest bit-exact MoE ceiling = ~85% of vLLM @npl128. The residual is the structural Marlin-NvFp4 grouped-GEMM gap that no bit-exact lever closes. Recommend shipping the ~85% bit-exact default and exposing the held 0026 bf16-SSM as a default-off opt-in (it reaches ~95% but is non-bit-exact and fails the MoE KL gate). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 19:10:24 +00:00
Ettore Di Giacinto	1f857f179e	docs(paged): B-2 down_proj act-quant retune RESULT - negative (no headroom) B-2 / M1 (SPEEDUP_HUNT rank #2): bit-exact block/grid/occupancy retune of quantize_mmq_nvfp4 (the MoE down_proj activation-quant, ~2% of the MoE decode step). Built+measured on a clean 0025 base (DGX GB10 sm_121), then reverted - it does not lift. Finding: the existing blockDim.x=128 is ALREADY the kernel-level optimum for quantize_mmq_nvfp4 on GB10. nsys (8193 invocations): block=128 total 117.4M ns is the fastest; 64 +8.7%, 192 +9.9%, 256 +6.9%. End-to-end MoE decode_agg is flat within 0.4% noise across all block sizes {32..256} (npl32 ~438, npl128 ~751 t/s). The act-quant is ~2% of a BW-bound step, so even a perfect kernel caps the win at ~2%, and 128 is already optimal => measured 0%. Same outcome as patch 0015 (M-tile) and 0017 (MINBLOCKS): no occupancy headroom on this 256-tiny-expert BW-bound model. Bit-exactness proven: md5 identical at block 64/128/256 for both models (the per-thread quant body is untouched; thread->output map is invariant to blockDim.x). Gate at default: dense 5951a5b4 == ref, MoE 07db32c2 == ref, MUL_MAT 1146/1146, MUL_MAT_ID 806/806 PASS. MoE stays ~85% of vLLM @npl128 / ~87% @npl32 - still well below vLLM, so the remaining MoE lever is B-3 (mmq_y-down warp-remap on the grouped FP4 GEMM). No patch 0027; dev tree reverted to pristine 0025. Full data in B_MOE_RESULTS.md. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 18:31:51 +00:00
Ettore Di Giacinto	33dfe7fd41	feat(paged): qwen35 hybrid per-head f32/bf16 SSM state - carry fix + gate sweep (patch 0026) Regenerate patch 0026 with the hybrid-decode carry fix and record the KL/throughput gate-sweep results. Fix: clear(data=true) zeroes the whole recurrent buffer including the head_slot maps, which were uploaded only once at construction; after the post-warmup reset every head read head_slot==0 (f32-local-0), collapsing the split and producing incoherent decode. Persist head_slot_host and re-upload via upload_head_slots() after every buffer clear. Hybrid decode is now coherent and the cross-op state carry is byte-exact (write==read, both partitions). Gate result: de-risk PASS (test-backend-ops 84/84; T=0 md5 == 0023 baseline, both models). Ship gate FAILS - no T_thresh meets MeanKLD<1e-3 AND same-top-p>=99.5% with a meaningful speedup. The premise that the bf16 error concentrates in long-memory heads is refuted: KL scales with the bf16 head count and saturates ~0.06/~91% (MoE saturates at the minimal split). The carry is byte-exact, so this is genuine bf16 sensitivity, not a bug. The byte-saving lever is real (dense +12.4%, MoE +11.5% decode @npl128 at T=128) but cannot meet the strict KL bar. Shipped default-off (f32, bit-exact opt-out); hybrid is opt-in only and not recommended in the gallery config. Full tables in A_HYBRID_SSM_RESULTS.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 17:44:05 +00:00
Ettore Di Giacinto	fe5bd3f53d	feat(paged): qwen35 hybrid per-head f32/bf16 SSM state (patch 0026) Lever A patch + build/de-risk results. Splits the persisted gated-DeltaNet recurrent state per head: f32 on long-memory heads (where bf16 rounding does not contract and the KL error concentrates), bf16 on fast-decaying heads, classified at model load by tau_h = 1/(\|ssm_a\|*softplus(ssm_dt)). Default ssm_hybrid_tau_thresh = 0.0 keeps every head f32 (bit-exact opt-out). De-risk gates BOTH PASS: test-backend-ops GATED_DELTA_NET CUDA0 OK (incl 32 hybrid mixed CUDA-vs-CPU cases); default all-f32 greedy md5 == 0023 baseline both models (dense 5951a5b4d624ce891e22ab5fca9bc439, MoE 07db32c2bcb78d17a43ed18bc22705cd). Known open issue (opt-in hybrid only; default unaffected): hybrid-ON model decode (ids in-place path) is incoherent; classifier/cache/kernel-params verified correct, bug isolated to the ids in-place cross-step state path. See A_HYBRID_SSM_RESULTS.md. Not ready for the GateSweep until fixed. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:opus-4.8 [Claude Code]	2026-06-26 16:21:33 +00:00
Ettore Di Giacinto	6bfca146d6	docs(paged): speedup-hunt C section + final RANK + PLAN synthesis Append lever C (structural dense residual: lm_head + scheduling) findings and the master RANK + PLAN section to SPEEDUP_HUNT.md. Per-lever scorecard (gain x tractability x gate), ranked build order, the concrete A build plan for the hybrid per-head f32/bf16 SSM state cache, and the ordered B/C/D queue with each one's build trigger. Verdict: ship the MoE re-graph (patch 0025, measured +1.9-4.4%, both gates PASSED) now; build A as the lead (only lever ABOVE vLLM on dense, KL-gated, ~430-454 t/s = 103-108% of vLLM); bank B-2/B-3 on MoE; C last (<1% bit-exact, dead-end); D opt-in-only and dense-only behind the same KL gate bf16-SSM failed. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 14:56:53 +00:00
Ettore Di Giacinto	4d3fecd524	docs(paged): MoE decode re-graph lever (patch 0025) + speedup-hunt B findings Mirror of llama.cpp dev-tree patch 0025 (qwen35moe NVFP4 MoE-decode re-graph) and the GPU-agent B findings in SPEEDUP_HUNT.md: re-confirmed MoE decode decomposition @npl128, the measured re-graph lever (+4.4%/+2.9%/+1.9% decode_agg at npl 32/64/128; bit-exact: test-backend-ops MUL_MAT_ID 806/806 + parallel-greedy np16 byte-identical ON==OFF), grouped-GEMM occupancy headroom (exhausted on this bandwidth-bound model), and the W4A16 assessment (rejected: non-bit-exact, slower BF16 MMA). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 14:53:14 +00:00
Ettore Di Giacinto	ec7c1b1f68	feat(paged): pin-sync patchset to llama.cpp 9d5d882d (re-export 4 patches) The worktree merge bumped LLAMA_VERSION 8be759e6 -> 9d5d882d. This re-syncs the paged patch-stack (0001-0024) to the new tip: the stack was rebased onto 9d5d882d on the DGX dev tree, rebuilt clean (CUDA sm_121), and re-validated bit-exact before re-exporting the LocalAI .patch files. Re-exporting each shipped patch from its rebased commit and diffing body-to-body against the committed files identifies exactly 4 that changed and no longer git-apply to 9d5d882d: - 0008 cross-request prefix share: re-anchored the [paged 0008] commit block to the refactored update_slots() lambda (continue->return, batch.n_tokens-> batch.size()); identical env-guarded logic. - 0013 static prefill budget: budget var-block / while-gate / admission-break re-expressed against the refactored loop (add_ok=false idiom). - 0015 expert-density MoE token-tile auto-select: pure context re-anchor; upstream inserted a test_mul_mat_id case at the hunk anchor in test-backend-ops.cpp. The inserted lines are unchanged. (This one rebased cleanly via 3-way but its committed .patch no longer applies with plain git apply, so it is caught by the per-patch apply-check, not by the rebase conflict count.) - 0016 dynamic decode-first budget: dynamic budget block + n_decode_in_batch = batch.size() + add_ok=false against the refactored loop. All four are byte-faithful format-patch exports of the gate-green rebased commits. Applying the full corrected series to a fresh 9d5d882d reproduces the gate-green tree byte-for-byte across every code file. The other 7 touched patches (0009/0017/0018/0019/0020/0021/0024) are LINENUM-only (hunk bodies byte-identical, only @@ line-numbers shifted) and still apply cleanly, so they are left unchanged. The remaining patches are identical. Validation on the rebased build (NVFP4 Qwen3.6, GB10 sm_121): - test-backend-ops CUDA0: GATED_DELTA_NET 36/36, SSM_CONV 45/45, MUL_MAT 1146/1146, MUL_MAT_ID 806/806 all OK. - greedy md5 (-fa on -n 48 --temp 0 --seed 1): dense q36-27b-nvfp4 5951a5b4d624ce891e22ab5fca9bc439 and MoE q36-35b-a3b-nvfp4 07db32c2bcb78d17a43ed18bc22705cd, both == baseline. - decode S_TG @npl128: dense 366.41 t/s (ref 373.2, -1.8%), MoE 751.11 t/s (ref 745.7, +0.7%), both within noise. Details in backend/cpp/llama-cpp/patches/paged/PIN_SYNC_9d5d882d.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 14:12:36 +00:00
Ettore Di Giacinto	30a2b590d9	Merge branch 'master' into worktree-feat+paged-attention (llama.cpp pin -> 9d5d882d) Sync to master (12 commits) + the llama.cpp pin bump 8be759e6 -> 9d5d882d. Conflicts resolved: - Makefile .NOTPARALLEL: union (keep both backends/llama-cpp-localai-paged and master's backends/privacy-filter-darwin). - gallery/index.yaml: our 2 base NVFP4 entries (qwen3.6-27b-nvfp4, qwen3.6-35b-a3b-nvfp4) for the paged backend prepended to master's full list; master keeps its own *-nvfp4-mtp variants (distinct entries). Go build + YAML validated; the 8 duplicate gallery names are pre-existing in master, not introduced here. The patchset still needs re-verification against the new tip (pin-sync, next step). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 13:16:13 +00:00
LocalAI [bot]	068d397acf	fix(backends): set rpath on the piper darwin binary so it can load its bundled libs (#10525 ) The metal-darwin-arm64-piper backend crashed at launch on macOS: DYLD "Library missing" Library not loaded: @rpath/libucd.dylib Referenced from: .../piper Reason: no LC_RPATH's found The piper binary links libucd, libespeak-ng, libpiper_phonemize and libonnxruntime via @rpath, but ships with no LC_RPATH, so dyld cannot expand @rpath and aborts before piper runs. The libraries themselves are already bundled in package/lib/ by package.sh. Additionally, package.sh's architecture detection only handled the Linux glibc loaders (/lib64/ld-linux-x86-64.so.2, /lib/ld-linux-aarch64.so.1) and otherwise hit `echo "Error: Could not detect architecture"; exit 1`, so on macOS packaging failed outright. Add a Darwin branch (before the Linux checks) that skips the glibc/ld.so bundling macOS has no use for and instead runs `install_name_tool -add_rpath @loader_path/lib` on the piper binary, so @rpath resolves to the bundled package/lib/ directory. Also mirror sherpa-onnx/opus in run.sh: export DYLD_LIBRARY_PATH on Darwin (LD_LIBRARY_PATH is Linux-only) as a defensive fallback. Validated by hand on Apple Silicon: with the rpath added, piper synthesized a real WAV. The darwin build is validated in CI. Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 15:10:15 +02:00
Ettore Di Giacinto	167768cac3	feat(backend): llama-cpp-localai-paged variant + NVFP4 Qwen3.6 gallery New backend = stock llama-cpp grpc-server + the paged patchset (forces LLAMA_PAGED=on), shipped as its own meta-backend (mirrors turboquant, simpler: no fork pin, no grpc-server patching - the paged runtime hooks already exist in grpc-server.cpp). Stock llama-cpp untouched (LLAMA_PAGED?=on retained; the de-risk flip deferred for sign-off). Gallery: qwen3.6-27b-nvfp4 (dense) + qwen3.6-35b-a3b-nvfp4 (MoE) with the benchmark run config (paged_kv, max_batch_tokens, parallel, flash_attention, f16), mudler/ GGUF uris (sha256 TODO until publish). Importer dropdown entry + tests. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 12:58:56 +00:00
Ettore Di Giacinto	125d10a782	feat(paged): paged-pool burst-reclaim (truncate + defrag + slot release) (patch 0024) Fixes the paged-pool burst-degradation bug (OTHER_PATHS_INVESTIGATION.md section C Part 2): on a long-lived llama-server with LLAMA_KV_PAGED=1, a high-fan-out prefill burst strands KV blocks in the host-side paged pool, so a later lower-npl prefill draws from a depleted/fragmented pool and its throughput collapses (the benchmark's "restart per npl" crutch). Decode is unaffected. The fix changes only host-side block accounting and placement, never KV values or compute, and is gated behind LLAMA_KV_PAGED (LLAMA_PAGED_NO_RECLAIM=1 restores the pre-fix behavior). Fix-1 reclaim trailing blocks: PagedKVManager::truncate(seq, n_keep) frees every block beyond ceil(n_keep/bs) (ref-counted); called from llama_kv_cache::seq_rm for the p1==MAX && p0>0 partial-tail case so the manager tracks the kv-cache exactly. Fix-2 defrag on empty: when the pool is fully idle, defrag_free_pool() relinks the free queue into ascending block-id order (FreeBlockQueue::rebuild), preserving content-cache hashes. Fix-3 release on slot completion: server_slot::release() issues prompt_clear() under the paged engine so a finished-idle slot returns its blocks promptly. Validation (DGX GB10, q36-27b-nvfp4 = qwen35 hybrid; HEAD f7409c2 = patch 0023): - Bit-exact: greedy md5 identical across paged off / paged on / paged on+NO_RECLAIM (5951a5b4d624ce891e22ab5fca9bc439), == the 0023 baseline. test-backend-ops unaffected (no ggml op touched). - Host unit test: truncate reclaims exactly 16 trailing blocks; defrag restores ascending popleft order. UNIT PASS. - Model A/B (one binary, NO_RECLAIM): fragmentation prefill ratio 0.944 -> 0.998; 64 idle slots strand 2048 blocks, reclaim returns the pool to fresh (2527). - Server A/B (FRESH-npl8 -> BURST-npl64 -> POST-npl8): POST-npl8 prefill collapses 488 -> 44 t/s with NO_RECLAIM (the bug; investigation saw 507 -> 65), restored to 532 t/s (fresh 525, within 1%) with the fix. Paged release-log count 17 -> 96 (Fix-3 fires per slot completion). Canary tokens identical fresh-vs-post in both arms (bit-exact serving). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 10:44:33 +00:00
Ettore Di Giacinto	b061e4aef0	docs(paged): OTHER_PATHS investigation - rank 4 post-0023 paths, pick paged-pool burst bug as first build target Synthesis of the four read-only/GPU investigations (A MoE grouped-GEMM, B cublas lm_head, C TTFT/paged-pool burst, D dense CUDA-graph): - A: llama already has the sorted-grouped-FP4-MMA GEMM (higher tier than vLLM's GB10 W4A16 Marlin fallback); standalone bit-exact kernel win is bounded on this bandwidth-bound a3b model. Keep down_proj quantize retune (M1) as a cheap bank-shot; fold the decode-graph (M2) into a later shared GDN+MoE decode-graph project. - B: lm_head is BF16 (not FP4), nvjet already ~72% of peak HBM; bit-exact ceiling <1%, the only big win (NVFP4 head) is non-bit-exact and unfair vs vLLM. Dead end. Rank last. - C: paged-pool burst-degradation BUG (Part 2) is a true correctness defect (prefill collapses 507->65 t/s after a burst, restart cures it): reclamation gap on partial seq_rm + free-queue fragmentation. Plus the static decode-first budget (Part 1) explains 903s/213s burst TTFT and the chunked-interleave fix. - D: f32 dense CUDA-graph is STABLE (<1%, no bimodality); the brief's bimodality was the shelved BF16 SSM path. Closed. First build target: the paged-pool burst-degradation bug fix (Fix-1 truncate-on-partial-seq_rm + Fix-2 defrag-on-empty + Fix-3 release-on-slot- completion). Small, localized, default-off byte-identical, crisp repro (npl64 burst then npl8: prefill within 10% of fresh + num_free restored). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 09:42:55 +00:00
LocalAI [bot]	6afe127cd4	fix(backends): make the opus backend build and package on macOS/Darwin (#10523 ) The opus Go backend (WebRTC audio codec) never built on macOS, so the published master-metal-darwin-arm64-opus image shipped source only — no opus binary and no libopusshim — because every step assumed Linux. - Makefile: hardcoded libopusshim.so with no OS handling. Mirror sherpa-onnx: SHIM_EXT=so / dylib on Darwin and build libopusshim.$(SHIM_EXT). On Darwin link the shim with -undefined dynamic_lookup so it resolves opus_encoder_ctl from the already globally-loaded libopus (codec.go dlopens it RTLD_GLOBAL first) instead of baking an absolute Homebrew path into the dylib, keeping the packaged shim relocatable. - run.sh: hardcoded LD_LIBRARY_PATH + libopusshim.so even on macOS. Add a Darwin branch exporting DYLD_LIBRARY_PATH and the .dylib shim, like sherpa-onnx/run.sh. - package.sh: bundle libopusshim.$(SHIM_EXT) and libopus*.dylib (not just .so) into package/lib so the OCI image (which ships package/.) is self-contained on a runtime with no Homebrew; add a Darwin arch branch so it doesn't warn/skip. - backend_build_darwin.yml: install + link opus and pkg-config via brew so the Makefile's `pkg-config opus` resolves on the macOS runner, and cache opus' Cellar dir. Go code is unchanged; darwin build is validated in CI. Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 11:19:50 +02:00
Ettore Di Giacinto	89e62fc74f	docs(paged): finalize f16 glue probe - cost analysis + build verdict Synthesize the GPU kernel-budget probe with the read-only glue source map. Add (4) the implementation cost - llama has no model-compute-dtype knob, the residual stream is F32 by construction (ggml_mul_mat hardcodes F32 output), so f16 glue is not a flag but an opt-in multi-file change (norm.cu f16 kernels + f16 residual stream). Add the final verdict: precision is not the dominant cause of the 8% residual (83% of the step is already f32/W4A4-matched), f16 recovers only 40-60% of the gap and is non-bit-exact, so do not build it as the default; ship the 95%-bit-exact f32 plateau and target the structural cublas/graph-launch ~3-4% instead. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 09:12:55 +00:00
Ettore Di Giacinto	001d833426	docs(paged): f16/bf16 glue probe - dense decode residual ceiling Empirical probe on q36-27b-nvfp4 @npl128 (build f7409c2, patch 0023): - attention KV cache default is ALREADY f16 (K/V f16) -> --cache-type f16 is a no-op; q8_0 within noise -> KV dtype is not a decode lever - nsys node-trace decode budget: f32-glue (norms/elementwise/activations/attn, excl. SSM recurrence + NVFP4 GEMM) = 28.7 ms = 8.4% of step (40.9 ms = 12% incl. the non-FP4 cublas GEMM) - f16 realistically recovers ~11-16 ms of the ~27 ms/step gap = ~40-60% of the 8.2% residual -> ~95-96% parity, not a full close; non-bit-exact opt-in only Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 09:11:21 +00:00
Ettore Di Giacinto	00f92659f8	docs(paged): correct vLLM recurrent-state precision (f32, not bf16) Earlier text claimed bf16 = vLLM's own precision; that was a refuted byte-gate draft re-surfacing. The settled finding (BITEXACT_VS_VLLM.md, proven 3 ways) is that vLLM keeps the gated-DeltaNet TEMPORAL state in f32 (only its conv state is bf16). So bf16 temporal is BELOW vLLM's recurrent precision, not a match; and at equal f32 precision llama's recurrence already beats vLLM (84.6% vs 82.4% peak). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 06:22:08 +00:00

1 2 3 4 5 ...

1692 Commits