LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-30 03:17:01 -04:00

Author	SHA1	Message	Date
Ettore Di Giacinto	c4058eb4da	feat(paged): tail-fusion (0042) + full-step decode CUDA graph default-on (0043); FP4-MMA W4A4 (0034) + Marlin W4A16 (0035) MoE-GEMM scaffolds default-off 0042 fuses the pre-norm residual add into RMSNorm (+0.5% prefill, bit-exact). 0043 makes the full-step MoE decode CUDA graph default-on (+2-4% decode, bit-exact; removes ~18x per-step host kernel re-issue, A/B-confirmed). 0034 (native FP4-MMA W4A4) and 0035 (Marlin-style W4A16 grouped MoE GEMM) are correct + bit-exact but regress vs the int8 FP4-MMQ in-backend on GB10 (bf16 MMA is ~half the int8 rate); shipped default-off as validated mechanisms and recorded negatives per the parity methodology. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-29 06:15:10 +00:00
Ettore Di Giacinto	f1c98ff0b9	fix(paged): revert S3 decode-stable scheduler to default-OFF (A/B regression) Patch 0041 (LLAMA_PAGED_DECODE_STABLE) was made default-on-when-paged, but a measured end-to-end A/B proved that is a serving mistake. S3 defers prefill admission on the period-8 cadence, which delays prompt admission: 2.5x worse TTFT (60s vs 24s at N=256) and 20-29% lower end-to-end throughput, with no end-to-end win at any concurrency. Its apparent decode_agg gain was a metric artifact (faster per-step decode bought by starving prefill). Flip the s3_enabled default so an unset LLAMA_PAGED_DECODE_STABLE means OFF; the mechanism stays available as an explicit opt-in (LLAMA_PAGED_DECODE_STABLE=1) for decode-dominated, low-arrival traffic where TTFT is not a concern. The default now prefers prompt prefill admission for good TTFT. S1 (patch 0040) keeps shipping default-on; only S3's default changes. Re-exports patch 0041 (change folded into its source commit) and updates the README 0041 row plus the decode-serving narrative to record the A/B finding. Greedy md5 gate unchanged (single-sequence llama-completion path, not update_slots): paged MoE 8cb0ce23, dense 5951a5b4. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-29 05:00:11 +00:00
Ettore Di Giacinto	b028c81eda	docs(paged): record padded/fixed-slot decode shape as tested-and-rejected The S1 section-(a) padded/fixed-slot decode shape (the scoped follow-up to push serving graph reuse from ~72% toward ~100%) was implemented in an isolated worktree off the committed S1/S3/tail base, built CUDA-only, and benched on GB10. Verdict: REJECTED. It is bit-exact and provably inert, but it regresses serving throughput at every concurrency and does not close the vLLM gap. Implementation (default-off, LLAMA_PAGED_PAD_DECODE): on a pure-decode step (n_prompt_budgeted == 0) emit a masked-inert dummy decode for every idle slot so n_tokens / n_seqs / n_seqs_unq / n_outputs and the seq-id set stay constant; a release()-side guard keeps a finished slot warm under padding. Each dummy is its own sequence (private recurrent state, per-stream paged attention, logits discarded), so it cannot perturb a real stream. Gates: single-seq greedy md5 bit-exact (dense 5951a5b4, paged-MoE 8cb0ce23). The literal per-stream ON-vs-OFF identity gate is unachievable - concurrent cuBLAS/FA decode is not bit-reproducible run-to-run even with padding off (OFF-vs-OFF diverging streams: dense 3/16, MoE 8/16). The achievable inertness gate passed: ON-vs-OFF per-stream prefix-agreement equals the OFF-vs-OFF noise floor exactly (MoE 0.940/0.940, dense 0.812/0.812), so the dummy slots leak nothing. Bench (MoE Qwen3.6-35B-A3B-NVFP4, GB10), burst decode tok/s/seq: n=8 S1+S3 28.16 / PAD 6.05 / vLLM 44.8; n=128 S1+S3 4.53 / PAD 4.32 / vLLM 6.87. Staggered aggregate tok/s: baseline (reuse 0%) 757.6, S1+S3 (reuse 72%) 763.3, PAD (reuse 38%) 558.0. Why it fails: (1) serving decode here is GPU-compute-bound, not host-rebuild-bound - baseline reuse 0% ~= S1+S3 reuse 72% on aggregate tok/s, so closing reuse buys ~nothing (the earlier 542->762 host-bound delta did not reproduce); (2) padding adds dummy-row compute proportional to pad_width - real_load, catastrophic at low load; (3) in continuous serving padding cannot hold a constant width (perpetual prefill churn) so reuse drops 72% -> 38%; (4) the completion-driven batch shrink padding prevents is itself a throughput win in a compute-bound regime. The residual burst gap is GPU-compute, which a host-side reuse lever cannot close. Patch series unchanged: this rejected lever is NOT added to patches/paged/. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 20:47:43 +00:00
Ettore Di Giacinto	2fa8ef8fc5	fix(paged): make patch 0031 apply on the 0001-0030 base; default S3 on under paged KV FIX A (patch 0031 compose break): the chunked GDN prefill patch carried '#include <cuda_bf16.h>' and '#include <type_traits>' as CONTEXT lines, but those were introduced by the dropped bf16-tau patch 0026, so on the bf16-tau-free 0001-0030 base only '#include <cstdlib>' is present and 'git apply' failed. The same 0026 drop also shifted 0031's later hunks off their context (the ', hyb' kernel-launch arg, the 'STATE_BF16, HYBRID' template params, and the GDN_LAUNCH_ARGS list). Regenerated 0031 against a fresh pin(0ed235ea) + 0001-0030 tree: the chunked kernel now SELF-PROVIDES the cuda_bf16.h / type_traits includes (adds them, plus the climits it needs for INT_MAX) and the dispatch guard is the 2-param 'if constexpr (!KDA && !keep_rs_t)' form. Behaviour is unchanged: 0031 stays opt-in, default OFF (GDN_CHUNK_MIN), a recorded negative. The full 0001-0042 series now applies clean on 0ed235ea ('git apply --check' green for every patch). FIX B (patch 0041 S3 default): the decode-shape-stable scheduler defaulted OFF. Make it default ON whenever paged KV is active (LLAMA_KV_PAGED set), still overridable to off via LLAMA_PAGED_DECODE_STABLE=0. Minimal host-side change in update_slots(); re-exported from the dev tree, README 0041 row updated to match. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 19:37:05 +00:00
Ettore Di Giacinto	d706980c2b	feat(paged): close the continuous-serving decode gap (S1+S3, patches 0040/0041) Add the two decode-serving graph-reuse levers (validated on GB10) that close the host-bound serving gap (paged dropped to ~3.7 vs vLLM ~5.9 tok/s/seq in real continuous serving while tying it in static batched-bench). - 0040 S1 paged decode-graph reuse: the paged decode inputs never overrode llm_graph_input_i::can_reuse (defaults false), so the host rebuilt the ggml graph on EVERY decode step (layer-A reuse 0%). Add a 256-bucketed-shape can_reuse + a live-mctx refresh from the owning attn input. Bit-exact (md5 byte-identical reuse on/off). Static batched-bench: paged reuse 0% -> 95.5%. - 0041 S3 decode-shape-stable scheduling: keep co-batched prefill out of decode steps so the scheduler emits the reuse-stable pure-decode shape S1 can reuse. Default-off policy on top of 0016; bit-exact (per-stream independent). S1+S3 together (128-client staggered serving, MoE Qwen3.6-35B-A3B-NVFP4): graph reuse 0% -> 72.2%, hostproc 15.98 -> 6.31 ms/step, decode 4.05 -> 5.52 tok/s/seq median (4.24 -> 5.96 mean, at vLLM's ~5.9). S1 alone is insufficient (13.8%); S3 is the multiplier. S2 (double-buffer set_inputs) dropped: Phase-0 put set_inputs at ~0.05 ms/step, so it has nothing to recover. README patch table + DECODE_SERVING_SCOPE.md updated with results and the padded/fixed-slot follow-up. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 18:04:28 +00:00
Ettore Di Giacinto	000705321f	feat(paged): FP4 prefill large-M dequant->bf16 cuBLAS scaffold (patch 0033, default-off) Option (a) of PREFILL_GEMM_SCOPE.md: route large-M (prefill) NVFP4 dense weight GEMMs off the decode-tuned FP4-MMQ kernel onto the dequant->bf16 cuBLAS (nvjet) tensor-core path, wired via an M-threshold in ggml_cuda_should_use_mmq. Lands the validated, bit-exact-gated mechanism and records the honest GB10 result: it is a regression, so it ships default-off (== stock), mirroring the patch-0017 default-off discipline. Three-edit scaffold (no new kernel): should_use_mmq routes NVFP4+Blackwell+dense M>LLAMA_FP4_PREFILL_M to cuBLAS; op_mul_mat_cublas gains an NVFP4 branch that dequants the FP4 weights to a transient bf16 pool buffer (not cached - stays FP4-resident) and runs cublasGemmEx CUDA_R_16BF/COMPUTE_32F; ggml_get_to_bf16_cuda gains the NVFP4 case. Bit-exact gate PASS (benign): test-backend-ops MUL_MAT 1146/1146 + MUL_MAT_ID 806/806; the forced path (LLAMA_FP4_PREFILL_M=64) is green CUDA-vs-CPU at NVFP4 large-M shapes; greedy md5 on q36-27b is byte-identical to FP4-MMQ both for short prefill (5951a5b4, decode untouched) and for a >threshold prefill that exercises the bf16 path (5f3967df - no greedy argmax flips). Performance REGRESSES on GB10 (S_PP, q36-27b dense, A/B via env): M=512 958.99 -> 486.65 (-49%), M=1024 1013.65 -> 587.27 (-42%), M=2048 918.46 -> 649.42 (-29%). The scope premise (FP4-MMQ ~3% of FP4 peak at large M) is false here: FP4-MMQ beats bf16-cuBLAS because bf16 peak is ~half FP4 peak and the per-step weight dequant + 4x bf16 weight traffic (~8x total vs the FP4 read) dominate, only partially amortizing as M grows. Default-off keeps stock S_PP (966.98). Phase 2 (MoE grouped large-M) not implemented: it inherits the same bf16-peak<FP4-peak ceiling plus a per-expert dequant, so grouped bf16-cuBLAS would regress for the same reason; a real prefill GEMM win needs option (b), a native FP4-MMA large-M kernel. Full A/B in docs/PREFILL_GEMM_RESULTS.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 17:42:15 +00:00
Ettore Di Giacinto	4bdd26a7f0	docs(paged): scope tensor-core (mma) chunked GDN prefill kernel Scopes the follow-up recorded by patch 0031 + README section 5: replace the serial per-thread reductions of the chunked gated-DeltaNet prefill scan with mma.sync tensor-core matmuls and lift the 1-block/SM occupancy ceiling, the path that would beat the tuned sequential scan and close the GDN prefill bucket toward vLLM's ~2.5x-cheaper chunked scan. Confirmed (not assumed) the GB10/sm_121a tensor-core reality: consumer Blackwell (SM12x) has NO wgmma (Hopper-only) and NO tcgen05/TMEM (sm_100a data-center only); the usable path is the extended mma.sync family. So the kernel is a warp-synchronous mma.sync + cp.async design (reusing ggml's mma.cuh tiles), not a wgmma/TMA/tcgen05 design - patch 0031's 'mma/wgmma' shorthand reads as mma only on this part. Design: register-resident state frees the 64KB that forced C=16, admitting C=64 under the 99KB shared opt-in; tf32 inputs / f32 accumulate with a 3xtf32 precision ladder; decays/gamma/beta stay f32 outside the mma to preserve the bounded de-gating; A-inverse via blocked forward substitution (FLA UT transform) with mma off-diagonal coupling. Mechanism: chunking cuts state-BW ~Cx, mma absorbs the O(C^2) intra-chunk flops the serial 0031 could not. Honest: multi-week, high risk, no vendor kernel to route to on sm_121; gains beat the sequential scan and close most of the bucket but not full sm_100-class parity. KL-gate binding (NMSE likely fails at reduced precision). Phased: re-profile -> two-product PoC -> full intra-chunk + C=64 + reg-state -> occupancy/cp.async; opt-in default-OFF until A/B-proven. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 17:23:51 +00:00
Ettore Di Giacinto	9a28f23134	docs(paged): scope the continuous-serving decode gap (host-bound, design-only) Add DECODE_SERVING_SCOPE.md: the decode KERNEL is at parity in static batched-bench (~6.1 tok/s/seq ~ vLLM ~5.9 at npl128) but continuous serving through llama-server update_slots() drops to ~3.7 (-39%) while vLLM sustains ~5.9. Scope shows the gap is the scheduler/host loop, not the kernel. Root-cause hypothesis from source: continuous batching's batch-shape + seq-set churn breaks BOTH graph-reuse layers every step - llama-context can_reuse/ allow_reuse (n_tokens + seq-set must match) and the CUDA ggml_cuda_graph update_required memcmp (ne/nb/data ptrs) - so the GPU idles while the host rebuilds + re-captures the graph and runs un-graphed set_inputs. vLLM avoids this with padded/bucketed decode shapes + piecewise CUDA graphs. Documents that the shipped scheduler patches (0008/0013/0016/0024/0025/0029) target prefill freezing + burst collapse, NOT decode-step graph reuse, which is why the serving gap survives them; notes the README s.5 'lever 2 graph coverage FLAT' verdict was static-regime and is reopened here for serving only. Ranks host-side, bit-exact-safe levers: S1 bucketed/padded decode-step shape for graph reuse, S2 double-buffer/overlap per-step host work, S3 graph-shape-stable scheduling (extend 0016). Specifies a Phase-0 profile to confirm host-bound before any build, reusing the in-tree [L5INSTR] hostproc/set_inputs/ get_block_table timers, the 'graphs reused' perf counter, LLAMA_GRAPH_REUSE_DISABLE and nsys GPU-busy%, with vLLM ground-truthed at the same concurrency. No kernel code; no GPU run in this pass. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 17:14:51 +00:00
Ettore Di Giacinto	e610347367	feat(paged): chunked parallel-scan GDN prefill kernel (patch 0031) Adds patch 0031 to the paged llama.cpp series: an FLA-style chunked parallel-scan prefill kernel for gated DeltaNet (the upstream gated_delta_net.cu "Add chunked kernel for even faster pre-fill" TODO). Scope: non-KDA scalar gate, f32 state, final-state-only, homogeneous. Bit-exact-benign (NEW per-path): test-backend-ops GATED_DELTA_NET 91/91 within the 1e-7 NMSE gate vs the CPU reference (patch adds 8 S_v=128 prefill cases: exact-multiple / tail / multi-seq / GQA / permuted); numpy prototype confirms f32 chunked-vs-sequential NMSE ~1e-13. OPT-IN, default OFF: GB10's 99KB dynamic-smem opt-in forces C=16 (the 128x128 f32 state is 64KB of the all-shared layout), pinning the kernel to 1 block/SM with serial dk-reductions. Measured ~761 t/s chunked vs ~971 t/s sequential (~22%% slower) on q36-27b-nvfp4 prefill, so it defaults OFF (enable with GDN_CHUNK_MIN=<n>); the backend default is regression-free. Beating the 84.7%-of-peak sequential scan needs tensor-core matmuls / register-resident state with larger chunks (recorded in README section 5). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 17:09:38 +00:00
Ettore Di Giacinto	11128cb080	docs(paged): scope the large-M NVFP4 prefill GEMM lever (design only) Design + plan for the #1 prefill lever: NVFP4 weight GEMM at large M, where MMQ (decode/M<=128-tuned, 1 CTA/SM, 128-col tile cap) is ~3.4x slower than vLLM's marlin/cutlass large-M path (~51% of the prefill gap). Recommends (a) dequant->bf16 cuBLAS routed by an M-threshold (dense first, MoE grouped-cuBLAS second); rejects (b) a from-scratch Marlin/FP4 kernel as a multi-week project. Key enabling finding: NVFP4->bf16 dequant kernels already exist, and NVFP4 is currently force-excluded from the tensor-core cuBLAS path (falls to f32 Sgemm) - relaxing that one guard is the pivot. Honest: bf16-cuBLAS banks ~60-75% of the GEMM gap, not full 68us/tok parity (bf16 TC peak ~half FP4). Design only - no kernel, no GPU run. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]	2026-06-28 16:42:23 +00:00
Ettore Di Giacinto	4cd90bfae9	paged: drop bf16-tau (patch 0026), subsumed by decode fusions (tau=100000 flat, zero speed benefit) The opt-in hybrid per-head bf16 SSM-state lever (ssm_bf16_tau, patch 0026) is removed from the llama-cpp-localai-paged patch series. Clean re-measurement after the decode fusions (0028 recurrent-state gather-fusion + 0029 block-table cache) landed shows it buys nothing: forcing ALL gated-DeltaNet heads to bf16 (tau=100000, the most aggressive setting) gives flat decode throughput, 780.6 vs 780.0 t/s. The mode engages but adds zero speed because it is subsumed by the fusions. The earlier "+12%" was measured before the fusions completed. bf16-tau was a precision trade (not bit-exact, ~91% same-top-p) plus extra bug surface and extra CUDA template-instantiation compile cost with no offsetting benefit. Dependency check: no later patch (0028/0029/0030) depends on 0026. 0030's only mention is a description comment; its code keys off fused_gdn_ar/ch/auto_fgdn, which originate in 0018/0019/0021 (before 0026). The remaining series (0001-0025, 0028-0030) applies clean with git apply --check against the pin 0ed235ea2c17a19fc8238668653946721ed136fd. The Makefile applies the series by glob (patches/paged/0*.patch); the resulting gap at 0026 is tolerated (0005/0027 are already absent). Removed: - patches/paged/0026-qwen35-hybrid-perhead-ssm-state.patch - the dead ssm_bf16_tau / ssm_hybrid_tau option handler in the shared grpc-server.cpp (it only set LLAMA_SSM_BF16_TAU, now a no-op the library no longer reads) - the patched+bf16-tau benchmark columns and llama-patched-bf16tau rows (README + final_benchmark.csv), the ssm_bf16_tau option text in backend index.yaml, the gallery NOTE block, and the docs/features/backends.md mention. The rejected-lever lesson is kept (why it was dropped: subsumed, tau=100000 flat) in the backend README section 5, the paged-backend agent guide, and the vLLM-parity methodology, so it is not re-tried. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 16:06:06 +00:00
Ettore Di Giacinto	2c59805267	fix(paged): rpc cmake target renamed rpc-server -> ggml-rpc-server at pin 0ed235ea llama.cpp renamed the RPC tool target (tools/rpc/CMakeLists.txt: set(TARGET ggml-rpc-server)) at the 0ed235ea pin. master already updated the stock llama-cpp Makefile to match (--target ggml-rpc-server, cp bin/ggml-rpc-server); the paged backend's separate Makefile copy was left stale and its -grpc (RPC) variant failed with 'No rule to make target rpc-server' (grpc-server itself built to 100%). Mirror the stock rename in the paged Makefile. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 11:10:07 +00:00
Ettore Di Giacinto	c51ff4cec9	docs(paged): scope porting the portable benefits to Metal/SYCL/Vulkan (+ROCm) Add ACCELERATOR_PORTING_SCOPE.md, the umbrella scope for taking the paged backend's accelerator-portable wins off the CUDA family. It builds on (does not duplicate) UPSTREAM_LAYER2_SCOPE.md, which stays the GDN/SSM-fusion detail (benefit #1), and adds: - Benefit #2 (paged KV in-kernel block-table flash-attn read, 0009-0011): new per-backend feasibility from source analysis of the Metal/SYCL/Vulkan flash-attn kernels. SYCL EASY (near line-for-line CUDA mirror), Metal EASY-MEDIUM (decode already routes to the vec kernel), Vulkan MEDIUM (the fast coopmat2 NVIDIA decode path cannot do the indexed read; push-constants are full). Universal constraint: only the vec/scalar decode kernel admits the per-cell indexed read, so route block-table ops onto vec (as CUDA's 0009-0010 dispatch guard already does) and leave the fast MM/coopmat2 path contiguous-only. This is the lever that flips paged KV from neutral-to-slightly-negative to non-negative off CUDA. - Benefit #3 (decode-first scheduler, 0013/0016): confirmed a free portable win - host-side update_slots() policy, zero kernel work, runs on any accelerator as-is. - Benefit #4 (NVFP4 FP4-MMA, 0017/0023/0025): out of scope (Blackwell only); flags the backend-agnostic analogues of the act-quant dedup and the graph-coverage lever without over-claiming a port. - A ROCm note: ROCm rides the CUDA/HIP path (validate, don't re-port); FP4-MMA stays Blackwell-only. Benefits #1 and #2 share the port shape and rank Metal->SYCL->Vulkan, so they bundle into one per-backend PR behind a shared ops-first PR. Cross-link added from UPSTREAM_LAYER2_SCOPE.md. All gates are test-backend-ops on-target (no Metal/SYCL/Vulkan/ROCm hardware here). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 08:34:32 +00:00
Ettore Di Giacinto	ea72a56e2c	Merge origin/master + pin-sync paged backend to 0ed235ea master auto-bumped the stock llama-cpp pin 9d5d882d -> 0ed235ea and updated the shared grpc-server.cpp. The paged backend's pin must track the stock pin (the grpc-server.cpp is shared), so bump its LLAMA_VERSION to match. All 28 paged patches apply clean on 0ed235ea (verified against a fresh upstream clone). The bf16-tau state-serialization fix (patch 0026) is included. Bit-exact gate + full grpc-server build verify on GPU/CI to follow. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 07:56:47 +00:00
Ettore Di Giacinto	1f3e5ba301	fix(paged): serialize both SSM partitions in hybrid bf16-tau state save/restore (patch 0026) The opt-in ssm_bf16_tau hybrid mode splits a gated-DeltaNet layer's recurrent SSM state into an f32 partition (s_l) and a bf16 partition (s_l_bf16). The recurrent state serialization paths (state_write_data / state_read_data) were never updated for the split: they read/wrote s_l using the FULL hparams.n_embd_s() (S_vS_vH) row width, but a split layer's s_l only holds S_vS_vn_f32, so the access overruns the smaller tensor (a ggml_backend tensor read out of bounds), and the bf16 fast-head partition was never persisted at all. This is what broke high-concurrency serving with --ssm-bf16-tau: the server's context-checkpoint feature serializes per-sequence state via state_seq_get_data. With a checkpoint enabled, even a single request triggered the out-of-bounds read; at higher concurrency the cell range starts at a higher base slot so the overrun reaches further (hard abort in a debug build, silent state corruption then 1-token-then-EOS on restore in a release build). The static batched-bench never exercises save/restore so it did not catch it; the GDN decode kernel and per-head partition offsets were already correct (decode with checkpoints disabled is fine at N=8/16/32). Fix: serialize the f32 partition and, when the layer is split, the bf16 partition right after it, each with its OWN row width (tensor ne[0]). head_slot is rebuilt deterministically at load (same model + tau), so it is not serialized. Non-split layers have ne[0] == n_embd_s() and no bf16 partition, so their on-disk format and behavior are byte-identical (the default f32 path and the bit-exact gate are unaffected). Verified on GB10/DGX with Qwen3.6-35B-A3B-NVFP4 + --ssm-bf16-tau 64 via a continuous-batching llama-server: with context checkpoints enabled, N=8, N=16 and N=32 (slot reuse + restore) all now produce full coherent 128-token output and the server stays up; pre-fix the same config aborted on the first checkpoint. Assisted-by: Claude:claude-opus-4-8[1m] [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 07:47:17 +00:00
LocalAI [bot]	de2ec2f136	feat(backends): add voice-detect + face-detect ggml backends (replace Python insightface/speaker-recognition) (#10441 ) * feat(voice-detect): add Go purego backend for voice-detect.cpp Add backend/go/voice-detect implementing the Backend gRPC voice subset (VoiceEmbed/VoiceVerify/VoiceAnalyze) over libvoicedetect.so via purego, mirroring the parakeet-cpp / omnivoice-cpp backends. The flat voicedetect_capi C ABI is dlopen'd cgo-less; malloc'd string and float-vector returns are owned by Go and released through the matching capi free functions, with the per-ctx last error surfaced into Go errors. Calls are serialized via base.SingleThread since the C context is not reentrant. Proto field mapping: - VoiceEmbed: VoiceEmbedRequest.audio (path) -> embed_path -> Embedding+Model. - VoiceVerify: audio1/audio2 + threshold (<=0 falls back to the verify_threshold option, default 0.25) -> verify_paths -> verified/distance/ threshold/confidence/model/processing_time_ms. - VoiceAnalyze: audio (path) -> analyze_path_json; the JSON age/gender/emotion document maps to a single VoiceAnalysis segment (start/end 0; gender "label" -> dominant_gender with the remaining float scores as the gender map; emotion label/scores -> dominant_emotion/emotion). The Makefile pins voice-detect.cpp to 47546430, clones+builds libvoicedetect.so with ggml static-linked (PIC, GGML_NATIVE off) so dlopen needs no external libggml/libvoicedetect; ldd on the artifact shows only system libs. Ginkgo tests cover option parsing and analyze-JSON mapping; embed/verify smoke specs gate on VOICEDETECT_BACKEND_TEST_MODEL + VOICEDETECT_BACKEND_TEST_WAV. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(voice-detect): wire backend into index, gallery and build Register the voice-detect.cpp speaker-recognition + voice-analysis backend (added in Voice-INT-A) into LocalAI's distribution surfaces, mirroring the ced backend (the closest mudler C++/ggml audio analogue): - backend/index.yaml: add the &voicedetect meta-backend (capabilities platform map, no top-level uri) plus the full set of concrete per-arch image entries (cpu/cuda12/cuda13/metal/rocm/sycl/vulkan/l4t and the -development variants). Referential integrity audited - every alias target resolves. - gallery/index.yaml: add 5 model entries on backend voice-detect - ECAPA-TDNN, WeSpeaker ResNet34, 3D-Speaker ERes2Net, CAM++ and the wav2vec2 age/gender/emotion analyze model. The engine architecture is read from GGUF metadata (voicedetect.arch) at load. GGUF artifacts are not yet published: each files: entry points at the intended mudler/voice-detect-gguf location with a TODO to fill sha256 after upload (no fabricated hashes). - .github/backend-matrix.yml: add the linux build matrix block + the darwin metal entry mirroring ced. - .github/workflows/bump_deps.yaml: track mudler/voice-detect.cpp via VOICEDETECT_VERSION (pin 47546430, = 4754643). - core/config/backend_capabilities.go: register voice-detect in the backend capability map (VoiceVerify/VoiceEmbed/VoiceAnalyze -> speaker_recognition), mirroring speaker-recognition. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(face-detect): add purego Go backend for face-detect.cpp Add the LocalAI Go backend that dlopens libfacedetect.so (the flat facedetect_capi_* C-ABI) via purego, mirroring the sibling voice-detect backend. Implements the Face subset of the Backend gRPC service: - Embeddings(PredictOptions): Images[0] base64 -> temp file -> embed_path -> L2-normalized ArcFace embedding. - Detect(DetectOptions): src -> detect_path_json -> Detection boxes (class_name "face", [x1,y1,x2,y2] -> x/y/w/h). - FaceVerify(FaceVerifyRequest): two images + threshold + anti_spoof -> verify_paths; best-effort img areas via detect. - FaceAnalyze(FaceAnalyzeRequest): img -> analyze_path_json -> per-face age + gender ("M"/"F" normalized to "Man"/"Woman"). The Makefile pins face-detect.cpp to 636a1963 and builds the shared lib with ggml + vendored libjpeg-turbo static (PIC), so the .so is ldd-clean (no libggml) and exports only facedetect_capi_* (no jpeg_ symbols). Gated Ginkgo e2e mirrors voice-detect. Note for the gallery-wiring task: backend registration (index.yaml, gallery, core/config/backend_capabilities.go) is intentionally not touched here. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * fix(voice-detect): replace em dashes in net-new descriptions Project style forbids em/en dashes. Replace the three U+2014 chars introduced by the voice-detect gallery/index wiring with `-`/`:`. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(face-detect): wire backend into index, gallery and build Register the face-detect.cpp face detection / embedding / verification / analysis backend (added in Face-INT-A) into LocalAI's distribution surfaces, mirroring the voice-detect wiring (the closest mudler C++/ggml recognition analogue): - backend/index.yaml: add the &facedetect meta-backend (capabilities platform map, no top-level uri to avoid the meta-backend gotcha) plus the full set of concrete per-arch image entries (cpu/cuda12/cuda13/ metal/rocm/sycl-f16/sycl-f32/vulkan/l4t and the -development variants), 22 entries. Referential integrity audited: every alias target resolves. - gallery/index.yaml: add 4 model entries on backend face-detect - face-detect-buffalo-l/m/s (insightface SCRFD + ArcFace/MBF, NON-COMMERCIAL) and face-detect-yunet-sface (OpenCV-Zoo YuNet + SFace, APACHE-2.0, the commercial-friendly alternative). The detector/embedder architecture is read from GGUF metadata (facedetect.arch) at load; only the real verify_threshold option is set (0.35 buffalo, 0.363 sface). GGUF artifacts are not yet published: each files: entry points at the intended mudler/face-detect-gguf location with a TODO to fill sha256 after upload (no fabricated hashes). - core/config/backend_capabilities.go: register face-detect in the backend capability map (Embedding/Detect/FaceVerify/FaceAnalyze -> face_recognition), mirroring insightface. - .github/backend-matrix.yml: add the linux build matrix block + the darwin metal entry mirroring voice-detect. - .github/workflows/bump_deps.yaml: track mudler/face-detect.cpp via FACEDETECT_VERSION (pin 636a1963). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * fix(recon): voice-detect metal build branch + face-detect gallery usecases Add the missing metal BUILD_TYPE branch to the voice-detect Makefile forwarding -DVOICEDETECT_GGML_METAL=ON, mirroring face-detect, so the darwin metal CI artifact is built with the Metal backend instead of CPU-only. Expand the 4 face-detect gallery models' known_usecases to [face_recognition, detection, embeddings] to match the backend capabilities map and the mirrored insightface-buffalo entries, so auto-selection for /v1/detect and /embeddings works. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * docs(recon): document voice-detect and face-detect ggml backends Document the new standalone C++/ggml biometric backends as the recommended/default option for face and voice recognition, keeping the existing Python insightface / speaker-recognition backends framed as the legacy path. - features/face-recognition.md: add a face-detect (ggml) backend section with the gallery entries (buffalo-l/m/s non-commercial, yunet-sface Apache-2.0), licensing, and verify/detect/analyze quickstart. - features/voice-recognition.md: add a voice-detect (ggml) backend section with the gallery entries (ecapa-tdnn, wespeaker-resnet34, eres2net, campplus speaker recognizers; emotion-wav2vec2 non-commercial analyze head) and quickstart. - reference/compatibility-table.md: add face-detect.cpp and voice-detect.cpp rows to the Vision, Detection & Recognition table. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(gallery): publish recon backend GGUF uris + sha256 Fill in the published HuggingFace GGUF uris and verified sha256 for the 9 recon gallery entries (voice-detect-* and face-detect-), and remove the TODO publish markers. Correct the eres2net, campplus, and emotion-wav2vec2 uris to the actual published filenames. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] feat(gallery): re-embed buffalo anti-spoof + add audeering age/gender voice model Update the 3 buffalo face-detect GGUF sha256 (anti-spoof ensemble now embedded and re-uploaded under the same filenames/uris) and note the FaceVerify anti_spoof request flag in each description. Add a new voice-detect-age-gender-wav2vec2 gallery entry mirroring the emotion model. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(gallery): add face-detect-buffalo-sc and antelopev2 packs Add gallery entries for two newly-published insightface face packs on the face-detect backend: buffalo_sc (smallest pack, SCRFD-500M + small ArcFace) and antelopev2 (higher-accuracy, SCRFD-10G + ArcFace glint360k R100, 512-d). Both are non-commercial research-only. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(recon): honor LocalAI per-model threads in voice/face-detect backends LocalAI spawns one backend process per model and serves requests concurrently, so the engines' own min(hardware_concurrency, 8) default can oversubscribe cores. Forward the per-model Threads value from the gRPC LoadModel options into the engine via VOICEDETECT_THREADS / FACEDETECT_THREADS (read at backend construction) before the capi load. A non-positive Threads is treated as unset, leaving the engine default. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump backend pins to CPU-optimized engine commits voice-detect.cpp -> 0d9c1b3 (radix-2 FFT FBank, threads, flash attn + cached pos-conv); face-detect.cpp -> 523aee1 (thread-gated direct conv, threads). Brings the CPU optimizations into the LocalAI backend builds. GGUF format and parity unchanged, so the published HF GGUFs remain valid. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump backend pins to round-2 CPU-optimized engines voice-detect.cpp -> fe7e6a3 (ERes2Net 1x1->mul_mat, CAM++ layout+context, wav2vec2 conv-LN, ECAPA capture-drop, AVX512 dispatch opt-in); face-detect.cpp -> 9c8adb7 (AVX2 Winograd F(2x2,3x3) for SCRFD/ArcFace 3x3 convs, ArcFace BN-fold). Parity unchanged (cosine=1.0); GGUF format unchanged, HF GGUFs valid. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump backend pins to round-3 Winograd engines voice-detect.cpp -> 45122ec (Winograd F(2x2,3x3) for WeSpeaker/ERes2Net 3x3 convs, -22%/-20% @8t); face-detect.cpp -> cd5c962 (Winograd F(4x4,3x3) for SCRFD large maps, -22% @1t on top of F(2x2), more load-stable). Parity held (cosine=1.0); GGUF format unchanged, HF GGUFs valid. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump backend pins to round-4 Winograd engines (CPU opt complete) voice-detect.cpp -> d2839ca (CAM++ FCM 2D convs through Winograd, -15.5%/-10.3%); face-detect.cpp -> c1db23d (AVX2-vectorized Winograd tile transforms, SCRFD detect -14%/-9.6%). Final CPU optimization round; the conv-kernel lever class is now exhausted (parity held cosine=1.0; GGUF/parity unchanged, HF GGUFs valid). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump face-detect pin to deep-kernel engine (7ae5c4d) face-detect.cpp -> 7ae5c4d: register-blocked winograd-domain GEMM microkernel (2.8x isolated GFLOP/s), AVX-512 zmm evolution behind runtime CPUID dispatch (ship-safe, AVX2 fallback bit-identical), bias/relu fused into the winograd output transform, and SFace Conv+BN fold + bias/PReLU fusion. SCRFD detect ~1.4x faster end-to-end vs the round-4 baseline; parity bit-exact; portable single binary (function-multiversioned, no global -mavx512f). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump voice-detect pin to ECAPA operand-order win (e9c56ae) voice-detect.cpp -> e9c56ae: weight-as-src0 mul_mat order in ECAPA's F32 conv1d_same (routes through tinyBLAS sgemm); ECAPA embed 1.67x @1t / ~1.3x @8t, parity cosine=1.0. Isolated to encoder.cpp (ECAPA-only); ERes2Net/CAM++/WeSpeaker do not call conv1d_same so are provably unaffected. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump pins to FMA-throughput engines (voice f7b9f89, face 2d2d5f0) face -> 2d2d5f0: route ArcFace 3x3 body convs through the AVX-512 winograd microkernel (kWinoMinSize 80->14); ArcFace 1.62x @1t, SCRFD detect to 0.966 of MLAS @1t, no regression. voice -> f7b9f89: runtime-CPUID-dispatched AVX-512 winograd-GEMM microkernel (ship-safe, AVX2 fallback bit-identical); WeSpeaker 1.90x @1t. Parity cosine=1.0 throughout; portable single binaries. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump pins to MLAS-class direct-conv engines (voice 7ecfd07, face be22d67) Hand-tuned nChw16c AVX-512 register-tiled direct-conv microkernel (~263 GFLOP/s, within 6-7% of MLAS per-op efficiency), runtime-CPUID-dispatched + AVX2 fallback, fused bias/relu. voice 7ecfd07: default 3x3-s1 kernel for WeSpeaker (+37%/+32%) + ERes2Net, CAM++ pinned to Winograd. face be22d67: shape-gated to the ArcFace recognizer body (+25-27% @8t); SCRFD detector stays on Winograd (no regression). Parity cosine=1.0 / detect <=1px on AVX-512 + AVX2 paths. Portable single binaries. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump voice pin to Phase-A blocked backbone (f4e7eef) WeSpeaker ResNet34 runs as one nChw16c blocked island (2 reorders/forward vs ~60) on AVX-512, default; per-conv directconv fallback on AVX2. +2.9% @1t / +17-19% @8t vs per-conv directconv, parity cosine=1.0. The conv microkernel is already FMA-bound near peak (~0.86-0.98x MLAS-implied); residual to MLAS is sub-peak edge + non-conv tail, documented in docs/cpu-optimization.md. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump pins to breadth blocked-backbone (voice 7f66871, face d80092b) voice 7f66871: AVX2-vectorized (ymm) blocked island - AVX2-only hosts now run the blocked backbone for WeSpeaker (2.3x over per-conv-AVX2, cosine=1.0); ERes2Net stays per-conv (blocked regresses, opt-in only); CAM++ Winograd-pinned. face d80092b: ArcFace recognizer blocked island, AVX-512 default (-13% @8t, ~0.90x MLAS, the closest conv result), auto per-conv on AVX2; SCRFD untouched on Winograd (0 island invocations during detect). Parity cosine=1.0 / detect <=1px throughout. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump pins to small-spatial + stem conv kernels (voice 99b1804, face 47fdab6) Measured-gap-driven conv kernels: small-spatial (fill the register tile when output width <= tile width) + small-IC stem + strided-1x1/downsample recovery. ArcFace recognizer 0.57 -> 0.70x MLAS @1t (the closest conv model), WeSpeaker 0.65 -> 0.79x @1t. Parity cosine=1.0 / detect <=1px. The OC-block-sharing lever was a measured dead-end (deep stride-1 is L3-weight-bandwidth bound, not read-port bound) and was NOT shipped. Kernel ceiling reached; further gap needs an algorithm-class change (cache-blocked weight-stationary GEMM, or q8 weights). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump pins to GPU persistent-graph + multi-model-safe cache (voice 45d2e6b, face 0a4799a) GPU wins (CUDA/ggml backend, no CPU-path change): persistent per-shape graph+context cache in Backend::compute() eliminates the per-call cudaGraph re-instantiation churn -> wav2vec2 emotion+age-gender now AT GPU parity with torch-cuDNN on GB10 (0.97-0.98x), CAM++ -5.7ms; bit-identical parity. Cache hardened multi-model-safe (invalidate-on-free keyed by the ModelLoader weights buffer) so LocalAI multi-model hosting cannot stale-hit. Conv models still trail cuDNN (im2col-materialization-bound) - cuDNN implicit-GEMM lever next. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump pins to cuDNN-conv-capable engines (voice b6e4356, face 6107a24) Adds the opt-in cuDNN implicit-GEMM conv path (VOICEDETECT_GGML_CUDNN / FACEDETECT_GGML_CUDNN, DEFAULT OFF -> zero build/runtime dep until enabled). On GPU it kills the im2col-materialization bottleneck and reaches torch-cuDNN parity on the spill-bound convs: SCRFD detect 14.8->6.4ms (2.3x, ~parity), WeSpeaker ~parity, ERes2Net beats torch (1.10x); ArcFace/CAM++ neutral (no spill). Parity exact (SCRFD <=1px, cosine=1.0). To USE it in LocalAI, the CUDA backend build must enable the flag AND bundle libcudnn - deferred until a cuDNN-bundled GPU image; flag stays OFF here. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(recon): enable cuDNN conv path on arm64+CUDA13 recon backends The voice-detect.cpp / face-detect.cpp engines have an opt-in cuDNN implicit-GEMM conv path behind VOICEDETECT_GGML_CUDNN / FACEDETECT_GGML_CUDNN (default OFF) that kills im2col on the GPU and reaches torch-cuDNN parity (SCRFD 2.3x, WeSpeaker/ERes2Net parity), measured on the GB10 (arm64, CUDA 13, sm_121a). Enable it for the CUDA build, but only where cuDNN actually ships: the arm64 + CUDA 13 image (GB10/Jetson/L4T). x86 CUDA images carry no cuDNN, so flipping it on globally for BUILD_TYPE=cublas would be a link failure. The Makefiles gate on CUDA_MAJOR_VERSION=13 + arch (TARGETARCH from the matrix/Docker build, uname -m fallback for local builds). backend/Dockerfile.golang already installs the runtime libcudnn9-cuda-13 in the arm64+CUDA13 apt block; add the matching libcudnn9-dev-cuda-13 so the build-time link resolves. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump voice-detect pin to ERes2Net blocked-default (30beecd) Defaults VD_ERES2NET_BLOCKED ON: routes the ERes2Net Res2Net body through the blocked nChw16c AVX-512 directconv island instead of the 1x1 mul_mat fast path (CONT-transpose + skinny low-K GEMM). On the shipped GGML_NATIVE=OFF build (ggml mul_mat is AVX2-only) this wins ~2x at every thread count (2.07x@1t, 2.2x@4t, 2.05x@8t); pure-AVX2 fallback still 1.3-1.62x. Parity exact (cosine=1.000000 vs golden), so registered voices + verify/identify thresholds are unaffected. The prior default-OFF rested on a stale comment whose 23pct regression only held on the non-shipping GGML_NATIVE=ON build. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * docs(readme): announce native voice-detect + face-detect backends in Latest News Add a Latest News entry for the new from-scratch C++/ggml biometric backends (voice-detect.cpp + face-detect.cpp) that replace the Python insightface and speaker-recognition backends: no Python/onnxruntime at inference, self-contained GGUF, bit-exact parity, GPU cuDNN parity. Mirrors the parakeet.cpp / locate-anything.cpp native-backend news entries. Refs PR #10441. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): re-pin to the squashed engine release commits The voice-detect.cpp and face-detect.cpp histories were squashed to a single release commit, which orphaned the previous pins (voice 30beecd, face 6107a24). Re-pin to the new single-commit SHAs (voice 3d51077, face 06914b0); the tree is identical, so the backend build is unchanged. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 09:29:08 +02:00
LocalAI [bot]	d3a26f961d	fix(ik-llama): port multimodal path to mtmd API and bump to f96eaddb (#10534 ) (#10568 ) * fix(ik-llama): port multimodal path to mtmd API and bump to f96eaddb (#10534) The IK_LLAMA_VERSION bump to f96eaddba8bed6a9a5e628bbf6a566775c70b49c pulls in upstream commit "Prune examples/llava", which deletes examples/llava (clip.* / llava.). The ik-llama backend's grpc-server.cpp built a local `myclip` library from those files and called the removed clip/llava C API, so the bump no longer builds. ik_llama keeps its multimodal stack in the surviving `mtmd` library (examples/mtmd/, public headers mtmd.h + mtmd-helper.h). This ports the backend's multimodal path onto the high-level mtmd_ / mtmd_helper_* API in place, leaving the text path (which still uses ik_llama's retained old common API) untouched: - Makefile: bump IK_LLAMA_VERSION to f96eaddb. - prepare.sh: drop the clip/llava source copy + sed block; mtmd is a library target, no source copy needed. - CMakeLists.txt: remove the `myclip` target; link `mtmd` and add its include dir; build grpc-server as C++17 (mtmd headers require it). - patches: drop 0002 (targeted the deleted examples/llava/clip.cpp; the mtmd clip.cpp never calls ggml_quantize_chunk, so the fix is unneeded). Keep 0001 (verified still applies). - grpc-server.cpp / utils.hpp: replace clip_model_load + clip_image_load_from_bytes + llava_image_embed_make_with_clip_img + the manual [img-N] prefix splitting and per-image llava_embd_batch decode loop with mtmd_init_from_file (moved after the model load, which it requires), mtmd_helper_bitmap_init_from_buf, mtmd_tokenize and mtmd_helper_eval_chunks. Legacy [img-N] tags are translated, in order, into mtmd media markers (mtmd_default_marker()); the post-image suffix text stays on the normal token path so the sampling loop is unchanged. Supersedes #10534. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * fix(ik-llama): align json alias to ordered_json to resolve mtmd.h conflict (#10534) mtmd.h declares `using json = nlohmann::ordered_json` at global scope (and its mtmd.cpp depends on it), while ik_llama's whole server/common stack also uses ordered_json. Our grpc-server.cpp/utils.hpp kept a plain `nlohmann::json` alias, which now collides with mtmd.h once it is included for the multimodal port: "conflicting declaration 'using json = ...'". Switch our two aliases to ordered_json to match; it is API-compatible (utils.hpp already used ordered_json for its log helper) and our json never crosses into an unordered-json API. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 08:57:11 +02:00
LocalAI [bot]	13b1ae53bc	chore: ⬆️ Update ggml-org/llama.cpp to `0ed235ea2c17a19fc8238668653946721ed136fd` (#10536 ) * ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * fix(llama-cpp): link server-stream.cpp TU into grpc-server for upstream 0ed235ea (#10536) Upstream llama.cpp 0ed235ea added an SSE stream-resumption layer in a new translation unit tools/server/server-stream.cpp, which defines stream_session, stream_pipe_producer and the g_stream_sessions manager. server-context.cpp (already #included into grpc-server.cpp) now calls into it via spipe->cleanup(), stream_aware_should_stop() and stream_session_attach_pipe(), so without the new TU the grpc-server link fails on every arch with: undefined reference to `stream_pipe_producer::cleanup()' prepare.sh already copies every tools/server/* file into tools/grpc-server/, so the source is present; the only missing piece was including its definitions. Add an __has_include-guarded #include "server-stream.cpp" before server-context.cpp, mirroring the existing server-chat.cpp and server-schema.cpp guards, keeping the source compatible with older pins/forks that predate the split. The file is self-contained (its only external symbols come from server-common, already in the TU) so it adds no new undefined references; the http route-handler factories it also defines are unused in the grpc path but harmless. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * fix(llama-cpp): build renamed ggml-rpc-server target for upstream 0ed235ea (#10536) Upstream renamed the RPC server CMake target and binary from `rpc-server` to `ggml-rpc-server` (tools/rpc/CMakeLists.txt: `set(TARGET ggml-rpc-server)`), so the RPC-enabled grpc build failed with "No rule to make target 'rpc-server'". The grpc-server itself links fine after the server-stream.cpp fix; this only updates the RPC target name and the binary path copied to llama-cpp-rpc-server. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] --------- Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 08:56:40 +02:00
LocalAI [bot]	e68ca109c5	chore: ⬆️ Update CrispStrobe/CrispASR to `6514c9da00b03a2f0f1b49a43fae4f3a01a41844` (#10535 ) ⬆️ Update CrispStrobe/CrispASR Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-28 08:56:24 +02:00
LocalAI [bot]	6740e988d2	chore: ⬆️ Update ggml-org/whisper.cpp to `0ae02cdb2c7317b50991367c165736ce42ed96ac` (#10532 ) ⬆️ Update ggml-org/whisper.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-28 08:56:06 +02:00
Ettore Di Giacinto	4da769c1ca	paged headers: self-include <cstddef>/<cstdint> for size_t/uintN_t (fix amd64/non-arm64 build; compile-only) Vendored paged headers used size_t / uintN_t without including <cstddef> / <cstdint>. The arm64 DGX toolchain provides them transitively so the build passed there, but amd64/older toolchains do not, failing the CI amd64 build one header at a time ('size_t' does not name a type -> cascade). paged-kv-manager.h was already fixed. This adds the missing includes to the remaining vendored headers at the point each is created/rewritten in the patch series so every src/paged.h self-includes both: paged-attn.h (0003): add <cstddef> (had <cstdint>) * paged-alloc.h (0007): add <cstddef> (had <cstdint>) * paged-prefix-api.h (0007): add <cstddef> + <cstdint> (had only llama.h) The .cpp units include their own paged header, so they inherit the includes transitively. Whole series still applies clean on the pinned llama.cpp. Compile-only change: no runtime behavior change, bit-exactness unaffected. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 06:18:56 +00:00
Ettore Di Giacinto	23b11a5239	paged-kv-manager.h: add missing <cstddef> for size_t Fixes cuda-13 amd64 / non-arm64 build where size_t was used without the header (arm64 cuda-13 pulled it in transitively; amd64/cuda-12 toolchains do not). Compile-only change, bit-exactness unaffected. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 04:09:16 +00:00
Ettore Di Giacinto	9bb8994c4e	chore(paged): drop CUDA-12 variants of llama-cpp-localai-paged, keep CUDA-13 only The paged backend targets Blackwell sm_121a, which CUDA 12.0 cannot target at all, so the CUDA-12 variants were pointless. They were also broken: the cublas-12 / nvidia-l4t / arm64 build failed to compile paged-kv-manager.cpp ("no declaration matches ...", a ~10-function mismatch the older cuda-12-base gcc rejects). CUDA-13 compiles it fine (confirmed on GB10). Removed (config-only, scoped to the paged backend): - backend-matrix.yml: the two CUDA-12 paged rows (-gpu-nvidia-cuda-12-llama-cpp-localai-paged, -nvidia-l4t-arm64-llama-cpp-localai-paged) - backend/index.yaml: CUDA-12 capability keys (nvidia-cuda-12, nvidia-l4t-cuda-12, nvidia-l4t) on both meta-backends, repointed default/nvidia to the cuda13 amd64 variant, and dropped the orphaned cuda12-* / nvidia-l4t-arm64-* variant definitions (latest + -development). Kept CUDA-13 only: cuda13-llama-cpp-localai-paged (amd64) and cuda13-nvidia-l4t-arm64-llama-cpp-localai-paged (l4t arm64). Matrix tag-suffixes <-> index variant URIs form a clean 2:2 bijection. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 01:37:54 +00:00
LocalAI [bot]	471e38e4e7	chore: ⬆️ Update leejet/stable-diffusion.cpp to `9956436c925a367daeab097598b1ea1f32d3503f` (#10533 ) ⬆️ Update leejet/stable-diffusion.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-28 01:55:44 +02:00
Ettore Di Giacinto	0b84fda496	docs(paged): add the bf16-tau opt-in line to the decode plots Per request, the plots now show all four series: llama.cpp (standard), vLLM, LocalAI's llama.cpp patches (bit-exact hero), and LocalAI's patches + bf16-tau (opt-in ceiling, +3% to +17% over the patches, ahead of vLLM at every dense width and MoE npl>=32). Subtitle flags bf16-tau as opt-in / not bit-exact. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 22:25:02 +00:00
Ettore Di Giacinto	1431f72b92	docs(paged): regenerate decode plots (3-way) from re-measured data + overview Rebuild the two committed decode plots from the re-measured CSV and add a combined overview. Three series per the comparison that matters: llama.cpp (standard) vs vLLM vs LocalAI's llama.cpp patches; x-over-standard called out at npl128. bf16-tau stays out of the plot (it remains in the CSV + the README table as the opt-in row). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 22:20:12 +00:00
Ettore Di Giacinto	3466094c68	docs(paged): re-measure DGX benchmarks on one harness (stock/patched/bf16-tau) Re-run the GB10/DGX-Spark llama-batched-bench matrix (dense q36-27b + MoE q36-35b-a3b, npl 8/32/64/128, -fa on -ngl 99 -npp 128 -ntg 128) so the CSV and README section 4 carry a single consistent set of llama numbers with all three configs: - stock: separately-built unpatched llama.cpp at this backend's exact pin 9d5d882d (toggling LLAMA_KV_PAGED on the patched binary does NOT reproduce stock - the SSM decode fusions are compiled in, not env-gated). - patched: paged binary, LLAMA_KV_PAGED=1 (+LLAMA_MOE_FORCE_GRAPHS=1 for MoE). - patched+bf16-tau: patched plus --ssm-bf16-tau 64 (opt-in, NOT bit-exact, ~91% same-top-p). final_benchmark.csv now has stock + patched + bf16-tau + vllm rows for both models at all four widths (the prior CSV had no stock and no bf16-tau rows). peak_gb is dropped: the GB10's unified LPDDR5x reports [N/A] to nvidia-smi and the bench does not print it, so per-run peak could not be captured this session. Patch series gives up to 2.46x (dense) / 2.26x (MoE) over true-stock; opt-in bf16-tau adds a further +3% to +17% on top of patched (growing with width). vLLM column is kept from the prior session (not re-run) and labeled as such. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 22:05:59 +00:00
Ettore Di Giacinto	ed5eb705c7	docs(paged): drop moot PIN_SYNC_c299a92c record, repoint to README sec 7 The paged backend's llama.cpp pin was reverted from c299a92c back to 9d5d882d (== stock), so docs/PIN_SYNC_c299a92c.md (a blow-by-blow of the reverted sync) is dead weight. The pin-sync PROCESS stays documented in the three live places: the Makefile comment, README section 7 (Pin + maintenance policy), and .agents/llama-cpp-localai-paged-backend.md. Delete the doc and repoint every reference to it (Makefile, README, .agents, canary script + workflow) at README section 7. No functional paths change: the canary's patches-dir glob (patches/paged/0*.patch) is untouched. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 21:34:10 +00:00
LocalAI [bot]	8aba4fdba3	chore(fish-speech): drop the darwin/metal build target (#10561 ) The fish-speech metal-darwin-arm64 backend build has been failing on every release (v4.5.3, v4.5.4, v4.5.5) and is a standing red on the darwin backend matrix. fish-speech pulls `tokenizers` transitively from its upstream source (`pip install -e fish-speech-src`), and on darwin/arm64 there is no prebuilt wheel for the pinned old `tokenizers` version, so pip builds it from source. Modern rustc rejects that old crate as a hard error: error: casting `&T` to `&mut T` is undefined behavior ... --> tokenizers-lib/src/models/bpe/trainer.rs:517:47 = note: `#[deny(invalid_reference_casting)]` on by default error: could not compile `tokenizers` (lib) due to 1 previous error This is deterministic, not a flake, and there is no clean fix that does not either pin a stale Rust toolchain or downgrade a soundness lint guarding real UB. Until upstream fish-speech moves to a tokenizers version that compiles on current toolchains, drop darwin support so the release backend build stays green. The Linux/CUDA/ROCm/Intel/L4T variants are unaffected. Removes: - the `-metal-darwin-arm64-fish-speech` entry from `includeDarwin` in backend-matrix.yml - the `metal:` capability mappings and the concrete `metal-fish-speech` / `metal-fish-speech-development` gallery entries in backend/index.yaml - the now-unused darwin-only requirements-mps.txt Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 23:24:21 +02:00
Ettore Di Giacinto	53f66a6f03	fix(paged): revert pin to 9d5d882d (== stock); c299a92c broke grpc-server link The c299a92c bump diverged 23 commits ahead of the stock llama-cpp pin. grpc-server.cpp is SHARED with the stock backend and tracks the stock pin; c299a92c's upstream server-API refactor pulled stream_* helpers into the headers grpc-server.cpp includes, whose definitions the stock-aligned build does not compile -> every paged variant failed to LINK (undefined reference to stream_aware_should_stop / stream_pipe_producer::cleanup / stream_session_attach_pipe). The bump was greedy-md5 bit-exact, but the bit-exact gate never exercises the full grpc-server build, so it slipped through. Revert LLAMA_VERSION to 9d5d882d (== stock pin, where the patches are bit-exact AND grpc-server links - the original DGX-proven baseline). Document the hard constraint in the Makefile, README, PIN_SYNC record, and the .agents guide: the paged pin must track the stock pin, and a pin-sync must pass the full CI grpc-server build, not only the bit-exact gate. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 20:28:28 +00:00
Ettore Di Giacinto	08b754f910	chore(paged): keep patches/ patch-only; README to backend root, docs to docs/ The llama-cpp-localai-paged patches/ dir had accumulated docs, plots, a csv, dev .cpp harnesses, and a dead FP4-MoE kernel scaffold after an earlier git-mv. Restore the invariant that patches/ holds only the .patch series. Moves: - patches/paged/README.md -> README.md (canonical doc at the backend root) - patches/paged/{PIN_SYNC_c299a92c,PAGED_BITEXACT_NOTE,LOCALAI_LLAMACPP_BACKEND_PLAN,UPSTREAM_LAYER2_SCOPE}.md, final_benchmark.csv, qwen36_.png, paged-burst-bench.cpp, paged-reclaim-unit.cpp -> docs/ - patches/README.md -> docs/PATCH_MAINTENANCE.md (unique patch-regen recipe not in the canonical README) Deletes: - patches/BENCHMARKS.md (superseded by README section 4 + the dev-notes section) - patches/kernel/ (dead FP4-MoE scaffold, never in the 0001-0030 apply glob, zero refs repo-wide) Repoint every reference to the moved files: README internal links (docs/ + the .github links drop from 5x ../ to 3x ../), .agents/llama-cpp-localai-paged-backend.md, .github/scripts/paged-canary-apply.sh, .github/workflows/llama-cpp-paged-canary.yml, the wrapper Makefile, backend/cpp/llama-cpp/grpc-server.cpp, backend/index.yaml, docs/content/features/backends.md, gallery/index.yaml. The build apply glob PAGED_PATCHES_DIR/0.patch (PAGED_PATCHES_DIR := .../patches/paged) is unchanged and still resolves to the 28 patches. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 13:20:05 +00:00
Ettore Di Giacinto	a4e730979d	feat(paged): restrict llama-cpp-localai-paged to CUDA-only build targets The paged backend previously built for cublas/cuda, cpu, vulkan, sycl, hipblas and darwin/metal. On non-CUDA the patchset's wins are inert: the GDN fusions are gated off (patch 0030) and NVFP4 falls back to dequant, so the backend is neutral-to-negative there (README section 4c). The darwin grpc-server link also fails on undefined upstream server symbols, turning CI red. Both broken and pointless off-CUDA, so ship CUDA-only. - backend-matrix.yml: drop the hipblas, sycl f32/f16, cpu amd64/arm64, vulkan amd64/arm64 and metal-darwin rows for this backend; keep the four cublas rows (cuda-12, cuda-13, nvidia-l4t cuda-12 and cuda-13). - index.yaml: meta-backend (and -development) capabilities are now CUDA-only with default pointing at cuda12 (mirrors faster-qwen3-tts); removed the orphaned cpu/rocm/sycl/vulkan/metal variant entries. - Removed the now-unused darwin build script and its Makefile target / .NOTPARALLEL entry / backend_build_darwin.yml step. - Documented the CUDA-only build coverage in the patch README and plan. Non-CUDA users should use the stock llama-cpp backend. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 12:29:15 +00:00
Ettore Di Giacinto	9115c2c52c	docs(paged): correct Vulkan/SYCL note (GDN op IS upstream) + CUDA-only rationale The gated-DeltaNet + SSM_CONV ops have upstream Metal/Vulkan/SYCL kernels, so the Qwen3.6 hybrids run there (non-fused) - the earlier 'no Vulkan kernel' note was wrong. The patchset's fusions are gated off off-CUDA, so the backend ships CUDA-only; non-CUDA users use stock llama-cpp. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 12:18:11 +00:00
Ettore Di Giacinto	984c8fcbea	docs(paged): Layer-2 upstream scope for native fused-GDN kernels (Metal/Vulkan/SYCL) Source-only analysis of what it would take to give the gated-DeltaNet decode fusions (0018 in-place state write-back, 0019 fused recurrent-state gather, 0021 ssm_conv_update_inplace, 0028 conv-tap gather fusion) native kernels on the non-CUDA compute backends, so the patch-series decode win extends past CUDA-family hardware. Key findings: - The base GGML_OP_GATED_DELTA_NET and GGML_OP_SSM_CONV kernels ALREADY exist upstream on Metal, Vulkan AND SYCL (the README's no-Vulkan-kernel line is stale). The Qwen3.6 hybrids run on all three today via the non-fused path; Layer-2 is the decode SPEEDUP, not enabling the model to run. - Per backend the new work is only the FUSION plumbing: redirect the GDN state write (in-place), add the ids read, write one new conv-update kernel + its ids variant, two tiny gather kernels, plus supports_op + op-handler + (Vulkan) pipeline/push-constant/descriptor wiring. Builders, CPU refs, model graph and test-backend-ops cases are shared and already done. - Bit-exactness is feasible per backend by construction (the fusions redirect addresses, not the f32 reduction order); test-backend-ops (backendX-vs-CPU) is the gate. - The 0030 name allow-list should become capability-driven (make supports_op authoritative for the discriminated src slots). - Ranked: ops-first PR, then Metal (highest value/effort, fixed simdgroup = simplest bit-exactness), then SYCL (near-verbatim CUDA mirror, cheapest to author), then Vulkan (widest hardware reach but the shader-gen + variant matrix + subgroup variance make it the capstone). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 12:11:24 +00:00
LocalAI [bot]	d11b202dd2	fix(backends): whisper darwin run.sh loads whichever fallback lib exists (.so/.dylib) (#10553 ) fix(backends): whisper darwin run.sh loads whichever fallback lib exists The macOS branch hardcoded WHISPER_LIBRARY=$CURDIR/libgowhisper-fallback.dylib, but the cmake build emits a Mach-O named libgowhisper-fallback.so on darwin, so the Go loader panicked at runtime ("dlopen ...dylib: no such file") and the backend exited ("grpc service not ready") — breaking e.g. the silero-vad-ggml VAD on darwin. Pick whichever of .dylib/.so is present so it is robust to the build's naming either way. Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 14:07:56 +02:00
Ettore Di Giacinto	4a9a1dd247	docs(paged): Mac stock-vs-patched bench + Vulkan note + cross-backend learnings Section 4(c): real Apple M4/Metal numbers (Qwen3-8B Q4_K_M, stock vs patched) - patchset is neutral-to-slightly-negative on Metal (the in-kernel block-table read is CUDA-only; NVFP4/GDN-fusions inert), so prefer stock llama-cpp on Apple Silicon. Vulkan: same picture, worse (no upstream GDN op). Section 6: cross-backend learnings + upstream candidates (the GDN decode-plumbing fusions are the portable, bit-exact, CPU-mirrored win worth upstreaming). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 11:05:37 +00:00
Ettore Di Giacinto	78fac9a28f	refactor(paged): stock llama-cpp is patch-free; paged backend owns its patch series Move ALL paged-attention content out of the stock backend/cpp/llama-cpp backend and into backend/cpp/llama-cpp-localai-paged, so the stock backend is pure upstream llama.cpp and the paged backend owns and applies its own vendored patch series. - Delete the dead early-exploration scaffold backend/cpp/llama-cpp/paged/ (kernel/w4a16 Marlin scaffold, standalone paged_kv_manager, bench/loadgen, its own 0001-0002 patches, dense-era design docs, tests). Zero references repo-wide. - Move backend/cpp/llama-cpp/patches/ (the 28-patch paged series + paged/README + 3 operational docs, plus the kernel/ scaffold patch and the top-level paged README/BENCHMARKS) to backend/cpp/llama-cpp-localai-paged/patches/. The stock backend keeps no patches/ dir; it had no non-paged base patches. - Purify the stock backend: remove the LLAMA_PAGED make variable, the patches/paged apply loop, and the LLAMA_PAGED passthrough to prepare.sh; remove the paged-series handling from prepare.sh. The stock llama.cpp target now only clones the pin and applies its own (currently empty) base patches/ series. The runtime paged option hooks in the shared grpc-server.cpp are untouched (inert without the patches). - The paged backend's Makefile now applies its OWN patches/paged/0*.patch onto each freshly cloned tree via strict git apply (apply-paged-patches), after the copied stock infra clones the pin and applies base patches. - Repoint every reference to the old patches/paged path: the upstream canary workflow + apply script, bump_deps.yaml, gallery/index.yaml, the docs, backend/index.yaml, backend-matrix.yml, the top-level Makefile comments, and the moved PIN_SYNC / README docs. Drop the now-removed LLAMA_PAGED=on build-toggle from comments. Verified: the full 28-patch series applies strict-clean (git apply, exit 0) to a clean ggml-org/llama.cpp checkout at the pinned c299a92c, and the repointed canary apply script resolves and applies the series end to end. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 11:01:22 +00:00
Ettore Di Giacinto	fb2dc33d52	docs(paged): consolidate the dev-trail docs into one canonical README The paged-attention patch directory had accumulated ~55 scattered dev docs (results, progress, scope, lever, and gap-analysis notes). Consolidate the durable content of all of them into one canonical backend/cpp/llama-cpp/patches/paged/README.md covering: what the patchset is, the architecture (paged KV + block-table flash-attn, the gated-DeltaNet SSM decode path, NVFP4 FP4-MMA, the decode-first scheduler), the full 0001-0030 patch series table with bit-exact status, the GB10 benchmarks (patched-vs-stock-vs-vLLM + the Apple M4 architectural note), the dev notes (bit-exact methodology, the per-path gate, the MoE-parity conclusion, the rejected/flat levers, the opt-in bf16-SSM mode), arch+quant generality, the pin + canary maintenance policy, and the published NVFP4 gallery models. Delete the consolidated-away dev trail. Keep the three operational docs the README links to: PIN_SYNC_c299a92c.md (canary reference), PAGED_BITEXACT_NOTE.md (per-path gate reference) and LOCALAI_LLAMACPP_BACKEND_PLAN.md (the ship-as-own-backend design-of-record), plus the benchmark plots + csv. The .patch files and the unit/bench .cpp are untouched. Repoint every external reference to a deleted doc at the new README: grpc-server.cpp, docs/content/features/backends.md, gallery/index.yaml, the canary apply script (PIN_BUMP_APPLY_CHECK.md -> README), and the base patches/README.md (ADDITIVE_DESIGN.md -> README). The canary's PIN_SYNC reference still resolves; its inert SSM_DECODE_FIX_RESULTS.md glob (a patch-internal path matcher, not a repo-doc link) is left intact. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 09:25:47 +00:00
Ettore Di Giacinto	a5a5b2ad80	feat(paged): bump llama.cpp pin 9d5d882d -> c299a92c (bit-exact verified) Advance the paged-attention backend's owned llama.cpp pin by 23 upstream commits. The shipped source-only patch series (0001-0030, 28 patches) applies strict-clean (git apply, exit 0) on a fresh c299a92c checkout with no re-export needed, and the bit-exact gate is GREEN on every path on GB10 (CUDA sm_121): - md5 greedy decode (-ngl 99 -fa on -n 48 --temp 0 --seed 1): dense non-paged/paged 5951a5b4, MoE non-paged 07db32c2, MoE paged 8cb0ce23; all match the established baselines. - test-backend-ops CUDA0: SSM_CONV 45/45, SSM_CONV_UPDATE 16/16, SSM_CONV_UPDATE_IDS 16/16, GATED_DELTA_NET 84/84, MUL_MAT 1146/1146, MUL_MAT_ID 806/806; all OK. The 23-commit upstream jump did not change our decode output. The .patch files are kept byte-identical (they already apply strict-clean at the new pin); only the pin, the PIN_SYNC evidence doc, and the canary/gallery doc references change. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 08:57:33 +00:00
Ettore Di Giacinto	7e1832b868	fix(paged): strip stray dev-doc hunks so patch series applies on a clean checkout The shipped from-patches build applies the paged series with strict `git apply` (backend/cpp/llama-cpp/Makefile `llama.cpp` target: `git apply --verbose "$p" \|\| { ...; exit 1; }`), which is atomic: a hunk against a file missing from the tree rejects the whole patch and fails the build. Four patches carried hunks against dev-only docs that live in the DGX dev tree but are absent from a clean ggml-org/llama.cpp checkout, so the build only succeeded on the DGX and FAILED on CI / any clean checkout: 0019 -> SSM_DECODE_FIX_RESULTS.md (modify hunk = the root reject) 0020 -> LEVER1_OPROJ_MMQ_RESULTS.md (create) 0021 -> CONV_STATE_FUSION_RESULTS.md (create) 0028 -> LEVER1_GATHER_PROGRESS.md, LEVER1_GATHER_RESULTS.md (create) 0019's reject cascaded to 0021/0022/0026/0028 (which build on 0019's code). Strip each `diff --git a/<devdoc>` section plus its diffstat line, `create mode` trailer, and correct the summary count. Every llama.cpp SOURCE hunk is left byte-identical (verified by sha256 of each patch's source-diff tail). Verified on a fresh clone of ggml-org/llama.cpp at the pin 9d5d882d: BEFORE, strict `git apply` failed at 0019 (cascade 0019/0021/0022/0026/0028); AFTER, the full series 0001-0030 applies with exit 0 (sentinel created, zero stray docs). The tolerant `patch -p1` fallback in prepare.sh also applies with zero rejects. PIN_SYNC_9d5d882d.md documents the durable fix: re-exports/pin-syncs must keep patches source-only (export with a source pathspec / `:!*.md`, gate with a strict `git apply` on a clean checkout). The upcoming c299a92c pin-bump re-export must produce source-only patches too. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 08:39:27 +00:00
Ettore Di Giacinto	2bee7a5ab1	ci(paged): add early-warning canary for vendored llama.cpp paged patches The paged backend (backend/cpp/llama-cpp-localai-paged) pins its own verified llama.cpp tip and is excluded from the nightly auto-bumper so a naive bump can never silently break the shipped build. That exclusion also removed the early warning of upstream drift. This restores the signal without touching the pin. Add .github/workflows/llama-cpp-paged-canary.yml (weekly + workflow_dispatch): - apply-check job (ubuntu-latest, toolchain-free): resolve the latest ggml-org/llama.cpp master tip, shallow-checkout it, and apply the full paged series 0001-0030 in order with the build's own git-apply method via the new shared helper .github/scripts/paged-canary-apply.sh. Red on any apply break. - compile job (needs apply-check): on the exact tip it validated, build the paged backend (cublas) inside the same base-grpc-cuda-12 toolchain and the same `make grpc-server` target the shipped build uses, so a red means upstream drift, not toolchain noise. nvcc compiles the kernels with no GPU present. Red here = run a PIN_SYNC (rebase + bit-exact gate + re-export), then bump the paged Makefile pin. The canary is signal-only: it opens no PR and never moves the pin, so the shipped build and the dep-bump PRs stay green regardless. It is fully separate from bump_deps. The lone pre-existing quirk in the series (patch 0019 carries a stray modify hunk against the dev-only doc SSM_DECODE_FIX_RESULTS.md, absent from any clean upstream checkout; git apply is atomic so it rejects the whole patch and cascades to 0021/0022/0026/0028) is handled path-scoped: the helper excludes only that dev-doc and still applies 0019's real code hunks atomically, mirroring prepare.sh's tolerance, so the quirk never false-positives the canary but a genuine code break in 0019 still turns it red. Point the existing pin comments in backend/cpp/llama-cpp-localai-paged/Makefile and .github/workflows/bump_deps.yaml at this canary as the drift signal, and document it in the PIN_SYNC doc: canary red -> do a pin-sync. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 08:29:09 +00:00
Ettore Di Giacinto	e160041f05	chore(paged): decouple paged llama.cpp pin from the nightly auto-bumper The llama-cpp-localai-paged backend reused backend/cpp/llama-cpp's LLAMA_VERSION, which .github/workflows/bump_deps.yaml auto-bumps nightly to the latest ggml-org/llama.cpp master tip. The stock backend is patch-free so that bump is safe, but the paged backend applies a vendored patch series (backend/cpp/llama-cpp/patches/paged/) hand-verified bit-exact against ONE specific tip. A naive bump moves the tip out from under the patches and breaks 'git apply' at build time - a dep-bump PR would go red (or, worse, the break surfaces later in a release build). Mirror the turboquant precedent: give the paged wrapper its OWN LLAMA_VERSION pin (the verified 9d5d882d) and force it into every copied build via LLAMA_VERSION=$(LLAMA_VERSION), so the nightly stock bump no longer drags the paged build to an unverified tip. Unlike turboquant (whose fork branch carries the patches and is safe to auto-bump), the paged series is vendored, so it gets NO bump_deps.yaml entry: it is advanced only by the manual PIN_SYNC process. Add cross-referencing comments in both Makefiles and bump_deps.yaml. Also add PIN_BUMP_APPLY_CHECK.md: an apply-feasibility report for the latest tip (c299a92c, 23 commits ahead). The full series applies CLEAN under 'git apply' with only benign line offsets and zero conflicts; the lone failure (0019) is a pre-existing stray dev-doc hunk, identical on the current pin, not a bump regression. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 08:02:37 +00:00
Ettore Di Giacinto	400930db19	Merge remote-tracking branch 'origin/master' into worktree-feat+paged-attention # Conflicts: # gallery/index.yaml	2026-06-27 07:48:49 +00:00
LocalAI [bot]	0258f8af55	fix(backends): repair release CI build/test breaks (kokoros, fish-speech, llama-cpp-quantization, sglang) (#10547 ) * fix(kokoros): implement new Backend RPCs to fix the build The backend.proto grew six RPCs (SoundDetection, Depth, TokenClassify, Score and the bidi-streaming Forward) that the kokoros gRPC service never implemented, so the trait impl no longer satisfies `Backend`: error[E0046]: not all trait items implemented, missing: `sound_detection`, `depth`, `token_classify`, `score`, `ForwardStream`, `forward` kokoros is a TTS backend with no use for these, so add `unimplemented` stubs (plus the `ForwardStream` associated type) matching the existing pattern for every other unsupported RPC in this file. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * fix(fish-speech): add setuptools-rust for the editable source install install.sh installs the fish-speech source tree editable with `--no-build-isolation`, which means the build backends of its transitive dependencies must already be present in the venv. One of them builds a Rust extension and its metadata step fails with: ModuleNotFoundError: No module named 'setuptools_rust' Add setuptools-rust to requirements.txt so installRequirements provisions it before the editable install runs. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * fix(llama-cpp-quantization): vendor convert_hf_to_gguf.py with conversion/ Upstream llama.cpp split the model-specific logic out of the single convert_hf_to_gguf.py file into a sibling `conversion/` package, so the script now starts with `from conversion import ...`. Downloading just the one file therefore fails at runtime with: ModuleNotFoundError: No module named 'conversion' Clone the repo (reusing the clone already needed to build llama-quantize) and copy both the script and the `conversion/` package into the backend dir. Python puts the script's own directory on sys.path[0], so the package resolves when it sits beside the script. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * fix(sglang): pin the CPU source build to sglang v0.5.11 The CPU profile builds sgl-kernel from a `git clone` of sglang with no ref, so it always tracks master. Recent master added CPU kernels (e.g. mamba/fla.cpp) that fail to compile in our builder: constexpr variable 'scale' must be initialized by a constant static library kineto_LIBRARY-NOTFOUND not found Pin the clone to v0.5.11, the same release the GPU path already floors on (requirements-cublas12-after.txt). Overridable via SGLANG_VERSION so the pin can be bumped deliberately. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 09:42:22 +02:00
Ettore Di Giacinto	202a29f980	feat(paged): Metal/darwin build availability for llama-cpp-localai-paged Close the single build-targeting gap the cross-arch audit (ARCH_GENERALITY_AUDIT.md section 6, item 2) flagged: the paged backend had no Metal/darwin variant and no metal: capability key, so a Mac user selecting llama-cpp-localai-paged fell back to default=cpu (a Linux image) that does not run, with no fallthrough to stock llama-cpp. Mirror exactly how stock llama-cpp does darwin: - .github/backend-matrix.yml: add the includeDarwin row (-metal-darwin-arm64-llama-cpp-localai-paged, arch arm64, lang go) next to the stock llama-cpp darwin row. - backend/index.yaml: add the metal: capability key to the llama-cpp-localai-paged meta-backend plus the metal-llama-cpp-localai-paged and -development variant entries (URIs match the matrix tag-suffix); add Metal to tags. - scripts/build/llama-cpp-localai-paged-darwin.sh: new bespoke darwin build, a line-for-line mirror of llama-cpp-darwin.sh swapping the paged wrapper dir, binary names, ggml-shared-libs dir and output tar. Same CPU_ALL_VARIANTS + Metal path (GGML_METAL=ON via the reused llama-cpp Makefile when OS=Darwin; --target ggml pulls in ggml-metal via add_dependencies) with LLAMA_PAGED=on. - Makefile: add backends/llama-cpp-localai-paged-darwin target (+ .NOTPARALLEL). - .github/workflows/backend_build_darwin.yml: give the paged backend the same bespoke darwin build step as stock llama-cpp, share the llama ccache restore (save stays stock-only to avoid a same-run key collision), and exclude it from the generic build-darwin-go-backend step. - scripts/changed-backends.js: comment-only - the paged darwin path mapping was already present (forward-looking); update the stale "if a metal row is ever added" note now that the row exists. Metal delivers paged-KV only (NVFP4 FP4-MMA is CUDA/Blackwell-only); the GDN/conv fused ops have no Metal kernel, so a gated-DeltaNet (qwen35) model falls back to the CPU reference op at runtime - made SAFE by the fused-op backend gate (patch 0030). This is config; the Metal build runs in CI on the next push and is runtime-tested on the M4 Mac. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 07:42:08 +00:00
Ettore Di Giacinto	621a20d2b5	feat(paged): backend-gate fused GDN/discriminated SSM_CONV emission (patch 0030) Closes audit RISKY-1 (the one latent silent-miscompute hazard). The fused/in-place Gated Delta Net op (0018/0019/0026) and the discriminated SSM_CONV decode op (0021/0028, which REUSE GGML_OP_SSM_CONV / GGML_OP_GATED_DELTA_NET via a non-null src[3]/src[4] discriminator) are CUDA+CPU-only but were emitted DEFAULT-ON (cparams.fused_gdn_ar/ch=true, auto_fgdn=true) with no backend guard. A backend that supports plain SSM_CONV but ignores the discriminator (Vulkan/SYCL/Metal) would run the wrong plain conv => silent corruption. Fix: in llama_context::sched_reserve(), before the auto_fgdn resolution, force fused_gdn_ar = fused_gdn_ch = auto_fgdn = false when any non-CPU compute backend is not CUDA-family (reg name not "CUDA"/"ROCm"/"MUSA"). Every emission site keys off these flags, so the graph falls back to the upstream non-fused plain ggml_ssm_conv + ggml_silu path that every backend handles. On CUDA the reg name is "CUDA", the flags are left untouched, and the decode graph is byte-identical. Mirror of DGX paged patch 0030; adds FUSED_OP_BACKEND_GATE_RESULTS.md. Verified GPU-free: reconstructed pin 9d5d882d + paged 0001-0029 + 0030, CPU-only build (GGML_CUDA=OFF) of libllama + test-backend-ops links with 0 errors; 0030 applies cleanly via git apply and patch -p1. test-backend-ops correctness for SSM_CONV/SSM_CONV_UPDATE(_IDS)/GATED_DELTA_NET is CUDA0-vs-CPU (pending DGX, tunnel offline this session); registered test cases will exercise it. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 07:32:49 +00:00
Ettore Di Giacinto	af6e133759	docs(paged): cross-arch synthesis - ship verdict + minimum non-Blackwell fixes Synthesizes the four ARCH_GENERALITY_AUDIT sections (build-matrix, gguf-gallery-targeting, optimization-generality, patch-arch-safety) into a single cross-arch ship decision: build-safety table per target, every patch bucketed (SAFE-EVERYWHERE / BLACKWELL-ONLY-clean-fallback / GB10-TUNED / RISKY), the NVFP4 gallery recommendation, a per-arch roadmap ranked by value/effort, the empirical-verification matrix (GB10 + M4 cover all but non-Blackwell NVIDIA), and the ship verdict. Verdict: SAFE to ship as Blackwell/Linux today; the build is arch-general (no GB10 pin; FP4 code is default-off + #if-guarded) and NVFP4 GGUFs run everywhere via dequant. The one hard prerequisite before extending paged to Metal/Vulkan/SYCL is closing the backend-ungated, default-on fused GDN/conv op emission (discriminated GGML_OP_SSM_CONV via non-null src[3], CUDA+CPU only, no supports_op guard) - latent on current Linux targets, silent miscompute on a future non-CUDA paged build of a gated-DeltaNet model. Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 07:07:20 +00:00
Ettore Di Giacinto	87cfd1fadb	docs(paged): quant-generality audit - SSM/serving opts are quant-agnostic (0013-0029) Source-verify each paged decode optimization as quant-agnostic (operates on the f32 gated-DeltaNet/conv recurrent state, the paged serving host path, or the generic MMQ/CUDA-graph routing) vs NVFP4-specific (only fires inside the use_native_fp4 / GGML_TYPE_NVFP4 branch). Findings: 14 of 16 landed patches are quant-agnostic (0013/0014/0015/0016/0018/ 0019/0020/0021/0022/0024/0025/0026/0028/0029). Only 0023 (MoE FP4 act-quant de-dup, inside use_native_fp4) is NVFP4-specific; 0017 is NVFP4-only but default-off/inert (kill-gate, no win). Corrects the hypothesis on 0025: the actual patch is the MUL_MAT_ID CUDA-graph guard relaxation gated on ggml_is_quantized + ggml_cuda_should_use_mmq (the generic quantized grouped-MMQ path), NOT NVFP4. The NVFP4-specific act-quant / quantize_mmq_nvfp4 work is LEVER 3, which was a measurement STOP and never landed (no patch); LEVER 4 (NVFP4 projections) KL-failed and never shipped. Adds the relative-impact-by-quant estimate (fixed f32-recurrence/host ms is the largest step fraction at NVFP4, shrinks at Q8/bf16 as the weight read grows) and the A/B plan to prove generality on a Q4_K_M requant of the same Qwen3.6 (build the control first, md5/KLD bit-exact gate per path, decode_agg npl 32/128, with 0023 as the NVFP4-only negative control). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 07:06:14 +00:00
Ettore Di Giacinto	2a2de1d6c1	docs(paged): patch-arch-safety classification for patches 0018-0029 Build-break / miscompile audit of the paged patch series. Classifies each patch general/Blackwell-gated/risky, records the only conditional arch surface (0017, fully #if-gated + default-off), and gives the per-target build-safety verdict (sm_80-90 CUDA / sm_100 / Metal-not-a-target / CPU / ROCm-SYCL-Vulkan). Flags the one latent silent-correctness hazard: fused GDN/conv ops reuse GGML_OP_SSM_CONV via a src discriminator with CUDA+CPU-only kernels and backend-ungated emission. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 07:04:05 +00:00
Ettore Di Giacinto	5667dfe461	docs(paged): arch-generality audit - optimization classification (0017-0029) Classify the paged-attention optimizations as arch-GENERAL (ship everywhere), GB10-TUNED (per-arch retune), or Blackwell-precision-specific; add the per-arch expected story (sm_100/Hopper/Ada/Metal/CPU) and the SAFETY gap (fused GDN/conv ops are CUDA+CPU-only with backend-ungated emission). Extends the prior build/gallery-targeting audit in the same file. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 07:02:54 +00:00

1 2 3 4 5 ...

1724 Commits