mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-29 19:06:43 -04:00
2fa8ef8fc53ccb2d932a1a89472486dfc80e0b59
7058 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
2fa8ef8fc5 |
fix(paged): make patch 0031 apply on the 0001-0030 base; default S3 on under paged KV
FIX A (patch 0031 compose break): the chunked GDN prefill patch carried
'#include <cuda_bf16.h>' and '#include <type_traits>' as CONTEXT lines, but
those were introduced by the dropped bf16-tau patch 0026, so on the
bf16-tau-free 0001-0030 base only '#include <cstdlib>' is present and 'git
apply' failed. The same 0026 drop also shifted 0031's later hunks off their
context (the ', hyb' kernel-launch arg, the 'STATE_BF16, HYBRID' template
params, and the GDN_LAUNCH_ARGS list). Regenerated 0031 against a fresh
pin(0ed235ea) + 0001-0030 tree: the chunked kernel now SELF-PROVIDES the
cuda_bf16.h / type_traits includes (adds them, plus the climits it needs for
INT_MAX) and the dispatch guard is the 2-param 'if constexpr (!KDA &&
!keep_rs_t)' form. Behaviour is unchanged: 0031 stays opt-in, default OFF
(GDN_CHUNK_MIN), a recorded negative. The full 0001-0042 series now applies
clean on 0ed235ea ('git apply --check' green for every patch).
FIX B (patch 0041 S3 default): the decode-shape-stable scheduler defaulted OFF.
Make it default ON whenever paged KV is active (LLAMA_KV_PAGED set), still
overridable to off via LLAMA_PAGED_DECODE_STABLE=0. Minimal host-side change in
update_slots(); re-exported from the dev tree, README 0041 row updated to match.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
|
||
|
|
d706980c2b |
feat(paged): close the continuous-serving decode gap (S1+S3, patches 0040/0041)
Add the two decode-serving graph-reuse levers (validated on GB10) that close the host-bound serving gap (paged dropped to ~3.7 vs vLLM ~5.9 tok/s/seq in real continuous serving while tying it in static batched-bench). - 0040 S1 paged decode-graph reuse: the paged decode inputs never overrode llm_graph_input_i::can_reuse (defaults false), so the host rebuilt the ggml graph on EVERY decode step (layer-A reuse 0%). Add a 256-bucketed-shape can_reuse + a live-mctx refresh from the owning attn input. Bit-exact (md5 byte-identical reuse on/off). Static batched-bench: paged reuse 0% -> 95.5%. - 0041 S3 decode-shape-stable scheduling: keep co-batched prefill out of decode steps so the scheduler emits the reuse-stable pure-decode shape S1 can reuse. Default-off policy on top of 0016; bit-exact (per-stream independent). S1+S3 together (128-client staggered serving, MoE Qwen3.6-35B-A3B-NVFP4): graph reuse 0% -> 72.2%, hostproc 15.98 -> 6.31 ms/step, decode 4.05 -> 5.52 tok/s/seq median (4.24 -> 5.96 mean, at vLLM's ~5.9). S1 alone is insufficient (13.8%); S3 is the multiplier. S2 (double-buffer set_inputs) dropped: Phase-0 put set_inputs at ~0.05 ms/step, so it has nothing to recover. README patch table + DECODE_SERVING_SCOPE.md updated with results and the padded/fixed-slot follow-up. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
000705321f |
feat(paged): FP4 prefill large-M dequant->bf16 cuBLAS scaffold (patch 0033, default-off)
Option (a) of PREFILL_GEMM_SCOPE.md: route large-M (prefill) NVFP4 dense weight GEMMs off the decode-tuned FP4-MMQ kernel onto the dequant->bf16 cuBLAS (nvjet) tensor-core path, wired via an M-threshold in ggml_cuda_should_use_mmq. Lands the validated, bit-exact-gated mechanism and records the honest GB10 result: it is a regression, so it ships default-off (== stock), mirroring the patch-0017 default-off discipline. Three-edit scaffold (no new kernel): should_use_mmq routes NVFP4+Blackwell+dense M>LLAMA_FP4_PREFILL_M to cuBLAS; op_mul_mat_cublas gains an NVFP4 branch that dequants the FP4 weights to a transient bf16 pool buffer (not cached - stays FP4-resident) and runs cublasGemmEx CUDA_R_16BF/COMPUTE_32F; ggml_get_to_bf16_cuda gains the NVFP4 case. Bit-exact gate PASS (benign): test-backend-ops MUL_MAT 1146/1146 + MUL_MAT_ID 806/806; the forced path (LLAMA_FP4_PREFILL_M=64) is green CUDA-vs-CPU at NVFP4 large-M shapes; greedy md5 on q36-27b is byte-identical to FP4-MMQ both for short prefill (5951a5b4, decode untouched) and for a >threshold prefill that exercises the bf16 path (5f3967df - no greedy argmax flips). Performance REGRESSES on GB10 (S_PP, q36-27b dense, A/B via env): M=512 958.99 -> 486.65 (-49%), M=1024 1013.65 -> 587.27 (-42%), M=2048 918.46 -> 649.42 (-29%). The scope premise (FP4-MMQ ~3% of FP4 peak at large M) is false here: FP4-MMQ beats bf16-cuBLAS because bf16 peak is ~half FP4 peak and the per-step weight dequant + 4x bf16 weight traffic (~8x total vs the FP4 read) dominate, only partially amortizing as M grows. Default-off keeps stock S_PP (966.98). Phase 2 (MoE grouped large-M) not implemented: it inherits the same bf16-peak<FP4-peak ceiling plus a per-expert dequant, so grouped bf16-cuBLAS would regress for the same reason; a real prefill GEMM win needs option (b), a native FP4-MMA large-M kernel. Full A/B in docs/PREFILL_GEMM_RESULTS.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
4bdd26a7f0 |
docs(paged): scope tensor-core (mma) chunked GDN prefill kernel
Scopes the follow-up recorded by patch 0031 + README section 5: replace the serial per-thread reductions of the chunked gated-DeltaNet prefill scan with mma.sync tensor-core matmuls and lift the 1-block/SM occupancy ceiling, the path that would beat the tuned sequential scan and close the GDN prefill bucket toward vLLM's ~2.5x-cheaper chunked scan. Confirmed (not assumed) the GB10/sm_121a tensor-core reality: consumer Blackwell (SM12x) has NO wgmma (Hopper-only) and NO tcgen05/TMEM (sm_100a data-center only); the usable path is the extended mma.sync family. So the kernel is a warp-synchronous mma.sync + cp.async design (reusing ggml's mma.cuh tiles), not a wgmma/TMA/tcgen05 design - patch 0031's 'mma/wgmma' shorthand reads as mma only on this part. Design: register-resident state frees the 64KB that forced C=16, admitting C=64 under the 99KB shared opt-in; tf32 inputs / f32 accumulate with a 3xtf32 precision ladder; decays/gamma/beta stay f32 outside the mma to preserve the bounded de-gating; A-inverse via blocked forward substitution (FLA UT transform) with mma off-diagonal coupling. Mechanism: chunking cuts state-BW ~Cx, mma absorbs the O(C^2) intra-chunk flops the serial 0031 could not. Honest: multi-week, high risk, no vendor kernel to route to on sm_121; gains beat the sequential scan and close most of the bucket but not full sm_100-class parity. KL-gate binding (NMSE likely fails at reduced precision). Phased: re-profile -> two-product PoC -> full intra-chunk + C=64 + reg-state -> occupancy/cp.async; opt-in default-OFF until A/B-proven. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
9a28f23134 |
docs(paged): scope the continuous-serving decode gap (host-bound, design-only)
Add DECODE_SERVING_SCOPE.md: the decode KERNEL is at parity in static batched-bench (~6.1 tok/s/seq ~ vLLM ~5.9 at npl128) but continuous serving through llama-server update_slots() drops to ~3.7 (-39%) while vLLM sustains ~5.9. Scope shows the gap is the scheduler/host loop, not the kernel. Root-cause hypothesis from source: continuous batching's batch-shape + seq-set churn breaks BOTH graph-reuse layers every step - llama-context can_reuse/ allow_reuse (n_tokens + seq-set must match) and the CUDA ggml_cuda_graph update_required memcmp (ne/nb/data ptrs) - so the GPU idles while the host rebuilds + re-captures the graph and runs un-graphed set_inputs. vLLM avoids this with padded/bucketed decode shapes + piecewise CUDA graphs. Documents that the shipped scheduler patches (0008/0013/0016/0024/0025/0029) target prefill freezing + burst collapse, NOT decode-step graph reuse, which is why the serving gap survives them; notes the README s.5 'lever 2 graph coverage FLAT' verdict was static-regime and is reopened here for serving only. Ranks host-side, bit-exact-safe levers: S1 bucketed/padded decode-step shape for graph reuse, S2 double-buffer/overlap per-step host work, S3 graph-shape-stable scheduling (extend 0016). Specifies a Phase-0 profile to confirm host-bound before any build, reusing the in-tree [L5INSTR] hostproc/set_inputs/ get_block_table timers, the 'graphs reused' perf counter, LLAMA_GRAPH_REUSE_DISABLE and nsys GPU-busy%, with vLLM ground-truthed at the same concurrency. No kernel code; no GPU run in this pass. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
e610347367 |
feat(paged): chunked parallel-scan GDN prefill kernel (patch 0031)
Adds patch 0031 to the paged llama.cpp series: an FLA-style chunked parallel-scan prefill kernel for gated DeltaNet (the upstream gated_delta_net.cu "Add chunked kernel for even faster pre-fill" TODO). Scope: non-KDA scalar gate, f32 state, final-state-only, homogeneous. Bit-exact-benign (NEW per-path): test-backend-ops GATED_DELTA_NET 91/91 within the 1e-7 NMSE gate vs the CPU reference (patch adds 8 S_v=128 prefill cases: exact-multiple / tail / multi-seq / GQA / permuted); numpy prototype confirms f32 chunked-vs-sequential NMSE ~1e-13. OPT-IN, default OFF: GB10's 99KB dynamic-smem opt-in forces C=16 (the 128x128 f32 state is 64KB of the all-shared layout), pinning the kernel to 1 block/SM with serial dk-reductions. Measured ~761 t/s chunked vs ~971 t/s sequential (~22%% slower) on q36-27b-nvfp4 prefill, so it defaults OFF (enable with GDN_CHUNK_MIN=<n>); the backend default is regression-free. Beating the 84.7%-of-peak sequential scan needs tensor-core matmuls / register-resident state with larger chunks (recorded in README section 5). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
11128cb080 |
docs(paged): scope the large-M NVFP4 prefill GEMM lever (design only)
Design + plan for the #1 prefill lever: NVFP4 weight GEMM at large M, where MMQ (decode/M<=128-tuned, 1 CTA/SM, 128-col tile cap) is ~3.4x slower than vLLM's marlin/cutlass large-M path (~51% of the prefill gap). Recommends (a) dequant->bf16 cuBLAS routed by an M-threshold (dense first, MoE grouped-cuBLAS second); rejects (b) a from-scratch Marlin/FP4 kernel as a multi-week project. Key enabling finding: NVFP4->bf16 dequant kernels already exist, and NVFP4 is currently force-excluded from the tensor-core cuBLAS path (falls to f32 Sgemm) - relaxing that one guard is the pivot. Honest: bf16-cuBLAS banks ~60-75% of the GEMM gap, not full 68us/tok parity (bf16 TC peak ~half FP4). Design only - no kernel, no GPU run. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] |
||
|
|
4cd90bfae9 |
paged: drop bf16-tau (patch 0026), subsumed by decode fusions (tau=100000 flat, zero speed benefit)
The opt-in hybrid per-head bf16 SSM-state lever (ssm_bf16_tau, patch 0026) is removed from the llama-cpp-localai-paged patch series. Clean re-measurement after the decode fusions (0028 recurrent-state gather-fusion + 0029 block-table cache) landed shows it buys nothing: forcing ALL gated-DeltaNet heads to bf16 (tau=100000, the most aggressive setting) gives flat decode throughput, 780.6 vs 780.0 t/s. The mode engages but adds zero speed because it is subsumed by the fusions. The earlier "+12%" was measured before the fusions completed. bf16-tau was a precision trade (not bit-exact, ~91% same-top-p) plus extra bug surface and extra CUDA template-instantiation compile cost with no offsetting benefit. Dependency check: no later patch (0028/0029/0030) depends on 0026. 0030's only mention is a description comment; its code keys off fused_gdn_ar/ch/auto_fgdn, which originate in 0018/0019/0021 (before 0026). The remaining series (0001-0025, 0028-0030) applies clean with git apply --check against the pin 0ed235ea2c17a19fc8238668653946721ed136fd. The Makefile applies the series by glob (patches/paged/0*.patch); the resulting gap at 0026 is tolerated (0005/0027 are already absent). Removed: - patches/paged/0026-qwen35-hybrid-perhead-ssm-state.patch - the dead ssm_bf16_tau / ssm_hybrid_tau option handler in the shared grpc-server.cpp (it only set LLAMA_SSM_BF16_TAU, now a no-op the library no longer reads) - the patched+bf16-tau benchmark columns and llama-patched-bf16tau rows (README + final_benchmark.csv), the ssm_bf16_tau option text in backend index.yaml, the gallery NOTE block, and the docs/features/backends.md mention. The rejected-lever lesson is kept (why it was dropped: subsumed, tau=100000 flat) in the backend README section 5, the paged-backend agent guide, and the vLLM-parity methodology, so it is not re-tried. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
2c59805267 |
fix(paged): rpc cmake target renamed rpc-server -> ggml-rpc-server at pin 0ed235ea
llama.cpp renamed the RPC tool target (tools/rpc/CMakeLists.txt: set(TARGET ggml-rpc-server)) at the 0ed235ea pin. master already updated the stock llama-cpp Makefile to match (--target ggml-rpc-server, cp bin/ggml-rpc-server); the paged backend's separate Makefile copy was left stale and its -grpc (RPC) variant failed with 'No rule to make target rpc-server' (grpc-server itself built to 100%). Mirror the stock rename in the paged Makefile. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
c51ff4cec9 |
docs(paged): scope porting the portable benefits to Metal/SYCL/Vulkan (+ROCm)
Add ACCELERATOR_PORTING_SCOPE.md, the umbrella scope for taking the paged backend's accelerator-portable wins off the CUDA family. It builds on (does not duplicate) UPSTREAM_LAYER2_SCOPE.md, which stays the GDN/SSM-fusion detail (benefit #1), and adds: - Benefit #2 (paged KV in-kernel block-table flash-attn read, 0009-0011): new per-backend feasibility from source analysis of the Metal/SYCL/Vulkan flash-attn kernels. SYCL EASY (near line-for-line CUDA mirror), Metal EASY-MEDIUM (decode already routes to the vec kernel), Vulkan MEDIUM (the fast coopmat2 NVIDIA decode path cannot do the indexed read; push-constants are full). Universal constraint: only the vec/scalar decode kernel admits the per-cell indexed read, so route block-table ops onto vec (as CUDA's 0009-0010 dispatch guard already does) and leave the fast MM/coopmat2 path contiguous-only. This is the lever that flips paged KV from neutral-to-slightly-negative to non-negative off CUDA. - Benefit #3 (decode-first scheduler, 0013/0016): confirmed a free portable win - host-side update_slots() policy, zero kernel work, runs on any accelerator as-is. - Benefit #4 (NVFP4 FP4-MMA, 0017/0023/0025): out of scope (Blackwell only); flags the backend-agnostic analogues of the act-quant dedup and the graph-coverage lever without over-claiming a port. - A ROCm note: ROCm rides the CUDA/HIP path (validate, don't re-port); FP4-MMA stays Blackwell-only. Benefits #1 and #2 share the port shape and rank Metal->SYCL->Vulkan, so they bundle into one per-backend PR behind a shared ops-first PR. Cross-link added from UPSTREAM_LAYER2_SCOPE.md. All gates are test-backend-ops on-target (no Metal/SYCL/Vulkan/ROCm hardware here). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
ea72a56e2c |
Merge origin/master + pin-sync paged backend to 0ed235ea
master auto-bumped the stock llama-cpp pin 9d5d882d -> 0ed235ea and updated the shared grpc-server.cpp. The paged backend's pin must track the stock pin (the grpc-server.cpp is shared), so bump its LLAMA_VERSION to match. All 28 paged patches apply clean on 0ed235ea (verified against a fresh upstream clone). The bf16-tau state-serialization fix (patch 0026) is included. Bit-exact gate + full grpc-server build verify on GPU/CI to follow. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
1f3e5ba301 |
fix(paged): serialize both SSM partitions in hybrid bf16-tau state save/restore (patch 0026)
The opt-in ssm_bf16_tau hybrid mode splits a gated-DeltaNet layer's recurrent SSM state into an f32 partition (s_l) and a bf16 partition (s_l_bf16). The recurrent state serialization paths (state_write_data / state_read_data) were never updated for the split: they read/wrote s_l using the FULL hparams.n_embd_s() (S_v*S_v*H) row width, but a split layer's s_l only holds S_v*S_v*n_f32, so the access overruns the smaller tensor (a ggml_backend tensor read out of bounds), and the bf16 fast-head partition was never persisted at all. This is what broke high-concurrency serving with --ssm-bf16-tau: the server's context-checkpoint feature serializes per-sequence state via state_seq_get_data. With a checkpoint enabled, even a single request triggered the out-of-bounds read; at higher concurrency the cell range starts at a higher base slot so the overrun reaches further (hard abort in a debug build, silent state corruption then 1-token-then-EOS on restore in a release build). The static batched-bench never exercises save/restore so it did not catch it; the GDN decode kernel and per-head partition offsets were already correct (decode with checkpoints disabled is fine at N=8/16/32). Fix: serialize the f32 partition and, when the layer is split, the bf16 partition right after it, each with its OWN row width (tensor ne[0]). head_slot is rebuilt deterministically at load (same model + tau), so it is not serialized. Non-split layers have ne[0] == n_embd_s() and no bf16 partition, so their on-disk format and behavior are byte-identical (the default f32 path and the bit-exact gate are unaffected). Verified on GB10/DGX with Qwen3.6-35B-A3B-NVFP4 + --ssm-bf16-tau 64 via a continuous-batching llama-server: with context checkpoints enabled, N=8, N=16 and N=32 (slot reuse + restore) all now produce full coherent 128-token output and the server stays up; pre-fix the same config aborted on the first checkpoint. Assisted-by: Claude:claude-opus-4-8[1m] [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
de2ec2f136 |
feat(backends): add voice-detect + face-detect ggml backends (replace Python insightface/speaker-recognition) (#10441)
* feat(voice-detect): add Go purego backend for voice-detect.cpp Add backend/go/voice-detect implementing the Backend gRPC voice subset (VoiceEmbed/VoiceVerify/VoiceAnalyze) over libvoicedetect.so via purego, mirroring the parakeet-cpp / omnivoice-cpp backends. The flat voicedetect_capi C ABI is dlopen'd cgo-less; malloc'd string and float-vector returns are owned by Go and released through the matching capi free functions, with the per-ctx last error surfaced into Go errors. Calls are serialized via base.SingleThread since the C context is not reentrant. Proto field mapping: - VoiceEmbed: VoiceEmbedRequest.audio (path) -> embed_path -> Embedding+Model. - VoiceVerify: audio1/audio2 + threshold (<=0 falls back to the verify_threshold option, default 0.25) -> verify_paths -> verified/distance/ threshold/confidence/model/processing_time_ms. - VoiceAnalyze: audio (path) -> analyze_path_json; the JSON age/gender/emotion document maps to a single VoiceAnalysis segment (start/end 0; gender "label" -> dominant_gender with the remaining float scores as the gender map; emotion label/scores -> dominant_emotion/emotion). The Makefile pins voice-detect.cpp to 47546430, clones+builds libvoicedetect.so with ggml static-linked (PIC, GGML_NATIVE off) so dlopen needs no external libggml/libvoicedetect; ldd on the artifact shows only system libs. Ginkgo tests cover option parsing and analyze-JSON mapping; embed/verify smoke specs gate on VOICEDETECT_BACKEND_TEST_MODEL + VOICEDETECT_BACKEND_TEST_WAV. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(voice-detect): wire backend into index, gallery and build Register the voice-detect.cpp speaker-recognition + voice-analysis backend (added in Voice-INT-A) into LocalAI's distribution surfaces, mirroring the ced backend (the closest mudler C++/ggml audio analogue): - backend/index.yaml: add the &voicedetect meta-backend (capabilities platform map, no top-level uri) plus the full set of concrete per-arch image entries (cpu/cuda12/cuda13/metal/rocm/sycl/vulkan/l4t and the -development variants). Referential integrity audited - every alias target resolves. - gallery/index.yaml: add 5 model entries on backend voice-detect - ECAPA-TDNN, WeSpeaker ResNet34, 3D-Speaker ERes2Net, CAM++ and the wav2vec2 age/gender/emotion analyze model. The engine architecture is read from GGUF metadata (voicedetect.arch) at load. GGUF artifacts are not yet published: each files: entry points at the intended mudler/voice-detect-gguf location with a TODO to fill sha256 after upload (no fabricated hashes). - .github/backend-matrix.yml: add the linux build matrix block + the darwin metal entry mirroring ced. - .github/workflows/bump_deps.yaml: track mudler/voice-detect.cpp via VOICEDETECT_VERSION (pin 47546430, = 4754643). - core/config/backend_capabilities.go: register voice-detect in the backend capability map (VoiceVerify/VoiceEmbed/VoiceAnalyze -> speaker_recognition), mirroring speaker-recognition. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(face-detect): add purego Go backend for face-detect.cpp Add the LocalAI Go backend that dlopens libfacedetect.so (the flat facedetect_capi_* C-ABI) via purego, mirroring the sibling voice-detect backend. Implements the Face subset of the Backend gRPC service: - Embeddings(PredictOptions): Images[0] base64 -> temp file -> embed_path -> L2-normalized ArcFace embedding. - Detect(DetectOptions): src -> detect_path_json -> Detection boxes (class_name "face", [x1,y1,x2,y2] -> x/y/w/h). - FaceVerify(FaceVerifyRequest): two images + threshold + anti_spoof -> verify_paths; best-effort img areas via detect. - FaceAnalyze(FaceAnalyzeRequest): img -> analyze_path_json -> per-face age + gender ("M"/"F" normalized to "Man"/"Woman"). The Makefile pins face-detect.cpp to 636a1963 and builds the shared lib with ggml + vendored libjpeg-turbo static (PIC), so the .so is ldd-clean (no libggml) and exports only facedetect_capi_* (no jpeg_ symbols). Gated Ginkgo e2e mirrors voice-detect. Note for the gallery-wiring task: backend registration (index.yaml, gallery, core/config/backend_capabilities.go) is intentionally not touched here. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * fix(voice-detect): replace em dashes in net-new descriptions Project style forbids em/en dashes. Replace the three U+2014 chars introduced by the voice-detect gallery/index wiring with `-`/`:`. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(face-detect): wire backend into index, gallery and build Register the face-detect.cpp face detection / embedding / verification / analysis backend (added in Face-INT-A) into LocalAI's distribution surfaces, mirroring the voice-detect wiring (the closest mudler C++/ggml recognition analogue): - backend/index.yaml: add the &facedetect meta-backend (capabilities platform map, no top-level uri to avoid the meta-backend gotcha) plus the full set of concrete per-arch image entries (cpu/cuda12/cuda13/ metal/rocm/sycl-f16/sycl-f32/vulkan/l4t and the -development variants), 22 entries. Referential integrity audited: every alias target resolves. - gallery/index.yaml: add 4 model entries on backend face-detect - face-detect-buffalo-l/m/s (insightface SCRFD + ArcFace/MBF, NON-COMMERCIAL) and face-detect-yunet-sface (OpenCV-Zoo YuNet + SFace, APACHE-2.0, the commercial-friendly alternative). The detector/embedder architecture is read from GGUF metadata (facedetect.arch) at load; only the real verify_threshold option is set (0.35 buffalo, 0.363 sface). GGUF artifacts are not yet published: each files: entry points at the intended mudler/face-detect-gguf location with a TODO to fill sha256 after upload (no fabricated hashes). - core/config/backend_capabilities.go: register face-detect in the backend capability map (Embedding/Detect/FaceVerify/FaceAnalyze -> face_recognition), mirroring insightface. - .github/backend-matrix.yml: add the linux build matrix block + the darwin metal entry mirroring voice-detect. - .github/workflows/bump_deps.yaml: track mudler/face-detect.cpp via FACEDETECT_VERSION (pin 636a1963). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * fix(recon): voice-detect metal build branch + face-detect gallery usecases Add the missing metal BUILD_TYPE branch to the voice-detect Makefile forwarding -DVOICEDETECT_GGML_METAL=ON, mirroring face-detect, so the darwin metal CI artifact is built with the Metal backend instead of CPU-only. Expand the 4 face-detect gallery models' known_usecases to [face_recognition, detection, embeddings] to match the backend capabilities map and the mirrored insightface-buffalo entries, so auto-selection for /v1/detect and /embeddings works. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * docs(recon): document voice-detect and face-detect ggml backends Document the new standalone C++/ggml biometric backends as the recommended/default option for face and voice recognition, keeping the existing Python insightface / speaker-recognition backends framed as the legacy path. - features/face-recognition.md: add a face-detect (ggml) backend section with the gallery entries (buffalo-l/m/s non-commercial, yunet-sface Apache-2.0), licensing, and verify/detect/analyze quickstart. - features/voice-recognition.md: add a voice-detect (ggml) backend section with the gallery entries (ecapa-tdnn, wespeaker-resnet34, eres2net, campplus speaker recognizers; emotion-wav2vec2 non-commercial analyze head) and quickstart. - reference/compatibility-table.md: add face-detect.cpp and voice-detect.cpp rows to the Vision, Detection & Recognition table. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(gallery): publish recon backend GGUF uris + sha256 Fill in the published HuggingFace GGUF uris and verified sha256 for the 9 recon gallery entries (voice-detect-* and face-detect-*), and remove the TODO publish markers. Correct the eres2net, campplus, and emotion-wav2vec2 uris to the actual published filenames. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(gallery): re-embed buffalo anti-spoof + add audeering age/gender voice model Update the 3 buffalo face-detect GGUF sha256 (anti-spoof ensemble now embedded and re-uploaded under the same filenames/uris) and note the FaceVerify anti_spoof request flag in each description. Add a new voice-detect-age-gender-wav2vec2 gallery entry mirroring the emotion model. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(gallery): add face-detect-buffalo-sc and antelopev2 packs Add gallery entries for two newly-published insightface face packs on the face-detect backend: buffalo_sc (smallest pack, SCRFD-500M + small ArcFace) and antelopev2 (higher-accuracy, SCRFD-10G + ArcFace glint360k R100, 512-d). Both are non-commercial research-only. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(recon): honor LocalAI per-model threads in voice/face-detect backends LocalAI spawns one backend process per model and serves requests concurrently, so the engines' own min(hardware_concurrency, 8) default can oversubscribe cores. Forward the per-model Threads value from the gRPC LoadModel options into the engine via VOICEDETECT_THREADS / FACEDETECT_THREADS (read at backend construction) before the capi load. A non-positive Threads is treated as unset, leaving the engine default. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump backend pins to CPU-optimized engine commits voice-detect.cpp -> 0d9c1b3 (radix-2 FFT FBank, threads, flash attn + cached pos-conv); face-detect.cpp -> 523aee1 (thread-gated direct conv, threads). Brings the CPU optimizations into the LocalAI backend builds. GGUF format and parity unchanged, so the published HF GGUFs remain valid. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump backend pins to round-2 CPU-optimized engines voice-detect.cpp -> fe7e6a3 (ERes2Net 1x1->mul_mat, CAM++ layout+context, wav2vec2 conv-LN, ECAPA capture-drop, AVX512 dispatch opt-in); face-detect.cpp -> 9c8adb7 (AVX2 Winograd F(2x2,3x3) for SCRFD/ArcFace 3x3 convs, ArcFace BN-fold). Parity unchanged (cosine=1.0); GGUF format unchanged, HF GGUFs valid. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump backend pins to round-3 Winograd engines voice-detect.cpp -> 45122ec (Winograd F(2x2,3x3) for WeSpeaker/ERes2Net 3x3 convs, -22%/-20% @8t); face-detect.cpp -> cd5c962 (Winograd F(4x4,3x3) for SCRFD large maps, -22% @1t on top of F(2x2), more load-stable). Parity held (cosine=1.0); GGUF format unchanged, HF GGUFs valid. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump backend pins to round-4 Winograd engines (CPU opt complete) voice-detect.cpp -> d2839ca (CAM++ FCM 2D convs through Winograd, -15.5%/-10.3%); face-detect.cpp -> c1db23d (AVX2-vectorized Winograd tile transforms, SCRFD detect -14%/-9.6%). Final CPU optimization round; the conv-kernel lever class is now exhausted (parity held cosine=1.0; GGUF/parity unchanged, HF GGUFs valid). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump face-detect pin to deep-kernel engine (7ae5c4d) face-detect.cpp -> 7ae5c4d: register-blocked winograd-domain GEMM microkernel (2.8x isolated GFLOP/s), AVX-512 zmm evolution behind runtime CPUID dispatch (ship-safe, AVX2 fallback bit-identical), bias/relu fused into the winograd output transform, and SFace Conv+BN fold + bias/PReLU fusion. SCRFD detect ~1.4x faster end-to-end vs the round-4 baseline; parity bit-exact; portable single binary (function-multiversioned, no global -mavx512f). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump voice-detect pin to ECAPA operand-order win (e9c56ae) voice-detect.cpp -> e9c56ae: weight-as-src0 mul_mat order in ECAPA's F32 conv1d_same (routes through tinyBLAS sgemm); ECAPA embed 1.67x @1t / ~1.3x @8t, parity cosine=1.0. Isolated to encoder.cpp (ECAPA-only); ERes2Net/CAM++/WeSpeaker do not call conv1d_same so are provably unaffected. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump pins to FMA-throughput engines (voice f7b9f89, face 2d2d5f0) face -> 2d2d5f0: route ArcFace 3x3 body convs through the AVX-512 winograd microkernel (kWinoMinSize 80->14); ArcFace 1.62x @1t, SCRFD detect to 0.966 of MLAS @1t, no regression. voice -> f7b9f89: runtime-CPUID-dispatched AVX-512 winograd-GEMM microkernel (ship-safe, AVX2 fallback bit-identical); WeSpeaker 1.90x @1t. Parity cosine=1.0 throughout; portable single binaries. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump pins to MLAS-class direct-conv engines (voice 7ecfd07, face be22d67) Hand-tuned nChw16c AVX-512 register-tiled direct-conv microkernel (~263 GFLOP/s, within 6-7% of MLAS per-op efficiency), runtime-CPUID-dispatched + AVX2 fallback, fused bias/relu. voice 7ecfd07: default 3x3-s1 kernel for WeSpeaker (+37%/+32%) + ERes2Net, CAM++ pinned to Winograd. face be22d67: shape-gated to the ArcFace recognizer body (+25-27% @8t); SCRFD detector stays on Winograd (no regression). Parity cosine=1.0 / detect <=1px on AVX-512 + AVX2 paths. Portable single binaries. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump voice pin to Phase-A blocked backbone (f4e7eef) WeSpeaker ResNet34 runs as one nChw16c blocked island (2 reorders/forward vs ~60) on AVX-512, default; per-conv directconv fallback on AVX2. +2.9% @1t / +17-19% @8t vs per-conv directconv, parity cosine=1.0. The conv microkernel is already FMA-bound near peak (~0.86-0.98x MLAS-implied); residual to MLAS is sub-peak edge + non-conv tail, documented in docs/cpu-optimization.md. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump pins to breadth blocked-backbone (voice 7f66871, face d80092b) voice 7f66871: AVX2-vectorized (ymm) blocked island - AVX2-only hosts now run the blocked backbone for WeSpeaker (2.3x over per-conv-AVX2, cosine=1.0); ERes2Net stays per-conv (blocked regresses, opt-in only); CAM++ Winograd-pinned. face d80092b: ArcFace recognizer blocked island, AVX-512 default (-13% @8t, ~0.90x MLAS, the closest conv result), auto per-conv on AVX2; SCRFD untouched on Winograd (0 island invocations during detect). Parity cosine=1.0 / detect <=1px throughout. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump pins to small-spatial + stem conv kernels (voice 99b1804, face 47fdab6) Measured-gap-driven conv kernels: small-spatial (fill the register tile when output width <= tile width) + small-IC stem + strided-1x1/downsample recovery. ArcFace recognizer 0.57 -> 0.70x MLAS @1t (the closest conv model), WeSpeaker 0.65 -> 0.79x @1t. Parity cosine=1.0 / detect <=1px. The OC-block-sharing lever was a measured dead-end (deep stride-1 is L3-weight-bandwidth bound, not read-port bound) and was NOT shipped. Kernel ceiling reached; further gap needs an algorithm-class change (cache-blocked weight-stationary GEMM, or q8 weights). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump pins to GPU persistent-graph + multi-model-safe cache (voice 45d2e6b, face 0a4799a) GPU wins (CUDA/ggml backend, no CPU-path change): persistent per-shape graph+context cache in Backend::compute() eliminates the per-call cudaGraph re-instantiation churn -> wav2vec2 emotion+age-gender now AT GPU parity with torch-cuDNN on GB10 (0.97-0.98x), CAM++ -5.7ms; bit-identical parity. Cache hardened multi-model-safe (invalidate-on-free keyed by the ModelLoader weights buffer) so LocalAI multi-model hosting cannot stale-hit. Conv models still trail cuDNN (im2col-materialization-bound) - cuDNN implicit-GEMM lever next. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump pins to cuDNN-conv-capable engines (voice b6e4356, face 6107a24) Adds the opt-in cuDNN implicit-GEMM conv path (VOICEDETECT_GGML_CUDNN / FACEDETECT_GGML_CUDNN, DEFAULT OFF -> zero build/runtime dep until enabled). On GPU it kills the im2col-materialization bottleneck and reaches torch-cuDNN parity on the spill-bound convs: SCRFD detect 14.8->6.4ms (2.3x, ~parity), WeSpeaker ~parity, ERes2Net beats torch (1.10x); ArcFace/CAM++ neutral (no spill). Parity exact (SCRFD <=1px, cosine=1.0). To USE it in LocalAI, the CUDA backend build must enable the flag AND bundle libcudnn - deferred until a cuDNN-bundled GPU image; flag stays OFF here. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(recon): enable cuDNN conv path on arm64+CUDA13 recon backends The voice-detect.cpp / face-detect.cpp engines have an opt-in cuDNN implicit-GEMM conv path behind VOICEDETECT_GGML_CUDNN / FACEDETECT_GGML_CUDNN (default OFF) that kills im2col on the GPU and reaches torch-cuDNN parity (SCRFD 2.3x, WeSpeaker/ERes2Net parity), measured on the GB10 (arm64, CUDA 13, sm_121a). Enable it for the CUDA build, but only where cuDNN actually ships: the arm64 + CUDA 13 image (GB10/Jetson/L4T). x86 CUDA images carry no cuDNN, so flipping it on globally for BUILD_TYPE=cublas would be a link failure. The Makefiles gate on CUDA_MAJOR_VERSION=13 + arch (TARGETARCH from the matrix/Docker build, uname -m fallback for local builds). backend/Dockerfile.golang already installs the runtime libcudnn9-cuda-13 in the arm64+CUDA13 apt block; add the matching libcudnn9-dev-cuda-13 so the build-time link resolves. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump voice-detect pin to ERes2Net blocked-default (30beecd) Defaults VD_ERES2NET_BLOCKED ON: routes the ERes2Net Res2Net body through the blocked nChw16c AVX-512 directconv island instead of the 1x1 mul_mat fast path (CONT-transpose + skinny low-K GEMM). On the shipped GGML_NATIVE=OFF build (ggml mul_mat is AVX2-only) this wins ~2x at every thread count (2.07x@1t, 2.2x@4t, 2.05x@8t); pure-AVX2 fallback still 1.3-1.62x. Parity exact (cosine=1.000000 vs golden), so registered voices + verify/identify thresholds are unaffected. The prior default-OFF rested on a stale comment whose 23pct regression only held on the non-shipping GGML_NATIVE=ON build. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * docs(readme): announce native voice-detect + face-detect backends in Latest News Add a Latest News entry for the new from-scratch C++/ggml biometric backends (voice-detect.cpp + face-detect.cpp) that replace the Python insightface and speaker-recognition backends: no Python/onnxruntime at inference, self-contained GGUF, bit-exact parity, GPU cuDNN parity. Mirrors the parakeet.cpp / locate-anything.cpp native-backend news entries. Refs PR #10441. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): re-pin to the squashed engine release commits The voice-detect.cpp and face-detect.cpp histories were squashed to a single release commit, which orphaned the previous pins (voice 30beecd, face 6107a24). Re-pin to the new single-commit SHAs (voice 3d51077, face 06914b0); the tree is identical, so the backend build is unchanged. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
d3a26f961d |
fix(ik-llama): port multimodal path to mtmd API and bump to f96eaddb (#10534) (#10568)
* fix(ik-llama): port multimodal path to mtmd API and bump to f96eaddb (#10534) The IK_LLAMA_VERSION bump to f96eaddba8bed6a9a5e628bbf6a566775c70b49c pulls in upstream commit "Prune examples/llava", which deletes examples/llava (clip.* / llava.*). The ik-llama backend's grpc-server.cpp built a local `myclip` library from those files and called the removed clip/llava C API, so the bump no longer builds. ik_llama keeps its multimodal stack in the surviving `mtmd` library (examples/mtmd/, public headers mtmd.h + mtmd-helper.h). This ports the backend's multimodal path onto the high-level mtmd_* / mtmd_helper_* API in place, leaving the text path (which still uses ik_llama's retained old common API) untouched: - Makefile: bump IK_LLAMA_VERSION to f96eaddb. - prepare.sh: drop the clip/llava source copy + sed block; mtmd is a library target, no source copy needed. - CMakeLists.txt: remove the `myclip` target; link `mtmd` and add its include dir; build grpc-server as C++17 (mtmd headers require it). - patches: drop 0002 (targeted the deleted examples/llava/clip.cpp; the mtmd clip.cpp never calls ggml_quantize_chunk, so the fix is unneeded). Keep 0001 (verified still applies). - grpc-server.cpp / utils.hpp: replace clip_model_load + clip_image_load_from_bytes + llava_image_embed_make_with_clip_img + the manual [img-N] prefix splitting and per-image llava_embd_batch decode loop with mtmd_init_from_file (moved after the model load, which it requires), mtmd_helper_bitmap_init_from_buf, mtmd_tokenize and mtmd_helper_eval_chunks. Legacy [img-N] tags are translated, in order, into mtmd media markers (mtmd_default_marker()); the post-image suffix text stays on the normal token path so the sampling loop is unchanged. Supersedes #10534. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * fix(ik-llama): align json alias to ordered_json to resolve mtmd.h conflict (#10534) mtmd.h declares `using json = nlohmann::ordered_json` at global scope (and its mtmd.cpp depends on it), while ik_llama's whole server/common stack also uses ordered_json. Our grpc-server.cpp/utils.hpp kept a plain `nlohmann::json` alias, which now collides with mtmd.h once it is included for the multimodal port: "conflicting declaration 'using json = ...'". Switch our two aliases to ordered_json to match; it is API-compatible (utils.hpp already used ordered_json for its log helper) and our json never crosses into an unordered-json API. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
13b1ae53bc |
chore: ⬆️ Update ggml-org/llama.cpp to 0ed235ea2c17a19fc8238668653946721ed136fd (#10536)
* ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * fix(llama-cpp): link server-stream.cpp TU into grpc-server for upstream 0ed235ea (#10536) Upstream llama.cpp 0ed235ea added an SSE stream-resumption layer in a new translation unit tools/server/server-stream.cpp, which defines stream_session, stream_pipe_producer and the g_stream_sessions manager. server-context.cpp (already #included into grpc-server.cpp) now calls into it via spipe->cleanup(), stream_aware_should_stop() and stream_session_attach_pipe(), so without the new TU the grpc-server link fails on every arch with: undefined reference to `stream_pipe_producer::cleanup()' prepare.sh already copies every tools/server/* file into tools/grpc-server/, so the source is present; the only missing piece was including its definitions. Add an __has_include-guarded #include "server-stream.cpp" before server-context.cpp, mirroring the existing server-chat.cpp and server-schema.cpp guards, keeping the source compatible with older pins/forks that predate the split. The file is self-contained (its only external symbols come from server-common, already in the TU) so it adds no new undefined references; the http route-handler factories it also defines are unused in the grpc path but harmless. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * fix(llama-cpp): build renamed ggml-rpc-server target for upstream 0ed235ea (#10536) Upstream renamed the RPC server CMake target and binary from `rpc-server` to `ggml-rpc-server` (tools/rpc/CMakeLists.txt: `set(TARGET ggml-rpc-server)`), so the RPC-enabled grpc build failed with "No rule to make target 'rpc-server'". The grpc-server itself links fine after the server-stream.cpp fix; this only updates the RPC target name and the binary path copied to llama-cpp-rpc-server. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] --------- Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
e68ca109c5 |
chore: ⬆️ Update CrispStrobe/CrispASR to 6514c9da00b03a2f0f1b49a43fae4f3a01a41844 (#10535)
⬆️ Update CrispStrobe/CrispASR Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> |
||
|
|
6740e988d2 |
chore: ⬆️ Update ggml-org/whisper.cpp to 0ae02cdb2c7317b50991367c165736ce42ed96ac (#10532)
⬆️ Update ggml-org/whisper.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> |
||
|
|
4da769c1ca |
paged headers: self-include <cstddef>/<cstdint> for size_t/uintN_t (fix amd64/non-arm64 build; compile-only)
Vendored paged headers used size_t / uintN_t without including <cstddef> /
<cstdint>. The arm64 DGX toolchain provides them transitively so the build
passed there, but amd64/older toolchains do not, failing the CI amd64 build one
header at a time ('size_t' does not name a type -> cascade).
paged-kv-manager.h was already fixed. This adds the missing includes to the
remaining vendored headers at the point each is created/rewritten in the patch
series so every src/paged*.h self-includes both:
* paged-attn.h (0003): add <cstddef> (had <cstdint>)
* paged-alloc.h (0007): add <cstddef> (had <cstdint>)
* paged-prefix-api.h (0007): add <cstddef> + <cstdint> (had only llama.h)
The .cpp units include their own paged header, so they inherit the includes
transitively. Whole series still applies clean on the pinned llama.cpp.
Compile-only change: no runtime behavior change, bit-exactness unaffected.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
|
||
|
|
23b11a5239 |
paged-kv-manager.h: add missing <cstddef> for size_t
Fixes cuda-13 amd64 / non-arm64 build where size_t was used without the header (arm64 cuda-13 pulled it in transitively; amd64/cuda-12 toolchains do not). Compile-only change, bit-exactness unaffected. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
9bb8994c4e |
chore(paged): drop CUDA-12 variants of llama-cpp-localai-paged, keep CUDA-13 only
The paged backend targets Blackwell sm_121a, which CUDA 12.0 cannot target
at all, so the CUDA-12 variants were pointless. They were also broken: the
cublas-12 / nvidia-l4t / arm64 build failed to compile paged-kv-manager.cpp
("no declaration matches ...", a ~10-function mismatch the older
cuda-12-base gcc rejects). CUDA-13 compiles it fine (confirmed on GB10).
Removed (config-only, scoped to the paged backend):
- backend-matrix.yml: the two CUDA-12 paged rows
(-gpu-nvidia-cuda-12-llama-cpp-localai-paged,
-nvidia-l4t-arm64-llama-cpp-localai-paged)
- backend/index.yaml: CUDA-12 capability keys (nvidia-cuda-12,
nvidia-l4t-cuda-12, nvidia-l4t) on both meta-backends, repointed
default/nvidia to the cuda13 amd64 variant, and dropped the orphaned
cuda12-* / nvidia-l4t-arm64-* variant definitions (latest + -development).
Kept CUDA-13 only: cuda13-llama-cpp-localai-paged (amd64) and
cuda13-nvidia-l4t-arm64-llama-cpp-localai-paged (l4t arm64). Matrix
tag-suffixes <-> index variant URIs form a clean 2:2 bijection.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
|
||
|
|
ade9cc9e37 |
fix(openresponses): bound resume-stream buffer and enforce response ownership (#10569)
The background=true resumable-stream path had two latent issues. 1. Unbounded resume buffer. AppendEvent grew StreamEvents without limit, so a long-running or abandoned background generation could consume process memory without bound. The store now caps the buffer (event count and total bytes, mirroring llama.cpp's byte-capped slot ring), evicting oldest events from the front and advancing a droppedThrough watermark. GetEventsAfter returns ErrOffsetLost when the requested starting_after is below the watermark, and handleStreamResume surfaces that as HTTP 409 before committing to the SSE response, so a resuming client gets a clear error instead of a silently truncated stream. 2. Missing ownership check (IDOR). GET /responses/:id, its stream resume, and /cancel looked up responses purely by ID, letting any caller who knows or guesses an ID read or cancel another caller's response. Responses now carry the creating caller's identity (auth.GetUser), stamped at creation and compared on read/cancel/resume; a mismatch returns 404 (not 403) so existence is not leaked. Backward compatible: responses with no owner (single-key / no-auth deployments) remain accessible. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
471e38e4e7 |
chore: ⬆️ Update leejet/stable-diffusion.cpp to 9956436c925a367daeab097598b1ea1f32d3503f (#10533)
⬆️ Update leejet/stable-diffusion.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> |
||
|
|
f3d829e2ef |
feat(distributed): add LOCALAI_DISTRIBUTED_SHARED_MODELS to skip staging on shared volumes (#10556) (#10566)
In distributed mode, even when the frontend and workers share the same models directory via a shared volume mount, starting a model on a worker re-staged (re-downloaded) it: stageModelFiles always uploads model files into a tracking-key-namespaced subdir on the worker, and the staging probe only checks that staged location, so a file already present on the shared volume at the canonical path was never reused. Add a config switch LOCALAI_DISTRIBUTED_SHARED_MODELS (default false). When enabled, the operator asserts that all nodes mount the SAME models directory at the SAME path, so staging is unnecessary: the frontend's absolute model paths are already valid on the worker. In that mode stageModelFiles returns the cloned opts unchanged without uploading, leaving the path fields pointing at their canonical absolute paths so the worker loads them directly from the shared volume. The value is plumbed from DistributedConfig through SmartRouterOptions into the SmartRouter. Docs and docker-compose.distributed.yaml updated. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
91885c2c7e |
fix(distributed): return empty backend list for agent nodes instead of failing backend.list (#10545) (#10565)
Opening an AGENT-type worker node's detail page errored with "failed to list backends on node" / NATS "nodes.<id>.backend.list: no responders available". Agent workers only subscribe to agent.*, jobs.*, mcp.* and <prefix>.backend.stop; they never subscribe to backend.list, so the per-node ListBackendsOnNodeEndpoint request had no responder and timed out. The aggregate cluster-wide list already guards this in managers_distributed.go (skip nodes whose NodeType is set and not "backend"). The single-node endpoint lacked the same guard. Thread the NodeRegistry into ListBackendsOnNodeEndpoint and short-circuit to an empty (non-nil) list for non-backend node types before issuing the doomed NATS request, mirroring the aggregate-list gate so both views stay consistent. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
f1fcafb888 |
fix(gallery): match mmproj/model quant as a whole token so F16 no longer selects BF16 (#10559) (#10564)
pickPreferredGroup matched a quant preference against the shard base filename with strings.Contains. Because `f16` is a substring of `bf16`, asking for the `F16` mmproj quant would wrongly satisfy a `BF16` file and select it when its group came first. Match the preference as a whole token instead: it must be delimited by a non-alphanumeric character (or the string start/end) on both outer edges. Separators inside the preference itself (e.g. `ud-q4_k_xl`) are left untouched, and all occurrences are scanned before rejecting. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
fdff114701 |
ci(vibevoice): skip the ASR transcription e2e on release tag builds (#10567)
The `tests-vibevoice-cpp-grpc-transcription` job downloads the vibevoice ASR model (`vibevoice-asr-q4_k.gguf`, ~10 GB) and decodes it through the e2e-backends harness. On release tag pushes the detect step forces the full matrix (run-all=true), so this job runs and consistently times out: the inner `go test -timeout 30m` cannot pull a 10 GB file from HuggingFace's throttled Xet CDN within budget (curl --max-time 600 x5 retries overruns the deadline), leaving an orphaned curl and a 30m panic. It has been red on every release (v4.5.3/4/5). Guard the job's `if` with `!startsWith(github.ref, 'refs/tags/')` so it no longer runs on tag/release builds. It still runs on PRs and branch pushes that touch vibevoice-cpp, so real regressions are caught off the release path. A proper fix (a small ASR test GGUF) can re-enable it on tags later. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
0b84fda496 |
docs(paged): add the bf16-tau opt-in line to the decode plots
Per request, the plots now show all four series: llama.cpp (standard), vLLM, LocalAI's llama.cpp patches (bit-exact hero), and LocalAI's patches + bf16-tau (opt-in ceiling, +3% to +17% over the patches, ahead of vLLM at every dense width and MoE npl>=32). Subtitle flags bf16-tau as opt-in / not bit-exact. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
1431f72b92 |
docs(paged): regenerate decode plots (3-way) from re-measured data + overview
Rebuild the two committed decode plots from the re-measured CSV and add a combined overview. Three series per the comparison that matters: llama.cpp (standard) vs vLLM vs LocalAI's llama.cpp patches; x-over-standard called out at npl128. bf16-tau stays out of the plot (it remains in the CSV + the README table as the opt-in row). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
266fcc79ad |
docs(agents): fix A/B-bench gotcha - env-toggle != stock for compiled-in wins
The DGX re-run showed toggling LLAMA_KV_PAGED on/off on the patched binary does NOT reproduce stock: the dominant SSM decode fusions are compiled in, not runtime-gated, so the toggle measures only the (here ~neutral) paged-KV part. True stock needs a separately-built unpatched binary at the same pin. Correct the methodology skill's per-lever discipline + apples-to-apples rule accordingly. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
3466094c68 |
docs(paged): re-measure DGX benchmarks on one harness (stock/patched/bf16-tau)
Re-run the GB10/DGX-Spark llama-batched-bench matrix (dense q36-27b + MoE q36-35b-a3b, npl 8/32/64/128, -fa on -ngl 99 -npp 128 -ntg 128) so the CSV and README section 4 carry a single consistent set of llama numbers with all three configs: - stock: separately-built unpatched llama.cpp at this backend's exact pin 9d5d882d (toggling LLAMA_KV_PAGED on the patched binary does NOT reproduce stock - the SSM decode fusions are compiled in, not env-gated). - patched: paged binary, LLAMA_KV_PAGED=1 (+LLAMA_MOE_FORCE_GRAPHS=1 for MoE). - patched+bf16-tau: patched plus --ssm-bf16-tau 64 (opt-in, NOT bit-exact, ~91% same-top-p). final_benchmark.csv now has stock + patched + bf16-tau + vllm rows for both models at all four widths (the prior CSV had no stock and no bf16-tau rows). peak_gb is dropped: the GB10's unified LPDDR5x reports [N/A] to nvidia-smi and the bench does not print it, so per-run peak could not be captured this session. Patch series gives up to 2.46x (dense) / 2.26x (MoE) over true-stock; opt-in bf16-tau adds a further +3% to +17% on top of patched (growing with width). vLLM column is kept from the prior session (not re-run) and labeled as such. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
1154be5eea |
fix(config): fall back to DefaultContextSize for unparseable GGUFs; pin NVFP4 gallery context_size (#10563)
The GGUF metadata parser (gpustack/gguf-parser-go) cannot read NVFP4-quantized GGUFs at all: it errors with "read tensor info 0: This quantized type is currently unsupported" because NVFP4 is a ggml tensor type it does not know. When ParseGGUFFile errors, the llama-cpp defaults hook skips guessGGUFFromFile entirely and the deferred fallback sets the context window to the conservative GGUFFallbackContextSize (1024). The result: a model that trains to 262144 tokens runs with n_ctx=1024, and every prompt over ~1k tokens fails with "request (N tokens) exceeds the available context size (1024 tokens)". Two changes: - Drop GGUFFallbackContextSize (1024) and fall back to DefaultContextSize (4096) in both the GGUF run-estimate path (gguf.go) and the deferred hook fallback (hooks_llamacpp.go). 1024 is a sensible floor for a tiny CPU GGUF but a footgun for a large, long-context model whose header simply cannot be parsed. Strengthen the existing "GGUF unreadable" test to assert the value. - Set context_size explicitly on the four NVFP4 gallery entries (qwen3.6-35b-a3b-nvfp4-mtp, qwopus3.6-27b-v2-mtp-nvfp4, qwopus3.6-27b-coder-mtp-nvfp4, qwen3.6-27b-nvfp4-mtp) so the parser failure is irrelevant for them. 32768 matches sibling Qwen entries and is safe on memory; operators can raise it toward the 262144 train length. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
ed5eb705c7 |
docs(paged): drop moot PIN_SYNC_c299a92c record, repoint to README sec 7
The paged backend's llama.cpp pin was reverted from c299a92c back to 9d5d882d (== stock), so docs/PIN_SYNC_c299a92c.md (a blow-by-blow of the reverted sync) is dead weight. The pin-sync PROCESS stays documented in the three live places: the Makefile comment, README section 7 (Pin + maintenance policy), and .agents/llama-cpp-localai-paged-backend.md. Delete the doc and repoint every reference to it (Makefile, README, .agents, canary script + workflow) at README section 7. No functional paths change: the canary's patches-dir glob (patches/paged/0*.patch) is untouched. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
8aba4fdba3 |
chore(fish-speech): drop the darwin/metal build target (#10561)
The fish-speech metal-darwin-arm64 backend build has been failing on every
release (v4.5.3, v4.5.4, v4.5.5) and is a standing red on the darwin backend
matrix. fish-speech pulls `tokenizers` transitively from its upstream source
(`pip install -e fish-speech-src`), and on darwin/arm64 there is no prebuilt
wheel for the pinned old `tokenizers` version, so pip builds it from source.
Modern rustc rejects that old crate as a hard error:
error: casting `&T` to `&mut T` is undefined behavior ...
--> tokenizers-lib/src/models/bpe/trainer.rs:517:47
= note: `#[deny(invalid_reference_casting)]` on by default
error: could not compile `tokenizers` (lib) due to 1 previous error
This is deterministic, not a flake, and there is no clean fix that does not
either pin a stale Rust toolchain or downgrade a soundness lint guarding real
UB. Until upstream fish-speech moves to a tokenizers version that compiles on
current toolchains, drop darwin support so the release backend build stays
green. The Linux/CUDA/ROCm/Intel/L4T variants are unaffected.
Removes:
- the `-metal-darwin-arm64-fish-speech` entry from `includeDarwin` in
backend-matrix.yml
- the `metal:` capability mappings and the concrete `metal-fish-speech` /
`metal-fish-speech-development` gallery entries in backend/index.yaml
- the now-unused darwin-only requirements-mps.txt
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
|
||
|
|
d7d7721eae |
feat(distributed): SyncedMap component + migrate finetune/quant/agent-tasks to cross-replica state (#10542)
* feat(distributed): add SyncedMap cross-replica in-memory state component Introduce core/services/syncstate.SyncedMap[K,V]: a thread-safe in-memory map that keeps itself consistent across frontend replicas via NATS, with an optional pluggable durable Store and hydrate-from-source convergence. Several features keep process-local state surfaced to the API (finetune/quant jobs, agent tasks, model configs) and each hand-wired the same in-memory + NATS broadcast + read-through-store legs - or forgot to, reintroducing cross-replica staleness. SyncedMap makes that consistency a configuration choice: - local writes mutate the map, write through the Store, then broadcast a delta; - the apply path is memory-only and never re-publishes or re-writes the Store (structural echo-loop guard, mirroring galleryop.mergeStatus); - on Start and on NATS reconnect the map re-hydrates from the source (Store, else Loader); an optional periodic Reconcile repairs silent drift; - standalone mode (nil NATS client) is a strict in-memory no-op. Reconnect re-hydrate is wired via a new *messaging.Client.OnReconnect callback, consumed through an optional type-assertion so MessagingClient stays minimal. Adds messaging.SubjectSyncStateDelta and a reusable testutil.FakeBus (synchronous in-process MessagingClient with wildcard matching) for adopter tests. Component only; service migrations follow in subsequent commits. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * refactor(finetune): back jobs with SyncedMap for cross-replica consistency FineTuneService kept jobs in a process-local map and, although it wrote them to Postgres, ListJobs/GetJob never read the store back and the wired natsClient was never used - so in distributed mode a job created on one replica was invisible to the others. Replace the map and the dead client with a syncstate.SyncedMap keyed by job ID, value *schema.FineTuneJob (the exact REST shape, so responses are unchanged). - Add a Store adapter (core/services/finetune/syncstore.go) over FineTuneStore, plus FineTuneStore.ListAll (global hydrate; per-user List kept) and an idempotent Upsert (create-or-update; Create alone fails on dup key). - Writes go through SyncedMap.Set/Delete (write-through + broadcast); reads use List/Get. The on-disk state.json path becomes the standalone Loader, keeping single-node restart recovery (stale->stopped / exporting->failed fixups). - Fold SetNATSClient/SetFineTuneStore into NewFineTuneService; app.go passes the distributed NATS client + store when distributed, nil otherwise. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * refactor(agentpool): back agent tasks with SyncedMap for cross-replica consistency AgentJobService.ListTasks read the process-local tasks map only, while ListJobs already read through the DB persister + dispatcher NATS - so in distributed mode a task created on one replica was invisible to the others. Back tasks with a syncstate.SyncedMap keyed by task ID (value schema.Task, the exact REST shape); jobs are left untouched. - Store adapter (task_syncstore.go) over the existing JobPersister (LoadTasks/SaveTask/DeleteTask); reads svc.persister/userID live so a persister swap needs no rebuild. No new persister methods required. - Task reads -> SyncedMap.List/Get; create/update -> Set (write-through + broadcast); delete -> Delete. The file persister now owns its own task set so the write-through path does not re-enter the SyncedMap lock (deadlock guard). - The distributed NATS client is not available at construction (start() precedes initDistributed), so it is injected via SetTaskSyncNATS, which rebuilds the still-empty map before Start/hydrate. Wired at the main, restart, and per-user (UserServicesManager) distributed sites. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * refactor(quantization): back jobs with SyncedMap + durable QuantStore QuantizationService kept jobs in a process-local map persisted only to a local state.json, so in distributed mode jobs were neither visible across replicas nor durable cluster-wide. Back jobs with a syncstate.SyncedMap keyed by job ID (value *schema.QuantizationJob, the exact REST shape). - New distributed.QuantStore (GORM, table quantization_jobs) mirroring FineTuneStore: Create/Get/ListAll/Upsert(idempotent)/Delete, registered for AutoMigrate via distributed.InitStores (Stores.Quant). - New adapter (quantization/syncstore.go) over QuantStore implementing syncstate.Store, with record<->schema conversion. - Reads go through List/Get, writes through Set/Delete (write-through + broadcast); state.json is kept as the standalone Loader for single-node restart recovery (stale-job fixups preserved). - app.go passes the distributed NATS client + QuantStore when distributed, nil otherwise; Start/Close lifecycle mirrors finetune. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * fix(syncstate): annotate gosec G118 false positive on lifeCtx gosec flagged the WithCancel in Start as "cancellation function not called" because the returned cancel is stored on the struct rather than called/deferred in scope. It is invoked in Close (covered by tests), and lifeCtx must outlive Start to drive the reconnect/reconcile goroutines. Suppress the verified false positive with a justified #nosec G118. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * test(distributed): e2e two-replica SyncedMap sync over real NATS + Postgres Adds the real-infrastructure counterpart to the fake-bus unit tests, in the existing distributed e2e suite (testcontainers NATS + PostgreSQL). Two SyncedMap instances stand in for two frontend replicas - each with its OWN NATS connection to a shared server and a SHARED Postgres store (the distributed-mode invariant) - and assert, over the wire: - a create on replica A is observed by replica B; - an update and a delete propagate A -> B (delete prunes, which a reload cannot); - a late-joining replica recovers a job it never received a delta for, via store hydrate on Start (the at-most-once gap a fake bus cannot exercise); - a local Set is written through to the shared Postgres store. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
c548150f99 |
fix(distributed): missing agent NATS permission (#10549)
Signed-off-by: Nicholas Ciechanowski <nicholas@ciech.anow.ski> |
||
|
|
ec26b86dd4 |
docs: ⬆️ update docs version mudler/LocalAI (#10560)
⬆️ Update docs version mudler/LocalAI Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> |
||
|
|
53f66a6f03 |
fix(paged): revert pin to 9d5d882d (== stock); c299a92c broke grpc-server link
The c299a92c bump diverged 23 commits ahead of the stock llama-cpp pin. grpc-server.cpp is SHARED with the stock backend and tracks the stock pin; c299a92c's upstream server-API refactor pulled stream_* helpers into the headers grpc-server.cpp includes, whose definitions the stock-aligned build does not compile -> every paged variant failed to LINK (undefined reference to stream_aware_should_stop / stream_pipe_producer::cleanup / stream_session_attach_pipe). The bump was greedy-md5 bit-exact, but the bit-exact gate never exercises the full grpc-server build, so it slipped through. Revert LLAMA_VERSION to 9d5d882d (== stock pin, where the patches are bit-exact AND grpc-server links - the original DGX-proven baseline). Document the hard constraint in the Makefile, README, PIN_SYNC record, and the .agents guide: the paged pin must track the stock pin, and a pin-sync must pass the full CI grpc-server build, not only the bit-exact gate. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
08b754f910 |
chore(paged): keep patches/ patch-only; README to backend root, docs to docs/
The llama-cpp-localai-paged patches/ dir had accumulated docs, plots, a csv,
dev .cpp harnesses, and a dead FP4-MoE kernel scaffold after an earlier git-mv.
Restore the invariant that patches/ holds only the .patch series.
Moves:
- patches/paged/README.md -> README.md (canonical doc at the backend root)
- patches/paged/{PIN_SYNC_c299a92c,PAGED_BITEXACT_NOTE,LOCALAI_LLAMACPP_BACKEND_PLAN,UPSTREAM_LAYER2_SCOPE}.md,
final_benchmark.csv, qwen36_*.png, paged-burst-bench.cpp, paged-reclaim-unit.cpp -> docs/
- patches/README.md -> docs/PATCH_MAINTENANCE.md (unique patch-regen recipe not in the canonical README)
Deletes:
- patches/BENCHMARKS.md (superseded by README section 4 + the dev-notes section)
- patches/kernel/ (dead FP4-MoE scaffold, never in the 0001-0030 apply glob, zero refs repo-wide)
Repoint every reference to the moved files: README internal links (docs/ + the
.github links drop from 5x ../ to 3x ../), .agents/llama-cpp-localai-paged-backend.md,
.github/scripts/paged-canary-apply.sh, .github/workflows/llama-cpp-paged-canary.yml,
the wrapper Makefile, backend/cpp/llama-cpp/grpc-server.cpp, backend/index.yaml,
docs/content/features/backends.md, gallery/index.yaml.
The build apply glob PAGED_PATCHES_DIR/0*.patch (PAGED_PATCHES_DIR := .../patches/paged)
is unchanged and still resolves to the 28 patches.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
|
||
|
|
db14006fcd |
docs(agents): add paged-backend maintenance + vLLM-parity methodology skills
Two .agents guides (indexed in AGENTS.md): - llama-cpp-localai-paged-backend.md: what the CUDA-only paged backend is, the patchset scope, the bit-exact gate, the manual pin-sync + weekly canary, the CUDA-only / stock-stays-pure invariants, and the Metal/SYCL/Vulkan follow-up scope. - vllm-parity-methodology.md: the decode-parity playbook (bit-exact gating, profile-don't-assume, both-engine ground-truth, per-lever A/B, recording rejected levers, multi-agent GPU orchestration). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
a4e730979d |
feat(paged): restrict llama-cpp-localai-paged to CUDA-only build targets
The paged backend previously built for cublas/cuda, cpu, vulkan, sycl, hipblas and darwin/metal. On non-CUDA the patchset's wins are inert: the GDN fusions are gated off (patch 0030) and NVFP4 falls back to dequant, so the backend is neutral-to-negative there (README section 4c). The darwin grpc-server link also fails on undefined upstream server symbols, turning CI red. Both broken and pointless off-CUDA, so ship CUDA-only. - backend-matrix.yml: drop the hipblas, sycl f32/f16, cpu amd64/arm64, vulkan amd64/arm64 and metal-darwin rows for this backend; keep the four cublas rows (cuda-12, cuda-13, nvidia-l4t cuda-12 and cuda-13). - index.yaml: meta-backend (and -development) capabilities are now CUDA-only with default pointing at cuda12 (mirrors faster-qwen3-tts); removed the orphaned cpu/rocm/sycl/vulkan/metal variant entries. - Removed the now-unused darwin build script and its Makefile target / .NOTPARALLEL entry / backend_build_darwin.yml step. - Documented the CUDA-only build coverage in the patch README and plan. Non-CUDA users should use the stock llama-cpp backend. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
9115c2c52c |
docs(paged): correct Vulkan/SYCL note (GDN op IS upstream) + CUDA-only rationale
The gated-DeltaNet + SSM_CONV ops have upstream Metal/Vulkan/SYCL kernels, so the Qwen3.6 hybrids run there (non-fused) - the earlier 'no Vulkan kernel' note was wrong. The patchset's fusions are gated off off-CUDA, so the backend ships CUDA-only; non-CUDA users use stock llama-cpp. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
984c8fcbea |
docs(paged): Layer-2 upstream scope for native fused-GDN kernels (Metal/Vulkan/SYCL)
Source-only analysis of what it would take to give the gated-DeltaNet decode fusions (0018 in-place state write-back, 0019 fused recurrent-state gather, 0021 ssm_conv_update_inplace, 0028 conv-tap gather fusion) native kernels on the non-CUDA compute backends, so the patch-series decode win extends past CUDA-family hardware. Key findings: - The base GGML_OP_GATED_DELTA_NET and GGML_OP_SSM_CONV kernels ALREADY exist upstream on Metal, Vulkan AND SYCL (the README's no-Vulkan-kernel line is stale). The Qwen3.6 hybrids run on all three today via the non-fused path; Layer-2 is the decode SPEEDUP, not enabling the model to run. - Per backend the new work is only the FUSION plumbing: redirect the GDN state write (in-place), add the ids read, write one new conv-update kernel + its ids variant, two tiny gather kernels, plus supports_op + op-handler + (Vulkan) pipeline/push-constant/descriptor wiring. Builders, CPU refs, model graph and test-backend-ops cases are shared and already done. - Bit-exactness is feasible per backend by construction (the fusions redirect addresses, not the f32 reduction order); test-backend-ops (backendX-vs-CPU) is the gate. - The 0030 name allow-list should become capability-driven (make supports_op authoritative for the discriminated src slots). - Ranked: ops-first PR, then Metal (highest value/effort, fixed simdgroup = simplest bit-exactness), then SYCL (near-verbatim CUDA mirror, cheapest to author), then Vulkan (widest hardware reach but the shader-gen + variant matrix + subgroup variance make it the capstone). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
d11b202dd2 |
fix(backends): whisper darwin run.sh loads whichever fallback lib exists (.so/.dylib) (#10553)
fix(backends): whisper darwin run.sh loads whichever fallback lib exists
The macOS branch hardcoded WHISPER_LIBRARY=$CURDIR/libgowhisper-fallback.dylib,
but the cmake build emits a Mach-O named libgowhisper-fallback.so on darwin, so
the Go loader panicked at runtime ("dlopen ...dylib: no such file") and the
backend exited ("grpc service not ready") — breaking e.g. the silero-vad-ggml
VAD on darwin. Pick whichever of .dylib/.so is present so it is robust to the
build's naming either way.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
v4.5.5
|
||
|
|
4a9a1dd247 |
docs(paged): Mac stock-vs-patched bench + Vulkan note + cross-backend learnings
Section 4(c): real Apple M4/Metal numbers (Qwen3-8B Q4_K_M, stock vs patched) - patchset is neutral-to-slightly-negative on Metal (the in-kernel block-table read is CUDA-only; NVFP4/GDN-fusions inert), so prefer stock llama-cpp on Apple Silicon. Vulkan: same picture, worse (no upstream GDN op). Section 6: cross-backend learnings + upstream candidates (the GDN decode-plumbing fusions are the portable, bit-exact, CPU-mirrored win worth upstreaming). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
78fac9a28f |
refactor(paged): stock llama-cpp is patch-free; paged backend owns its patch series
Move ALL paged-attention content out of the stock backend/cpp/llama-cpp backend and into backend/cpp/llama-cpp-localai-paged, so the stock backend is pure upstream llama.cpp and the paged backend owns and applies its own vendored patch series. - Delete the dead early-exploration scaffold backend/cpp/llama-cpp/paged/ (kernel/w4a16 Marlin scaffold, standalone paged_kv_manager, bench/loadgen, its own 0001-0002 patches, dense-era design docs, tests). Zero references repo-wide. - Move backend/cpp/llama-cpp/patches/ (the 28-patch paged series + paged/README + 3 operational docs, plus the kernel/ scaffold patch and the top-level paged README/BENCHMARKS) to backend/cpp/llama-cpp-localai-paged/patches/. The stock backend keeps no patches/ dir; it had no non-paged base patches. - Purify the stock backend: remove the LLAMA_PAGED make variable, the patches/paged apply loop, and the LLAMA_PAGED passthrough to prepare.sh; remove the paged-series handling from prepare.sh. The stock llama.cpp target now only clones the pin and applies its own (currently empty) base patches/ series. The runtime paged option hooks in the shared grpc-server.cpp are untouched (inert without the patches). - The paged backend's Makefile now applies its OWN patches/paged/0*.patch onto each freshly cloned tree via strict git apply (apply-paged-patches), after the copied stock infra clones the pin and applies base patches. - Repoint every reference to the old patches/paged path: the upstream canary workflow + apply script, bump_deps.yaml, gallery/index.yaml, the docs, backend/index.yaml, backend-matrix.yml, the top-level Makefile comments, and the moved PIN_SYNC / README docs. Drop the now-removed LLAMA_PAGED=on build-toggle from comments. Verified: the full 28-patch series applies strict-clean (git apply, exit 0) to a clean ggml-org/llama.cpp checkout at the pinned c299a92c, and the repointed canary apply script resolves and applies the series end to end. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
fb2dc33d52 |
docs(paged): consolidate the dev-trail docs into one canonical README
The paged-attention patch directory had accumulated ~55 scattered dev docs (results, progress, scope, lever, and gap-analysis notes). Consolidate the durable content of all of them into one canonical backend/cpp/llama-cpp/patches/paged/README.md covering: what the patchset is, the architecture (paged KV + block-table flash-attn, the gated-DeltaNet SSM decode path, NVFP4 FP4-MMA, the decode-first scheduler), the full 0001-0030 patch series table with bit-exact status, the GB10 benchmarks (patched-vs-stock-vs-vLLM + the Apple M4 architectural note), the dev notes (bit-exact methodology, the per-path gate, the MoE-parity conclusion, the rejected/flat levers, the opt-in bf16-SSM mode), arch+quant generality, the pin + canary maintenance policy, and the published NVFP4 gallery models. Delete the consolidated-away dev trail. Keep the three operational docs the README links to: PIN_SYNC_c299a92c.md (canary reference), PAGED_BITEXACT_NOTE.md (per-path gate reference) and LOCALAI_LLAMACPP_BACKEND_PLAN.md (the ship-as-own-backend design-of-record), plus the benchmark plots + csv. The .patch files and the unit/bench .cpp are untouched. Repoint every external reference to a deleted doc at the new README: grpc-server.cpp, docs/content/features/backends.md, gallery/index.yaml, the canary apply script (PIN_BUMP_APPLY_CHECK.md -> README), and the base patches/README.md (ADDITIVE_DESIGN.md -> README). The canary's PIN_SYNC reference still resolves; its inert SSM_DECODE_FIX_RESULTS.md glob (a patch-internal path matcher, not a repo-doc link) is left intact. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
a5a5b2ad80 |
feat(paged): bump llama.cpp pin 9d5d882d -> c299a92c (bit-exact verified)
Advance the paged-attention backend's owned llama.cpp pin by 23 upstream commits. The shipped source-only patch series (0001-0030, 28 patches) applies strict-clean (git apply, exit 0) on a fresh c299a92c checkout with no re-export needed, and the bit-exact gate is GREEN on every path on GB10 (CUDA sm_121): - md5 greedy decode (-ngl 99 -fa on -n 48 --temp 0 --seed 1): dense non-paged/paged 5951a5b4, MoE non-paged 07db32c2, MoE paged 8cb0ce23; all match the established baselines. - test-backend-ops CUDA0: SSM_CONV 45/45, SSM_CONV_UPDATE 16/16, SSM_CONV_UPDATE_IDS 16/16, GATED_DELTA_NET 84/84, MUL_MAT 1146/1146, MUL_MAT_ID 806/806; all OK. The 23-commit upstream jump did not change our decode output. The .patch files are kept byte-identical (they already apply strict-clean at the new pin); only the pin, the PIN_SYNC evidence doc, and the canary/gallery doc references change. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
7e1832b868 |
fix(paged): strip stray dev-doc hunks so patch series applies on a clean checkout
The shipped from-patches build applies the paged series with strict `git apply`
(backend/cpp/llama-cpp/Makefile `llama.cpp` target:
`git apply --verbose "$p" || { ...; exit 1; }`), which is atomic: a hunk against
a file missing from the tree rejects the whole patch and fails the build. Four
patches carried hunks against dev-only docs that live in the DGX dev tree but are
absent from a clean ggml-org/llama.cpp checkout, so the build only succeeded on
the DGX and FAILED on CI / any clean checkout:
0019 -> SSM_DECODE_FIX_RESULTS.md (modify hunk = the root reject)
0020 -> LEVER1_OPROJ_MMQ_RESULTS.md (create)
0021 -> CONV_STATE_FUSION_RESULTS.md (create)
0028 -> LEVER1_GATHER_PROGRESS.md, LEVER1_GATHER_RESULTS.md (create)
0019's reject cascaded to 0021/0022/0026/0028 (which build on 0019's code). Strip
each `diff --git a/<devdoc>` section plus its diffstat line, `create mode`
trailer, and correct the summary count. Every llama.cpp SOURCE hunk is left
byte-identical (verified by sha256 of each patch's source-diff tail).
Verified on a fresh clone of ggml-org/llama.cpp at the pin 9d5d882d: BEFORE,
strict `git apply` failed at 0019 (cascade 0019/0021/0022/0026/0028); AFTER, the
full series 0001-0030 applies with exit 0 (sentinel created, zero stray docs).
The tolerant `patch -p1` fallback in prepare.sh also applies with zero rejects.
PIN_SYNC_9d5d882d.md documents the durable fix: re-exports/pin-syncs must keep
patches source-only (export with a source pathspec / `:!*.md`, gate with a strict
`git apply` on a clean checkout). The upcoming c299a92c pin-bump re-export must
produce source-only patches too.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
|
||
|
|
2bee7a5ab1 |
ci(paged): add early-warning canary for vendored llama.cpp paged patches
The paged backend (backend/cpp/llama-cpp-localai-paged) pins its own verified llama.cpp tip and is excluded from the nightly auto-bumper so a naive bump can never silently break the shipped build. That exclusion also removed the early warning of upstream drift. This restores the signal without touching the pin. Add .github/workflows/llama-cpp-paged-canary.yml (weekly + workflow_dispatch): - apply-check job (ubuntu-latest, toolchain-free): resolve the latest ggml-org/llama.cpp master tip, shallow-checkout it, and apply the full paged series 0001-0030 in order with the build's own git-apply method via the new shared helper .github/scripts/paged-canary-apply.sh. Red on any apply break. - compile job (needs apply-check): on the exact tip it validated, build the paged backend (cublas) inside the same base-grpc-cuda-12 toolchain and the same `make grpc-server` target the shipped build uses, so a red means upstream drift, not toolchain noise. nvcc compiles the kernels with no GPU present. Red here = run a PIN_SYNC (rebase + bit-exact gate + re-export), then bump the paged Makefile pin. The canary is signal-only: it opens no PR and never moves the pin, so the shipped build and the dep-bump PRs stay green regardless. It is fully separate from bump_deps. The lone pre-existing quirk in the series (patch 0019 carries a stray modify hunk against the dev-only doc SSM_DECODE_FIX_RESULTS.md, absent from any clean upstream checkout; git apply is atomic so it rejects the whole patch and cascades to 0021/0022/0026/0028) is handled path-scoped: the helper excludes only that dev-doc and still applies 0019's real code hunks atomically, mirroring prepare.sh's tolerance, so the quirk never false-positives the canary but a genuine code break in 0019 still turns it red. Point the existing pin comments in backend/cpp/llama-cpp-localai-paged/Makefile and .github/workflows/bump_deps.yaml at this canary as the drift signal, and document it in the PIN_SYNC doc: canary red -> do a pin-sync. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
e160041f05 |
chore(paged): decouple paged llama.cpp pin from the nightly auto-bumper
The llama-cpp-localai-paged backend reused backend/cpp/llama-cpp's LLAMA_VERSION, which .github/workflows/bump_deps.yaml auto-bumps nightly to the latest ggml-org/llama.cpp master tip. The stock backend is patch-free so that bump is safe, but the paged backend applies a vendored patch series (backend/cpp/llama-cpp/patches/paged/) hand-verified bit-exact against ONE specific tip. A naive bump moves the tip out from under the patches and breaks 'git apply' at build time - a dep-bump PR would go red (or, worse, the break surfaces later in a release build). Mirror the turboquant precedent: give the paged wrapper its OWN LLAMA_VERSION pin (the verified 9d5d882d) and force it into every copied build via LLAMA_VERSION=$(LLAMA_VERSION), so the nightly stock bump no longer drags the paged build to an unverified tip. Unlike turboquant (whose fork branch carries the patches and is safe to auto-bump), the paged series is vendored, so it gets NO bump_deps.yaml entry: it is advanced only by the manual PIN_SYNC process. Add cross-referencing comments in both Makefiles and bump_deps.yaml. Also add PIN_BUMP_APPLY_CHECK.md: an apply-feasibility report for the latest tip (c299a92c, 23 commits ahead). The full series applies CLEAN under 'git apply' with only benign line offsets and zero conflicts; the lone failure (0019) is a pre-existing stray dev-doc hunk, identical on the current pin, not a bump regression. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |