mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-28 10:27:30 -04:00
be1ae9338ba1756be178dde5098d6d088512abef
6873 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
be1ae9338b |
fix(distributed): missing agent NATS permissions (#10571)
Signed-off-by: Nicholas Ciechanowski <nicholas@ciech.anow.ski> |
||
|
|
923c47020d |
fix(launcher): robust binary download/upgrade (resume, rate-limit, UX) (#10575)
* fix(launcher): resume flaky downloads, drop redundant percent, fit dialogs
The binary upgrade/download flow had three rough edges:
- The status label printed "Downloading... N%" right next to a progress
bar already showing the percent. Replace it with a human-readable byte
readout ("Downloading... 12.3 MB / 45.6 MB").
- A failed download (GitHub releases are flaky) had no recourse and always
restarted from byte 0. Stream to "<dest>.part" and resume via a
"Range: bytes=N-" request (handling 206/200/416), renaming to the final
path only after checksum verification; on checksum failure the file is
discarded so the next attempt starts clean. Add a Retry button that
appears on failure and resumes from the partial file.
- Progress/install dialogs were hardcoded to oversized dimensions, leaving
a blank gap below "View Release Notes". Size each window to its content
with a sane minimum width.
Also unify the three near-identical download-progress popups into one
Launcher.showDownloadProgressWindow helper (and delete a dead unused copy
in ui.go) so the behaviour stays consistent across every entry point.
The progress callback now reports (downloaded, total) byte counts instead
of a single fraction. Resume/retry behaviour is covered by httptest-backed
unit tests in release_manager_test.go.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(launcher): resolve latest version via redirect to dodge GitHub API 403
On a fresh Linux start with no LocalAI installed, the download failed with
"failed to fetch latest release: status 403". The cause is the unauthenticated
api.github.com rate limit (60 requests/hour, per IP): on shared/NAT/CGNAT/cloud
addresses it is exhausted almost immediately and every request 403s.
Resolve the latest version by following the github.com "releases/latest"
redirect instead, reading the tag from the final ".../releases/tag/<tag>" URL.
That endpoint is not subject to the API rate limit. Only the version is ever
consumed by callers, so the tag is sufficient. The JSON API is kept as a
fallback, now honoring GITHUB_TOKEN and reporting rate-limit 403/429 clearly
instead of an opaque status code.
Covered by an httptest-backed unit test that asserts the redirect path is used.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
|
||
|
|
b7a1dec773 |
fix(kokoro): add explicit click dep so spacy CLI works on intel build (#10572)
The kokoro install.sh ends with `python -m spacy download en_core_web_sm`.
spaCy's CLI imports typer -> click, so click must be present at that point.
On the intel build profile, install.sh adds `--upgrade --index-strategy=unsafe-first-match`
against the Intel pip index. With that resolution strategy, click is not
resolved/installed, so the spacy CLI import fails with:
ModuleNotFoundError: No module named 'click'
make: *** [Makefile:3: kokoro] Error 1
Other profiles (cpu/cublas) pull click in transitively and build fine; only
the intel profile breaks. This surfaced in the v4.5.5 release CI as the
gpu-intel-kokoro backend image build failure.
Make click an explicit dependency in the base requirements.txt (installed for
every profile) so it is always present before `python -m spacy download` runs,
regardless of index resolution. Unpinned: spacy constrains the version.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
|
||
|
|
de2ec2f136 |
feat(backends): add voice-detect + face-detect ggml backends (replace Python insightface/speaker-recognition) (#10441)
* feat(voice-detect): add Go purego backend for voice-detect.cpp Add backend/go/voice-detect implementing the Backend gRPC voice subset (VoiceEmbed/VoiceVerify/VoiceAnalyze) over libvoicedetect.so via purego, mirroring the parakeet-cpp / omnivoice-cpp backends. The flat voicedetect_capi C ABI is dlopen'd cgo-less; malloc'd string and float-vector returns are owned by Go and released through the matching capi free functions, with the per-ctx last error surfaced into Go errors. Calls are serialized via base.SingleThread since the C context is not reentrant. Proto field mapping: - VoiceEmbed: VoiceEmbedRequest.audio (path) -> embed_path -> Embedding+Model. - VoiceVerify: audio1/audio2 + threshold (<=0 falls back to the verify_threshold option, default 0.25) -> verify_paths -> verified/distance/ threshold/confidence/model/processing_time_ms. - VoiceAnalyze: audio (path) -> analyze_path_json; the JSON age/gender/emotion document maps to a single VoiceAnalysis segment (start/end 0; gender "label" -> dominant_gender with the remaining float scores as the gender map; emotion label/scores -> dominant_emotion/emotion). The Makefile pins voice-detect.cpp to 47546430, clones+builds libvoicedetect.so with ggml static-linked (PIC, GGML_NATIVE off) so dlopen needs no external libggml/libvoicedetect; ldd on the artifact shows only system libs. Ginkgo tests cover option parsing and analyze-JSON mapping; embed/verify smoke specs gate on VOICEDETECT_BACKEND_TEST_MODEL + VOICEDETECT_BACKEND_TEST_WAV. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(voice-detect): wire backend into index, gallery and build Register the voice-detect.cpp speaker-recognition + voice-analysis backend (added in Voice-INT-A) into LocalAI's distribution surfaces, mirroring the ced backend (the closest mudler C++/ggml audio analogue): - backend/index.yaml: add the &voicedetect meta-backend (capabilities platform map, no top-level uri) plus the full set of concrete per-arch image entries (cpu/cuda12/cuda13/metal/rocm/sycl/vulkan/l4t and the -development variants). Referential integrity audited - every alias target resolves. - gallery/index.yaml: add 5 model entries on backend voice-detect - ECAPA-TDNN, WeSpeaker ResNet34, 3D-Speaker ERes2Net, CAM++ and the wav2vec2 age/gender/emotion analyze model. The engine architecture is read from GGUF metadata (voicedetect.arch) at load. GGUF artifacts are not yet published: each files: entry points at the intended mudler/voice-detect-gguf location with a TODO to fill sha256 after upload (no fabricated hashes). - .github/backend-matrix.yml: add the linux build matrix block + the darwin metal entry mirroring ced. - .github/workflows/bump_deps.yaml: track mudler/voice-detect.cpp via VOICEDETECT_VERSION (pin 47546430, = 4754643). - core/config/backend_capabilities.go: register voice-detect in the backend capability map (VoiceVerify/VoiceEmbed/VoiceAnalyze -> speaker_recognition), mirroring speaker-recognition. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(face-detect): add purego Go backend for face-detect.cpp Add the LocalAI Go backend that dlopens libfacedetect.so (the flat facedetect_capi_* C-ABI) via purego, mirroring the sibling voice-detect backend. Implements the Face subset of the Backend gRPC service: - Embeddings(PredictOptions): Images[0] base64 -> temp file -> embed_path -> L2-normalized ArcFace embedding. - Detect(DetectOptions): src -> detect_path_json -> Detection boxes (class_name "face", [x1,y1,x2,y2] -> x/y/w/h). - FaceVerify(FaceVerifyRequest): two images + threshold + anti_spoof -> verify_paths; best-effort img areas via detect. - FaceAnalyze(FaceAnalyzeRequest): img -> analyze_path_json -> per-face age + gender ("M"/"F" normalized to "Man"/"Woman"). The Makefile pins face-detect.cpp to 636a1963 and builds the shared lib with ggml + vendored libjpeg-turbo static (PIC), so the .so is ldd-clean (no libggml) and exports only facedetect_capi_* (no jpeg_ symbols). Gated Ginkgo e2e mirrors voice-detect. Note for the gallery-wiring task: backend registration (index.yaml, gallery, core/config/backend_capabilities.go) is intentionally not touched here. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * fix(voice-detect): replace em dashes in net-new descriptions Project style forbids em/en dashes. Replace the three U+2014 chars introduced by the voice-detect gallery/index wiring with `-`/`:`. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(face-detect): wire backend into index, gallery and build Register the face-detect.cpp face detection / embedding / verification / analysis backend (added in Face-INT-A) into LocalAI's distribution surfaces, mirroring the voice-detect wiring (the closest mudler C++/ggml recognition analogue): - backend/index.yaml: add the &facedetect meta-backend (capabilities platform map, no top-level uri to avoid the meta-backend gotcha) plus the full set of concrete per-arch image entries (cpu/cuda12/cuda13/ metal/rocm/sycl-f16/sycl-f32/vulkan/l4t and the -development variants), 22 entries. Referential integrity audited: every alias target resolves. - gallery/index.yaml: add 4 model entries on backend face-detect - face-detect-buffalo-l/m/s (insightface SCRFD + ArcFace/MBF, NON-COMMERCIAL) and face-detect-yunet-sface (OpenCV-Zoo YuNet + SFace, APACHE-2.0, the commercial-friendly alternative). The detector/embedder architecture is read from GGUF metadata (facedetect.arch) at load; only the real verify_threshold option is set (0.35 buffalo, 0.363 sface). GGUF artifacts are not yet published: each files: entry points at the intended mudler/face-detect-gguf location with a TODO to fill sha256 after upload (no fabricated hashes). - core/config/backend_capabilities.go: register face-detect in the backend capability map (Embedding/Detect/FaceVerify/FaceAnalyze -> face_recognition), mirroring insightface. - .github/backend-matrix.yml: add the linux build matrix block + the darwin metal entry mirroring voice-detect. - .github/workflows/bump_deps.yaml: track mudler/face-detect.cpp via FACEDETECT_VERSION (pin 636a1963). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * fix(recon): voice-detect metal build branch + face-detect gallery usecases Add the missing metal BUILD_TYPE branch to the voice-detect Makefile forwarding -DVOICEDETECT_GGML_METAL=ON, mirroring face-detect, so the darwin metal CI artifact is built with the Metal backend instead of CPU-only. Expand the 4 face-detect gallery models' known_usecases to [face_recognition, detection, embeddings] to match the backend capabilities map and the mirrored insightface-buffalo entries, so auto-selection for /v1/detect and /embeddings works. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * docs(recon): document voice-detect and face-detect ggml backends Document the new standalone C++/ggml biometric backends as the recommended/default option for face and voice recognition, keeping the existing Python insightface / speaker-recognition backends framed as the legacy path. - features/face-recognition.md: add a face-detect (ggml) backend section with the gallery entries (buffalo-l/m/s non-commercial, yunet-sface Apache-2.0), licensing, and verify/detect/analyze quickstart. - features/voice-recognition.md: add a voice-detect (ggml) backend section with the gallery entries (ecapa-tdnn, wespeaker-resnet34, eres2net, campplus speaker recognizers; emotion-wav2vec2 non-commercial analyze head) and quickstart. - reference/compatibility-table.md: add face-detect.cpp and voice-detect.cpp rows to the Vision, Detection & Recognition table. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(gallery): publish recon backend GGUF uris + sha256 Fill in the published HuggingFace GGUF uris and verified sha256 for the 9 recon gallery entries (voice-detect-* and face-detect-*), and remove the TODO publish markers. Correct the eres2net, campplus, and emotion-wav2vec2 uris to the actual published filenames. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(gallery): re-embed buffalo anti-spoof + add audeering age/gender voice model Update the 3 buffalo face-detect GGUF sha256 (anti-spoof ensemble now embedded and re-uploaded under the same filenames/uris) and note the FaceVerify anti_spoof request flag in each description. Add a new voice-detect-age-gender-wav2vec2 gallery entry mirroring the emotion model. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(gallery): add face-detect-buffalo-sc and antelopev2 packs Add gallery entries for two newly-published insightface face packs on the face-detect backend: buffalo_sc (smallest pack, SCRFD-500M + small ArcFace) and antelopev2 (higher-accuracy, SCRFD-10G + ArcFace glint360k R100, 512-d). Both are non-commercial research-only. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(recon): honor LocalAI per-model threads in voice/face-detect backends LocalAI spawns one backend process per model and serves requests concurrently, so the engines' own min(hardware_concurrency, 8) default can oversubscribe cores. Forward the per-model Threads value from the gRPC LoadModel options into the engine via VOICEDETECT_THREADS / FACEDETECT_THREADS (read at backend construction) before the capi load. A non-positive Threads is treated as unset, leaving the engine default. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump backend pins to CPU-optimized engine commits voice-detect.cpp -> 0d9c1b3 (radix-2 FFT FBank, threads, flash attn + cached pos-conv); face-detect.cpp -> 523aee1 (thread-gated direct conv, threads). Brings the CPU optimizations into the LocalAI backend builds. GGUF format and parity unchanged, so the published HF GGUFs remain valid. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump backend pins to round-2 CPU-optimized engines voice-detect.cpp -> fe7e6a3 (ERes2Net 1x1->mul_mat, CAM++ layout+context, wav2vec2 conv-LN, ECAPA capture-drop, AVX512 dispatch opt-in); face-detect.cpp -> 9c8adb7 (AVX2 Winograd F(2x2,3x3) for SCRFD/ArcFace 3x3 convs, ArcFace BN-fold). Parity unchanged (cosine=1.0); GGUF format unchanged, HF GGUFs valid. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump backend pins to round-3 Winograd engines voice-detect.cpp -> 45122ec (Winograd F(2x2,3x3) for WeSpeaker/ERes2Net 3x3 convs, -22%/-20% @8t); face-detect.cpp -> cd5c962 (Winograd F(4x4,3x3) for SCRFD large maps, -22% @1t on top of F(2x2), more load-stable). Parity held (cosine=1.0); GGUF format unchanged, HF GGUFs valid. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump backend pins to round-4 Winograd engines (CPU opt complete) voice-detect.cpp -> d2839ca (CAM++ FCM 2D convs through Winograd, -15.5%/-10.3%); face-detect.cpp -> c1db23d (AVX2-vectorized Winograd tile transforms, SCRFD detect -14%/-9.6%). Final CPU optimization round; the conv-kernel lever class is now exhausted (parity held cosine=1.0; GGUF/parity unchanged, HF GGUFs valid). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump face-detect pin to deep-kernel engine (7ae5c4d) face-detect.cpp -> 7ae5c4d: register-blocked winograd-domain GEMM microkernel (2.8x isolated GFLOP/s), AVX-512 zmm evolution behind runtime CPUID dispatch (ship-safe, AVX2 fallback bit-identical), bias/relu fused into the winograd output transform, and SFace Conv+BN fold + bias/PReLU fusion. SCRFD detect ~1.4x faster end-to-end vs the round-4 baseline; parity bit-exact; portable single binary (function-multiversioned, no global -mavx512f). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump voice-detect pin to ECAPA operand-order win (e9c56ae) voice-detect.cpp -> e9c56ae: weight-as-src0 mul_mat order in ECAPA's F32 conv1d_same (routes through tinyBLAS sgemm); ECAPA embed 1.67x @1t / ~1.3x @8t, parity cosine=1.0. Isolated to encoder.cpp (ECAPA-only); ERes2Net/CAM++/WeSpeaker do not call conv1d_same so are provably unaffected. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump pins to FMA-throughput engines (voice f7b9f89, face 2d2d5f0) face -> 2d2d5f0: route ArcFace 3x3 body convs through the AVX-512 winograd microkernel (kWinoMinSize 80->14); ArcFace 1.62x @1t, SCRFD detect to 0.966 of MLAS @1t, no regression. voice -> f7b9f89: runtime-CPUID-dispatched AVX-512 winograd-GEMM microkernel (ship-safe, AVX2 fallback bit-identical); WeSpeaker 1.90x @1t. Parity cosine=1.0 throughout; portable single binaries. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump pins to MLAS-class direct-conv engines (voice 7ecfd07, face be22d67) Hand-tuned nChw16c AVX-512 register-tiled direct-conv microkernel (~263 GFLOP/s, within 6-7% of MLAS per-op efficiency), runtime-CPUID-dispatched + AVX2 fallback, fused bias/relu. voice 7ecfd07: default 3x3-s1 kernel for WeSpeaker (+37%/+32%) + ERes2Net, CAM++ pinned to Winograd. face be22d67: shape-gated to the ArcFace recognizer body (+25-27% @8t); SCRFD detector stays on Winograd (no regression). Parity cosine=1.0 / detect <=1px on AVX-512 + AVX2 paths. Portable single binaries. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump voice pin to Phase-A blocked backbone (f4e7eef) WeSpeaker ResNet34 runs as one nChw16c blocked island (2 reorders/forward vs ~60) on AVX-512, default; per-conv directconv fallback on AVX2. +2.9% @1t / +17-19% @8t vs per-conv directconv, parity cosine=1.0. The conv microkernel is already FMA-bound near peak (~0.86-0.98x MLAS-implied); residual to MLAS is sub-peak edge + non-conv tail, documented in docs/cpu-optimization.md. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump pins to breadth blocked-backbone (voice 7f66871, face d80092b) voice 7f66871: AVX2-vectorized (ymm) blocked island - AVX2-only hosts now run the blocked backbone for WeSpeaker (2.3x over per-conv-AVX2, cosine=1.0); ERes2Net stays per-conv (blocked regresses, opt-in only); CAM++ Winograd-pinned. face d80092b: ArcFace recognizer blocked island, AVX-512 default (-13% @8t, ~0.90x MLAS, the closest conv result), auto per-conv on AVX2; SCRFD untouched on Winograd (0 island invocations during detect). Parity cosine=1.0 / detect <=1px throughout. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump pins to small-spatial + stem conv kernels (voice 99b1804, face 47fdab6) Measured-gap-driven conv kernels: small-spatial (fill the register tile when output width <= tile width) + small-IC stem + strided-1x1/downsample recovery. ArcFace recognizer 0.57 -> 0.70x MLAS @1t (the closest conv model), WeSpeaker 0.65 -> 0.79x @1t. Parity cosine=1.0 / detect <=1px. The OC-block-sharing lever was a measured dead-end (deep stride-1 is L3-weight-bandwidth bound, not read-port bound) and was NOT shipped. Kernel ceiling reached; further gap needs an algorithm-class change (cache-blocked weight-stationary GEMM, or q8 weights). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump pins to GPU persistent-graph + multi-model-safe cache (voice 45d2e6b, face 0a4799a) GPU wins (CUDA/ggml backend, no CPU-path change): persistent per-shape graph+context cache in Backend::compute() eliminates the per-call cudaGraph re-instantiation churn -> wav2vec2 emotion+age-gender now AT GPU parity with torch-cuDNN on GB10 (0.97-0.98x), CAM++ -5.7ms; bit-identical parity. Cache hardened multi-model-safe (invalidate-on-free keyed by the ModelLoader weights buffer) so LocalAI multi-model hosting cannot stale-hit. Conv models still trail cuDNN (im2col-materialization-bound) - cuDNN implicit-GEMM lever next. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump pins to cuDNN-conv-capable engines (voice b6e4356, face 6107a24) Adds the opt-in cuDNN implicit-GEMM conv path (VOICEDETECT_GGML_CUDNN / FACEDETECT_GGML_CUDNN, DEFAULT OFF -> zero build/runtime dep until enabled). On GPU it kills the im2col-materialization bottleneck and reaches torch-cuDNN parity on the spill-bound convs: SCRFD detect 14.8->6.4ms (2.3x, ~parity), WeSpeaker ~parity, ERes2Net beats torch (1.10x); ArcFace/CAM++ neutral (no spill). Parity exact (SCRFD <=1px, cosine=1.0). To USE it in LocalAI, the CUDA backend build must enable the flag AND bundle libcudnn - deferred until a cuDNN-bundled GPU image; flag stays OFF here. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(recon): enable cuDNN conv path on arm64+CUDA13 recon backends The voice-detect.cpp / face-detect.cpp engines have an opt-in cuDNN implicit-GEMM conv path behind VOICEDETECT_GGML_CUDNN / FACEDETECT_GGML_CUDNN (default OFF) that kills im2col on the GPU and reaches torch-cuDNN parity (SCRFD 2.3x, WeSpeaker/ERes2Net parity), measured on the GB10 (arm64, CUDA 13, sm_121a). Enable it for the CUDA build, but only where cuDNN actually ships: the arm64 + CUDA 13 image (GB10/Jetson/L4T). x86 CUDA images carry no cuDNN, so flipping it on globally for BUILD_TYPE=cublas would be a link failure. The Makefiles gate on CUDA_MAJOR_VERSION=13 + arch (TARGETARCH from the matrix/Docker build, uname -m fallback for local builds). backend/Dockerfile.golang already installs the runtime libcudnn9-cuda-13 in the arm64+CUDA13 apt block; add the matching libcudnn9-dev-cuda-13 so the build-time link resolves. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump voice-detect pin to ERes2Net blocked-default (30beecd) Defaults VD_ERES2NET_BLOCKED ON: routes the ERes2Net Res2Net body through the blocked nChw16c AVX-512 directconv island instead of the 1x1 mul_mat fast path (CONT-transpose + skinny low-K GEMM). On the shipped GGML_NATIVE=OFF build (ggml mul_mat is AVX2-only) this wins ~2x at every thread count (2.07x@1t, 2.2x@4t, 2.05x@8t); pure-AVX2 fallback still 1.3-1.62x. Parity exact (cosine=1.000000 vs golden), so registered voices + verify/identify thresholds are unaffected. The prior default-OFF rested on a stale comment whose 23pct regression only held on the non-shipping GGML_NATIVE=ON build. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * docs(readme): announce native voice-detect + face-detect backends in Latest News Add a Latest News entry for the new from-scratch C++/ggml biometric backends (voice-detect.cpp + face-detect.cpp) that replace the Python insightface and speaker-recognition backends: no Python/onnxruntime at inference, self-contained GGUF, bit-exact parity, GPU cuDNN parity. Mirrors the parakeet.cpp / locate-anything.cpp native-backend news entries. Refs PR #10441. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): re-pin to the squashed engine release commits The voice-detect.cpp and face-detect.cpp histories were squashed to a single release commit, which orphaned the previous pins (voice 30beecd, face 6107a24). Re-pin to the new single-commit SHAs (voice 3d51077, face 06914b0); the tree is identical, so the backend build is unchanged. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
d3a26f961d |
fix(ik-llama): port multimodal path to mtmd API and bump to f96eaddb (#10534) (#10568)
* fix(ik-llama): port multimodal path to mtmd API and bump to f96eaddb (#10534) The IK_LLAMA_VERSION bump to f96eaddba8bed6a9a5e628bbf6a566775c70b49c pulls in upstream commit "Prune examples/llava", which deletes examples/llava (clip.* / llava.*). The ik-llama backend's grpc-server.cpp built a local `myclip` library from those files and called the removed clip/llava C API, so the bump no longer builds. ik_llama keeps its multimodal stack in the surviving `mtmd` library (examples/mtmd/, public headers mtmd.h + mtmd-helper.h). This ports the backend's multimodal path onto the high-level mtmd_* / mtmd_helper_* API in place, leaving the text path (which still uses ik_llama's retained old common API) untouched: - Makefile: bump IK_LLAMA_VERSION to f96eaddb. - prepare.sh: drop the clip/llava source copy + sed block; mtmd is a library target, no source copy needed. - CMakeLists.txt: remove the `myclip` target; link `mtmd` and add its include dir; build grpc-server as C++17 (mtmd headers require it). - patches: drop 0002 (targeted the deleted examples/llava/clip.cpp; the mtmd clip.cpp never calls ggml_quantize_chunk, so the fix is unneeded). Keep 0001 (verified still applies). - grpc-server.cpp / utils.hpp: replace clip_model_load + clip_image_load_from_bytes + llava_image_embed_make_with_clip_img + the manual [img-N] prefix splitting and per-image llava_embd_batch decode loop with mtmd_init_from_file (moved after the model load, which it requires), mtmd_helper_bitmap_init_from_buf, mtmd_tokenize and mtmd_helper_eval_chunks. Legacy [img-N] tags are translated, in order, into mtmd media markers (mtmd_default_marker()); the post-image suffix text stays on the normal token path so the sampling loop is unchanged. Supersedes #10534. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * fix(ik-llama): align json alias to ordered_json to resolve mtmd.h conflict (#10534) mtmd.h declares `using json = nlohmann::ordered_json` at global scope (and its mtmd.cpp depends on it), while ik_llama's whole server/common stack also uses ordered_json. Our grpc-server.cpp/utils.hpp kept a plain `nlohmann::json` alias, which now collides with mtmd.h once it is included for the multimodal port: "conflicting declaration 'using json = ...'". Switch our two aliases to ordered_json to match; it is API-compatible (utils.hpp already used ordered_json for its log helper) and our json never crosses into an unordered-json API. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
13b1ae53bc |
chore: ⬆️ Update ggml-org/llama.cpp to 0ed235ea2c17a19fc8238668653946721ed136fd (#10536)
* ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * fix(llama-cpp): link server-stream.cpp TU into grpc-server for upstream 0ed235ea (#10536) Upstream llama.cpp 0ed235ea added an SSE stream-resumption layer in a new translation unit tools/server/server-stream.cpp, which defines stream_session, stream_pipe_producer and the g_stream_sessions manager. server-context.cpp (already #included into grpc-server.cpp) now calls into it via spipe->cleanup(), stream_aware_should_stop() and stream_session_attach_pipe(), so without the new TU the grpc-server link fails on every arch with: undefined reference to `stream_pipe_producer::cleanup()' prepare.sh already copies every tools/server/* file into tools/grpc-server/, so the source is present; the only missing piece was including its definitions. Add an __has_include-guarded #include "server-stream.cpp" before server-context.cpp, mirroring the existing server-chat.cpp and server-schema.cpp guards, keeping the source compatible with older pins/forks that predate the split. The file is self-contained (its only external symbols come from server-common, already in the TU) so it adds no new undefined references; the http route-handler factories it also defines are unused in the grpc path but harmless. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * fix(llama-cpp): build renamed ggml-rpc-server target for upstream 0ed235ea (#10536) Upstream renamed the RPC server CMake target and binary from `rpc-server` to `ggml-rpc-server` (tools/rpc/CMakeLists.txt: `set(TARGET ggml-rpc-server)`), so the RPC-enabled grpc build failed with "No rule to make target 'rpc-server'". The grpc-server itself links fine after the server-stream.cpp fix; this only updates the RPC target name and the binary path copied to llama-cpp-rpc-server. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] --------- Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
e68ca109c5 |
chore: ⬆️ Update CrispStrobe/CrispASR to 6514c9da00b03a2f0f1b49a43fae4f3a01a41844 (#10535)
⬆️ Update CrispStrobe/CrispASR Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> |
||
|
|
6740e988d2 |
chore: ⬆️ Update ggml-org/whisper.cpp to 0ae02cdb2c7317b50991367c165736ce42ed96ac (#10532)
⬆️ Update ggml-org/whisper.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> |
||
|
|
ade9cc9e37 |
fix(openresponses): bound resume-stream buffer and enforce response ownership (#10569)
The background=true resumable-stream path had two latent issues. 1. Unbounded resume buffer. AppendEvent grew StreamEvents without limit, so a long-running or abandoned background generation could consume process memory without bound. The store now caps the buffer (event count and total bytes, mirroring llama.cpp's byte-capped slot ring), evicting oldest events from the front and advancing a droppedThrough watermark. GetEventsAfter returns ErrOffsetLost when the requested starting_after is below the watermark, and handleStreamResume surfaces that as HTTP 409 before committing to the SSE response, so a resuming client gets a clear error instead of a silently truncated stream. 2. Missing ownership check (IDOR). GET /responses/:id, its stream resume, and /cancel looked up responses purely by ID, letting any caller who knows or guesses an ID read or cancel another caller's response. Responses now carry the creating caller's identity (auth.GetUser), stamped at creation and compared on read/cancel/resume; a mismatch returns 404 (not 403) so existence is not leaked. Backward compatible: responses with no owner (single-key / no-auth deployments) remain accessible. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
471e38e4e7 |
chore: ⬆️ Update leejet/stable-diffusion.cpp to 9956436c925a367daeab097598b1ea1f32d3503f (#10533)
⬆️ Update leejet/stable-diffusion.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> |
||
|
|
f3d829e2ef |
feat(distributed): add LOCALAI_DISTRIBUTED_SHARED_MODELS to skip staging on shared volumes (#10556) (#10566)
In distributed mode, even when the frontend and workers share the same models directory via a shared volume mount, starting a model on a worker re-staged (re-downloaded) it: stageModelFiles always uploads model files into a tracking-key-namespaced subdir on the worker, and the staging probe only checks that staged location, so a file already present on the shared volume at the canonical path was never reused. Add a config switch LOCALAI_DISTRIBUTED_SHARED_MODELS (default false). When enabled, the operator asserts that all nodes mount the SAME models directory at the SAME path, so staging is unnecessary: the frontend's absolute model paths are already valid on the worker. In that mode stageModelFiles returns the cloned opts unchanged without uploading, leaving the path fields pointing at their canonical absolute paths so the worker loads them directly from the shared volume. The value is plumbed from DistributedConfig through SmartRouterOptions into the SmartRouter. Docs and docker-compose.distributed.yaml updated. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
91885c2c7e |
fix(distributed): return empty backend list for agent nodes instead of failing backend.list (#10545) (#10565)
Opening an AGENT-type worker node's detail page errored with "failed to list backends on node" / NATS "nodes.<id>.backend.list: no responders available". Agent workers only subscribe to agent.*, jobs.*, mcp.* and <prefix>.backend.stop; they never subscribe to backend.list, so the per-node ListBackendsOnNodeEndpoint request had no responder and timed out. The aggregate cluster-wide list already guards this in managers_distributed.go (skip nodes whose NodeType is set and not "backend"). The single-node endpoint lacked the same guard. Thread the NodeRegistry into ListBackendsOnNodeEndpoint and short-circuit to an empty (non-nil) list for non-backend node types before issuing the doomed NATS request, mirroring the aggregate-list gate so both views stay consistent. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
f1fcafb888 |
fix(gallery): match mmproj/model quant as a whole token so F16 no longer selects BF16 (#10559) (#10564)
pickPreferredGroup matched a quant preference against the shard base filename with strings.Contains. Because `f16` is a substring of `bf16`, asking for the `F16` mmproj quant would wrongly satisfy a `BF16` file and select it when its group came first. Match the preference as a whole token instead: it must be delimited by a non-alphanumeric character (or the string start/end) on both outer edges. Separators inside the preference itself (e.g. `ud-q4_k_xl`) are left untouched, and all occurrences are scanned before rejecting. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
fdff114701 |
ci(vibevoice): skip the ASR transcription e2e on release tag builds (#10567)
The `tests-vibevoice-cpp-grpc-transcription` job downloads the vibevoice ASR model (`vibevoice-asr-q4_k.gguf`, ~10 GB) and decodes it through the e2e-backends harness. On release tag pushes the detect step forces the full matrix (run-all=true), so this job runs and consistently times out: the inner `go test -timeout 30m` cannot pull a 10 GB file from HuggingFace's throttled Xet CDN within budget (curl --max-time 600 x5 retries overruns the deadline), leaving an orphaned curl and a 30m panic. It has been red on every release (v4.5.3/4/5). Guard the job's `if` with `!startsWith(github.ref, 'refs/tags/')` so it no longer runs on tag/release builds. It still runs on PRs and branch pushes that touch vibevoice-cpp, so real regressions are caught off the release path. A proper fix (a small ASR test GGUF) can re-enable it on tags later. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
1154be5eea |
fix(config): fall back to DefaultContextSize for unparseable GGUFs; pin NVFP4 gallery context_size (#10563)
The GGUF metadata parser (gpustack/gguf-parser-go) cannot read NVFP4-quantized GGUFs at all: it errors with "read tensor info 0: This quantized type is currently unsupported" because NVFP4 is a ggml tensor type it does not know. When ParseGGUFFile errors, the llama-cpp defaults hook skips guessGGUFFromFile entirely and the deferred fallback sets the context window to the conservative GGUFFallbackContextSize (1024). The result: a model that trains to 262144 tokens runs with n_ctx=1024, and every prompt over ~1k tokens fails with "request (N tokens) exceeds the available context size (1024 tokens)". Two changes: - Drop GGUFFallbackContextSize (1024) and fall back to DefaultContextSize (4096) in both the GGUF run-estimate path (gguf.go) and the deferred hook fallback (hooks_llamacpp.go). 1024 is a sensible floor for a tiny CPU GGUF but a footgun for a large, long-context model whose header simply cannot be parsed. Strengthen the existing "GGUF unreadable" test to assert the value. - Set context_size explicitly on the four NVFP4 gallery entries (qwen3.6-35b-a3b-nvfp4-mtp, qwopus3.6-27b-v2-mtp-nvfp4, qwopus3.6-27b-coder-mtp-nvfp4, qwen3.6-27b-nvfp4-mtp) so the parser failure is irrelevant for them. 32768 matches sibling Qwen entries and is safe on memory; operators can raise it toward the 262144 train length. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
8aba4fdba3 |
chore(fish-speech): drop the darwin/metal build target (#10561)
The fish-speech metal-darwin-arm64 backend build has been failing on every
release (v4.5.3, v4.5.4, v4.5.5) and is a standing red on the darwin backend
matrix. fish-speech pulls `tokenizers` transitively from its upstream source
(`pip install -e fish-speech-src`), and on darwin/arm64 there is no prebuilt
wheel for the pinned old `tokenizers` version, so pip builds it from source.
Modern rustc rejects that old crate as a hard error:
error: casting `&T` to `&mut T` is undefined behavior ...
--> tokenizers-lib/src/models/bpe/trainer.rs:517:47
= note: `#[deny(invalid_reference_casting)]` on by default
error: could not compile `tokenizers` (lib) due to 1 previous error
This is deterministic, not a flake, and there is no clean fix that does not
either pin a stale Rust toolchain or downgrade a soundness lint guarding real
UB. Until upstream fish-speech moves to a tokenizers version that compiles on
current toolchains, drop darwin support so the release backend build stays
green. The Linux/CUDA/ROCm/Intel/L4T variants are unaffected.
Removes:
- the `-metal-darwin-arm64-fish-speech` entry from `includeDarwin` in
backend-matrix.yml
- the `metal:` capability mappings and the concrete `metal-fish-speech` /
`metal-fish-speech-development` gallery entries in backend/index.yaml
- the now-unused darwin-only requirements-mps.txt
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
|
||
|
|
d7d7721eae |
feat(distributed): SyncedMap component + migrate finetune/quant/agent-tasks to cross-replica state (#10542)
* feat(distributed): add SyncedMap cross-replica in-memory state component Introduce core/services/syncstate.SyncedMap[K,V]: a thread-safe in-memory map that keeps itself consistent across frontend replicas via NATS, with an optional pluggable durable Store and hydrate-from-source convergence. Several features keep process-local state surfaced to the API (finetune/quant jobs, agent tasks, model configs) and each hand-wired the same in-memory + NATS broadcast + read-through-store legs - or forgot to, reintroducing cross-replica staleness. SyncedMap makes that consistency a configuration choice: - local writes mutate the map, write through the Store, then broadcast a delta; - the apply path is memory-only and never re-publishes or re-writes the Store (structural echo-loop guard, mirroring galleryop.mergeStatus); - on Start and on NATS reconnect the map re-hydrates from the source (Store, else Loader); an optional periodic Reconcile repairs silent drift; - standalone mode (nil NATS client) is a strict in-memory no-op. Reconnect re-hydrate is wired via a new *messaging.Client.OnReconnect callback, consumed through an optional type-assertion so MessagingClient stays minimal. Adds messaging.SubjectSyncStateDelta and a reusable testutil.FakeBus (synchronous in-process MessagingClient with wildcard matching) for adopter tests. Component only; service migrations follow in subsequent commits. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * refactor(finetune): back jobs with SyncedMap for cross-replica consistency FineTuneService kept jobs in a process-local map and, although it wrote them to Postgres, ListJobs/GetJob never read the store back and the wired natsClient was never used - so in distributed mode a job created on one replica was invisible to the others. Replace the map and the dead client with a syncstate.SyncedMap keyed by job ID, value *schema.FineTuneJob (the exact REST shape, so responses are unchanged). - Add a Store adapter (core/services/finetune/syncstore.go) over FineTuneStore, plus FineTuneStore.ListAll (global hydrate; per-user List kept) and an idempotent Upsert (create-or-update; Create alone fails on dup key). - Writes go through SyncedMap.Set/Delete (write-through + broadcast); reads use List/Get. The on-disk state.json path becomes the standalone Loader, keeping single-node restart recovery (stale->stopped / exporting->failed fixups). - Fold SetNATSClient/SetFineTuneStore into NewFineTuneService; app.go passes the distributed NATS client + store when distributed, nil otherwise. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * refactor(agentpool): back agent tasks with SyncedMap for cross-replica consistency AgentJobService.ListTasks read the process-local tasks map only, while ListJobs already read through the DB persister + dispatcher NATS - so in distributed mode a task created on one replica was invisible to the others. Back tasks with a syncstate.SyncedMap keyed by task ID (value schema.Task, the exact REST shape); jobs are left untouched. - Store adapter (task_syncstore.go) over the existing JobPersister (LoadTasks/SaveTask/DeleteTask); reads svc.persister/userID live so a persister swap needs no rebuild. No new persister methods required. - Task reads -> SyncedMap.List/Get; create/update -> Set (write-through + broadcast); delete -> Delete. The file persister now owns its own task set so the write-through path does not re-enter the SyncedMap lock (deadlock guard). - The distributed NATS client is not available at construction (start() precedes initDistributed), so it is injected via SetTaskSyncNATS, which rebuilds the still-empty map before Start/hydrate. Wired at the main, restart, and per-user (UserServicesManager) distributed sites. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * refactor(quantization): back jobs with SyncedMap + durable QuantStore QuantizationService kept jobs in a process-local map persisted only to a local state.json, so in distributed mode jobs were neither visible across replicas nor durable cluster-wide. Back jobs with a syncstate.SyncedMap keyed by job ID (value *schema.QuantizationJob, the exact REST shape). - New distributed.QuantStore (GORM, table quantization_jobs) mirroring FineTuneStore: Create/Get/ListAll/Upsert(idempotent)/Delete, registered for AutoMigrate via distributed.InitStores (Stores.Quant). - New adapter (quantization/syncstore.go) over QuantStore implementing syncstate.Store, with record<->schema conversion. - Reads go through List/Get, writes through Set/Delete (write-through + broadcast); state.json is kept as the standalone Loader for single-node restart recovery (stale-job fixups preserved). - app.go passes the distributed NATS client + QuantStore when distributed, nil otherwise; Start/Close lifecycle mirrors finetune. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * fix(syncstate): annotate gosec G118 false positive on lifeCtx gosec flagged the WithCancel in Start as "cancellation function not called" because the returned cancel is stored on the struct rather than called/deferred in scope. It is invoked in Close (covered by tests), and lifeCtx must outlive Start to drive the reconnect/reconcile goroutines. Suppress the verified false positive with a justified #nosec G118. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * test(distributed): e2e two-replica SyncedMap sync over real NATS + Postgres Adds the real-infrastructure counterpart to the fake-bus unit tests, in the existing distributed e2e suite (testcontainers NATS + PostgreSQL). Two SyncedMap instances stand in for two frontend replicas - each with its OWN NATS connection to a shared server and a SHARED Postgres store (the distributed-mode invariant) - and assert, over the wire: - a create on replica A is observed by replica B; - an update and a delete propagate A -> B (delete prunes, which a reload cannot); - a late-joining replica recovers a job it never received a delta for, via store hydrate on Start (the at-most-once gap a fake bus cannot exercise); - a local Set is written through to the shared Postgres store. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
c548150f99 |
fix(distributed): missing agent NATS permission (#10549)
Signed-off-by: Nicholas Ciechanowski <nicholas@ciech.anow.ski> |
||
|
|
ec26b86dd4 |
docs: ⬆️ update docs version mudler/LocalAI (#10560)
⬆️ Update docs version mudler/LocalAI Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> |
||
|
|
d11b202dd2 |
fix(backends): whisper darwin run.sh loads whichever fallback lib exists (.so/.dylib) (#10553)
fix(backends): whisper darwin run.sh loads whichever fallback lib exists
The macOS branch hardcoded WHISPER_LIBRARY=$CURDIR/libgowhisper-fallback.dylib,
but the cmake build emits a Mach-O named libgowhisper-fallback.so on darwin, so
the Go loader panicked at runtime ("dlopen ...dylib: no such file") and the
backend exited ("grpc service not ready") — breaking e.g. the silero-vad-ggml
VAD on darwin. Pick whichever of .dylib/.so is present so it is robust to the
build's naming either way.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
v4.5.5
|
||
|
|
e95018ef70 |
chore(model gallery): 🤖 add 1 new models via gallery agent (#10544)
chore(model gallery): 🤖 add new models via gallery agent Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> |
||
|
|
0258f8af55 |
fix(backends): repair release CI build/test breaks (kokoros, fish-speech, llama-cpp-quantization, sglang) (#10547)
* fix(kokoros): implement new Backend RPCs to fix the build
The backend.proto grew six RPCs (SoundDetection, Depth, TokenClassify,
Score and the bidi-streaming Forward) that the kokoros gRPC service never
implemented, so the trait impl no longer satisfies `Backend`:
error[E0046]: not all trait items implemented, missing:
`sound_detection`, `depth`, `token_classify`, `score`,
`ForwardStream`, `forward`
kokoros is a TTS backend with no use for these, so add `unimplemented`
stubs (plus the `ForwardStream` associated type) matching the existing
pattern for every other unsupported RPC in this file.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* fix(fish-speech): add setuptools-rust for the editable source install
install.sh installs the fish-speech source tree editable with
`--no-build-isolation`, which means the build backends of its transitive
dependencies must already be present in the venv. One of them builds a
Rust extension and its metadata step fails with:
ModuleNotFoundError: No module named 'setuptools_rust'
Add setuptools-rust to requirements.txt so installRequirements provisions
it before the editable install runs.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* fix(llama-cpp-quantization): vendor convert_hf_to_gguf.py with conversion/
Upstream llama.cpp split the model-specific logic out of the single
convert_hf_to_gguf.py file into a sibling `conversion/` package, so the
script now starts with `from conversion import ...`. Downloading just the
one file therefore fails at runtime with:
ModuleNotFoundError: No module named 'conversion'
Clone the repo (reusing the clone already needed to build llama-quantize)
and copy both the script and the `conversion/` package into the backend
dir. Python puts the script's own directory on sys.path[0], so the package
resolves when it sits beside the script.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* fix(sglang): pin the CPU source build to sglang v0.5.11
The CPU profile builds sgl-kernel from a `git clone` of sglang with no
ref, so it always tracks master. Recent master added CPU kernels (e.g.
mamba/fla.cpp) that fail to compile in our builder:
constexpr variable 'scale' must be initialized by a constant
static library kineto_LIBRARY-NOTFOUND not found
Pin the clone to v0.5.11, the same release the GPU path already floors on
(requirements-cublas12-after.txt). Overridable via SGLANG_VERSION so the
pin can be bumped deliberately.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
|
||
|
|
14b29ebf4e |
fix(backends): derive darwin RUN_BINARY from the exec line only (#10541)
golang-darwin.sh's packaging check derived the launch binary by grepping every $CURDIR/... reference in run.sh and taking the last one. Backends that pick a runtime CPU variant assign it via unquoted `LIBRARY=$CURDIR/libgo<x>-avx512.so` lines, so the heuristic returned `libgo<x>-avx512.so` — a variant Darwin never builds (arm64 builds only fallback) — and the check then failed with "package/libgo<x>-avx512.so not found ... refusing to package (#10267)", breaking the darwin builds for whisper, sam3-cpp, vibevoice-cpp and friends. Scan only the `exec` line(s) (the actual launch contract) and tolerate a quoted `exec "$CURDIR"/<binary>`. parakeet-cpp's parakeet-cpp-grpc and the quoted-only backends (sherpa/piper/opus) resolve correctly; no Linux change. Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>v4.5.4 |
||
|
|
f0d0bff232 |
fix(llama-cpp): stop reinterpreting plain-string message content as JSON (#10524) (#10538)
The llama-cpp gRPC backend reconstructs OpenAI messages from proto for the tokenizer-template path and blindly json::parse'd each message's content string. LocalAI's Go layer always flattens content to a plain string, so a user prompt that merely looks like JSON (e.g. mealie's ingredient array ["1/4 cup brown sugar", ...]) was reinterpreted as structured content parts and rejected by oaicompat_chat_params_parse with "unsupported content[].type". Normalize content per role instead: user/system/developer content is opaque text and is never JSON-sniffed; assistant/tool content still collapses a literal JSON null/object (tool-call bookkeeping) to a string, but a plain string is never turned into an array/scalar. The array defense is role-independent, so the role gate only governs the benign null/object case. While here, extract the duplicated per-message reconstruction and the pre-template content sanitization into shared, unit-tested helpers (message_content.h) so the streaming (PredictStream) and non-streaming (Predict) paths cannot drift. This removes ~490 lines of copy-pasted defensive code, the dead tool-role parse branches, and the redundant Predict-only tool_calls branch, while preserving the prior #7324 (null content -> "") and #7528 (tool array content -> string) fixes. Tests: - backend/cpp/llama-cpp/message_content_test.cpp: standalone C++ unit tests for all three helpers (#10524, #7324, #7528, multimodal), discovered and run by `make test-backend-cpp` and a new generic tests-backend-cpp CI job. Also wired as an opt-in CMake/ctest target (-DLLAMA_GRPC_BUILD_TESTS=ON). - core/schema/message_test.go: Go regression pinning that ToProto flattens a JSON-array-looking text part to the verbatim string. - prepare.sh now copies message_content.h into the build tree. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>v4.5.3 |
||
|
|
64150ca7ab |
fix(distributed): broadcast admin model-config changes across replicas (#10540)
In distributed mode the admin model endpoints (/models/edit, /models/import, /models/toggle-state and the PATCH config-json endpoint) wrote the YAML to the shared models dir but reloaded only the local replica's in-memory ModelConfigLoader. With multiple frontend replicas behind one service, a save landed on whichever replica handled the request; peers kept serving their stale in-memory view, so a load-balanced request was a coin-flip between old and new config (a created alias visible on one replica and missing on the other, an edited alias target diverging, etc.). The NATS cache-invalidation channel (SubjectCacheInvalidateModels + OnModelsChanged) already existed for the gallery install/delete path; these admin endpoints simply never published on it. Wire them up via a new GalleryService.BroadcastModelsChanged helper (no-op in standalone mode). Also fix delete propagation: LoadModelConfigsFromPath is additive and never drops an entry whose file is gone, so the subscriber hook (which only reloaded from disk) could not propagate a removal. ApplyRemoteChange now honors the event op - pruning the element on "delete" and reloading otherwise - and shuts down any running instance of the affected model so the new config takes effect. This closes the same latent gap on the gallery delete path. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
f98b0f1c1e |
fix(gpu-libs): bundle transitive deps of GPU runtime libs (#10537) (#10539)
fix(gpu-libs): bundle transitive deps of GPU runtime libs The per-vendor packagers in package-gpu-libs.sh copy an explicit allowlist of top-level GPU runtime libraries (libamdhip64, libhipblas, librocblas, the CUDA/Intel equivalents, ...) but never resolved their transitive dependencies. Backends run through the bundled lib/ld.so with LD_LIBRARY_PATH=lib, so any transitive dep not in the allowlist is a fatal "cannot open shared object file" at load time. On recent ROCm (base image rocm 7.2.1) the runtime libs link against librocprofiler-register.so.0, which is not in the allowlist, so the rocm llama-cpp backend (and every other GPU backend sharing this script) failed to load with: librocprofiler-register.so.0: cannot open shared object file The Vulkan path already solved this class of problem with copy_elf_deps (ldd-based transitive resolution), but that sweep was only wired into the Vulkan ICD path. This adds a generic sweep_transitive_deps that runs the same ldd resolution over everything the allowlist already bundled, and wires it into the ROCm, CUDA and Intel packagers. ldd returns the full recursive closure, so one pass suffices; core libc-family deps are skipped via is_core_lib so we never shadow the loader's own libc/libstdc++. Adds a self-contained regression test (gcc + ldd) that fabricates a primary lib linking a transitive lib and asserts the sweep bundles the dependency. Fixes #10537 Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
2c96c2d08e |
chore: ⬆️ Update mudler/parakeet.cpp to f469a57270a1cc4554acb15febf60e56619673b9 (#10530)
⬆️ Update mudler/parakeet.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> |
||
|
|
f01a969f7b |
docs: ⬆️ update docs version mudler/LocalAI (#10531)
⬆️ Update docs version mudler/LocalAI Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> |
||
|
|
56600eec3e |
fix(nodes): show a node's existing labels on the detail view (#10529)
fix(nodes): return labels in single-node GET so the detail view shows them The node detail view (/app/nodes/:id) reads `node.labels` to render a node's existing labels, but the single-node GET endpoint returned a bare BackendNode whose Labels live in a separate table - so the list was always empty and operators could only add labels, never see what was already set (#10527). The same response also lacked in_flight_count and model_count. Add NodeRegistry.GetWithExtras, mirroring the existing List vs ListWithExtras split: bare Get stays cheap for the routing hot paths and existence checks, while the detail endpoint uses the enriched variant to attach the labels map and live counts. No frontend change is needed - the UI already renders existing labels once the data is present. Closes #10527 Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
c4fa256cdf |
chore(model gallery): 🤖 add 1 new models via gallery agent (#10526)
chore(model gallery): 🤖 add new models via gallery agent Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> |
||
|
|
17c1fc74b2 |
fix(backends): darwin packaging for silero-vad (last Linux-only Go backend) (#10528)
fix(backends): darwin packaging for silero-vad silero-vad was the last Go backend with Linux-only darwin packaging: - package.sh fell through to "Could not detect architecture" -> exit 1 on macOS (no Darwin branch), so its darwin image never packaged. - run.sh exported LD_LIBRARY_PATH, which macOS dyld ignores, so the bundled libonnxruntime.dylib couldn't be found at runtime. Add a Darwin branch to package.sh (skip the glibc/ld.so bundling; add an @loader_path/lib rpath so @rpath resolves to package/lib/) and a DYLD_LIBRARY_PATH branch to run.sh — mirroring the piper darwin fix (#10525). Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
068d397acf |
fix(backends): set rpath on the piper darwin binary so it can load its bundled libs (#10525)
The metal-darwin-arm64-piper backend crashed at launch on macOS:
DYLD "Library missing"
Library not loaded: @rpath/libucd.dylib
Referenced from: .../piper
Reason: no LC_RPATH's found
The piper binary links libucd, libespeak-ng, libpiper_phonemize and
libonnxruntime via @rpath, but ships with no LC_RPATH, so dyld cannot
expand @rpath and aborts before piper runs. The libraries themselves are
already bundled in package/lib/ by package.sh.
Additionally, package.sh's architecture detection only handled the Linux
glibc loaders (/lib64/ld-linux-x86-64.so.2, /lib/ld-linux-aarch64.so.1)
and otherwise hit `echo "Error: Could not detect architecture"; exit 1`,
so on macOS packaging failed outright.
Add a Darwin branch (before the Linux checks) that skips the glibc/ld.so
bundling macOS has no use for and instead runs
`install_name_tool -add_rpath @loader_path/lib` on the piper binary, so
@rpath resolves to the bundled package/lib/ directory.
Also mirror sherpa-onnx/opus in run.sh: export DYLD_LIBRARY_PATH on
Darwin (LD_LIBRARY_PATH is Linux-only) as a defensive fallback.
Validated by hand on Apple Silicon: with the rpath added, piper
synthesized a real WAV. The darwin build is validated in CI.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
|
||
|
|
5b3572f8b8 |
feat(macos): sign and notarize the DMG, app, and server binary (#10510)
Produce a Gatekeeper-clean macOS distribution with no user workaround: - Launcher DMG + the LocalAI.app inside it are built via fyne, codesigned with the Developer ID under the hardened runtime, then the DMG is signed, notarized (notarytool) and stapled. Replaces macos-dmg-creator (which had no signing hook) with fyne package + hdiutil so we control the .app before packaging. - The bare local-ai darwin server binary is signed + notarized via GoReleaser's native notarize block (quill backend, runs on Linux). - All signing is gated on secrets being present, so forks/PRs/local builds stay unsigned and green (contrib/macos/sign-and-notarize.sh no-ops). - Add hardened-runtime entitlements and FyneApp.toml for deterministic packaging; update macOS install docs to drop the quarantine workaround. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
6afe127cd4 |
fix(backends): make the opus backend build and package on macOS/Darwin (#10523)
The opus Go backend (WebRTC audio codec) never built on macOS, so the published master-metal-darwin-arm64-opus image shipped source only — no opus binary and no libopusshim — because every step assumed Linux. - Makefile: hardcoded libopusshim.so with no OS handling. Mirror sherpa-onnx: SHIM_EXT=so / dylib on Darwin and build libopusshim.$(SHIM_EXT). On Darwin link the shim with -undefined dynamic_lookup so it resolves opus_encoder_ctl from the already globally-loaded libopus (codec.go dlopens it RTLD_GLOBAL first) instead of baking an absolute Homebrew path into the dylib, keeping the packaged shim relocatable. - run.sh: hardcoded LD_LIBRARY_PATH + libopusshim.so even on macOS. Add a Darwin branch exporting DYLD_LIBRARY_PATH and the .dylib shim, like sherpa-onnx/run.sh. - package.sh: bundle libopusshim.$(SHIM_EXT) and libopus*.dylib (not just .so) into package/lib so the OCI image (which ships package/.) is self-contained on a runtime with no Homebrew; add a Darwin arch branch so it doesn't warn/skip. - backend_build_darwin.yml: install + link opus and pkg-config via brew so the Makefile's `pkg-config opus` resolves on the macOS runner, and cache opus' Cellar dir. Go code is unchanged; darwin build is validated in CI. Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>v4.5.2 |
||
|
|
f58dcefed4 |
fix(backends): ship the package/ dir for darwin go backend images (#10522)
fix(backends): ship the package/ dir for darwin go backends golang-darwin.sh packaged the whole backend source/build dir as the OCI image (backend/go/$BACKEND/.), so the runtime dylibs ended up under package/lib and backend-assets/lib while run.sh looks in $CURDIR/lib. As a result a backend like sherpa-onnx could not dlopen its libsherpa-shim.dylib at runtime and exited immediately (the model then 500s with "grpc service not ready"); it started fine only when run from inside package/. Ship package/. instead — the self-contained run.sh + binary + lib/ bundle — matching the Linux Dockerfile.golang (`COPY .../package/. ./`). Backends that don't assemble a package/ fall back to the backend dir, and the binary-existence guard now checks the directory actually shipped. Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>v4.5.1 |
||
|
|
11b062f8f4 |
chore(model gallery): 🤖 add 1 new models via gallery agent (#10521)
chore(model gallery): 🤖 add new models via gallery agent Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> |
||
|
|
114eeaae81 |
feat(backends): make PreferDevelopmentBackends install the development image as primary (#10520)
When LOCALAI_PREFER_DEV_BACKENDS is set, install the -development image as the primary backend URI (keeping the released image reachable as the first fallback), instead of only reaching development as a download fallback when the released image is missing. This lets an operator force backends built from the development branch — e.g. to pick up a fix already on master before a release. Threads PreferDevelopmentBackends through SystemState so InstallBackend can see it, and reuses the same development-URI convention as the existing failure-path fallback (released tag -> branch tag + dev suffix). The unexported developmentURI helper is covered by a Ginkgo spec. Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
d388f874de |
feat(backends): darwin/Metal build for the privacy-filter backend (#10513)
* feat(backends): darwin/Metal build for the privacy-filter backend (timeboxed try) The privacy-filter.cpp engine is already Metal-capable on Apple Silicon: it pulls ggml and never forces GGML_METAL=OFF, and ggml defaults Metal ON on Apple, so a plain Darwin build is Metal-enabled. grpc++/protobuf resolve from Homebrew via find_package(... CONFIG). It just had no darwin build path - the existing package.sh and run.sh are Linux-only and there was no make target / workflow step. Adds the bespoke darwin path, modeled on the ds4 one: - scripts/build/privacy-filter-darwin.sh: native make grpc-server, otool -L dylib bundling, create-oci-image (no Linux package.sh). - Makefile: backends/privacy-filter-darwin target (+ .NOTPARALLEL). - .github/workflows/backend_build_darwin.yml: gated build step for privacy-filter. - scripts/changed-backends.js: inferBackendPathDarwin special-case -> backend/cpp. - .github/backend-matrix.yml: includeDarwin entry (lang go, like ds4/llama-cpp). - backend/index.yaml: metal: capability + metal-privacy-filter(-development) entries. - backend/cpp/privacy-filter/run.sh: DYLD_LIBRARY_PATH branch on Darwin. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:opus-4.8 [Claude Code] * fix(privacy-filter): macOS proto include + bundle ggml dylibs Validated natively on an M4 (the build/package/load chain now works with Metal): - CMakeLists.txt: hw_grpc_proto compiles the generated proto/grpc sources but only linked the binary dir, so on macOS it could not find protobuf's headers (runtime_version.h) - Homebrew puts them under /opt/homebrew, not /usr/include. Link protobuf::libprotobuf + gRPC::grpc++ so their include dirs propagate. No-op on Linux (apt headers are already on the default search path). - privacy-filter-darwin.sh: bundle the ggml shared libs the binary @rpath-links (libggml{,-base,-cpu,-blas,-metal}); the otool -L walk only catches on-disk absolute deps and missed them. Resolved at runtime by run.sh's DYLD_LIBRARY_PATH. M4 check: arm64 grpc-server links @rpath/libggml-metal.0.dylib; with the 15 ggml dylibs + grpc/protobuf bundled, it loads clean (no dyld errors) and prints usage. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:opus-4.8 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
86677495a2 |
chore: ⬆️ Update ggml-org/llama.cpp to 9d5d882d8cd0f0a9283d87ed5e6fe3ee0d925fb1 (#10514)
⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> |
||
|
|
253aedff06 |
chore: ⬆️ Update CrispStrobe/CrispASR to 8f1218141b792b8868861c1af17ba1e361b05dc0 (#10502)
⬆️ Update CrispStrobe/CrispASR Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> |
||
|
|
74f07ecc35 |
fix(backends): quote $CURDIR in run.sh (fixes backends in paths with spaces) (#10519)
fix(backends): quote $CURDIR in run.sh so backends work in paths with spaces The backend launcher scripts derive their own directory with CURDIR=$(dirname "$(realpath $0)") and then referenced it unquoted as $CURDIR (e.g. [ -f $CURDIR/lib/ld.so ], export LD_LIBRARY_PATH=$CURDIR/lib:..., exec $CURDIR/<binary> "$@"). When a backend is installed under a path that contains a space - notably macOS's ~/Library/Application Support/... - bash word-splits the unquoted $CURDIR, so the test builtin fails with "binary operator expected" and exec tries to run ".../Library/Application", yielding "No such file or directory". The backend never starts, surfacing as a gRPC "service not ready" error and an HTTP 500. Quote $CURDIR (and the realpath "$0") in every affected run.sh; no logic changes. Co-authored-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> |
||
|
|
ae0da454a7 |
chore: pin localrecall to tagged v0.6.3 (#10518)
#10517 pinned the pseudo-version of the postgres connection-timeout fix; mudler/LocalRecall@v0.6.3 now tags that exact commit. Use the clean release tag instead of the pseudo-version. No code change. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
179210b970 |
chore: bump localrecall for postgres per-connection timeouts (#10517)
* chore: bump localrecall for postgres per-connection timeouts Pulls mudler/LocalRecall#49: sets lock_timeout / idle_in_transaction (default on) + opt-in statement_timeout on every pooled connection, so a corrupt/wedged index (e.g. a BM25 insert spinning on a buffer-content lock) can no longer hold its relation lock forever and head-of-line block the whole vector store. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * docs(agents): document PostgreSQL connection safety timeouts Note the POSTGRES_LOCK_TIMEOUT / POSTGRES_IDLE_IN_TRANSACTION_TIMEOUT / POSTGRES_STATEMENT_TIMEOUT env vars read by the embedded vector store, and that safe defaults are on automatically. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
6c03e46390 |
chore: ⬆️ Update ikawrakow/ik_llama.cpp to b84902d2ad27c34f989f23947200c4b91b1568fd (#10515)
⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> |
||
|
|
f2ed63e39a |
docs(backends): make OS coverage explicit + require darwin support for new backends (#10516)
docs(backends): make OS coverage explicit + require darwin for new backends The backend matrix is the source of truth for which OS a backend ships on, but that was never written down, so backends were landing Linux-only by default even when the engine builds fine on macOS. - .github/backend-matrix.yml: header block documenting the two matrices (include = Linux, includeDarwin = macOS/Apple Silicon) and the policy that new backends target every OS they can build for. - .agents/adding-backends.md: a 'Cover every OS' subsection in step 2 (full darwin wiring: includeDarwin entry, index.yaml metal: + metal-<backend> entries, run.sh DYLD branch + inferBackendPathDarwin case for C++ backends, the hw_grpc_proto protobuf/grpc link gotcha, and the path-filter touch) plus a verification-checklist item. - AGENTS.md (CLAUDE.md): Quick Reference pointer so it surfaces every session. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
286c508ce0 |
feat(backends): darwin build for the localvqe backend (acoustic echo cancellation) (#10512)
feat(backends): darwin build for the localvqe backend LocalVQE (acoustic echo cancellation / noise suppression / dereverberation) already builds on Darwin - its Makefile takes the OS=Darwin branch with GGML_METAL=OFF (upstream is CPU + Vulkan only), producing a native arm64 CPU image. It was just never wired into CI. - .github/backend-matrix.yml: add localvqe to includeDarwin (build-type metal, lang go) - the darwin/arm64 build profile; the backend itself stays CPU. - backend/index.yaml: metal: capability + concrete metal-localvqe(-development) entries pointing at the -metal-darwin-arm64-localvqe images. - backend/go/localvqe/Makefile: note on the existing Darwin branch (also the per-backend change the CI path filter needs to build it here). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
d1a9d59917 |
feat(backends): darwin/Metal builds for vision C++/ggml backends (depth-anything, locate-anything, rfdetr-cpp, sam3-cpp) (#10511)
feat(backends): darwin/Metal builds for the vision C++/ggml backends depth-anything-cpp, locate-anything-cpp, rfdetr-cpp and sam3-cpp already carry a Darwin/Metal path in their Makefiles (GGML_METAL=ON when build-type=metal), but were never wired into CI, so no Metal image was published and Apple Silicon could not install them. - .github/backend-matrix.yml: add the four to includeDarwin (build-type metal, lang go), matching the other go+ggml *-cpp Metal entries. - backend/index.yaml: add metal: to each backend's capabilities map (main and -development) plus concrete metal-<backend>(-development) entries pointing at the latest/master -metal-darwin-arm64-<backend> images. - backend/go/*/Makefile: a one-line note on the existing Darwin branch (also the per-backend change the CI path filter needs to actually build them here). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
f72046b5b5 |
fix(auth): make advisory locks dialect-aware and harden SQLite DSN (#10509)
* fix(auth): make advisory locks dialect-aware and harden SQLite DSN Fixes #10506. Two failures hit deployments that use the default SQLite auth database: 1. advisorylock executed PostgreSQL-only SQL (pg_advisory_lock / pg_try_advisory_lock) unconditionally. On a SQLite auth DB the job store, agent store and node registry migrations failed with "no such function: pg_advisory_lock". WithLockCtx/TryWithLockCtx now branch on the gorm dialect: PostgreSQL keeps the cross-process advisory lock, every other dialect uses a context-aware, per-key in-process lock (a SQLite auth DB is effectively single-process, so serializing within the process is sufficient). 2. The SQLite auth DSN set no busy timeout, so transient SQLITE_BUSY over network-backed storage (SMB/CIFS/NFS, e.g. Azure Files) failed the auth migration immediately with "database is locked". The DSN now sets _busy_timeout=5000 and _txlock=immediate (caller-supplied values are preserved). WAL is intentionally not enabled since its shared-memory mmap does not work over network filesystems. Docs note that PostgreSQL should be used when the data directory lives on shared storage. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * test(jobs): regression test for #10506 SQLite job store migration Exercises the exact caller chain that failed in the issue: auth.InitDB(sqlite) -> jobs.NewJobStore -> advisorylock.WithLockCtx -> AutoMigrate. Before the dialect-aware advisory lock fix this failed with "no such function: pg_advisory_lock"; the test now asserts it migrates cleanly on a SQLite auth DB. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
79783120dd |
fix(config): gate parallel-slot default on per-device VRAM too (#10485) (#10507)
The first #10485 fix (#10494) made the Blackwell physical-batch boost per-device/context-aware, which neutralized the big compute-buffer OOM, but the reporter's 2x16 GiB consumer Blackwell still OOM'd. Tracing the post-fix log: the model now loads its weights, builds the main context and warms up fine, and dies only on the *last* allocation — the MTP draft context's 800 MiB KV cache on the tighter device. #10411 changed only two defaults: the physical batch (now gated) and a VRAM-scaled parallel-slot count. The KV cache is unified (n_ctx_seq == full context proves slots share the budget, so parallel doesn't multiply KV), but n_seq_max=4 still adds per-slot compute-graph / context-checkpoint / output scratch. On a device packed ~99% by a 27B model spanning both cards, that overhead is the few-hundred-MiB straw — which is why reverting #10411 (and only #10411) restores a working load. Gate the parallel-slot default on the same per-device headroom predicate as the batch boost: when a large context already fills a single card (largeContextForDevice), keep n_parallel=1. A user running one big-context model that barely fits across two consumer GPUs is not serving four concurrent tenants. Small contexts and large unified-memory devices (GB10) keep full concurrency. Applied on both the single-host path and the distributed router. Also make the auto-tuning visible and reversible (the debugging here needed DEBUG logs and a git bisect): - Log the effective performance-relevant runtime options at INFO once per model load ("effective runtime tuning …": context, n_batch, n_gpu_layers, parallel, flash_attention, f16) so an admin can see what will run and pin or override any value in the model YAML. - LOCALAI_DISABLE_HARDWARE_DEFAULTS=true skips the hardware auto-tuning entirely (mirrors LOCALAI_DISABLE_GUESSING) for stock llama.cpp behavior. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
4ac67d255d |
feat: single-build ggml CPU_ALL_VARIANTS for llama-cpp + turboquant (x86/arm64/apple) (#10497)
* feat(llama-cpp): single x86 CPU build via ggml CPU_ALL_VARIANTS
Replace the per-microarch avx/avx2/avx512/fallback multi-binary build on
x86 with a single grpc-server plus the dlopen-able libggml-cpu-*.so set
that ggml's backend registry selects at runtime by probing host CPU
features. One build instead of four, broader microarch coverage (adds
alderlake AVX-VNNI, zen4 AVX512-BF16, sapphirerapids AMX), and the
shell-side /proc/cpuinfo probing in run.sh goes away.
Build/link notes:
- CPU_ALL_VARIANTS requires GGML_BACKEND_DL + BUILD_SHARED_LIBS=ON, so
ggml/llama become shared objects. SHARED_LIBS is now a make variable
(default OFF) so the override survives the recursive sub-make into the
VARIANT build dir instead of being re-clobbered by the base flags.
- The cpu-all target also builds "--target ggml": the per-microarch
backends are runtime-dlopened, not link deps, so they only compile via
ggml's add_dependencies().
- hw_grpc_proto is pinned STATIC. Under BUILD_SHARED_LIBS=ON it would
otherwise become a DSO referencing hidden-visibility symbols in the
static libprotobuf.a, which fails to link ("hidden symbol ... is
referenced by DSO"). Keeping it static links gRPC/protobuf into the
executable while only ggml/llama stay shared, so no PIC or base-image
change is required.
- package.sh bundles the libggml-*.so set into package/lib; ggml finds
them by scanning the bundled ld.so directory (/proc/self/exe), which
run.sh launches from.
Scope: x86 only. arm64/darwin keep the single fallback build. The
ik-llama-cpp / turboquant forks and the other ggml C++ backends are
unchanged; the same recipe applies but is out of scope here.
Validated with a full docker build plus a live inference smoke test:
the model loads, ggml selects the AVX512_BF16 variant on a Zen-class
host, and tokens generate correctly.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* feat(llama-cpp,turboquant): extend CPU_ALL_VARIANTS to arm64 + turboquant
- llama-cpp: x86 AND arm64 now use the single llama-cpp-cpu-all build
(only hipblas keeps the fallback build). ggml's arm64 variant table
(armv8.x / armv9.x, plus apple_m* on darwin) is selected at runtime.
- turboquant: same recipe via a turboquant-cpu-all target. turboquant
copies backend/cpp/llama-cpp's CMakeLists.txt + Makefile per flavor, so
the hw_grpc_proto STATIC fix and the SHARED_LIBS / EXTRA_CMAKE_ARGS
make-vars are inherited; the target just passes SHARED_LIBS=ON, the DL
flags and --target ggml through, then collects the .so set. run.sh and
package.sh updated to ship/select turboquant-cpu-all.
- Makefile lib-collection find now also matches *.dylib (for the darwin
build, which emits dylibs rather than .so).
ik-llama-cpp is intentionally left unchanged: its pinned ggml has no
CPU_ALL_VARIANTS support and its IQK kernels require AVX2, so the
per-microarch dynamic backend set does not apply.
Scope still excludes the darwin packaging wiring (separate change).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* feat(llama-cpp,turboquant): arm64 gcc-14 for SME variants + darwin cpu-all packaging
- arm64: ggml CPU_ALL_VARIANTS builds armv9.2 SME variants whose -march=...+sme
is rejected by the Ubuntu 24.04 default gcc-13. Build the arm64 variants with
gcc-14 (installed in the compile step). The host only selects a variant it
actually supports at runtime, but every variant must still compile.
- darwin: scripts/build/llama-cpp-darwin.sh builds llama-cpp-cpu-all instead of
the fallback binary, keeps Metal (GGML_METAL stays ON; --target ggml also builds
ggml-metal). The per-microarch libggml-cpu-*.dylib are placed in the package
root next to the binary (darwin has no bundled ld.so, so ggml's executable-dir
scan looks there), while the other shared dylibs go in lib/ for DYLD_LIBRARY_PATH.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* fix(llama-cpp-darwin): distribute ggml backends by suffix (.so root, .dylib lib)
ggml emits its loadable backends (per-microarch CPU variants, metal, blas) with a
.so suffix even on darwin, while the core libraries (ggml-base/ggml/llama/
llama-common/mtmd) use .dylib. Split the distribution by suffix: .so DL backends
go in the package root for ggml's executable-directory scan, .dylib core libs go
in lib/ for DYLD_LIBRARY_PATH. The previous .dylib name-pattern matched none of the
variants.
Verified on an M4: ggml loads the apple_m4 CPU variant (SME=1) and Metal, model
loads and generates correct tokens.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* fix(llama-cpp,turboquant): only CPU_ALL_VARIANTS for pure-CPU builds, GPU uses fallback
The previous gate sent every non-hipblas build through llama-cpp-cpu-all, so the
GPU image builds (cublas, sycl_f16/f32, vulkan, nvidia l4t) compiled the whole CPU
microarch variant matrix on top of their already-huge GPU backend - blowing the
build time (the sycl job was only 59% done after 2h11m) - and the arm64 l4t build
failed at `apt-get install gcc-14` (exit 100) on the Jetson base.
Gate on an empty BUILD_TYPE instead: only the pure CPU image (build-type: '' in
.github/backend-matrix.yml) builds the CPU_ALL_VARIANTS set; every GPU build gets a
single fallback CPU grpc-server, since the accelerator does the compute. This also
confines the arm64 gcc-14 step (needed for the armv9.2 SME variants) to the CPU
build, away from the GPU base images.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* docs(llama-cpp): correct run.sh comment for arm64/darwin cpu-all
arm64 and darwin CPU images now also ship llama-cpp-cpu-all (not fallback-only);
only GPU images ship fallback-only. Fix the stale comment to match.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
|