Mark the Phase 6 serving classifier complete, preserve the old parity final as historical, and scope Phase 7 source candidates with explicit md5 and op gates.
Assisted-by: Codex:gpt-5
Record both-engine serving nsys buckets, rejected sampler short-circuit, and rejected GDN/MMQ env grids for the GB10 parity work.
Assisted-by: Codex:gpt-5
Record the Phase 5 Wq shared-memory padding experiment, its gates, sub-threshold benchmark gain, and the decision to ship no 0051 patch.
Assisted-by: Codex:gpt-5
Mirror fork commit d9b9be0be as patch 0050 and record the Phase 4 W4A16 shared-memory padding gates, benchmarks, and mirror verification.
Assisted-by: Codex:gpt-5
Record the Phase 3 scale-broadcast experiment, its md5 and MUL_MAT_ID gates, the prefill regression, and the decision to ship no 0050 patch.
Assisted-by: Codex:gpt-5
Mirror fork commit 7dfa0e175 as patch 0049 and record the Phase 2 GB10 W4A16 shape sweep, md5 gates, MUL_MAT_ID checks, and mirror verification.
Assisted-by: Codex:gpt-5
Mirror the fork-first W4A16 packed tile metadata commit into the LocalAI paged patch series, record the Phase 1 benchmark result, and keep the implementation plan checked off.
Assisted-by: Codex:gpt-5
Record the clean forced W4A16 baseline, default comparison, selected metadata target, and completed plan checkpoint for the GB10 parity reopen.
Assisted-by: Codex:gpt-5
Record the clean DGX build retry, binary provenance, canonical greedy md5 gates, and completed plan steps for the GB10 parity reopen.
Assisted-by: Codex:gpt-5
Add the Superpowers implementation plan for the GB10 parity reopen, including Phase 0 provenance, decode repro, W4A16 kill gates, and later kernel workstream entry criteria.
Assisted-by: Codex:gpt-5
The opt-in hybrid per-head bf16 SSM-state lever (ssm_bf16_tau, patch 0026) is
removed from the llama-cpp-localai-paged patch series. Clean re-measurement after
the decode fusions (0028 recurrent-state gather-fusion + 0029 block-table cache)
landed shows it buys nothing: forcing ALL gated-DeltaNet heads to bf16
(tau=100000, the most aggressive setting) gives flat decode throughput, 780.6 vs
780.0 t/s. The mode engages but adds zero speed because it is subsumed by the
fusions. The earlier "+12%" was measured before the fusions completed. bf16-tau
was a precision trade (not bit-exact, ~91% same-top-p) plus extra bug surface and
extra CUDA template-instantiation compile cost with no offsetting benefit.
Dependency check: no later patch (0028/0029/0030) depends on 0026. 0030's only
mention is a description comment; its code keys off fused_gdn_ar/ch/auto_fgdn,
which originate in 0018/0019/0021 (before 0026). The remaining series (0001-0025,
0028-0030) applies clean with git apply --check against the pin
0ed235ea2c17a19fc8238668653946721ed136fd. The Makefile applies the series by glob
(patches/paged/0*.patch); the resulting gap at 0026 is tolerated (0005/0027 are
already absent).
Removed:
- patches/paged/0026-qwen35-hybrid-perhead-ssm-state.patch
- the dead ssm_bf16_tau / ssm_hybrid_tau option handler in the shared
grpc-server.cpp (it only set LLAMA_SSM_BF16_TAU, now a no-op the library no
longer reads)
- the patched+bf16-tau benchmark columns and llama-patched-bf16tau rows
(README + final_benchmark.csv), the ssm_bf16_tau option text in backend
index.yaml, the gallery NOTE block, and the docs/features/backends.md mention.
The rejected-lever lesson is kept (why it was dropped: subsumed, tau=100000 flat)
in the backend README section 5, the paged-backend agent guide, and the
vLLM-parity methodology, so it is not re-tried.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
master auto-bumped the stock llama-cpp pin 9d5d882d -> 0ed235ea and updated the
shared grpc-server.cpp. The paged backend's pin must track the stock pin (the
grpc-server.cpp is shared), so bump its LLAMA_VERSION to match. All 28 paged
patches apply clean on 0ed235ea (verified against a fresh upstream clone). The
bf16-tau state-serialization fix (patch 0026) is included. Bit-exact gate + full
grpc-server build verify on GPU/CI to follow.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(voice-detect): add Go purego backend for voice-detect.cpp
Add backend/go/voice-detect implementing the Backend gRPC voice subset
(VoiceEmbed/VoiceVerify/VoiceAnalyze) over libvoicedetect.so via purego,
mirroring the parakeet-cpp / omnivoice-cpp backends.
The flat voicedetect_capi C ABI is dlopen'd cgo-less; malloc'd string and
float-vector returns are owned by Go and released through the matching capi
free functions, with the per-ctx last error surfaced into Go errors. Calls are
serialized via base.SingleThread since the C context is not reentrant.
Proto field mapping:
- VoiceEmbed: VoiceEmbedRequest.audio (path) -> embed_path -> Embedding+Model.
- VoiceVerify: audio1/audio2 + threshold (<=0 falls back to the
verify_threshold option, default 0.25) -> verify_paths -> verified/distance/
threshold/confidence/model/processing_time_ms.
- VoiceAnalyze: audio (path) -> analyze_path_json; the JSON age/gender/emotion
document maps to a single VoiceAnalysis segment (start/end 0; gender "label"
-> dominant_gender with the remaining float scores as the gender map; emotion
label/scores -> dominant_emotion/emotion).
The Makefile pins voice-detect.cpp to 47546430, clones+builds libvoicedetect.so
with ggml static-linked (PIC, GGML_NATIVE off) so dlopen needs no external
libggml/libvoicedetect; ldd on the artifact shows only system libs. Ginkgo
tests cover option parsing and analyze-JSON mapping; embed/verify smoke specs
gate on VOICEDETECT_BACKEND_TEST_MODEL + VOICEDETECT_BACKEND_TEST_WAV.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* feat(voice-detect): wire backend into index, gallery and build
Register the voice-detect.cpp speaker-recognition + voice-analysis
backend (added in Voice-INT-A) into LocalAI's distribution surfaces,
mirroring the ced backend (the closest mudler C++/ggml audio analogue):
- backend/index.yaml: add the &voicedetect meta-backend (capabilities
platform map, no top-level uri) plus the full set of concrete per-arch
image entries (cpu/cuda12/cuda13/metal/rocm/sycl/vulkan/l4t and the
-development variants). Referential integrity audited - every alias
target resolves.
- gallery/index.yaml: add 5 model entries on backend voice-detect -
ECAPA-TDNN, WeSpeaker ResNet34, 3D-Speaker ERes2Net, CAM++ and the
wav2vec2 age/gender/emotion analyze model. The engine architecture is
read from GGUF metadata (voicedetect.arch) at load. GGUF artifacts are
not yet published: each files: entry points at the intended
mudler/voice-detect-gguf location with a TODO to fill sha256 after
upload (no fabricated hashes).
- .github/backend-matrix.yml: add the linux build matrix block + the
darwin metal entry mirroring ced.
- .github/workflows/bump_deps.yaml: track mudler/voice-detect.cpp via
VOICEDETECT_VERSION (pin 47546430, = 4754643).
- core/config/backend_capabilities.go: register voice-detect in the
backend capability map (VoiceVerify/VoiceEmbed/VoiceAnalyze ->
speaker_recognition), mirroring speaker-recognition.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* feat(face-detect): add purego Go backend for face-detect.cpp
Add the LocalAI Go backend that dlopens libfacedetect.so (the flat
facedetect_capi_* C-ABI) via purego, mirroring the sibling voice-detect
backend. Implements the Face subset of the Backend gRPC service:
- Embeddings(PredictOptions): Images[0] base64 -> temp file -> embed_path
-> L2-normalized ArcFace embedding.
- Detect(DetectOptions): src -> detect_path_json -> Detection boxes
(class_name "face", [x1,y1,x2,y2] -> x/y/w/h).
- FaceVerify(FaceVerifyRequest): two images + threshold + anti_spoof ->
verify_paths; best-effort img areas via detect.
- FaceAnalyze(FaceAnalyzeRequest): img -> analyze_path_json -> per-face
age + gender ("M"/"F" normalized to "Man"/"Woman").
The Makefile pins face-detect.cpp to 636a1963 and builds the shared lib
with ggml + vendored libjpeg-turbo static (PIC), so the .so is
ldd-clean (no libggml) and exports only facedetect_capi_* (no jpeg_
symbols). Gated Ginkgo e2e mirrors voice-detect.
Note for the gallery-wiring task: backend registration (index.yaml,
gallery, core/config/backend_capabilities.go) is intentionally not
touched here.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* fix(voice-detect): replace em dashes in net-new descriptions
Project style forbids em/en dashes. Replace the three U+2014 chars
introduced by the voice-detect gallery/index wiring with `-`/`:`.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* feat(face-detect): wire backend into index, gallery and build
Register the face-detect.cpp face detection / embedding / verification /
analysis backend (added in Face-INT-A) into LocalAI's distribution
surfaces, mirroring the voice-detect wiring (the closest mudler C++/ggml
recognition analogue):
- backend/index.yaml: add the &facedetect meta-backend (capabilities
platform map, no top-level uri to avoid the meta-backend gotcha) plus
the full set of concrete per-arch image entries (cpu/cuda12/cuda13/
metal/rocm/sycl-f16/sycl-f32/vulkan/l4t and the -development variants),
22 entries. Referential integrity audited: every alias target resolves.
- gallery/index.yaml: add 4 model entries on backend face-detect -
face-detect-buffalo-l/m/s (insightface SCRFD + ArcFace/MBF, NON-COMMERCIAL)
and face-detect-yunet-sface (OpenCV-Zoo YuNet + SFace, APACHE-2.0, the
commercial-friendly alternative). The detector/embedder architecture is
read from GGUF metadata (facedetect.arch) at load; only the real
verify_threshold option is set (0.35 buffalo, 0.363 sface). GGUF
artifacts are not yet published: each files: entry points at the
intended mudler/face-detect-gguf location with a TODO to fill sha256
after upload (no fabricated hashes).
- core/config/backend_capabilities.go: register face-detect in the
backend capability map (Embedding/Detect/FaceVerify/FaceAnalyze ->
face_recognition), mirroring insightface.
- .github/backend-matrix.yml: add the linux build matrix block + the
darwin metal entry mirroring voice-detect.
- .github/workflows/bump_deps.yaml: track mudler/face-detect.cpp via
FACEDETECT_VERSION (pin 636a1963).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* fix(recon): voice-detect metal build branch + face-detect gallery usecases
Add the missing metal BUILD_TYPE branch to the voice-detect Makefile
forwarding -DVOICEDETECT_GGML_METAL=ON, mirroring face-detect, so the
darwin metal CI artifact is built with the Metal backend instead of
CPU-only.
Expand the 4 face-detect gallery models' known_usecases to
[face_recognition, detection, embeddings] to match the backend
capabilities map and the mirrored insightface-buffalo entries, so
auto-selection for /v1/detect and /embeddings works.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* docs(recon): document voice-detect and face-detect ggml backends
Document the new standalone C++/ggml biometric backends as the
recommended/default option for face and voice recognition, keeping the
existing Python insightface / speaker-recognition backends framed as the
legacy path.
- features/face-recognition.md: add a face-detect (ggml) backend section
with the gallery entries (buffalo-l/m/s non-commercial, yunet-sface
Apache-2.0), licensing, and verify/detect/analyze quickstart.
- features/voice-recognition.md: add a voice-detect (ggml) backend
section with the gallery entries (ecapa-tdnn, wespeaker-resnet34,
eres2net, campplus speaker recognizers; emotion-wav2vec2 non-commercial
analyze head) and quickstart.
- reference/compatibility-table.md: add face-detect.cpp and
voice-detect.cpp rows to the Vision, Detection & Recognition table.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* chore(gallery): publish recon backend GGUF uris + sha256
Fill in the published HuggingFace GGUF uris and verified sha256 for the
9 recon gallery entries (voice-detect-* and face-detect-*), and remove
the TODO publish markers. Correct the eres2net, campplus, and
emotion-wav2vec2 uris to the actual published filenames.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* feat(gallery): re-embed buffalo anti-spoof + add audeering age/gender voice model
Update the 3 buffalo face-detect GGUF sha256 (anti-spoof ensemble now
embedded and re-uploaded under the same filenames/uris) and note the
FaceVerify anti_spoof request flag in each description. Add a new
voice-detect-age-gender-wav2vec2 gallery entry mirroring the emotion
model.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* feat(gallery): add face-detect-buffalo-sc and antelopev2 packs
Add gallery entries for two newly-published insightface face packs on
the face-detect backend: buffalo_sc (smallest pack, SCRFD-500M + small
ArcFace) and antelopev2 (higher-accuracy, SCRFD-10G + ArcFace glint360k
R100, 512-d). Both are non-commercial research-only.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* feat(recon): honor LocalAI per-model threads in voice/face-detect backends
LocalAI spawns one backend process per model and serves requests
concurrently, so the engines' own min(hardware_concurrency, 8) default
can oversubscribe cores. Forward the per-model Threads value from the
gRPC LoadModel options into the engine via VOICEDETECT_THREADS /
FACEDETECT_THREADS (read at backend construction) before the capi load.
A non-positive Threads is treated as unset, leaving the engine default.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* chore(recon): bump backend pins to CPU-optimized engine commits
voice-detect.cpp -> 0d9c1b3 (radix-2 FFT FBank, threads, flash attn + cached
pos-conv); face-detect.cpp -> 523aee1 (thread-gated direct conv, threads).
Brings the CPU optimizations into the LocalAI backend builds. GGUF format and
parity unchanged, so the published HF GGUFs remain valid.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* chore(recon): bump backend pins to round-2 CPU-optimized engines
voice-detect.cpp -> fe7e6a3 (ERes2Net 1x1->mul_mat, CAM++ layout+context,
wav2vec2 conv-LN, ECAPA capture-drop, AVX512 dispatch opt-in); face-detect.cpp
-> 9c8adb7 (AVX2 Winograd F(2x2,3x3) for SCRFD/ArcFace 3x3 convs, ArcFace
BN-fold). Parity unchanged (cosine=1.0); GGUF format unchanged, HF GGUFs valid.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* chore(recon): bump backend pins to round-3 Winograd engines
voice-detect.cpp -> 45122ec (Winograd F(2x2,3x3) for WeSpeaker/ERes2Net 3x3
convs, -22%/-20% @8t); face-detect.cpp -> cd5c962 (Winograd F(4x4,3x3) for
SCRFD large maps, -22% @1t on top of F(2x2), more load-stable). Parity held
(cosine=1.0); GGUF format unchanged, HF GGUFs valid.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* chore(recon): bump backend pins to round-4 Winograd engines (CPU opt complete)
voice-detect.cpp -> d2839ca (CAM++ FCM 2D convs through Winograd, -15.5%/-10.3%);
face-detect.cpp -> c1db23d (AVX2-vectorized Winograd tile transforms, SCRFD
detect -14%/-9.6%). Final CPU optimization round; the conv-kernel lever class is
now exhausted (parity held cosine=1.0; GGUF/parity unchanged, HF GGUFs valid).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* chore(recon): bump face-detect pin to deep-kernel engine (7ae5c4d)
face-detect.cpp -> 7ae5c4d: register-blocked winograd-domain GEMM microkernel
(2.8x isolated GFLOP/s), AVX-512 zmm evolution behind runtime CPUID dispatch
(ship-safe, AVX2 fallback bit-identical), bias/relu fused into the winograd
output transform, and SFace Conv+BN fold + bias/PReLU fusion. SCRFD detect
~1.4x faster end-to-end vs the round-4 baseline; parity bit-exact; portable
single binary (function-multiversioned, no global -mavx512f).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* chore(recon): bump voice-detect pin to ECAPA operand-order win (e9c56ae)
voice-detect.cpp -> e9c56ae: weight-as-src0 mul_mat order in ECAPA's F32
conv1d_same (routes through tinyBLAS sgemm); ECAPA embed 1.67x @1t / ~1.3x @8t,
parity cosine=1.0. Isolated to encoder.cpp (ECAPA-only); ERes2Net/CAM++/WeSpeaker
do not call conv1d_same so are provably unaffected.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* chore(recon): bump pins to FMA-throughput engines (voice f7b9f89, face 2d2d5f0)
face -> 2d2d5f0: route ArcFace 3x3 body convs through the AVX-512 winograd
microkernel (kWinoMinSize 80->14); ArcFace 1.62x @1t, SCRFD detect to 0.966 of
MLAS @1t, no regression. voice -> f7b9f89: runtime-CPUID-dispatched AVX-512
winograd-GEMM microkernel (ship-safe, AVX2 fallback bit-identical); WeSpeaker
1.90x @1t. Parity cosine=1.0 throughout; portable single binaries.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* chore(recon): bump pins to MLAS-class direct-conv engines (voice 7ecfd07, face be22d67)
Hand-tuned nChw16c AVX-512 register-tiled direct-conv microkernel (~263 GFLOP/s,
within 6-7% of MLAS per-op efficiency), runtime-CPUID-dispatched + AVX2 fallback,
fused bias/relu. voice 7ecfd07: default 3x3-s1 kernel for WeSpeaker (+37%/+32%)
+ ERes2Net, CAM++ pinned to Winograd. face be22d67: shape-gated to the ArcFace
recognizer body (+25-27% @8t); SCRFD detector stays on Winograd (no regression).
Parity cosine=1.0 / detect <=1px on AVX-512 + AVX2 paths. Portable single binaries.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* chore(recon): bump voice pin to Phase-A blocked backbone (f4e7eef)
WeSpeaker ResNet34 runs as one nChw16c blocked island (2 reorders/forward vs
~60) on AVX-512, default; per-conv directconv fallback on AVX2. +2.9% @1t /
+17-19% @8t vs per-conv directconv, parity cosine=1.0. The conv microkernel is
already FMA-bound near peak (~0.86-0.98x MLAS-implied); residual to MLAS is
sub-peak edge + non-conv tail, documented in docs/cpu-optimization.md.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* chore(recon): bump pins to breadth blocked-backbone (voice 7f66871, face d80092b)
voice 7f66871: AVX2-vectorized (ymm) blocked island - AVX2-only hosts now run
the blocked backbone for WeSpeaker (2.3x over per-conv-AVX2, cosine=1.0);
ERes2Net stays per-conv (blocked regresses, opt-in only); CAM++ Winograd-pinned.
face d80092b: ArcFace recognizer blocked island, AVX-512 default (-13% @8t, ~0.90x
MLAS, the closest conv result), auto per-conv on AVX2; SCRFD untouched on Winograd
(0 island invocations during detect). Parity cosine=1.0 / detect <=1px throughout.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* chore(recon): bump pins to small-spatial + stem conv kernels (voice 99b1804, face 47fdab6)
Measured-gap-driven conv kernels: small-spatial (fill the register tile when
output width <= tile width) + small-IC stem + strided-1x1/downsample recovery.
ArcFace recognizer 0.57 -> 0.70x MLAS @1t (the closest conv model), WeSpeaker
0.65 -> 0.79x @1t. Parity cosine=1.0 / detect <=1px. The OC-block-sharing lever
was a measured dead-end (deep stride-1 is L3-weight-bandwidth bound, not
read-port bound) and was NOT shipped. Kernel ceiling reached; further gap needs
an algorithm-class change (cache-blocked weight-stationary GEMM, or q8 weights).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* chore(recon): bump pins to GPU persistent-graph + multi-model-safe cache (voice 45d2e6b, face 0a4799a)
GPU wins (CUDA/ggml backend, no CPU-path change): persistent per-shape graph+context
cache in Backend::compute() eliminates the per-call cudaGraph re-instantiation churn
-> wav2vec2 emotion+age-gender now AT GPU parity with torch-cuDNN on GB10 (0.97-0.98x),
CAM++ -5.7ms; bit-identical parity. Cache hardened multi-model-safe (invalidate-on-free
keyed by the ModelLoader weights buffer) so LocalAI multi-model hosting cannot stale-hit.
Conv models still trail cuDNN (im2col-materialization-bound) - cuDNN implicit-GEMM lever next.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* chore(recon): bump pins to cuDNN-conv-capable engines (voice b6e4356, face 6107a24)
Adds the opt-in cuDNN implicit-GEMM conv path (VOICEDETECT_GGML_CUDNN /
FACEDETECT_GGML_CUDNN, DEFAULT OFF -> zero build/runtime dep until enabled).
On GPU it kills the im2col-materialization bottleneck and reaches torch-cuDNN
parity on the spill-bound convs: SCRFD detect 14.8->6.4ms (2.3x, ~parity),
WeSpeaker ~parity, ERes2Net beats torch (1.10x); ArcFace/CAM++ neutral (no
spill). Parity exact (SCRFD <=1px, cosine=1.0). To USE it in LocalAI, the CUDA
backend build must enable the flag AND bundle libcudnn - deferred until a
cuDNN-bundled GPU image; flag stays OFF here.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* feat(recon): enable cuDNN conv path on arm64+CUDA13 recon backends
The voice-detect.cpp / face-detect.cpp engines have an opt-in cuDNN
implicit-GEMM conv path behind VOICEDETECT_GGML_CUDNN / FACEDETECT_GGML_CUDNN
(default OFF) that kills im2col on the GPU and reaches torch-cuDNN parity
(SCRFD 2.3x, WeSpeaker/ERes2Net parity), measured on the GB10
(arm64, CUDA 13, sm_121a).
Enable it for the CUDA build, but only where cuDNN actually ships: the
arm64 + CUDA 13 image (GB10/Jetson/L4T). x86 CUDA images carry no cuDNN,
so flipping it on globally for BUILD_TYPE=cublas would be a link failure.
The Makefiles gate on CUDA_MAJOR_VERSION=13 + arch (TARGETARCH from the
matrix/Docker build, uname -m fallback for local builds).
backend/Dockerfile.golang already installs the runtime libcudnn9-cuda-13
in the arm64+CUDA13 apt block; add the matching libcudnn9-dev-cuda-13 so
the build-time link resolves.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* chore(recon): bump voice-detect pin to ERes2Net blocked-default (30beecd)
Defaults VD_ERES2NET_BLOCKED ON: routes the ERes2Net Res2Net body through the
blocked nChw16c AVX-512 directconv island instead of the 1x1 mul_mat fast path
(CONT-transpose + skinny low-K GEMM). On the shipped GGML_NATIVE=OFF build (ggml
mul_mat is AVX2-only) this wins ~2x at every thread count (2.07x@1t, 2.2x@4t,
2.05x@8t); pure-AVX2 fallback still 1.3-1.62x. Parity exact (cosine=1.000000 vs
golden), so registered voices + verify/identify thresholds are unaffected. The
prior default-OFF rested on a stale comment whose 23pct regression only held on
the non-shipping GGML_NATIVE=ON build.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* docs(readme): announce native voice-detect + face-detect backends in Latest News
Add a Latest News entry for the new from-scratch C++/ggml biometric backends
(voice-detect.cpp + face-detect.cpp) that replace the Python insightface and
speaker-recognition backends: no Python/onnxruntime at inference, self-contained
GGUF, bit-exact parity, GPU cuDNN parity. Mirrors the parakeet.cpp /
locate-anything.cpp native-backend news entries. Refs PR #10441.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* chore(recon): re-pin to the squashed engine release commits
The voice-detect.cpp and face-detect.cpp histories were squashed to a single
release commit, which orphaned the previous pins (voice 30beecd, face 6107a24).
Re-pin to the new single-commit SHAs (voice 3d51077, face 06914b0); the tree is
identical, so the backend build is unchanged.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
In distributed mode, even when the frontend and workers share the same
models directory via a shared volume mount, starting a model on a worker
re-staged (re-downloaded) it: stageModelFiles always uploads model files
into a tracking-key-namespaced subdir on the worker, and the staging probe
only checks that staged location, so a file already present on the shared
volume at the canonical path was never reused.
Add a config switch LOCALAI_DISTRIBUTED_SHARED_MODELS (default false). When
enabled, the operator asserts that all nodes mount the SAME models directory
at the SAME path, so staging is unnecessary: the frontend's absolute model
paths are already valid on the worker. In that mode stageModelFiles returns
the cloned opts unchanged without uploading, leaving the path fields pointing
at their canonical absolute paths so the worker loads them directly from the
shared volume.
The value is plumbed from DistributedConfig through SmartRouterOptions into
the SmartRouter. Docs and docker-compose.distributed.yaml updated.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
The llama-cpp-localai-paged patches/ dir had accumulated docs, plots, a csv,
dev .cpp harnesses, and a dead FP4-MoE kernel scaffold after an earlier git-mv.
Restore the invariant that patches/ holds only the .patch series.
Moves:
- patches/paged/README.md -> README.md (canonical doc at the backend root)
- patches/paged/{PIN_SYNC_c299a92c,PAGED_BITEXACT_NOTE,LOCALAI_LLAMACPP_BACKEND_PLAN,UPSTREAM_LAYER2_SCOPE}.md,
final_benchmark.csv, qwen36_*.png, paged-burst-bench.cpp, paged-reclaim-unit.cpp -> docs/
- patches/README.md -> docs/PATCH_MAINTENANCE.md (unique patch-regen recipe not in the canonical README)
Deletes:
- patches/BENCHMARKS.md (superseded by README section 4 + the dev-notes section)
- patches/kernel/ (dead FP4-MoE scaffold, never in the 0001-0030 apply glob, zero refs repo-wide)
Repoint every reference to the moved files: README internal links (docs/ + the
.github links drop from 5x ../ to 3x ../), .agents/llama-cpp-localai-paged-backend.md,
.github/scripts/paged-canary-apply.sh, .github/workflows/llama-cpp-paged-canary.yml,
the wrapper Makefile, backend/cpp/llama-cpp/grpc-server.cpp, backend/index.yaml,
docs/content/features/backends.md, gallery/index.yaml.
The build apply glob PAGED_PATCHES_DIR/0*.patch (PAGED_PATCHES_DIR := .../patches/paged)
is unchanged and still resolves to the 28 patches.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Move ALL paged-attention content out of the stock backend/cpp/llama-cpp
backend and into backend/cpp/llama-cpp-localai-paged, so the stock backend is
pure upstream llama.cpp and the paged backend owns and applies its own vendored
patch series.
- Delete the dead early-exploration scaffold backend/cpp/llama-cpp/paged/
(kernel/w4a16 Marlin scaffold, standalone paged_kv_manager, bench/loadgen,
its own 0001-0002 patches, dense-era design docs, tests). Zero references
repo-wide.
- Move backend/cpp/llama-cpp/patches/ (the 28-patch paged series + paged/README
+ 3 operational docs, plus the kernel/ scaffold patch and the top-level paged
README/BENCHMARKS) to backend/cpp/llama-cpp-localai-paged/patches/. The stock
backend keeps no patches/ dir; it had no non-paged base patches.
- Purify the stock backend: remove the LLAMA_PAGED make variable, the
patches/paged apply loop, and the LLAMA_PAGED passthrough to prepare.sh;
remove the paged-series handling from prepare.sh. The stock llama.cpp target
now only clones the pin and applies its own (currently empty) base patches/
series. The runtime paged option hooks in the shared grpc-server.cpp are
untouched (inert without the patches).
- The paged backend's Makefile now applies its OWN patches/paged/0*.patch onto
each freshly cloned tree via strict git apply (apply-paged-patches), after the
copied stock infra clones the pin and applies base patches.
- Repoint every reference to the old patches/paged path: the upstream canary
workflow + apply script, bump_deps.yaml, gallery/index.yaml, the docs,
backend/index.yaml, backend-matrix.yml, the top-level Makefile comments, and
the moved PIN_SYNC / README docs. Drop the now-removed LLAMA_PAGED=on
build-toggle from comments.
Verified: the full 28-patch series applies strict-clean (git apply, exit 0) to
a clean ggml-org/llama.cpp checkout at the pinned c299a92c, and the repointed
canary apply script resolves and applies the series end to end.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
The paged-attention patch directory had accumulated ~55 scattered dev docs
(results, progress, scope, lever, and gap-analysis notes). Consolidate the
durable content of all of them into one canonical
backend/cpp/llama-cpp/patches/paged/README.md covering: what the patchset is,
the architecture (paged KV + block-table flash-attn, the gated-DeltaNet SSM
decode path, NVFP4 FP4-MMA, the decode-first scheduler), the full 0001-0030
patch series table with bit-exact status, the GB10 benchmarks
(patched-vs-stock-vs-vLLM + the Apple M4 architectural note), the dev notes
(bit-exact methodology, the per-path gate, the MoE-parity conclusion, the
rejected/flat levers, the opt-in bf16-SSM mode), arch+quant generality, the
pin + canary maintenance policy, and the published NVFP4 gallery models.
Delete the consolidated-away dev trail. Keep the three operational docs the
README links to: PIN_SYNC_c299a92c.md (canary reference), PAGED_BITEXACT_NOTE.md
(per-path gate reference) and LOCALAI_LLAMACPP_BACKEND_PLAN.md (the
ship-as-own-backend design-of-record), plus the benchmark plots + csv. The
.patch files and the unit/bench .cpp are untouched.
Repoint every external reference to a deleted doc at the new README:
grpc-server.cpp, docs/content/features/backends.md, gallery/index.yaml, the
canary apply script (PIN_BUMP_APPLY_CHECK.md -> README), and the base
patches/README.md (ADDITIVE_DESIGN.md -> README). The canary's PIN_SYNC
reference still resolves; its inert SSM_DECODE_FIX_RESULTS.md glob (a
patch-internal path matcher, not a repo-doc link) is left intact.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Patch 0026 added the hybrid per-head bf16 SSM-state opt-in as the
ssm_hybrid_tau_thresh cparam + the --ssm-bf16-tau CLI flag (default 0 =
bit-exact f32). Expose it per-model via the LocalAI gallery/model YAML
`options:` list, mirroring the paged_kv / max_batch_tokens setenv hooks.
- grpc-server.cpp: new `ssm_bf16_tau` (alias `ssm_hybrid_tau`) option ->
setenv(LLAMA_SSM_BF16_TAU) when the value parses to a positive float. It
does NOT reference the paged-only common_params field, so the turboquant
fork (which lacks patch 0026) stays byte-clean.
- patch 0026 (common.cpp common_context_params_to_llama): getenv fallback
feeds cparams.ssm_hybrid_tau_thresh from LLAMA_SSM_BF16_TAU only when the
--ssm-bf16-tau CLI flag is unset (0). Absent/non-positive env => untouched,
so stock stays bit-exact; the CLI flag takes precedence when set.
- docs: backend/index.yaml note, docs backends.md, gallery header NOTE
(referencing A_HYBRID_SSM_RESULTS.md; the 2 NVFP4 entries stay bit-exact).
Byte-safe when unset: with no ssm_bf16_tau option the env is never touched
and the default f32 bit-exact recurrence is preserved. Verified the parse +
consume code paths with a standalone compile-and-run (option string ->
LLAMA_SSM_BF16_TAU -> tau, plus 0 / garbage / CLI-precedence / unset cases).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Sync to master (12 commits) + the llama.cpp pin bump 8be759e6 -> 9d5d882d.
Conflicts resolved:
- Makefile .NOTPARALLEL: union (keep both backends/llama-cpp-localai-paged and
master's backends/privacy-filter-darwin).
- gallery/index.yaml: our 2 base NVFP4 entries (qwen3.6-27b-nvfp4, qwen3.6-35b-a3b-nvfp4)
for the paged backend prepended to master's full list; master keeps its own
*-nvfp4-mtp variants (distinct entries). Go build + YAML validated; the 8 duplicate
gallery names are pre-existing in master, not introduced here.
The patchset still needs re-verification against the new tip (pin-sync, next step).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
New backend = stock llama-cpp grpc-server + the paged patchset (forces LLAMA_PAGED=on),
shipped as its own meta-backend (mirrors turboquant, simpler: no fork pin, no
grpc-server patching - the paged runtime hooks already exist in grpc-server.cpp).
Stock llama-cpp untouched (LLAMA_PAGED?=on retained; the de-risk flip deferred for
sign-off). Gallery: qwen3.6-27b-nvfp4 (dense) + qwen3.6-35b-a3b-nvfp4 (MoE) with the
benchmark run config (paged_kv, max_batch_tokens, parallel, flash_attention, f16),
mudler/ GGUF uris (sha256 TODO until publish). Importer dropdown entry + tests.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Produce a Gatekeeper-clean macOS distribution with no user workaround:
- Launcher DMG + the LocalAI.app inside it are built via fyne, codesigned
with the Developer ID under the hardened runtime, then the DMG is signed,
notarized (notarytool) and stapled. Replaces macos-dmg-creator (which had
no signing hook) with fyne package + hdiutil so we control the .app before
packaging.
- The bare local-ai darwin server binary is signed + notarized via
GoReleaser's native notarize block (quill backend, runs on Linux).
- All signing is gated on secrets being present, so forks/PRs/local builds
stay unsigned and green (contrib/macos/sign-and-notarize.sh no-ops).
- Add hardened-runtime entitlements and FyneApp.toml for deterministic
packaging; update macOS install docs to drop the quarantine workaround.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* chore: bump localrecall for postgres per-connection timeouts
Pulls mudler/LocalRecall#49: sets lock_timeout / idle_in_transaction
(default on) + opt-in statement_timeout on every pooled connection, so a
corrupt/wedged index (e.g. a BM25 insert spinning on a buffer-content lock)
can no longer hold its relation lock forever and head-of-line block the
whole vector store.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* docs(agents): document PostgreSQL connection safety timeouts
Note the POSTGRES_LOCK_TIMEOUT / POSTGRES_IDLE_IN_TRANSACTION_TIMEOUT /
POSTGRES_STATEMENT_TIMEOUT env vars read by the embedded vector store, and
that safe defaults are on automatically.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* fix(auth): make advisory locks dialect-aware and harden SQLite DSN
Fixes#10506.
Two failures hit deployments that use the default SQLite auth database:
1. advisorylock executed PostgreSQL-only SQL (pg_advisory_lock /
pg_try_advisory_lock) unconditionally. On a SQLite auth DB the job
store, agent store and node registry migrations failed with
"no such function: pg_advisory_lock". WithLockCtx/TryWithLockCtx now
branch on the gorm dialect: PostgreSQL keeps the cross-process advisory
lock, every other dialect uses a context-aware, per-key in-process lock
(a SQLite auth DB is effectively single-process, so serializing within
the process is sufficient).
2. The SQLite auth DSN set no busy timeout, so transient SQLITE_BUSY over
network-backed storage (SMB/CIFS/NFS, e.g. Azure Files) failed the auth
migration immediately with "database is locked". The DSN now sets
_busy_timeout=5000 and _txlock=immediate (caller-supplied values are
preserved). WAL is intentionally not enabled since its shared-memory
mmap does not work over network filesystems. Docs note that PostgreSQL
should be used when the data directory lives on shared storage.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* test(jobs): regression test for #10506 SQLite job store migration
Exercises the exact caller chain that failed in the issue:
auth.InitDB(sqlite) -> jobs.NewJobStore -> advisorylock.WithLockCtx ->
AutoMigrate. Before the dialect-aware advisory lock fix this failed with
"no such function: pg_advisory_lock"; the test now asserts it migrates
cleanly on a SQLite auth DB.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
The first #10485 fix (#10494) made the Blackwell physical-batch boost
per-device/context-aware, which neutralized the big compute-buffer OOM, but
the reporter's 2x16 GiB consumer Blackwell still OOM'd. Tracing the post-fix
log: the model now loads its weights, builds the main context and warms up
fine, and dies only on the *last* allocation — the MTP draft context's 800 MiB
KV cache on the tighter device.
#10411 changed only two defaults: the physical batch (now gated) and a
VRAM-scaled parallel-slot count. The KV cache is unified (n_ctx_seq == full
context proves slots share the budget, so parallel doesn't multiply KV), but
n_seq_max=4 still adds per-slot compute-graph / context-checkpoint / output
scratch. On a device packed ~99% by a 27B model spanning both cards, that
overhead is the few-hundred-MiB straw — which is why reverting #10411 (and only
#10411) restores a working load.
Gate the parallel-slot default on the same per-device headroom predicate as the
batch boost: when a large context already fills a single card
(largeContextForDevice), keep n_parallel=1. A user running one big-context model
that barely fits across two consumer GPUs is not serving four concurrent
tenants. Small contexts and large unified-memory devices (GB10) keep full
concurrency. Applied on both the single-host path and the distributed router.
Also make the auto-tuning visible and reversible (the debugging here needed
DEBUG logs and a git bisect):
- Log the effective performance-relevant runtime options at INFO once per
model load ("effective runtime tuning …": context, n_batch, n_gpu_layers,
parallel, flash_attention, f16) so an admin can see what will run and pin or
override any value in the model YAML.
- LOCALAI_DISABLE_HARDWARE_DEFAULTS=true skips the hardware auto-tuning
entirely (mirrors LOCALAI_DISABLE_GUESSING) for stock llama.cpp behavior.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* fix(http): harden BaseURL proxy scheme/host detection
Split comma-separated X-Forwarded-Proto and honor the RFC 7239 Forwarded
header so generated links use https behind common reverse-proxy setups.
Refs #10482
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(http): honor explicit external base URL in BaseURL
When _external_base_url is set in the request context it dictates the
origin (scheme+host+port); the proxy path prefix is still appended.
Refs #10482
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(config): generalize LOCALAI_BASE_URL to ExternalBaseURL
LOCALAI_BASE_URL now sets a single instance-wide external base URL used
for OAuth callbacks and all self-referential links. A Pre middleware
stamps it into the request context for middleware.BaseURL.
Refs #10482
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* docs: document LOCALAI_BASE_URL and reverse-proxy headers
Refs #10482
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(http): cover parseForwarded edge cases; clarify base-url flag group
Adds direct unit coverage for quoted/malformed/multi-element Forwarded
headers and regroups the external base URL flag away from auth-only.
Refs #10482
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(llama-cpp): add main-model cpu_moe/n_cpu_moe options
Mirror the existing draft_cpu_moe/draft_n_cpu_moe siblings for the main
model, matching upstream --cpu-moe / --n-cpu-moe (common/arg.cpp). Lets
users keep MoE expert weights on CPU to manage VRAM on large MoE models.
Closes part of #10483
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(llama-cpp): forward unknown '-' options to upstream arg parser
Any options: entry starting with '-' is collected and passed verbatim to
llama.cpp's own common_params_parse (LLAMA_EXAMPLE_SERVER) at the end of
params_parse, so every upstream llama-server flag works without a new
hand-wired branch. Passthrough runs last and wins on overlap; n_parallel is
snapshotted to survive parser_init's SERVER reset, and help/usage/completion
flags are skipped to avoid exiting the backend.
Closes#10483
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* docs(llama-cpp): document cpu_moe/n_cpu_moe and option passthrough
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(llama-cpp): terminate tensor/kv override vectors after passthrough
The tensor_buft_overrides padding and the kv/draft override terminators
ran before the generic option passthrough, so a passthrough flag
(--cpu-moe, --override-tensor, --override-kv, ...) appended a real entry
after the null sentinel - tripping the model loader's
back().pattern == nullptr assertion (crash) or being silently dropped.
Move all three termination/padding blocks to the end of params_parse,
after both the named-option loop and common_params_parse have pushed
their real entries. Also widen the exit()-flag skip list so --version,
--license, --list-devices and --cache-list cannot terminate the backend.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
pii_default_detectors was applied to the live config only by a live
POST /api/settings (ApplyRuntimeSettings) — neither the startup loader nor
the config file watcher read it back. So after a restart the persisted
default detectors were dropped, and the cloud-proxy MITM listener (which
resolves each intercept host's detectors once at start via ResolvePIIPolicy)
came up with an empty set and forwarded intercepted traffic unredacted, even
though the MITM model had pii.enabled:true and the defaults were on disk.
Request-side default redaction broke the same way.
- startup.go: loadRuntimeSettingsFromFile now applies pii_default_detectors,
before startMITMIfConfigured, with env > file precedence.
- config_file_watcher.go: apply pii_default_detectors on live file edits,
matching the existing env-guard pattern used for the other fields.
- settings endpoint: rebuild the MITM listener when pii_default_detectors
changes (its per-host detector map is frozen at listener start), not only
on a mitm_listen change — so toggling a default detector takes effect on
cloud-proxy traffic immediately.
- new LOCALAI_PII_DEFAULT_DETECTORS env var / CLI flag (WithPIIDefaultDetectors)
so the default detector set can be pinned at boot for immutable deployments.
Assisted-by: Claude:claude-opus-4-8 Claude-Code
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
* feat(realtime): add pipeline.compaction config + resolution
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(realtime): extract itemID helper, reuse in item.retrieve
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(realtime): drop duplicate Ginkgo bootstrap, fold specs into openai suite
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): implement conversation.item.delete
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): implement input_audio_buffer.clear
Add a handler for the input_audio_buffer.clear client event that discards
a partially-captured utterance (raw PCM + buffered Opus frames) via a
unit-tested clearInputAudio helper, then acks with input_audio_buffer.cleared.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): implement conversation.item.truncate (text)
Clears both .Text and .Transcript of the assistant content part at
contentIndex so barge-in truncation also works for audio turns whose
spoken words live in .Transcript.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): add Conversation.Memory + pair-safe compactionCut
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(realtime): compactionCut returns 0 for keep<=0 (no-cap sentinel, avoids panic)
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* style(realtime): gofmt compaction test helper closures
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): inject rolling memory into the prompt + summary builders
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): server-side summarize-then-drop compactor
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(realtime): unit-test prefixMatches eviction-safety predicate
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): resolve summarizer model + schedule compaction per turn
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* docs(realtime): document conversation compaction + new item events
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(realtime): resolve summary model inside compaction goroutine (lazy, off-path)
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(realtime): reuse reasoning.ExtractReasoningComplete for summary stripping
Replace the bespoke <think> regex in the compactor with the shared
pkg/reasoning extractor (via spokenReasoningConfig), matching the rest of
the realtime path and covering all reasoning tag families, not just <think>.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(config): register pipeline.compaction fields in meta registry
TestAllFieldsHaveRegistryEntries requires every ModelConfig field to have
a UI/meta registry entry; add the four pipeline.compaction.* leaves so they
render with proper labels/descriptions instead of the reflection fallback.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
LocalAI pulls models from OCI registries (via go-containerregistry), the
Ollama registry, and OCI blob stores (via oras), but every request went
out with the underlying library's generic User-Agent, so registry
operators had no way to attribute traffic to LocalAI.
Add an oci.UserAgent() helper that returns "LocalAI" (or
"LocalAI/<version>" when the binary is built with a version stamp via
internal.Version) and wire it into all three pull paths:
- pkg/oci/image.go: remote.WithUserAgent on the go-containerregistry
image and digest requests
- pkg/oci/ollama.go: a User-Agent header on the Ollama manifest request
- pkg/oci/blob.go: a LocalAI User-Agent on the oras blob client. This
mirrors oras' auth.DefaultClient (same retry.DefaultClient policy);
only the advertised User-Agent changes.
Implements #6258.
Assisted-by: Claude:claude-opus-4-8 golangci-lint
Signed-off-by: Vijay Sai <vijaysaijnv@gmail.com>
* feat(ced): sketch sound-classification backend (CED audio tagger)
Wires ced.cpp (CED, 527-class AudioSet sound-event tagger; baby cry,
footsteps, glass, alarms, dog bark) into LocalAI as a Go/purego backend.
SKETCH (backend skeleton real; core REST wiring + CI/gallery is a checklist
in DESIGN.md):
- backend/backend.proto: new SoundDetection rpc + SoundClass messages
(run `make protogen-go` to regenerate pkg/grpc/proto).
- backend/go/ced: main.go (purego dlopen libced.so + ced_capi.h),
goced.go (Ced gRPC backend: Load + SoundDetection), Makefile
(clone-at-pin CED_VERSION, ggml static-PIC shared build), run.sh,
package.sh, .gitignore.
- DESIGN.md: REST /v1/audio/classification wiring (handler/route/capability
registration checklist), gallery/index + CI registration, and a scoping
note for the realtime/websocket live-recognition path (sliding-window
classify over the existing ws transport + voicegate; the ced C-API
per-PCM entry point is already window-friendly).
Backend code does not compile until protogen-go regenerates the pb types
and a libced.so is built (Makefile clones+builds it).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ced): REST /v1/audio/classification endpoint + capability registration
Wires the ced sound-event classification backend (AudioSet audio tagger)
end to end through the REST surface, mirroring the transcription path.
- Handler: core/http/endpoints/openai/sound_classification.go parses the
multipart audio upload, temp-files it, resolves the model config and
calls the SoundDetection RPC; returns {model, detections[]} JSON.
- Backend wrapper: core/backend/sound_classification.go (ModelSoundDetection)
loads the model and normalizes the proto response into schema types.
- Schema: core/schema/sound_classification.go (SoundClassificationResult).
- gRPC layer: SoundDetection wired through the LocalAI wrapper (interface,
Backend client, Client, embed, server, base default) so the loader-typed
client exposes the RPC; proto regenerated via make protogen-go.
- Route: POST /v1/audio/classification (+ /audio/classification alias) with
the audio/multipart default-model middleware in routes/openai.go.
- Capability surfaces: swagger @Tags/@Router on the handler; FLAG_SOUND_
CLASSIFICATION usecase flag + UsecaseSoundClassification + UsecaseInfoMap +
GuessUsecases + ModalityGroups + GetAllModelConfigUsecases; meta usecase
option; /api/instructions audio area updated; auth RouteFeatureRegistry +
FeatureAudioClassification (APIFeatures, default ON) + FeatureMetas; UI
usecaseFilters, capabilities.js CAP_SOUND_CLASSIFICATION, Models.jsx filter
+ i18n; docs page features/audio-classification.md + whats-new + crosslink.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ced): realtime sound-event detection over the websocket API
When a realtime pipeline configures a sound-classification model, each
VAD-committed utterance (the same window the transcription path produces)
is also run through the CED sound-event classifier and the scored AudioSet
tags are emitted as a new server event. No new backend rpc is needed: the
SoundDetection gRPC method already exists on this branch.
- config: add Pipeline.SoundDetection (yaml/json sound_detection,omitempty)
beside Transcription/VAD.
- realtime: add Model.SoundDetection(ctx, audio, topK, threshold) to the
ModelInterface; implement it on wrappedModel and transcriptOnlyModel by
calling backend.ModelSoundDetection with the session's sound-classification
model config (mirrors how Transcribe dispatches). Load the optional config
in newModel / newTranscriptionOnlyModel; nil config keeps it additive.
- types: add ConversationItemSoundDetectionEvent (item_id, content_index,
detections[]{label,score,index}) with type conversation.item.sound_detection,
its ServerEventType constant and MarshalJSON, mirroring the transcription
completed event.
- realtime: add emitSoundDetection (unary path: classify the committed window,
build the event, t.SendEvent) and wire it at the utterance-commit hook right
after emitTranscription; gated on session.SoundDetectionEnabled (resolved
from Pipeline.SoundDetection at session setup, defaults top_k=5, threshold=0).
Its error is logged via xlog but never aborts the turn.
- test: Ginkgo specs for emitSoundDetection (tags emitted, empty detections,
classifier error) plus a SoundDetection method on the fakeModel double.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(ced): implement SoundDetection in nodes backend test doubles
The SoundDetection method added to the grpc backend interface left two
test doubles (fakeBackendClient, fakeGRPCBackend) incomplete, so
core/services/nodes failed to compile under `go vet`/`go test` (go build
missed it: the doubles live in _test.go). Add the method to both,
mirroring their existing Detect mock. Repairs CI for the nodes package.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ced): decouple realtime sound detection from VAD (sound-only sessions)
Sound-event detection must activate on sounds, not speech, so it no longer
runs through the voice VAD/transcription path. A sound-detection-only
pipeline (sound_detection set, no transcription/LLM) now:
- is accepted by prepareRealtimeConfig (sound_detection counts as a pipeline
stage),
- builds a lightweight model via newSoundDetectionOnlyModel (no VAD/STT/LLM/TTS
loaded), and
- defaults the session to turn_detection none (no VAD) with no transcription
stage, so the client drives windowing via input_audio_buffer.commit
(option A: client-side sliding window). The per-PCM C-API already supports
arbitrary windows.
commitUtterance gains a sound-only branch: it emits the
conversation.item.sound_detection event (scored AudioSet tags) and stops -
no transcription, no LLM response. generateResponse is now guarded on a
transcription stage being present, so a sound-only turn never invokes the LLM.
Existing transcription/VAD sessions are unchanged (additive). Added a
commitUtterance sound-only Ginkgo spec asserting it emits the sound event and
neither transcribes nor generates a response. go vet + golangci-lint
(new-from-merge-base) clean; openai suite green.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ced): register sound-classification backend in gallery + CI
Mechanical backend-image registration for the ced sound-event classifier,
mirroring the parakeet-cpp Go/purego backend everywhere it is wired up.
- .github/backend-matrix.yml: add the ced build matrix, field-for-field copies
of the parakeet-cpp entries (cpu amd64/arm64, cublas cuda 12/13 amd64,
l4t cuda-13 arm64, l4t-jetpack cuda-12 arm64, sycl f32/f16, vulkan
amd64/arm64, rocm hipblas, and the metal darwin entry), changing only
backend and tag-suffix. dockerfile stays ./backend/Dockerfile.golang.
- backend/index.yaml: add the &ced meta anchor (capabilities map per platform)
plus ced-development and the per-arch image entries, each uri/mirror
tag-suffix matching the matrix exactly. The model gallery (GGUF) entry is
intentionally deferred pending the HuggingFace publish (TODO note inline).
- scripts/changed-backends.js: add an explicit item.backend === "ced" branch in
inferBackendPath mapping to backend/go/ced/, same mechanism and ordering as
the parakeet-cpp branch (before the generic golang fallthrough).
- .github/workflows/bump_deps.yaml: register mudler/ced.cpp -> CED_VERSION in
backend/go/ced/Makefile so the daily bot bumps the pin.
- swagger/{docs.go,swagger.json,swagger.yaml}: regenerated via make swagger so
the existing /v1/audio/classification annotations land in the generated spec.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ced): server-side windowing for realtime sound detection (option B)
Adds an optional server-driven sliding-window classifier so a sound-only
realtime client only has to stream audio (no input_audio_buffer.commit):
- Pipeline.sound_detection_window_ms / sound_detection_hop_ms config knobs.
When both > 0 on a sound-only session, the server classifies the last
window of streamed audio every hop and emits a conversation.item.sound_
detection event; the input buffer is trimmed to one window so a long
stream stays bounded. When unset, the session stays client-driven
(option A). Runs independent of VAD (sound events are not speech).
- handleSoundWindow (ticker) + classifySoundWindow (one tick, extracted so
it is unit-testable) + writeWindowWAV, which declares the true
InputSampleRate (NewWAVHeaderWithRate) so the classifier resamples
correctly. Goroutine is started after toggleVAD and torn down with the
session (close + wg.Wait).
- Register pipeline.sound_detection (+window_ms/hop_ms) in the config meta
registry; the earlier realtime commit added pipeline.sound_detection
without a registry entry, failing TestAllFieldsHaveRegistryEntries. This
fixes that and covers the two new knobs.
Tests: classifySoundWindow emits an event + trims the buffer to one window,
no-ops on too-little audio; writeWindowWAV declares the given sample rate.
go build/vet + golangci-lint (new-from-merge-base) clean; config + openai
suites green.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ced): add ced-base GGUF model gallery entries (f16 + q8_0)
The ced-base weights are now published at mudler/ced-base-gguf (Apache-2.0,
converted from mispeech/ced-base). Adds gallery/ced.yaml (backend: ced +
known_usecases: sound_classification) and two gallery/index.yaml entries
(ced-base-f16 default, ced-base-q8 smallest) with sha256-pinned files, and
removes the now-resolved TODO from backend/index.yaml.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ced): add tiny/mini/small GGUF model gallery entries
Publishes the rest of the CED family (same architecture, metadata-driven port
verified end-to-end on ced-tiny) to mudler/ced-{tiny,mini,small}-gguf and adds
their f16 + q8_0 gallery entries:
ced-tiny (5.5M, edge/Pi-class) f16 11MB / q8_0 6MB
ced-mini (9.6M) f16 19MB / q8_0 11MB
ced-small (22M) f16 42MB / q8_0 23MB
All sha256-pinned. ced-base remains the accuracy default.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore(ced): point gallery entries at the consolidated mudler/ced-gguf repo
All CED quantizations (tiny/mini/small/base, f16/q8_0) now live in a single
HuggingFace repo, mudler/ced-gguf, instead of per-model repos. Repoint the 8
gallery model entries' urls + file uris accordingly. sha256 and filenames are
unchanged.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore(ced): bump CED_VERSION to the short-clip fix
Pin the ced backend to ced.cpp 99c6ed3, which fixes a crash on any clip
shorter than target_length (~10.11s): time_pos_embed was added at its full
63-frame grid instead of being sliced to the clip's actual time grid, tripping
ggml_can_repeat in ggml_add. Surfaced by the live realtime e2e (sub-10s
windows) and gated with a short-clip parity test upstream.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* docs(ced): list ced.cpp as a LocalAI-team engine + backend-guide directive
- README.md: add ced.cpp to the "native C/C++/GGML engines developed and
maintained by the LocalAI project" table.
- docs/content/features/backends.md: add a Sound Classification backend
category (sound-event classification / audio tagging) listing ced.cpp.
- .agents/adding-backends.md: add a "Documenting the backend" section and two
verification-checklist items requiring new backends to be documented in the
backends.md category list, and in-house native engines to be added to the
README maintained-engines table. This directive was missing.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore(ced): repin CED_VERSION to the v0.1.0 release commit
ced.cpp history was squashed into a single release commit (tagged v0.1.0), so
the previous pin (99c6ed3) no longer exists upstream. Pin to c04ac14, the
v0.1.0 release commit, so the backend builds against a commit that exists.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(ced): silence gosec G304/G103 + govet unsafeptr on audited paths
- sound_classification.go: os.Create(dst) where dst = temp dir + path.Base of
the upload (no traversal). #nosec G304, matching the depth-anything-cpp handler.
- goced.go: reading a NUL-terminated C string from a libced-owned buffer.
#nosec G103 (gosec) + //nolint:govet (golangci-lint's unsafeptr check), since
the uintptr is a C-owned malloc'd buffer, not Go-GC memory.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): add voice_recognition enforce + identity config
Add Enforce *bool and Identity *VoiceIdentityConfig to
PipelineVoiceRecognition, plus EnforceGate/IdentityEnabled/
AnnounceEnabled/PersonalizeEnabled helpers. Enforce nil defaults to
gating (backward compatible); identity surfacing is independent of the
gate.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): add Speaker type and conversation.item.speaker event
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(realtime): split voiceGate into Resolve + authorize
Split the speaker authorization into a Resolve step (embed once, produce a
types.Speaker identity) and a pure authorize policy step, with a 0..100
confidence score mirroring /v1/voice/identify. The legacy Authorize wrapper is
kept so existing specs stay green.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): resolve speaker per turn and emit conversation.item.speaker
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): personalize LLM turns with recognized speaker
Set the per-message name field on each recognized user turn and append a
current-speaker note to the system message, both gated by the voice
recognition identity config.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* docs(realtime): document speaker identity surfacing and personalization
Document the new voice_recognition keys (enforce, identity.*) and the
LocalAI-extension conversation.item.speaker server event in the realtime
feature docs.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(realtime): cover when:first+identity re-resolution and multi-speaker history
Add two integration specs to harden the speaker-aware realtime path:
- when:first with an Identity block re-resolves the speaker every turn even
though re-authorization is skipped after the first match: a later resolve
error now fails closed, while a clean later resolve still surfaces and names
the speaker.
- multi-speaker history attribution: each user turn carries its own per-message
name and the injected system note reflects the latest speaker.
Test-only change; no production behavior was modified.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): surface speaker labels in conversation.item.speaker
Carry the registered speaker's labels (identify mode) on types.Speaker so
they flow into the conversation.item.speaker event and the stored item.
Verify mode has no labels, so the field is omitted there.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(e2e): cover conversation.item.speaker over a real websocket
Add a realtime-pipeline-identity config (verify mode, enforce:false, identity
announce+announce_unknown+personalize) and two e2e specs driving the real
server over a real WebSocket with the mock VoiceEmbed backend: an authorized
speaker yields a conversation.item.speaker event naming e2e-speaker (matched
true) and reaches response.done; an unauthorized speaker yields an unknown
(matched false, no name) event and still responds, proving enforce:false
never drops a turn.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(config): register voice_recognition enforce + identity fields
The meta registry coverage test (TestAllFieldsHaveRegistryEntries) requires
every config field to have an entry in core/config/meta/registry.go. The new
voice_recognition.enforce and voice_recognition.identity.* fields were missing,
failing tests-linux and tests-apple. Add registry entries (toggles) so the
fields are surfaced in the model-config editor and the coverage test passes.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
* feat(config): add model alias field and self-validation
Add ModelConfig.Alias (yaml: alias), IsAlias(), and an alias
short-circuit at the top of Validate() that rejects self-reference and
forbids setting backend/parameters.model on a pure-redirect alias.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(config): resolve and validate model alias targets in the loader
Assisted-by: Claude:opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(middleware): resolve model aliases and stamp requested/served identity
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(modeladmin): reject alias configs with invalid targets on create/edit
Validate alias targets at create/swap entry points (ImportModelEndpoint,
EditYAML, PatchConfig) so a dangling, chained, or disabled alias target is
rejected at save time rather than surfacing as a runtime error.
Assisted-by: Claude:opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(api): add GET /api/aliases to list model aliases
Adds an admin-gated read-only endpoint that lists every model alias
config as {name, target} pairs, backed by the loader's existing
GetAllModelsConfigs().
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(mcp): add set_alias and list_aliases tools
Expose model-alias management over the LocalAI Assistant MCP surface:
list_aliases (read-only, GET /api/aliases) and set_alias (mutating).
SetAlias is swap-first: PATCH /api/models/config-json/:name swaps an
existing alias's target (validated, non-destructive) and a 404 falls
back to POST /models/import to create a fresh {name, alias} config. The
inproc client mirrors this via ConfigService.PatchConfig + a create path
modeled on ImportModelEndpoint. Deletion reuses delete_model.
Assisted-by: Claude:claude-opus-4 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* style(mcp): replace em dashes in alias tool comments
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(config-meta): expose alias as a model-select field
Add an 'alias' section to DefaultSections() and an 'alias' field override
in DefaultRegistry() so the schema-driven React editor renders the new
top-level ModelConfig.Alias field as a model picker in its own section.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): add alias template card and Manage alias badge
Add an 'Alias / Routing' template to the create-flow gallery that seeds a
minimal name + alias config, and a read-only 'alias -> target' badge on the
Manage Models tab. The capabilities row payload does not carry the alias
field, so the badge resolves targets from GET /api/aliases looked up by name.
Assisted-by: Claude:claude-opus-4 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* docs: document model aliases
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* docs(swagger): regenerate for GET /api/aliases
Adds the /api/aliases path and AliasInfo schema generated from the
ListAliasesEndpoint annotation.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(localai): check os.RemoveAll error in aliases_test
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix: correct alias conversion docs and advertise /api/aliases in instructions
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(mcp): write alias config 0600 to satisfy gosec G306
The inproc createAlias path wrote the alias YAML with 0644, which gosec
flags as a new G306 finding on the PR. The LocalAI process is the sole
reader/writer of model configs, so 0600 is correct and keeps the scan clean.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
docs: document the privacy-filter.cpp backend in README and compatibility table
The privacy-filter.cpp backend (#10360) was registered in backend/index.yaml
and referenced from the PII feature docs, but was missing from the backend
catalog surfaces. Add it to the README "Backends built by us" table, the
compatibility table (Utilities & Other, CPU/CUDA 13/Vulkan), and the backend
type list in the backends feature doc.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Squashed feat/pii-ner-tier-engine rebased onto master (was 45 commits; see
backup/pii-ner-tier-engine-prerebase). Net change:
- privacy-filter.cpp: standalone GGML engine for the openai-privacy-filter
PII/NER token classifier, wired as a LocalAI gRPC backend (CPU/CUDA/Vulkan).
TokenClassify moves off the patched llama.cpp path onto this backend.
- PII filter reworked to be NER-centric (encoder/NER detection tier scanning
whole conversations as one document), with a recreated bounded restricted-
regex secret-matching pattern detector tier alongside it (per-model
pii_detection.builtins / .patterns + core/services/routing/piipattern).
- Detection labelled by source (ner vs pattern); backend trace / confidence /
debug observability; analyze/redact exposed as a synchronous API.
- Instance-wide default detector policy + per-usecase default-on; request
filtering extended to completions, embeddings, edits & Ollama.
- React UI: NER-centric PII editor, detector-models table, pattern/builtins
editor, middleware default-policy UI.
- Gallery: privacy-filter-multilingual token-classify model + NER install
filter; token_classify known_usecase; batch sized to context for NER models.
privacy-filter backend registered in the backend gallery (cpu/vulkan/cuda-13
meta + image entries with a capabilities map) matching its CI matrix jobs,
and an /import-model auto-detect importer (PrivacyFilterImporter, narrow
privacy-filter GGUF detection) replacing the prior pref-only registration.
Reconciled against master's independent evolution:
- Dropped master's PIIPatternOverrides feature (global-pattern runtime
overrides + /api/pii/patterns API + runtime_settings.json persistence). The
per-model NER + pattern-detector design supersedes it; it was built on the
global redactor pattern set this branch replaced.
- Reverted the llama.cpp Score carry-patch (0006-server-task-type-score):
removed the patch and restored master's grpc-server.cpp Score RPC (direct
llama_decode, slot-loop bypass) and LLAMA_VERSION pin, plus master's
model_config validation forbidding score + chat/completion/embeddings on
llama-cpp. token_classify is unaffected (it runs on the privacy-filter
backend, not llama-cpp).
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Bring the Backend & Model Compatibility Table up to the full set of
backends published in backend/index.yaml (60+), organized by modality
with per-backend acceleration targets. Add an "Available Backends"
pointer and expand the backend-type list in the backends feature doc.
Update the README backend count to 60+ and add a "Backends built by us"
section listing the native C/C++/GGML engines maintained by the LocalAI
project (parakeet.cpp, voxtral.c, vibevoice.cpp, rf-detr.cpp,
locate-anything.cpp, depth-anything.cpp, LocalVQE, local-store).
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
The Hugo relearn theme does not provide an "alert" shortcode, so the
docs deploy failed at the Build site step:
failed to extract shortcode: template for shortcode "alert" not found
docs/content/features/image-generation.md:106
Convert the vae_decode_only note to the theme-supported notice shortcode
used everywhere else in the docs.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(config): add chat_template_kwargs model field + resolver
Adds the ChatTemplateKwargs model-config map and RequestMetadata carrier,
plus ResolveChatTemplateKwargs which layers the config map under coerced
request metadata. Foundation for generic jinja chat-template kwargs (issue #10329).
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(backend): forward resolved chat_template_kwargs blob to backends
gRPCPredictOpts now merges per-request client metadata over the server-derived
enable_thinking/reasoning_effort (reaching all backends via the standalone keys)
and serialises the resolved chat_template_kwargs map into a JSON blob for
llama.cpp, written last so a client cannot clobber it. Issue #10329.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(http): wire request metadata to config.RequestMetadata
The OpenAI request metadata field was parsed but unused; stamp it onto the
per-request ModelConfig so gRPCPredictOpts forwards it as chat_template_kwargs
overrides. Issue #10329.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(llama-cpp): generic chat_template_kwargs merge (drop per-key blocks)
Replace the per-key enable_thinking/reasoning_effort handling in both the
streaming and non-streaming chat paths with a single block that parses the
chat_template_kwargs JSON blob resolved by the Go layer and merges every key
into body_json. New jinja template levers (e.g. preserve_thinking) now need
no C++ change. Issue #10329.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* docs: document custom chat_template_kwargs (model + per-request)
Issue #10329.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(backend): pin reasoning_effort as a string in the chat_template_kwargs blob
Issue #10329.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(http): e2e guard pinning chat_template_kwargs forwarded to gRPC
Adds an ECHO_PREDICT_METADATA marker to the mock-backend that echoes the
received PredictOptions.Metadata, and an app_test.go spec that drives a real
/v1/chat/completions request (model chat_template_kwargs + per-request metadata
override) and asserts the exact metadata + chat_template_kwargs blob the REST
layer forwards to gRPC. Locks the REST->gRPC contract against regressions. Issue #10329.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(config): grandfather chat_template_kwargs in registry coverage
chat_template_kwargs is a free-form map[string]any (like engine_args, already
on the list), not a scalar the config UI registry can surface, so it is exempt
from the registry-entry requirement. Fixes the TestAllFieldsHaveRegistryEntries
failure introduced by the new field. Issue #10329.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>