LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-13 19:27:48 -04:00

Author	SHA1	Message	Date
LocalAI [bot]	47cc3dc8d7	chore: ⬆️ Update ggml-org/llama.cpp to `361fe72acb7b9bd79059cc177cbeda99b35b5db9` (#9548 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-25 08:58:27 +02:00
dependabot[bot]	9ab3496de2	chore(deps): bump rustls-webpki from 0.103.10 to 0.103.13 in /backend/rust/kokoros in the cargo group across 1 directory (#9546 ) chore(deps): bump rustls-webpki Bumps the cargo group with 1 update in the /backend/rust/kokoros directory: [rustls-webpki](https://github.com/rustls/webpki). Updates `rustls-webpki` from 0.103.10 to 0.103.13 - [Release notes](https://github.com/rustls/webpki/releases) - [Commits](https://github.com/rustls/webpki/compare/v/0.103.10...v/0.103.13) --- updated-dependencies: - dependency-name: rustls-webpki dependency-version: 0.103.13 dependency-type: indirect dependency-group: cargo ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-04-24 22:02:58 +02:00
Andreas Egli	1d0de757c3	fix: add hipblaslt library (#9541 ) Signed-off-by: Andreas Egli <github@kharan.ch>	2026-04-24 18:50:03 +02:00
LocalAI [bot]	1c9592c77f	chore: ⬆️ Update leejet/stable-diffusion.cpp to `b8bdffc19962be7e5a84bfefeb2e31bd885b571a` (#9521 ) ⬆️ Update leejet/stable-diffusion.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-24 15:15:15 +02:00
Richard Palethorpe	13734ae9fa	feat: Add Sherpa ONNX backend for ASR and TTS (#8523 ) feat(backend): Add Sherpa ONNX backend and Omnilingual ASR Adds a new Go backend wrapping sherpa-onnx via purego (no cgo). Same approach as opus/stablediffusion-ggml/whisper — a thin C shim (csrc/shim.c + shim.h → libsherpa-shim.so) wraps the bits purego can't reach directly: nested struct config writes, result-struct field reads, and the streaming TTS callback trampoline. The Go side uses opaque uintptr handles and purego.NewCallback for the TTS callback. Supports: - VAD via sherpa-onnx's Silero VAD - Offline ASR: Whisper, Paraformer, SenseVoice, Omnilingual CTC - Online/streaming ASR: zipformer transducer with endpoint detection (AudioTranscriptionStream emits delta events during decode) - Offline TTS: VITS (LJS, etc.) - Streaming TTS: sherpa-onnx's callback API → PCM chunks on a channel, prefixed by a streaming WAV header Gallery entries: omnilingual-0.3b-ctc-q8-sherpa (1600-language offline ASR), streaming-zipformer-en-sherpa (low-latency streaming ASR), silero-vad-sherpa, vits-ljs-sherpa. E2E coverage: tests/e2e-backends for offline + streaming ASR, tests/e2e for the full realtime pipeline (VAD + STT + TTS). Assisted-by: claude-opus-4-7-1M [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-04-24 14:40:06 +02:00
Ettore Di Giacinto	c0920f3273	fix(ik-llama-cpp): patch clip.cpp for new ggml_quantize_chunk signature (#9531 ) Bumps ik_llama.cpp pin to 16996aeab7. Upstream 286ce32...16996ae adds a trailing `const struct quantize_user_data *` parameter to `ggml_quantize_chunk` (PR ikawrakow/ik_llama.cpp#1677) but leaves `examples/llava/clip.cpp` unchanged because their build has moved to `examples/mtmd/`. LocalAI's prepare.sh still copies from `examples/llava/`, so the dead 7-arg call reaches the grpc-server compile and fails. Patch the call site to pass `nullptr` for the new param. Assisted-by: Claude:Opus-4.7 [Read] [Edit] [Bash]	2026-04-24 13:07:26 +02:00
LocalAI [bot]	7c1934b183	chore: ⬆️ Update ggml-org/llama.cpp to `187a45637054881ecacf17f8e2f6f8f2ba7df1c7` (#9520 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-24 09:17:06 +02:00
Ettore Di Giacinto	4906cbad04	feat: add biometrics UI (#9524 ) * feat(react-ui): add Face & Voice Recognition pages Expose the face and voice biometrics endpoints (/v1/face/, /v1/voice/) through the React UI. Each page has four tabs driving the six endpoints per modality: Analyze (demographics with bounding boxes / waveform segments), Compare (verify with a match gauge and live threshold slider), Enrollment (register / identify / forget with a top-K matches view), Embedding (raw vector inspector with sparkline + copy). MediaInput supports file upload plus live capture: webcam snap-to-canvas for face, MediaRecorder -> AudioContext -> 16-bit PCM mono WAV transcode for voice (libsndfile on the backend only handles WAV/FLAC/OGG natively). Sidebar gets a new Biometrics section feature-gated on face_recognition / voice_recognition; routes are wrapped in <RequireFeature>. No new dependencies -- Font Awesome icons picked from the Free set. Assisted-by: Claude:Opus 4.7 * fix(localai): accept data URI prefixes with codec/charset params Browser MediaRecorder produces data URIs like data:audio/webm;codecs=opus;base64,... so the pre-';base64,' section can carry multiple parameter segments. The `^data:([^;]+);base64,` regex in pkg/utils/base64.go and core/http/endpoints/localai/audio.go only matched exactly one segment, so recordings straight from the React UI's live-capture tab failed the strip and then tripped the base64 decoder on the leading 'data:' literal, surfacing as "invalid audio base64: illegal base64 data at input byte 4" Widened both regexes to `^data:[^,]+?;base64,` so any number of ';param=value' segments between the mime type and ';base64,' are tolerated. Added a regression test covering the MediaRecorder shape. Assisted-by: Claude:Opus 4.7 * fix(insightface): scope pack ONNX loading to known manifests LocalAI's gallery extracts buffalo_* zips flat into the models directory, which inevitably mixes with ONNX files from other backends (opencv face engine, MiniFASNet antispoof, WeSpeaker voice embedding) and older buffalo pack installs. Feeding those foreign files into insightface's model_zoo.get_model() blows up inside the router -- it assumes a 4-D NCHW input and indexes `input_shape[2]` on tensors that aren't shaped like a face model, raising IndexError mid-load and leaving the backend unusable. The router's dispatch isn't amenable to per-file try/except alone (first-file-wins picks det_10g.onnx from buffalo_l even when the user asked for buffalo_sc -- alphabetical order happens to favour the wrong pack). Instead, ship an explicit manifest of the upstream v0.7 pack contents and scope the glob to that when the requested pack is known. The manifest is small and stable; future packs can be added alongside or fall through to the tolerance loop, which also swallows any remaining IndexError / ValueError from foreign files with a clear `[insightface] skipped` stderr line for diagnostics. Assisted-by: Claude:Opus 4.7 * fix(speaker-recognition): extract FBank features for rank-3 ONNX encoders Pre-exported speaker-encoder ONNX graphs come in two shapes: rank-2 [batch, samples] -- some 3D-Speaker exports, take raw waveform directly. rank-3 [batch, frames, n_mels] -- WeSpeaker and most Kaldi- lineage encoders, expect pre-computed Kaldi FBank. OnnxDirectEngine unconditionally fed `audio.reshape(1, -1)` -- correct for rank-2, IndexError-on-input_shape[3] on rank-3, which surfaced to the UI as "Invalid rank for input: feats Got: 2 Expected: 3" Detect the input rank at session init and run Kaldi FBank (80-dim, 25ms/10ms frames, dither=0.0, per-utterance CMN) before the forward pass when rank>=3. All knobs are configurable via backend options for encoders that deviate from defaults. torchaudio.compliance.kaldi is already in the backend's requirements (SpeechBrain pulls torchaudio in), so no new dependency. Assisted-by: Claude:Opus 4.7 * fix(biometrics): isolate face and voice vector stores Face (ArcFace, 512-D) and voice (ECAPA-TDNN 192-D / WeSpeaker 256-D) biometric embeddings were colliding inside a single in-memory local-store instance. Enrolling one after the other failed with "Try to add key with length N when existing length is M" because local-store correctly refuses to mix dimensions in one keyspace. The registries were constructed with `storeName=""`, which in StoreBackend() is just a WithModel() call. But ModelLoader's cache is keyed on `modelID`, not `model` -- so both registries collapsed to the same `modelID=""` slot and reused the same backend process despite looking isolated on paper. Three complementary fixes: 1. application.go -- give each registry a distinct default namespace ("localai-face-biometrics" / "localai-voice-biometrics"). The comment claimed isolation, now it's actually enforced. 2. stores.go -- pass the storeName as both WithModelID and WithModel so the ModelLoader cache key separates namespaces and the loader spawns distinct processes. 3. local-store/store.go -- drop the Load() `opts.Model != ""` guard. It was there to prevent generic model-loading loops from picking up local-store by accident, but that auto-load path is being retired; the guard now just blocks legitimate namespace isolation. opts.Model is treated as a tag; the per-tuple process isolation upstream handles discrimination. Assisted-by: Claude:Opus 4.7 * fix(gallery): stale-file cleanup and upgrade-tmp directory safety Two related robustness fixes for backend install/upgrade: pkg/downloader/uri.go OCI downloads passed through if filepath.Ext(filePath) != "" ... filePath = filepath.Dir(filePath) which was intended to redirect file-shaped download targets into their parent directory for OCI extraction. The heuristic misfires on directory-shaped paths with a dot-suffix -- gallery.UpgradeBackend uses tmpPath = "<backendsPath>/<name>.upgrade-tmp" and Go's filepath.Ext treats ".upgrade-tmp" as an extension. The rewrite landed the extraction at "<backendsPath>/", which then overwrote the real install (backends/<name>/) with a flat-layout file and left a stray run.sh at the top level. The tmp dir itself stayed empty, so the validation step that checked "<tmpPath>/run.sh" predictably failed with "upgrade validation failed: run.sh not found in new backend" Every manual upgrade silently corrupted the backends tree this way. Guard the rewrite behind "target isn't already an existing directory" -- InstallBackend / UpgradeBackend both pre-create the target as a directory, so they get the correct behaviour; existing file-path callers with a genuine dot-extension still get the parent redirect. core/gallery/backends.go InstallBackend's MkdirAll returned ENOTDIR when something at the target path was already a file (legacy dev builds dropped golang backend binaries directly at `<backendsPath>/<name>` instead of nesting them under their own subdir). That permanently blocked reinstall and upgrade for anyone carrying that state, since every retry hit the same error. Detect a pre-existing non-directory, warn, and remove it before the MkdirAll so the fresh install can write the correct nested layout with metadata.json + run.sh. Assisted-by: Claude:Opus 4.7 * fix(galleryop): refresh upgrade cache after backend ops UpgradeChecker caches the last upgrade-check result and only refreshes on the 6-hour tick or after an auto-upgrade cycle. Manual upgrades (POST /api/backends/upgrade/:name) go through the async galleryop worker, which completes the upgrade correctly but never tells UpgradeChecker to re-check -- so /api/backends/upgrades continued to list a just-upgraded backend as upgradeable, indistinguishable from a failed upgrade, for up to six hours. Add an optional `OnBackendOpCompleted func()` hook on GalleryService that fires after every successful install / upgrade / delete on the backend channel (async, so a slow callback doesn't stall the queue). startup.go wires it to UpgradeChecker.TriggerCheck after both services exist. Result: the upgrade banner clears within milliseconds of the worker finishing. Assisted-by: Claude:Opus 4.7 * build: prepend GOPATH/bin to PATH for protogen-go install-go-tools runs `go install` for protoc-gen-go and protoc-gen-go-grpc, which writes them into `go env GOPATH`/bin. That directory isn't on every dev's PATH, and protoc resolves its code-gen plugins via PATH, so the immediately-following protoc invocation fails with "protoc-gen-go: program not found" which in turn blocks `make build` and any `make backends/%` target that depends on build. Prepend `go env GOPATH`/bin to PATH for the protoc invocation so the freshly-installed plugins are found without requiring a shell-profile change. Assisted-by: Claude:Opus 4.7 * refactor(ui-api): non-blocking backend upgrade handler with opcache POST /api/backends/upgrade/:name used to send the ManagementOp directly onto the unbuffered BackendGalleryChannel, which blocked the HTTP request whenever the galleryop worker was busy with a prior operation. The op also didn't show up in /api/operations, so the Backends UI couldn't reflect upgrade progress on the affected row. Register the op in opcache immediately, wrap it in a cancellable context, store the cancellation function on the GalleryService, and push onto the channel from a goroutine so the handler returns right away. Response gains a `jobID` field and a `message` string so clients have a consistent handle regardless of whether the op is queued or running. Pairs with the OnBackendOpCompleted hook added in the galleryop commit — together the UI sees the upgrade start, watches progress via /api/operations, and drops the "upgradeable" flag the moment the worker finishes. Assisted-by: Claude:Opus 4.7	2026-04-24 08:50:34 +02:00
Ettore Di Giacinto	f5eb13d3c2	feat(insightface): add antispoofing (liveness) detection (#9515 ) * feat(insightface): add antispoofing (liveness) detection Light up the anti_spoofing flag that was parked during the first pass. Both FaceVerify and FaceAnalyze now run the Silent-Face MiniFASNetV2 + MiniFASNetV1SE ensemble (~4 MB, Apache 2.0, CPU <10ms) when the flag is set. Failed liveness on either image vetoes FaceVerify regardless of embedding similarity. Every insightface* gallery entry now ships the MiniFASNet ONNX weights so existing packs light up after reinstall. Setting the flag against a model without the MiniFASNet files returns FAILED_PRECONDITION (HTTP 412) with a clear install message — no silent is_real=false. FaceVerifyResponse gained per-image img{1,2}_is_real and img{1,2}_antispoof_score (proto 9-12); FaceAnalysis's existing is_real/antispoof_score fields are now populated. Schema fields are pointers so they are fully absent from the JSON response when anti_spoofing was not requested — avoids collapsing "not checked" with "checked and fake" under Go's omitempty on bool. Validated end-to-end over HTTP against a local install: - verify + anti_spoofing, both real -> verified=true, score ~0.76 - verify + anti_spoofing, img2 spoof -> verified=false, img2_is_real=false - analyze + anti_spoofing -> is_real and score per face - flag against model without MiniFASNet -> HTTP 412 fail-loud Assisted-by: Claude:claude-opus-4-7 go vet * test(insightface): wire test target into test-extra The root Makefile's `test-extra` already runs `$(MAKE) -C backend/python/insightface test`, but the backend's Makefile never defined the target — so the command silently errored and the suite was never executed in CI. Adding the two-line target (matching ace-step/Makefile) hooks `test.sh` → `runUnittests` → `python -m unittest test.py`, which discovers both the pre-existing engine classes (InsightFaceEngineTest, OnnxDirectEngineTest) and the new AntispoofingTest. Each class skips gracefully when its weights can't be downloaded from a network-restricted runner. Assisted-by: Claude:claude-opus-4-7 * test(insightface): exercise antispoofing in e2e-backends (both paths) Add a `face_antispoof` capability to the Ginkgo e2e suite and extend the existing FaceVerify + FaceAnalyze specs with liveness assertions covering BOTH paths: real fixture -> is_real=true, score>0, verified stays true spoof fixture -> is_real=false, verified vetoed to false The spoof fixture is upstream's own `image_F2.jpg` (via the yakhyo mirror) — verified locally against the MiniFASNetV2+V1SE ensemble to classify as is_real=false with score ~0.013. That makes the assertion deterministic across CI runs; synthetic/derived spoofs fool the model unpredictably and would be flaky. Makefile wires it up end-to-end: - New INSIGHTFACE_ANTISPOOF_* cache dir + two ONNX downloads with pinned SHAs, matching the gallery entries. - insightface-antispoof-models target shared by both backend configs. - FACE_SPOOF_IMAGE_URL passed via BACKEND_TEST_FACE_SPOOF_IMAGE_URL. - Both e2e targets (buffalo-sc + opencv) now: * depend on insightface-antispoof-models * pass antispoof_v2_onnx / antispoof_v1se_onnx in BACKEND_TEST_OPTIONS * include face_antispoof in BACKEND_TEST_CAPS backend_test.go adds the new capability constant and a faceSpoofFile fixture resolved the same way as faceFile1/2/3. Spoof assertions are gated on both capFaceAntispoof AND faceSpoofFile being set, so a test config that omits the spoof fixture degrades gracefully to "real path only" instead of failing. Assisted-by: Claude:claude-opus-4-7 go vet	2026-04-23 18:28:15 +02:00
Ettore Di Giacinto	ed648b3b4e	fix(llama-cpp): include server-chat.cpp in grpc-server translation unit (#9511 ) * fix(llama-cpp): include server-chat.cpp in grpc-server translation unit Upstream llama.cpp refactor (ggml-org/llama.cpp#20690) moved the OAI/Anthropic/Responses and transcription conversion helpers out of server-common.cpp into a new server-chat.cpp, and server-task.cpp and server-context.cpp now call those symbols (convert_transcriptions_to_chatcmpl, server_chat_convert_responses_to_chatcmpl, server_chat_convert_anthropic_to_oai, server_chat_msg_diff_to_json_oaicompat) via server-chat.h. grpc-server.cpp builds as a single translation unit by #include-ing the upstream .cpp files directly. Without including server-chat.cpp, the declarations are satisfied at compile time via server-chat.h but the link step fails with undefined references once LLAMA_VERSION crosses the refactor commit (134d6e54). Guard the include with __has_include so the same source stays buildable on older LLAMA_VERSION pins that predate the refactor (where prepare.sh won't copy server-chat.cpp into tools/grpc-server/). Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(llama-cpp): bump LLAMA_VERSION to 0d0764dfd Bump to ggml-org/llama.cpp@0d0764dfd2. Paired with the preceding grpc-server server-chat.cpp include so the refactor at 134d6e54 links cleanly. Supersedes PR #9494. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-23 14:59:39 +02:00
Ettore Di Giacinto	04f1a0285d	fix(ik-llama-cpp): adapt to common_grammar struct in sampling.h (#9512 ) Upstream ik_llama.cpp commit e0596bf6 ("Autoparser") changed common_params_sampling::grammar from std::string to a common_grammar struct (type + grammar), which broke our two direct accesses: - JSON ingest fed the field through json_value<common_grammar>(...), for which nlohmann has no from_json adapter. - JSON export emitted the struct directly, for which nlohmann has no to_json adapter. Wrap the incoming JSON string in common_grammar{COMMON_GRAMMAR_TYPE_USER, ...} and serialize via the inner .grammar member, mirroring upstream's examples/server/server-context.cpp. Also bump IK_LLAMA_VERSION to 286ce324baed17c95faec77792eaa6bdb1c7a5f5 so the local-ai side lines up with the dependency bump in #9496. Assisted-by: Claude-Code:claude-opus-4-7	2026-04-23 13:45:06 +02:00
Ettore Di Giacinto	181ebb6df4	feat: voice recognition (#9500 ) * feat(voice-recognition): add /v1/voice/{verify,analyze,embed} + speaker-recognition backend Audio analog to face recognition. Adds three gRPC RPCs (VoiceVerify / VoiceAnalyze / VoiceEmbed), their Go service and HTTP layers, a new FLAG_SPEAKER_RECOGNITION capability flag, and a Python backend scaffold under backend/python/speaker-recognition/ wrapping SpeechBrain ECAPA-TDNN with a parallel OnnxDirectEngine for WeSpeaker / 3D-Speaker ONNX exports. The kokoros Rust backend gets matching unimplemented trait stubs — tonic's async_trait has no defaults, so adding an RPC without Rust stubs breaks the build (same regression fixed by `eb01c772` for face). Swagger, /api/instructions, and the auth RouteFeatureRegistry / APIFeatures list are updated so the endpoints surface everywhere a client or admin UI looks. Assisted-by: Claude:claude-opus-4-7 * feat(voice-recognition): add 1:N identify + register/forget endpoints Mirrors the face-recognition register/identify/forget surface. New package core/services/voicerecognition/ carries a Registry interface and a local-store-backed implementation (same in-memory vector-store plumbing facerecognition uses, separate instance so the embedding spaces stay isolated). Handlers under /v1/voice/{register,identify,forget} reuse backend.VoiceEmbed to compute the probe vector, then delegate the nearest-neighbour search to the registry. Default cosine-distance threshold is tuned for ECAPA-TDNN on VoxCeleb (0.25, EER ~1.9%). As with the face registry, the current backing is in-memory only — a pgvector implementation is a future constructor-level swap. Assisted-by: Claude:claude-opus-4-7 * feat(voice-recognition): gallery, docs, CI and e2e coverage - backend/index.yaml: speaker-recognition backend entry + CPU and CUDA-12 image variants (plus matching development variants). - gallery/index.yaml: speechbrain-ecapa-tdnn (default) and wespeaker-resnet34 model entries. The WeSpeaker SHA-256 is a deliberate placeholder — the HF URI must be curl'd and its hash filled in before the entry installs. - docs/content/features/voice-recognition.md: API reference + quickstart, mirrors the face-recognition docs. - React UI: CAP_SPEAKER_RECOGNITION flag export (consumers follow face's precedent — no dedicated tab yet). - tests/e2e-backends: voice_embed / voice_verify / voice_analyze specs. Helper resolveFaceFixture is reused as-is — the only thing face/voice share is "download a file into workDir", so no need for a new helper. - Makefile: docker-build-speaker-recognition + test-extra-backend- speaker-recognition-{ecapa,all} targets. Audio fixtures default to VCTK p225/p226 samples from HuggingFace. - CI: test-extra.yml grows a tests-speaker-recognition-grpc job mirroring insightface. backend.yml matrix gains CPU + CUDA-12 image build entries — scripts/changed-backends.js auto-picks these up. Assisted-by: Claude:claude-opus-4-7 * feat(voice-recognition): wire a working /v1/voice/analyze head Adds AnalysisHead: a lazy-loading age / gender / emotion inference wrapper that plugs into both SpeechBrainEngine and OnnxDirectEngine. Defaults to two open-licence HuggingFace checkpoints: - audeering/wav2vec2-large-robust-24-ft-age-gender (Apache 2.0) — age regression + 3-way gender (female / male / child). - superb/wav2vec2-base-superb-er (Apache 2.0) — 4-way emotion. Both are optional and degrade gracefully when transformers or the model can't be loaded — the engine raises NotImplementedError so the gRPC layer returns 501 instead of a generic 500. Emotion classes pass through from the model (neutral/happy/angry/sad on the default checkpoint); the e2e test now accepts any non-empty dominant gender so custom age_gender_model overrides don't fail it. Adds transformers to the backend's CPU and CUDA-12 requirements. Assisted-by: Claude:claude-opus-4-7 * fix(voice-recognition): pin real WeSpeaker ResNet34 ONNX SHA-256 Replaces the placeholder hash in gallery/index.yaml with the actual SHA-256 (7bb2f06e…) of the upstream Wespeaker/wespeaker-voxceleb-resnet34-LM ONNX at ~25MB. `local-ai models install wespeaker-resnet34` now succeeds. Assisted-by: Claude:claude-opus-4-7 * fix(voice-recognition): soundfile loader + honest analyze default Two issues surfaced on first end-to-end smoke with the actual backend image: 1. torchaudio.load in torchaudio 2.8+ requires the torchcodec package for audio decoding. Switch SpeechBrainEngine._load_waveform to the already-present soundfile (listed in requirements.txt) plus a numpy linear resample to 16kHz. Drops a heavy ffmpeg-linked dep and the codepath we never exercise (torchaudio's ffmpeg backend). 2. The AnalysisHead was defaulting to audeering/wav2vec2-large-robust- 24-ft-age-gender, but AutoModelForAudioClassification silently mangles that checkpoint — it reports the age head weights as UNEXPECTED and re-initialises the classifier head with random values, so the "gender" output is noise and there is no age output at all. Make age/gender opt-in instead (empty default; users wire a cleanly-loadable Wav2Vec2ForSequenceClassification checkpoint via age_gender_model: option). Emotion keeps its working Superb default. Also broaden _infer_age_gender's tensor-shape handling and catch runtime exceptions so a dodgy age/gender head never takes down the whole analyze call. Docs and README updated to match the new policy. Verified with the branch-scoped gallery on localhost: - voice/embed → 192-d ECAPA-TDNN vector - voice/verify → same-clip dist≈6e-08 verified=true; cross-speaker dist 0.76–0.99 verified=false (as expected) - voice/register/identify/forget → round-trip works, 404 on unknown id - voice/analyze → emotion populated, age/gender omitted (opt-in) Assisted-by: Claude:claude-opus-4-7 * fix(voice-recognition): real CI audio fixtures + fixture-agnostic verify spec Two issues surfaced after CI actually ran the speaker-recognition e2e target (I'd curl-tested against a running server but hadn't run the make target locally): 1. The default BACKEND_TEST_VOICE_AUDIO_* URLs pointed at huggingface.co/datasets/CSTR-Edinburgh/vctk paths that return 404 (the dataset is gated). Swap them for the speechbrain test samples served from github.com/speechbrain/speechbrain/raw/develop/ — public, no auth, correct 16kHz mono format. 2. The VoiceVerify spec required d(file1,file2) < 0.4, assuming file1/file2 were same-speaker. The speechbrain samples are three different speakers (example1/2/5), and there is no easy un-gated source of true same-speaker audio pairs (VoxCeleb/VCTK/LibriSpeech are all license- or size-gated for CI use). Replace the ceiling check with a relative-ordering assertion: d(pair) > d(same-clip) for both file2 and file3 — that's enough to prove the embeddings encode speaker info, and it works with any three non-identical clips. Actual speaker ordering d(1,2) vs d(1,3) is logged but not asserted. Local run: 4/4 voice specs pass (Health, LoadModel, VoiceEmbed, VoiceVerify) on the built backend image. 12 non-voice specs skipped as expected. Assisted-by: Claude:claude-opus-4-7 * fix(ci): checkout with submodules in the reusable backend_build workflow The kokoros Rust backend build fails with failed to read .../sources/Kokoros/kokoros/Cargo.toml: No such file because the reusable backend_build.yml workflow's actions/checkout step was missing `submodules: true`. Dockerfile.rust does `COPY . /LocalAI`, and without the submodule files the subsequent `cargo build` can't find the vendored Kokoros crate. The bug pre-dates this PR — scripts/changed-backends.js only triggers the kokoros image job when something under backend/rust/kokoros or the shared proto changes, so master had been coasting past it. The voice-recognition proto addition re-broke it. Other checkouts in backend.yml (llama-cpp-darwin) and test-extra.yml (insightface, kokoros, speaker-recognition) already pass `submodules: true`; this brings the shared backend image builder in line. Assisted-by: Claude:claude-opus-4-7	2026-04-23 12:07:14 +02:00
LocalAI [bot]	eb00d9b178	chore: ⬆️ Update leejet/stable-diffusion.cpp to `c97702e1057c2fe13a7074cd9069cb9dd6edc1bf` (#9495 ) ⬆️ Update leejet/stable-diffusion.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-23 09:32:21 +02:00
Ettore Di Giacinto	eb01c77214	fix(kokoros): implement face_verify and face_analyze trait stubs (#9499 ) The backend.proto was updated to add FaceVerify and FaceAnalyze RPCs (face detection support), but the Rust KokorosService was never updated to match the regenerated tonic trait, breaking compilation with E0046: not all trait items implemented, missing: `face_verify`, `face_analyze` Stubs both methods as unimplemented, matching the pattern used for the other RPCs Kokoros does not support. Assisted-by: Claude:claude-opus-4-7 [Claude Code]	2026-04-22 22:51:18 +02:00
orbisai0security	bbeacf140d	fix: remove unsafe sprintf() in grpc-server.cpp (#9486 ) fix: V-001 security vulnerability Automated security fix generated by Orbis Security AI	2026-04-22 21:57:29 +02:00
Ettore Di Giacinto	20baec77ab	feat(face-recognition): add insightface/onnx backend for 1:1 verify, 1:N identify, embedding, detection, analysis (#9480 ) * feat(face-recognition): add insightface backend for 1:1 verify, 1:N identify, embedding, detection, analysis Adds face recognition as a new first-class capability in LocalAI via the `insightface` Python backend, with a pluggable two-engine design so non-commercial (insightface model packs) and commercial-safe (OpenCV Zoo YuNet + SFace) models share the same gRPC/HTTP surface. New gRPC RPCs (backend/backend.proto): * FaceVerify(FaceVerifyRequest) returns FaceVerifyResponse * FaceAnalyze(FaceAnalyzeRequest) returns FaceAnalyzeResponse Existing Embedding and Detect RPCs are reused (face image in PredictOptions.Images / DetectOptions.src) for face embedding and face detection respectively. New HTTP endpoints under /v1/face/: * verify — 1:1 image pair same-person decision * analyze — per-face age + gender (emotion/race reserved) * register — 1:N enrollment; stores embedding in vector store * identify — 1:N recognition; detect → embed → StoresFind * forget — remove a registered face by opaque ID Service layer (core/services/facerecognition/) introduces a `Registry` interface with one in-memory `storeRegistry` impl backed by LocalAI's existing local-store gRPC vector backend. HTTP handlers depend on the interface, not on StoresSet/StoresFind directly, so a persistent PostgreSQL/pgvector implementation can be slotted in via a single constructor change in core/application (TODO marker in the package doc). New usecase flag FLAG_FACE_RECOGNITION; insightface is also wired into FLAG_DETECTION so /v1/detection works for face bounding boxes. Gallery (backend/index.yaml) ships three entries: * insightface-buffalo-l — SCRFD-10GF + ArcFace R50 + genderage (~326MB pre-baked; non-commercial research use only) * insightface-opencv — YuNet + SFace (~40MB pre-baked; Apache 2.0) * insightface-buffalo-s — SCRFD-500MF + MBF (runtime download; non-commercial) Python backend (backend/python/insightface/): * engines.py — FaceEngine protocol with InsightFaceEngine and OnnxDirectEngine; resolves model paths relative to the backend directory so the same gallery config works in docker-scratch and in the e2e-backends rootfs-extraction harness. * backend.py — gRPC servicer implementing Health, LoadModel, Status, Embedding, Detect, FaceVerify, FaceAnalyze. * install.sh — pre-bakes buffalo_l + OpenCV YuNet/SFace inside the backend directory so first-run is offline-clean (the final scratch image only preserves files under /<backend>/). * test.py — parametrized unit tests over both engines. Tests: * Registry unit tests (go test -race ./core/services/facerecognition/...) — in-memory fake grpc.Backend, table-driven, covers register/ identify/forget/error paths + concurrent access. * tests/e2e-backends/backend_test.go extended with face caps (face_detect, face_embed, face_verify, face_analyze); relative ordering + configurable verifyCeiling per engine. * Makefile targets: test-extra-backend-insightface-buffalo-l, -opencv, and the -all aggregate. * CI: .github/workflows/test-extra.yml gains tests-insightface-grpc, auto-triggered by changes under backend/python/insightface/. Docs: * docs/content/features/face-recognition.md — feature page with license table, quickstart (defaults to the commercial-safe model), models matrix, API reference, 1:N workflow, storage caveats. * Cross-refs in object-detection.md, stores.md, embeddings.md, and whats-new.md. * Contributor README at backend/python/insightface/README.md. Verified end-to-end: * buffalo_l: 6/6 specs (health, load, face_detect, face_embed, face_verify, face_analyze). * opencv: 5/5 specs (same minus face_analyze — SFace has no demographic head; correctly skipped via BACKEND_TEST_CAPS). Assisted-by: Claude:claude-opus-4-7 * fix(face-recognition): move engine selection to model gallery, collapse backend entries The previous commit put engine/model_pack options on backend gallery entries (`backend/index.yaml`). That was wrong — `GalleryBackend` (core/gallery/backend_types.go:32) has no `options` field, so the YAML decoder silently dropped those keys and all three "different insightface-" backend entries resolved to the same container image with no distinguishing configuration. Correct split: `backend/index.yaml` now has ONE `insightface` backend entry shipping the CPU + CUDA 12 container images. The Python backend bundles both the non-commercial insightface model packs (buffalo_l / buffalo_s) and the commercial-safe OpenCV Zoo weights (YuNet + SFace); the active engine is selected at LoadModel time via `options: ["engine:..."]`. * `gallery/index.yaml` gains three model entries — `insightface-buffalo-l`, `insightface-opencv`, `insightface-buffalo-s` — each setting the appropriate `overrides.backend` + `overrides.options` so installing one actually gives the user the intended engine. This matches how `rfdetr-base` lives in the model gallery against the `rfdetr` backend. The earlier e2e tests passed despite this bug because the Makefile targets pass `BACKEND_TEST_OPTIONS` directly to LoadModel via gRPC, bypassing any gallery resolution entirely. No code changes needed. Assisted-by: Claude:claude-opus-4-7 * feat(face-recognition): cover all supported models in the gallery + drop weight baking Follows up on the model-gallery split: adds entries for every model configuration either engine actually supports, and switches weight delivery from image-baked to LocalAI's standard gallery mechanism. Gallery now has seven `insightface-` model entries (gallery/index.yaml): insightface (family) — non-commercial research use • buffalo-l (326MB) — SCRFD-10GF + ResNet50 + genderage, default • buffalo-m (313MB) — SCRFD-2.5GF + ResNet50 + genderage • buffalo-s (159MB) — SCRFD-500MF + MBF + genderage • buffalo-sc (16MB) — SCRFD-500MF + MBF, recognition only (no landmarks, no demographics — analyze returns empty attributes) • antelopev2 (407MB) — SCRFD-10GF + ResNet100@Glint360K + genderage OpenCV Zoo family — Apache 2.0 commercial-safe • opencv — YuNet + SFace fp32 (~40MB) • opencv-int8 — YuNet + SFace int8 (~12MB, ~3x smaller, faster on CPU) Model weights are no longer baked into the backend image. The image now ships only the Python runtime + libraries (~275MB content size, ~1.18GB disk vs ~1.21GB when weights were baked). Weights flow through LocalAI's gallery mechanism: OpenCV variants list `files:` with ONNX URIs + SHA-256, so `local-ai models install insightface-opencv` pulls them into the models directory exactly like any other gallery-managed model. * insightface packs (upstream distributes .zip archives only, not individual ONNX files) auto-download on first LoadModel via FaceAnalysis' built-in machinery, rooted at the LocalAI models directory so they live alongside everything else — same pattern `rfdetr` uses with `inference.get_model()`. Backend changes (backend/python/insightface/): * backend.py — LoadModel propagates `ModelOptions.ModelPath` (the LocalAI models directory) to engines via a `_model_dir` hint. This replaces the earlier ModelFile-dirname approach; ModelPath is the canonical "models directory" variable set by the Go loader (pkg/model/initializers.go:144) and is always populated. * engines.py::_resolve_model_path — picks up `model_dir` and searches it (plus basename-in-model-dir) before falling back to the dev script-dir. This is how OnnxDirectEngine finds gallery-downloaded YuNet/SFace files by filename only. * engines.py::_flatten_insightface_pack — new helper that works around an upstream packaging inconsistency: buffalo_l/s/sc zips expand flat, but buffalo_m and antelopev2 zips wrap their ONNX files in a redundant `<name>/` directory. insightface's own loader looks one level too shallow and fails. We call `ensure_available()` explicitly, flatten if nested, then hand to FaceAnalysis. * engines.py::InsightFaceEngine.prepare — root-resolution order now includes the `_model_dir` hint so packs download into the LocalAI models directory by default. * install.sh — no longer pre-downloads any weights. Everything is gallery-managed now. * smoke.py (new) — parametrized smoke test that iterates over every gallery configuration, simulating the LocalAI install flow (creates a models dir, fetches OpenCV files with checksum verification, lets insightface auto-download its packs), then runs detect + embed + verify (+ analyze where supported) through the in-process BackendServicer. * test.py — OnnxDirectEngineTest no longer hardcodes `/models/opencv/` paths; downloads ONNX files to a temp dir at setUpClass time and passes ModelPath accordingly. Registry change (core/services/facerecognition/store_registry.go): * `dim=0` in NewStoreRegistry now means "accept whatever dimension arrives" — needed because the backend supports 512-d ArcFace/MBF and 128-d SFace via the same Registry. A non-zero dim still fails fast with ErrDimensionMismatch. * core/application plumbs `faceEmbeddingDim = 0`, explaining the rationale in the comment. Backend gallery description updated to reflect that the image carries no weights — it's just Python + engines. Smoke-tested all 7 configurations against the rebuilt image (with the flatten fix applied), exit 0: PASS: insightface-buffalo-l faces=6 dim=512 same-dist=0.000 PASS: insightface-buffalo-sc faces=6 dim=512 same-dist=0.000 PASS: insightface-buffalo-s faces=6 dim=512 same-dist=0.000 PASS: insightface-buffalo-m faces=6 dim=512 same-dist=0.000 PASS: insightface-antelopev2 faces=6 dim=512 same-dist=0.000 PASS: insightface-opencv faces=6 dim=128 same-dist=0.000 PASS: insightface-opencv-int8 faces=6 dim=128 same-dist=0.000 7/7 passed Assisted-by: Claude:claude-opus-4-7 * fix(face-recognition): pre-fetch OpenCV ONNX for e2e target; drop stale pre-baked claim CI regression from the previous commit: I moved OpenCV Zoo weight delivery to LocalAI's gallery `files:` mechanism, but the test-extra-backend-insightface-opencv target was still passing relative paths `detector_onnx:models/opencv/yunet.onnx` in BACKEND_TEST_OPTIONS. The e2e suite drives LoadModel directly over gRPC without going through the gallery, so those relative paths resolved to nothing and OpenCV's ONNXImporter failed: LoadModel failed: Failed to load face engine: OpenCV(4.13.0) ... Can't read ONNX file: models/opencv/yunet.onnx Fix: add an `insightface-opencv-models` prerequisite target that fetches the two ONNX files (YuNet + SFace) to a deterministic host cache at /tmp/localai-insightface-opencv-cache/, verifies SHA-256, and skips the download on re-runs. The opencv test target depends on it and passes absolute paths in BACKEND_TEST_OPTIONS, so the backend finds the files via its normal absolute-path resolution branch. Also refresh the buffalo_l comment: it no longer says "pre-baked" (nothing is — the pack auto-downloads from upstream's GitHub release on first LoadModel, same as in CI). Locally verified: `make test-extra-backend-insightface-opencv` passes 5/5 specs (health, load, face_detect, face_embed, face_verify). Assisted-by: Claude:claude-opus-4-7 * feat(face-recognition): add POST /v1/face/embed + correct /v1/embeddings docs The docs promised that /v1/embeddings returns face vectors when you send an image data-URI. That was never true: /v1/embeddings is OpenAI-compatible and text-only by contract — its handler goes through `core/backend/embeddings.go::ModelEmbedding`, which sets `predictOptions.Embeddings = s` (a string of TEXT to embed) and never populates `predictOptions.Images[]`. The Python backend's Embedding gRPC method does handle Images[] (that's how /v1/face/register reaches it internally via `backend.FaceEmbed`), but the HTTP embeddings endpoint wasn't wired to populate it. Rather than overload /v1/embeddings with image-vs-text detection — messy, and the endpoint is OpenAI-compatible by design — add a dedicated /v1/face/embed endpoint that wraps `backend.FaceEmbed` (already used internally by /v1/face/register and /v1/face/identify). Matches LocalAI's convention of a dedicated path per non-standard flow (/v1/rerank, /v1/detection, /v1/face/verify etc.). Response: { "embedding": [<dim> floats, L2-normed], "dim": int, // 512 for ArcFace R50 / MBF, 128 for SFace "model": "<name>" } Live-tested on the opencv engine: returns a 128-d L2-normalized vector (sum(x^2) = 1.0000). Sentinel in docs updated to note /v1/embeddings is text-only and point image users at /v1/face/embed instead. Assisted-by: Claude:claude-opus-4-7 * fix(http): map malformed image input + gRPC status codes to proper 4xx Image-input failures on LocalAI's single-image endpoints (/v1/detection, /v1/face/{verify,analyze,embed,register,identify}) have historically returned 500 — even when the client was the one who sent garbage. Classic example: you POST an "image" that isn't a URL, isn't a data-URI, and isn't a valid JPEG/PNG — the server shouldn't claim that's its fault. Two helpers land in core/http/endpoints/localai/images.go and every single-image handler is switched over: * decodeImageInput(s) Wraps utils.GetContentURIAsBase64 and turns any failure (invalid URL, not a data-URI, download error, etc.) into echo.NewHTTPError(400, "invalid image input: ..."). * mapBackendError(err) Inspects the gRPC status on a backend call error and maps: INVALID_ARGUMENT → 400 Bad Request NOT_FOUND → 404 Not Found FAILED_PRECONDITION → 412 Precondition Failed Unimplemented → 501 Not Implemented All other codes fall through unchanged (still 500). Before, my 1×1 PNG error-path test returned: HTTP 500 "rpc error: code = InvalidArgument desc = failed to decode one or both images" After: HTTP 400 "failed to decode one or both images" Scope-limited to the LocalAI single-image endpoints. The multi-modal paths (middleware/request.go, openresponses/responses.go, openai/realtime.go) intentionally log-and-skip individual media parts when decoding fails — different design intent (graceful degradation of a multi-part message), not a 400-worthy failure. Left untouched. Live-verified: every error case in /tmp/face_errors.py now returns 4xx with a meaningful message; the "image with no face (1x1 PNG)" case specifically went from 500 → 400. Assisted-by: Claude:claude-opus-4-7 * refactor(face-recognition): insightface packs go through gallery files:, drop FaceAnalysis Follows up on the discovery that LocalAI's gallery `files:` mechanism handles archives (zip, tar.gz, …) via mholt/archiver/v3 — the rhasspy piper voices use exactly this pattern. Insightface packs are zip archives, so we can now deliver them the same way every other gallery-managed model gets delivered: declaratively, checksum-verified, through LocalAI's standard download+extract pipeline. Two changes: 1. Gallery (gallery/index.yaml) — every insightface-* entry gains a `files:` list with the pack zip's URI + SHA-256. `local-ai models install insightface-buffalo-l` now fetches the zip, verifies the hash, and extracts it into the models directory. No more reliance on insightface's library-internal `ensure_available()` auto-download or its hardcoded `BASE_REPO_URL`. 2. InsightFaceEngine (backend/python/insightface/engines.py) — drops the FaceAnalysis wrapper and drives insightface's `model_zoo` directly. The ~50 lines FaceAnalysis provides — glob ONNX files, route each through `model_zoo.get_model()`, build a `{taskname: model}` dict, loop per-face at inference — are reimplemented in `InsightFaceEngine`. The actual inference classes (RetinaFace, ArcFaceONNX, Attribute, Landmark) are still insightface's — we only replicate the glue, so drift risk against upstream is minimal. Why drop FaceAnalysis: it hard-codes a `<root>/models/<name>/.onnx` layout that doesn't match what LocalAI's zip extraction produces. LocalAI unpacks archives flat into `<models_dir>`. Upstream packs are inconsistent — buffalo_l/s/sc ship ONNX at the zip root (lands at `<models_dir>/.onnx`), buffalo_m/antelopev2 wrap in a redundant `<name>/` dir (lands at `<models_dir>/<name>/.onnx`). The new `_locate_insightface_pack` helper searches both locations plus legacy paths and returns whichever has ONNX files. Replaces the earlier `_flatten_insightface_pack` helper (which tried to fight FaceAnalysis's layout expectations; now we just find the files wherever they are). Net effect for users: install once via LocalAI's managed flow, weights live alongside every other model, progress shows in the jobs endpoint, no first-load network call. Same API surface, cleaner plumbing. Assisted-by: Claude:claude-opus-4-7 fix(face-recognition): CI's insightface e2e path needs the pack pre-fetched The e2e suite drives LoadModel over gRPC without going through LocalAI's gallery flow, so the engine's `_model_dir` option (normally populated from ModelPath) is empty. Previously the insightface target relied on FaceAnalysis auto-download to paper over this, but we dropped FaceAnalysis in favor of direct model_zoo calls — so the buffalo_l target started failing at LoadModel with "no insightface pack found". Mirror the opencv target's pre-fetch pattern: download buffalo_sc.zip (same SHA as the gallery entry), extract it on the host, and pass `root:<dir>` so the engine locates the pack without needing ModelPath. Switched to buffalo_sc (smallest pack, ~16MB) to keep CI fast; it covers the same insightface engine code path as buffalo_l. Face analyze cap dropped since buffalo_sc has no age/gender head. Assisted-by: Claude:claude-opus-4-7[1m] * feat(face-recognition): surface face-recognition in advertised feature maps The six /v1/face/* endpoints were missing from every place LocalAI advertises its feature surface to clients: * api_instructions — the machine-readable capability index at GET /api/instructions. Added `face-recognition` as a dedicated instruction area with an intro that calls out the in-memory registry caveat and the /v1/face/embed vs /v1/embeddings split. * auth/permissions — added FeatureFaceRecognition constant, routed all six face endpoints through it so admins can gate them per-user like any other API feature. Default ON (matches the other API features). * React UI capabilities — CAP_FACE_RECOGNITION symbol mapped to FLAG_FACE_RECOGNITION. Declared only for now; the Face page is a follow-up (noted in the plan). Instruction count bumped 9 → 10; test updated. Assisted-by: Claude:claude-opus-4-7[1m] * docs(agents): capture advertising-surface steps in the endpoint guide Before this change, adding a new /v1/* endpoint reliably missed one or more of: the swagger @Tags annotation, the /api/instructions registry, the auth RouteFeatureRegistry, and the React UI CAP_* symbol. The endpoint would work but be invisible to API consumers, admins, and the UI — and nothing in the existing docs said to look in those places. Extend .agents/api-endpoints-and-auth.md with a new "Advertising surfaces" section covering all four surfaces (swagger tags, /api/ instructions, capabilities.js, docs/), and expand the closing checklist so it's impossible to ship a feature without visiting each one. Hoist a one-liner reminder into AGENTS.md's Quick Reference so agents skim it before diving in. Assisted-by: Claude:claude-opus-4-7[1m]	2026-04-22 21:55:41 +02:00
LocalAI [bot]	cd7b035716	chore: ⬆️ Update ggml-org/llama.cpp to `5a4cd6741fc33227cdacb329f355ab21f8481de2` (#9479 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-22 08:58:19 +02:00
Ettore Di Giacinto	39573ecd2a	chore(whisperx): drop ROCm/hipblas build target (#9474 ) whisperx has no upstream AMD GPU support and its core transcription path (faster-whisper -> ctranslate2) falls back to CPU on AMD since the PyPI ctranslate2 is CUDA-only. The torch rocm wheels would accelerate only the alignment/diarization stages, producing a misleadingly half-working image. Drop the hipblas variant rather than shipping a partially accelerated build users can't distinguish from the real thing. AMD hosts now fall through the capability map to cpu-whisperx / cpu-whisperx-development. Also removes the now-dangling rocm-whisperx assertion from pkg/system/capabilities_test.go and the ROCm mention from the whisperx row in docs/content/reference/compatibility-table.md. Assisted-by: Claude Code:claude-opus-4-7	2026-04-21 21:50:18 +02:00
LocalAI [bot]	8bb1e8f21f	chore: ⬆️ Update ggml-org/llama.cpp to `cf8b0dbda9ac0eac30ee33f87bc6702ead1c4664` (#9448 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-21 11:15:45 +02:00
LocalAI [bot]	cd94a0b61a	chore: ⬆️ Update ggml-org/whisper.cpp to `fc674574ca27cac59a15e5b22a09b9d9ad62aafe` (#9450 ) ⬆️ Update ggml-org/whisper.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-21 11:09:05 +02:00
LocalAI [bot]	5973c0a9df	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `d4824131580b94ffa7b0e91c955e2b237c2fe16e` (#9447 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-21 00:07:19 +02:00
Ettore Di Giacinto	60633c4dd5	fix(stable-diffusion.ggml): force mp4 container in ffmpeg mux (#9435 ) gen_video's ffmpeg subprocess was relying on the filename extension to choose the output container. Distributed LocalAI hands the backend a staging path (e.g. /staging/localai-output-NNN.tmp) that is renamed to .mp4 only after the backend returns, so ffmpeg saw a .tmp extension and bailed with "Unable to choose an output format". Inference had already completed and the frames were piped in, producing the cryptic "video inference failed (code 1)" at the API layer. Pass -f mp4 explicitly so the container is selected by flag instead of by filename suffix. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 00:41:54 +02:00
LocalAI [bot]	28091d626e	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `00ba208a5c036eee72d4a631b4f57c126095cb03` (#9430 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-20 00:01:48 +02:00
LocalAI [bot]	babbbc6ec8	chore: ⬆️ Update ggml-org/llama.cpp to `4eac5b45095a4e8a1ff1cce4f6d030e0872fb4ad` (#9429 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-19 23:39:19 +02:00
LocalAI [bot]	3804497186	chore: ⬆️ Update leejet/stable-diffusion.cpp to `44cca3d626d301e2215d5e243277e8f0e65bfa78` (#9428 ) ⬆️ Update leejet/stable-diffusion.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-19 23:39:07 +02:00
Ettore Di Giacinto	510f791ccc	feat(gallery): add stablediffusion-ggml-development meta backend	2026-04-19 20:16:33 +00:00
Ettore Di Giacinto	369c50a41c	fix(turboquant): drop ignore-eos patch, bump fork to b8967-627ebbc (#9423 ) * fix(turboquant): drop ignore-eos patch, bump fork to b8967-627ebbc The upstream PR #21203 (server: respect the ignore_eos flag) has been merged into the TheTom/llama-cpp-turboquant feature/turboquant-kv-cache branch. With the fix now in-tree, 0001-server-respect-the-ignore-eos-flag.patch no longer applies (git apply sees its additions already present) and the nightly turboquant bump fails. Retire the patch and bump the pin to the first fork revision that carries the merged fix (tag feature-turboquant-kv-cache-b8967-627ebbc). This matches the contract in apply-patches.sh: drop patches once the fork catches up. * fix(turboquant): patch out get_media_marker() call in grpc-server copy CI turboquant docker build was failing with: grpc-server.cpp:2825:40: error: use of undeclared identifier 'get_media_marker' The call was added by `7809c5f5` (PR #9412) to propagate the mtmd random per-server media marker upstream landed in ggml-org/llama.cpp#21962. The TheTom/llama-cpp-turboquant fork branched before that PR, so its server-common.cpp has no such symbol. Extend patch-grpc-server.sh to substitute get_media_marker() with the legacy "<__media__>" literal in the build-time grpc-server.cpp copy under turboquant-<flavor>-build/. The fork's mtmd_default_marker() returns exactly that string, and the Go layer falls back to the same sentinel when media_marker is empty, so behavior on the turboquant path is unchanged. Patched copy only — the shared source under backend/cpp/llama-cpp/ keeps compiling against vanilla upstream. Verified by running `make docker-build-turboquant` locally end-to-end: all five flavors (avx, avx2, avx512, fallback, grpc+rpc-server) now compile past the previous failure and the image tags successfully.	2026-04-19 21:05:21 +02:00
Ettore Di Giacinto	9cd8d7951f	fix(kokoros): implement audio_transcription_stream trait stub (#9422 ) The backend.proto was updated to add AudioTranscriptionStream RPC, but the Rust KokorosService was never updated to match the regenerated tonic trait, breaking compilation with E0046. Stubs the new streaming method as unimplemented, matching the pattern used for the other streaming RPCs Kokoros does not support.	2026-04-19 13:29:58 +02:00
LocalAI [bot]	884bfb84c9	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `8befd92ea5f702494ea9813fe42a52fb015db5fe` (#9418 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-19 09:27:11 +02:00
LocalAI [bot]	e94a9a8f10	chore: ⬆️ Update leejet/stable-diffusion.cpp to `7d33d4b2ddeafa672761a5880ec33bdff452504d` (#9417 ) ⬆️ Update leejet/stable-diffusion.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2026-04-19 09:26:58 +02:00
Ettore Di Giacinto	054c4b4b45	feat(stable-diffusion.ggml): add support for video generation (#9420 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-19 09:26:33 +02:00
LocalAI [bot]	6e49dba27c	chore: ⬆️ Update ggml-org/llama.cpp to `4f02d4733934179386cbc15b3454be26237940bb` (#9415 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-19 09:26:05 +02:00
Keith Mattix II	8839a71c87	fix(rocm): add gfx1151 support and expose AMDGPU_TARGETS build-arg (#9410 ) Add gfx1151 (AMD Strix Halo / Ryzen AI MAX) to the default AMDGPU_TARGETS list in the llama-cpp backend Makefile. ROCm 7.2.1 ships with gfx1151 Tensile libraries, so this architecture should be included in default builds. Also expose AMDGPU_TARGETS as an ARG/ENV in Dockerfile.llama-cpp so that users building for non-default GPU architectures can override the target list via --build-arg AMDGPU_TARGETS=<arch>. Previously, passing -DAMDGPU_TARGETS=<arch> through CMAKE_ARGS was silently overridden by the Makefile's own append of the default target list. Fixes #9374 Signed-off-by: Keith Mattix <keithmattix2@gmail.com> Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2026-04-18 20:39:40 +02:00
Ettore Di Giacinto	117f6430b8	fix(turboquant): resolve common.h by detecting llama-common vs common target (#9413 ) The shared grpc-server CMakeLists hardcoded `llama-common`, the post-rename target name in upstream llama.cpp. The turboquant fork branched before that rename and still exposes the helpers library as `common`, so the name silently degraded to a plain `-llama-common` link flag, the PUBLIC include directory was never propagated, and tools/server/server-task.h failed to find common.h during turboquant-<flavor> builds.	2026-04-18 20:30:28 +02:00
Ettore Di Giacinto	7809c5f5d0	fix(vision): propagate mtmd media marker from backend via ModelMetadata (#9412 ) Upstream llama.cpp (PR #21962) switched the server-side mtmd media marker to a random per-server string and removed the legacy "<__media__>" backward-compat replacement in mtmd_tokenizer. The Go layer still emitted the hardcoded "<__media__>", so on the non-tokenizer-template path the prompt arrived with a marker mtmd did not recognize and tokenization failed with "number of bitmaps (1) does not match number of markers (0)". Report the active media marker via ModelMetadataResponse.media_marker and substitute the sentinel "<__media__>" with it right before the gRPC call, after the backend has been loaded and probed. Also skip the Go-side multimodal templating entirely when UseTokenizerTemplate is true — llama.cpp's oaicompat_chat_params_parse already injects its own marker and StringContent is unused in that path. Backends that do not expose the field keep the legacy "<__media__>" behavior.	2026-04-18 20:30:13 +02:00
LocalAI [bot]	ad742738cb	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `52efa12fdae390d1dca6ecd7ca00010fe51f651e` (#9404 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-18 09:21:32 +02:00
LocalAI [bot]	86c673fd94	chore: ⬆️ Update ggml-org/whisper.cpp to `166c20b473d5f4d04052e699f992f625ea2a2fdd` (#9403 ) ⬆️ Update ggml-org/whisper.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-18 00:42:32 +02:00
Ettore Di Giacinto	c49feb546f	fix(llama-cpp): rename linked target common -> llama-common (#9408 ) Upstream llama.cpp (45cac7ca) renamed the CMake library target `common` to `llama-common`. Linking the old name caused `target_include_directories(... PUBLIC .)` from the common/ dir to not propagate, so `#include "common.h"` failed when building grpc-server.	2026-04-18 00:42:05 +02:00
LocalAI [bot]	7dbd9c056a	chore: ⬆️ Update ggml-org/llama.cpp to `4fbdabdc61c04d1262b581e1b8c0c3b119f688ff` (#9381 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-17 08:13:04 +02:00
Ettore Di Giacinto	5837b14888	chore: ⬆️ Update TheTom/llama-cpp-turboquant to `45f8a066ed5f5bb38c695cec532f6cef9f4efa9d' (#9385 ) chore: ⬆️ Update TheTom/llama-cpp-turboquant to `45f8a066ed5f5bb38c695cec532f6cef9f4efa9d` Drop 0002-ggml-rpc-bump-op-count-to-97.patch; the fork now has GGML_OP_COUNT == 97 and RPC_PROTO_PATCH_VERSION 2 upstream. Fetch all tags in backend/cpp/llama-cpp/Makefile so tag-only commits (the new turboquant pin is reachable only through the tag feature-turboquant-kv-cache-b8821-45f8a06) can be checked out.	2026-04-17 08:12:21 +02:00
LocalAI [bot]	b6a68e5df4	chore: ⬆️ Update leejet/stable-diffusion.cpp to `a564fdf642780d1df123f1c413b19961375b8346` (#9383 ) ⬆️ Update leejet/stable-diffusion.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-17 08:11:55 +02:00
LocalAI [bot]	c6dfb4acaf	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `eaf83865a132f66e8f49efe0e78491625942f068` (#9382 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-17 08:11:41 +02:00
Ettore Di Giacinto	a0cbc46be9	refactor(tinygrad): reuse tinygrad.apps.llm instead of vendored Transformer (#9380 ) Drop the 295-line vendor/llama.py fork in favor of `tinygrad.apps.llm`, which now provides the Transformer blocks, GGUF loader (incl. Q4/Q6/Q8 quantization), KV-cache and generate loop we were maintaining ourselves. What changed: - New vendor/appsllm_adapter.py (~90 LOC) — HF -> GGUF-native state-dict keymap, Transformer kwargs builder, `_embed_hidden` helper, and a hard rejection of qkv_bias models (Qwen2 / 2.5 are no longer supported; the apps.llm Transformer ties `bias=False` on Q/K/V projections). - backend.py routes both safetensors and GGUF paths through apps.llm.Transformer. Generation now delegates to its (greedy-only) `generate()`; Temperature / TopK / TopP / RepetitionPenalty are still accepted on the wire but ignored — documented in the module docstring. - Jinja chat render now passes `enable_thinking=False` so Qwen3's reasoning preamble doesn't eat the tool-call token budget on small models. - Embedding path uses `_embed_hidden` (block stack + output_norm) rather than the custom `embed()` method we were carrying on the vendored Transformer. - test.py gains TestAppsLLMAdapter covering the keymap rename, tied embedding fallback, unknown-key skipping, and qkv_bias rejection. - Makefile fixtures move from Qwen/Qwen2.5-0.5B-Instruct to Qwen/Qwen3-0.6B (apps.llm-compatible) and tool_parser from qwen3_xml to hermes (the HF chat template emits hermes-style JSON tool calls). Verified with the docker-backed targets: test-extra-backend-tinygrad 5/5 PASS test-extra-backend-tinygrad-embeddings 3/3 PASS test-extra-backend-tinygrad-whisper 4/4 PASS test-extra-backend-tinygrad-sd 3/3 PASS	2026-04-16 22:41:18 +02:00
Ettore Di Giacinto	b4e30692a2	feat(backends): add sglang (#9359 ) * feat(backends): add sglang Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(sglang): force AVX-512 CXXFLAGS and disable CI e2e job sgl-kernel's shm.cpp uses __m512 AVX-512 intrinsics unconditionally; -march=native fails on CI runners without AVX-512 in /proc/cpuinfo. Force -march=sapphirerapids so the build always succeeds, matching sglang upstream's docker/xeon.Dockerfile recipe. The resulting binary still requires an AVX-512 capable CPU at runtime, so disable tests-sglang-grpc in test-extra.yml for the same reason tests-vllm-grpc is disabled. Local runs with make test-extra-backend-sglang still work on hosts with the right SIMD baseline. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(sglang): patch CMakeLists.txt instead of CXXFLAGS for AVX-512 CXXFLAGS with -march=sapphirerapids was being overridden by add_compile_options(-march=native) in sglang's CPU CMakeLists.txt, since CMake appends those flags after CXXFLAGS. Sed-patch the CMakeLists.txt directly after cloning to replace -march=native. --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-16 22:40:56 +02:00
LocalAI [bot]	7f88a3ba30	chore: ⬆️ Update leejet/stable-diffusion.cpp to `c41c5ded7af85e01b7fe442ff7950c720706d53a` (#9366 ) ⬆️ Update leejet/stable-diffusion.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-16 09:04:33 +02:00
LocalAI [bot]	df2d25cee5	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `1163af96cf6bb4a4b819f998f84c153a49768b99` (#9368 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-16 01:13:08 +02:00
LocalAI [bot]	96cd561d9d	chore: ⬆️ Update ggml-org/llama.cpp to `b3d758750a268bf93f084ccfa3060fb9a203192a` (#9370 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-16 01:12:39 +02:00
Ettore Di Giacinto	6f0051301b	feat(backend): add tinygrad multimodal backend (experimental) (#9364 ) * feat(backend): add tinygrad multimodal backend Wire tinygrad as a new Python backend covering LLM text generation with native tool-call extraction, embeddings, Stable Diffusion 1.x image generation, and Whisper speech-to-text from a single self-contained container. Backend (`backend/python/tinygrad/`): - `backend.py` gRPC servicer with LLM Predict/PredictStream (auto-detects Llama / Qwen2 / Mistral architecture from `config.json`, supports safetensors and GGUF), Embedding via mean-pooled last hidden state, GenerateImage via the vendored SD1.x pipeline, AudioTranscription + AudioTranscriptionStream via the vendored Whisper inference loop, plus Tokenize / ModelMetadata / Status / Free. - Vendored upstream model code under `vendor/` (MIT, headers preserved): llama.py with an added `qkv_bias` flag for Qwen2-family bias support and an `embed()` method that returns the last hidden state, plus clip.py, unet.py, stable_diffusion.py (trimmed to drop the MLPerf training branch that pulls `mlperf.initializers`), audio_helpers.py and whisper.py (trimmed to drop the pyaudio listener). - Pluggable tool-call parsers under `tool_parsers/`: hermes (Qwen2.5 / Hermes), llama3_json (Llama 3.1+), qwen3_xml (Qwen 3), mistral (Mistral / Mixtral). Auto-selected from model architecture or `Options`. - `install.sh` pins Python 3.11.14 (tinygrad >=0.12 needs >=3.11; the default portable python is 3.10). - `package.sh` bundles libLLVM.so.1 + libedit/libtinfo/libgomp/libsndfile into the scratch image. `run.sh` sets `CPU_LLVM=1` and `LLVM_PATH` so tinygrad's CPU device uses the in-process libLLVM JIT instead of shelling out to the missing `clang` binary. - Local unit tests for Health and the four parsers in `test.py`. Build wiring: - Root `Makefile`: `.NOTPARALLEL`, `prepare-test-extra`, `test-extra`, `BACKEND_TINYGRAD = tinygrad\|python\|.\|false\|true`, docker-build-target eval, and `docker-build-backends` aggregator. - `.github/workflows/backend.yml`: cpu / cuda12 / cuda13 build matrix entries (mirrors the transformers backend placement). - `backend/index.yaml`: `&tinygrad` meta + cpu/cuda12/cuda13 image entries (latest + development). E2E test wiring: - `tests/e2e-backends/backend_test.go` gains an `image` capability that exercises GenerateImage and asserts a non-empty PNG is written to `dst`. New `BACKEND_TEST_IMAGE_PROMPT` / `BACKEND_TEST_IMAGE_STEPS` knobs. - Five new make targets next to `test-extra-backend-vllm`: - `test-extra-backend-tinygrad` — Qwen2.5-0.5B-Instruct + hermes, mirrors the vllm target 1:1 (5/9 specs in ~57s). - `test-extra-backend-tinygrad-embeddings` — same model, embeddings via LLM hidden state (3/9 in ~10s). - `test-extra-backend-tinygrad-sd` — stable-diffusion-v1-5 mirror, health/load/image (3/9 in ~10min, 4 diffusion steps on CPU). - `test-extra-backend-tinygrad-whisper` — openai/whisper-tiny.en against jfk.wav from whisper.cpp samples (4/9 in ~49s). - `test-extra-backend-tinygrad-all` aggregate. All four targets land green on the first MVP pass: 15 specs total, 0 failures across LLM+tools, embeddings, image generation, and speech transcription. * refactor(tinygrad): collapse to a single backend image tinygrad generates its own GPU kernels (PTX renderer for CUDA, the autogen ctypes wrappers for HIP / Metal / WebGPU) and never links against cuDNN, cuBLAS, or any toolkit-version-tied library. The only runtime dependency that varies across hosts is the driver's libcuda.so.1 / libamdhip64.so, which are injected into the container at run time by the nvidia-container / rocm runtimes. So unlike torch- or vLLM-based backends, there is no reason to ship per-CUDA-version images. - Drop the cuda12-tinygrad and cuda13-tinygrad build-matrix entries from .github/workflows/backend.yml. The sole remaining entry is renamed to -tinygrad (from -cpu-tinygrad) since it is no longer CPU-only. - Collapse backend/index.yaml to a single meta + development pair. The meta anchor carries the latest uri directly; the development entry points at the master tag. - run.sh picks the tinygrad device at launch time by probing /usr/lib/... for libcuda.so.1 / libamdhip64.so. When libcuda is visible we set CUDA=1 + CUDA_PTX=1 so tinygrad uses its own PTX renderer (avoids any nvrtc/toolkit dependency); otherwise we fall back to HIP or CLANG. CPU_LLVM=1 + LLVM_PATH keep the in-process libLLVM JIT for the CLANG path. - backend.py's _select_tinygrad_device() is trimmed to a CLANG-only fallback since production device selection happens in run.sh. Re-ran test-extra-backend-tinygrad after the change: Ran 5 of 9 Specs in 56.541 seconds — 5 Passed, 0 Failed	2026-04-15 19:48:23 +02:00
LocalAI [bot]	62862ca06b	chore: ⬆️ Update ggml-org/llama.cpp to `fae3a28070fe4026f87bd6a544aba1b2d1896566` (#9357 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-15 01:25:41 +02:00
Ettore Di Giacinto	95efb8a562	feat(backend): add turboquant llama.cpp-fork backend (#9355 ) * feat(backend): add turboquant llama.cpp-fork backend turboquant is a llama.cpp fork (TheTom/llama-cpp-turboquant, branch feature/turboquant-kv-cache) that adds a TurboQuant KV-cache scheme. It ships as a first-class backend reusing backend/cpp/llama-cpp sources via a thin wrapper Makefile: each variant target copies ../llama-cpp into a sibling build dir and invokes llama-cpp's build-llama-cpp-grpc-server with LLAMA_REPO/LLAMA_VERSION overridden to point at the fork. No duplication of grpc-server.cpp — upstream fixes flow through automatically. Wires up the full matrix (CPU, CUDA 12/13, L4T, L4T-CUDA13, ROCm, SYCL f32/f16, Vulkan) in backend.yml and the gallery entries in index.yaml, adds a tests-turboquant-grpc e2e job driven by BACKEND_TEST_CACHE_TYPE_K/V=q8_0 to exercise the KV-cache config path (backend_test.go gains dedicated env vars wired into ModelOptions.CacheTypeKey/Value — a generic improvement usable by any llama.cpp-family backend), and registers a nightly auto-bump PR in bump_deps.yaml tracking feature/turboquant-kv-cache. scripts/changed-backends.js gets a special-case so edits to backend/cpp/llama-cpp/ also retrigger the turboquant CI pipeline, since the wrapper reuses those sources. * feat(turboquant): carry upstream patches against fork API drift turboquant branched from llama.cpp before upstream commit 66060008 ("server: respect the ignore eos flag", #21203) which added the `logit_bias_eog` field to `server_context_meta` and a matching parameter to `server_task::params_from_json_cmpl`. The shared backend/cpp/llama-cpp/grpc-server.cpp depends on that field, so building it against the fork unmodified fails. Cherry-pick that commit as a patch file under backend/cpp/turboquant/patches/ and apply it to the cloned fork sources via a new apply-patches.sh hook called from the wrapper Makefile. Simplifies the build flow too: instead of hopping through llama-cpp's build-llama-cpp-grpc-server indirection, the wrapper now drives the copied Makefile directly (clone -> patch -> build). Drop the corresponding patch whenever the fork catches up with upstream — the build fails fast if a patch stops applying, which is the signal to retire it. * docs: add turboquant backend section + clarify cache_type_k/v Document the new turboquant (llama.cpp fork with TurboQuant KV-cache) backend alongside the existing llama-cpp / ik-llama-cpp sections in features/text-generation.md: when to pick it, how to install it from the gallery, and a YAML example showing backend: turboquant together with cache_type_k / cache_type_v. Also expand the cache_type_k / cache_type_v table rows in advanced/model-configuration.md to spell out the accepted llama.cpp quantization values and note that these fields apply to all llama.cpp-family backends, not just vLLM. * feat(turboquant): patch ggml-rpc GGML_OP_COUNT assertion The fork adds new GGML ops bringing GGML_OP_COUNT to 97, but ggml/include/ggml-rpc.h static-asserts it equals 96, breaking the GGML_RPC=ON build paths (turboquant-grpc / turboquant-rpc-server). Carry a one-line patch that updates the expected count so the assertion holds. Drop this patch whenever the fork fixes it upstream. * feat(turboquant): allow turbo* KV-cache types and exercise them in e2e The shared backend/cpp/llama-cpp/grpc-server.cpp carries its own allow-list of accepted KV-cache types (kv_cache_types[]) and rejects anything outside it before the value reaches llama.cpp's parser. That list only contains the standard llama.cpp types — turbo2/turbo3/turbo4 would throw "Unsupported cache type" at LoadModel time, meaning nothing the LocalAI gRPC layer accepted was actually fork-specific. Add a build-time augmentation step (patch-grpc-server.sh, called from the turboquant wrapper Makefile) that inserts GGML_TYPE_TURBO2_0/3_0/4_0 into the allow-list of the copied grpc-server.cpp under turboquant-<flavor>-build/. The original file under backend/cpp/llama-cpp/ is never touched, so the stock llama-cpp build keeps compiling against vanilla upstream which has no notion of those enum values. Switch test-extra-backend-turboquant to set BACKEND_TEST_CACHE_TYPE_K=turbo3 / _V=turbo3 so the e2e gRPC suite actually runs the fork's TurboQuant KV-cache code paths (turbo3 also auto-enables flash_attention in the fork). Picking q8_0 here would only re-test the standard llama.cpp path that the upstream llama-cpp backend already covers. Refresh the docs (text-generation.md + model-configuration.md) to list turbo2/turbo3/turbo4 explicitly and call out that you only get the TurboQuant code path with this backend + a turbo* cache type. * fix(turboquant): rewrite patch-grpc-server.sh in awk, not python3 The builder image (ubuntu:24.04 stage-2 in Dockerfile.turboquant) does not install python3, so the python-based augmentation step errored with `python3: command not found` at make time. Switch to awk, which ships in coreutils and is already available everywhere the rest of the wrapper Makefile runs. * Apply suggestion from @mudler Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> --------- Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2026-04-15 01:25:04 +02:00

1 2 3 4 5 ...

1175 Commits