LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-11 02:07:27 -04:00

Author	SHA1	Message	Date
LocalAI [bot]	fba8c9c498	fix(distributed): track in-flight for non-LLM inference methods (VAD, diarize, voice, ...) (#10238 ) fix(distributed): track in-flight for non-LLM inference methods InFlightTrackingClient only wrapped a subset of the grpc.Backend inference methods (Predict, Embeddings, TTS, AudioTranscription, Detect, Rerank, ...). Methods like VAD were left as embedded passthrough, so track() never ran for them. In distributed mode every model is loaded with in_flight=1 as a reservation; that reservation is only released by the OnFirstComplete callback, which fires after the first tracked inference call completes. A VAD-only model (e.g. silero-vad) never calls a tracked method, so the reservation is never released and in-flight stays pinned at 1 forever - which also blocks the router's idle-eviction logic. Wrap the remaining unary inference methods (VAD, Diarize, Face, Voice, TokenClassify, Score, AudioEncode, AudioDecode, AudioTransform) with the same track()/reconcile() pattern. The three bidi-stream constructors (AudioTransformStream, AudioToAudioStream, Forward) are deliberately left as passthrough - their inference spans the stream lifetime, not the constructor call, so track() there would fire onFirstComplete before any data flows. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> v4.4.0	2026-06-10 16:29:50 +02:00
LocalAI [bot]	6b2badb837	chore: ⬆️ Update CrispStrobe/CrispASR to `c29f6653a516a3001d923944dad8892072cc7334` (#10236 ) ⬆️ Update CrispStrobe/CrispASR Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-10 16:16:24 +02:00
LocalAI [bot]	8b8506d01a	chore: ⬆️ Update ggml-org/llama.cpp to `039e20a2db9e87b2477c76cc04905f3e1acad77f` (#10223 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-10 12:22:03 +02:00
LocalAI [bot]	6910a0bb48	chore: ⬆️ Update antirez/ds4 to `91bafb5acd5a6cf00b1e55ef68bf40ddd207bee7` (#10234 ) ⬆️ Update antirez/ds4 Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-10 12:08:19 +02:00
LocalAI [bot]	cffd03b522	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `e6f8112f3ba126eed3ff5b30cdd08085414a7516` (#10233 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-10 12:07:49 +02:00
LocalAI [bot]	bf448d3794	chore: ⬆️ Update ggml-org/whisper.cpp to `df7638d8229a243af8a4b5a8ae557e0d74e0a0ae` (#10220 ) ⬆️ Update ggml-org/whisper.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-10 01:16:29 +02:00
LocalAI [bot]	1d4a12f7c0	chore: ⬆️ Update CrispStrobe/CrispASR to `97cad527d247edefc904e6c40c4cf5ee78bed055` (#10221 ) ⬆️ Update CrispStrobe/CrispASR Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-10 01:16:17 +02:00
LocalAI [bot]	186d62801d	chore: ⬆️ Update leejet/stable-diffusion.cpp to `19bdfe22d255d5b4dff39d449318b9bc5ea2317f` (#10222 ) ⬆️ Update leejet/stable-diffusion.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-10 01:16:06 +02:00
LocalAI [bot]	da4ed05429	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `2768b6251548b78b6610e95edad13f888ad95982` (#10219 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-10 01:15:54 +02:00
LocalAI [bot]	ec1eea4f45	chore: ⬆️ Update antirez/ds4 to `512d07cb08f234b704b5a5959aa9e2d4c466eeb0` (#10224 ) ⬆️ Update antirez/ds4 Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-10 01:15:42 +02:00
LocalAI [bot]	b203b32e57	feat(realtime): make WebRTC ICE candidates configurable (#10231 ) The /v1/realtime WebRTC handler created the peer connection with a bare webrtc.Configuration and no SettingEngine, so pion gathered a host ICE candidate for every local interface. Under Docker host networking that includes bridge addresses (docker0/veth, 172.x) a remote browser cannot route to; the call establishes on a good pair and then drops once ICE consent freshness checks fail on the unreachable candidates. Add two opt-in knobs, applied via a pion SettingEngine: - LOCALAI_WEBRTC_NAT_1TO1_IPS: advertise these IPs as the host candidates (e.g. the host LAN IP) - LOCALAI_WEBRTC_ICE_INTERFACES: restrict ICE gathering to these interfaces Defaults are unchanged (empty => current all-interface behavior). Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-09 22:28:03 +02:00
Ching	48a8ce98aa	fix(cli): handle chat output errors (#10229 ) Propagate terminal write errors from the chat prompt and explicitly ignore stream close errors during cleanup. Update chat tests to assert response writer errors so errcheck passes without hiding failed writes. Tests: - go test -count=1 ./core/cli/chat - go test -count=1 ./core/cli Assisted-by: Codex:GPT-5 Signed-off-by: Ching Kao <0980124jim@gmail.com>	2026-06-09 19:10:24 +02:00
Ching	8344d1c865	feat(cli): add interactive chat mode (#10226 ) Add an opt-in `local-ai chat` command for testing chat models directly from the terminal without manually sending curl requests. The command connects to a running LocalAI server, lists available models through the existing OpenAI-compatible API, streams chat completions, and supports interactive commands such as `/models`, `/model`, `/clear`, and `/exit`. Keep `local-ai run` focused on the server lifecycle so the web UI, API clients, and multiple chat terminals can coexist against the same server. Document the new command and terminal workflow in the README and CLI docs. Tests: - go test -count=1 ./core/cli/chat - go test -count=1 ./core/cli Assisted-by: Codex:GPT-5 Signed-off-by: Ching Kao <0980124jim@gmail.com>	2026-06-09 14:58:44 +00:00
Pete	d2e6b93369	feat(agents): surface KB source citations in RAG responses (#10228 ) * dev knowledge.go structure Signed-off-by: Pete Chen <petechentw@gmail.com> * feat(agents): append KB source citations to responses Render structured KB citations as a Sources block after agent responses, linking each source to the existing raw collection entry endpoint. Keep long-term memory writes on the original model response so citation blocks do not get stored back into the knowledge base. Tested with: go test ./core/services/agents Assisted-by: Codex:gpt-5 Signed-off-by: Pete Chen <petechentw@gmail.com> * Collect KB citations from tool searches Signed-off-by: Pete Chen <petechentw@gmail.com> * fix(agents): append KB sources in local chats Apply the shared KB citation post-processing to standalone LocalAGI chat responses so the React agent chat receives the same clickable Sources block as the native executor path. Also fix the run target to use the current cmd/local-ai entrypoint. Assisted-by: Codex:gpt-5 Signed-off-by: Pete Chen <petechentw@gmail.com> --------- Signed-off-by: Pete Chen <petechentw@gmail.com> Co-authored-by: shihyunhuang <shihyunhuang88@gmail.com> Co-authored-by: TLoE419 <tloemizuchizu@gmail.com> Co-authored-by: Ching Kao <0980124jim@gmail.com>	2026-06-09 16:32:56 +02:00
LocalAI [bot]	e1ec03d33f	fix(reasoning): stop prefilled <think> from swallowing tag-less answers (#10225 ) * fix(reasoning): stop prefilled <think> from swallowing tag-less answers When a chat template injects the thinking start token into the prompt (so DetectThinkingStartToken returns e.g. "<think>"), the model's output begins inside a reasoning block and carries only the closing tag. The non-jinja autoparser fallback (peg-native "pure content" mode, issue #9985) prepends the start token so the extractor can pair it with the model's </think>. But on a COMPLETE response that contains no closing tag, the model answered directly with no reasoning at all. Prepending the start token there manufactures an unclosed block that swallows the entire answer into reasoning, leaving the OpenAI `content` field empty. This breaks short/direct answers — session names, JSON summaries, any terse completion where the model skips the think block — which come back with empty content. Regression surfaced by #9991, which added the defensive prefill extraction to the complete-response paths. Add reasoning.ExtractReasoningComplete: it only honors a prefilled start token when the response actually contains the matching closing tag (proof a reasoning block exists). Genuine reasoning tags already in the content still extract; tag-less content stays content. Apply it at every complete-response site (applyAutoparserOverride, realtime, openresponses). The streaming per-token extractor is intentionally left on ExtractReasoningWithConfig — mid-stream an as-yet-unclosed block is legitimate and must surface as reasoning deltas. Also adds reasoning.ClosingTokenForStart and hoists the default reasoning tag pairs to package scope so both helpers share one source of truth. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(reasoning): cover the enable_thinking=false non-thinking-mode regression Adds the end-to-end case that actually broke session summaries / auto-titles and was not covered before: a request with enable_thinking=false against a <think>-capable model. In non-thinking mode the model emits no reasoning block, so llama.cpp's autoparser returns ChatDeltas with content set and reasoning_content empty (verified against stock llama-server: same model with chat_template_kwargs.enable_thinking=false returns reasoning_content=null, content="hello"). thinkingStartToken is still "<think>" because it is detected per-model from the enable_thinking=true render, so the old code prepended it and swallowed the answer. The test fails without the ExtractReasoningComplete gate. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-09 09:02:04 +02:00
LocalAI [bot]	9323f4b5ca	feat(llama-cpp): video input support (mtmd #24269 ) (#10216 ) * chore(llama-cpp): bump to 8f83d6c for mtmd video input support Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(llama-cpp): forward video input to mtmd (template + non-template paths) Wire request->videos() into grpc-server.cpp mirroring the existing image and audio handling: a video_data build + non-template files extraction, and input_video chat chunks on the tokenizer-template path. allow_video is auto-set at model load by the vendored upstream chat_params. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(ui): add video attachment support to the chat UI Mirror the image/audio attachment path for video: emit video_url content parts, accept video/* in the picker, keep video files as base64, show a film icon badge, and render attached video inline with a <video> player. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(llama-cpp): patch mtmd video stdin double-close (heap crash) Upstream mtmd video input (ggml-org/llama.cpp#24269) double-fcloses the ffmpeg/ffprobe stdin FILE: feed_stdin() fclose()s the FILE returned by subprocess_stdin() (which is sp->stdin_file), then subprocess_destroy() fclose()s the same pointer again -> heap corruption that aborts the backend on any base64 input_video request (the CLI --video file path is unaffected). Vendor a one-line fix (null sp->stdin_file after fclose) via prepare.sh's patches/ until upstream merges it. Verified e2e with gemma-4-e2b-it-qat-q4_0: video frames decode via ffmpeg and the model answers correctly (red clip -> 'Red', blue -> 'Blue'). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(llama-cpp): re-pin to upstream #24316, drop vendored stdin patch Upstream replaced the ad-hoc video stdin handling with a proper RAII refactor (ggml-org/llama.cpp#24316, "mtmd: refactor video subproc handling"), which includes the same `sp->stdin_file = nullptr` guard our patch added (plus join-before-destroy ordering). Re-pin LLAMA_VERSION to that branch head and drop patches/0001 - it's now redundant. Verified e2e with gemma-4-e2b-it-qat-q4_0: no crash, video frames decode and the model answers correctly (red clip -> "Red", blue -> "Blue"). NOTE: #24316 is not yet merged, so this pins to its branch-head commit (28ca1e60). Re-pin to the squash-merge commit on master once it lands, otherwise `git fetch` may lose the commit after the branch is deleted. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-08 23:17:50 +02:00
LocalAI [bot]	c20225fc13	chore: ⬆️ Update CrispStrobe/CrispASR to `f7838a306687f22c281d29c250f879a4ab3df2d7` (#10177 ) * ⬆️ Update CrispStrobe/CrispASR Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * fix(crispasr): link crispasr-lib CMake target instead of crispasr The dependency-bump regeneration of this branch reset CMakeLists.txt to master and dropped the prior link-target fix, reintroducing the `cannot find -lcrispasr` failure. Upstream CrispASR (f7838a3) defines the library as the CMake target `crispasr-lib` (with OUTPUT_NAME crispasr); there is no target named `crispasr`, so target_link_libraries falls back to a bare `-lcrispasr` linker flag that cannot be resolved. Point the link at the real target name. Verified locally: CPU cmake-configure of the bumped source generates a gocrispasr link line referencing sources/CrispASR/src/libcrispasr.a with no dangling -lcrispasr. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:opus-4.8 [Claude Code] --------- Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-08 16:01:19 +02:00
LocalAI [bot]	337acc4c37	chore: ⬆️ Update antirez/ds4 to `c463029c205c2ec8d7ab6c0df4a3f52979091286` (#10189 ) * ⬆️ Update antirez/ds4 Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * fix(ds4): link ds4_ssd.o into the backend build Upstream antirez/ds4 splits the SSD expert-cache into its own ds4_ssd.c translation unit, whose symbols (ds4_ssd_memory_lock_acquire/release, ds4_ssd_cache_experts_for_byte_budget, ds4_ssd_auto_cache_plan) are referenced by ds4.c/ds4_cpu.o. The dependency-bump automation regenerated this branch from clean master and dropped the prior linkage fix, so the cpu-ds4 / cublas-ds4 backend builds fail again with undefined references. Re-apply the ds4_ssd.o linkage GPU-agnostically (mirroring ds4_distributed.o) in both the backend Makefile (DS4_OBJ_TARGET + the engine-object build rule for every GPU mode) and CMakeLists.txt (list(APPEND DS4_OBJS ds4_ssd.o)). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:opus-4.8 [Claude Code] --------- Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-08 11:15:32 +02:00
LocalAI [bot]	618e90cd13	feat(gallery): add Gemma 4 QAT family + MTP speculative-decoding pairs (#10215 ) Add the remaining official Google Gemma 4 QAT Q4_0 GGUFs (E2B, E4B, 26B-A4B, 31B) next to the existing 12B entry, each shipping its multimodal mmproj. Also add three MTP (Multi-Token Prediction) speculative-decoding bundles that pair each QAT target with a QAT-matched assistant/drafter head: - 12B <- Janvitos/gemma-4-12B-it-qat-assistant-MTP-Q8_0-GGUF - 26B-A4B <- boxwrench/gemma-4-qat-mtp-assistant-heads - 31B <- boxwrench/gemma-4-qat-mtp-assistant-heads The assistant heads use the gemma4_assistant architecture and are not standalone chat models, so each entry bundles the target + draft and sets draft_model together with the draft-mtp spec options (spec_type:draft-mtp / spec_n_max:6 / spec_p_min:0.75), matching MTPSpecOptions() in core/config/mtp.go. QAT-matched heads raise draft acceptance substantially over generic non-QAT heads. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-08 10:26:42 +02:00
LocalAI [bot]	92dea961c2	fix: distributed backend reinstall/upgrade UI stuck on 'reinstalling' (#10214 ) * fix(galleryop): self-evict terminal ops from OpCache.GetStatus The processingBackends map (the UI 'reinstalling' spinner source) only cleared an op when a client polled /api/backends/job/:uid. The Manage-page Reinstall and Upgrade buttons never poll, so completed installs leaked into processingBackends forever and the backend card spun 'reinstalling' even though the install had finished. Evict terminal ops on the list read instead; DeleteUUID already broadcasts the eviction so peer replicas converge. Reproduced on a live 5-node distributed cluster: 5 backends sat in processingBackends with underlying jobs reporting completed:true,progress:100. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(nodes): clear pending backend ops behind offline/draining nodes ListDuePendingBackendOps filters status=healthy, so a backend op queued against a node that went offline (stale heartbeat) or draining (admin action) was never retried, aged out, or deleted - it leaked forever and kept the UI operation spinning. Add DeleteStalePendingBackendOps and run it each reconcile pass: draining nodes are cleared immediately (model rows already purged), offline nodes once their heartbeat is older than a grace window (blip protection). Reproduced on a live cluster: orphaned llama-cpp install rows targeting an offline (nvidia-thor) and a draining (mac-mini-m4) node sat at attempts=0 indefinitely. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(nodes): stream per-node progress during backend upgrade The install dispatch subscribed to a per-op progress subject and streamed per-node download ticks; the upgrade dispatch did a bare 15-minute blocking NATS round-trip with no subscription, so the UI showed progress:0 the whole time (the 'reinstalling but nothing happens' report on a slow node). Thread the op ID through BackendManager.UpgradeBackend -> the distributed manager -> the adapter, and have the adapter subscribe to the per-op progress subject before the request (extracted into a shared subscribeProgress helper reused by install/upgrade/force-fallback). The worker's upgradeBackend now creates the same DebouncedInstallProgressPublisher installBackend uses. An upgrade is a force-reinstall, so it reuses SubjectNodeBackendInstallProgress rather than minting a new subject - no new NATS permission, no new rolling-update compat surface. Reconciler-driven retries pass empty opID/onProgress and stay on the silent path. Reproduced on a live cluster: upgrade of llama-cpp-development on agx-orin-slow sat at progress:0 for 4+ minutes with no per-node feedback. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(galleryop): persist cancellation + periodically reap orphaned ops Two distributed gaps surfaced when a replica was killed mid-upgrade on a live cluster, leaving the backend stuck 'processing' in the UI forever: 1. CancelOperation flipped the in-memory status to cancelled and broadcast a NATS event but never persisted the terminal status. On the next replica restart the still-active row re-hydrated straight back into processingBackends and the UI spun again. It now calls store.Cancel(id) so the cancel survives a restart. 2. CleanStale (which marks abandoned active ops failed) only ran once on startup, so an op orphaned AFTER startup - its owning replica's foreground handler goroutine gone - was never reaped until the next restart. Add GalleryService.ReapStaleOperations and run it on a 15m ticker (CleanStale now returns the reaped count for observability). Neither is covered by the OpCache self-evict fix: an orphaned op never reaches Processed, so it would never self-evict. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(review): address self-review findings on the distributed install fixes Three findings from an adversarial review of this branch: 1. CRITICAL - OpCache.GetStatus crashed under concurrent load. m.Map() returns the live internal map by reference, so deleting from it on the read path was an unsynchronized write to a map four HTTP handlers poll every ~1s -> a 'concurrent map writes' fatal. Rewritten to iterate a Keys() snapshot, build a fresh result map, and apply evictions via the locked DeleteUUID after the loop. Added a -race concurrency regression guard. 2. HIGH - GetStatus evicted failed ops too, hiding them from /api/operations and breaking the dismiss-failed-op flow (the panel keeps Error != nil ops so the admin can read the error and click Dismiss). Eviction now fires only for terminal ops with Error == nil (success/cancelled); failures are retained. 3. MEDIUM - DeleteStalePendingBackendOps missed StatusUnhealthy nodes. A node marked unhealthy on a NATS ErrNoResponders never transitions to offline (health.go skips re-marking it), so its pending ops leaked exactly like the offline case. Unhealthy is now reaped via the same stale-heartbeat grace path (a fresh-heartbeat node is recovering and keeps its op). Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(review-2): don't evict the still-installing soft-path; don't spin on failed ops Second review pass found two issues: 1. MEDIUM (Go) - OpCache.GetStatus evicted the ErrWorkerStillInstalling soft-path op. That op is deliberately Processed=true with no error to show a yellow in-progress state when a worker timed out the NATS round-trip but is still installing in the background; the reconciler confirms the real outcome later. Evicting it (and broadcasting OpEnd + marking the DB completed) hid an install that may still fail. Eviction is now scoped to a clean success (progress 100 + 'completed', matching the job-poll's historical condition) or a cancellation - the soft-path (progress != 100) and failures are kept. 2. MEDIUM (React) - the Backends gallery card rendered ANY operation as an 'Installing...' spinner, so a failed op (now intentionally kept in the list for the OperationsBar error + Dismiss) spun forever. Exclude errored ops from the card spinner, mirroring Models.jsx (isInstalling already excludes op.error). The error + Dismiss still surface in the global OperationsBar. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(ui): refresh Manage backends table when an operation settles The Manage backends table fetched installed backends only on mount/after delete and checked upgrades only on tab activation. After a reinstall/upgrade completed neither re-ran, so the installed-version cell and the 'update available' badge stayed stale until the user switched tabs - the op looked like it 'did nothing'. Watch the operations list (via useOperations) and re-fetch installed backends + available upgrades whenever the count settles, mirroring the operations.length watch Backends.jsx already uses. Consolidates the prior tab-activation upgrades check into the same effect. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-08 10:03:02 +02:00
LocalAI [bot]	2e93186043	chore: ⬆️ Update ggml-org/llama.cpp to `9e3b928fd8c9d14dbf15a8768b9fdd7e5c721d66` (#10210 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-08 09:35:17 +02:00
LocalAI [bot]	d07037e817	chore: ⬆️ Update leejet/stable-diffusion.cpp to `b3d56d0ba1bd437886079e339118e8e75bb79ee7` (#10211 ) ⬆️ Update leejet/stable-diffusion.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-08 09:03:57 +02:00
LocalAI [bot]	f6cc90d258	chore: ⬆️ Update mudler/parakeet.cpp to `e270af73b94c9a5c37ec516230219ed4580e1db6` (#10212 ) ⬆️ Update mudler/parakeet.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-07 23:52:44 +02:00
Adira	2c804bef5a	fix(config): skip vocab arrays and mmap GGUF headers to speed up startup (#10213 ) When the models directory holds many GGUF files, startup parsed every model's full GGUF — including the tokenizer vocab arrays (tokenizer.ggml.tokens/scores/merges, often >100k entries) — once per model while guessing defaults. On slow storage (e.g. a models directory on a Docker volume) those hundreds of thousands of tiny reads dominate boot time before the HTTP server comes up. The default-guessing path and the VRAM metadata reader only consume scalar metadata and array lengths, never the array contents. Parse with SkipLargeMetadata (seek past large arrays) and UseMMap (fault in a few header pages instead of issuing per-element read() syscalls). For a 256k-token vocab this cuts the parse from ~524k read() syscalls to 8. The mapping is released when ParseGGUFFile returns. Fixes #9790 Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Adira Denis Muhando <dennisadira@gmail.com>	2026-06-07 23:33:52 +02:00
LocalAI [bot]	6070402477	chore(model gallery): 🤖 add 1 new models via gallery agent (#10209 ) chore(model gallery): 🤖 add new models via gallery agent Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-07 22:09:32 +02:00
LocalAI [bot]	67f80a152b	fix(mtp): don't auto-enable self-spec MTP for draft-only assistant GGUFs (#10208 ) Gemma4 MTP (ggml-org/llama.cpp#23398) registers the prediction head as a separate `gemma4-assistant` architecture. That assistant GGUF still carries `<arch>.nextn_predict_layers`, so the architecture-agnostic detection in HasEmbeddedMTPHead matched it and appended the `spec_type:draft-mtp` defaults. Unlike the DeepSeek/Qwen embedded-head models, an assistant checkpoint cannot self-speculate: it is a draft model that requires a paired target context (`ctx_other`) and throws if loaded alone. Auto-applying the self-spec defaults to a standalone assistant import therefore produces a broken config. Guard the detection against draft-only assistant architectures (the `-assistant` suffix is upstream's naming convention) so importing one no longer yields a self-speculation config. Two-model target+draft pairing remains expressible manually via `draft_model:` and is left to a follow-up. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-07 22:09:02 +02:00
LocalAI [bot]	a7cb587d96	feat(parakeet-cpp): real segment timestamps (NeMo-faithful) (#10207 ) * feat(parakeet-cpp): real segment timestamps (NeMo-faithful) Offline: replace the single synthetic whole-clip segment with multiple segments grouped exactly like NeMo's get_segment_offsets - a new segment after sentence-ending punctuation ('. ? !'), each carrying start/end and its time-window token ids. The optional model option segment_gap_threshold (NeMo's unit: encoder FRAMES, default 0=off) adds NeMo's silence-gap split, converted to seconds via the JSON frame_sec the engine now reports. Per-segment words are still gated behind timestamp_granularities=["word"]; a zero-word document falls back to a single text segment. Streaming: when libparakeet.so exposes the ABI v4 JSON entry points (probed), drive parakeet_capi_stream_feed_json / _finalize_json and accumulate the streamed per-word timestamps into per-utterance segments (EOU stays the boundary), so streaming FinalResult segments now carry start/end. Falls back to the text-only feed against an older library. Pure-Go specs cover splitWordsIntoSegments (punctuation + gap rules, NeMo elif order, fallback), transcriptResultFromDoc (multi-segment, token windows, word-granularity gate), and the streaming segmenter. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(audio): document parakeet-cpp segment timestamps + segment_gap_threshold Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(parakeet-cpp): update model-gated specs for multi-segment output The offline AudioTranscription specs asserted the old single synthetic segment (Segments HaveLen(1), Segments[0].Text == res.Text). With NeMo-faithful segmentation a multi-sentence clip now yields multiple punctuation-delimited segments, so assert the new contract instead: one-or-more time-ordered segments, each with text and (under word granularity) per-segment words whose span tracks the segment start/end. Caught by running the model-gated suite on the dgx (GB10) against the real tdt_ctc-110m + realtime_eou models. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-07 22:08:24 +02:00
LocalAI [bot]	f7c74ad2da	chore: ⬆️ Update ggml-org/llama.cpp to `31e82494c0a3913c919c1027fa70500fbf4c07dd` (#10191 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-07 10:43:17 +02:00
LocalAI [bot]	7402d1fd20	chore(turboquant): bump to 7d9715f1 + fix compilation against rebased fork (#10205 ) * chore(turboquant): bump TheTom/llama-cpp-turboquant to 7d9715f1 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * fix(turboquant): drop obsolete legacy-spec shim after fork rebased The TheTom/llama-cpp-turboquant fork (pin c9aa86a) rebased past the upstream common_params_speculative refactor (ggml-org/llama.cpp #22397/#22838/#22964), the model_tgt rename (#22838) and get_media_marker (#21962). The old fork-compat shim forced now-wrong legacy code paths, breaking the build with errors like 'struct common_params_speculative has no member named mparams_dft / type' and 'server_context_impl has no member named model'. Remove the obsolete LOCALAI_LEGACY_LLAMA_CPP_SPEC branches from the shared grpc-server.cpp (stock llama-cpp and the modern fork both take the modern path now), and narrow the one remaining gap (the fork still lacks common_params::checkpoint_min_step) to a dedicated LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP guard injected by patch-grpc-server.sh. The patch script now only adds the turbo2/3/4 KV-cache types and injects that one macro. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * fix(turboquant): HIP-port the fork's CUDA additions (copy2d 3D-peer + cudaEventCreate) The turboquant fork adds/modifies a few ggml-cuda.cu spots with CUDA APIs that ggml's HIP/MUSA shim does not provide, breaking the -gpu-rocm-hipblas-turboquant build. patches/0001-hip-guard-copy2d-peer-fastpath.patch (applied by apply-patches.sh) ports them: - Guard ggml_cuda_copy2d_across_devices's 3D-peer copy fast path with #if !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA) so HIP/MUSA fall through to the existing cudaMemcpyAsync staging fallback (HIP genuinely lacks cudaMemcpy3DPeerAsync, per the fork's own comment). - Create the device event in ggml_backend_cuda_device_event_new with the HIP-aliased cudaEventCreateWithFlags(.., cudaEventDisableTiming) instead of the un-aliased plain cudaEventCreate, matching this file's own usage elsewhere. CUDA builds are unaffected. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * ci(turboquant): drop the ROCm/hipblas build flavor The TheTom/llama-cpp-turboquant fork is not ROCm-clean at the current pin: beyond the CUDA-API gaps already patched (3D-peer copy, cudaEventCreate), its llama.cpp base fails to compile the flash-attention MMA f16 kernels for head-dim 640 under HIP (cols_per_warp evaluates to 0 -> division-by-zero / non-constant static asserts in fattn-mma-f16.cuh). That is a deep ggml-on-ROCm kernel issue, not something a small fork patch can paper over. Drop -gpu-rocm-hipblas-turboquant from the build matrix so turboquant still ships for cpu / cublas / vulkan / sycl. Re-add it once the fork's HIP path compiles (or upstream ggml fixes the large-head-dim MMA kernels for ROCm). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-07 10:42:06 +02:00
LocalAI [bot]	8c42695ef8	chore: ⬆️ Update ggml-org/whisper.cpp to `a8ec021f2750a473ff4a8f3883bc9fdf5feafa84` (#10202 ) ⬆️ Update ggml-org/whisper.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-07 08:37:42 +02:00
LocalAI [bot]	72e3241431	chore: ⬆️ Update mudler/parakeet.cpp to `abd0087dcc92ec5ad1f96f9fd86c49eb26a5ce67` (#10204 ) ⬆️ Update mudler/parakeet.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-07 00:37:28 +02:00
LocalAI [bot]	cd2bf95862	fix(docs): use relearn notice shortcode instead of unsupported alert (#10206 ) The Hugo relearn theme does not provide an "alert" shortcode, so the docs deploy failed at the Build site step: failed to extract shortcode: template for shortcode "alert" not found docs/content/features/distributed-mode.md:136 Convert the warning block to the theme-supported notice shortcode used everywhere else in the docs. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-07 00:37:12 +02:00
LocalAI [bot]	f64b72dd7d	feat: support Ideogram4 in stablediffusion-ggml backend + gallery (#10201 ) * feat(stablediffusion-ggml): support Ideogram4 unconditional diffusion model Bump stable-diffusion.cpp from 1f9ee88 to b9254dd, the upstream commit that adds Ideogram4 support (leejet/stable-diffusion.cpp#1609). Ideogram4 derives its classifier-free guidance from a separate unconditional diffusion model, exposed upstream through the new sd_ctx_params_t.uncond_diffusion_model_path field. Wire that field into the gosd wrapper via a new uncond_diffusion_model_path option. The _path suffix is deliberate: the Go loader only resolves options whose name contains "path" to an absolute path under the model directory, so this keeps the option consistent with diffusion_model_path and high_noise_diffusion_model_path. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(gallery): add Ideogram4 stablediffusion-ggml models Single-file GGUF weights for Ideogram4 are now published (stduhpf/ideogram-4-gguf), so add the model to the gallery. Ideogram4 is a text-to-image model with strong, accurate in-image text rendering, driven by a Qwen3-VL-8B text encoder and real classifier-free guidance from a separate unconditional diffusion model (the uncond_diffusion_model_path support added in the preceding commit). Two index entries, both built on gallery/virtual.yaml with the full config inlined in overrides (same pattern as the other models, no dedicated template file): - ideogram-4-iq4nl-ggml (4-bit, ~11.6GB diffusion) - ideogram-4-q8_0-ggml (8-bit, ~20GB diffusion) Each bundles the diffusion + unconditional GGUF (stduhpf), the Qwen3-VL-8B-Instruct text encoder (unsloth), and the FLUX.2 VAE (Comfy-Org mirror, non-gated). cfg_scale is 7 to match the upstream Ideogram4 default, since it performs real CFG unlike the guidance-distilled Flux/Z-Image models. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-06 22:50:12 +02:00
LocalAI [bot]	03c84cff28	feat(parakeet-cpp): nemotron-3.5-asr multilingual streaming model + request language support (#10199 ) * feat(parakeet-cpp): honor request language (multilingual nemotron) on batched + streaming paths Reads opts.GetLanguage() and threads it through to the new parakeet_capi_transcribe_pcm_batch_json_lang and parakeet_capi_stream_begin_lang C-API entry points, both probed with Dlsym so the backend still loads against an older libparakeet.so (falling back to the non-lang paths, i.e. model default). parakeet.cpp's batched C-API takes a single target_lang for the whole batch, so the dispatcher only coalesces same-language requests: a request whose language differs from the batch leader is held as a single carry-over and becomes the leader of the next batch, never dropped and never left waiting (including on shutdown). A new batcher test asserts no dispatched batch is ever mixed-language and that every submitted request still receives a reply. Assisted-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(gallery): add parakeet-cpp-nemotron-3.5-asr-streaming-0.6b; bump parakeet.cpp pin Adds the multilingual prompt-conditioned streaming model to the gallery (q8_0 default, OpenMDW-1.1) and bumps the parakeet-cpp backend pin to the parakeet.cpp commit that ships nemotron support plus batched causal subsampling and the batched target_lang C-API. Assisted-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-06 13:53:10 +02:00
LocalAI [bot]	9bc69c9e5f	chore(model gallery): 🤖 add 1 new models via gallery agent (#10200 ) chore(model gallery): 🤖 add new models via gallery agent Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-06 13:52:46 +02:00
LocalAI [bot]	1e6c9cfd60	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `6b9de3dbaa21ae95ea80638e5ee836795cc48c93` (#10190 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-06 09:42:43 +02:00
LocalAI [bot]	0e6712f734	chore: ⬆️ Update mudler/parakeet.cpp to `843600590f96a31467a5199f827c253f34c110f7` (#10198 ) chore(parakeet-cpp): bump pin to banded long-audio attention (843600590) Update PARAKEET_VERSION to mudler/parakeet.cpp@843600590f (merge of parakeet.cpp#9). Brings NeMo rel_pos_local_attn banded/Longformer attention with the chunk-matmul construction: long audio now uses O(T*window) attention instead of global O(T^2), fixing the encoder OOM on long clips (~16.6-min clip: 54GB->9.4GB peak, ~4x faster) at NeMo's full [128,128] window. Short clips are unchanged (global path). No C-ABI change. Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-06 09:25:25 +02:00
LocalAI [bot]	0e4cee9a97	chore: bump LocalAGI + localrecall (fix pgvector hybrid search seqscan, #10186 ) (#10192 ) chore: bump LocalAGI and localrecall (index-backed RRF hybrid search) Bumps the agent stack to pull in the PostgreSQL hybrid-search fix: - mudler/localrecall -> v0.6.3-...-9a3b3321a9cd (mudler/LocalRecall#46, merged) - mudler/LocalAGI -> ...-14aed1ae4336 (mudler/LocalAGI#477, merged) localrecall's hybrid search previously sorted on a wrapped scalar similarity expression, which blinded the planner into a full sequential scan over every row and exceeded the statement timeout on large collections, returning an empty result set. It now uses the canonical Reciprocal Rank Fusion pattern (index-backed candidate retrieval + FULL OUTER JOIN + weighted RRF). Fixes #10186 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-06 09:16:59 +02:00
Copilot	352b7ec604	Harden gallery-agent Hugging Face fetches against transient rate limiting (#10187 ) * Initial plan * fix: retry HuggingFace trending fetch on transient rate limits * fix: handle body close/write errors in huggingface retry paths --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>	2026-06-05 23:43:06 +02:00
LocalAI [bot]	ba706422fb	chore: ⬆️ Update vllm-project/vllm cu130 wheel to `0.22.1` (#10188 ) ⬆️ Update vllm-project/vllm cu130 wheel Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-05 23:42:50 +02:00
LocalAI [bot]	e837921c2c	feat: forward reasoning_effort to the backend so jinja models honor it (#10184 ) * feat: forward reasoning_effort to the backend so jinja models honor it reasoning_effort was only mapped to the binary enable_thinking toggle and otherwise reached Go-side templates — it was never sent to the backend. So jinja-templated models whose chat template keys on reasoning_effort (gpt-oss Harmony, LFM2.5) could not be driven by it: LFM2.5 ignores enable_thinking and kept emitting <think>. Forward the effective reasoning_effort to the backend as a chat_template_kwarg (mirroring enable_thinking) in grpc-server.cpp, and put it in PredictOptions metadata (gRPCPredictOpts). Add a config-level default: ModelConfig.reasoning_effort and Pipeline.reasoning_effort, resolved by ModelConfig.ApplyReasoningEffort (request value overrides config default, none->disable / level->enable, an operator's reasoning.disable wins). request.go now uses that helper. Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(realtime): set the pipeline LLM's reasoning_effort Apply Pipeline.ReasoningEffort to the pipeline's LLM config when the realtime model is built (per-session copy, overrides the LLM's own reasoning_effort), and surface the resolved effort on the template input so Go-templated models get it too. jinja models receive it via the backend metadata. This lets a realtime pipeline disable thinking on models that only honor reasoning_effort (e.g. LFM2.5), which enable_thinking can't. Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-05 13:45:43 +00:00
Richard Palethorpe	73385713ca	feat(distributed): enforce registration token for worker file transfer (#10183 ) The worker HTTP file-transfer server is authenticated by the registration token via checkBearerToken, which fails open on an empty token: every /v1/files, /v1/files-list and /v1/backend-logs request is then served unauthenticated, granting read/write to the worker's models/staging/data directories. The fail-open was also silent (the only auth log sat on the unreachable reject branch), and the worker process never runs DistributedConfig.Validate(), so the existing frontend warning did not cover the component that exposes the server. Mirror the NatsRequireAuth pattern: keep anonymous as the default but make it loud and opt-in enforceable. - Log a prominent warning when the file-transfer server starts tokenless. - Add LOCALAI_REGISTRATION_REQUIRE_AUTH: DistributedConfig.Validate() errors on an empty token (frontend) and the worker refuses to start (fail-fast, before registration), so production can fail closed. Also satisfies the F-003 suggestion to fail Validate() on distributed + empty token. - Add LOCALAI_DISTRIBUTED_REQUIRE_AUTH umbrella switch implying both RegistrationRequireAuth and NatsRequireAuth — one production knob locking down the registration/file-transfer layer and the NATS bus together; the granular flags remain available as single-layer overrides. Wired into the frontend, supervisor worker, and agent worker (vLLM worker has neither a NATS connection nor a file-transfer server, so it is left untouched). - Document in distributed-mode.md (warning callout + flag tables). Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-06-05 14:34:28 +02:00
LocalAI [bot]	a4e671779a	chore: ⬆️ Update ggml-org/whisper.cpp to `99613cb720b65036237d44b52f753b51f75c2797` (#10178 ) ⬆️ Update ggml-org/whisper.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-05 09:04:25 +02:00
LocalAI [bot]	7051b2e0a1	chore: ⬆️ Update ggml-org/llama.cpp to `7c158fbb4aec1bdc9c81d6ca0e785139f4826fae` (#10179 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-05 09:04:10 +02:00
LocalAI [bot]	469737101a	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `1520eda980564241434b791ce2bbbd128c4be9ea` (#10180 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-05 09:03:08 +02:00
LocalAI [bot]	858257eaf0	fix(distributed): self-heal stale 'model not loaded' routing (#10181 ) * fix(distributed): self-heal stale 'model not loaded' routing In distributed mode the registry can list a model as loaded on a node while the worker has evicted it (autonomous LRU eviction, an out-of-band unload, etc.) yet the backend process survives. The router's cached-node check only verifies the process is alive (probeHealth), so it routes there and inference fails with "<backend>: model not loaded" — and stays broken until the controller restarts and rebuilds its registry. InFlightTrackingClient now reconciles this: when a tracked inference call returns a model-not-loaded error, it drops the stale replica row (RemoveNodeModel) so the next request reloads the model on a healthy node instead of routing back to the evicted one. The original error is returned unchanged; only the registry is corrected. Assisted-by: Claude:claude-opus-4-8 go vet Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(distributed): typed model-not-loaded error via gRPC status code Replace the controller-side error-string match with a shared, code-aware helper. Go error types don't survive the gRPC boundary, so the signal is carried as a status code (FailedPrecondition): - pkg/grpc/grpcerrors: ModelNotLoaded(backend) constructor + IsModelNotLoaded(err) checker (status-code first, message fallback for backends not yet migrated). - InFlightTrackingClient.reconcile now uses grpcerrors.IsModelNotLoaded. - Migrate the Go backends that emit this error (parakeet-cpp, cloud-proxy, rfdetr-cpp) to the typed constructor. Acting on a false positive is harmless (the model is just reloaded). Assisted-by: Claude:claude-opus-4-8 go vet Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-05 09:01:36 +02:00
Adira	ef80a0e825	fix(config): add face/speaker recognition constants and register insightface + speaker-recognition (#10110 ) FLAG_FACE_RECOGNITION and FLAG_SPEAKER_RECOGNITION already existed as ModelConfigUsecase bitmask flags, and GuessUsecases already gate-checks both backends by name — but BackendCapabilities had no entries for either, so the UI could not classify them. Also missing were the Method* constants for the five proto-defined RPCs these backends implement (FaceVerify, FaceAnalyze, VoiceVerify, VoiceEmbed, VoiceAnalyze) and the corresponding Usecase* strings and UsecaseInfoMap entries needed to wire them into the rest of the capability system. Changes: - Add MethodFaceVerify, MethodFaceAnalyze, MethodVoiceVerify, MethodVoiceEmbed, MethodVoiceAnalyze GRPCMethod constants - Add UsecaseFaceRecognition ("face_recognition") and UsecaseSpeakerRecognition ("speaker_recognition") Usecase constants - Add UsecaseInfoMap entries for both new usecases, referencing the existing FLAG_FACE_RECOGNITION and FLAG_SPEAKER_RECOGNITION flags - Register insightface: Embedding + Detect + FaceVerify + FaceAnalyze - Register speaker-recognition: VoiceVerify + VoiceEmbed + VoiceAnalyze Follows up on #10107 which left these two out because they needed new constants first. Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Adira Denis Muhando <dennisadira@gmail.com>	2026-06-04 21:48:01 +02:00
LocalAI [bot]	92726f7631	fix(distributed): stage directory-based models to remote nodes (#10175 ) Distributed file-staging treated every model path field (ModelFile, etc.) as a single regular file: it os.Open'd the path and streamed its fd as the HTTP PUT body. For directory-based models — e.g. qwen3-tts-cpp, whose weights and tokenizer ggufs live under one directory referenced by parameters.model — opening the directory succeeds but reading its fd returns EISDIR, so routing the model to a remote NATS worker failed with "read /models/<model>: is a directory". Single-file models were unaffected, so only multi-file pipelines (e.g. the realtime TTS stage) broke. stageModelFiles now detects a directory path field and stages each contained file individually (via the new stageDirectory helper), preserving structure with the existing StagingKeyMapper and rewriting the field to the remote directory (deriving ModelPath as before). countStageableFiles makes the progress total count a directory's files so the staging tracker stays accurate. Assisted-by: Claude:claude-opus-4-8 go vet Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-04 18:05:38 +02:00
LocalAI [bot]	994063ba9a	feat(qwen3-tts-cpp): normalize request language for flexible matching (#10174 ) The qwen3-tts.cpp backend honored the request `language` field only via exact lowercase two-letter codes in the C++ language_to_id table, silently defaulting to English for anything else (en-US, EN, english, ...). Add normalizeLanguage() in the Go handler: lowercase + trim, strip the region/locale suffix (en-US, pt_BR, zh-Hans -> en/pt/zh), and resolve common English full names (english -> en). The canonical codes match the existing C++ table, so no C++ change is needed. Covered by a pure-Go Ginkgo spec. Also document the language field and accepted forms under the Qwen3-TTS docs. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-04 17:26:31 +02:00
LocalAI [bot]	c1a55cf72d	chore: ⬆️ Update mudler/parakeet.cpp to `b11fe5bca78ad8b342dd559a43d76df3984bb447` (#10167 ) ⬆️ Update mudler/parakeet.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-04 12:07:09 +02:00

1 2 3 4 5 ...

6626 Commits