LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-14 11:49:33 -04:00

Author	SHA1	Message	Date
LocalAI [bot]	7637f8cf1b	feat(distributed): declarative per-model scheduling via env/args (#10308 ) * feat(distributed): add SpreadAll column and authoritative scheduling seeding Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): parse declarative model scheduling config (env/file) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): reconcile spread_all to one replica per matching node Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): wire LOCALAI_MODEL_SCHEDULING env/args and startup seeding Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): expose spread_all on the scheduling API endpoint Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): add spread-to-all-nodes mode to the scheduling UI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(distributed): document LOCALAI_MODEL_SCHEDULING env/args Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(distributed): clarify replica modes and all-nodes spread in scheduling config Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-13 18:31:06 +02:00
LocalAI [bot]	f0e001b7f8	fix(xsysinfo): container-aware total RAM detection (cgroup/lxcfs) (#8059 ) (#10288 ) fix(xsysinfo): make reported system RAM total cgroup/lxcfs-aware (#8059) GetSystemRAMInfo derived Total from memory.TotalMemory(), which on Linux uses syscall.Sysinfo().Totalram - the HOST kernel total. lxcfs/LXD does NOT virtualize that value, while MemAvailable (used for Free/Available) IS virtualized. Inside an LXD/container with a 128Gi host but a ~10Gi container view this produced Total=128Gi, Available=10Gi => Used=118Gi, reporting ~92% RAM usage on an idle container. Derive Total instead from the minimum of all non-zero, non-unlimited candidates: cgroup v2 memory.max, cgroup v1 memory.limit_in_bytes (the kernel unlimited sentinel is ignored), /proc/meminfo MemTotal (which lxcfs virtualizes), and the syscall.Sysinfo total as the bare-metal fallback. On bare metal every candidate is unlimited or equals the host total, so behavior is unchanged. The selection/parsing lives in a pure function chooseTotalMemory(...) taking file CONTENTS, unit-tested without a real LXD host; OS file reads stay in a thin wrapper. Assisted-by: claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-13 18:13:06 +02:00
pos-ei-don	cf9debf4eb	model: fix case-insensitive suffix matching and skip .bak files in ListFilesInModelPath (#10306 ) model: skip .bak files and fix case-insensitive suffix matching in ListFilesInModelPath	2026-06-13 17:46:46 +02:00
LocalAI [bot]	e1556aa1dc	fix(react-ui): make agent chat timestamps format-agnostic (#9867 ) (#10290 ) fix(agents): make React agent chat timestamps format-agnostic The agent SSE bridge emits the json_message timestamp in three different encodings depending on deploy mode: an RFC3339 string (standalone agent pool), Unix milliseconds (local dispatcher), and Unix nanoseconds (the older NATS path). The React AgentChat handler passed data.timestamp straight through, so the standalone string and any numeric value outside the millisecond range rendered as "Invalid Timestamp" or a constant epoch-ish time. Add a small pure helper, normalizeTimestampMs, that accepts an RFC3339 string or a numeric epoch in s/ms/us/ns and returns JS milliseconds, falling back to Date.now() on null/empty/unparseable input. Use it in the json_message handler so the rendered time is correct regardless of which backend path produced it. Fixes #9867 Assisted-by: claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-13 11:05:21 +02:00
LocalAI [bot]	53cbb578a9	chore(model gallery): 🤖 add 1 new models via gallery agent (#10304 ) chore(model gallery): 🤖 add new models via gallery agent Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-13 11:03:03 +02:00
LocalAI [bot]	99c8205740	fix(react-ui): stop Talk pipeline overflow and center collapsed-rail avatar (#10305 ) Two small visual fixes in the React UI: - Talk page pipeline summary: the four-column grid used `repeat(4, 1fr)`, which resolves to `minmax(auto, 1fr)` so each track refuses to shrink below the min-content width of its `nowrap` model name. Long names (e.g. a verbose GGUF LLM id) blew the grid out past the container despite the per-cell ellipsis styling. Switching to `minmax(0, 1fr)` lets the tracks shrink and the ellipsis take effect. - Sidebar user avatar: the desktop collapsed look centers the avatar via `.sidebar.collapsed .sidebar-user{-link}` rules, but the tablet icon-rail (640-1023px) collapses visually through `.sidebar:not(.open)` without necessarily carrying the `.collapsed` class, so the avatar kept its left-aligned negative margins and looked misaligned. Mirror the centering rules under `.sidebar:not(.open)`. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-13 11:02:48 +02:00
LocalAI [bot]	d7162b9f89	ci(darwin): build the ds4 backend for darwin/arm64 (metal) (#10303 ) The gallery has metal-ds4 / metal-ds4-development entries, and the build recipe exists (make backends/ds4-darwin, special-cased in backend_build_darwin.yml), but ds4 was never listed in the darwin matrix, so no metal-darwin-arm64-ds4 image was ever published and the entries dangled. - Add ds4 to the darwin matrix (includeDarwin), mirroring the llama-cpp form (the reusable workflow builds it via 'make backends/ds4-darwin'). - Fix inferBackendPathDarwin in scripts/changed-backends.js to map ds4 to backend/cpp/ds4/ (like llama-cpp): ds4 is C++ but the matrix entry carries lang=go, so without this its darwin build would only ever run on a release (FORCE_ALL), never incrementally when backend/cpp/ds4 changes. sherpa-onnx and speaker-recognition are already in the darwin matrix on master and are not changed here. Assisted-by: claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-13 11:02:32 +02:00
LocalAI [bot]	3351b62c91	chore(model gallery): 🤖 add 1 new models via gallery agent (#10302 ) chore(model gallery): 🤖 add new models via gallery agent Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-13 10:59:23 +02:00
LocalAI [bot]	0eca930b8d	fix(gallery): correct meta-backend definitions for platform auto-selection (#10299 ) fix(gallery): correct meta-backend definitions in backend/index.yaml Backends that ship per-platform images must be meta backends (a capabilities map and NO uri) so the right variant is auto-selected per platform - mirroring llama-cpp/whisper. Several entries were misdefined; fixed here: - Concrete base + metal sibling (could not select the Apple Silicon variant): silero-vad, piper, kitten-tts, local-store (+ their -development). Converted each anchor to a meta and added the cpu-<name> concrete. - mlx family (mlx, mlx-vlm, mlx-audio, mlx-distributed + -development): anchor had both a uri AND a capabilities map, so IsMeta() was false and the map was ignored (always resolved to the metal-darwin image); the metal-<name> target did not exist. Removed the uri and added the missing metal-<name> concretes. - Dangling capability targets: diffusers/kokoro nvidia-l4t-cuda-12 repointed to the existing nvidia-l4t-<name> concrete; coqui nvidia-cuda-13 key removed (no cuda13-coqui image). - locate-anything: the meta existed but its concrete entries were never added, so it was un-installable on every platform. Added the full concrete set plus the locate-anything-development meta, mirroring rfdetr-cpp. Image tags grounded against the published quay.io tags. - trl (cuda12/13): repointed the stale 'cublas-cuda12/13-trl' image tags to the actually-published 'gpu-nvidia-cuda-12/13-trl' tags (fixes #9236). Assisted-by: claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-13 10:43:14 +02:00
LocalAI [bot]	81ab62e874	chore(model gallery): 🤖 add 1 new models via gallery agent (#10298 ) chore(model gallery): 🤖 add new models via gallery agent Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-13 09:58:11 +02:00
LocalAI [bot]	0413fc03f8	fix(gallery): make opus a meta backend for platform auto-selection (#9813 ) (#10291 ) fix(gallery): make opus a meta backend so the platform variant is auto-selected (#9813) The realtime/WebRTC path loads the "opus" codec backend by name, but on macOS arm64 only "metal-opus" is installable, so Load("opus") failed with "opus backend not available". The root cause: unlike llama-cpp and whisper, the opus entry was a concrete CPU backend (it carried a uri and no capabilities map) rather than a meta backend, so nothing mapped "opus" to the platform-appropriate variant. Restructure opus to mirror llama-cpp/whisper: "opus" becomes a meta backend with a capabilities map (default -> cpu-opus, metal -> metal-opus) and no uri; the CPU image moves to a new "cpu-opus" concrete (and its dev variant to "cpu-opus-development"). Installing "opus" now resolves to metal-opus on Apple Silicon and cpu-opus elsewhere, and Load("opus") works on every platform via the meta pointer - so the realtime endpoint needs no special casing. This reverts the realtime_webrtc.go resolution helper from the earlier approach in favor of the gallery-level fix. Assisted-by: claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-13 09:51:02 +02:00
LocalAI [bot]	7088572f75	fix(neutts): pin torchaudio to match torch (fixes undefined symbol) (#9798 ) (#10292 ) fix(neutts): pin torchaudio to match torch to avoid ABI mismatch (#9798) neucodec pulls torchaudio transitively but it was unpinned, so an incompatible torchaudio could be resolved against the pinned torch==2.8.0, producing the 'undefined symbol: torch_library_impl' load failure. Pin torchaudio==2.8.0 alongside torch in the cpu and cublas12 requirements. Assisted-by: claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-13 09:28:41 +02:00
LocalAI [bot]	c1e8440f5b	fix(deps): bump cogito to fix MCP image-result panic (#10101 ) (#10294 ) fix(mcp): bump cogito to handle non-text tool result content Fixes #10101: the API panicked with "interface conversion: mcp.Content is mcp.ImageContent, not mcp.TextContent" when an MCP tool returned an image. Upstream cogito PR #50 replaced the unchecked TextContent assertion in the tool-result loop with a contentToString type-switch that handles image (and other non-text) content blocks gracefully. Bump github.com/mudler/cogito to v0.10.1-0.20260609212329-bf4010d31047, which includes the fix. Assisted-by: claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-13 09:28:25 +02:00
LocalAI [bot]	8f0059123b	feat(gallery): add 60 piper TTS voices across 42 languages (Phase 2) (#10296 ) Extends the piper voice set with a couple of voices per language for 42 more languages (Arabic, Bulgarian, Catalan, Czech, Welsh, Danish, Greek, Spanish, Basque, Persian, Finnish, French, Hindi, Hungarian, Indonesian, Icelandic, Georgian, Kazakh, Luxembourgish, Latvian, Malayalam, Nepali, Dutch, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Albanian, Swedish, Swahili, Telugu, Turkish, Ukrainian, Urdu, Vietnamese, Chinese, ...), run through the crispasr backend's backend:piper engine and hosted at LocalAI-Community/piper-voices-GGUF. All converted from rhasspy/piper-voices with CrispASR's convert-piper-to-gguf.py and screened end-to-end on the pinned engine. Only single-speaker low/medium voices are included; high-quality decoders and multi-speaker models segfault and are excluded (e.g. zh_CN-chaowen dropped, huayan kept). Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-13 09:19:21 +02:00
LocalAI [bot]	a906438a69	fix(config): backend-gate the top_k=40 sampler default (#6632 ) (#10285 ) fix(config): gate top_k=40 default on backend family (#6632) SetDefaults injected top_k=40 (llama.cpp's sampling default) for every model config regardless of backend. That value is wrong for backends whose native default differs: mlx_lm's intended default is top_k=0 (disabled) and mlx does not remap 0->40, so a client that omits top_k silently got 40 shipped to mlx, changing sampling. The mlx backend's own getattr(request,'TopK',0) fallback is dead because proto3 int32 is always present. Gate the injection on backend family via UsesLlamaSamplerDefaults: keep top_k=40 for the llama.cpp family and for the empty/auto backend (the GGUF auto-detect path resolves to llama.cpp, so existing behavior is preserved), but leave TopK nil for the known non-llama backends (mlx, mlx-vlm, mlx-distributed). gRPCPredictOpts now sends 0 when TopK is nil, which is the value mlx actually wants. Only TopK is gated - the confirmed bug. The sibling sampler defaults (top_p, temperature, min_p) are left global to avoid widening scope and introducing nil-deref risk; revisit per-backend if needed. Assisted-by: claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-13 09:04:25 +02:00
LocalAI [bot]	d28a5b6da1	chore: ⬆️ Update mudler/locate-anything.cpp to `92c1682da792c1e8a5dec91acc2be4b02c742ded` (#10282 ) ⬆️ Update mudler/locate-anything.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-13 09:01:17 +02:00
LocalAI [bot]	edeacf22c4	fix(realtime): keep transcription model on a language-only session.update (#10295 ) A transcription session.update that carries only a language (no model) — e.g. a client forcing the STT input language — has an empty Transcription.Model. updateSession unconditionally copied that into session.ModelConfig.Pipeline.Transcription, blanking the pipeline's configured transcription backend. The next utterance then transcribed against an empty model and the backend RPC failed with "unimplemented" (surfaced to the client as transcription_failed), so transcription silently stopped whenever a language was selected. Only adopt the incoming transcription model when it is non-empty, and preserve the existing model otherwise (mirroring updateTransSession). Signed-off-by: mudler <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-13 01:01:36 +02:00
Aniruddh Jha	51f4f67c47	fix(agents): emit chat event timestamps in milliseconds (#9867 ) (#10243 ) Agent chat replies rendered a broken timestamp in the web UI ("Invalid Timestamp" / "12:00 AM", identical for every reply) because the SSE timestamp unit was inconsistent across producers. EventBridge.PublishEvent emitted Unix nanoseconds while the local dispatcher (dispatcher.go) already emitted Unix milliseconds, and the React UI fed the value straight into `new Date(ts)` after dividing by 1e6. Nanoseconds also overflow JS's safe-integer range (~1.7e18). Standardize on Unix milliseconds: switch PublishEvent to UnixMilli and drop the /1e6 conversion in AgentChat.jsx so both SSE paths agree and match the React UI's expectation. Add a regression test asserting the published timestamp is in milliseconds.	2026-06-12 23:18:44 +02:00
LocalAI [bot]	cf71e291b4	fix(darwin): fix vibevoice-cpp build linkage + fail-safe go backend packaging (#10276 ) * fix(darwin): never package a go backend build tree as a working image The darwin/arm64 vibevoice-cpp image shipped the source tree with a half-built CMake directory (build-libgovibevoicecpp-fallback.so/) and no backend binary, so the backend could never start: run.sh exec'd a vibevoice-cpp binary that was not in the package and LocalAI timed out waiting for the gRPC service. Two durable, backend-agnostic defenses: - backend/go/vibevoice-cpp/Makefile: mirror whisper's cleanup discipline so a partial CMake tree cannot survive into packaging. Run `make purge` before each variant build and `rm -rfv build` after. The old recipe only removed its build dir after a successful `mv`, so a failed build left the half-built tree behind. - scripts/build/golang-darwin.sh: before creating the OCI image, remove any stray build- directory and assert that the binary run.sh launches actually exists. A build that produced no binary now fails the job loudly instead of publishing a source tree as a working backend. The binary name is derived from run.sh's `exec $CURDIR/<binary>` line (parakeet-cpp launches parakeet-cpp-grpc, so it is not always ${BACKEND}) with a ${BACKEND} fallback. The underlying native build failure that left vibevoice-cpp half-built still needs to be reproduced and fixed on Apple Silicon; this change ensures such a failure can never again be published as a working image. Refs #10267 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * fix(vibevoice-cpp): build libvibevoice.a on darwin (link target, not path) The darwin build failed with: No rule to make target 'vibevoice/libvibevoice.a', needed by 'libgovibevoicecpp.so'. Stop. The upstream vibevoice project is added with add_subdirectory(... EXCLUDE_FROM_ALL), so its `vibevoice` static-library target is only built when something links it as a target. The Apple branch linked only `$<TARGET_FILE:vibevoice>` - a bare archive path with no target reference - so CMake never emitted a rule to build libvibevoice.a, while the Linux branch worked because it passes the `vibevoice` target name inside the --whole-archive flags. Link the `vibevoice` target on Apple (establishing the build dependency) and apply -force_load as a separate link option to keep whole-archive semantics so purego can dlsym the vv_capi_* symbols. Refs #10267 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-12 23:13:50 +02:00
LocalAI [bot]	a7a7bd646b	fix(mlx): route vision-language models to the mlx-vlm backend (#10274 ) Vision-language checkpoints such as mlx-community/gemma-4-E4B-it-qat-4bit declare the "image-text-to-text" pipeline tag on HuggingFace. The mlx importer hardcoded backend "mlx" for every mlx-community model, so these VLMs were served by the text-only mlx-lm backend whose tokenizer does not carry the processor chat template. The template was never applied and the model produced degenerate, looping output that echoed the prompt. Detect the "image-text-to-text" pipeline tag in the importer and route those models to mlx-vlm, which applies the processor-aware chat template. An explicit backend preference still wins. As a defensive backstop, the mlx backend now warns loudly when the loaded model has no chat template, so a misrouted VLM surfaces the problem instead of silently looping. Fixes #10269 Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-12 23:12:42 +02:00
LocalAI [bot]	cec93d2e00	docs: ⬆️ update docs version mudler/LocalAI (#10279 ) ⬆️ Update docs version mudler/LocalAI Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-12 23:12:30 +02:00
LocalAI [bot]	722bdb87e9	chore: ⬆️ Update mudler/parakeet.cpp to `b8012f11e5269126eddb7f4fd02f891a2ccc29b0` (#10281 ) * ⬆️ Update mudler/parakeet.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * fix(parakeet-cpp): close streaming segments on <EOB> after ABI v5 eou/eob split parakeet.cpp ABI v5 (the pin this PR bumps to) splits the streaming JSON "eou" flag: in v4 "eou":1 fired for either <EOU> (end of utterance) or <EOB> (backchannel); in v5 "eou" means <EOU> only, with a new separate "eob" field for the backchannel token. The streamSegmenter closed a segment on "eou" alone, so after the bump a backchannel token would silently stop ending a segment and merge into the next utterance. Read the new "eob" field and flush on either signal to preserve the v4 segmentation boundaries. The flat stream_feed eou_out path is unaffected: its mask is still non-zero for either event. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-12 23:12:04 +02:00
LocalAI [bot]	50dea8c983	feat(crispasr): bundle espeak-ng and add piper TTS voices to the gallery (#10283 ) CrispASR's piper backend phonemizes non-English text via espeak-ng (dlopen, the MIT-clean path; English uses a built-in G2P). The FROM scratch crispasr image shipped none of it, so non-English piper voices loaded but failed synthesis with "phonemization failed". Bundle the espeak-ng runtime so they work: - Dockerfile.golang: install espeak-ng-data + libespeak-ng1 and its libpcaudio0 / libsonic0 deps in the crispasr builder (espeak's dlopen fails without the latter two). - package.sh: copy libespeak-ng.so.1, libpcaudio.so.0, libsonic.so.0 into package/lib/ and the espeak-ng-data dir into the package root. - run.sh: export CRISPASR_ESPEAK_DATA_PATH so the bundled data is found. Add 9 single-speaker piper voices (de/en/it, incl. Italian paola + riccardo) to the gallery, run through backend:piper, hosted at LocalAI-Community/piper-voices-GGUF (converted from rhasspy/piper-voices with CrispASR's convert-piper-to-gguf.py). Only single-speaker low/medium voices are included; the engine does not yet support multi-speaker or high-quality piper decoders. All 9 verified end-to-end: each synthesizes a WAV at the model's native sample rate using only the image-bundled espeak payload. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-12 23:10:30 +02:00
LocalAI [bot]	46ba70632b	fix(crispasr): write piper TTS WAV at the model's native sample rate (#10277 ) CrispASR's piper backend returns PCM at the voice's native rate (from the GGUF piper.sample_rate key: 16 kHz for x_low/low, 22.05 kHz for medium/high) and does not resample, but the Go WAV encoder hardcoded 24000 Hz. Every piper voice was therefore written with a wrong header and played back at the wrong pitch/speed. Read piper.sample_rate from the model's GGUF metadata at Load via the vendored gguf-parser-go and use it for the WAV header, falling back to the 24 kHz default for the other CrispASR TTS engines (vibevoice/orpheus/chatterbox/qwen3-tts) that emit 24 kHz and carry no such key. Adds unit specs (minimal crafted GGUFs + WAV-header decode) and an env-gated end-to-end spec (CRISPASR_PIPER_MODEL_PATH). Verified e2e: en_GB-cori-medium synthesizes a 22050 Hz WAV through backend:piper. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-12 23:10:17 +02:00
LocalAI [bot]	60facc7252	fix(darwin): publish sherpa-onnx and speaker-recognition images for darwin/arm64 (#10275 ) Neither the sherpa-onnx nor the speaker-recognition backend had a darwin/arm64 image, so `local-ai backends install` failed with "no child with platform darwin/arm64" on macOS. This left /v1/audio/diarization (the sherpa-onnx path) and /v1/voice/embed without any usable backend on Apple Silicon. Both backends build on darwin/arm64: - sherpa-onnx (Go) already fetches the onnxruntime osx-arm64 runtime in its Makefile; it only needed a darwin matrix entry (build-type metal, lang go, like whisper and silero-vad). - speaker-recognition (Python) needed a requirements-mps.txt so the mps build installs plain onnxruntime (which ships a macOS arm64 wheel) instead of the onnxruntime-gpu pulled by its base requirements (which does not). Add both to the includeDarwin build matrix, wire the metal capability and metal image aliases into the gallery, and add the speaker-recognition requirements-mps.txt. Fixes #10268 Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-12 22:32:42 +02:00
LocalAI [bot]	8c8204d3c4	feat(parakeet-cpp): enable GGML_CUDA_GRAPHS in the cublas build (#10273 ) ggml leaves GGML_CUDA_GRAPHS off by default. Passing -DGGML_CUDA_GRAPHS=ON for cublas builds lets the CUDA backend capture and replay the compute graph for a small free speedup (about 1% measured on a GB10, never negative). It is not gated by parakeet.cpp's CMake options, so it passes straight through to ggml. Assisted-by: Claude Opus 4.8 <noreply@anthropic.com> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-12 18:47:36 +02:00
LocalAI [bot]	4ce0f6102a	chore(model gallery): 🤖 add 1 new models via gallery agent (#10270 ) chore(model gallery): 🤖 add new models via gallery agent Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-12 16:21:35 +02:00
Richard Palethorpe	085fc53bbc	fix(router): production-ready request router + auto-size batch for embedding/rerank (#10104 ) * fix(router): score classifier production-readiness Conversation trimming runs through the classifier model's chat template and trims by exact token count, sized to the model's n_batch which is now scaled to context so long probes can't crash the backend. Missing chat_message templates are a hard error at router build time. Router- facing factories (Embedder/Scorer/Reranker/TokenCounter) re-resolve ModelConfig per call so a model installed post-startup doesn't bind a stub Backend="" config and silently fall into the loader's auto- iterate path. New 'vector_store' backend trace recorded inside localVectorStore on every Search/Insert — including the backend-load-failure path that previously vanished into an xlog.Warn — with outcome tagging (hit/miss/empty_store/backend_load_error/find_error/insert_error/ok). Companion cleanup drops misleading similarity:0 and input_tokens_count:0 from non-hit and text-mode traces. Gallery local-store-development aliases to 'local-store' so the master image satisfies pkg/model.LocalStoreBackend lookups from the embedding cache. Misc: llama-cpp TokenizeString reads the correct 'prompt' JSON key (the original bug); ModelTokenize nil-guard; non-fatal mitm proxy startup; PII 'route_local' renamed to 'allow' with docs/UI in sync; model-editor footer no longer eats the edit area on small screens; several config-editor template/dropdown/section fixes. Tests: e2e router specs (casual/code-hint + long-conversation trim), vector_store trace specs, lazy-factory specs, gallery dev-alias resolution, Playwright trace badge + scroll regression. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(backend): auto-size batch to context for embedding and rerank models Embedding and rerank models pool over the whole input in a single physical batch (n_ubatch). With batch left at the 512 default, the backend rejects longer inputs with "input is too large to process", silently capping a large-context embedder (e.g. 8k/32k) at 512 tokens. Size n_batch to the context for these single-pass usecases, mirroring the existing FLAG_SCORE behaviour; an explicit batch: still wins. Extracts EffectiveContextSize/EffectiveBatchSize from grpcModelOpts so the effective decode window has one home for other callers to reuse. Adds an e2e-aio regression test that embeds a >512-token input. The AIO embedding model is switched to nomic-embed-text-v1.5 (2048 context) because the previous granite model was capped at 512 tokens and could not exercise the larger batch. Assisted-by: claude-code:claude-opus-4-8 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(gallery): raise arch-router scoring output cap via parallel:64 Scoring decodes the whole prompt+candidate in a single llama_decode and reads one logit row per candidate token. The vendored llama.cpp server caps causal output rows at n_parallel, so the default of 1 aborts with GGML_ASSERT(n_outputs_max <= cparams.n_outputs_max) on multi-token route labels. Set options: [parallel:64] on both arch-router quant entries to lift the cap; kv_unified (the grpc-server default) keeps the full context per sequence, so this does not split the KV cache. Assisted-by: claude-code:claude-opus-4-8 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-06-12 16:21:15 +02:00
LocalAI [bot]	56cc4f63fc	feat(backend): locate-anything-cpp (open-vocabulary object detection via ggml) (#10264 ) * feat(backend): add locate-anything-cpp backend (open-vocab detection via la_capi) A Go/purego backend wrapping locate-anything.cpp's la_capi C ABI, implementing the gRPC Detect RPC: image + open-vocabulary text prompt -> labeled boxes. Mirrors backend/go/rfdetr-cpp; static-links ggml into a per-CPU-variant .so. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(backend): register locate-anything-cpp in build matrix Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(gallery): locate-anything gallery entry + model importer Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(backend): locate-anything-cpp Load+Detect wire test Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(gallery): add locate-anything-3b model to the gallery index Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(backend): register locate-anything.cpp in bump_deps auto-bump Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: mudler <mudler@localai.io> * ci(test): e2e smoke for locate-anything-cpp in test-extra (loads the 3B + image, runs Detect) Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: mudler <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Signed-off-by: mudler <mudler@localai.io> Co-authored-by: mudler <mudler@localai.io>	2026-06-12 14:59:07 +02:00
LocalAI [bot]	a53f34e78f	chore: ⬆️ Update ggml-org/llama.cpp to `4c6595503fe45d5a39f88d194e270f64c7424677` (#10261 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-12 14:57:52 +02:00
Dedy F. Setyawan	1cea96f09f	feat(react-ui): add Indonesian language support (#10266 ) Signed-off-by: Dedy F. Setyawan <dedyfajars@gmail.com>	2026-06-12 10:08:58 +02:00
LocalAI [bot]	006a9d38c7	chore: ⬆️ Update mudler/parakeet.cpp to `9db92be63179a27201d3b88d5d40c545b2ac48ae` (#10263 ) ⬆️ Update mudler/parakeet.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-12 09:18:21 +02:00
LocalAI [bot]	892ce951ce	chore: ⬆️ Update antirez/ds4 to `d881f2a05e8ff6bec001315a36b794b4aa310173` (#10262 ) ⬆️ Update antirez/ds4 Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-12 09:18:07 +02:00
LocalAI [bot]	7cda221d36	docs: ⬆️ update docs version mudler/LocalAI (#10259 ) ⬆️ Update docs version mudler/LocalAI Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-12 09:17:49 +02:00
LocalAI [bot]	9a88eb81e7	chore: ⬆️ Update CrispStrobe/CrispASR to `d745bda4386ae0f9d1d2f23fff8ec95d76428221` (#10260 ) ⬆️ Update CrispStrobe/CrispASR Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-12 09:17:34 +02:00
pos-ei-don	58cdc050e9	fix(cuda): install cuda-nvrtc-dev alongside the other CUDA dev packages (#10257 ) Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com> v4.4.2	2026-06-11 23:57:00 +02:00
pos-ei-don	b962f4a192	fix(vllm): parse tool_call function arguments before applying the chat template (#10256 ) Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>	2026-06-11 23:55:38 +02:00
LocalAI [bot]	b6fcb3e1db	chore: ⬆️ Update CrispStrobe/CrispASR to `4b27392ffd0991a857594652cbb8b57e585bcd7b` (#10241 ) ⬆️ Update CrispStrobe/CrispASR Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-11 18:33:58 +02:00
LocalAI [bot]	ff09683d84	chore: ⬆️ Update ggml-org/llama.cpp to `ac4cddeb0dbd778f650bf568f6f08344a06abe3a` (#10239 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-11 18:33:38 +02:00
LocalAI [bot]	f618636c71	docs: fix broken relref to realtime page (#10255 ) Hugo fails the gh-pages build with REF_NOT_FOUND because the relref in model-configuration.md uses the 'docs/' prefix; refs are resolved relative to content/, so the page lives at 'features/openai-realtime' (as the other ref in the same file already uses). Assisted-by: Claude Code:claude-fable-5 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> v4.4.1	2026-06-11 18:32:50 +02:00
LocalAI [bot]	892fc49949	feat(realtime): stream the LLM / TTS / transcription pipeline stages (#10176 ) * feat(realtime): pipeline streaming + disable_thinking config Add a nested pipeline.streaming.{llm,tts,transcription} block plus pipeline.disable_thinking, with StreamLLM/StreamTTS/StreamTranscription/ ThinkingDisabled helpers. Pointer-bools so unset keeps the unary path; existing configs are unaffected. Wiring into the realtime handler follows. Assisted-by: Claude:claude-opus-4-8 go vet Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(realtime): sentence segmenter for streamed LLM->TTS pipelining streamSegmenter accumulates streamed LLM tokens and emits complete sentence/clause segments (terminator+whitespace, or newline) so TTS can synthesize each segment as it completes instead of waiting for the whole reply. Pure helper; the streaming handler wiring consumes it next. Assisted-by: Claude:claude-opus-4-8 go vet Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(realtime): streaming TTS/transcription methods on Model interface Add TTSStream and TranscribeStream to the realtime Model interface and implement them on wrappedModel (delegating to backend.ModelTTSStream / ModelTranscriptionStream) and transcriptOnlyModel. ttsStream adapts the backend's WAV-framed stream (44-byte header carrying the sample rate, then PCM) into raw PCM + sample rate for the realtime transports. Handler wiring that consumes these (flag-gated) follows. Assisted-by: Claude:claude-opus-4-8 go vet Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(realtime): emitSpeech with flag-gated streaming TTS emitSpeech synthesizes a piece of text and forwards audio to the client, streaming one output_audio.delta per backend PCM chunk when the pipeline sets streaming.tts, or one delta for the whole utterance otherwise. WebRTC gets raw PCM (it resamples internally); WebSocket gets base64 PCM at the session rate. It emits no transcript/audio-done events so a streamed reply can be split into multiple spoken segments sharing one response. Adds fakeModel/fakeTransport test doubles for the realtime Model/Transport interfaces, driving streaming assertions deterministically. Assisted-by: Claude:claude-opus-4-8 go vet Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(realtime): route response audio through emitSpeech (streaming TTS) Replace the inline unary TTS block in the response handler with emitSpeech, which streams a response.output_audio.delta per backend PCM chunk when pipeline.streaming.tts is set and otherwise preserves the single-delta unary behaviour. emitSpeech returns the accumulated base64 audio, stored on the conversation item as before. Transcript and audio-done events stay in the handler so later per-segment streaming can reuse emitSpeech. Assisted-by: Claude:claude-opus-4-8 go vet Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(realtime): streaming transcription text deltas Add emitTranscription and route commitUtterance through it. With pipeline.streaming.transcription set it streams each transcript fragment as a conversation.item.input_audio_transcription.delta via TranscribeStream then a completed event; otherwise it preserves the single completed-event unary behaviour. Returns the final transcript for response generation. Assisted-by: Claude:claude-opus-4-8 go vet Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(realtime): pipeline disable_thinking maps to enable_thinking off applyPipelineThinking forces the LLM's ReasoningConfig.DisableReasoning when pipeline.disable_thinking is set, which gRPCPredictOpts turns into the enable_thinking=false backend metadata. Applied at newModel construction on the per-session LLM config copy, so it doesn't leak to other model users and needs no realtime-specific request plumbing. Assisted-by: Claude:claude-opus-4-8 go vet Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(realtime): speechStreamer for token-streamed LLM->TTS emitSpeech now returns raw PCM (caller base64-encodes) so streamed segments accumulate correctly. speechStreamer consumes streamed LLM tokens: it strips reasoning via the streaming ReasoningExtractor, emits a transcript delta per content fragment, and sentence-pipes content into emitSpeech so each sentence is synthesized as soon as it's ready. Handler wiring (plain-content turns) follows. Assisted-by: Claude:claude-opus-4-8 go vet Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(realtime): wire streamLLMResponse for token-streamed replies triggerResponseAtTurn takes a streamed path when pipeline.streaming.llm is set, the turn has no tools, and audio is requested: streamLLMResponse announces the assistant item, drives the LLM token callback through a speechStreamer (reasoning-stripped transcript deltas + sentence-piped TTS), and emits the terminal events. Tool turns and non-streaming pipelines keep the existing buffered path unchanged, so this is strictly opt-in. Assisted-by: Claude:claude-opus-4-8 go vet Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(realtime): document pipeline streaming + disable_thinking Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(realtime): register pipeline streaming/thinking config fields TestAllFieldsHaveRegistryEntries (core/config/meta) requires every config field to have a meta registry entry. The four new pipeline fields (disable_thinking, streaming.{llm,tts,transcription}) had none, failing tests-linux/tests-apple. Add toggle entries for them. Also handle the os.Remove return in realtime_speech_test.go to satisfy errcheck (golangci-lint). Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(realtime): always strip reasoning from spoken output disable_thinking maps to ReasoningConfig.DisableReasoning=true on the LLM config, which the backend reads as enable_thinking=false. But the realtime handler reads that SAME config to drive reasoning extraction, and there DisableReasoning=true means "skip stripping". PredictConfig() returns this LLM config, so both the streamed (speechStreamer) and buffered realtime paths stopped stripping <think>…</think> exactly when disable_thinking was on — leaking raw reasoning to the client whenever the model ignored the enable_thinking hint (e.g. lfm2.5). Add spokenReasoningConfig() which clears DisableReasoning for extraction (keeping custom tokens/tag pairs) and route both realtime paths through it. Spoken output now always strips reasoning, independent of the backend suppression hint. Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(realtime): clean TTS temp path before read (gosec G304) emitSpeech reads the WAV file the TTS backend wrote. The read moved here from realtime.go, so code-scanning flagged it as a new G304 alert even though the path is backend-controlled (a temp file), not user input. Wrap it in filepath.Clean — a real path normalization that also clears the alert, keeping with the repo's no-#nosec convention. Assisted-by: Claude:claude-opus-4-8 gosec, golangci-lint Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(realtime): buffer whole message for TTS, drop sentence segmenter Per review (richiejp): the sentence segmenter pipelined unary TTS by splitting on ASCII .!?/newline, which does nothing for languages without those boundaries (CJK/Thai) — there it already degraded to buffering the whole message anyway. Replace it with a uniform model: stream the LLM transcript live, buffer the full message, then synthesize it once. emitSpeech already streams the audio chunks when the backend implements TTSStream and falls back to a single unary delta otherwise, so this is real streaming TTS where supported and a clean whole-message synthesis elsewhere — no per-sentence emulation, no language assumptions. speechStreamer becomes transcriptStreamer (transcript deltas only); the whole-message synthesis moves into streamLLMResponse. Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(realtime): stream tool-call turns via tokenizer-template autoparser Per review (richiejp): tool-call deltas exist, so streaming should work with tools too. It does — for models that use their tokenizer template. The C++ autoparser then clears reply.Message and delivers content + tool calls via ChatDeltas, so the streamed transcript carries only spoken content (no tool-call JSON leak) and the tool calls are parsed from the final response. - Drop the len(tools)==0 gate; stream when no tools OR use_tokenizer_template (grammar-based function calling still buffers, since its call is emitted as JSON in the token stream and would leak into the transcript). - streamLLMResponse takes tools/toolChoice/toolTurn, reads ChatDelta content in the token callback, parses tool calls from the final ChatDeltas, and creates the assistant content item lazily so a content-less tool turn emits only the tool calls. - Extract emitToolCallItems from the buffered path so both paths finalize tool calls, response.done, and server-side assistant-tool follow-ups identically. Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(realtime): script-aware clause chunking + streamed-reply fixes Opt-in pipeline.streaming.clause_chunking splits the streamed LLM reply into speakable clauses and synthesizes each as soon as it completes, lowering time-to-first-audio instead of buffering the whole message. The splitter is script-aware (rivo/uniseg, pure Go): UAX#29 sentence segmentation handles CJK 。！？ with no whitespace, CJK clause punctuation (，、；：) and Thai/Lao spaces give finer cuts, and a UAX#14 line-break cap bounds an over-long punctuation-less run. Unlike the old ASCII .!?/newline segmenter (dropped in `076dcdbe`) it does not degrade to whole-message buffering for CJK/Thai; scripts needing a dictionary (Khmer/Burmese) stay buffered until a space or end-of-message. Clauses are synthesized synchronously in the token callback (the LLM keeps generating into the gRPC stream meanwhile), so audio still starts mid-generation. Off by default — the whole-message path is unchanged. Also fix the streamed-reply path and the Talk page: - Don't swallow streamed autoparser content as reasoning: the tokenizer-template path already delivers reasoning-free content via ChatDeltas, so prefilling the thinking start token re-tagged it as an unclosed reasoning block, leaving no spoken reply. Disable the prefill on that path; closed tag pairs are still stripped (#9985). - Generate collision-free realtime IDs (16 random bytes) instead of a constant, so per-item bookkeeping (cancel, conversation.item.retrieve) works. - Key the Talk transcript by the server item_id and upsert entries. Realtime events arrive over a WebRTC data channel — outside React's event system — so React defers the setTranscript updaters while synchronous ref writes in handler bodies run first; the old index-tracking ref rendered a duplicate assistant bubble on completion. Upserts by item_id are idempotent and order-independent. - Drop the partial assistant bubble on a cancelled response (barge-in): the server discards the interrupted item and sends response.done with status "cancelled"; mirror that in the UI so the regenerated reply isn't rendered as a second assistant message. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Assisted-by: Claude:claude-fable-5 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Signed-off-by: Richard Palethorpe <io@richiejp.com> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Richard Palethorpe <io@richiejp.com>	2026-06-11 08:43:12 +01:00
pos-ei-don	228a6dfe79	fix(vllm): restore compatibility with vLLM >= 0.22 (get_tokenizer moved to vllm.tokenizers) (#10252 ) fix(vllm): restore compatibility with vLLM >= 0.22 (get_tokenizer moved) vLLM 0.22 moved get_tokenizer from vllm.transformers_utils.tokenizer to vllm.tokenizers. Since the backend requirements install vllm unpinned, freshly built/installed vllm backends currently fail to start with ModuleNotFoundError: No module named 'vllm.transformers_utils.tokenizer' (surfacing as 'grpc service not ready' when loading a model). Use the same try/except version-compat import pattern already used elsewhere in this file: try the new vllm.tokenizers location first and fall back to the pre-0.22 path. Tested on a DGX Spark (GB10, ARM64) with the cuda13-nvidia-l4t-arm64-vllm backend and vllm 0.22.0: model load, chat completions and tool calls all work with this patch applied. Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-11 09:05:23 +02:00
LocalAI [bot]	51a92b6093	chore: ⬆️ Update antirez/ds4 to `8384adf0f9fa0f3bb342dd925372de778b95b263` (#10242 ) ⬆️ Update antirez/ds4 Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-11 00:10:34 +02:00
LocalAI [bot]	b5964d385d	docs: ⬆️ update docs version mudler/LocalAI (#10245 ) ⬆️ Update docs version mudler/LocalAI Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-11 00:10:10 +02:00
LocalAI [bot]	fba8c9c498	fix(distributed): track in-flight for non-LLM inference methods (VAD, diarize, voice, ...) (#10238 ) fix(distributed): track in-flight for non-LLM inference methods InFlightTrackingClient only wrapped a subset of the grpc.Backend inference methods (Predict, Embeddings, TTS, AudioTranscription, Detect, Rerank, ...). Methods like VAD were left as embedded passthrough, so track() never ran for them. In distributed mode every model is loaded with in_flight=1 as a reservation; that reservation is only released by the OnFirstComplete callback, which fires after the first tracked inference call completes. A VAD-only model (e.g. silero-vad) never calls a tracked method, so the reservation is never released and in-flight stays pinned at 1 forever - which also blocks the router's idle-eviction logic. Wrap the remaining unary inference methods (VAD, Diarize, Face, Voice, TokenClassify, Score, AudioEncode, AudioDecode, AudioTransform) with the same track()/reconcile() pattern. The three bidi-stream constructors (AudioTransformStream, AudioToAudioStream, Forward) are deliberately left as passthrough - their inference spans the stream lifetime, not the constructor call, so track() there would fire onFirstComplete before any data flows. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> v4.4.0	2026-06-10 16:29:50 +02:00
LocalAI [bot]	6b2badb837	chore: ⬆️ Update CrispStrobe/CrispASR to `c29f6653a516a3001d923944dad8892072cc7334` (#10236 ) ⬆️ Update CrispStrobe/CrispASR Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-10 16:16:24 +02:00
LocalAI [bot]	8b8506d01a	chore: ⬆️ Update ggml-org/llama.cpp to `039e20a2db9e87b2477c76cc04905f3e1acad77f` (#10223 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-10 12:22:03 +02:00
LocalAI [bot]	6910a0bb48	chore: ⬆️ Update antirez/ds4 to `91bafb5acd5a6cf00b1e55ef68bf40ddd207bee7` (#10234 ) ⬆️ Update antirez/ds4 Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-10 12:08:19 +02:00
LocalAI [bot]	cffd03b522	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `e6f8112f3ba126eed3ff5b30cdd08085414a7516` (#10233 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-10 12:07:49 +02:00
LocalAI [bot]	bf448d3794	chore: ⬆️ Update ggml-org/whisper.cpp to `df7638d8229a243af8a4b5a8ae557e0d74e0a0ae` (#10220 ) ⬆️ Update ggml-org/whisper.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-10 01:16:29 +02:00

1 2 3 4 5 ...

6670 Commits