Commit Graph

6641 Commits

Author SHA1 Message Date
Ettore Di Giacinto
c9c6040fe8 feat(dllm): default gallery entry on Q4_K_M; add Q8_0 variant
Q4_K_M (~17 GB, GB10-validated: cosine 0.9862, coherent generation) is
the friendlier default download than the 50 GB BF16; Q8_0 (~27 GB) is
the higher-fidelity middle ground. Both descriptions carry the measured
caveat that BF16 is ~5x faster per denoise step on BF16-native hardware,
with a pointer to fetch it manually when it fits. sha256 values are the
HF LFS oids.

Assisted-by: Claude Code (Fable 5)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-11 20:24:26 +00:00
Ettore Di Giacinto
8134d6db37 docs(dllm): record Q4_K_M validation and quantization guidance
Q4_K_M validated on GB10: quality holds (cosine 0.9862, coherent
generation, 19/48 stopper exit) but a forward step is ~5x slower than
BF16 (27.5s vs 5.6s: native BF16 tensor cores vs K-quant MoE dequant).
Guidance: prefer BF16 when it fits; Q4_K_M is the memory-bound option.

Assisted-by: Claude Code (Fable 5)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-11 19:22:02 +00:00
Ettore Di Giacinto
ad6d1dbc8b feat(grpc): request cancellation for Go backends via the Cancellable capability
The llama.cpp C++ backend aborts generation when its gRPC context is
cancelled (grpc-server.cpp polls context->IsCancelled() in the result
loops), but Go backends served by pkg/grpc never observed context
cancellation: a disconnected client left the generation running to
completion. Add an optional Cancellable capability; the server registers
context.AfterFunc on the request/stream context (after the Locking block
so queued requests cannot abort the current owner) covering both rich
and legacy paths. dllm implements it: measured cancel latency ~10ms vs
~10s of orphaned generation, and follow-up requests no longer queue
behind cancelled ones (~220ms vs ~9s in the e2e proof).

Assisted-by: Claude Code (Fable 5)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-11 17:50:04 +00:00
Ettore Di Giacinto
eb61e1d770 chore(dllm): review fixes - file modes and build-matrix doc accuracy
Drop the stray executable bit from the Go sources and Makefile (the
sibling Go backends commit them 644; only run.sh/package.sh are
executable), and correct two documentation claims found in the final
branch review: cuda13-dllm is built for amd64 only (arm64 CUDA ships as
the l4t flavor), and package.sh is the parakeet-cpp-style stub layout
with no ldd walk.

Assisted-by: Claude Code (Fable 5)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-11 17:17:54 +00:00
Ettore Di Giacinto
aba9c4794a docs(dllm): backend documentation and agents topic guide
User docs: dllm section in text-generation (setup, eb_* options table,
n_predict canvas rounding, enable_thinking metadata, honest GB10
throughput numbers). Agents guide: .agents/dllm-backend.md covering the
purego C-ABI contract, serialization rules, template provenance, test
layers, and known limitations.

Assisted-by: Claude Code (Fable 5)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-11 17:05:18 +00:00
Ettore Di Giacinto
04d6f66a9a feat(dllm): diffusiongemma gallery entry and e2e coverage
Gallery model diffusiongemma-26b-a4b-it (unsloth BF16 GGUF, sha256
verified against the HF LFS oid) with use_tokenizer_template and an
honest experimental/throughput description. e2e: BACKEND_BINARY-mode
specs boot the real gRPC backend with the tiny fixture model (templated
chat + streaming); real-26B specs are separately env-gated. Adds an
opt-in BACKEND_TEST_SEED knob so random-weight fixture models run the
generic specs deterministically.

Assisted-by: Claude Code (Fable 5)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-11 17:05:18 +00:00
Ettore Di Giacinto
52b3b68cea feat(dllm): backend packaging, gallery index, CI matrix
Registers the dllm backend across every surface: backend gallery index
(cpu amd64+arm64 with manifest merge, cuda13, l4t-cuda13 for GB10-class
hardware; no darwin per engine scope), top-level Makefile targets,
bump_deps pin tracking for DLLM_VERSION, and the curated known-backends
list for /backends/known (pref-only: auto-detecting on .gguf would
shadow llama-cpp). Note: image builds and the nightly bump leg stay red
until github.com/mudler/dllm.cpp is published (planned at merge time).

Assisted-by: Claude Code (Fable 5)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-11 17:05:18 +00:00
Ettore Di Giacinto
99184809fa feat(dllm): rich gRPC backend with ChatDelta streaming
Implements PredictRich/PredictStreamRich (legacy methods delegate),
TokenizeString, and Load over the purego binding. A single worker
goroutine serializes all C calls per the dllm.cpp one-generate-per-ctx
contract (cancel is the documented exception); an RWMutex guards Free
against in-flight requests. Under use_tokenizer_template the gemma4
renderer and streaming parser own templating and ChatDelta extraction;
raw-prompt mode passes through verbatim. enable_thinking is opt-in via
request metadata (the gemma4 template treats thinking as opt-in).

Assisted-by: Claude Code (Fable 5)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-11 16:14:37 +00:00
Ettore Di Giacinto
294c04ae2f feat(dllm): gemma4 streaming parser emitting ChatDeltas
Fragment-safe state machine (content / channel header / thought /
tool-call / done) classifying model output into content,
reasoning_content and tool_calls deltas. Tool-call payload decoder is a
non-partial port of vLLM's gemma4 parser grammar; ~25 of its test cases
are ported with citations, plus a 2-split invariance property over
every byte position. Recursion depth-capped against model-generated
deep nesting; marker constants shared with the renderer.

Assisted-by: Claude Code (Fable 5)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-11 15:55:27 +00:00
Ettore Di Giacinto
778f85c2a0 feat(dllm): purego backend scaffold over the dllm.cpp C-ABI
Binds the 9-symbol flat C-ABI of dllm.cpp (DiffusionGemma engine) via
purego: typed wrappers with correct string ownership (malloc'd returns
freed via dllm_capi_free_string, borrowed last_error never freed),
once-allocated stream-callback trampolines, and a gated Ginkgo binding
smoke against the tiny fixture model.

Assisted-by: Claude Code (Fable 5)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-11 14:50:39 +00:00
Ettore Di Giacinto
af0db1419c test(http): make the suite listen port configurable
The core/http specs hardcoded 127.0.0.1:9090 in ~70 call sites, so the
pre-commit coverage gate fails on any machine where an unrelated service
holds 9090. Centralize the address in the suite file behind
LOCALAI_TEST_HTTP_PORT (default unchanged: 9090).

Assisted-by: Claude Code (Fable 5)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-11 14:28:39 +00:00
LocalAI [bot]
892fc49949 feat(realtime): stream the LLM / TTS / transcription pipeline stages (#10176)
* feat(realtime): pipeline streaming + disable_thinking config

Add a nested pipeline.streaming.{llm,tts,transcription} block plus
pipeline.disable_thinking, with StreamLLM/StreamTTS/StreamTranscription/
ThinkingDisabled helpers. Pointer-bools so unset keeps the unary path;
existing configs are unaffected. Wiring into the realtime handler follows.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): sentence segmenter for streamed LLM->TTS pipelining

streamSegmenter accumulates streamed LLM tokens and emits complete
sentence/clause segments (terminator+whitespace, or newline) so TTS can
synthesize each segment as it completes instead of waiting for the whole
reply. Pure helper; the streaming handler wiring consumes it next.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): streaming TTS/transcription methods on Model interface

Add TTSStream and TranscribeStream to the realtime Model interface and
implement them on wrappedModel (delegating to backend.ModelTTSStream /
ModelTranscriptionStream) and transcriptOnlyModel. ttsStream adapts the
backend's WAV-framed stream (44-byte header carrying the sample rate, then
PCM) into raw PCM + sample rate for the realtime transports. Handler wiring
that consumes these (flag-gated) follows.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): emitSpeech with flag-gated streaming TTS

emitSpeech synthesizes a piece of text and forwards audio to the client,
streaming one output_audio.delta per backend PCM chunk when the pipeline
sets streaming.tts, or one delta for the whole utterance otherwise. WebRTC
gets raw PCM (it resamples internally); WebSocket gets base64 PCM at the
session rate. It emits no transcript/audio-done events so a streamed reply
can be split into multiple spoken segments sharing one response.

Adds fakeModel/fakeTransport test doubles for the realtime Model/Transport
interfaces, driving streaming assertions deterministically.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): route response audio through emitSpeech (streaming TTS)

Replace the inline unary TTS block in the response handler with emitSpeech,
which streams a response.output_audio.delta per backend PCM chunk when
pipeline.streaming.tts is set and otherwise preserves the single-delta unary
behaviour. emitSpeech returns the accumulated base64 audio, stored on the
conversation item as before. Transcript and audio-done events stay in the
handler so later per-segment streaming can reuse emitSpeech.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): streaming transcription text deltas

Add emitTranscription and route commitUtterance through it. With
pipeline.streaming.transcription set it streams each transcript fragment as
a conversation.item.input_audio_transcription.delta via TranscribeStream
then a completed event; otherwise it preserves the single completed-event
unary behaviour. Returns the final transcript for response generation.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): pipeline disable_thinking maps to enable_thinking off

applyPipelineThinking forces the LLM's ReasoningConfig.DisableReasoning when
pipeline.disable_thinking is set, which gRPCPredictOpts turns into the
enable_thinking=false backend metadata. Applied at newModel construction on
the per-session LLM config copy, so it doesn't leak to other model users and
needs no realtime-specific request plumbing.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): speechStreamer for token-streamed LLM->TTS

emitSpeech now returns raw PCM (caller base64-encodes) so streamed segments
accumulate correctly. speechStreamer consumes streamed LLM tokens: it strips
reasoning via the streaming ReasoningExtractor, emits a transcript delta per
content fragment, and sentence-pipes content into emitSpeech so each sentence
is synthesized as soon as it's ready. Handler wiring (plain-content turns)
follows.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): wire streamLLMResponse for token-streamed replies

triggerResponseAtTurn takes a streamed path when pipeline.streaming.llm is
set, the turn has no tools, and audio is requested: streamLLMResponse
announces the assistant item, drives the LLM token callback through a
speechStreamer (reasoning-stripped transcript deltas + sentence-piped TTS),
and emits the terminal events. Tool turns and non-streaming pipelines keep
the existing buffered path unchanged, so this is strictly opt-in.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(realtime): document pipeline streaming + disable_thinking

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(realtime): register pipeline streaming/thinking config fields

TestAllFieldsHaveRegistryEntries (core/config/meta) requires every config
field to have a meta registry entry. The four new pipeline fields
(disable_thinking, streaming.{llm,tts,transcription}) had none, failing
tests-linux/tests-apple. Add toggle entries for them.

Also handle the os.Remove return in realtime_speech_test.go to satisfy
errcheck (golangci-lint).

Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(realtime): always strip reasoning from spoken output

disable_thinking maps to ReasoningConfig.DisableReasoning=true on the LLM
config, which the backend reads as enable_thinking=false. But the realtime
handler reads that SAME config to drive reasoning extraction, and there
DisableReasoning=true means "skip stripping". PredictConfig() returns this
LLM config, so both the streamed (speechStreamer) and buffered realtime
paths stopped stripping <think>…</think> exactly when disable_thinking was
on — leaking raw reasoning to the client whenever the model ignored the
enable_thinking hint (e.g. lfm2.5).

Add spokenReasoningConfig() which clears DisableReasoning for extraction
(keeping custom tokens/tag pairs) and route both realtime paths through it.
Spoken output now always strips reasoning, independent of the backend
suppression hint.

Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(realtime): clean TTS temp path before read (gosec G304)

emitSpeech reads the WAV file the TTS backend wrote. The read moved here
from realtime.go, so code-scanning flagged it as a new G304 alert even
though the path is backend-controlled (a temp file), not user input.
Wrap it in filepath.Clean — a real path normalization that also clears
the alert, keeping with the repo's no-#nosec convention.

Assisted-by: Claude:claude-opus-4-8 gosec, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(realtime): buffer whole message for TTS, drop sentence segmenter

Per review (richiejp): the sentence segmenter pipelined unary TTS by
splitting on ASCII .!?/newline, which does nothing for languages without
those boundaries (CJK/Thai) — there it already degraded to buffering the
whole message anyway.

Replace it with a uniform model: stream the LLM transcript live, buffer the
full message, then synthesize it once. emitSpeech already streams the audio
chunks when the backend implements TTSStream and falls back to a single
unary delta otherwise, so this is real streaming TTS where supported and a
clean whole-message synthesis elsewhere — no per-sentence emulation, no
language assumptions. speechStreamer becomes transcriptStreamer (transcript
deltas only); the whole-message synthesis moves into streamLLMResponse.

Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): stream tool-call turns via tokenizer-template autoparser

Per review (richiejp): tool-call deltas exist, so streaming should work with
tools too. It does — for models that use their tokenizer template. The C++
autoparser then clears reply.Message and delivers content + tool calls via
ChatDeltas, so the streamed transcript carries only spoken content (no
tool-call JSON leak) and the tool calls are parsed from the final response.

- Drop the len(tools)==0 gate; stream when no tools OR use_tokenizer_template
  (grammar-based function calling still buffers, since its call is emitted as
  JSON in the token stream and would leak into the transcript).
- streamLLMResponse takes tools/toolChoice/toolTurn, reads ChatDelta content
  in the token callback, parses tool calls from the final ChatDeltas, and
  creates the assistant content item lazily so a content-less tool turn emits
  only the tool calls.
- Extract emitToolCallItems from the buffered path so both paths finalize tool
  calls, response.done, and server-side assistant-tool follow-ups identically.

Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): script-aware clause chunking + streamed-reply fixes

Opt-in pipeline.streaming.clause_chunking splits the streamed LLM reply
into speakable clauses and synthesizes each as soon as it completes,
lowering time-to-first-audio instead of buffering the whole message. The
splitter is script-aware (rivo/uniseg, pure Go): UAX#29 sentence
segmentation handles CJK 。!? with no whitespace, CJK clause
punctuation (,、;:) and Thai/Lao spaces give finer cuts, and a UAX#14
line-break cap bounds an over-long punctuation-less run. Unlike the old
ASCII .!?/newline segmenter (dropped in 076dcdbe) it does not degrade to
whole-message buffering for CJK/Thai; scripts needing a dictionary
(Khmer/Burmese) stay buffered until a space or end-of-message. Clauses
are synthesized synchronously in the token callback (the LLM keeps
generating into the gRPC stream meanwhile), so audio still starts
mid-generation. Off by default — the whole-message path is unchanged.

Also fix the streamed-reply path and the Talk page:

- Don't swallow streamed autoparser content as reasoning: the
  tokenizer-template path already delivers reasoning-free content via
  ChatDeltas, so prefilling the thinking start token re-tagged it as an
  unclosed reasoning block, leaving no spoken reply. Disable the prefill
  on that path; closed tag pairs are still stripped (#9985).

- Generate collision-free realtime IDs (16 random bytes) instead of a
  constant, so per-item bookkeeping (cancel, conversation.item.retrieve)
  works.

- Key the Talk transcript by the server item_id and upsert entries.
  Realtime events arrive over a WebRTC data channel — outside React's
  event system — so React defers the setTranscript updaters while
  synchronous ref writes in handler bodies run first; the old
  index-tracking ref rendered a duplicate assistant bubble on
  completion. Upserts by item_id are idempotent and order-independent.

- Drop the partial assistant bubble on a cancelled response (barge-in):
  the server discards the interrupted item and sends response.done with
  status "cancelled"; mirror that in the UI so the regenerated reply
  isn't rendered as a second assistant message.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Assisted-by: Claude:claude-fable-5 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Richard Palethorpe <io@richiejp.com>
2026-06-11 08:43:12 +01:00
pos-ei-don
228a6dfe79 fix(vllm): restore compatibility with vLLM >= 0.22 (get_tokenizer moved to vllm.tokenizers) (#10252)
fix(vllm): restore compatibility with vLLM >= 0.22 (get_tokenizer moved)

vLLM 0.22 moved get_tokenizer from vllm.transformers_utils.tokenizer
to vllm.tokenizers. Since the backend requirements install vllm
unpinned, freshly built/installed vllm backends currently fail to
start with ModuleNotFoundError: No module named
'vllm.transformers_utils.tokenizer' (surfacing as 'grpc service not
ready' when loading a model).

Use the same try/except version-compat import pattern already used
elsewhere in this file: try the new vllm.tokenizers location first and
fall back to the pre-0.22 path.

Tested on a DGX Spark (GB10, ARM64) with the
cuda13-nvidia-l4t-arm64-vllm backend and vllm 0.22.0: model load, chat
completions and tool calls all work with this patch applied.

Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-11 09:05:23 +02:00
LocalAI [bot]
51a92b6093 chore: ⬆️ Update antirez/ds4 to 8384adf0f9fa0f3bb342dd925372de778b95b263 (#10242)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-11 00:10:34 +02:00
LocalAI [bot]
b5964d385d docs: ⬆️ update docs version mudler/LocalAI (#10245)
⬆️ Update docs version mudler/LocalAI

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-11 00:10:10 +02:00
LocalAI [bot]
fba8c9c498 fix(distributed): track in-flight for non-LLM inference methods (VAD, diarize, voice, ...) (#10238)
fix(distributed): track in-flight for non-LLM inference methods

InFlightTrackingClient only wrapped a subset of the grpc.Backend
inference methods (Predict, Embeddings, TTS, AudioTranscription, Detect,
Rerank, ...). Methods like VAD were left as embedded passthrough, so
track() never ran for them.

In distributed mode every model is loaded with in_flight=1 as a
reservation; that reservation is only released by the OnFirstComplete
callback, which fires after the first *tracked* inference call completes.
A VAD-only model (e.g. silero-vad) never calls a tracked method, so the
reservation is never released and in-flight stays pinned at 1 forever -
which also blocks the router's idle-eviction logic.

Wrap the remaining unary inference methods (VAD, Diarize, Face*, Voice*,
TokenClassify, Score, AudioEncode, AudioDecode, AudioTransform) with the
same track()/reconcile() pattern. The three bidi-stream constructors
(AudioTransformStream, AudioToAudioStream, Forward) are deliberately left
as passthrough - their inference spans the stream lifetime, not the
constructor call, so track() there would fire onFirstComplete before any
data flows.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
v4.4.0
2026-06-10 16:29:50 +02:00
LocalAI [bot]
6b2badb837 chore: ⬆️ Update CrispStrobe/CrispASR to c29f6653a516a3001d923944dad8892072cc7334 (#10236)
⬆️ Update CrispStrobe/CrispASR

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-10 16:16:24 +02:00
LocalAI [bot]
8b8506d01a chore: ⬆️ Update ggml-org/llama.cpp to 039e20a2db9e87b2477c76cc04905f3e1acad77f (#10223)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-10 12:22:03 +02:00
LocalAI [bot]
6910a0bb48 chore: ⬆️ Update antirez/ds4 to 91bafb5acd5a6cf00b1e55ef68bf40ddd207bee7 (#10234)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-10 12:08:19 +02:00
LocalAI [bot]
cffd03b522 chore: ⬆️ Update ikawrakow/ik_llama.cpp to e6f8112f3ba126eed3ff5b30cdd08085414a7516 (#10233)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-10 12:07:49 +02:00
LocalAI [bot]
bf448d3794 chore: ⬆️ Update ggml-org/whisper.cpp to df7638d8229a243af8a4b5a8ae557e0d74e0a0ae (#10220)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-10 01:16:29 +02:00
LocalAI [bot]
1d4a12f7c0 chore: ⬆️ Update CrispStrobe/CrispASR to 97cad527d247edefc904e6c40c4cf5ee78bed055 (#10221)
⬆️ Update CrispStrobe/CrispASR

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-10 01:16:17 +02:00
LocalAI [bot]
186d62801d chore: ⬆️ Update leejet/stable-diffusion.cpp to 19bdfe22d255d5b4dff39d449318b9bc5ea2317f (#10222)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-10 01:16:06 +02:00
LocalAI [bot]
da4ed05429 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 2768b6251548b78b6610e95edad13f888ad95982 (#10219)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-10 01:15:54 +02:00
LocalAI [bot]
ec1eea4f45 chore: ⬆️ Update antirez/ds4 to 512d07cb08f234b704b5a5959aa9e2d4c466eeb0 (#10224)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-10 01:15:42 +02:00
LocalAI [bot]
b203b32e57 feat(realtime): make WebRTC ICE candidates configurable (#10231)
The /v1/realtime WebRTC handler created the peer connection with a bare
webrtc.Configuration and no SettingEngine, so pion gathered a host ICE
candidate for every local interface. Under Docker host networking that
includes bridge addresses (docker0/veth, 172.x) a remote browser cannot
route to; the call establishes on a good pair and then drops once ICE
consent freshness checks fail on the unreachable candidates.

Add two opt-in knobs, applied via a pion SettingEngine:
- LOCALAI_WEBRTC_NAT_1TO1_IPS: advertise these IPs as the host candidates
  (e.g. the host LAN IP)
- LOCALAI_WEBRTC_ICE_INTERFACES: restrict ICE gathering to these interfaces

Defaults are unchanged (empty => current all-interface behavior).

Assisted-by: Claude:claude-opus-4-8

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-09 22:28:03 +02:00
Ching
48a8ce98aa fix(cli): handle chat output errors (#10229)
Propagate terminal write errors from the chat prompt and explicitly ignore stream close errors during cleanup.

Update chat tests to assert response writer errors so errcheck passes without hiding failed writes.

Tests:
- go test -count=1 ./core/cli/chat
- go test -count=1 ./core/cli

Assisted-by: Codex:GPT-5

Signed-off-by: Ching Kao <0980124jim@gmail.com>
2026-06-09 19:10:24 +02:00
Ching
8344d1c865 feat(cli): add interactive chat mode (#10226)
Add an opt-in `local-ai chat` command for testing chat models directly from the terminal without manually sending curl requests.

The command connects to a running LocalAI server, lists available models through the existing OpenAI-compatible API, streams chat completions, and supports interactive commands such as `/models`, `/model`, `/clear`, and `/exit`.

Keep `local-ai run` focused on the server lifecycle so the web UI, API clients, and multiple chat terminals can coexist against the same server.

Document the new command and terminal workflow in the README and CLI docs.

Tests:
- go test -count=1 ./core/cli/chat
- go test -count=1 ./core/cli

Assisted-by: Codex:GPT-5

Signed-off-by: Ching Kao <0980124jim@gmail.com>
2026-06-09 14:58:44 +00:00
Pete
d2e6b93369 feat(agents): surface KB source citations in RAG responses (#10228)
* dev knowledge.go structure

Signed-off-by: Pete Chen <petechentw@gmail.com>

* feat(agents): append KB source citations to responses

Render structured KB citations as a Sources block after agent responses, linking each source to the existing raw collection entry endpoint.

Keep long-term memory writes on the original model response so citation blocks do not get stored back into the knowledge base.

Tested with: go test ./core/services/agents

Assisted-by: Codex:gpt-5
Signed-off-by: Pete Chen <petechentw@gmail.com>

* Collect KB citations from tool searches

Signed-off-by: Pete Chen <petechentw@gmail.com>

* fix(agents): append KB sources in local chats

Apply the shared KB citation post-processing to standalone LocalAGI chat responses so the React agent chat receives the same clickable Sources block as the native executor path. Also fix the run target to use the current cmd/local-ai entrypoint.

Assisted-by: Codex:gpt-5
Signed-off-by: Pete Chen <petechentw@gmail.com>

---------

Signed-off-by: Pete Chen <petechentw@gmail.com>
Co-authored-by: shihyunhuang <shihyunhuang88@gmail.com>
Co-authored-by: TLoE419 <tloemizuchizu@gmail.com>
Co-authored-by: Ching Kao <0980124jim@gmail.com>
2026-06-09 16:32:56 +02:00
LocalAI [bot]
e1ec03d33f fix(reasoning): stop prefilled <think> from swallowing tag-less answers (#10225)
* fix(reasoning): stop prefilled <think> from swallowing tag-less answers

When a chat template injects the thinking start token into the prompt (so
DetectThinkingStartToken returns e.g. "<think>"), the model's output begins
inside a reasoning block and carries only the closing tag. The non-jinja
autoparser fallback (peg-native "pure content" mode, issue #9985) prepends the
start token so the extractor can pair it with the model's </think>.

But on a COMPLETE response that contains no closing tag, the model answered
directly with no reasoning at all. Prepending the start token there manufactures
an unclosed block that swallows the entire answer into reasoning, leaving the
OpenAI `content` field empty. This breaks short/direct answers — session names,
JSON summaries, any terse completion where the model skips the think block —
which come back with empty content. Regression surfaced by #9991, which added
the defensive prefill extraction to the complete-response paths.

Add reasoning.ExtractReasoningComplete: it only honors a prefilled start token
when the response actually contains the matching closing tag (proof a reasoning
block exists). Genuine reasoning tags already in the content still extract;
tag-less content stays content. Apply it at every complete-response site
(applyAutoparserOverride, realtime, openresponses). The streaming per-token
extractor is intentionally left on ExtractReasoningWithConfig — mid-stream an
as-yet-unclosed block is legitimate and must surface as reasoning deltas.

Also adds reasoning.ClosingTokenForStart and hoists the default reasoning tag
pairs to package scope so both helpers share one source of truth.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(reasoning): cover the enable_thinking=false non-thinking-mode regression

Adds the end-to-end case that actually broke session summaries / auto-titles
and was not covered before: a request with enable_thinking=false against a
<think>-capable model. In non-thinking mode the model emits no reasoning block,
so llama.cpp's autoparser returns ChatDeltas with content set and
reasoning_content empty (verified against stock llama-server: same model with
chat_template_kwargs.enable_thinking=false returns reasoning_content=null,
content="hello"). thinkingStartToken is still "<think>" because it is detected
per-model from the enable_thinking=true render, so the old code prepended it and
swallowed the answer. The test fails without the ExtractReasoningComplete gate.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 09:02:04 +02:00
LocalAI [bot]
9323f4b5ca feat(llama-cpp): video input support (mtmd #24269) (#10216)
* chore(llama-cpp): bump to 8f83d6c for mtmd video input support

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(llama-cpp): forward video input to mtmd (template + non-template paths)

Wire request->videos() into grpc-server.cpp mirroring the existing image
and audio handling: a video_data build + non-template files extraction, and
input_video chat chunks on the tokenizer-template path. allow_video is
auto-set at model load by the vendored upstream chat_params.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ui): add video attachment support to the chat UI

Mirror the image/audio attachment path for video: emit video_url content
parts, accept video/* in the picker, keep video files as base64, show a
film icon badge, and render attached video inline with a <video> player.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(llama-cpp): patch mtmd video stdin double-close (heap crash)

Upstream mtmd video input (ggml-org/llama.cpp#24269) double-fcloses the
ffmpeg/ffprobe stdin FILE: feed_stdin() fclose()s the FILE returned by
subprocess_stdin() (which is sp->stdin_file), then subprocess_destroy()
fclose()s the same pointer again -> heap corruption that aborts the
backend on any base64 input_video request (the CLI --video file path is
unaffected). Vendor a one-line fix (null sp->stdin_file after fclose)
via prepare.sh's patches/ until upstream merges it.

Verified e2e with gemma-4-e2b-it-qat-q4_0: video frames decode via
ffmpeg and the model answers correctly (red clip -> 'Red', blue -> 'Blue').

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore(llama-cpp): re-pin to upstream #24316, drop vendored stdin patch

Upstream replaced the ad-hoc video stdin handling with a proper RAII
refactor (ggml-org/llama.cpp#24316, "mtmd: refactor video subproc
handling"), which includes the same `sp->stdin_file = nullptr` guard our
patch added (plus join-before-destroy ordering). Re-pin LLAMA_VERSION to
that branch head and drop patches/0001 - it's now redundant.

Verified e2e with gemma-4-e2b-it-qat-q4_0: no crash, video frames decode
and the model answers correctly (red clip -> "Red", blue -> "Blue").

NOTE: #24316 is not yet merged, so this pins to its branch-head commit
(28ca1e60). Re-pin to the squash-merge commit on master once it lands,
otherwise `git fetch` may lose the commit after the branch is deleted.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-08 23:17:50 +02:00
LocalAI [bot]
c20225fc13 chore: ⬆️ Update CrispStrobe/CrispASR to f7838a306687f22c281d29c250f879a4ab3df2d7 (#10177)
* ⬆️ Update CrispStrobe/CrispASR

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* fix(crispasr): link crispasr-lib CMake target instead of crispasr

The dependency-bump regeneration of this branch reset CMakeLists.txt to
master and dropped the prior link-target fix, reintroducing the
`cannot find -lcrispasr` failure. Upstream CrispASR (f7838a3) defines the
library as the CMake target `crispasr-lib` (with OUTPUT_NAME crispasr);
there is no target named `crispasr`, so target_link_libraries falls back
to a bare `-lcrispasr` linker flag that cannot be resolved. Point the link
at the real target name.

Verified locally: CPU cmake-configure of the bumped source generates a
gocrispasr link line referencing sources/CrispASR/src/libcrispasr.a with no
dangling -lcrispasr.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]

---------

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-08 16:01:19 +02:00
LocalAI [bot]
337acc4c37 chore: ⬆️ Update antirez/ds4 to c463029c205c2ec8d7ab6c0df4a3f52979091286 (#10189)
* ⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* fix(ds4): link ds4_ssd.o into the backend build

Upstream antirez/ds4 splits the SSD expert-cache into its own ds4_ssd.c
translation unit, whose symbols (ds4_ssd_memory_lock_acquire/release,
ds4_ssd_cache_experts_for_byte_budget, ds4_ssd_auto_cache_plan) are
referenced by ds4.c/ds4_cpu.o. The dependency-bump automation regenerated
this branch from clean master and dropped the prior linkage fix, so the
cpu-ds4 / cublas-ds4 backend builds fail again with undefined references.

Re-apply the ds4_ssd.o linkage GPU-agnostically (mirroring ds4_distributed.o)
in both the backend Makefile (DS4_OBJ_TARGET + the engine-object build rule
for every GPU mode) and CMakeLists.txt (list(APPEND DS4_OBJS ds4_ssd.o)).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]

---------

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-08 11:15:32 +02:00
LocalAI [bot]
618e90cd13 feat(gallery): add Gemma 4 QAT family + MTP speculative-decoding pairs (#10215)
Add the remaining official Google Gemma 4 QAT Q4_0 GGUFs (E2B, E4B,
26B-A4B, 31B) next to the existing 12B entry, each shipping its
multimodal mmproj.

Also add three MTP (Multi-Token Prediction) speculative-decoding bundles
that pair each QAT target with a QAT-matched assistant/drafter head:

  - 12B       <- Janvitos/gemma-4-12B-it-qat-assistant-MTP-Q8_0-GGUF
  - 26B-A4B   <- boxwrench/gemma-4-qat-mtp-assistant-heads
  - 31B       <- boxwrench/gemma-4-qat-mtp-assistant-heads

The assistant heads use the gemma4_assistant architecture and are not
standalone chat models, so each entry bundles the target + draft and
sets draft_model together with the draft-mtp spec options
(spec_type:draft-mtp / spec_n_max:6 / spec_p_min:0.75), matching
MTPSpecOptions() in core/config/mtp.go. QAT-matched heads raise draft
acceptance substantially over generic non-QAT heads.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-08 10:26:42 +02:00
LocalAI [bot]
92dea961c2 fix: distributed backend reinstall/upgrade UI stuck on 'reinstalling' (#10214)
* fix(galleryop): self-evict terminal ops from OpCache.GetStatus

The processingBackends map (the UI 'reinstalling' spinner source) only cleared
an op when a client polled /api/backends/job/:uid. The Manage-page Reinstall and
Upgrade buttons never poll, so completed installs leaked into processingBackends
forever and the backend card spun 'reinstalling' even though the install had
finished. Evict terminal ops on the list read instead; DeleteUUID already
broadcasts the eviction so peer replicas converge.

Reproduced on a live 5-node distributed cluster: 5 backends sat in
processingBackends with underlying jobs reporting completed:true,progress:100.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(nodes): clear pending backend ops behind offline/draining nodes

ListDuePendingBackendOps filters status=healthy, so a backend op queued against
a node that went offline (stale heartbeat) or draining (admin action) was never
retried, aged out, or deleted - it leaked forever and kept the UI operation
spinning. Add DeleteStalePendingBackendOps and run it each reconcile pass:
draining nodes are cleared immediately (model rows already purged), offline
nodes once their heartbeat is older than a grace window (blip protection).

Reproduced on a live cluster: orphaned llama-cpp install rows targeting an
offline (nvidia-thor) and a draining (mac-mini-m4) node sat at attempts=0
indefinitely.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(nodes): stream per-node progress during backend upgrade

The install dispatch subscribed to a per-op progress subject and streamed
per-node download ticks; the upgrade dispatch did a bare 15-minute blocking
NATS round-trip with no subscription, so the UI showed progress:0 the whole
time (the 'reinstalling but nothing happens' report on a slow node).

Thread the op ID through BackendManager.UpgradeBackend -> the distributed
manager -> the adapter, and have the adapter subscribe to the per-op progress
subject before the request (extracted into a shared subscribeProgress helper
reused by install/upgrade/force-fallback). The worker's upgradeBackend now
creates the same DebouncedInstallProgressPublisher installBackend uses. An
upgrade is a force-reinstall, so it reuses SubjectNodeBackendInstallProgress
rather than minting a new subject - no new NATS permission, no new
rolling-update compat surface. Reconciler-driven retries pass empty
opID/onProgress and stay on the silent path.

Reproduced on a live cluster: upgrade of llama-cpp-development on agx-orin-slow
sat at progress:0 for 4+ minutes with no per-node feedback.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(galleryop): persist cancellation + periodically reap orphaned ops

Two distributed gaps surfaced when a replica was killed mid-upgrade on a live
cluster, leaving the backend stuck 'processing' in the UI forever:

1. CancelOperation flipped the in-memory status to cancelled and broadcast a
   NATS event but never persisted the terminal status. On the next replica
   restart the still-active row re-hydrated straight back into
   processingBackends and the UI spun again. It now calls store.Cancel(id) so
   the cancel survives a restart.

2. CleanStale (which marks abandoned active ops failed) only ran once on
   startup, so an op orphaned AFTER startup - its owning replica's foreground
   handler goroutine gone - was never reaped until the next restart. Add
   GalleryService.ReapStaleOperations and run it on a 15m ticker (CleanStale
   now returns the reaped count for observability).

Neither is covered by the OpCache self-evict fix: an orphaned op never reaches
Processed, so it would never self-evict.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(review): address self-review findings on the distributed install fixes

Three findings from an adversarial review of this branch:

1. CRITICAL - OpCache.GetStatus crashed under concurrent load. m.Map() returns
   the live internal map by reference, so deleting from it on the read path was
   an unsynchronized write to a map four HTTP handlers poll every ~1s -> a
   'concurrent map writes' fatal. Rewritten to iterate a Keys() snapshot, build
   a fresh result map, and apply evictions via the locked DeleteUUID after the
   loop. Added a -race concurrency regression guard.

2. HIGH - GetStatus evicted failed ops too, hiding them from /api/operations
   and breaking the dismiss-failed-op flow (the panel keeps Error != nil ops so
   the admin can read the error and click Dismiss). Eviction now fires only for
   terminal ops with Error == nil (success/cancelled); failures are retained.

3. MEDIUM - DeleteStalePendingBackendOps missed StatusUnhealthy nodes. A node
   marked unhealthy on a NATS ErrNoResponders never transitions to offline
   (health.go skips re-marking it), so its pending ops leaked exactly like the
   offline case. Unhealthy is now reaped via the same stale-heartbeat grace path
   (a fresh-heartbeat node is recovering and keeps its op).

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(review-2): don't evict the still-installing soft-path; don't spin on failed ops

Second review pass found two issues:

1. MEDIUM (Go) - OpCache.GetStatus evicted the ErrWorkerStillInstalling
   soft-path op. That op is deliberately Processed=true with no error to show a
   yellow in-progress state when a worker timed out the NATS round-trip but is
   still installing in the background; the reconciler confirms the real outcome
   later. Evicting it (and broadcasting OpEnd + marking the DB completed) hid an
   install that may still fail. Eviction is now scoped to a clean success
   (progress 100 + 'completed', matching the job-poll's historical condition) or
   a cancellation - the soft-path (progress != 100) and failures are kept.

2. MEDIUM (React) - the Backends gallery card rendered ANY operation as an
   'Installing...' spinner, so a failed op (now intentionally kept in the list
   for the OperationsBar error + Dismiss) spun forever. Exclude errored ops from
   the card spinner, mirroring Models.jsx (isInstalling already excludes
   op.error). The error + Dismiss still surface in the global OperationsBar.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(ui): refresh Manage backends table when an operation settles

The Manage backends table fetched installed backends only on mount/after delete
and checked upgrades only on tab activation. After a reinstall/upgrade completed
neither re-ran, so the installed-version cell and the 'update available' badge
stayed stale until the user switched tabs - the op looked like it 'did nothing'.

Watch the operations list (via useOperations) and re-fetch installed backends +
available upgrades whenever the count settles, mirroring the operations.length
watch Backends.jsx already uses. Consolidates the prior tab-activation upgrades
check into the same effect.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-08 10:03:02 +02:00
LocalAI [bot]
2e93186043 chore: ⬆️ Update ggml-org/llama.cpp to 9e3b928fd8c9d14dbf15a8768b9fdd7e5c721d66 (#10210)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-08 09:35:17 +02:00
LocalAI [bot]
d07037e817 chore: ⬆️ Update leejet/stable-diffusion.cpp to b3d56d0ba1bd437886079e339118e8e75bb79ee7 (#10211)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-08 09:03:57 +02:00
LocalAI [bot]
f6cc90d258 chore: ⬆️ Update mudler/parakeet.cpp to e270af73b94c9a5c37ec516230219ed4580e1db6 (#10212)
⬆️ Update mudler/parakeet.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-07 23:52:44 +02:00
Adira
2c804bef5a fix(config): skip vocab arrays and mmap GGUF headers to speed up startup (#10213)
When the models directory holds many GGUF files, startup parsed every
model's full GGUF — including the tokenizer vocab arrays
(tokenizer.ggml.tokens/scores/merges, often >100k entries) — once per
model while guessing defaults. On slow storage (e.g. a models directory
on a Docker volume) those hundreds of thousands of tiny reads dominate
boot time before the HTTP server comes up.

The default-guessing path and the VRAM metadata reader only consume
scalar metadata and array lengths, never the array contents. Parse with
SkipLargeMetadata (seek past large arrays) and UseMMap (fault in a few
header pages instead of issuing per-element read() syscalls). For a
256k-token vocab this cuts the parse from ~524k read() syscalls to 8.
The mapping is released when ParseGGUFFile returns.

Fixes #9790

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Adira Denis Muhando <dennisadira@gmail.com>
2026-06-07 23:33:52 +02:00
LocalAI [bot]
6070402477 chore(model gallery): 🤖 add 1 new models via gallery agent (#10209)
chore(model gallery): 🤖 add new models via gallery agent

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-07 22:09:32 +02:00
LocalAI [bot]
67f80a152b fix(mtp): don't auto-enable self-spec MTP for draft-only assistant GGUFs (#10208)
Gemma4 MTP (ggml-org/llama.cpp#23398) registers the prediction head as a
separate `gemma4-assistant` architecture. That assistant GGUF still carries
`<arch>.nextn_predict_layers`, so the architecture-agnostic detection in
HasEmbeddedMTPHead matched it and appended the `spec_type:draft-mtp` defaults.

Unlike the DeepSeek/Qwen embedded-head models, an assistant checkpoint cannot
self-speculate: it is a draft model that requires a paired target context
(`ctx_other`) and throws if loaded alone. Auto-applying the self-spec defaults
to a standalone assistant import therefore produces a broken config.

Guard the detection against draft-only assistant architectures (the `-assistant`
suffix is upstream's naming convention) so importing one no longer yields a
self-speculation config. Two-model target+draft pairing remains expressible
manually via `draft_model:` and is left to a follow-up.


Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-07 22:09:02 +02:00
LocalAI [bot]
a7cb587d96 feat(parakeet-cpp): real segment timestamps (NeMo-faithful) (#10207)
* feat(parakeet-cpp): real segment timestamps (NeMo-faithful)

Offline: replace the single synthetic whole-clip segment with multiple
segments grouped exactly like NeMo's get_segment_offsets - a new segment
after sentence-ending punctuation ('. ? !'), each carrying start/end and
its time-window token ids. The optional model option segment_gap_threshold
(NeMo's unit: encoder FRAMES, default 0=off) adds NeMo's silence-gap split,
converted to seconds via the JSON frame_sec the engine now reports.
Per-segment words are still gated behind timestamp_granularities=["word"];
a zero-word document falls back to a single text segment.

Streaming: when libparakeet.so exposes the ABI v4 JSON entry points
(probed), drive parakeet_capi_stream_feed_json / _finalize_json and
accumulate the streamed per-word timestamps into per-utterance segments
(EOU stays the boundary), so streaming FinalResult segments now carry
start/end. Falls back to the text-only feed against an older library.

Pure-Go specs cover splitWordsIntoSegments (punctuation + gap rules, NeMo
elif order, fallback), transcriptResultFromDoc (multi-segment, token
windows, word-granularity gate), and the streaming segmenter.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(audio): document parakeet-cpp segment timestamps + segment_gap_threshold

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(parakeet-cpp): update model-gated specs for multi-segment output

The offline AudioTranscription specs asserted the old single synthetic
segment (Segments HaveLen(1), Segments[0].Text == res.Text). With
NeMo-faithful segmentation a multi-sentence clip now yields multiple
punctuation-delimited segments, so assert the new contract instead:
one-or-more time-ordered segments, each with text and (under word
granularity) per-segment words whose span tracks the segment start/end.
Caught by running the model-gated suite on the dgx (GB10) against the
real tdt_ctc-110m + realtime_eou models.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-07 22:08:24 +02:00
LocalAI [bot]
f7c74ad2da chore: ⬆️ Update ggml-org/llama.cpp to 31e82494c0a3913c919c1027fa70500fbf4c07dd (#10191)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-07 10:43:17 +02:00
LocalAI [bot]
7402d1fd20 chore(turboquant): bump to 7d9715f1 + fix compilation against rebased fork (#10205)
* chore(turboquant): bump TheTom/llama-cpp-turboquant to 7d9715f1

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* fix(turboquant): drop obsolete legacy-spec shim after fork rebased

The TheTom/llama-cpp-turboquant fork (pin c9aa86a) rebased past the
upstream common_params_speculative refactor (ggml-org/llama.cpp
#22397/#22838/#22964), the model_tgt rename (#22838) and get_media_marker
(#21962). The old fork-compat shim forced now-wrong legacy code paths,
breaking the build with errors like 'struct common_params_speculative has
no member named mparams_dft / type' and 'server_context_impl has no member
named model'.

Remove the obsolete LOCALAI_LEGACY_LLAMA_CPP_SPEC branches from the shared
grpc-server.cpp (stock llama-cpp and the modern fork both take the modern
path now), and narrow the one remaining gap (the fork still lacks
common_params::checkpoint_min_step) to a dedicated
LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP guard injected by
patch-grpc-server.sh. The patch script now only adds the turbo2/3/4
KV-cache types and injects that one macro.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* fix(turboquant): HIP-port the fork's CUDA additions (copy2d 3D-peer + cudaEventCreate)

The turboquant fork adds/modifies a few ggml-cuda.cu spots with CUDA APIs that
ggml's HIP/MUSA shim does not provide, breaking the -gpu-rocm-hipblas-turboquant
build. patches/0001-hip-guard-copy2d-peer-fastpath.patch (applied by
apply-patches.sh) ports them:

- Guard ggml_cuda_copy2d_across_devices's 3D-peer copy fast path with
  #if !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA) so HIP/MUSA fall through
  to the existing cudaMemcpyAsync staging fallback (HIP genuinely lacks
  cudaMemcpy3DPeerAsync, per the fork's own comment).
- Create the device event in ggml_backend_cuda_device_event_new with the
  HIP-aliased cudaEventCreateWithFlags(.., cudaEventDisableTiming) instead of the
  un-aliased plain cudaEventCreate, matching this file's own usage elsewhere.

CUDA builds are unaffected.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* ci(turboquant): drop the ROCm/hipblas build flavor

The TheTom/llama-cpp-turboquant fork is not ROCm-clean at the current pin:
beyond the CUDA-API gaps already patched (3D-peer copy, cudaEventCreate),
its llama.cpp base fails to compile the flash-attention MMA f16 kernels for
head-dim 640 under HIP (cols_per_warp evaluates to 0 -> division-by-zero /
non-constant static asserts in fattn-mma-f16.cuh). That is a deep
ggml-on-ROCm kernel issue, not something a small fork patch can paper over.

Drop -gpu-rocm-hipblas-turboquant from the build matrix so turboquant still
ships for cpu / cublas / vulkan / sycl. Re-add it once the fork's HIP path
compiles (or upstream ggml fixes the large-head-dim MMA kernels for ROCm).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-07 10:42:06 +02:00
LocalAI [bot]
8c42695ef8 chore: ⬆️ Update ggml-org/whisper.cpp to a8ec021f2750a473ff4a8f3883bc9fdf5feafa84 (#10202)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-07 08:37:42 +02:00
LocalAI [bot]
72e3241431 chore: ⬆️ Update mudler/parakeet.cpp to abd0087dcc92ec5ad1f96f9fd86c49eb26a5ce67 (#10204)
⬆️ Update mudler/parakeet.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-07 00:37:28 +02:00
LocalAI [bot]
cd2bf95862 fix(docs): use relearn notice shortcode instead of unsupported alert (#10206)
The Hugo relearn theme does not provide an "alert" shortcode, so the
docs deploy failed at the Build site step:

  failed to extract shortcode: template for shortcode "alert" not found
  docs/content/features/distributed-mode.md:136

Convert the warning block to the theme-supported notice shortcode used
everywhere else in the docs.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-07 00:37:12 +02:00
LocalAI [bot]
f64b72dd7d feat: support Ideogram4 in stablediffusion-ggml backend + gallery (#10201)
* feat(stablediffusion-ggml): support Ideogram4 unconditional diffusion model

Bump stable-diffusion.cpp from 1f9ee88 to b9254dd, the upstream commit that
adds Ideogram4 support (leejet/stable-diffusion.cpp#1609). Ideogram4 derives
its classifier-free guidance from a separate unconditional diffusion model,
exposed upstream through the new sd_ctx_params_t.uncond_diffusion_model_path
field.

Wire that field into the gosd wrapper via a new uncond_diffusion_model_path
option. The _path suffix is deliberate: the Go loader only resolves options
whose name contains "path" to an absolute path under the model directory, so
this keeps the option consistent with diffusion_model_path and
high_noise_diffusion_model_path.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(gallery): add Ideogram4 stablediffusion-ggml models

Single-file GGUF weights for Ideogram4 are now published
(stduhpf/ideogram-4-gguf), so add the model to the gallery. Ideogram4 is a
text-to-image model with strong, accurate in-image text rendering, driven by
a Qwen3-VL-8B text encoder and real classifier-free guidance from a separate
unconditional diffusion model (the uncond_diffusion_model_path support added
in the preceding commit).

Two index entries, both built on gallery/virtual.yaml with the full config
inlined in overrides (same pattern as the other models, no dedicated template
file):
- ideogram-4-iq4nl-ggml (4-bit, ~11.6GB diffusion)
- ideogram-4-q8_0-ggml  (8-bit, ~20GB diffusion)

Each bundles the diffusion + unconditional GGUF (stduhpf), the
Qwen3-VL-8B-Instruct text encoder (unsloth), and the FLUX.2 VAE (Comfy-Org
mirror, non-gated). cfg_scale is 7 to match the upstream Ideogram4 default,
since it performs real CFG unlike the guidance-distilled Flux/Z-Image models.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-06 22:50:12 +02:00
LocalAI [bot]
03c84cff28 feat(parakeet-cpp): nemotron-3.5-asr multilingual streaming model + request language support (#10199)
* feat(parakeet-cpp): honor request language (multilingual nemotron) on batched + streaming paths

Reads opts.GetLanguage() and threads it through to the new
parakeet_capi_transcribe_pcm_batch_json_lang and parakeet_capi_stream_begin_lang
C-API entry points, both probed with Dlsym so the backend still loads against an
older libparakeet.so (falling back to the non-lang paths, i.e. model default).

parakeet.cpp's batched C-API takes a single target_lang for the whole batch, so
the dispatcher only coalesces same-language requests: a request whose language
differs from the batch leader is held as a single carry-over and becomes the
leader of the next batch, never dropped and never left waiting (including on
shutdown). A new batcher test asserts no dispatched batch is ever mixed-language
and that every submitted request still receives a reply.

Assisted-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(gallery): add parakeet-cpp-nemotron-3.5-asr-streaming-0.6b; bump parakeet.cpp pin

Adds the multilingual prompt-conditioned streaming model to the gallery (q8_0
default, OpenMDW-1.1) and bumps the parakeet-cpp backend pin to the parakeet.cpp
commit that ships nemotron support plus batched causal subsampling and the
batched target_lang C-API.

Assisted-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-06 13:53:10 +02:00
LocalAI [bot]
9bc69c9e5f chore(model gallery): 🤖 add 1 new models via gallery agent (#10200)
chore(model gallery): 🤖 add new models via gallery agent

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-06 13:52:46 +02:00