Compare commits

..

54 Commits

Author SHA1 Message Date
Richard Palethorpe
e38b1e1d05 feat(realtime): script-aware clause chunking + streamed-reply fixes
Opt-in pipeline.streaming.clause_chunking splits the streamed LLM reply
into speakable clauses and synthesizes each as soon as it completes,
lowering time-to-first-audio instead of buffering the whole message. The
splitter is script-aware (rivo/uniseg, pure Go): UAX#29 sentence
segmentation handles CJK 。!? with no whitespace, CJK clause
punctuation (,、;:) and Thai/Lao spaces give finer cuts, and a UAX#14
line-break cap bounds an over-long punctuation-less run. Unlike the old
ASCII .!?/newline segmenter (dropped in 076dcdbe) it does not degrade to
whole-message buffering for CJK/Thai; scripts needing a dictionary
(Khmer/Burmese) stay buffered until a space or end-of-message. Clauses
are synthesized synchronously in the token callback (the LLM keeps
generating into the gRPC stream meanwhile), so audio still starts
mid-generation. Off by default — the whole-message path is unchanged.

Also fix the streamed-reply path and the Talk page:

- Don't swallow streamed autoparser content as reasoning: the
  tokenizer-template path already delivers reasoning-free content via
  ChatDeltas, so prefilling the thinking start token re-tagged it as an
  unclosed reasoning block, leaving no spoken reply. Disable the prefill
  on that path; closed tag pairs are still stripped (#9985).

- Generate collision-free realtime IDs (16 random bytes) instead of a
  constant, so per-item bookkeeping (cancel, conversation.item.retrieve)
  works.

- Key the Talk transcript by the server item_id and upsert entries.
  Realtime events arrive over a WebRTC data channel — outside React's
  event system — so React defers the setTranscript updaters while
  synchronous ref writes in handler bodies run first; the old
  index-tracking ref rendered a duplicate assistant bubble on
  completion. Upserts by item_id are idempotent and order-independent.

- Drop the partial assistant bubble on a cancelled response (barge-in):
  the server discards the interrupted item and sends response.done with
  status "cancelled"; mirror that in the UI so the regenerated reply
  isn't rendered as a second assistant message.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Assisted-by: Claude:claude-fable-5 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-06-10 16:07:04 +01:00
Ettore Di Giacinto
e2a0f88bce feat(realtime): stream tool-call turns via tokenizer-template autoparser
Per review (richiejp): tool-call deltas exist, so streaming should work with
tools too. It does — for models that use their tokenizer template. The C++
autoparser then clears reply.Message and delivers content + tool calls via
ChatDeltas, so the streamed transcript carries only spoken content (no
tool-call JSON leak) and the tool calls are parsed from the final response.

- Drop the len(tools)==0 gate; stream when no tools OR use_tokenizer_template
  (grammar-based function calling still buffers, since its call is emitted as
  JSON in the token stream and would leak into the transcript).
- streamLLMResponse takes tools/toolChoice/toolTurn, reads ChatDelta content
  in the token callback, parses tool calls from the final ChatDeltas, and
  creates the assistant content item lazily so a content-less tool turn emits
  only the tool calls.
- Extract emitToolCallItems from the buffered path so both paths finalize tool
  calls, response.done, and server-side assistant-tool follow-ups identically.

Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-10 16:05:10 +01:00
Ettore Di Giacinto
f52944e753 refactor(realtime): buffer whole message for TTS, drop sentence segmenter
Per review (richiejp): the sentence segmenter pipelined unary TTS by
splitting on ASCII .!?/newline, which does nothing for languages without
those boundaries (CJK/Thai) — there it already degraded to buffering the
whole message anyway.

Replace it with a uniform model: stream the LLM transcript live, buffer the
full message, then synthesize it once. emitSpeech already streams the audio
chunks when the backend implements TTSStream and falls back to a single
unary delta otherwise, so this is real streaming TTS where supported and a
clean whole-message synthesis elsewhere — no per-sentence emulation, no
language assumptions. speechStreamer becomes transcriptStreamer (transcript
deltas only); the whole-message synthesis moves into streamLLMResponse.

Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-10 16:05:10 +01:00
Ettore Di Giacinto
c44abe40e8 fix(realtime): clean TTS temp path before read (gosec G304)
emitSpeech reads the WAV file the TTS backend wrote. The read moved here
from realtime.go, so code-scanning flagged it as a new G304 alert even
though the path is backend-controlled (a temp file), not user input.
Wrap it in filepath.Clean — a real path normalization that also clears
the alert, keeping with the repo's no-#nosec convention.

Assisted-by: Claude:claude-opus-4-8 gosec, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-10 16:05:10 +01:00
Ettore Di Giacinto
c72c05f3e2 fix(realtime): always strip reasoning from spoken output
disable_thinking maps to ReasoningConfig.DisableReasoning=true on the LLM
config, which the backend reads as enable_thinking=false. But the realtime
handler reads that SAME config to drive reasoning extraction, and there
DisableReasoning=true means "skip stripping". PredictConfig() returns this
LLM config, so both the streamed (speechStreamer) and buffered realtime
paths stopped stripping <think>…</think> exactly when disable_thinking was
on — leaking raw reasoning to the client whenever the model ignored the
enable_thinking hint (e.g. lfm2.5).

Add spokenReasoningConfig() which clears DisableReasoning for extraction
(keeping custom tokens/tag pairs) and route both realtime paths through it.
Spoken output now always strips reasoning, independent of the backend
suppression hint.

Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-10 16:05:10 +01:00
Ettore Di Giacinto
d35c214a5d fix(realtime): register pipeline streaming/thinking config fields
TestAllFieldsHaveRegistryEntries (core/config/meta) requires every config
field to have a meta registry entry. The four new pipeline fields
(disable_thinking, streaming.{llm,tts,transcription}) had none, failing
tests-linux/tests-apple. Add toggle entries for them.

Also handle the os.Remove return in realtime_speech_test.go to satisfy
errcheck (golangci-lint).

Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-10 16:02:30 +01:00
Ettore Di Giacinto
4aaf9c6892 docs(realtime): document pipeline streaming + disable_thinking
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-10 16:02:30 +01:00
Ettore Di Giacinto
72624b513d feat(realtime): wire streamLLMResponse for token-streamed replies
triggerResponseAtTurn takes a streamed path when pipeline.streaming.llm is
set, the turn has no tools, and audio is requested: streamLLMResponse
announces the assistant item, drives the LLM token callback through a
speechStreamer (reasoning-stripped transcript deltas + sentence-piped TTS),
and emits the terminal events. Tool turns and non-streaming pipelines keep
the existing buffered path unchanged, so this is strictly opt-in.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-10 16:02:30 +01:00
Ettore Di Giacinto
ccdbe10044 feat(realtime): speechStreamer for token-streamed LLM->TTS
emitSpeech now returns raw PCM (caller base64-encodes) so streamed segments
accumulate correctly. speechStreamer consumes streamed LLM tokens: it strips
reasoning via the streaming ReasoningExtractor, emits a transcript delta per
content fragment, and sentence-pipes content into emitSpeech so each sentence
is synthesized as soon as it's ready. Handler wiring (plain-content turns)
follows.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-10 16:02:30 +01:00
Ettore Di Giacinto
7db5524e12 feat(realtime): pipeline disable_thinking maps to enable_thinking off
applyPipelineThinking forces the LLM's ReasoningConfig.DisableReasoning when
pipeline.disable_thinking is set, which gRPCPredictOpts turns into the
enable_thinking=false backend metadata. Applied at newModel construction on
the per-session LLM config copy, so it doesn't leak to other model users and
needs no realtime-specific request plumbing.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-10 16:02:29 +01:00
Ettore Di Giacinto
74ecd99d4a feat(realtime): streaming transcription text deltas
Add emitTranscription and route commitUtterance through it. With
pipeline.streaming.transcription set it streams each transcript fragment as
a conversation.item.input_audio_transcription.delta via TranscribeStream
then a completed event; otherwise it preserves the single completed-event
unary behaviour. Returns the final transcript for response generation.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-10 16:02:29 +01:00
Ettore Di Giacinto
121c36f717 feat(realtime): route response audio through emitSpeech (streaming TTS)
Replace the inline unary TTS block in the response handler with emitSpeech,
which streams a response.output_audio.delta per backend PCM chunk when
pipeline.streaming.tts is set and otherwise preserves the single-delta unary
behaviour. emitSpeech returns the accumulated base64 audio, stored on the
conversation item as before. Transcript and audio-done events stay in the
handler so later per-segment streaming can reuse emitSpeech.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-10 16:02:29 +01:00
Ettore Di Giacinto
a1487a073a feat(realtime): emitSpeech with flag-gated streaming TTS
emitSpeech synthesizes a piece of text and forwards audio to the client,
streaming one output_audio.delta per backend PCM chunk when the pipeline
sets streaming.tts, or one delta for the whole utterance otherwise. WebRTC
gets raw PCM (it resamples internally); WebSocket gets base64 PCM at the
session rate. It emits no transcript/audio-done events so a streamed reply
can be split into multiple spoken segments sharing one response.

Adds fakeModel/fakeTransport test doubles for the realtime Model/Transport
interfaces, driving streaming assertions deterministically.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-10 16:02:29 +01:00
Ettore Di Giacinto
9788e4e2f2 feat(realtime): streaming TTS/transcription methods on Model interface
Add TTSStream and TranscribeStream to the realtime Model interface and
implement them on wrappedModel (delegating to backend.ModelTTSStream /
ModelTranscriptionStream) and transcriptOnlyModel. ttsStream adapts the
backend's WAV-framed stream (44-byte header carrying the sample rate, then
PCM) into raw PCM + sample rate for the realtime transports. Handler wiring
that consumes these (flag-gated) follows.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-10 16:02:29 +01:00
Ettore Di Giacinto
65481973a3 feat(realtime): sentence segmenter for streamed LLM->TTS pipelining
streamSegmenter accumulates streamed LLM tokens and emits complete
sentence/clause segments (terminator+whitespace, or newline) so TTS can
synthesize each segment as it completes instead of waiting for the whole
reply. Pure helper; the streaming handler wiring consumes it next.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-10 16:02:29 +01:00
Ettore Di Giacinto
eef81fd189 feat(realtime): pipeline streaming + disable_thinking config
Add a nested pipeline.streaming.{llm,tts,transcription} block plus
pipeline.disable_thinking, with StreamLLM/StreamTTS/StreamTranscription/
ThinkingDisabled helpers. Pointer-bools so unset keeps the unary path;
existing configs are unaffected. Wiring into the realtime handler follows.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-10 16:02:29 +01:00
LocalAI [bot]
fba8c9c498 fix(distributed): track in-flight for non-LLM inference methods (VAD, diarize, voice, ...) (#10238)
fix(distributed): track in-flight for non-LLM inference methods

InFlightTrackingClient only wrapped a subset of the grpc.Backend
inference methods (Predict, Embeddings, TTS, AudioTranscription, Detect,
Rerank, ...). Methods like VAD were left as embedded passthrough, so
track() never ran for them.

In distributed mode every model is loaded with in_flight=1 as a
reservation; that reservation is only released by the OnFirstComplete
callback, which fires after the first *tracked* inference call completes.
A VAD-only model (e.g. silero-vad) never calls a tracked method, so the
reservation is never released and in-flight stays pinned at 1 forever -
which also blocks the router's idle-eviction logic.

Wrap the remaining unary inference methods (VAD, Diarize, Face*, Voice*,
TokenClassify, Score, AudioEncode, AudioDecode, AudioTransform) with the
same track()/reconcile() pattern. The three bidi-stream constructors
(AudioTransformStream, AudioToAudioStream, Forward) are deliberately left
as passthrough - their inference spans the stream lifetime, not the
constructor call, so track() there would fire onFirstComplete before any
data flows.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-10 16:29:50 +02:00
LocalAI [bot]
6b2badb837 chore: ⬆️ Update CrispStrobe/CrispASR to c29f6653a516a3001d923944dad8892072cc7334 (#10236)
⬆️ Update CrispStrobe/CrispASR

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-10 16:16:24 +02:00
LocalAI [bot]
8b8506d01a chore: ⬆️ Update ggml-org/llama.cpp to 039e20a2db9e87b2477c76cc04905f3e1acad77f (#10223)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-10 12:22:03 +02:00
LocalAI [bot]
6910a0bb48 chore: ⬆️ Update antirez/ds4 to 91bafb5acd5a6cf00b1e55ef68bf40ddd207bee7 (#10234)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-10 12:08:19 +02:00
LocalAI [bot]
cffd03b522 chore: ⬆️ Update ikawrakow/ik_llama.cpp to e6f8112f3ba126eed3ff5b30cdd08085414a7516 (#10233)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-10 12:07:49 +02:00
LocalAI [bot]
bf448d3794 chore: ⬆️ Update ggml-org/whisper.cpp to df7638d8229a243af8a4b5a8ae557e0d74e0a0ae (#10220)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-10 01:16:29 +02:00
LocalAI [bot]
1d4a12f7c0 chore: ⬆️ Update CrispStrobe/CrispASR to 97cad527d247edefc904e6c40c4cf5ee78bed055 (#10221)
⬆️ Update CrispStrobe/CrispASR

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-10 01:16:17 +02:00
LocalAI [bot]
186d62801d chore: ⬆️ Update leejet/stable-diffusion.cpp to 19bdfe22d255d5b4dff39d449318b9bc5ea2317f (#10222)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-10 01:16:06 +02:00
LocalAI [bot]
da4ed05429 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 2768b6251548b78b6610e95edad13f888ad95982 (#10219)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-10 01:15:54 +02:00
LocalAI [bot]
ec1eea4f45 chore: ⬆️ Update antirez/ds4 to 512d07cb08f234b704b5a5959aa9e2d4c466eeb0 (#10224)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-10 01:15:42 +02:00
LocalAI [bot]
b203b32e57 feat(realtime): make WebRTC ICE candidates configurable (#10231)
The /v1/realtime WebRTC handler created the peer connection with a bare
webrtc.Configuration and no SettingEngine, so pion gathered a host ICE
candidate for every local interface. Under Docker host networking that
includes bridge addresses (docker0/veth, 172.x) a remote browser cannot
route to; the call establishes on a good pair and then drops once ICE
consent freshness checks fail on the unreachable candidates.

Add two opt-in knobs, applied via a pion SettingEngine:
- LOCALAI_WEBRTC_NAT_1TO1_IPS: advertise these IPs as the host candidates
  (e.g. the host LAN IP)
- LOCALAI_WEBRTC_ICE_INTERFACES: restrict ICE gathering to these interfaces

Defaults are unchanged (empty => current all-interface behavior).

Assisted-by: Claude:claude-opus-4-8

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-09 22:28:03 +02:00
Ching
48a8ce98aa fix(cli): handle chat output errors (#10229)
Propagate terminal write errors from the chat prompt and explicitly ignore stream close errors during cleanup.

Update chat tests to assert response writer errors so errcheck passes without hiding failed writes.

Tests:
- go test -count=1 ./core/cli/chat
- go test -count=1 ./core/cli

Assisted-by: Codex:GPT-5

Signed-off-by: Ching Kao <0980124jim@gmail.com>
2026-06-09 19:10:24 +02:00
Ching
8344d1c865 feat(cli): add interactive chat mode (#10226)
Add an opt-in `local-ai chat` command for testing chat models directly from the terminal without manually sending curl requests.

The command connects to a running LocalAI server, lists available models through the existing OpenAI-compatible API, streams chat completions, and supports interactive commands such as `/models`, `/model`, `/clear`, and `/exit`.

Keep `local-ai run` focused on the server lifecycle so the web UI, API clients, and multiple chat terminals can coexist against the same server.

Document the new command and terminal workflow in the README and CLI docs.

Tests:
- go test -count=1 ./core/cli/chat
- go test -count=1 ./core/cli

Assisted-by: Codex:GPT-5

Signed-off-by: Ching Kao <0980124jim@gmail.com>
2026-06-09 14:58:44 +00:00
Pete
d2e6b93369 feat(agents): surface KB source citations in RAG responses (#10228)
* dev knowledge.go structure

Signed-off-by: Pete Chen <petechentw@gmail.com>

* feat(agents): append KB source citations to responses

Render structured KB citations as a Sources block after agent responses, linking each source to the existing raw collection entry endpoint.

Keep long-term memory writes on the original model response so citation blocks do not get stored back into the knowledge base.

Tested with: go test ./core/services/agents

Assisted-by: Codex:gpt-5
Signed-off-by: Pete Chen <petechentw@gmail.com>

* Collect KB citations from tool searches

Signed-off-by: Pete Chen <petechentw@gmail.com>

* fix(agents): append KB sources in local chats

Apply the shared KB citation post-processing to standalone LocalAGI chat responses so the React agent chat receives the same clickable Sources block as the native executor path. Also fix the run target to use the current cmd/local-ai entrypoint.

Assisted-by: Codex:gpt-5
Signed-off-by: Pete Chen <petechentw@gmail.com>

---------

Signed-off-by: Pete Chen <petechentw@gmail.com>
Co-authored-by: shihyunhuang <shihyunhuang88@gmail.com>
Co-authored-by: TLoE419 <tloemizuchizu@gmail.com>
Co-authored-by: Ching Kao <0980124jim@gmail.com>
2026-06-09 16:32:56 +02:00
LocalAI [bot]
e1ec03d33f fix(reasoning): stop prefilled <think> from swallowing tag-less answers (#10225)
* fix(reasoning): stop prefilled <think> from swallowing tag-less answers

When a chat template injects the thinking start token into the prompt (so
DetectThinkingStartToken returns e.g. "<think>"), the model's output begins
inside a reasoning block and carries only the closing tag. The non-jinja
autoparser fallback (peg-native "pure content" mode, issue #9985) prepends the
start token so the extractor can pair it with the model's </think>.

But on a COMPLETE response that contains no closing tag, the model answered
directly with no reasoning at all. Prepending the start token there manufactures
an unclosed block that swallows the entire answer into reasoning, leaving the
OpenAI `content` field empty. This breaks short/direct answers — session names,
JSON summaries, any terse completion where the model skips the think block —
which come back with empty content. Regression surfaced by #9991, which added
the defensive prefill extraction to the complete-response paths.

Add reasoning.ExtractReasoningComplete: it only honors a prefilled start token
when the response actually contains the matching closing tag (proof a reasoning
block exists). Genuine reasoning tags already in the content still extract;
tag-less content stays content. Apply it at every complete-response site
(applyAutoparserOverride, realtime, openresponses). The streaming per-token
extractor is intentionally left on ExtractReasoningWithConfig — mid-stream an
as-yet-unclosed block is legitimate and must surface as reasoning deltas.

Also adds reasoning.ClosingTokenForStart and hoists the default reasoning tag
pairs to package scope so both helpers share one source of truth.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(reasoning): cover the enable_thinking=false non-thinking-mode regression

Adds the end-to-end case that actually broke session summaries / auto-titles
and was not covered before: a request with enable_thinking=false against a
<think>-capable model. In non-thinking mode the model emits no reasoning block,
so llama.cpp's autoparser returns ChatDeltas with content set and
reasoning_content empty (verified against stock llama-server: same model with
chat_template_kwargs.enable_thinking=false returns reasoning_content=null,
content="hello"). thinkingStartToken is still "<think>" because it is detected
per-model from the enable_thinking=true render, so the old code prepended it and
swallowed the answer. The test fails without the ExtractReasoningComplete gate.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 09:02:04 +02:00
LocalAI [bot]
9323f4b5ca feat(llama-cpp): video input support (mtmd #24269) (#10216)
* chore(llama-cpp): bump to 8f83d6c for mtmd video input support

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(llama-cpp): forward video input to mtmd (template + non-template paths)

Wire request->videos() into grpc-server.cpp mirroring the existing image
and audio handling: a video_data build + non-template files extraction, and
input_video chat chunks on the tokenizer-template path. allow_video is
auto-set at model load by the vendored upstream chat_params.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ui): add video attachment support to the chat UI

Mirror the image/audio attachment path for video: emit video_url content
parts, accept video/* in the picker, keep video files as base64, show a
film icon badge, and render attached video inline with a <video> player.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(llama-cpp): patch mtmd video stdin double-close (heap crash)

Upstream mtmd video input (ggml-org/llama.cpp#24269) double-fcloses the
ffmpeg/ffprobe stdin FILE: feed_stdin() fclose()s the FILE returned by
subprocess_stdin() (which is sp->stdin_file), then subprocess_destroy()
fclose()s the same pointer again -> heap corruption that aborts the
backend on any base64 input_video request (the CLI --video file path is
unaffected). Vendor a one-line fix (null sp->stdin_file after fclose)
via prepare.sh's patches/ until upstream merges it.

Verified e2e with gemma-4-e2b-it-qat-q4_0: video frames decode via
ffmpeg and the model answers correctly (red clip -> 'Red', blue -> 'Blue').

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore(llama-cpp): re-pin to upstream #24316, drop vendored stdin patch

Upstream replaced the ad-hoc video stdin handling with a proper RAII
refactor (ggml-org/llama.cpp#24316, "mtmd: refactor video subproc
handling"), which includes the same `sp->stdin_file = nullptr` guard our
patch added (plus join-before-destroy ordering). Re-pin LLAMA_VERSION to
that branch head and drop patches/0001 - it's now redundant.

Verified e2e with gemma-4-e2b-it-qat-q4_0: no crash, video frames decode
and the model answers correctly (red clip -> "Red", blue -> "Blue").

NOTE: #24316 is not yet merged, so this pins to its branch-head commit
(28ca1e60). Re-pin to the squash-merge commit on master once it lands,
otherwise `git fetch` may lose the commit after the branch is deleted.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-08 23:17:50 +02:00
LocalAI [bot]
c20225fc13 chore: ⬆️ Update CrispStrobe/CrispASR to f7838a306687f22c281d29c250f879a4ab3df2d7 (#10177)
* ⬆️ Update CrispStrobe/CrispASR

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* fix(crispasr): link crispasr-lib CMake target instead of crispasr

The dependency-bump regeneration of this branch reset CMakeLists.txt to
master and dropped the prior link-target fix, reintroducing the
`cannot find -lcrispasr` failure. Upstream CrispASR (f7838a3) defines the
library as the CMake target `crispasr-lib` (with OUTPUT_NAME crispasr);
there is no target named `crispasr`, so target_link_libraries falls back
to a bare `-lcrispasr` linker flag that cannot be resolved. Point the link
at the real target name.

Verified locally: CPU cmake-configure of the bumped source generates a
gocrispasr link line referencing sources/CrispASR/src/libcrispasr.a with no
dangling -lcrispasr.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]

---------

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-08 16:01:19 +02:00
LocalAI [bot]
337acc4c37 chore: ⬆️ Update antirez/ds4 to c463029c205c2ec8d7ab6c0df4a3f52979091286 (#10189)
* ⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* fix(ds4): link ds4_ssd.o into the backend build

Upstream antirez/ds4 splits the SSD expert-cache into its own ds4_ssd.c
translation unit, whose symbols (ds4_ssd_memory_lock_acquire/release,
ds4_ssd_cache_experts_for_byte_budget, ds4_ssd_auto_cache_plan) are
referenced by ds4.c/ds4_cpu.o. The dependency-bump automation regenerated
this branch from clean master and dropped the prior linkage fix, so the
cpu-ds4 / cublas-ds4 backend builds fail again with undefined references.

Re-apply the ds4_ssd.o linkage GPU-agnostically (mirroring ds4_distributed.o)
in both the backend Makefile (DS4_OBJ_TARGET + the engine-object build rule
for every GPU mode) and CMakeLists.txt (list(APPEND DS4_OBJS ds4_ssd.o)).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]

---------

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-08 11:15:32 +02:00
LocalAI [bot]
618e90cd13 feat(gallery): add Gemma 4 QAT family + MTP speculative-decoding pairs (#10215)
Add the remaining official Google Gemma 4 QAT Q4_0 GGUFs (E2B, E4B,
26B-A4B, 31B) next to the existing 12B entry, each shipping its
multimodal mmproj.

Also add three MTP (Multi-Token Prediction) speculative-decoding bundles
that pair each QAT target with a QAT-matched assistant/drafter head:

  - 12B       <- Janvitos/gemma-4-12B-it-qat-assistant-MTP-Q8_0-GGUF
  - 26B-A4B   <- boxwrench/gemma-4-qat-mtp-assistant-heads
  - 31B       <- boxwrench/gemma-4-qat-mtp-assistant-heads

The assistant heads use the gemma4_assistant architecture and are not
standalone chat models, so each entry bundles the target + draft and
sets draft_model together with the draft-mtp spec options
(spec_type:draft-mtp / spec_n_max:6 / spec_p_min:0.75), matching
MTPSpecOptions() in core/config/mtp.go. QAT-matched heads raise draft
acceptance substantially over generic non-QAT heads.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-08 10:26:42 +02:00
LocalAI [bot]
92dea961c2 fix: distributed backend reinstall/upgrade UI stuck on 'reinstalling' (#10214)
* fix(galleryop): self-evict terminal ops from OpCache.GetStatus

The processingBackends map (the UI 'reinstalling' spinner source) only cleared
an op when a client polled /api/backends/job/:uid. The Manage-page Reinstall and
Upgrade buttons never poll, so completed installs leaked into processingBackends
forever and the backend card spun 'reinstalling' even though the install had
finished. Evict terminal ops on the list read instead; DeleteUUID already
broadcasts the eviction so peer replicas converge.

Reproduced on a live 5-node distributed cluster: 5 backends sat in
processingBackends with underlying jobs reporting completed:true,progress:100.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(nodes): clear pending backend ops behind offline/draining nodes

ListDuePendingBackendOps filters status=healthy, so a backend op queued against
a node that went offline (stale heartbeat) or draining (admin action) was never
retried, aged out, or deleted - it leaked forever and kept the UI operation
spinning. Add DeleteStalePendingBackendOps and run it each reconcile pass:
draining nodes are cleared immediately (model rows already purged), offline
nodes once their heartbeat is older than a grace window (blip protection).

Reproduced on a live cluster: orphaned llama-cpp install rows targeting an
offline (nvidia-thor) and a draining (mac-mini-m4) node sat at attempts=0
indefinitely.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(nodes): stream per-node progress during backend upgrade

The install dispatch subscribed to a per-op progress subject and streamed
per-node download ticks; the upgrade dispatch did a bare 15-minute blocking
NATS round-trip with no subscription, so the UI showed progress:0 the whole
time (the 'reinstalling but nothing happens' report on a slow node).

Thread the op ID through BackendManager.UpgradeBackend -> the distributed
manager -> the adapter, and have the adapter subscribe to the per-op progress
subject before the request (extracted into a shared subscribeProgress helper
reused by install/upgrade/force-fallback). The worker's upgradeBackend now
creates the same DebouncedInstallProgressPublisher installBackend uses. An
upgrade is a force-reinstall, so it reuses SubjectNodeBackendInstallProgress
rather than minting a new subject - no new NATS permission, no new
rolling-update compat surface. Reconciler-driven retries pass empty
opID/onProgress and stay on the silent path.

Reproduced on a live cluster: upgrade of llama-cpp-development on agx-orin-slow
sat at progress:0 for 4+ minutes with no per-node feedback.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(galleryop): persist cancellation + periodically reap orphaned ops

Two distributed gaps surfaced when a replica was killed mid-upgrade on a live
cluster, leaving the backend stuck 'processing' in the UI forever:

1. CancelOperation flipped the in-memory status to cancelled and broadcast a
   NATS event but never persisted the terminal status. On the next replica
   restart the still-active row re-hydrated straight back into
   processingBackends and the UI spun again. It now calls store.Cancel(id) so
   the cancel survives a restart.

2. CleanStale (which marks abandoned active ops failed) only ran once on
   startup, so an op orphaned AFTER startup - its owning replica's foreground
   handler goroutine gone - was never reaped until the next restart. Add
   GalleryService.ReapStaleOperations and run it on a 15m ticker (CleanStale
   now returns the reaped count for observability).

Neither is covered by the OpCache self-evict fix: an orphaned op never reaches
Processed, so it would never self-evict.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(review): address self-review findings on the distributed install fixes

Three findings from an adversarial review of this branch:

1. CRITICAL - OpCache.GetStatus crashed under concurrent load. m.Map() returns
   the live internal map by reference, so deleting from it on the read path was
   an unsynchronized write to a map four HTTP handlers poll every ~1s -> a
   'concurrent map writes' fatal. Rewritten to iterate a Keys() snapshot, build
   a fresh result map, and apply evictions via the locked DeleteUUID after the
   loop. Added a -race concurrency regression guard.

2. HIGH - GetStatus evicted failed ops too, hiding them from /api/operations
   and breaking the dismiss-failed-op flow (the panel keeps Error != nil ops so
   the admin can read the error and click Dismiss). Eviction now fires only for
   terminal ops with Error == nil (success/cancelled); failures are retained.

3. MEDIUM - DeleteStalePendingBackendOps missed StatusUnhealthy nodes. A node
   marked unhealthy on a NATS ErrNoResponders never transitions to offline
   (health.go skips re-marking it), so its pending ops leaked exactly like the
   offline case. Unhealthy is now reaped via the same stale-heartbeat grace path
   (a fresh-heartbeat node is recovering and keeps its op).

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(review-2): don't evict the still-installing soft-path; don't spin on failed ops

Second review pass found two issues:

1. MEDIUM (Go) - OpCache.GetStatus evicted the ErrWorkerStillInstalling
   soft-path op. That op is deliberately Processed=true with no error to show a
   yellow in-progress state when a worker timed out the NATS round-trip but is
   still installing in the background; the reconciler confirms the real outcome
   later. Evicting it (and broadcasting OpEnd + marking the DB completed) hid an
   install that may still fail. Eviction is now scoped to a clean success
   (progress 100 + 'completed', matching the job-poll's historical condition) or
   a cancellation - the soft-path (progress != 100) and failures are kept.

2. MEDIUM (React) - the Backends gallery card rendered ANY operation as an
   'Installing...' spinner, so a failed op (now intentionally kept in the list
   for the OperationsBar error + Dismiss) spun forever. Exclude errored ops from
   the card spinner, mirroring Models.jsx (isInstalling already excludes
   op.error). The error + Dismiss still surface in the global OperationsBar.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(ui): refresh Manage backends table when an operation settles

The Manage backends table fetched installed backends only on mount/after delete
and checked upgrades only on tab activation. After a reinstall/upgrade completed
neither re-ran, so the installed-version cell and the 'update available' badge
stayed stale until the user switched tabs - the op looked like it 'did nothing'.

Watch the operations list (via useOperations) and re-fetch installed backends +
available upgrades whenever the count settles, mirroring the operations.length
watch Backends.jsx already uses. Consolidates the prior tab-activation upgrades
check into the same effect.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-08 10:03:02 +02:00
LocalAI [bot]
2e93186043 chore: ⬆️ Update ggml-org/llama.cpp to 9e3b928fd8c9d14dbf15a8768b9fdd7e5c721d66 (#10210)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-08 09:35:17 +02:00
LocalAI [bot]
d07037e817 chore: ⬆️ Update leejet/stable-diffusion.cpp to b3d56d0ba1bd437886079e339118e8e75bb79ee7 (#10211)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-08 09:03:57 +02:00
LocalAI [bot]
f6cc90d258 chore: ⬆️ Update mudler/parakeet.cpp to e270af73b94c9a5c37ec516230219ed4580e1db6 (#10212)
⬆️ Update mudler/parakeet.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-07 23:52:44 +02:00
Adira
2c804bef5a fix(config): skip vocab arrays and mmap GGUF headers to speed up startup (#10213)
When the models directory holds many GGUF files, startup parsed every
model's full GGUF — including the tokenizer vocab arrays
(tokenizer.ggml.tokens/scores/merges, often >100k entries) — once per
model while guessing defaults. On slow storage (e.g. a models directory
on a Docker volume) those hundreds of thousands of tiny reads dominate
boot time before the HTTP server comes up.

The default-guessing path and the VRAM metadata reader only consume
scalar metadata and array lengths, never the array contents. Parse with
SkipLargeMetadata (seek past large arrays) and UseMMap (fault in a few
header pages instead of issuing per-element read() syscalls). For a
256k-token vocab this cuts the parse from ~524k read() syscalls to 8.
The mapping is released when ParseGGUFFile returns.

Fixes #9790

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Adira Denis Muhando <dennisadira@gmail.com>
2026-06-07 23:33:52 +02:00
LocalAI [bot]
6070402477 chore(model gallery): 🤖 add 1 new models via gallery agent (#10209)
chore(model gallery): 🤖 add new models via gallery agent

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-07 22:09:32 +02:00
LocalAI [bot]
67f80a152b fix(mtp): don't auto-enable self-spec MTP for draft-only assistant GGUFs (#10208)
Gemma4 MTP (ggml-org/llama.cpp#23398) registers the prediction head as a
separate `gemma4-assistant` architecture. That assistant GGUF still carries
`<arch>.nextn_predict_layers`, so the architecture-agnostic detection in
HasEmbeddedMTPHead matched it and appended the `spec_type:draft-mtp` defaults.

Unlike the DeepSeek/Qwen embedded-head models, an assistant checkpoint cannot
self-speculate: it is a draft model that requires a paired target context
(`ctx_other`) and throws if loaded alone. Auto-applying the self-spec defaults
to a standalone assistant import therefore produces a broken config.

Guard the detection against draft-only assistant architectures (the `-assistant`
suffix is upstream's naming convention) so importing one no longer yields a
self-speculation config. Two-model target+draft pairing remains expressible
manually via `draft_model:` and is left to a follow-up.


Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-07 22:09:02 +02:00
LocalAI [bot]
a7cb587d96 feat(parakeet-cpp): real segment timestamps (NeMo-faithful) (#10207)
* feat(parakeet-cpp): real segment timestamps (NeMo-faithful)

Offline: replace the single synthetic whole-clip segment with multiple
segments grouped exactly like NeMo's get_segment_offsets - a new segment
after sentence-ending punctuation ('. ? !'), each carrying start/end and
its time-window token ids. The optional model option segment_gap_threshold
(NeMo's unit: encoder FRAMES, default 0=off) adds NeMo's silence-gap split,
converted to seconds via the JSON frame_sec the engine now reports.
Per-segment words are still gated behind timestamp_granularities=["word"];
a zero-word document falls back to a single text segment.

Streaming: when libparakeet.so exposes the ABI v4 JSON entry points
(probed), drive parakeet_capi_stream_feed_json / _finalize_json and
accumulate the streamed per-word timestamps into per-utterance segments
(EOU stays the boundary), so streaming FinalResult segments now carry
start/end. Falls back to the text-only feed against an older library.

Pure-Go specs cover splitWordsIntoSegments (punctuation + gap rules, NeMo
elif order, fallback), transcriptResultFromDoc (multi-segment, token
windows, word-granularity gate), and the streaming segmenter.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(audio): document parakeet-cpp segment timestamps + segment_gap_threshold

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(parakeet-cpp): update model-gated specs for multi-segment output

The offline AudioTranscription specs asserted the old single synthetic
segment (Segments HaveLen(1), Segments[0].Text == res.Text). With
NeMo-faithful segmentation a multi-sentence clip now yields multiple
punctuation-delimited segments, so assert the new contract instead:
one-or-more time-ordered segments, each with text and (under word
granularity) per-segment words whose span tracks the segment start/end.
Caught by running the model-gated suite on the dgx (GB10) against the
real tdt_ctc-110m + realtime_eou models.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-07 22:08:24 +02:00
LocalAI [bot]
f7c74ad2da chore: ⬆️ Update ggml-org/llama.cpp to 31e82494c0a3913c919c1027fa70500fbf4c07dd (#10191)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-07 10:43:17 +02:00
LocalAI [bot]
7402d1fd20 chore(turboquant): bump to 7d9715f1 + fix compilation against rebased fork (#10205)
* chore(turboquant): bump TheTom/llama-cpp-turboquant to 7d9715f1

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* fix(turboquant): drop obsolete legacy-spec shim after fork rebased

The TheTom/llama-cpp-turboquant fork (pin c9aa86a) rebased past the
upstream common_params_speculative refactor (ggml-org/llama.cpp
#22397/#22838/#22964), the model_tgt rename (#22838) and get_media_marker
(#21962). The old fork-compat shim forced now-wrong legacy code paths,
breaking the build with errors like 'struct common_params_speculative has
no member named mparams_dft / type' and 'server_context_impl has no member
named model'.

Remove the obsolete LOCALAI_LEGACY_LLAMA_CPP_SPEC branches from the shared
grpc-server.cpp (stock llama-cpp and the modern fork both take the modern
path now), and narrow the one remaining gap (the fork still lacks
common_params::checkpoint_min_step) to a dedicated
LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP guard injected by
patch-grpc-server.sh. The patch script now only adds the turbo2/3/4
KV-cache types and injects that one macro.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* fix(turboquant): HIP-port the fork's CUDA additions (copy2d 3D-peer + cudaEventCreate)

The turboquant fork adds/modifies a few ggml-cuda.cu spots with CUDA APIs that
ggml's HIP/MUSA shim does not provide, breaking the -gpu-rocm-hipblas-turboquant
build. patches/0001-hip-guard-copy2d-peer-fastpath.patch (applied by
apply-patches.sh) ports them:

- Guard ggml_cuda_copy2d_across_devices's 3D-peer copy fast path with
  #if !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA) so HIP/MUSA fall through
  to the existing cudaMemcpyAsync staging fallback (HIP genuinely lacks
  cudaMemcpy3DPeerAsync, per the fork's own comment).
- Create the device event in ggml_backend_cuda_device_event_new with the
  HIP-aliased cudaEventCreateWithFlags(.., cudaEventDisableTiming) instead of the
  un-aliased plain cudaEventCreate, matching this file's own usage elsewhere.

CUDA builds are unaffected.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* ci(turboquant): drop the ROCm/hipblas build flavor

The TheTom/llama-cpp-turboquant fork is not ROCm-clean at the current pin:
beyond the CUDA-API gaps already patched (3D-peer copy, cudaEventCreate),
its llama.cpp base fails to compile the flash-attention MMA f16 kernels for
head-dim 640 under HIP (cols_per_warp evaluates to 0 -> division-by-zero /
non-constant static asserts in fattn-mma-f16.cuh). That is a deep
ggml-on-ROCm kernel issue, not something a small fork patch can paper over.

Drop -gpu-rocm-hipblas-turboquant from the build matrix so turboquant still
ships for cpu / cublas / vulkan / sycl. Re-add it once the fork's HIP path
compiles (or upstream ggml fixes the large-head-dim MMA kernels for ROCm).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-07 10:42:06 +02:00
LocalAI [bot]
8c42695ef8 chore: ⬆️ Update ggml-org/whisper.cpp to a8ec021f2750a473ff4a8f3883bc9fdf5feafa84 (#10202)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-07 08:37:42 +02:00
LocalAI [bot]
72e3241431 chore: ⬆️ Update mudler/parakeet.cpp to abd0087dcc92ec5ad1f96f9fd86c49eb26a5ce67 (#10204)
⬆️ Update mudler/parakeet.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-07 00:37:28 +02:00
LocalAI [bot]
cd2bf95862 fix(docs): use relearn notice shortcode instead of unsupported alert (#10206)
The Hugo relearn theme does not provide an "alert" shortcode, so the
docs deploy failed at the Build site step:

  failed to extract shortcode: template for shortcode "alert" not found
  docs/content/features/distributed-mode.md:136

Convert the warning block to the theme-supported notice shortcode used
everywhere else in the docs.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-07 00:37:12 +02:00
LocalAI [bot]
f64b72dd7d feat: support Ideogram4 in stablediffusion-ggml backend + gallery (#10201)
* feat(stablediffusion-ggml): support Ideogram4 unconditional diffusion model

Bump stable-diffusion.cpp from 1f9ee88 to b9254dd, the upstream commit that
adds Ideogram4 support (leejet/stable-diffusion.cpp#1609). Ideogram4 derives
its classifier-free guidance from a separate unconditional diffusion model,
exposed upstream through the new sd_ctx_params_t.uncond_diffusion_model_path
field.

Wire that field into the gosd wrapper via a new uncond_diffusion_model_path
option. The _path suffix is deliberate: the Go loader only resolves options
whose name contains "path" to an absolute path under the model directory, so
this keeps the option consistent with diffusion_model_path and
high_noise_diffusion_model_path.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(gallery): add Ideogram4 stablediffusion-ggml models

Single-file GGUF weights for Ideogram4 are now published
(stduhpf/ideogram-4-gguf), so add the model to the gallery. Ideogram4 is a
text-to-image model with strong, accurate in-image text rendering, driven by
a Qwen3-VL-8B text encoder and real classifier-free guidance from a separate
unconditional diffusion model (the uncond_diffusion_model_path support added
in the preceding commit).

Two index entries, both built on gallery/virtual.yaml with the full config
inlined in overrides (same pattern as the other models, no dedicated template
file):
- ideogram-4-iq4nl-ggml (4-bit, ~11.6GB diffusion)
- ideogram-4-q8_0-ggml  (8-bit, ~20GB diffusion)

Each bundles the diffusion + unconditional GGUF (stduhpf), the
Qwen3-VL-8B-Instruct text encoder (unsloth), and the FLUX.2 VAE (Comfy-Org
mirror, non-gated). cfg_scale is 7 to match the upstream Ideogram4 default,
since it performs real CFG unlike the guidance-distilled Flux/Z-Image models.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-06 22:50:12 +02:00
LocalAI [bot]
03c84cff28 feat(parakeet-cpp): nemotron-3.5-asr multilingual streaming model + request language support (#10199)
* feat(parakeet-cpp): honor request language (multilingual nemotron) on batched + streaming paths

Reads opts.GetLanguage() and threads it through to the new
parakeet_capi_transcribe_pcm_batch_json_lang and parakeet_capi_stream_begin_lang
C-API entry points, both probed with Dlsym so the backend still loads against an
older libparakeet.so (falling back to the non-lang paths, i.e. model default).

parakeet.cpp's batched C-API takes a single target_lang for the whole batch, so
the dispatcher only coalesces same-language requests: a request whose language
differs from the batch leader is held as a single carry-over and becomes the
leader of the next batch, never dropped and never left waiting (including on
shutdown). A new batcher test asserts no dispatched batch is ever mixed-language
and that every submitted request still receives a reply.

Assisted-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(gallery): add parakeet-cpp-nemotron-3.5-asr-streaming-0.6b; bump parakeet.cpp pin

Adds the multilingual prompt-conditioned streaming model to the gallery (q8_0
default, OpenMDW-1.1) and bumps the parakeet-cpp backend pin to the parakeet.cpp
commit that ships nemotron support plus batched causal subsampling and the
batched target_lang C-API.

Assisted-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-06 13:53:10 +02:00
LocalAI [bot]
9bc69c9e5f chore(model gallery): 🤖 add 1 new models via gallery agent (#10200)
chore(model gallery): 🤖 add new models via gallery agent

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-06 13:52:46 +02:00
LocalAI [bot]
1e6c9cfd60 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 6b9de3dbaa21ae95ea80638e5ee836795cc48c93 (#10190)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-06 09:42:43 +02:00
LocalAI [bot]
0e6712f734 chore: ⬆️ Update mudler/parakeet.cpp to 843600590f96a31467a5199f827c253f34c110f7 (#10198)
chore(parakeet-cpp): bump pin to banded long-audio attention (843600590)

Update PARAKEET_VERSION to mudler/parakeet.cpp@843600590f
(merge of parakeet.cpp#9). Brings NeMo rel_pos_local_attn banded/Longformer
attention with the chunk-matmul construction: long audio now uses O(T*window)
attention instead of global O(T^2), fixing the encoder OOM on long clips
(~16.6-min clip: 54GB->9.4GB peak, ~4x faster) at NeMo's full [128,128] window.
Short clips are unchanged (global path). No C-ABI change.


Assisted-by: Claude:claude-opus-4-8

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-06 09:25:25 +02:00
LocalAI [bot]
0e4cee9a97 chore: bump LocalAGI + localrecall (fix pgvector hybrid search seqscan, #10186) (#10192)
chore: bump LocalAGI and localrecall (index-backed RRF hybrid search)

Bumps the agent stack to pull in the PostgreSQL hybrid-search fix:

- mudler/localrecall -> v0.6.3-...-9a3b3321a9cd (mudler/LocalRecall#46, merged)
- mudler/LocalAGI    -> ...-14aed1ae4336 (mudler/LocalAGI#477, merged)

localrecall's hybrid search previously sorted on a wrapped scalar
similarity expression, which blinded the planner into a full sequential
scan over every row and exceeded the statement timeout on large
collections, returning an empty result set. It now uses the canonical
Reciprocal Rank Fusion pattern (index-backed candidate retrieval + FULL
OUTER JOIN + weighted RRF).

Fixes #10186

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-06 09:16:59 +02:00
112 changed files with 5869 additions and 533 deletions

View File

@@ -1766,20 +1766,6 @@ include:
dockerfile: "./backend/Dockerfile.llama-cpp"
context: "./"
ubuntu-version: '2404'
- build-type: 'hipblas'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-rocm-hipblas-turboquant'
builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-rocm-amd64'
runs-on: 'ubuntu-latest'
base-image: "rocm/dev-ubuntu-24.04:7.2.1"
skip-drivers: 'false'
backend: "turboquant"
dockerfile: "./backend/Dockerfile.turboquant"
context: "./"
ubuntu-version: '2404'
- build-type: 'hipblas'
cuda-major-version: ""
cuda-minor-version: ""

View File

@@ -180,7 +180,7 @@ osx-signed: build
## Run
run: ## run local-ai
CGO_LDFLAGS="$(CGO_LDFLAGS)" $(GOCMD) run ./
CGO_LDFLAGS="$(CGO_LDFLAGS)" $(GOCMD) run ./cmd/local-ai
prepare-test: protogen-go build-mock-backend

View File

@@ -149,6 +149,16 @@ local-ai run https://gist.githubusercontent.com/.../phi-2.yaml
local-ai run oci://localai/phi-2:latest
```
To test a running LocalAI server from the terminal, open an interactive chat session from another shell. Inside the prompt, `/models` lists installed models and `/model <name>` switches between them.
```bash
# Terminal 1
local-ai run llama-3.2-1b-instruct:q4_k_m
# Terminal 2
local-ai chat --model llama-3.2-1b-instruct:q4_k_m
```
> **Automatic Backend Detection**: LocalAI automatically detects your GPU capabilities and downloads the appropriate backend. For advanced options, see [GPU Acceleration](https://localai.io/features/gpu-acceleration/).
For more details, see the [Getting Started guide](https://localai.io/basics/getting_started/).

View File

@@ -60,10 +60,12 @@ elseif(DS4_GPU STREQUAL "cpu")
set(DS4_OBJS "${DS4_DIR}/ds4_cpu.o")
endif()
# ds4.c now references ds4_distributed.c (distributed inference was split into
# its own translation unit upstream). It is a single GPU-agnostic object shared
# by every GPU mode, so link it in regardless of DS4_GPU.
# ds4.c now references ds4_distributed.c (distributed inference) and ds4_ssd.c
# (SSD expert-cache), each split into its own translation unit upstream. Both
# are GPU-agnostic objects shared by every GPU mode, so link them in regardless
# of DS4_GPU.
list(APPEND DS4_OBJS "${DS4_DIR}/ds4_distributed.o")
list(APPEND DS4_OBJS "${DS4_DIR}/ds4_ssd.o")
add_executable(${TARGET}
grpc-server.cpp

View File

@@ -1,10 +1,10 @@
# ds4 backend Makefile.
#
# Upstream pin lives below as DS4_VERSION?=477c0e82e2699b35a65fd0a1ed6fe66b41087dfe
# Upstream pin lives below as DS4_VERSION?=91bafb5acd5a6cf00b1e55ef68bf40ddd207bee7
# (.github/bump_deps.sh) can find and update it - matches the
# llama-cpp / ik-llama-cpp / turboquant convention.
DS4_VERSION?=477c0e82e2699b35a65fd0a1ed6fe66b41087dfe
DS4_VERSION?=91bafb5acd5a6cf00b1e55ef68bf40ddd207bee7
DS4_REPO?=https://github.com/antirez/ds4
CURRENT_MAKEFILE_DIR := $(dir $(abspath $(lastword $(MAKEFILE_LIST))))
@@ -18,19 +18,20 @@ UNAME_S := $(shell uname -s)
CMAKE_ARGS ?= -DCMAKE_BUILD_TYPE=Release
# ds4_distributed.o is a GPU-agnostic translation unit that ds4.c/ds4_cpu.o now
# reference (upstream split distributed inference into its own .c). The same
# object is shared by every GPU mode, so it is appended unconditionally below.
# ds4_distributed.o and ds4_ssd.o are GPU-agnostic translation units that
# ds4.c/ds4_cpu.o now reference (upstream split distributed inference and the
# SSD expert-cache into their own .c files). Both objects are shared by every
# GPU mode, so they are appended unconditionally below.
ifeq ($(BUILD_TYPE),cublas)
CMAKE_ARGS += -DDS4_GPU=cuda
DS4_OBJ_TARGET := ds4.o ds4_cuda.o ds4_distributed.o
DS4_OBJ_TARGET := ds4.o ds4_cuda.o ds4_distributed.o ds4_ssd.o
else ifeq ($(UNAME_S),Darwin)
CMAKE_ARGS += -DDS4_GPU=metal
DS4_OBJ_TARGET := ds4.o ds4_metal.o ds4_distributed.o
DS4_OBJ_TARGET := ds4.o ds4_metal.o ds4_distributed.o ds4_ssd.o
else
# CPU reference path (Linux only - macOS CPU path is broken by VM bug per ds4 README).
CMAKE_ARGS += -DDS4_GPU=cpu
DS4_OBJ_TARGET := ds4_cpu.o ds4_distributed.o
DS4_OBJ_TARGET := ds4_cpu.o ds4_distributed.o ds4_ssd.o
endif
ifneq ($(NATIVE),true)
@@ -55,11 +56,11 @@ ds4:
# the right per-platform compile flags (Objective-C/Metal on Darwin, nvcc on Linux+CUDA).
ds4/ds4.o: ds4
ifeq ($(BUILD_TYPE),cublas)
+$(MAKE) -C ds4 ds4.o ds4_cuda.o ds4_distributed.o
+$(MAKE) -C ds4 ds4.o ds4_cuda.o ds4_distributed.o ds4_ssd.o
else ifeq ($(UNAME_S),Darwin)
+$(MAKE) -C ds4 ds4.o ds4_metal.o ds4_distributed.o
+$(MAKE) -C ds4 ds4.o ds4_metal.o ds4_distributed.o ds4_ssd.o
else
+$(MAKE) -C ds4 ds4_cpu.o ds4_distributed.o
+$(MAKE) -C ds4 ds4_cpu.o ds4_distributed.o ds4_ssd.o
endif
grpc-server: ds4/ds4.o

View File

@@ -1,5 +1,5 @@
IK_LLAMA_VERSION?=1520eda980564241434b791ce2bbbd128c4be9ea
IK_LLAMA_VERSION?=e6f8112f3ba126eed3ff5b30cdd08085414a7516
LLAMA_REPO?=https://github.com/ikawrakow/ik_llama.cpp
CMAKE_ARGS?=

View File

@@ -1,5 +1,5 @@
LLAMA_VERSION?=7c158fbb4aec1bdc9c81d6ca0e785139f4826fae
LLAMA_VERSION?=039e20a2db9e87b2477c76cc04905f3e1acad77f
LLAMA_REPO?=https://github.com/ggerganov/llama.cpp
CMAKE_ARGS?=

View File

@@ -381,6 +381,15 @@ json parse_options(bool streaming, const backend::PredictOptions* predict, const
});
}
// for each video in the request, add the video data
for (int i = 0; i < predict->videos_size(); i++) {
data["video_data"].push_back(json
{
{"id", i},
{"data", predict->videos(i)},
});
}
data["stop"] = predict->stopprompts();
// data["n_probs"] = predict->nprobs();
//TODO: images,
@@ -482,23 +491,13 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
if (!request->draftmodel().empty()) {
params.speculative.draft.mparams.path = request->draftmodel();
// Default to draft type if a draft model is set but no explicit type.
// Upstream (post ggml-org/llama.cpp#22838) made the speculative type a
// vector; the turboquant fork still uses the legacy scalar. The
// LOCALAI_LEGACY_LLAMA_CPP_SPEC macro is injected by
// backend/cpp/turboquant/patch-grpc-server.sh for fork builds only.
// Upstream renamed COMMON_SPECULATIVE_TYPE_DRAFT -> ..._DRAFT_SIMPLE
// in ggml-org/llama.cpp#22964; the fork still uses the old name.
#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
if (params.speculative.type == COMMON_SPECULATIVE_TYPE_NONE) {
params.speculative.type = COMMON_SPECULATIVE_TYPE_DRAFT;
}
#else
// Upstream made the speculative type a vector (ggml-org/llama.cpp#22838)
// and renamed COMMON_SPECULATIVE_TYPE_DRAFT -> ..._DRAFT_SIMPLE (#22964).
const bool no_spec_type = params.speculative.types.empty() ||
(params.speculative.types.size() == 1 && params.speculative.types[0] == COMMON_SPECULATIVE_TYPE_NONE);
if (no_spec_type) {
params.speculative.types = { COMMON_SPECULATIVE_TYPE_DRAFT_SIMPLE };
}
#endif
}
// params.model_alias ??
@@ -574,9 +573,10 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
// tokens (0 disables the minimum). Match upstream's default (256). This
// field was renamed from `checkpoint_every_nt` in llama.cpp; the semantics
// also shifted from a fixed cadence to a minimum spacing. The turboquant
// fork branched before the field existed, so skip it on the legacy path
// (LOCALAI_LEGACY_LLAMA_CPP_SPEC is injected by patch-grpc-server.sh).
#ifndef LOCALAI_LEGACY_LLAMA_CPP_SPEC
// fork still lacks common_params::checkpoint_min_step, so skip it there
// (LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP is injected by
// backend/cpp/turboquant/patch-grpc-server.sh).
#ifndef LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP
params.checkpoint_min_step = 256;
#endif
@@ -752,7 +752,7 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
params.cache_idle_slots = false;
}
#ifndef LOCALAI_LEGACY_LLAMA_CPP_SPEC
#ifndef LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP
// --- minimum context-checkpoint spacing (upstream -cms / --checkpoint-min-step) ---
// 0 disables the minimum-spacing gate. Old option names (`checkpoint_every_nt`,
// `checkpoint_every_n_tokens`) are kept as aliases for backward compatibility
@@ -906,17 +906,6 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
// Speculative decoding options
} else if (!strcmp(optname, "spec_type") || !strcmp(optname, "speculative_type")) {
#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
// Fork only knows a single scalar `type`. Take the first comma-
// separated value and assign it via the singular helper.
std::string first = optval_str;
const auto comma = first.find(',');
if (comma != std::string::npos) first = first.substr(0, comma);
auto type = common_speculative_type_from_name(first);
if (type != COMMON_SPECULATIVE_TYPE_COUNT) {
params.speculative.type = type;
}
#else
// Upstream switched to a vector of types (comma-separated for multi-type
// chaining via common_speculative_types_from_names). We keep accepting a
// single value here, but also tolerate comma-separated lists.
@@ -945,7 +934,6 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
if (!parsed.empty()) {
params.speculative.types = parsed;
}
#endif
} else if (!strcmp(optname, "spec_n_max") || !strcmp(optname, "draft_max")) {
if (optval != NULL) {
try { params.speculative.draft.n_max = std::stoi(optval_str); } catch (...) {}
@@ -983,21 +971,6 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
// shares the target context size. Accept the option for backward
// compatibility but silently ignore it.
// Everything below relies on struct shape introduced in ggml-org/llama.cpp#22838
// (parallel drafting): `ngram_mod`, `ngram_map_k`, `ngram_map_k4v`,
// `ngram_cache`, and the `draft.{cache_type_*, cpuparams*, tensor_buft_overrides}`
// fields. The turboquant fork branched before that, so its build defines
// LOCALAI_LEGACY_LLAMA_CPP_SPEC via patch-grpc-server.sh and these option
// keys become unrecognized (silently dropped, like any unknown opt) for it.
//
// The `#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC` / `#else` split below sits at the
// closing-brace position of the `draft_ctx_size` branch on purpose: in the
// legacy build the chain ends here (the brace closes draft_ctx_size), and in
// the modern build the chain continues with `} else if (...)` instead, so the
// brace count stays balanced under both branches of the preprocessor.
#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
}
#else
// --- ngram_mod family (upstream --spec-ngram-mod-*) ---
} else if (!strcmp(optname, "spec_ngram_mod_n_min")) {
if (optval != NULL) {
@@ -1127,7 +1100,6 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
}
if (!cur.empty()) flush(cur);
}
#endif // LOCALAI_LEGACY_LLAMA_CPP_SPEC — closes the `else`/`#ifdef` opened at draft_ctx_size
}
// Set params.n_parallel from environment variable if not set via options (fallback)
@@ -1177,15 +1149,11 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
params.tensor_buft_overrides.push_back({nullptr, nullptr});
}
}
// The draft tensor_buft_overrides are only populated under the modern
// (post-#22838) layout, whose population code is itself gated by
// LOCALAI_LEGACY_LLAMA_CPP_SPEC above. The turboquant fork lacks
// common_params_speculative::draft entirely, so skip the sentinel there too.
#ifndef LOCALAI_LEGACY_LLAMA_CPP_SPEC
// Terminate the draft tensor_buft_overrides list with a sentinel, mirroring
// the main-model handling above.
if (!params.speculative.draft.tensor_buft_overrides.empty()) {
params.speculative.draft.tensor_buft_overrides.push_back({nullptr, nullptr});
}
#endif
// TODO: Add yarn
@@ -1544,7 +1512,7 @@ public:
msg_json["role"] = msg.role();
bool is_last_user_msg = (i == last_user_msg_idx);
bool has_images_or_audio = (request->images_size() > 0 || request->audios_size() > 0);
bool has_images_or_audio = (request->images_size() > 0 || request->audios_size() > 0 || request->videos_size() > 0);
// Handle content - can be string, null, or array
// For multimodal content, we'll embed images/audio from separate fields
@@ -1595,6 +1563,16 @@ public:
content_array.push_back(audio_chunk);
}
}
if (request->videos_size() > 0) {
for (int j = 0; j < request->videos_size(); j++) {
json video_chunk;
video_chunk["type"] = "input_video";
json input_video;
input_video["data"] = request->videos(j);
video_chunk["input_video"] = input_video;
content_array.push_back(video_chunk);
}
}
msg_json["content"] = content_array;
} else {
// Use content as-is (already array or not last user message)
@@ -1629,6 +1607,16 @@ public:
content_array.push_back(audio_chunk);
}
}
if (request->videos_size() > 0) {
for (int j = 0; j < request->videos_size(); j++) {
json video_chunk;
video_chunk["type"] = "input_video";
json input_video;
input_video["data"] = request->videos(j);
video_chunk["input_video"] = input_video;
content_array.push_back(video_chunk);
}
}
msg_json["content"] = content_array;
} else if (msg.role() == "tool") {
// Tool role messages must have content field set, even if empty
@@ -2080,6 +2068,16 @@ public:
files.push_back(decoded_data);
}
}
const auto &video_data = data.find("video_data");
if (video_data != data.end() && video_data->is_array())
{
for (const auto &video : *video_data)
{
auto decoded_data = base64_decode(video["data"].get<std::string>());
files.push_back(decoded_data);
}
}
}
const bool has_mtmd = ctx_server.impl->mctx != nullptr;
@@ -2332,7 +2330,7 @@ public:
}
bool is_last_user_msg = (i == last_user_msg_idx);
bool has_images_or_audio = (request->images_size() > 0 || request->audios_size() > 0);
bool has_images_or_audio = (request->images_size() > 0 || request->audios_size() > 0 || request->videos_size() > 0);
// Handle content - can be string, null, or array
// For multimodal content, we'll embed images/audio from separate fields
@@ -2385,6 +2383,16 @@ public:
content_array.push_back(audio_chunk);
}
}
if (request->videos_size() > 0) {
for (int j = 0; j < request->videos_size(); j++) {
json video_chunk;
video_chunk["type"] = "input_video";
json input_video;
input_video["data"] = request->videos(j);
video_chunk["input_video"] = input_video;
content_array.push_back(video_chunk);
}
}
msg_json["content"] = content_array;
} else {
// Use content as-is (already array or not last user message)
@@ -2424,6 +2432,16 @@ public:
content_array.push_back(audio_chunk);
}
}
if (request->videos_size() > 0) {
for (int j = 0; j < request->videos_size(); j++) {
json video_chunk;
video_chunk["type"] = "input_video";
json input_video;
input_video["data"] = request->videos(j);
video_chunk["input_video"] = input_video;
content_array.push_back(video_chunk);
}
}
msg_json["content"] = content_array;
SRV_INF("[CONTENT DEBUG] Predict: Message %d created content array with media\n", i);
} else if (!msg.tool_calls().empty()) {
@@ -2886,6 +2904,16 @@ public:
files.push_back(decoded_data);
}
}
const auto &video_data = data.find("video_data");
if (video_data != data.end() && video_data->is_array())
{
for (const auto &video : *video_data)
{
auto decoded_data = base64_decode(video["data"].get<std::string>());
files.push_back(decoded_data);
}
}
}
// process files

View File

@@ -1,7 +1,7 @@
# Pinned to the HEAD of feature/turboquant-kv-cache on https://github.com/TheTom/llama-cpp-turboquant.
# Auto-bumped nightly by .github/workflows/bump_deps.yaml.
TURBOQUANT_VERSION?=5aeb2fdbe26cd4c534c6fa15de73cb5749bd0403
TURBOQUANT_VERSION?=7d9715f1f071fa07c7b2ad3dbfd320b314139e65
LLAMA_REPO?=https://github.com/TheTom/llama-cpp-turboquant
CMAKE_ARGS?=

View File

@@ -4,21 +4,19 @@
#
# 1. Augment the kv_cache_types[] allow-list so `LoadModel` accepts the
# fork-specific `turbo2` / `turbo3` / `turbo4` cache types.
# 2. Replace `get_media_marker()` (added upstream in ggml-org/llama.cpp#21962,
# server-side random per-instance marker) with the legacy "<__media__>"
# literal. The fork branched before that PR, so server-common.cpp has no
# get_media_marker symbol. The fork's mtmd_default_marker() still returns
# "<__media__>", and Go-side tooling falls back to that sentinel when the
# backend does not expose media_marker, so substituting the literal keeps
# behavior identical on the turboquant path.
# 3. Revert the `common_params_speculative` field references to the
# pre-refactor flat layout. Upstream ggml-org/llama.cpp#22397 split the
# struct into nested `draft` / `ngram_simple` / `ngram_mod` / etc. members;
# the turboquant fork branched before that PR and still exposes the flat
# `n_max`, `mparams_dft`, `ngram_size_n`, ... fields. The substitutions
# below map the new nested paths back to the legacy flat names so the
# shared grpc-server.cpp keeps compiling against the fork's common.h.
# Drop this block once the fork rebases past #22397.
# 2. Define LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP at the top of the file
# so the grpc-server option parser skips the two references to
# common_params::checkpoint_min_step (the default and the option handler).
# That field does not exist in the fork yet; drop this once it does.
#
# The fork used to lag upstream on the whole common_params_speculative refactor
# (ggml-org/llama.cpp#22397/#22838/#22964), the model_tgt rename (#22838) and
# get_media_marker (#21962), which required a much larger compat shim here
# (flat-field sed renames + a coarse LOCALAI_LEGACY_LLAMA_CPP_SPEC define). The
# fork has since rebased past all of those, so the only remaining gap is
# checkpoint_min_step. If a future bump reintroduces a divergence, add a narrow
# guard in grpc-server.cpp keyed on a fork-specific macro and inject it here
# rather than resurrecting the coarse one.
#
# We patch the *copy* sitting in turboquant-<flavor>-build/, never the original
# under backend/cpp/llama-cpp/, so the stock llama-cpp build keeps compiling
@@ -72,72 +70,20 @@ else
echo "==> KV allow-list patch OK"
fi
if grep -q 'get_media_marker()' "$SRC"; then
echo "==> patching $SRC to replace get_media_marker() with legacy \"<__media__>\" literal"
# Only one call site today (ModelMetadata), but replace all occurrences to
# stay robust if upstream adds more. Use a temp file to avoid relying on
# sed -i portability (the builder image uses GNU sed, but keeping this
# consistent with the awk block above).
sed 's/get_media_marker()/"<__media__>"/g' "$SRC" > "$SRC.tmp"
mv "$SRC.tmp" "$SRC"
echo "==> get_media_marker() substitution OK"
# 2. Define LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP at the top of the file so
# the grpc-server option parser skips the two references to
# common_params::checkpoint_min_step (the default assignment and the option
# handler). That field does not exist in the fork yet. Drop this block once
# the fork rebases past the bump that added checkpoint_min_step.
if grep -q '^#define LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP' "$SRC"; then
echo "==> $SRC already defines LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP, skipping"
else
echo "==> $SRC has no get_media_marker() call, skipping media-marker patch"
fi
if grep -q 'params\.speculative\.draft\.\|params\.speculative\.ngram_simple\.' "$SRC"; then
echo "==> patching $SRC to revert common_params_speculative refs to pre-#22397 flat layout"
# Each substitution is the exact post-refactor path → legacy flat field.
# Order doesn't matter because the source paths are disjoint, but we keep
# the most-specific (mparams.path) first for readability.
sed -E \
-e 's/params\.speculative\.draft\.mparams\.path/params.speculative.mparams_dft.path/g' \
-e 's/params\.speculative\.draft\.n_max/params.speculative.n_max/g' \
-e 's/params\.speculative\.draft\.n_min/params.speculative.n_min/g' \
-e 's/params\.speculative\.draft\.p_min/params.speculative.p_min/g' \
-e 's/params\.speculative\.draft\.p_split/params.speculative.p_split/g' \
-e 's/params\.speculative\.draft\.n_gpu_layers/params.speculative.n_gpu_layers/g' \
-e 's/params\.speculative\.draft\.n_ctx/params.speculative.n_ctx/g' \
-e 's/params\.speculative\.ngram_simple\.size_n/params.speculative.ngram_size_n/g' \
-e 's/params\.speculative\.ngram_simple\.size_m/params.speculative.ngram_size_m/g' \
-e 's/params\.speculative\.ngram_simple\.min_hits/params.speculative.ngram_min_hits/g' \
"$SRC" > "$SRC.tmp"
mv "$SRC.tmp" "$SRC"
echo "==> speculative field rename OK"
else
echo "==> $SRC has no post-#22397 speculative field refs, skipping spec rename patch"
fi
# 4. Revert the `ctx_server.impl->model_tgt` rename introduced by upstream
# ggml-org/llama.cpp#22838 (parallel drafting). The turboquant fork still
# exposes the field as `model` on `server_context_impl`. The two call sites
# are in the Rerank and ModelMetadata RPC handlers.
if grep -q 'ctx_server\.impl->model_tgt' "$SRC"; then
echo "==> patching $SRC to revert ctx_server.impl->model_tgt -> ctx_server.impl->model"
sed -E 's/ctx_server\.impl->model_tgt/ctx_server.impl->model/g' "$SRC" > "$SRC.tmp"
mv "$SRC.tmp" "$SRC"
echo "==> model_tgt rename OK"
else
echo "==> $SRC has no ctx_server.impl->model_tgt refs, skipping model_tgt rename patch"
fi
# 5. Define LOCALAI_LEGACY_LLAMA_CPP_SPEC at the top of the file so the
# grpc-server option parser skips the new option-handler blocks (ngram_mod,
# ngram_map_k, ngram_map_k4v, ngram_cache, draft.cache_type_*, draft.cpuparams*,
# draft.tensor_buft_overrides) introduced for the post-#22838 layout, the
# draft.tensor_buft_overrides sentinel termination, and the
# common_params::checkpoint_min_step default/option (added with the
# 35c9b1f3 bump). Those blocks reference struct fields that simply do not
# exist in the fork.
if grep -q '^#define LOCALAI_LEGACY_LLAMA_CPP_SPEC' "$SRC"; then
echo "==> $SRC already defines LOCALAI_LEGACY_LLAMA_CPP_SPEC, skipping"
else
echo "==> patching $SRC to define LOCALAI_LEGACY_LLAMA_CPP_SPEC at the top"
# Insert the define before the very first `#include` so it precedes all the
# speculative-decoding code paths.
echo "==> patching $SRC to define LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP at the top"
# Insert the define before the very first `#include` so it precedes the
# checkpoint_min_step references.
awk '
!done && /^#include/ {
print "#define LOCALAI_LEGACY_LLAMA_CPP_SPEC 1"
print "#define LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP 1"
print "// ^ injected by backend/cpp/turboquant/patch-grpc-server.sh"
print ""
done = 1
@@ -145,13 +91,13 @@ else
{ print }
END {
if (!done) {
print "patch-grpc-server.sh: no #include anchor found to insert LOCALAI_LEGACY_LLAMA_CPP_SPEC" > "/dev/stderr"
print "patch-grpc-server.sh: no #include anchor found to insert LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP" > "/dev/stderr"
exit 1
}
}
' "$SRC" > "$SRC.tmp"
mv "$SRC.tmp" "$SRC"
echo "==> LOCALAI_LEGACY_LLAMA_CPP_SPEC define OK"
echo "==> LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP define OK"
fi
echo "==> all patches applied"

View File

@@ -0,0 +1,55 @@
hip: port the turboquant CUDA additions that ggml's HIP shim doesn't cover
The turboquant fork adds/modifies a few ggml-cuda.cu spots with CUDA APIs
that ggml's HIP (and MUSA) compatibility layer does not provide, breaking
the -gpu-rocm-hipblas-turboquant build:
1. ggml_cuda_copy2d_across_devices() (host-staged cross-device copy for
split mul_mat output) uses the CUDA 3D-peer copy APIs
cudaMemcpy3DPeerParms / make_cudaPitchedPtr / make_cudaExtent /
cudaMemcpy3DPeerAsync. HIP genuinely does not support these (see the
fork's own comment "HIP does not support cudaMemcpy3DPeerAsync"), so
guard the peer fast path with #if !defined(GGML_USE_HIP) &&
!defined(GGML_USE_MUSA) -- matching how the fork already guards the
same API for the sibling 2D copy -- and fall through to the existing
cudaMemcpyAsync staging fallback below (functionally identical,
slightly slower on multi-GPU ROCm).
2. ggml_backend_cuda_device_event_new() creates its event with plain
cudaEventCreate, which ggml's HIP shim does not alias (it only aliases
cudaEventCreateWithFlags). Use cudaEventCreateWithFlags(...,
cudaEventDisableTiming) -- exactly what the rest of this file already
does (cf. lines ~1034, ~3461) and HIP-safe.
CUDA builds are unaffected. Drop the relevant hunk once the fork HIP-ports
these; apply-patches.sh fails fast if an anchor goes stale.
diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
index 0427e6b..6352e6a 100644
--- a/ggml/src/ggml-cuda/ggml-cuda.cu
+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
@@ -1933,6 +1933,7 @@ static cudaError_t ggml_cuda_copy2d_across_devices(
size_t width, size_t height, cudaStream_t dst_stream, cudaStream_t src_stream) {
const auto & info = ggml_cuda_info();
+#if !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA) // 3D-peer copy types unmapped by ggml's HIP/MUSA shim; use staging fallback below
if (info.peer_access[src_device][dst_device]) {
cudaMemcpy3DPeerParms p = {};
p.dstDevice = dst_device;
@@ -1942,6 +1943,7 @@ static cudaError_t ggml_cuda_copy2d_across_devices(
p.extent = make_cudaExtent(width, height, 1);
return cudaMemcpy3DPeerAsync(&p, dst_stream);
}
+#endif // !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA)
// Fallback: stage all rows through a single contiguous pinned buffer
int prev_device = ggml_cuda_get_device();
@@ -5714,7 +5716,7 @@ static ggml_backend_event_t ggml_backend_cuda_device_event_new(ggml_backend_dev_
ggml_cuda_set_device(dev_ctx->device);
cudaEvent_t event;
- CUDA_CHECK(cudaEventCreate(&event));
+ CUDA_CHECK(cudaEventCreateWithFlags(&event, cudaEventDisableTiming));
return new ggml_backend_event {
/* .device = */ dev,

View File

@@ -14,7 +14,7 @@ target_include_directories(gocrispasr PRIVATE
# whisper. crispasr is the referencer; the backend static libs supply the
# per-architecture symbols; ggml is the math/runtime base.
target_link_libraries(gocrispasr PRIVATE
crispasr
crispasr-lib
parakeet canary canary_ctc cohere granite_speech granite_nle
voxtral voxtral4b qwen3_asr qwen3_tts orpheus chatterbox indextts
kokoro voxcpm2_tts m2m100 t5_translate wav2vec2-ggml vibevoice

View File

@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)
# CrispASR version (release tag)
CRISPASR_REPO?=https://github.com/CrispStrobe/CrispASR
CRISPASR_VERSION?=13d54e110e1538e0f0bc3af0680b9ab246cfb48d
CRISPASR_VERSION?=c29f6653a516a3001d923944dad8892072cc7334
SO_TARGET?=libgocrispasr.so
CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF

View File

@@ -1,6 +1,6 @@
# parakeet-cpp backend Makefile.
#
# Upstream pin lives below as PARAKEET_VERSION?=b11fe5bca78ad8b342dd559a43d76df3984bb447
# Upstream pin lives below as PARAKEET_VERSION?=e270af73b94c9a5c37ec516230219ed4580e1db6
# (.github/bump_deps.sh) can find and update it - matches the
# whisper.cpp / ds4 / vibevoice-cpp convention.
#
@@ -15,7 +15,7 @@
# That's what the L0 smoke test uses. The default target below does the
# proper clone-at-pin + cmake build so CI doesn't need a side-checkout.
PARAKEET_VERSION?=b11fe5bca78ad8b342dd559a43d76df3984bb447
PARAKEET_VERSION?=e270af73b94c9a5c37ec516230219ed4580e1db6
PARAKEET_REPO?=https://github.com/mudler/parakeet.cpp
GOCMD?=go

View File

@@ -7,8 +7,12 @@ import "time"
type batchRequest struct {
pcm []float32
decoder int32
tag string
reply chan batchReply
// language is the per-request target locale ("" means the model default).
// parakeet.cpp's batched C-API takes ONE target_lang for the whole batch,
// so the dispatcher only coalesces requests that share a language.
language string
tag string
reply chan batchReply
}
// batchReply carries one per-item JSON object string (an element of the C-API's
@@ -43,13 +47,25 @@ func newBatcher(maxSize int, maxWait time.Duration, runBatch func([]*batchReques
// run is the dispatcher loop: accumulate submitted requests until either maxSize
// is reached or maxWait elapses since the first queued request, then dispatch.
// Exits when stop is closed (draining any partially-filled batch first).
//
// A batch carries ONE language (parakeet.cpp's batched C-API takes a single
// target_lang), so a request whose language differs from the batch leader is
// not coalesced: it is held in carry and becomes the leader of the next batch.
// carry is therefore never dropped and its caller never deadlocks: every batch
// (including a lone carry on stop) is dispatched, and runBatch replies to all.
func (b *batcher) run(stop <-chan struct{}) {
var carry *batchRequest
for {
var first *batchRequest
select {
case first = <-b.submit:
case <-stop:
return
if carry != nil {
// A mismatched request from the previous fill leads this batch.
first, carry = carry, nil
} else {
select {
case first = <-b.submit:
case <-stop:
return
}
}
batch := []*batchRequest{first}
@@ -64,12 +80,22 @@ func (b *batcher) run(stop <-chan struct{}) {
for len(batch) < b.maxSize {
select {
case r := <-b.submit:
if r.language != first.language {
// Different language: carry it to the next batch so this
// batch stays single-language, then dispatch what we have.
carry = r
break fill
}
batch = append(batch, r)
case <-timer.C:
break fill
case <-stop:
timer.Stop()
b.runBatch(batch)
// Don't strand a carried request's caller on shutdown.
if carry != nil {
b.runBatch([]*batchRequest{carry})
}
return
}
}

View File

@@ -105,4 +105,60 @@ var _ = Describe("batcher", func() {
go func() { <-rep }()
Eventually(dispatched, "2s").Should(Receive(Equal(1)))
})
It("never coalesces requests with different languages into one batch", func() {
// parakeet.cpp's batched C-API takes ONE target_lang per batch, so the
// dispatcher must keep every dispatched batch single-language. Submit a
// mix of languages and assert (a) no batch ever carries more than one
// distinct language and (b) every submitted request still gets a reply
// (the mismatched carry-over is never dropped).
var mu sync.Mutex
var langsPerBatch [][]string
run := func(reqs []*batchRequest) {
seen := map[string]struct{}{}
var distinct []string
for _, r := range reqs {
if _, ok := seen[r.language]; !ok {
seen[r.language] = struct{}{}
distinct = append(distinct, r.language)
}
}
mu.Lock()
langsPerBatch = append(langsPerBatch, distinct)
mu.Unlock()
echoReply(reqs)
}
// Large window + size so the fill loop stays open across submits and the
// language constraint (not the timer) is what splits the batches.
b := newBatcher(16, 200*time.Millisecond, run)
stop := make(chan struct{})
go b.run(stop)
defer close(stop)
langs := []string{"en", "en", "de", "de", "en", "fr", "fr"}
const N = 7
var wg sync.WaitGroup
got := make([]string, N)
for i := 0; i < N; i++ {
wg.Add(1)
go func(i int) {
defer wg.Done()
rep := make(chan batchReply, 1)
b.submit <- &batchRequest{tag: string(rune('a' + i)), language: langs[i], reply: rep}
got[i] = (<-rep).json
}(i)
}
wg.Wait()
mu.Lock()
defer mu.Unlock()
// Invariant: every dispatched batch is single-language.
for _, distinct := range langsPerBatch {
Expect(len(distinct)).To(Equal(1), "a batch coalesced more than one language: %v", distinct)
}
// Liveness: every request got a reply (carry-over never stranded).
for i := 0; i < N; i++ {
Expect(got[i]).To(Equal(string(rune('a' + i))))
}
})
})

View File

@@ -48,6 +48,13 @@ var (
// side reads them as const float*/const int*.
CppTranscribePcmBatchJSON func(ctx uintptr, samplesConcat []float32, nSamples []int32, nClips int32, sampleRate int32, decoder int32) uintptr
// CppTranscribePcmBatchJSONLang is the multilingual variant of the batched
// JSON entry point: identical, plus a trailing target_lang. "" (the model
// default, "auto") is passed for non-prompt models, which ignore it; an
// unknown locale on a prompt model returns 0 and sets last_error. Present
// only in newer libparakeet.so; nil falls back to CppTranscribePcmBatchJSON.
CppTranscribePcmBatchJSONLang func(ctx uintptr, samplesConcat []float32, nSamples []int32, nClips int32, sampleRate int32, decoder int32, targetLang string) uintptr
// Cache-aware streaming (RNN-T) entry points. stream_begin returns 0 for
// non-streaming models. feed/finalize return a malloc'd char* (uintptr,
// freed via CppFreeString); feed writes 1 to *eouOut on an <EOU>/<EOB>.
@@ -55,6 +62,18 @@ var (
CppStreamFeed func(s uintptr, pcm []float32, nSamples int32, eouOut unsafe.Pointer) uintptr
CppStreamFinalize func(s uintptr) uintptr
CppStreamFree func(s uintptr)
// CppStreamBeginLang is the multilingual variant of stream_begin: identical,
// plus a trailing target_lang ("" means the model default). Present only in
// newer libparakeet.so; nil falls back to CppStreamBegin.
CppStreamBeginLang func(ctx uintptr, targetLang string) uintptr
// Streaming JSON variants (ABI v4): feed/finalize returning a malloc'd char*
// JSON document {text,eou,frame_sec,words} (uintptr, freed via CppFreeString)
// so streaming segments can carry per-word timestamps. Present only in newer
// libparakeet.so; nil falls back to the text-only CppStreamFeed/Finalize path.
CppStreamFeedJSON func(s uintptr, pcm []float32, nSamples int32) uintptr
CppStreamFinalizeJSON func(s uintptr) uintptr
)
// streamChunkSamples is how much 16 kHz mono PCM we hand to stream_feed per
@@ -72,9 +91,26 @@ const streamChunkSamples = 16000
//
// "start"/"end"/"t" are seconds; "conf" is confidence in (0,1].
type transcriptJSON struct {
Text string `json:"text"`
Words []transcriptWord `json:"words"`
Tokens []transcriptToken `json:"tokens"`
Text string `json:"text"`
FrameSec float64 `json:"frame_sec"`
Words []transcriptWord `json:"words"`
Tokens []transcriptToken `json:"tokens"`
}
// streamFeedJSON mirrors the document returned by
// parakeet_capi_stream_feed_json / parakeet_capi_stream_finalize_json (ABI v4):
//
// {"text":"...","eou":0,"frame_sec":0.080000,
// "words":[{"w":"...","start":0.480,"end":0.640,"conf":0.9100}, ...]}
//
// "text" is the newly-finalized text since the last call; "eou" is 1 when an
// <EOU>/<EOB> fired this feed; "words" are the words finalized this call with
// absolute (stream-relative) start/end seconds.
type streamFeedJSON struct {
Text string `json:"text"`
Eou int `json:"eou"`
FrameSec float64 `json:"frame_sec"`
Words []transcriptWord `json:"words"`
}
type transcriptWord struct {
@@ -103,6 +139,10 @@ type ParakeetCpp struct {
engineMu sync.Mutex // sole guard of the one C engine (dispatcher + streaming)
bat *batcher
batStop chan struct{}
// segmentGapFrames is NeMo's segment_gap_threshold in ENCODER FRAMES (model
// YAML option, default 0=off). When >0 it adds NeMo's silence-gap split on
// top of the punctuation split; converted to seconds via the JSON frame_sec.
segmentGapFrames int
}
// Load is the LocalAI gRPC entry point for LoadModel: it calls
@@ -132,6 +172,11 @@ func (p *ParakeetCpp) Load(opts *pb.ModelOptions) error {
if maxWaitMs < 0 {
maxWaitMs = 0
}
// NeMo's segment_gap_threshold (encoder frames, default 0=off). Off by
// default matches NeMo's default (punctuation-only segments); when set it
// additionally splits segments on inter-word silence (see transcriptResultFromDoc).
p.segmentGapFrames = optInt(opts, "segment_gap_threshold", 0)
if CppTranscribePcmBatchJSON != nil {
p.batStop = make(chan struct{})
p.bat = newBatcher(maxSize, time.Duration(maxWaitMs)*time.Millisecond, p.runBatch)
@@ -187,8 +232,19 @@ func (p *ParakeetCpp) runBatch(reqs []*batchRequest) {
if len(reqs) > 0 {
dec = reqs[0].decoder
}
// All requests in a batch share one language (the batcher coalesces only
// same-language requests), so any element's language describes the batch.
lang := ""
if len(reqs) > 0 {
lang = reqs[0].language
}
p.engineMu.Lock()
cstr := CppTranscribePcmBatchJSON(p.ctxPtr, concat, nSamples, int32(len(reqs)), 16000, dec)
var cstr uintptr
if CppTranscribePcmBatchJSONLang != nil {
cstr = CppTranscribePcmBatchJSONLang(p.ctxPtr, concat, nSamples, int32(len(reqs)), 16000, dec, lang)
} else {
cstr = CppTranscribePcmBatchJSON(p.ctxPtr, concat, nSamples, int32(len(reqs)), 16000, dec)
}
p.engineMu.Unlock()
if cstr == 0 {
err := fmt.Errorf("parakeet-cpp: batch transcribe failed: %s", CppLastError(p.ctxPtr))
@@ -226,8 +282,9 @@ func (p *ParakeetCpp) runBatch(reqs []*batchRequest) {
// OpenAI API, whose default is segment-level); token ids always populate
// Segment.Tokens.
//
// translate/diarize/prompt/temperature/language/threads are not applicable to
// parakeet and are ignored; streaming is handled by AudioTranscriptionStream
// translate/diarize/prompt/temperature/threads are not applicable to parakeet
// and are ignored; language is honored on the batched + streaming paths (see
// opts.GetLanguage() below); streaming is handled by AudioTranscriptionStream
// (L2).
func (p *ParakeetCpp) AudioTranscription(ctx context.Context, opts *pb.TranscriptRequest) (pb.TranscriptResult, error) {
if p.ctxPtr == 0 {
@@ -259,7 +316,7 @@ func (p *ParakeetCpp) AudioTranscription(ctx context.Context, opts *pb.Transcrip
if err := json.Unmarshal([]byte(raw), &doc); err != nil {
return pb.TranscriptResult{}, fmt.Errorf("parakeet-cpp: decode transcript json: %w", err)
}
return transcriptResultFromDoc(doc, opts), nil
return transcriptResultFromDoc(doc, opts, p.segmentGapFrames), nil
}
// Batched path: decode to PCM, submit to the batcher, wait for this request's
@@ -271,7 +328,7 @@ func (p *ParakeetCpp) AudioTranscription(ctx context.Context, opts *pb.Transcrip
}
rep := make(chan batchReply, 1)
select {
case p.bat.submit <- &batchRequest{pcm: pcm, decoder: 0, reply: rep}:
case p.bat.submit <- &batchRequest{pcm: pcm, decoder: 0, language: opts.GetLanguage(), reply: rep}:
case <-ctx.Done():
return pb.TranscriptResult{}, status.Error(codes.Canceled, "transcription cancelled")
}
@@ -288,34 +345,169 @@ func (p *ParakeetCpp) AudioTranscription(ctx context.Context, opts *pb.Transcrip
if err := json.Unmarshal([]byte(res.json), &doc); err != nil {
return pb.TranscriptResult{}, fmt.Errorf("parakeet-cpp: decode transcript json: %w", err)
}
return transcriptResultFromDoc(doc, opts), nil
return transcriptResultFromDoc(doc, opts, p.segmentGapFrames), nil
}
// segmentSeparators is NeMo's default segment_seperators (sentence-ending
// punctuation). Splitting on these matches NeMo's default segment timestamps.
var segmentSeparators = []rune{'.', '?', '!'}
// transcriptResultFromDoc maps a decoded transcriptJSON to a TranscriptResult,
// synthesising a single whole-clip segment and attaching word timings only when
// the caller requested word granularity. Shared by the batched and direct paths.
func transcriptResultFromDoc(doc transcriptJSON, opts *pb.TranscriptRequest) pb.TranscriptResult {
// grouping words into NeMo-faithful segments (see splitWordsIntoSegments). The
// optional gapFrames (NeMo's segment_gap_threshold, in encoder FRAMES; 0=off)
// additionally splits on inter-word silence; it is converted to a seconds gap
// with the document's frame_sec. Per-segment word timings are attached only when
// the caller requested word granularity; token ids populate each segment's
// Tokens by time-window membership. Shared by the batched and direct paths.
func transcriptResultFromDoc(doc transcriptJSON, opts *pb.TranscriptRequest, gapFrames int) pb.TranscriptResult {
text := strings.TrimSpace(doc.Text)
words := make([]*pb.TranscriptWord, 0, len(doc.Words))
for _, w := range doc.Words {
words = append(words, &pb.TranscriptWord{Start: secondsToNanos(w.Start), End: secondsToNanos(w.End), Text: w.W})
// Frame-unit gap threshold -> seconds (NeMo segment_gap_threshold). 0 = off.
gapSeconds := 0.0
if gapFrames > 0 {
if doc.FrameSec > 0 {
gapSeconds = float64(gapFrames) * doc.FrameSec
} else {
xlog.Warn("parakeet-cpp: segment_gap_threshold set but libparakeet.so " +
"did not report frame_sec; falling back to punctuation-only segments")
}
}
tokens := make([]int32, 0, len(doc.Tokens))
for _, t := range doc.Tokens {
tokens = append(tokens, t.ID)
groups := splitWordsIntoSegments(doc.Words, segmentSeparators, gapSeconds)
if len(groups) == 0 {
// No words (edge case): single whole-clip text segment.
return pb.TranscriptResult{
Text: text,
Segments: []*pb.TranscriptSegment{{Id: 0, Text: text}},
}
}
var segStart, segEnd int64
if len(words) > 0 {
segStart = words[0].Start
segEnd = words[len(words)-1].End
wantWords := wordsRequested(opts.TimestampGranularities)
segments := make([]*pb.TranscriptSegment, 0, len(groups))
for id, group := range groups {
parts := make([]string, len(group))
for i, gw := range group {
parts[i] = gw.W
}
seg := &pb.TranscriptSegment{
Id: int32(id),
Start: secondsToNanos(group[0].Start),
End: secondsToNanos(group[len(group)-1].End),
Text: strings.TrimSpace(strings.Join(parts, " ")),
Tokens: tokensInWindow(doc.Tokens, group[0].Start, group[len(group)-1].End),
}
if wantWords {
ws := make([]*pb.TranscriptWord, len(group))
for i, gw := range group {
ws[i] = &pb.TranscriptWord{Start: secondsToNanos(gw.Start), End: secondsToNanos(gw.End), Text: gw.W}
}
seg.Words = ws
}
segments = append(segments, seg)
}
seg := &pb.TranscriptSegment{Id: 0, Start: segStart, End: segEnd, Text: text, Tokens: tokens}
if wordsRequested(opts.TimestampGranularities) {
seg.Words = words
}
return pb.TranscriptResult{Text: text, Segments: []*pb.TranscriptSegment{seg}}
return pb.TranscriptResult{Text: text, Segments: segments}
}
// splitWordsIntoSegments groups words into segments exactly as NeMo's
// get_segment_offsets does (nemo/collections/asr/parts/utils/timestamp_utils.py).
// Walking the words, it closes a segment when (1) the gap rule is enabled
// (gapSeconds > 0) and the segment already has words and the gap from the
// previous word's end to this word's start is >= gapSeconds - the current word
// then STARTS a new segment - or, checked only when the gap rule did not apply
// (NeMo's elif), (2) the word ends with (or is) a separator, which closes the
// segment INCLUDING that word. Trailing words flush into a final segment.
// gapSeconds <= 0 disables the gap rule, matching NeMo's default
// segment_gap_threshold=None (punctuation-only segments).
func splitWordsIntoSegments(words []transcriptWord, separators []rune, gapSeconds float64) [][]transcriptWord {
var segments [][]transcriptWord
var cur []transcriptWord
for i, word := range words {
gapActive := gapSeconds > 0 && len(cur) > 0
if gapActive && (word.Start-words[i-1].End) >= gapSeconds {
segments = append(segments, cur)
cur = []transcriptWord{word}
continue
}
if !gapActive && endsWithSeparator(word.W, separators) {
cur = append(cur, word)
segments = append(segments, cur)
cur = nil
continue
}
cur = append(cur, word)
}
if len(cur) > 0 {
segments = append(segments, cur)
}
return segments
}
// endsWithSeparator reports whether w's last rune is in separators (matching
// NeMo's `word[-1] in delims or word in delims`).
func endsWithSeparator(w string, separators []rune) bool {
r := []rune(strings.TrimSpace(w))
if len(r) == 0 {
return false
}
last := r[len(r)-1]
for _, s := range separators {
if last == s {
return true
}
}
return false
}
// tokensInWindow returns the ids of tokens whose timestamp t falls in
// [start, end] (inclusive), assigning each token to the segment that spans its
// time. The last segment's end is the last word end, so the final token is
// included.
func tokensInWindow(tokens []transcriptToken, start, end float64) []int32 {
var ids []int32
for _, t := range tokens {
if t.T >= start && t.T <= end {
ids = append(ids, t.ID)
}
}
return ids
}
// streamSegmenter accumulates streaming words into per-utterance segments. EOU
// is the model's own utterance boundary; each closed segment takes its start/end
// from its first/last accumulated word.
type streamSegmenter struct {
segs []*pb.TranscriptSegment
cur []transcriptWord
nextID int32
}
func (s *streamSegmenter) add(doc streamFeedJSON) {
s.cur = append(s.cur, doc.Words...)
if doc.Eou != 0 {
s.flush()
}
}
func (s *streamSegmenter) flush() {
if len(s.cur) == 0 {
return
}
parts := make([]string, len(s.cur))
for i, w := range s.cur {
parts[i] = w.W
}
s.segs = append(s.segs, &pb.TranscriptSegment{
Id: s.nextID,
Start: secondsToNanos(s.cur[0].Start),
End: secondsToNanos(s.cur[len(s.cur)-1].End),
Text: strings.TrimSpace(strings.Join(parts, " ")),
})
s.nextID++
s.cur = nil
}
func (s *streamSegmenter) segments() []*pb.TranscriptSegment { return s.segs }
// wordsRequested reports whether the caller asked for word-level timestamps.
// The OpenAI transcription API gates word timings behind
// timestamp_granularities[] containing "word" and defaults to segment-level
@@ -361,7 +553,12 @@ func (p *ParakeetCpp) AudioTranscriptionStream(ctx context.Context, opts *pb.Tra
return status.Error(codes.Canceled, "transcription cancelled")
}
stream := CppStreamBegin(p.ctxPtr)
var stream uintptr
if CppStreamBeginLang != nil {
stream = CppStreamBeginLang(p.ctxPtr, opts.GetLanguage())
} else {
stream = CppStreamBegin(p.ctxPtr)
}
if stream == 0 {
// Not a cache-aware streaming model: run a normal offline
// transcription and emit it as one delta + a closing final result.
@@ -390,6 +587,14 @@ func (p *ParakeetCpp) AudioTranscriptionStream(ctx context.Context, opts *pb.Tra
return err
}
// ABI v4: when the streaming JSON entry points are present, drive them so the
// per-utterance segments carry per-word start/end timestamps. Falls through to
// the text-only loop below against an older libparakeet.so. Runs under the
// engineMu already held above.
if CppStreamFeedJSON != nil {
return p.streamJSON(ctx, stream, data, duration, results)
}
var (
full strings.Builder
segText strings.Builder
@@ -466,6 +671,71 @@ func (p *ParakeetCpp) AudioTranscriptionStream(ctx context.Context, opts *pb.Tra
return nil
}
// streamJSON drives the ABI v4 streaming JSON entry points: each feed/finalize
// returns a {text,eou,frame_sec,words} document. The newly-finalized text is
// emitted as a delta (unchanged streaming contract) while words are accumulated
// into per-utterance segments (closed on EOU) so the closing FinalResult carries
// timestamped segments. Runs under engineMu (already held by the caller).
func (p *ParakeetCpp) streamJSON(ctx context.Context, stream uintptr, data []float32,
duration float32, results chan *pb.TranscriptStreamResponse) error {
var (
full strings.Builder
seg streamSegmenter
)
// consume frees the malloc'd char* (a 0 return is an error), parses the JSON,
// emits the delta, and routes words through the segmenter.
consume := func(ret uintptr) error {
if ret == 0 {
msg := CppLastError(p.ctxPtr)
if msg == "" {
msg = "unknown error"
}
return fmt.Errorf("parakeet-cpp: stream feed/finalize failed: %s", msg)
}
raw := goStringFromCPtr(ret)
CppFreeString(ret)
var doc streamFeedJSON
if err := json.Unmarshal([]byte(raw), &doc); err != nil {
return fmt.Errorf("parakeet-cpp: decode stream json: %w", err)
}
if doc.Text != "" {
full.WriteString(doc.Text)
results <- &pb.TranscriptStreamResponse{Delta: doc.Text}
}
seg.add(doc)
return nil
}
for off := 0; off < len(data); off += streamChunkSamples {
if err := ctx.Err(); err != nil {
return status.Error(codes.Canceled, "transcription cancelled")
}
end := min(off+streamChunkSamples, len(data))
chunk := data[off:end]
if err := consume(CppStreamFeedJSON(stream, chunk, int32(len(chunk)))); err != nil {
return err
}
}
if err := consume(CppStreamFinalizeJSON(stream)); err != nil {
return err
}
seg.flush() // close any trailing utterance that never saw an EOU
text := strings.TrimSpace(full.String())
segments := seg.segments()
if len(segments) == 0 && text != "" {
segments = append(segments, &pb.TranscriptSegment{Id: 0, Text: text})
}
results <- &pb.TranscriptStreamResponse{
FinalResult: &pb.TranscriptResult{
Text: text,
Segments: segments,
Duration: duration,
},
}
return nil
}
// decodeWavMono16k converts any input audio to 16 kHz mono PCM and returns the
// float samples plus the clip duration in seconds. Mirrors the whisper
// backend: utils.AudioToWav (ffmpeg) normalises rate/channels, go-audio

View File

@@ -53,6 +53,10 @@ func ensureLibLoaded() {
purego.RegisterLibFunc(&CppStreamFeed, lib, "parakeet_capi_stream_feed")
purego.RegisterLibFunc(&CppStreamFinalize, lib, "parakeet_capi_stream_finalize")
purego.RegisterLibFunc(&CppStreamFree, lib, "parakeet_capi_stream_free")
if sym, err := purego.Dlsym(lib, "parakeet_capi_stream_feed_json"); err == nil && sym != 0 {
purego.RegisterLibFunc(&CppStreamFeedJSON, lib, "parakeet_capi_stream_feed_json")
purego.RegisterLibFunc(&CppStreamFinalizeJSON, lib, "parakeet_capi_stream_finalize_json")
}
purego.RegisterLibFunc(&CppFreeString, lib, "parakeet_capi_free_string")
purego.RegisterLibFunc(&CppLastError, lib, "parakeet_capi_last_error")
})
@@ -107,13 +111,22 @@ var _ = Describe("ParakeetCpp", func() {
Expect(err).ToNot(HaveOccurred())
Expect(strings.TrimSpace(res.Text)).ToNot(BeEmpty(),
"expected non-empty transcript for %s", audioPath)
Expect(res.Segments).To(HaveLen(1),
"synthesises a single whole-clip segment")
Expect(res.Segments[0].Text).To(Equal(res.Text),
"single segment text must equal the top-level text")
// Default (no granularities) is segment-level: no per-word timings.
Expect(res.Segments[0].Words).To(BeEmpty(),
"word timings are opt-in via timestamp_granularities")
// NeMo-faithful segmentation: one or more punctuation-delimited
// segments, each with text and a monotonically-advancing time span.
Expect(res.Segments).ToNot(BeEmpty(), "expected at least one segment")
var prevEnd int64
for i, seg := range res.Segments {
Expect(strings.TrimSpace(seg.Text)).ToNot(BeEmpty(),
"segment %d must have text", i)
Expect(seg.End).To(BeNumerically(">=", seg.Start),
"segment %d end must not precede its start", i)
Expect(seg.Start).To(BeNumerically(">=", prevEnd),
"segments must be in time order")
prevEnd = seg.End
// Default (no granularities) is segment-level: no per-word timings.
Expect(seg.Words).To(BeEmpty(),
"word timings are opt-in via timestamp_granularities")
}
})
It("emits word-level timestamps when granularity=word", func() {
@@ -129,15 +142,28 @@ var _ = Describe("ParakeetCpp", func() {
TimestampGranularities: []string{"word"},
})
Expect(err).ToNot(HaveOccurred())
Expect(res.Segments).To(HaveLen(1))
seg := res.Segments[0]
Expect(seg.Words).ToNot(BeEmpty(),
"expected per-word timestamps with granularity=word")
// Monotonic, non-negative timings spanning the segment.
Expect(seg.Words[0].Start).To(BeNumerically(">=", int64(0)))
Expect(seg.End).To(BeNumerically(">=", seg.Start))
Expect(seg.Words[len(seg.Words)-1].End).To(Equal(seg.End),
"segment end tracks the last word")
Expect(res.Segments).ToNot(BeEmpty())
// With word granularity every segment carries its own words, and each
// segment's span tracks its first/last word; word starts advance
// monotonically across the whole transcript.
totalWords := 0
var prevStart int64 = -1
for i, seg := range res.Segments {
Expect(seg.Words).ToNot(BeEmpty(),
"segment %d must carry per-word timestamps with granularity=word", i)
Expect(seg.Start).To(Equal(seg.Words[0].Start),
"segment %d start tracks its first word", i)
Expect(seg.End).To(Equal(seg.Words[len(seg.Words)-1].End),
"segment %d end tracks its last word", i)
for _, w := range seg.Words {
Expect(w.End).To(BeNumerically(">=", w.Start))
Expect(w.Start).To(BeNumerically(">=", prevStart))
prevStart = w.Start
totalWords++
}
}
Expect(totalWords).To(BeNumerically(">", 0))
Expect(res.Segments[0].Words[0].Start).To(BeNumerically(">=", int64(0)))
})
})

View File

@@ -65,6 +65,25 @@ func main() {
purego.RegisterLibFunc(&CppTranscribePcmBatchJSON, lib, "parakeet_capi_transcribe_pcm_batch_json")
}
// Per-request language variants (multilingual nemotron). Same probe pattern:
// present only in libparakeet.so built with multilingual support, so the
// backend still loads against an older library and falls back to the
// non-lang batched + streaming entry points (model default / "auto").
if sym, err := purego.Dlsym(lib, "parakeet_capi_transcribe_pcm_batch_json_lang"); err == nil && sym != 0 {
purego.RegisterLibFunc(&CppTranscribePcmBatchJSONLang, lib, "parakeet_capi_transcribe_pcm_batch_json_lang")
}
if sym, err := purego.Dlsym(lib, "parakeet_capi_stream_begin_lang"); err == nil && sym != 0 {
purego.RegisterLibFunc(&CppStreamBeginLang, lib, "parakeet_capi_stream_begin_lang")
}
// Streaming JSON entry points (ABI v4): surface per-word timestamps on the
// streaming path. Same probe pattern; absent in older libparakeet.so, where
// the backend falls back to the text-only streaming feed.
if sym, err := purego.Dlsym(lib, "parakeet_capi_stream_feed_json"); err == nil && sym != 0 {
purego.RegisterLibFunc(&CppStreamFeedJSON, lib, "parakeet_capi_stream_feed_json")
purego.RegisterLibFunc(&CppStreamFinalizeJSON, lib, "parakeet_capi_stream_finalize_json")
}
fmt.Fprintf(os.Stderr, "[parakeet-cpp] ABI=%d\n", CppAbiVersion())
flag.Parse()

View File

@@ -0,0 +1,127 @@
package main
import (
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
func tw(text string, start, end float64) transcriptWord {
return transcriptWord{W: text, Start: start, End: end}
}
var _ = Describe("splitWordsIntoSegments (NeMo get_segment_offsets parity)", func() {
seps := []rune{'.', '?', '!'}
It("splits on sentence-ending punctuation, including the delimiter word", func() {
words := []transcriptWord{tw("hello", 0, 0.4), tw("world.", 0.4, 0.8), tw("bye", 1.0, 1.3)}
segs := splitWordsIntoSegments(words, seps, 0)
Expect(segs).To(HaveLen(2))
Expect(segs[0]).To(HaveLen(2))
Expect(segs[0][1].W).To(Equal("world."))
Expect(segs[1]).To(HaveLen(1))
Expect(segs[1][0].W).To(Equal("bye"))
})
It("keeps a single segment with no terminal punctuation and gap off", func() {
words := []transcriptWord{tw("a", 0, 0.2), tw("b", 0.2, 0.4), tw("c", 5.0, 5.2)}
segs := splitWordsIntoSegments(words, seps, 0)
Expect(segs).To(HaveLen(1))
})
It("splits on the gap rule when enabled, the gapped word starting the next segment", func() {
words := []transcriptWord{tw("a", 0, 0.2), tw("b", 0.2, 0.4), tw("c", 5.0, 5.2)}
segs := splitWordsIntoSegments(words, seps, 1.0) // c is 4.6s after b
Expect(segs).To(HaveLen(2))
Expect(segs[0]).To(HaveLen(2)) // a b
Expect(segs[1]).To(HaveLen(1)) // c
Expect(segs[1][0].W).To(Equal("c"))
})
It("checks the gap rule before punctuation (NeMo elif order)", func() {
// "b." would terminate, but c is far after it -> gap closes [a b.] at b.
words := []transcriptWord{tw("a", 0, 0.2), tw("b.", 0.2, 0.4), tw("c", 9.0, 9.2)}
segs := splitWordsIntoSegments(words, seps, 1.0)
Expect(segs).To(HaveLen(2))
Expect(segs[0]).To(HaveLen(2))
Expect(segs[1][0].W).To(Equal("c"))
})
It("still splits on punctuation when the gap rule is enabled but does not fire", func() {
words := []transcriptWord{tw("hi.", 0, 0.4), tw("bye", 0.4, 0.8)}
segs := splitWordsIntoSegments(words, seps, 5.0) // gap never reached
Expect(segs).To(HaveLen(2))
Expect(segs[0][0].W).To(Equal("hi."))
})
It("returns nothing for empty input", func() {
Expect(splitWordsIntoSegments(nil, seps, 0)).To(BeEmpty())
})
})
var _ = Describe("transcriptResultFromDoc (multi-segment)", func() {
doc := transcriptJSON{
Text: "hello world. bye now",
FrameSec: 0.08,
Words: []transcriptWord{
{W: "hello", Start: 0.0, End: 0.4},
{W: "world.", Start: 0.4, End: 0.8},
{W: "bye", Start: 1.0, End: 1.3},
{W: "now", Start: 1.3, End: 1.6},
},
Tokens: []transcriptToken{{ID: 1, T: 0.1}, {ID: 2, T: 0.5}, {ID: 3, T: 1.1}, {ID: 4, T: 1.4}},
}
It("emits one segment per punctuation-delimited group with start/end", func() {
res := transcriptResultFromDoc(doc, &pb.TranscriptRequest{}, 0)
Expect(res.Segments).To(HaveLen(2))
Expect(res.Segments[0].Text).To(Equal("hello world."))
Expect(res.Segments[0].Start).To(Equal(int64(0)))
Expect(res.Segments[0].End).To(Equal(secondsToNanos(0.8)))
Expect(res.Segments[1].Text).To(Equal("bye now"))
Expect(res.Segments[1].Start).To(Equal(secondsToNanos(1.0)))
Expect(res.Segments[1].Id).To(Equal(int32(1)))
})
It("assigns tokens to the segment whose time window contains them", func() {
res := transcriptResultFromDoc(doc, &pb.TranscriptRequest{}, 0)
Expect(res.Segments[0].Tokens).To(Equal([]int32{1, 2}))
Expect(res.Segments[1].Tokens).To(Equal([]int32{3, 4}))
})
It("attaches per-segment words only when word granularity requested", func() {
plain := transcriptResultFromDoc(doc, &pb.TranscriptRequest{}, 0)
Expect(plain.Segments[0].Words).To(BeEmpty())
withWords := transcriptResultFromDoc(doc, &pb.TranscriptRequest{TimestampGranularities: []string{"word"}}, 0)
Expect(withWords.Segments[0].Words).To(HaveLen(2))
})
It("falls back to a single text segment when there are no words", func() {
res := transcriptResultFromDoc(transcriptJSON{Text: "hi"}, &pb.TranscriptRequest{}, 0)
Expect(res.Segments).To(HaveLen(1))
Expect(res.Segments[0].Text).To(Equal("hi"))
})
})
var _ = Describe("streaming segment assembly", func() {
It("closes a segment with start/end from its words on EOU", func() {
acc := &streamSegmenter{}
acc.add(streamFeedJSON{Text: "hello world", Eou: 1, Words: []transcriptWord{
{W: "hello", Start: 0.0, End: 0.4}, {W: "world", Start: 0.4, End: 0.9},
}})
segs := acc.segments()
Expect(segs).To(HaveLen(1))
Expect(segs[0].Text).To(Equal("hello world"))
Expect(segs[0].Start).To(Equal(int64(0)))
Expect(segs[0].End).To(Equal(secondsToNanos(0.9)))
})
It("buffers words across feeds until EOU", func() {
acc := &streamSegmenter{}
acc.add(streamFeedJSON{Text: "hi", Eou: 0, Words: []transcriptWord{{W: "hi", Start: 0, End: 0.3}}})
Expect(acc.segments()).To(BeEmpty())
acc.add(streamFeedJSON{Text: "there", Eou: 1, Words: []transcriptWord{{W: "there", Start: 0.3, End: 0.7}}})
Expect(acc.segments()).To(HaveLen(1))
Expect(acc.segments()[0].Text).To(Equal("hi there"))
})
})

View File

@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)
# stablediffusion.cpp (ggml)
STABLEDIFFUSION_GGML_REPO?=https://github.com/leejet/stable-diffusion.cpp
STABLEDIFFUSION_GGML_VERSION?=1f9ee88e09c258053fa59d5e05e23dfb10fa0b13
STABLEDIFFUSION_GGML_VERSION?=19bdfe22d255d5b4dff39d449318b9bc5ea2317f
CMAKE_ARGS+=-DGGML_MAX_NAME=128

View File

@@ -386,6 +386,7 @@ int load_model(const char *model, char *model_path, char* options[], int threads
const char *llm_vision_path = "";
const char *diffusion_model_path = stableDiffusionModel;
const char *high_noise_diffusion_model_path = "";
const char *uncond_diffusion_model_path = "";
const char *taesd_path = "";
const char *control_net_path = "";
const char *embedding_dir = "";
@@ -472,6 +473,7 @@ int load_model(const char *model, char *model_path, char* options[], int threads
if (!strcmp(optname, "llm_vision_path")) llm_vision_path = strdup(optval);
if (!strcmp(optname, "diffusion_model_path")) diffusion_model_path = strdup(optval);
if (!strcmp(optname, "high_noise_diffusion_model_path")) high_noise_diffusion_model_path = strdup(optval);
if (!strcmp(optname, "uncond_diffusion_model_path")) uncond_diffusion_model_path = strdup(optval);
if (!strcmp(optname, "taesd_path")) taesd_path = strdup(optval);
if (!strcmp(optname, "control_net_path")) control_net_path = strdup(optval);
if (!strcmp(optname, "embedding_dir")) {
@@ -571,6 +573,7 @@ int load_model(const char *model, char *model_path, char* options[], int threads
ctx_params.llm_vision_path = llm_vision_path;
ctx_params.diffusion_model_path = diffusion_model_path;
ctx_params.high_noise_diffusion_model_path = high_noise_diffusion_model_path;
ctx_params.uncond_diffusion_model_path = uncond_diffusion_model_path;
ctx_params.vae_path = vae_path;
ctx_params.audio_vae_path = audio_vae_path;
ctx_params.embeddings_connectors_path = embeddings_connectors_path;

View File

@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)
# whisper.cpp version
WHISPER_REPO?=https://github.com/ggml-org/whisper.cpp
WHISPER_CPP_VERSION?=99613cb720b65036237d44b52f753b51f75c2797
WHISPER_CPP_VERSION?=df7638d8229a243af8a4b5a8ae557e0d74e0a0ae
SO_TARGET?=libgowhisper.so
CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF

View File

@@ -1,4 +1,4 @@
transformers
accelerate
torch==2.7.1+xpu
torch==2.7.1
rerankers[transformers]

View File

@@ -1,4 +1,4 @@
transformers
accelerate
torch==2.7.1+xpu
torch==2.7.1
rerankers[transformers]

View File

@@ -1,5 +1,5 @@
--extra-index-url https://download.pytorch.org/whl/cu130
transformers
accelerate
torch==2.7.1+xpu
torch==2.9.1
rerankers[transformers]

View File

@@ -1,5 +1,5 @@
--extra-index-url https://download.pytorch.org/whl/rocm7.0
transformers
accelerate
torch==2.7.1+xpu
torch==2.10.0+rocm7.0
rerankers[transformers]

View File

@@ -1,4 +1,4 @@
torch==2.7.1+xpu
torch==2.7.1
transformers
accelerate
rerankers[transformers]

View File

@@ -23,9 +23,9 @@ import (
"github.com/mudler/LocalAI/core/services/routing/pii"
"github.com/mudler/LocalAI/core/services/routing/router"
"github.com/mudler/LocalAI/core/services/storage"
"github.com/mudler/LocalAI/pkg/signals"
coreStartup "github.com/mudler/LocalAI/core/startup"
"github.com/mudler/LocalAI/internal"
"github.com/mudler/LocalAI/pkg/signals"
"github.com/mudler/LocalAI/pkg/vram"
"github.com/mudler/LocalAI/pkg/model"
@@ -308,10 +308,31 @@ func New(opts ...config.AppOption) (*Application, error) {
application.galleryService.SetNATSClient(distSvc.Nats)
if distSvc.DistStores != nil && distSvc.DistStores.Gallery != nil {
// Clean up stale in-progress operations from previous crashed instances
if err := distSvc.DistStores.Gallery.CleanStale(30 * time.Minute); err != nil {
if _, err := distSvc.DistStores.Gallery.CleanStale(30 * time.Minute); err != nil {
xlog.Warn("Failed to clean stale gallery operations", "error", err)
}
application.galleryService.SetGalleryStore(distSvc.DistStores.Gallery)
// Reap stale ops periodically, not just at boot: an op orphaned by
// a replica that died mid-install (its foreground handler goroutine
// gone) would otherwise linger "processing" in the UI until the next
// restart. 30m matches the install/upgrade ceiling so a genuinely
// slow op is never reaped out from under itself.
gsvc := application.galleryService
go func() {
ticker := time.NewTicker(15 * time.Minute)
defer ticker.Stop()
for {
select {
case <-options.Context.Done():
return
case <-ticker.C:
if _, err := gsvc.ReapStaleOperations(30 * time.Minute); err != nil {
xlog.Warn("Failed to reap stale gallery operations", "error", err)
}
}
}
}()
}
// Hydrate from the store first so the wildcard subscriber finds an
// already-populated statuses map for any operations still in flight

View File

@@ -214,7 +214,9 @@ func (uc *UpgradeChecker) runCheck(ctx context.Context) {
"from", info.InstalledVersion, "to", info.AvailableVersion)
var err error
if bm != nil {
err = bm.UpgradeBackend(ctx, name, nil)
// Background auto-upgrade: no live admin watching a progress bar,
// so opID is empty and the distributed path skips progress streaming.
err = bm.UpgradeBackend(ctx, "", name, nil)
} else {
err = gallery.UpgradeBackend(ctx, uc.systemState, uc.modelLoader,
uc.galleries, name, nil, uc.appConfig.RequireBackendIntegrity)

30
core/cli/chat/chat.go Normal file
View File

@@ -0,0 +1,30 @@
package chat
import (
"context"
"io"
"strings"
)
type Options struct {
Model string
BaseURL string
APIKey string
In io.Reader
Out io.Writer
}
func Run(ctx context.Context, opts Options) error {
if opts.In == nil {
opts.In = strings.NewReader("")
}
if opts.Out == nil {
opts.Out = io.Discard
}
session, err := newChatSession(ctx, newLocalAIChatClient(opts.BaseURL, opts.APIKey), opts.Model)
if err != nil {
return err
}
return runTerminalChat(ctx, session, opts.In, opts.Out)
}

View File

@@ -0,0 +1,13 @@
package chat
import (
"testing"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
func TestChat(t *testing.T) {
RegisterFailHandler(Fail)
RunSpecs(t, "Chat Suite")
}

172
core/cli/chat/chat_test.go Normal file
View File

@@ -0,0 +1,172 @@
package chat
import (
"bytes"
"encoding/json"
"fmt"
"io"
"net/http"
"net/http/httptest"
"strings"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
var _ = Describe("Run chat", func() {
It("streams a single chat response", func() {
var capturedModel string
var capturedAuth string
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
if r.URL.Path == "/v1/models" {
w.Header().Set("Content-Type", "application/json")
writeResponse(w, `{"object":"list","data":[{"id":"test-model","object":"model"}]}`)
return
}
Expect(r.URL.Path).To(Equal("/v1/chat/completions"))
capturedAuth = r.Header.Get("Authorization")
var body struct {
Model string `json:"model"`
Messages []struct {
Role string `json:"role"`
Content string `json:"content"`
} `json:"messages"`
}
Expect(json.NewDecoder(r.Body).Decode(&body)).To(Succeed())
capturedModel = body.Model
Expect(body.Messages).To(HaveLen(1))
Expect(body.Messages[0].Role).To(Equal("user"))
Expect(body.Messages[0].Content).To(Equal("hello"))
w.Header().Set("Content-Type", "text/event-stream")
writeResponse(w, "data: {\"choices\":[{\"index\":0,\"delta\":{\"content\":\"hi\"}}]}\n\n")
writeResponse(w, "data: {\"choices\":[{\"index\":0,\"delta\":{\"content\":\"!\"}}]}\n\n")
writeResponse(w, "data: [DONE]\n\n")
}))
defer server.Close()
var out bytes.Buffer
err := Run(GinkgoT().Context(), Options{
Model: "test-model",
BaseURL: server.URL + "/v1",
APIKey: "secret",
In: strings.NewReader("hello\n/exit\n"),
Out: &out,
})
Expect(err).ToNot(HaveOccurred())
Expect(capturedModel).To(Equal("test-model"))
Expect(capturedAuth).To(Equal("Bearer secret"))
Expect(out.String()).To(ContainSubstring("assistant: hi!"))
Expect(out.String()).To(ContainSubstring("bye"))
})
It("auto-selects the only available model", func() {
server := chatTestServer([]string{"solo"}, nil)
defer server.Close()
var out bytes.Buffer
err := Run(GinkgoT().Context(), Options{
BaseURL: server.URL + "/v1",
In: strings.NewReader("/exit\n"),
Out: &out,
})
Expect(err).ToNot(HaveOccurred())
Expect(out.String()).To(ContainSubstring("LocalAI chat (solo)"))
})
It("returns an actionable error when no models are installed", func() {
server := chatTestServer(nil, nil)
defer server.Close()
err := Run(GinkgoT().Context(), Options{
BaseURL: server.URL + "/v1",
In: strings.NewReader(""),
})
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring("no chat models are installed"))
Expect(err.Error()).To(ContainSubstring("local-ai models install <model>"))
})
It("returns an actionable error when multiple models are available without a selection", func() {
server := chatTestServer([]string{"alpha", "beta"}, nil)
defer server.Close()
err := Run(GinkgoT().Context(), Options{
BaseURL: server.URL + "/v1",
In: strings.NewReader(""),
})
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring("multiple models are available"))
Expect(err.Error()).To(ContainSubstring("--model"))
Expect(err.Error()).To(ContainSubstring("alpha"))
Expect(err.Error()).To(ContainSubstring("beta"))
})
It("lists and switches models inside the chat", func() {
requestedModels := []string{}
server := chatTestServer([]string{"alpha", "beta"}, func(model string) {
requestedModels = append(requestedModels, model)
})
defer server.Close()
var out bytes.Buffer
err := Run(GinkgoT().Context(), Options{
Model: "alpha",
BaseURL: server.URL + "/v1",
In: strings.NewReader("/models\n/model beta\nhello\n/exit\n"),
Out: &out,
})
Expect(err).ToNot(HaveOccurred())
Expect(out.String()).To(ContainSubstring("* alpha"))
Expect(out.String()).To(ContainSubstring(" beta"))
Expect(out.String()).To(ContainSubstring("switched to beta; conversation cleared"))
Expect(requestedModels).To(Equal([]string{"beta"}))
})
})
func chatTestServer(models []string, onChat func(model string)) *httptest.Server {
return httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
switch r.URL.Path {
case "/v1/models":
w.Header().Set("Content-Type", "application/json")
writeResponse(w, `{"object":"list","data":[`)
for i, model := range models {
if i > 0 {
writeResponse(w, ",")
}
writeResponsef(w, `{"id":%q,"object":"model"}`, model)
}
writeResponse(w, `]}`)
case "/v1/chat/completions":
var body struct {
Model string `json:"model"`
}
Expect(json.NewDecoder(r.Body).Decode(&body)).To(Succeed())
if onChat != nil {
onChat(body.Model)
}
w.Header().Set("Content-Type", "text/event-stream")
writeResponse(w, "data: {\"choices\":[{\"index\":0,\"delta\":{\"content\":\"ok\"}}]}\n\n")
writeResponse(w, "data: [DONE]\n\n")
default:
w.WriteHeader(http.StatusNotFound)
}
}))
}
func writeResponse(w io.Writer, text string) {
_, err := fmt.Fprint(w, text)
Expect(err).ToNot(HaveOccurred())
}
func writeResponsef(w io.Writer, format string, args ...any) {
_, err := fmt.Fprintf(w, format, args...)
Expect(err).ToNot(HaveOccurred())
}

114
core/cli/chat/client.go Normal file
View File

@@ -0,0 +1,114 @@
package chat
import (
"context"
"errors"
"fmt"
"io"
"sort"
"strings"
openai "github.com/sashabaranov/go-openai"
)
type chatClient interface {
ListModels(ctx context.Context) ([]string, error)
StreamChat(ctx context.Context, model string, messages []chatMessage, out io.Writer) (string, error)
}
type localAIChatClient struct {
client *openai.Client
}
func newLocalAIChatClient(baseURL string, apiKey string) *localAIChatClient {
cfg := openai.DefaultConfig(apiKey)
cfg.BaseURL = baseURL
return &localAIChatClient{client: openai.NewClientWithConfig(cfg)}
}
func (c *localAIChatClient) ListModels(ctx context.Context) ([]string, error) {
resp, err := c.client.ListModels(ctx)
if err != nil {
return nil, err
}
models := make([]string, 0, len(resp.Models))
for _, model := range resp.Models {
if model.ID != "" {
models = append(models, model.ID)
}
}
sort.Strings(models)
return models, nil
}
func (c *localAIChatClient) StreamChat(ctx context.Context, model string, messages []chatMessage, out io.Writer) (string, error) {
stream, err := c.client.CreateChatCompletionStream(ctx, openai.ChatCompletionRequest{
Model: model,
Messages: openAIChatMessages(messages),
})
if err != nil {
return "", friendlyChatError(err, model)
}
defer func() {
_ = stream.Close()
}()
var answer strings.Builder
for {
resp, err := stream.Recv()
if errors.Is(err, io.EOF) {
break
}
if err != nil {
return answer.String(), friendlyChatError(err, model)
}
if len(resp.Choices) == 0 {
continue
}
token := resp.Choices[0].Delta.Content
if token == "" {
continue
}
answer.WriteString(token)
if _, err := fmt.Fprint(out, token); err != nil {
return answer.String(), err
}
}
return answer.String(), nil
}
func openAIChatMessages(messages []chatMessage) []openai.ChatCompletionMessage {
converted := make([]openai.ChatCompletionMessage, len(messages))
for i, message := range messages {
converted[i] = openai.ChatCompletionMessage{
Role: message.Role,
Content: message.Content,
}
}
return converted
}
func friendlyChatError(err error, model string) error {
var apiErr *openai.APIError
if errors.As(err, &apiErr) {
switch apiErr.HTTPStatusCode {
case 404:
return fmt.Errorf("model %q is not available. Run `local-ai models list`, install a model with `local-ai models install <model>`, or switch with `/model <name>`", model)
case 403:
return fmt.Errorf("model %q is disabled. Enable it from LocalAI settings or choose another model with `/model <name>`", model)
}
if apiErr.Message != "" {
return errors.New(apiErr.Message)
}
}
msg := err.Error()
if strings.Contains(msg, "model") && strings.Contains(msg, "not found") {
return fmt.Errorf("model %q is not available. Run `local-ai models list`, install a model with `local-ai models install <model>`, or switch with `/model <name>`", model)
}
return err
}

17
core/cli/chat/models.go Normal file
View File

@@ -0,0 +1,17 @@
package chat
import "strings"
func formatChatModelList(models []string, current string) string {
var b strings.Builder
for _, model := range models {
prefix := " "
if model == current {
prefix = "* "
}
b.WriteString(prefix)
b.WriteString(model)
b.WriteByte('\n')
}
return b.String()
}

120
core/cli/chat/session.go Normal file
View File

@@ -0,0 +1,120 @@
package chat
import (
"context"
"errors"
"fmt"
"io"
"strings"
)
const (
chatRoleUser = "user"
chatRoleAssistant = "assistant"
)
type chatMessage struct {
Role string
Content string
}
type chatSession struct {
client chatClient
model string
models []string
messages []chatMessage
}
func newChatSession(ctx context.Context, client chatClient, requestedModel string) (*chatSession, error) {
models, err := client.ListModels(ctx)
if err != nil {
return nil, fmt.Errorf("list models: %w", err)
}
model, err := resolveChatModel(requestedModel, models)
if err != nil {
return nil, err
}
return &chatSession{
client: client,
model: model,
models: models,
}, nil
}
func (s *chatSession) CurrentModel() string {
return s.model
}
func (s *chatSession) Models() []string {
models := make([]string, len(s.models))
copy(models, s.models)
return models
}
func (s *chatSession) Clear() {
s.messages = nil
}
func (s *chatSession) SwitchModel(model string) error {
if !modelExists(s.models, model) {
return fmt.Errorf("model %q is not available. Use /models to see installed models", model)
}
s.model = model
s.Clear()
return nil
}
func (s *chatSession) Send(ctx context.Context, prompt string, out io.Writer) error {
s.messages = append(s.messages, chatMessage{
Role: chatRoleUser,
Content: prompt,
})
answer, err := s.client.StreamChat(ctx, s.model, s.messages, out)
if err != nil {
return err
}
s.messages = append(s.messages, chatMessage{
Role: chatRoleAssistant,
Content: answer,
})
return nil
}
func resolveChatModel(requested string, models []string) (string, error) {
switch {
case requested == "" && len(models) == 0:
return "", errors.New(`no chat models are installed.
Install a model first, for example:
local-ai models list
local-ai models install <model>
local-ai run
Then start a chat session:
local-ai chat --model <model>`)
case requested == "" && len(models) == 1:
return models[0], nil
case requested == "" && len(models) > 1:
var b strings.Builder
b.WriteString("multiple models are available; choose one with --model:\n")
b.WriteString(formatChatModelList(models, ""))
return "", errors.New(b.String())
case !modelExists(models, requested):
return "", fmt.Errorf("model %q is not available. Use `local-ai models list` and `local-ai models install <model>`, or pass an installed model with --model", requested)
default:
return requested, nil
}
}
func modelExists(models []string, name string) bool {
for _, model := range models {
if model == name {
return true
}
}
return false
}

View File

@@ -0,0 +1,56 @@
package chat
import (
"context"
"io"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
var _ = Describe("Chat session", func() {
It("keeps model switching and message history out of the terminal adapter", func() {
client := &fakeChatClient{
models: []string{"alpha", "beta"},
answer: "pong",
}
session, err := newChatSession(context.Background(), client, "alpha")
Expect(err).ToNot(HaveOccurred())
Expect(session.CurrentModel()).To(Equal("alpha"))
Expect(session.SwitchModel("beta")).To(Succeed())
Expect(session.CurrentModel()).To(Equal("beta"))
Expect(session.Send(context.Background(), "ping", io.Discard)).To(Succeed())
Expect(client.requests).To(HaveLen(1))
Expect(client.requests[0].model).To(Equal("beta"))
Expect(client.requests[0].messages).To(HaveLen(1))
Expect(client.requests[0].messages[0].Content).To(Equal("ping"))
})
})
type fakeChatClient struct {
models []string
answer string
requests []fakeChatRequest
}
type fakeChatRequest struct {
model string
messages []chatMessage
}
func (c *fakeChatClient) ListModels(context.Context) ([]string, error) {
return c.models, nil
}
func (c *fakeChatClient) StreamChat(_ context.Context, model string, messages []chatMessage, out io.Writer) (string, error) {
copied := make([]chatMessage, len(messages))
copy(copied, messages)
c.requests = append(c.requests, fakeChatRequest{model: model, messages: copied})
if _, err := io.WriteString(out, c.answer); err != nil {
return "", err
}
return c.answer, nil
}

93
core/cli/chat/terminal.go Normal file
View File

@@ -0,0 +1,93 @@
package chat
import (
"bufio"
"context"
"fmt"
"io"
"strings"
)
func runTerminalChat(ctx context.Context, session *chatSession, in io.Reader, out io.Writer) error {
scanner := bufio.NewScanner(in)
scanner.Buffer(make([]byte, 0, 64*1024), 4*1024*1024)
if err := writeChat(out, "LocalAI chat (%s)\n", session.CurrentModel()); err != nil {
return err
}
if err := writeChat(out, "Type /exit to quit, /clear to reset the conversation, /models to list models.\n"); err != nil {
return err
}
for {
if err := writeChat(out, "\n> "); err != nil {
return err
}
if !scanner.Scan() {
break
}
prompt := strings.TrimSpace(scanner.Text())
switch prompt {
case "":
continue
case "/bye", "/exit", "/quit":
return writeChat(out, "bye\n")
case "/clear":
session.Clear()
if err := writeChat(out, "conversation cleared\n"); err != nil {
return err
}
continue
case "/models":
if err := printChatModels(out, session.Models(), session.CurrentModel()); err != nil {
return err
}
continue
}
if nextModel, ok := strings.CutPrefix(prompt, "/model "); ok {
nextModel = strings.TrimSpace(nextModel)
if nextModel == "" {
if err := writeChat(out, "usage: /model <name>\n"); err != nil {
return err
}
continue
}
if err := session.SwitchModel(nextModel); err != nil {
if writeErr := writeChat(out, "%s\n", err); writeErr != nil {
return writeErr
}
continue
}
if err := writeChat(out, "switched to %s; conversation cleared\n", session.CurrentModel()); err != nil {
return err
}
continue
}
if err := writeChat(out, "assistant: "); err != nil {
return err
}
if err := session.Send(ctx, prompt, out); err != nil {
return err
}
if err := writeChat(out, "\n"); err != nil {
return err
}
}
return scanner.Err()
}
func printChatModels(out io.Writer, models []string, current string) error {
if len(models) == 0 {
return writeChat(out, "no models installed\n")
}
return writeChat(out, "%s", formatChatModelList(models, current))
}
func writeChat(out io.Writer, format string, args ...any) error {
_, err := fmt.Fprintf(out, format, args...)
return err
}

25
core/cli/chat_cmd.go Normal file
View File

@@ -0,0 +1,25 @@
package cli
import (
"context"
"os"
chatcli "github.com/mudler/LocalAI/core/cli/chat"
cliContext "github.com/mudler/LocalAI/core/cli/context"
)
type ChatCMD struct {
Model string `short:"m" help:"Model name to use. Defaults to the only model returned by the server when exactly one is available"`
Endpoint string `env:"LOCALAI_CHAT_ENDPOINT" default:"http://127.0.0.1:8080" help:"LocalAI server endpoint. The /v1 path is added automatically when omitted"`
APIKey string `env:"LOCALAI_API_KEY,API_KEY" help:"API key to use when the LocalAI server requires authentication"`
}
func (c *ChatCMD) Run(ctx *cliContext.Context) error {
return chatcli.Run(context.Background(), chatcli.Options{
Model: c.Model,
BaseURL: chatAPIBaseURL(c.Endpoint),
APIKey: c.APIKey,
In: os.Stdin,
Out: os.Stdout,
})
}

27
core/cli/chat_cmd_test.go Normal file
View File

@@ -0,0 +1,27 @@
package cli
import (
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
var _ = Describe("Chat command wiring", func() {
Describe("chatAPIBaseURL", func() {
It("adds /v1 to a root endpoint", func() {
Expect(chatAPIBaseURL("http://127.0.0.1:8080")).To(Equal("http://127.0.0.1:8080/v1"))
})
It("keeps endpoints that already include /v1", func() {
Expect(chatAPIBaseURL("http://127.0.0.1:8080/v1")).To(Equal("http://127.0.0.1:8080/v1"))
Expect(chatAPIBaseURL("http://127.0.0.1:8080/v1/")).To(Equal("http://127.0.0.1:8080/v1"))
})
It("adds a default http scheme", func() {
Expect(chatAPIBaseURL("127.0.0.1:8080")).To(Equal("http://127.0.0.1:8080/v1"))
})
It("preserves non-root paths before /v1", func() {
Expect(chatAPIBaseURL("http://127.0.0.1:8080/localai")).To(Equal("http://127.0.0.1:8080/localai/v1"))
})
})
})

29
core/cli/chat_endpoint.go Normal file
View File

@@ -0,0 +1,29 @@
package cli
import (
"net/url"
"strings"
)
func chatAPIBaseURL(endpoint string) string {
if !strings.Contains(endpoint, "://") {
endpoint = "http://" + endpoint
}
u, err := url.Parse(endpoint)
if err != nil {
return strings.TrimRight(endpoint, "/") + "/v1"
}
path := strings.TrimRight(u.Path, "/")
if path == "" {
u.Path = "/v1"
} else if path != "/v1" && !strings.HasSuffix(path, "/v1") {
u.Path = path + "/v1"
} else {
u.Path = path
}
u.RawQuery = ""
u.Fragment = ""
return u.String()
}

View File

@@ -9,6 +9,7 @@ var CLI struct {
cliContext.Context `embed:""`
Run RunCMD `cmd:"" help:"Run LocalAI, this the default command if no other command is specified. Run 'local-ai run --help' for more information" default:"withargs"`
Chat ChatCMD `cmd:"" help:"Open an interactive chat session against a running LocalAI server"`
Federated FederatedCLI `cmd:"" help:"Run LocalAI in federated mode"`
Models ModelsCMD `cmd:"" help:"Manage LocalAI models and definitions"`
Backends BackendsCMD `cmd:"" help:"Manage LocalAI backends and definitions"`

View File

@@ -30,6 +30,8 @@ type RunCMD struct {
ModelArgs []string `arg:"" optional:"" name:"models" help:"Model configuration URLs to load"`
ExternalBackends []string `env:"LOCALAI_EXTERNAL_BACKENDS,EXTERNAL_BACKENDS" help:"A list of external backends to load from gallery on boot" group:"backends"`
WebRTCNAT1To1IPs []string `env:"LOCALAI_WEBRTC_NAT_1TO1_IPS,WEBRTC_NAT_1TO1_IPS" help:"IPs advertised as the host ICE candidates for /v1/realtime WebRTC instead of every local interface. Set to the reachable host/LAN IP when running under Docker host networking or NAT, where pion otherwise offers unreachable bridge addresses and the connection drops after ICE consent checks fail." group:"api"`
WebRTCICEInterfaces []string `env:"LOCALAI_WEBRTC_ICE_INTERFACES,WEBRTC_ICE_INTERFACES" help:"Restrict /v1/realtime WebRTC ICE candidate gathering to these network interfaces (e.g. eth0), filtering out docker0/veth noise." group:"api"`
BackendsPath string `env:"LOCALAI_BACKENDS_PATH,BACKENDS_PATH" type:"path" default:"${basepath}/backends" help:"Path containing backends used for inferencing" group:"backends"`
BackendsSystemPath string `env:"LOCALAI_BACKENDS_SYSTEM_PATH,BACKEND_SYSTEM_PATH" type:"path" default:"/var/lib/local-ai/backends" help:"Path containing system backends used for inferencing" group:"backends"`
ModelsPath string `env:"LOCALAI_MODELS_PATH,MODELS_PATH" type:"path" default:"${basepath}/models" help:"Path containing models used for inferencing" group:"storage"`
@@ -225,6 +227,8 @@ func (r *RunCMD) Run(ctx *cliContext.Context) error {
config.WithApiKeys(r.APIKeys),
config.WithModelsURL(append(r.Models, r.ModelArgs...)...),
config.WithExternalBackends(r.ExternalBackends...),
config.WithWebRTCNAT1To1IPs(r.WebRTCNAT1To1IPs...),
config.WithWebRTCICEInterfaces(r.WebRTCICEInterfaces...),
config.WithOpaqueErrors(r.OpaqueErrors),
config.WithEnforcedPredownloadScans(!r.DisablePredownloadScan),
config.WithSubtleKeyComparison(r.UseSubtleKeyComparison),
@@ -652,12 +656,12 @@ func (r *RunCMD) Run(ctx *cliContext.Context) error {
// waitForServerReady polls the given address until the HTTP server is
// accepting connections or the context is cancelled.
func waitForServerReady(address string, ctx context.Context) {
// Ensure the address has a host component for dialing.
// Echo accepts ":8080" but net.Dial needs a resolvable host.
host, port, err := net.SplitHostPort(address)
if err == nil && host == "" {
address = "127.0.0.1:" + port
}
ticker := time.NewTicker(250 * time.Millisecond)
defer ticker.Stop()
for {
select {
@@ -665,11 +669,17 @@ func waitForServerReady(address string, ctx context.Context) {
return
default:
}
conn, err := net.DialTimeout("tcp", address, 500*time.Millisecond)
if err == nil {
conn.Close()
return
}
time.Sleep(250 * time.Millisecond)
select {
case <-ctx.Done():
return
case <-ticker.C:
}
}
}

View File

@@ -12,10 +12,19 @@ import (
)
type ApplicationConfig struct {
Context context.Context
ConfigFile string
SystemState *system.SystemState
ExternalBackends []string
Context context.Context
ConfigFile string
SystemState *system.SystemState
ExternalBackends []string
// WebRTCNAT1To1IPs, when set, are advertised as the host ICE candidates for
// /v1/realtime WebRTC instead of every local interface address. Needed when
// the routable address differs from what pion gathers — e.g. Docker host
// networking (where pion also offers unreachable bridge IPs) or NAT.
WebRTCNAT1To1IPs []string
// WebRTCICEInterfaces, when set, restricts ICE candidate gathering to these
// network interfaces (e.g. eth0), filtering out docker0/veth noise.
WebRTCICEInterfaces []string
UploadLimitMB, Threads, ContextSize int
F16 bool
Debug bool
@@ -81,7 +90,6 @@ type ApplicationConfig struct {
// file is mode 0600.
MITMCADir string
// PIIPatternOverrides applies persisted per-id deltas (action,
// disabled) to the live redactor at startup. Loaded from
// runtime_settings.json and applied right after pii.NewRedactor.
@@ -116,11 +124,11 @@ type ApplicationConfig struct {
// --require-backend-integrity / LOCALAI_REQUIRE_BACKEND_INTEGRITY.
RequireBackendIntegrity bool
SingleBackend bool // Deprecated: use MaxActiveBackends = 1 instead
MaxActiveBackends int // Maximum number of active backends (0 = unlimited, 1 = single backend mode)
WatchDogIdle bool
WatchDogBusy bool
WatchDog bool
SingleBackend bool // Deprecated: use MaxActiveBackends = 1 instead
MaxActiveBackends int // Maximum number of active backends (0 = unlimited, 1 = single backend mode)
WatchDogIdle bool
WatchDogBusy bool
WatchDog bool
// Memory Reclaimer settings (works with GPU if available, otherwise RAM)
MemoryReclaimerEnabled bool // Enable memory threshold monitoring
@@ -311,6 +319,18 @@ func WithExternalBackends(backends ...string) AppOption {
}
}
func WithWebRTCNAT1To1IPs(ips ...string) AppOption {
return func(o *ApplicationConfig) {
o.WebRTCNAT1To1IPs = ips
}
}
func WithWebRTCICEInterfaces(interfaces ...string) AppOption {
return func(o *ApplicationConfig) {
o.WebRTCICEInterfaces = interfaces
}
}
func WithMachineTag(tag string) AppOption {
return func(o *ApplicationConfig) {
o.MachineTag = tag
@@ -702,7 +722,6 @@ func WithMITMCADir(dir string) AppOption {
}
}
func WithDynamicConfigDir(dynamicConfigsDir string) AppOption {
return func(o *ApplicationConfig) {
o.DynamicConfigsDir = dynamicConfigsDir

View File

@@ -39,7 +39,21 @@ func llamaCppDefaults(cfg *ModelConfig, modelPath string) {
}
}()
f, err := gguf.ParseGGUFFile(guessPath)
// Startup parses every model's GGUF header to guess defaults. We only need
// scalar metadata (architecture, head/ff counts, chat_template, token IDs,
// MTP head) plus array *lengths* — never the array *contents*. Two options
// keep this cheap, which matters when many models live on slow storage such
// as a Docker volume (see https://github.com/mudler/LocalAI/issues/9790):
//
// - SkipLargeMetadata: seek past large array-valued metadata (the tokenizer
// vocab: tokenizer.ggml.tokens/scores/merges, often >100k entries) instead
// of reading and allocating every element. Lengths stay populated.
// - UseMMap: read the header via a memory map so faulting in a few pages
// replaces hundreds of thousands of tiny read() syscalls (measured ~524k
// -> 8 for a 256k-token vocab), the dominant cost on slow filesystems.
//
// The mapping is released when ParseGGUFFile returns.
f, err := gguf.ParseGGUFFile(guessPath, gguf.UseMMap(), gguf.SkipLargeMetadata())
if err == nil {
guessGGUFFromFile(cfg, f, 0)
}

View File

@@ -1,13 +1,76 @@
package config_test
import (
"bytes"
"encoding/binary"
"os"
"path/filepath"
. "github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/schema"
gguf "github.com/gpustack/gguf-parser-go"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
// GGUF metadata value type tags (see github.com/gpustack/gguf-parser-go).
const (
ggufTypeUint32 uint32 = 4
ggufTypeString uint32 = 8
ggufTypeArray uint32 = 9
)
// writeTestGGUF emits a minimal but valid little-endian GGUF v3 header carrying
// the scalar metadata the llama-cpp hook guesses from plus a large string vocab
// array (tokenizer.ggml.tokens). The big array is exactly what SkipLargeMetadata
// + UseMMap are expected to avoid reading element-by-element, so it must survive a
// round-trip through the real hook without corrupting the guessed defaults.
func writeTestGGUF(path, chatTemplate string, vocab int) error {
wStr := func(b *bytes.Buffer, s string) {
binary.Write(b, binary.LittleEndian, uint64(len(s)))
b.WriteString(s)
}
kvStr := func(b *bytes.Buffer, k, v string) {
wStr(b, k)
binary.Write(b, binary.LittleEndian, ggufTypeString)
wStr(b, v)
}
kvU32 := func(b *bytes.Buffer, k string, v uint32) {
wStr(b, k)
binary.Write(b, binary.LittleEndian, ggufTypeUint32)
binary.Write(b, binary.LittleEndian, v)
}
var meta bytes.Buffer
kvStr(&meta, "general.architecture", "llama")
kvStr(&meta, "general.name", "ReproModel")
kvU32(&meta, "llama.context_length", 4096)
kvU32(&meta, "llama.attention.head_count", 32)
kvU32(&meta, "llama.feed_forward_length", 11008)
kvU32(&meta, "llama.block_count", 32)
kvU32(&meta, "tokenizer.ggml.bos_token_id", 1)
kvStr(&meta, "tokenizer.chat_template", chatTemplate)
// large array value — the one the optimization skips reading
wStr(&meta, "tokenizer.ggml.tokens")
binary.Write(&meta, binary.LittleEndian, ggufTypeArray)
binary.Write(&meta, binary.LittleEndian, ggufTypeString)
binary.Write(&meta, binary.LittleEndian, uint64(vocab))
for i := 0; i < vocab; i++ {
wStr(&meta, "token")
}
var out bytes.Buffer
binary.Write(&out, binary.LittleEndian, gguf.GGUFMagicGGUFLe)
binary.Write(&out, binary.LittleEndian, uint32(3)) // version
binary.Write(&out, binary.LittleEndian, uint64(0)) // tensor count
binary.Write(&out, binary.LittleEndian, uint64(9)) // metadata kv count
out.Write(meta.Bytes())
return os.WriteFile(path, out.Bytes(), 0o644)
}
var _ = Describe("Backend hooks and parser defaults", func() {
Context("MatchParserDefaults", func() {
It("matches Qwen3 family", func() {
@@ -137,6 +200,58 @@ var _ = Describe("Backend hooks and parser defaults", func() {
})
})
Context("llamaCppDefaults GGUF guessing", func() {
// Regression coverage for https://github.com/mudler/LocalAI/issues/9790:
// the hook reads GGUF headers with SkipLargeMetadata + UseMMap to avoid
// pulling the whole tokenizer vocab off (slow) disk on every startup. This
// verifies that skipping the vocab array still yields the correct guessed
// defaults from the remaining scalar metadata.
const chatTemplate = "{{ bos_token }}{% for m in messages %}{{ m.content }}{% endfor %}"
It("guesses defaults from a GGUF whose large vocab is skipped", func() {
dir := GinkgoT().TempDir()
modelFile := "repro.gguf"
Expect(writeTestGGUF(filepath.Join(dir, modelFile), chatTemplate, 50000)).To(Succeed())
// A pre-set context size short-circuits the GGUF run-estimate, which
// needs full tensor info this header-only fixture deliberately omits;
// the metadata-reading path the optimization touches is unaffected.
ctxSize := 4096
cfg := &ModelConfig{
Backend: "llama-cpp",
LLMConfig: LLMConfig{ContextSize: &ctxSize},
PredictionOptions: schema.PredictionOptions{
BasicModelRequest: schema.BasicModelRequest{Model: modelFile},
},
}
cfg.SetDefaults(ModelPath(dir))
// chat_template is a scalar string, not part of the skipped array,
// so it must be captured verbatim.
Expect(cfg.GetModelTemplate()).To(Equal(chatTemplate))
// scalar-derived defaults are still applied
Expect(cfg.ContextSize).NotTo(BeNil())
Expect(cfg.NGPULayers).NotTo(BeNil())
Expect(cfg.TemplateConfig.UseTokenizerTemplate).To(BeTrue())
Expect(cfg.KnownUsecaseStrings).To(ContainElement("FLAG_CHAT"))
})
It("falls back to the default context size when the GGUF is unreadable", func() {
dir := GinkgoT().TempDir()
Expect(os.WriteFile(filepath.Join(dir, "bad.gguf"), []byte("not a gguf"), 0o644)).To(Succeed())
cfg := &ModelConfig{
Backend: "llama-cpp",
PredictionOptions: schema.PredictionOptions{
BasicModelRequest: schema.BasicModelRequest{Model: "bad.gguf"},
},
}
cfg.SetDefaults(ModelPath(dir))
Expect(cfg.ContextSize).NotTo(BeNil())
})
})
Context("PromptCacheAll default", func() {
It("defaults to true when omitted from YAML", func() {
cfg := &ModelConfig{}

View File

@@ -308,6 +308,41 @@ func DefaultRegistry() map[string]FieldMetaOverride {
},
Order: 64,
},
"pipeline.disable_thinking": {
Section: "pipeline",
Label: "Disable Thinking",
Description: "Suppress reasoning/thinking output from the pipeline LLM (sets enable_thinking=false on the underlying model). Use for models that emit <think> blocks you don't want spoken or streamed back to the realtime client.",
Component: "toggle",
Order: 65,
},
"pipeline.streaming.llm": {
Section: "pipeline",
Label: "Stream LLM",
Description: "Stream LLM tokens to the realtime client as they are generated instead of waiting for the full response. Emits incremental response.output_audio_transcript.delta / text deltas.",
Component: "toggle",
Order: 66,
},
"pipeline.streaming.tts": {
Section: "pipeline",
Label: "Stream TTS",
Description: "Stream synthesized audio chunks to the realtime client as they are produced (requires a TTS backend that implements TTSStream). Falls back to unary synthesis otherwise.",
Component: "toggle",
Order: 67,
},
"pipeline.streaming.transcription": {
Section: "pipeline",
Label: "Stream Transcription",
Description: "Stream partial transcription text to the realtime client as the STT backend produces it (requires a transcription backend that implements AudioTranscriptionStream). Falls back to unary transcription otherwise.",
Component: "toggle",
Order: 68,
},
"pipeline.streaming.clause_chunking": {
Section: "pipeline",
Label: "Clause Chunking",
Description: "Split the streamed reply into speakable clauses and synthesize each as soon as it completes, instead of buffering the whole message before TTS — lower time-to-first-audio. Script-aware (handles CJK 。!? and Thai/Lao spaces), so it does not whitespace-split. Requires Stream LLM; off buffers the whole message.",
Component: "toggle",
Order: 69,
},
// --- Functions ---
"function.grammar.parallel_calls": {

View File

@@ -499,6 +499,16 @@ type Pipeline struct {
// the pipeline's LLM without editing the LLM model config. Overrides the LLM's
// own reasoning_effort. Unset leaves the LLM model config in charge.
ReasoningEffort string `yaml:"reasoning_effort,omitempty" json:"reasoning_effort,omitempty"`
// Streaming opts each pipeline stage into incremental delivery (LLM tokens,
// TTS audio chunks, transcription text). Unset stages keep the blocking
// unary path, so existing configs are unaffected.
Streaming PipelineStreaming `yaml:"streaming,omitempty" json:"streaming,omitempty"`
// DisableThinking suppresses reasoning/thinking for the pipeline LLM (maps
// to enable_thinking=false backend metadata) without editing the underlying
// LLM model config. Unset leaves the LLM model config in charge.
DisableThinking *bool `yaml:"disable_thinking,omitempty" json:"disable_thinking,omitempty"`
}
// ApplyReasoningEffort resolves the effective reasoning effort — a per-request
@@ -530,6 +540,41 @@ func (c *ModelConfig) ApplyReasoningEffort(requestEffort string) {
}
}
// @Description PipelineStreaming toggles incremental delivery per realtime stage.
type PipelineStreaming struct {
LLM *bool `yaml:"llm,omitempty" json:"llm,omitempty"`
TTS *bool `yaml:"tts,omitempty" json:"tts,omitempty"`
Transcription *bool `yaml:"transcription,omitempty" json:"transcription,omitempty"`
// ClauseChunking splits the streamed LLM reply into speakable clauses and
// synthesizes each as soon as it completes, instead of buffering the whole
// message before TTS. Script-aware (CJK/Thai), so it does not rely on
// whitespace sentence boundaries. Requires LLM streaming; unset buffers the
// whole message (today's default).
ClauseChunking *bool `yaml:"clause_chunking,omitempty" json:"clause_chunking,omitempty"`
}
// StreamLLM reports whether LLM tokens should be streamed for this pipeline.
func (p Pipeline) StreamLLM() bool { return p.Streaming.LLM != nil && *p.Streaming.LLM }
// StreamTTS reports whether TTS audio should be streamed for this pipeline.
func (p Pipeline) StreamTTS() bool { return p.Streaming.TTS != nil && *p.Streaming.TTS }
// StreamTranscription reports whether transcription text should be streamed.
func (p Pipeline) StreamTranscription() bool {
return p.Streaming.Transcription != nil && *p.Streaming.Transcription
}
// ChunkClauses reports whether the streamed reply should be split into
// script-aware clauses and synthesized incrementally rather than buffered whole.
func (p Pipeline) ChunkClauses() bool {
return p.Streaming.ClauseChunking != nil && *p.Streaming.ClauseChunking
}
// ThinkingDisabled reports whether the pipeline forces the LLM's thinking off.
func (p Pipeline) ThinkingDisabled() bool {
return p.DisableThinking != nil && *p.DisableThinking
}
// @Description File configuration for model downloads
type File struct {
Filename string `yaml:"filename,omitempty" json:"filename,omitempty"`

View File

@@ -30,11 +30,26 @@ func MTPSpecOptions() []string {
return out
}
// HasEmbeddedMTPHead reports whether the parsed GGUF declares a Multi-Token
// Prediction head. Detection reads `<arch>.nextn_predict_layers`, which is
// what `gguf_writer.add_nextn_predict_layers(n)` emits in upstream's
// isDraftOnlyAssistantArch reports whether an architecture names a standalone
// MTP *draft* model rather than a self-speculating trunk. Upstream's Gemma4 MTP
// (ggml-org/llama.cpp#23398) registers the head as a separate `gemma4-assistant`
// architecture whose GGUF still carries `nextn_predict_layers`, but which cannot
// run alone: it requires a paired target context (`ctx_other`). Such archs must
// not trigger the embedded-head self-speculation defaults. The `-assistant`
// suffix is upstream's naming convention for these draft-only checkpoints.
func isDraftOnlyAssistantArch(arch string) bool {
return strings.HasSuffix(arch, "-assistant")
}
// HasEmbeddedMTPHead reports whether the parsed GGUF declares a self-speculating
// Multi-Token Prediction head. Detection reads `<arch>.nextn_predict_layers`,
// which is what `gguf_writer.add_nextn_predict_layers(n)` emits in upstream's
// `conversion/qwen.py` MTP mixin. A positive layer count means the head is
// present in the same GGUF as the trunk.
//
// Draft-only assistant architectures (e.g. Gemma4's `gemma4-assistant`) carry
// the same key but are separate draft checkpoints meant to be paired with a
// target model, so they are deliberately excluded here.
func HasEmbeddedMTPHead(f *gguf.GGUFFile) (uint32, bool) {
if f == nil {
return 0, false
@@ -43,6 +58,9 @@ func HasEmbeddedMTPHead(f *gguf.GGUFFile) (uint32, bool) {
if arch == "" {
return 0, false
}
if isDraftOnlyAssistantArch(arch) {
return 0, false
}
v, ok := f.Header.MetadataKV.Get(arch + ".nextn_predict_layers")
if !ok {
return 0, false

View File

@@ -3,10 +3,33 @@ package config_test
import (
. "github.com/mudler/LocalAI/core/config"
gguf "github.com/gpustack/gguf-parser-go"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
// ggufWithArch fabricates a minimal in-memory GGUF carrying the given
// `general.architecture` and a positive `<arch>.nextn_predict_layers` count,
// so HasEmbeddedMTPHead can be exercised without a real model file.
func ggufWithArch(arch string, nextn uint32) *gguf.GGUFFile {
return &gguf.GGUFFile{
Header: gguf.GGUFHeader{
MetadataKV: gguf.GGUFMetadataKVs{
{
Key: "general.architecture",
ValueType: gguf.GGUFMetadataValueTypeString,
Value: arch,
},
{
Key: arch + ".nextn_predict_layers",
ValueType: gguf.GGUFMetadataValueTypeUint32,
Value: nextn,
},
},
},
}
}
var _ = Describe("MTP auto-defaults", func() {
Context("MTPSpecOptions", func() {
It("returns the upstream-recommended speculative tuple", func() {
@@ -82,5 +105,20 @@ var _ = Describe("MTP auto-defaults", func() {
Expect(ok).To(BeFalse())
Expect(n).To(BeZero())
})
It("detects a same-GGUF embedded head (DeepSeek/Qwen style)", func() {
n, ok := HasEmbeddedMTPHead(ggufWithArch("qwen3moe", 1))
Expect(ok).To(BeTrue())
Expect(n).To(Equal(uint32(1)))
})
It("ignores a gemma4-assistant draft-only model", func() {
// The assistant GGUF carries nextn_predict_layers but is a separate
// draft model that requires a paired target (ctx_other); it cannot
// self-speculate, so it must not trigger the embedded-head defaults.
n, ok := HasEmbeddedMTPHead(ggufWithArch("gemma4-assistant", 48))
Expect(ok).To(BeFalse())
Expect(n).To(BeZero())
})
})
})

View File

@@ -0,0 +1,57 @@
package config
import (
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
"gopkg.in/yaml.v3"
)
// The realtime pipeline can stream each stage (LLM tokens, TTS audio,
// transcription text) and can disable model "thinking" for the LLM. These are
// opt-in per pipeline; everything defaults to off so existing configs keep the
// unary behaviour.
var _ = Describe("Pipeline streaming config", func() {
It("defaults every streaming + thinking helper to false when unset", func() {
var p Pipeline
Expect(p.StreamLLM()).To(BeFalse())
Expect(p.StreamTTS()).To(BeFalse())
Expect(p.StreamTranscription()).To(BeFalse())
Expect(p.ChunkClauses()).To(BeFalse())
Expect(p.ThinkingDisabled()).To(BeFalse())
})
It("parses the nested streaming block and disable_thinking from YAML", func() {
var c ModelConfig
err := yaml.Unmarshal([]byte(`
name: gpt-realtime
pipeline:
llm: my-llm
tts: my-tts
transcription: my-stt
streaming:
llm: true
tts: true
transcription: true
clause_chunking: true
disable_thinking: true
`), &c)
Expect(err).ToNot(HaveOccurred())
Expect(c.Pipeline.StreamLLM()).To(BeTrue())
Expect(c.Pipeline.StreamTTS()).To(BeTrue())
Expect(c.Pipeline.StreamTranscription()).To(BeTrue())
Expect(c.Pipeline.ChunkClauses()).To(BeTrue())
Expect(c.Pipeline.ThinkingDisabled()).To(BeTrue())
})
It("treats an explicit false in the streaming block as disabled", func() {
var c ModelConfig
err := yaml.Unmarshal([]byte(`
name: gpt-realtime
pipeline:
streaming:
tts: false
`), &c)
Expect(err).ToNot(HaveOccurred())
Expect(c.Pipeline.StreamTTS()).To(BeFalse())
})
})

View File

@@ -103,7 +103,12 @@ func applyAutoparserOverride(
// blocks like "<think></think>" that some models emit when reasoning
// is disabled.
if deltaReasoning == "" && deltaContent != "" {
deltaReasoning, deltaContent = reason.ExtractReasoningWithConfig(deltaContent, thinkingStartToken, reasoningConfig)
// Complete-response extraction: only honor a prefilled <think> start
// token when deltaContent actually closes the reasoning block. Without
// it the model answered directly and the whole answer must stay in
// content rather than be swallowed as unclosed reasoning. See
// reason.ExtractReasoningComplete.
deltaReasoning, deltaContent = reason.ExtractReasoningComplete(deltaContent, thinkingStartToken, reasoningConfig)
}
xlog.Debug("[ChatDeltas] non-SSE no-tools: overriding result with C++ autoparser deltas",
"content_len", len(deltaContent), "reasoning_len", len(deltaReasoning))

View File

@@ -186,6 +186,114 @@ var _ = Describe("applyAutoparserOverride", func() {
Expect(result).To(Equal(existing))
})
})
// Regression tests for the prefilled-thinking-token path (thinkingStartToken
// != ""). This is the configuration the gallery qwen3 family runs in: the
// chat template injects <think> into the prompt, so DetectThinkingStartToken
// returns "<think>" and the model's output begins *inside* a reasoning block
// — it emits a closing </think> but no opening tag.
//
// The defensive Go-side fallback prepends the start token so the standard
// extractor can pair it with the model's </think>. But on a *complete*
// response that contains NO closing tag (the model answered directly with no
// reasoning at all), prepending <think> manufactures an unclosed block that
// swallows the entire answer into reasoning, leaving content empty. That is
// the bug: short/direct answers (session names, JSON summaries) come back
// with an empty content field.
Context("autoparser delivered content with empty reasoning and a prefilled thinking token", func() {
const startToken = "<think>"
It("keeps a tag-less direct answer as content instead of swallowing it as reasoning", func() {
// Model answered directly: no <think>, no </think> anywhere.
chatDeltas := []*pb.ChatDelta{
{Content: "hello", ReasoningContent: ""},
}
result := applyAutoparserOverride(chatDeltas, startToken, reason.Config{}, nil)
Expect(result).To(HaveLen(1))
Expect(result[0].Message.Content).ToNot(BeNil())
Expect(*(result[0].Message.Content.(*string))).To(Equal("hello"),
"a complete answer with no closing reasoning tag must stay in content")
Expect(result[0].Message.Reasoning).To(BeNil(),
"no reasoning block was emitted, so Reasoning must not be set")
})
It("keeps a tag-less JSON answer as content (the summary case)", func() {
raw := `{"short":"Tests pass","long":"go test ./... succeeded."}`
chatDeltas := []*pb.ChatDelta{
{Content: raw, ReasoningContent: ""},
}
result := applyAutoparserOverride(chatDeltas, startToken, reason.Config{}, nil)
Expect(result).To(HaveLen(1))
Expect(*(result[0].Message.Content.(*string))).To(Equal(raw))
Expect(result[0].Message.Reasoning).To(BeNil())
})
It("still splits reasoning when the model emits the closing tag (prefill paired with </think>)", func() {
// The legitimate prefill case: <think> was in the prompt, so the
// output carries only the closing tag. The closing tag is the proof
// that a reasoning block exists, so extraction must run.
raw := "The user wants a greeting.\n</think>\n\nHello there!"
chatDeltas := []*pb.ChatDelta{
{Content: raw, ReasoningContent: ""},
}
result := applyAutoparserOverride(chatDeltas, startToken, reason.Config{}, nil)
Expect(result).To(HaveLen(1))
content := *(result[0].Message.Content.(*string))
Expect(content).To(ContainSubstring("Hello there!"))
Expect(content).ToNot(ContainSubstring("</think>"))
Expect(content).ToNot(ContainSubstring("The user wants a greeting"))
Expect(result[0].Message.Reasoning).ToNot(BeNil())
Expect(*result[0].Message.Reasoning).To(ContainSubstring("The user wants a greeting"))
})
It("still splits a fully-tagged <think>…</think> block with a prefill token set", func() {
raw := "<think>Reasoning here.</think>Final answer."
chatDeltas := []*pb.ChatDelta{
{Content: raw, ReasoningContent: ""},
}
result := applyAutoparserOverride(chatDeltas, startToken, reason.Config{}, nil)
Expect(result).To(HaveLen(1))
Expect(*(result[0].Message.Content.(*string))).To(Equal("Final answer."))
Expect(result[0].Message.Reasoning).ToNot(BeNil())
Expect(*result[0].Message.Reasoning).To(ContainSubstring("Reasoning here"))
})
// End-to-end regression for the real production failure: a request with
// enable_thinking=false against a <think>-capable model (qwen3 family).
//
// In non-thinking mode the model emits no reasoning block, so llama.cpp's
// autoparser correctly returns ChatDeltas with Content set and
// ReasoningContent EMPTY (verified against stock llama-server: the same
// model with chat_template_kwargs.enable_thinking=false returns
// reasoning_content=null and content="hello"). But thinkingStartToken is
// detected per-model from the enable_thinking=TRUE render
// (grpc-server renders with enable_thinking=true; DetectThinkingStartToken
// does not evaluate the jinja {% if enable_thinking %} conditional), so it
// is "<think>" even for this non-thinking request. The old code prepended
// it and swallowed the answer. This is the case that broke session
// summaries and auto-titles and was NOT covered before.
It("preserves content for a non-thinking-mode request (enable_thinking=false, empty reasoning_content)", func() {
// What llama.cpp's autoparser actually returns in non-thinking mode.
chatDeltas := []*pb.ChatDelta{
{Content: `{"short":"Go tests passed for internal/session"}`, ReasoningContent: ""},
}
result := applyAutoparserOverride(chatDeltas, startToken, reason.Config{}, nil)
Expect(result).To(HaveLen(1))
Expect(*(result[0].Message.Content.(*string))).To(Equal(`{"short":"Go tests passed for internal/session"}`),
"non-thinking-mode answers must reach the client intact, not be swallowed as reasoning")
Expect(result[0].Message.Reasoning).To(BeNil())
})
})
})
var _ = Describe("mergeToolCallDeltas", func() {

View File

@@ -2,8 +2,10 @@ package openai
import (
"context"
"crypto/rand"
"encoding/base64"
"encoding/binary"
"encoding/hex"
"encoding/json"
"fmt"
"math"
@@ -235,6 +237,12 @@ type Model interface {
Transcribe(ctx context.Context, audio, language string, translate bool, diarize bool, prompt string) (*schema.TranscriptionResult, error)
Predict(ctx context.Context, messages schema.Messages, images, videos, audios []string, tokenCallback func(string, backend.TokenUsage) bool, tools []types.ToolUnion, toolChoice *types.ToolChoiceUnion, logprobs *int, topLogprobs *int, logitBias map[string]float64) (func() (backend.LLMResponse, error), error)
TTS(ctx context.Context, text, voice, language string) (string, *proto.Result, error)
// TTSStream synthesizes speech incrementally, invoking onAudio with raw PCM
// chunks (and the backend sample rate) as they are produced.
TTSStream(ctx context.Context, text, voice, language string, onAudio func(pcm []byte, sampleRate int) error) error
// TranscribeStream transcribes audio incrementally, invoking onDelta for each
// transcript text fragment and returning the final aggregated result.
TranscribeStream(ctx context.Context, audio, language string, translate, diarize bool, prompt string, onDelta func(text string)) (*schema.TranscriptionResult, error)
PredictConfig() *config.ModelConfig
}
@@ -1254,27 +1262,15 @@ func commitUtterance(ctx context.Context, utt []byte, session *Session, conv *Co
// TODO: If we have a real any-to-any model then transcription is optional
var transcript string
if session.InputAudioTranscription != nil {
tr, err := session.ModelInterface.Transcribe(ctx, f.Name(), session.InputAudioTranscription.Language, false, false, session.InputAudioTranscription.Prompt)
// emitTranscription streams transcript deltas when
// pipeline.streaming.transcription is set, otherwise emits a single
// completed event; either way it returns the final transcript text.
var err error
transcript, err = emitTranscription(ctx, t, session, generateItemID(), f.Name())
if err != nil {
sendError(t, "transcription_failed", err.Error(), "", "event_TODO")
return
} else if tr == nil {
sendError(t, "transcription_failed", "trancribe result is nil", "", "event_TODO")
return
}
transcript = tr.Text
sendEvent(t, types.ConversationItemInputAudioTranscriptionCompletedEvent{
ServerEventBase: types.ServerEventBase{
EventID: "event_TODO",
},
ItemID: generateItemID(),
// ResponseID: "resp_TODO", // Not needed for transcription completed event
// OutputIndex: 0,
ContentIndex: 0,
Transcript: transcript,
})
} else {
sendNotImplemented(t, "any-to-any models")
return
@@ -1502,6 +1498,26 @@ func triggerResponseAtTurn(ctx context.Context, session *Session, conv *Conversa
},
})
// Streamed LLM path: when the pipeline opts into LLM streaming, stream the
// transcript to the client as it is generated and synthesize the buffered
// message once. Tool turns are supported only when the model uses its
// tokenizer template: the C++ autoparser then delivers content and tool
// calls via ChatDeltas (clearing the text stream), so the spoken transcript
// never leaks tool-call tokens. Grammar-based function calling emits the
// call as JSON in the token stream, so those turns keep the buffered path.
if config != nil && session.ModelConfig != nil && session.ModelConfig.Pipeline.StreamLLM() {
canStream := len(tools) == 0 || config.TemplateConfig.UseTokenizerTemplate
var respMods []types.Modality
if overrides != nil {
respMods = overrides.OutputModalities
}
if canStream && modalitiesContainAudio(resolveOutputModalities(session.OutputModalities, respMods)) {
if streamLLMResponse(ctx, session, conv, t, responseID, conversationHistory, images, config, tools, toolChoice, toolTurn) {
return
}
}
}
predFunc, err := session.ModelInterface.Predict(ctx, conversationHistory, images, nil, nil, nil, tools, toolChoice, nil, nil, nil)
if err != nil {
sendError(t, "inference_failed", fmt.Sprintf("backend error: %v", err), "", "") // item.Assistant.ID is unknown here
@@ -1579,7 +1595,7 @@ func triggerResponseAtTurn(ctx context.Context, session *Session, conv *Conversa
// ExtractReasoningWithConfig is a no-op when no tag pair matches,
// so it's safe to apply unconditionally in the no-reasoning branch.
if deltaReasoning == "" && deltaContent != "" {
deltaReasoning, deltaContent = reasoning.ExtractReasoningWithConfig(deltaContent, thinkingStartToken, config.ReasoningConfig)
deltaReasoning, deltaContent = reasoning.ExtractReasoningComplete(deltaContent, thinkingStartToken, spokenReasoningConfig(config.ReasoningConfig))
}
reasoningText = deltaReasoning
responseWithoutReasoning = deltaContent
@@ -1587,7 +1603,7 @@ func triggerResponseAtTurn(ctx context.Context, session *Session, conv *Conversa
cleanedResponse = deltaContent
toolCalls = deltaToolCalls
} else {
reasoningText, responseWithoutReasoning = reasoning.ExtractReasoningWithConfig(rawResponse, thinkingStartToken, config.ReasoningConfig)
reasoningText, responseWithoutReasoning = reasoning.ExtractReasoningComplete(rawResponse, thinkingStartToken, spokenReasoningConfig(config.ReasoningConfig))
textContent = functions.ParseTextContent(responseWithoutReasoning, config.FunctionsConfig)
cleanedResponse = functions.CleanupLLMResult(responseWithoutReasoning, config.FunctionsConfig)
toolCalls = functions.ParseFunctionCall(cleanedResponse, config.FunctionsConfig)
@@ -1713,64 +1729,7 @@ func triggerResponseAtTurn(ctx context.Context, session *Session, conv *Conversa
return
}
audioFilePath, res, err := session.ModelInterface.TTS(ctx, finalSpeech, session.Voice, session.InputAudioTranscription.Language)
if err != nil {
if ctx.Err() != nil {
xlog.Debug("TTS cancelled (barge-in)")
sendCancelledResponse()
return
}
xlog.Error("TTS failed", "error", err)
sendError(t, "tts_error", fmt.Sprintf("TTS generation failed: %v", err), "", item.Assistant.ID)
return
}
if !res.Success {
xlog.Error("TTS failed", "message", res.Message)
sendError(t, "tts_error", fmt.Sprintf("TTS generation failed: %s", res.Message), "", item.Assistant.ID)
return
}
defer func() { _ = os.Remove(audioFilePath) }()
audioBytes, err := os.ReadFile(audioFilePath)
if err != nil {
xlog.Error("failed to read TTS file", "error", err)
sendError(t, "tts_error", fmt.Sprintf("Failed to read TTS audio: %v", err), "", item.Assistant.ID)
return
}
// Parse WAV header to get raw PCM and the actual sample rate from the TTS backend.
pcmData, ttsSampleRate := laudio.ParseWAV(audioBytes)
if ttsSampleRate == 0 {
ttsSampleRate = localSampleRate
}
xlog.Debug("TTS audio parsed", "raw_bytes", len(audioBytes), "pcm_bytes", len(pcmData), "sample_rate", ttsSampleRate)
// SendAudio (WebRTC) passes PCM at the TTS sample rate directly to the
// Opus encoder, which resamples to 48kHz internally. This avoids a
// lossy intermediate resample through 16kHz.
// XXX: This is a noop in websocket mode; it's included in the JSON instead
if err := t.SendAudio(ctx, pcmData, ttsSampleRate); err != nil {
if ctx.Err() != nil {
xlog.Debug("Audio playback cancelled (barge-in)")
sendCancelledResponse()
return
}
xlog.Error("failed to send audio via transport", "error", err)
}
// For WebSocket clients, resample to the session's output rate and
// deliver audio as base64 in JSON events. WebRTC clients already
// received audio over the RTP track, so skip the base64 payload.
if !isWebRTC {
wsPCM := pcmData
if ttsSampleRate != session.OutputSampleRate {
samples := sound.BytesToInt16sLE(pcmData)
resampled := sound.ResampleInt16(samples, ttsSampleRate, session.OutputSampleRate)
wsPCM = sound.Int16toBytesLE(resampled)
}
audioString = base64.StdEncoding.EncodeToString(wsPCM)
}
// Transcript of the spoken reply (the audio's text).
sendEvent(t, types.ResponseOutputAudioTranscriptDeltaEvent{
ServerEventBase: types.ServerEventBase{},
ResponseID: responseID,
@@ -1788,15 +1747,26 @@ func triggerResponseAtTurn(ctx context.Context, session *Session, conv *Conversa
Transcript: finalSpeech,
})
// Synthesize and send the audio. With pipeline.streaming.tts enabled
// emitSpeech forwards a response.output_audio.delta per backend PCM
// chunk as it's produced; otherwise it sends the whole utterance as a
// single delta. The returned PCM is stored (base64) on the item below.
pcmAudio, err := emitSpeech(ctx, t, session, responseID, item.Assistant.ID, finalSpeech)
if err != nil {
if ctx.Err() != nil {
xlog.Debug("TTS cancelled (barge-in)")
sendCancelledResponse()
return
}
xlog.Error("TTS failed", "error", err)
sendError(t, "tts_error", fmt.Sprintf("TTS generation failed: %v", err), "", item.Assistant.ID)
return
}
if !isWebRTC {
audioString = base64.StdEncoding.EncodeToString(pcmAudio)
}
if !isWebRTC {
sendEvent(t, types.ResponseOutputAudioDeltaEvent{
ServerEventBase: types.ServerEventBase{},
ResponseID: responseID,
ItemID: item.Assistant.ID,
OutputIndex: 0,
ContentIndex: 0,
Delta: audioString,
})
sendEvent(t, types.ResponseOutputAudioDoneEvent{
ServerEventBase: types.ServerEventBase{},
ResponseID: responseID,
@@ -1849,17 +1819,27 @@ func triggerResponseAtTurn(ctx context.Context, session *Session, conv *Conversa
})
}
// Handle Tool Calls. Two paths:
// - LocalAI Assistant tools (session.AssistantExecutor.IsTool) run
// server-side; we append both the call and its output to conv.Items
// and re-trigger a follow-up response so the model can speak the
// result. The client only sees observability events.
// - All other tools follow the standard OpenAI flow: emit
// function_call_arguments.done and wait for the client to send
// conversation.item.create back.
xlog.Debug("About to handle tool calls", "finalToolCallsCount", len(finalToolCalls))
// Emit the parsed tool calls, the terminal response.done, and (for
// server-side assistant tools) the follow-up response. Shared with the
// streamed path so both finalize tool calls identically.
emitToolCallItems(ctx, session, conv, t, responseID, finalToolCalls, finalSpeech != "", toolTurn)
}
// emitToolCallItems emits the realtime function_call items for the parsed tool
// calls, the terminal response.done, and — for server-side LocalAI Assistant
// tools — re-triggers a follow-up response so the model can speak the result.
// hasContent shifts the tool-call output index past the assistant content item
// when the same turn also produced spoken/text content. Two tool paths:
// - LocalAI Assistant tools (session.AssistantExecutor.IsTool) run server-side;
// we append both the call and its output to conv.Items and re-trigger. The
// client only sees observability events.
// - All other tools follow the standard OpenAI flow: emit
// function_call_arguments.done and wait for the client to send
// conversation.item.create back.
func emitToolCallItems(ctx context.Context, session *Session, conv *Conversation, t Transport, responseID string, toolCalls []functions.FuncCallResults, hasContent bool, toolTurn int) {
xlog.Debug("About to handle tool calls", "finalToolCallsCount", len(toolCalls))
executedAssistantTool := false
for i, tc := range finalToolCalls {
for i, tc := range toolCalls {
toolCallID := generateItemID()
callID := "call_" + generateUniqueID() // OpenAI uses call_xyz
@@ -1879,7 +1859,7 @@ func triggerResponseAtTurn(ctx context.Context, session *Session, conv *Conversa
conv.Lock.Unlock()
outputIndex := i
if finalSpeech != "" {
if hasContent {
outputIndex++
}
@@ -2005,8 +1985,11 @@ func generateItemID() string {
}
func generateUniqueID() string {
// Generate a unique ID string
// For simplicity, use a counter or UUID
// Implement as needed
return "unique_id"
// 16 random bytes, hex-encoded. Must be collision-free: session, item,
// response and call IDs build on this, and the conversation tracks/removes
// items by ID (e.g. cancel() in realtime_stream.go, conversation.item.retrieve).
// A constant would make every ID alias and corrupt that bookkeeping.
var b [16]byte
_, _ = rand.Read(b[:])
return hex.EncodeToString(b[:])
}

View File

@@ -0,0 +1,200 @@
package openai
import (
"strings"
"unicode"
"unicode/utf8"
"github.com/rivo/uniseg"
)
// Default clause-chunker bounds (in runes). minRunes gates only sub-sentence
// (clause-mark / Thai-space) cuts so we don't synthesize tiny choppy fragments;
// full sentences always flush regardless of length. maxRunes caps an
// unterminated run so a long punctuation-less span doesn't buffer unbounded.
const (
defaultClauseMinRunes = 12
defaultClauseMaxRunes = 200
)
// clauseChunker splits streamed LLM content into speakable clauses for
// incremental TTS, in a SCRIPT-AWARE way so it works for languages without
// whitespace word boundaries. It leans on UAX #29 sentence segmentation (which
// natively terminates on CJK 。!? as well as Latin .!?), adds CJK clause
// punctuation (,、;:) and Thai/Lao spaces as finer boundaries, and caps an
// over-long unterminated run via UAX #14 line-break opportunities.
//
// Unlike the old ASCII .!?/newline segmenter (dropped in 076dcdbe), it does not
// degrade to whole-message buffering for CJK (handled natively) or Thai/Lao
// (handled via spaces, which Thai uses at clause/sentence boundaries). Scripts
// that genuinely need a dictionary (Khmer/Myanmar) simply stay buffered until a
// space or end-of-message — no worse than the buffered default.
//
// It is not safe for concurrent use; callers feed it from a single goroutine
// (the LLM token callback).
type clauseChunker struct {
buf strings.Builder
minRunes int
maxRunes int
}
func newClauseChunker(minRunes, maxRunes int) *clauseChunker {
return &clauseChunker{minRunes: minRunes, maxRunes: maxRunes}
}
// push appends streamed content and returns any clauses that are now complete —
// "complete" meaning confirmed by following content, so we never speak a clause
// that the next token might extend. Incomplete trailing text stays buffered.
func (c *clauseChunker) push(text string) []string {
c.buf.WriteString(text)
return c.drain(false)
}
// flush returns the remaining buffered clauses, treating end-of-input as a hard
// boundary, and clears the buffer.
func (c *clauseChunker) flush() []string {
return c.drain(true)
}
func (c *clauseChunker) drain(final bool) []string {
s := c.buf.String()
rest := s
var out []string
for rest != "" {
end, ok := c.nextBoundary(rest, final)
if !ok {
break
}
if seg := strings.TrimSpace(rest[:end]); seg != "" {
out = append(out, seg)
}
rest = rest[end:]
}
// Rewriting the builder reallocates and copies the whole buffer; skip it on
// the common per-token call where no boundary was confirmed.
if len(rest) != len(s) {
c.buf.Reset()
c.buf.WriteString(rest)
}
return out
}
// nextBoundary returns the byte offset just past the first emittable clause in
// s, or ok=false when more input is needed (final=false) and no boundary is
// confirmed yet.
func (c *clauseChunker) nextBoundary(s string, final bool) (int, bool) {
if s == "" {
return 0, false
}
// 1) UAX #29 sentence boundary. When the first sentence is followed by more
// text it is a confirmed complete sentence (handles Latin .!? with
// abbreviation/decimal guards, and CJK 。!? with no whitespace).
sentence, rest, _ := uniseg.FirstSentenceInString(s, -1)
if rest != "" {
// Optionally cut finer inside the sentence at a clause boundary.
if cut, ok := c.firstClauseCut(sentence); ok {
return cut, true
}
return len(sentence), true
}
// 2) Unterminated tail: look for a sub-sentence clause boundary (CJK
// punctuation or a Thai/Lao space) confirmed by following content.
if cut, ok := c.firstClauseCut(s); ok {
return cut, true
}
// 3) Over-long punctuation-less run: force a typographically legal break so
// we don't buffer unbounded (e.g. a long CJK run with no punctuation).
if !final && c.maxRunes > 0 && utf8.RuneCountInString(s) > c.maxRunes {
if cut, ok := lineBreakCut(s, c.maxRunes); ok {
return cut, true
}
}
// 4) End of input: emit whatever remains as the final clause.
if final {
return len(s), true
}
return 0, false
}
// firstClauseCut returns the byte offset just past the first sub-sentence clause
// boundary in s — a CJK clause punctuation mark, or a space following a Thai/Lao
// letter — provided the prefix is at least minRunes long and non-space content
// follows. The boundary mark (and any trailing spaces) stay with the left clause.
func (c *clauseChunker) firstClauseCut(s string) (int, bool) {
var prev rune
runes := 0
for i, r := range s {
boundary := isCJKClausePunct(r) || (unicode.IsSpace(r) && isThaiLao(prev))
if boundary && runes+1 >= c.minRunes {
end := i + utf8.RuneLen(r)
for end < len(s) {
nr, sz := utf8.DecodeRuneInString(s[end:])
if !unicode.IsSpace(nr) {
break
}
end += sz
}
if end < len(s) { // confirmed: real content follows the boundary
return end, true
}
// Boundary sits at the end of the buffer with nothing after it yet —
// wait for the next token to confirm it rather than emit early.
return 0, false
}
prev = r
runes++
}
return 0, false
}
// lineBreakCut walks UAX #14 line segments and returns the byte offset of the
// last legal break opportunity at or before maxRunes. Returns ok=false when the
// run has no internal break opportunity (e.g. a space-less Thai run), leaving it
// buffered.
func lineBreakCut(s string, maxRunes int) (int, bool) {
state := -1
rest := s
consumed := 0
runes := 0
for rest != "" {
seg, rem, _, st := uniseg.FirstLineSegmentInString(rest, state)
state = st
runes += utf8.RuneCountInString(seg)
consumed += len(seg)
rest = rem
if runes >= maxRunes {
if consumed < len(s) {
return consumed, true
}
return 0, false
}
}
return 0, false
}
// isCJKClausePunct reports whether r is a CJK clause-level separator worth a
// soft TTS break. Sentence terminators (。!?) are intentionally excluded — UAX
// #29 sentence segmentation already handles those.
func isCJKClausePunct(r rune) bool {
switch r {
case '', // fullwidth comma
'、', // 、 ideographic comma
'', // fullwidth semicolon
'', // fullwidth colon
'・', // ・ katakana middle dot
'・': // ・ halfwidth katakana middle dot
return true
}
return false
}
// isThaiLao reports whether r is a Thai or Lao letter. Those scripts have no
// inter-word spaces; an ASCII space inside such a run marks a clause/sentence
// boundary, which is the only no-dictionary segmentation signal available.
func isThaiLao(r rune) bool {
return unicode.Is(unicode.Thai, r) || unicode.Is(unicode.Lao, r)
}

View File

@@ -0,0 +1,103 @@
package openai
import (
"strings"
"unicode/utf8"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
// clauseChunker splits streamed LLM content into speakable clauses in a
// script-aware way: UAX#29 sentences (Latin .!? and CJK 。!?), CJK clause
// punctuation, and Thai/Lao spaces — never whitespace-splitting CJK.
var _ = Describe("clauseChunker", func() {
Context("Latin sentences", func() {
It("emits a sentence only once following content confirms it is complete", func() {
c := newClauseChunker(12, 200)
Expect(c.push("Hello world. How are you?")).To(Equal([]string{"Hello world."}))
// The trailing sentence is held until flush (the next token might extend it).
Expect(c.flush()).To(Equal([]string{"How are you?"}))
})
It("assembles a sentence across many small tokens", func() {
c := newClauseChunker(12, 200)
var got []string
for _, tok := range []string{"Hello", " world.", " How", " are", " you?"} {
got = append(got, c.push(tok)...)
}
got = append(got, c.flush()...)
Expect(got).To(Equal([]string{"Hello world.", "How are you?"}))
})
It("does not split decimals or abbreviations (UAX#29 SB6)", func() {
c := newClauseChunker(12, 200)
got := c.push("Pi is 3.14 and e is 2.72. Done")
Expect(got).To(Equal([]string{"Pi is 3.14 and e is 2.72."}))
Expect(c.flush()).To(Equal([]string{"Done"}))
})
})
Context("CJK (no whitespace)", func() {
It("splits Chinese on the ideographic full stop", func() {
c := newClauseChunker(12, 200)
Expect(c.push("你好世界。今天天气很好。")).To(Equal([]string{"你好世界。"}))
Expect(c.flush()).To(Equal([]string{"今天天气很好。"}))
})
It("splits Japanese on the ideographic full stop", func() {
c := newClauseChunker(12, 200)
Expect(c.push("こんにちは。元気ですか。")).To(Equal([]string{"こんにちは。"}))
Expect(c.flush()).To(Equal([]string{"元気ですか。"}))
})
It("splits on CJK clause punctuation for lower latency", func() {
c := newClauseChunker(2, 200) // small min so short test clauses cut
Expect(c.push("你好,世界。再见")).To(Equal([]string{"你好,", "世界。"}))
Expect(c.flush()).To(Equal([]string{"再见"}))
})
})
Context("Thai (spaces mark clauses, not words)", func() {
It("splits a Thai run on the inter-clause space", func() {
c := newClauseChunker(2, 200)
Expect(c.push("สวัสดีครับ กินข้าวไหม")).To(Equal([]string{"สวัสดีครับ"}))
Expect(c.flush()).To(Equal([]string{"กินข้าวไหม"}))
})
It("never shatters a space-less Thai run into characters", func() {
c := newClauseChunker(2, 200)
Expect(c.push("สวัสดีครับ")).To(BeEmpty()) // held, no boundary
Expect(c.flush()).To(Equal([]string{"สวัสดีครับ"}))
})
})
Context("length cap (UAX#14 fallback)", func() {
It("force-breaks an over-long punctuation-less CJK run at legal points", func() {
c := newClauseChunker(4, 10) // maxRunes = 10
run := strings.Repeat("字", 25)
got := c.push(run)
got = append(got, c.flush()...)
total := 0
for _, seg := range got {
n := utf8.RuneCountInString(seg)
Expect(n).To(BeNumerically("<=", 10)) // never exceeds the cap
total += n
}
Expect(total).To(Equal(25)) // nothing dropped
Expect(len(got)).To(BeNumerically(">=", 3)) // 10 + 10 + 5
})
})
Context("buffer lifecycle", func() {
It("flush clears the buffer so the chunker is reusable", func() {
c := newClauseChunker(12, 200)
// "First one." is confirmed by the following "Second", so push drains it;
// only the unterminated tail remains for flush.
Expect(c.push("First one. Second")).To(Equal([]string{"First one."}))
Expect(c.flush()).To(Equal([]string{"Second"}))
Expect(c.flush()).To(BeEmpty())
Expect(c.push("Again. More")).To(Equal([]string{"Again."}))
})
})
})

View File

@@ -0,0 +1,138 @@
package openai
import (
"context"
"strings"
"github.com/mudler/LocalAI/core/backend"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/http/endpoints/openai/types"
"github.com/mudler/LocalAI/core/schema"
"github.com/mudler/LocalAI/pkg/grpc/proto"
)
// fakeTransport records the server events and audio sent to a realtime client
// so streaming behaviour can be asserted without a real WebSocket/WebRTC peer.
// It is not a *WebRTCTransport, so handler code takes the WebSocket path.
type fakeTransport struct {
events []types.ServerEvent
audio []fakeAudioChunk
}
type fakeAudioChunk struct {
pcm []byte
sampleRate int
}
func (f *fakeTransport) SendEvent(e types.ServerEvent) error {
f.events = append(f.events, e)
return nil
}
func (f *fakeTransport) ReadEvent() ([]byte, error) { return nil, nil }
func (f *fakeTransport) SendAudio(_ context.Context, pcm []byte, sampleRate int) error {
f.audio = append(f.audio, fakeAudioChunk{pcm: pcm, sampleRate: sampleRate})
return nil
}
func (f *fakeTransport) Close() error { return nil }
// countEvents returns how many recorded events have the given type.
func (f *fakeTransport) countEvents(et types.ServerEventType) int {
n := 0
for _, e := range f.events {
if e.ServerEventType() == et {
n++
}
}
return n
}
// transcriptDeltaText concatenates the Delta of every recorded transcript
// delta event — i.e. the text streamed to the client as it is generated.
func (f *fakeTransport) transcriptDeltaText() string {
var b strings.Builder
for _, e := range f.events {
if d, ok := e.(types.ResponseOutputAudioTranscriptDeltaEvent); ok {
b.WriteString(d.Delta)
}
}
return b.String()
}
// fakeModel is a configurable Model double. TTSStream replays ttsStreamChunks
// and TranscribeStream replays transcribeDeltas, so the handler's streaming
// paths can be driven deterministically.
type fakeModel struct {
cfg *config.ModelConfig
ttsFile string
ttsStreamChunks [][]byte
ttsStreamRate int
ttsStreamErr error
transcribeDeltas []string
transcribeFinal *schema.TranscriptionResult
// Predict streaming: predictTokens are replayed through the token callback
// (simulating streamed LLM output); predictResp/predictErr are returned by
// the deferred predict function. predictChunkDeltas, when set, are delivered
// per-token via TokenUsage.ChatDeltas to exercise the autoparser path.
predictTokens []string
predictChunkDeltas [][]*proto.ChatDelta
predictResp backend.LLMResponse
predictErr error
}
func (m *fakeModel) VAD(context.Context, *schema.VADRequest) (*schema.VADResponse, error) {
return nil, nil
}
func (m *fakeModel) Transcribe(context.Context, string, string, bool, bool, string) (*schema.TranscriptionResult, error) {
return m.transcribeFinal, nil
}
func (m *fakeModel) Predict(_ context.Context, _ schema.Messages, _, _, _ []string, cb func(string, backend.TokenUsage) bool, _ []types.ToolUnion, _ *types.ToolChoiceUnion, _, _ *int, _ map[string]float64) (func() (backend.LLMResponse, error), error) {
if m.predictErr != nil {
return nil, m.predictErr
}
return func() (backend.LLMResponse, error) {
for i, tok := range m.predictTokens {
if cb == nil {
continue
}
usage := backend.TokenUsage{}
if i < len(m.predictChunkDeltas) {
usage.ChatDeltas = m.predictChunkDeltas[i]
}
cb(tok, usage)
}
return m.predictResp, nil
}, nil
}
func (m *fakeModel) TTS(context.Context, string, string, string) (string, *proto.Result, error) {
return m.ttsFile, &proto.Result{Success: true}, nil
}
func (m *fakeModel) TTSStream(_ context.Context, _, _, _ string, onAudio func(pcm []byte, sampleRate int) error) error {
if m.ttsStreamErr != nil {
return m.ttsStreamErr
}
for _, c := range m.ttsStreamChunks {
if err := onAudio(c, m.ttsStreamRate); err != nil {
return err
}
}
return nil
}
func (m *fakeModel) TranscribeStream(_ context.Context, _, _ string, _, _ bool, _ string, onDelta func(text string)) (*schema.TranscriptionResult, error) {
for _, d := range m.transcribeDeltas {
onDelta(d)
}
return m.transcribeFinal, nil
}
func (m *fakeModel) PredictConfig() *config.ModelConfig { return m.cfg }

View File

@@ -3,6 +3,7 @@ package openai
import (
"context"
"crypto/rand"
"encoding/binary"
"encoding/hex"
"encoding/json"
"fmt"
@@ -87,6 +88,14 @@ func (m *transcriptOnlyModel) TTS(ctx context.Context, text, voice, language str
return "", nil, fmt.Errorf("TTS not supported in transcript-only mode")
}
func (m *transcriptOnlyModel) TTSStream(ctx context.Context, text, voice, language string, onAudio func(pcm []byte, sampleRate int) error) error {
return fmt.Errorf("TTS not supported in transcript-only mode")
}
func (m *transcriptOnlyModel) TranscribeStream(ctx context.Context, audio, language string, translate, diarize bool, prompt string, onDelta func(text string)) (*schema.TranscriptionResult, error) {
return transcribeStream(ctx, m.modelLoader, *m.TranscriptionConfig, m.appConfig, audio, language, translate, diarize, prompt, onDelta)
}
func (m *transcriptOnlyModel) PredictConfig() *config.ModelConfig {
return nil
}
@@ -321,10 +330,75 @@ func (m *wrappedModel) TTS(ctx context.Context, text, voice, language string) (s
return backend.ModelTTS(ctx, text, voice, language, "", nil, m.modelLoader, m.appConfig, *m.TTSConfig)
}
func (m *wrappedModel) TTSStream(ctx context.Context, text, voice, language string, onAudio func(pcm []byte, sampleRate int) error) error {
return ttsStream(ctx, m.modelLoader, m.appConfig, *m.TTSConfig, text, voice, language, onAudio)
}
func (m *wrappedModel) TranscribeStream(ctx context.Context, audio, language string, translate, diarize bool, prompt string, onDelta func(text string)) (*schema.TranscriptionResult, error) {
return transcribeStream(ctx, m.modelLoader, *m.TranscriptionConfig, m.appConfig, audio, language, translate, diarize, prompt, onDelta)
}
func (m *wrappedModel) PredictConfig() *config.ModelConfig {
return m.LLMConfig
}
// wavStreamHeaderBytes is the size of the WAV header that backend.ModelTTSStream
// emits as its first audio callback; the sample rate lives at byte offset 24.
const wavStreamHeaderBytes = 44
// ttsStream adapts backend.ModelTTSStream (which emits a WAV stream: a 44-byte
// header carrying the sample rate, then raw PCM) to the realtime onAudio
// callback, which wants raw PCM plus the sample rate. The header is buffered
// until complete, the sample rate is read from it, and subsequent bytes are
// forwarded as PCM.
func ttsStream(ctx context.Context, ml *model.ModelLoader, appConfig *config.ApplicationConfig, ttsConfig config.ModelConfig, text, voice, language string, onAudio func(pcm []byte, sampleRate int) error) error {
var header []byte
headerDone := false
sampleRate := 0
return backend.ModelTTSStream(ctx, text, voice, language, "", nil, ml, appConfig, ttsConfig, func(b []byte) error {
if headerDone {
if len(b) == 0 {
return nil
}
return onAudio(b, sampleRate)
}
header = append(header, b...)
if len(header) < wavStreamHeaderBytes {
return nil
}
sampleRate = int(binary.LittleEndian.Uint32(header[24:28]))
headerDone = true
if len(header) > wavStreamHeaderBytes {
return onAudio(header[wavStreamHeaderBytes:], sampleRate)
}
return nil
})
}
// transcribeStream adapts backend.ModelTranscriptionStream to the realtime
// onDelta callback, returning the final aggregated transcription result.
func transcribeStream(ctx context.Context, ml *model.ModelLoader, transcriptionConfig config.ModelConfig, appConfig *config.ApplicationConfig, audio, language string, translate, diarize bool, prompt string, onDelta func(text string)) (*schema.TranscriptionResult, error) {
var final *schema.TranscriptionResult
err := backend.ModelTranscriptionStream(ctx, backend.TranscriptionRequest{
Audio: audio,
Language: language,
Translate: translate,
Diarize: diarize,
Prompt: prompt,
}, ml, transcriptionConfig, appConfig, func(chunk backend.TranscriptionStreamChunk) {
if chunk.Delta != "" {
onDelta(chunk.Delta)
}
if chunk.Final != nil {
final = chunk.Final
}
})
if err != nil {
return nil, err
}
return final, nil
}
func newTranscriptionOnlyModel(pipeline *config.Pipeline, cl *config.ModelConfigLoader, ml *model.ModelLoader, appConfig *config.ApplicationConfig) (Model, *config.ModelConfig, error) {
cfgVAD, err := cl.LoadModelConfigFileByName(pipeline.VAD, ml.ModelPath)
if err != nil {
@@ -454,8 +528,10 @@ func newModel(pipeline *config.Pipeline, cl *config.ModelConfigLoader, ml *model
return nil, fmt.Errorf("failed to validate config: %w", err)
}
// Let the pipeline set the LLM's reasoning effort (cfgLLM is a per-session copy).
// Let the pipeline set the LLM's reasoning effort and force thinking off
// (cfgLLM is a per-session copy). disable_thinking applies after the effort.
applyPipelineReasoning(cfgLLM, *pipeline)
applyPipelineThinking(cfgLLM, *pipeline)
cfgTTS, err := cl.LoadModelConfigFileByName(pipeline.TTS, ml.ModelPath)
if err != nil {

View File

@@ -0,0 +1,102 @@
package openai
import (
"context"
"encoding/base64"
"fmt"
"os"
"path/filepath"
"github.com/mudler/LocalAI/core/http/endpoints/openai/types"
laudio "github.com/mudler/LocalAI/pkg/audio"
"github.com/mudler/LocalAI/pkg/sound"
)
// emitSpeech synthesizes text and sends the audio to the client. When the
// pipeline opts into TTS streaming it forwards each PCM chunk as its own
// response.output_audio.delta as soon as the backend produces it; otherwise it
// synthesizes the whole utterance and sends it as a single delta.
//
// It deliberately does NOT emit transcript or audio-done events: the caller owns
// those so a streamed reply can be split into several spoken segments that share
// one response/item.
//
// It returns the PCM audio (at the session output rate) accumulated across all
// chunks, which the caller base64-encodes onto the conversation item. For WebRTC
// the audio goes over the RTP track instead, so the returned slice is empty.
func emitSpeech(ctx context.Context, t Transport, session *Session, responseID, itemID, text string) ([]byte, error) {
if text == "" {
return nil, nil
}
_, isWebRTC := t.(*WebRTCTransport)
var wsAudio []byte // PCM at the session output rate, accumulated for the item record
// sendChunk hands one PCM buffer to the transport: WebRTC consumes the raw
// PCM directly (it resamples internally); WebSocket gets base64 PCM at the
// session output rate via a JSON delta event.
sendChunk := func(pcm []byte, sampleRate int) error {
if len(pcm) == 0 {
return nil
}
if err := t.SendAudio(ctx, pcm, sampleRate); err != nil {
return err
}
if isWebRTC {
return nil
}
wsPCM := pcm
if sampleRate != 0 && sampleRate != session.OutputSampleRate {
samples := sound.BytesToInt16sLE(pcm)
resampled := sound.ResampleInt16(samples, sampleRate, session.OutputSampleRate)
wsPCM = sound.Int16toBytesLE(resampled)
}
wsAudio = append(wsAudio, wsPCM...)
return t.SendEvent(types.ResponseOutputAudioDeltaEvent{
ServerEventBase: types.ServerEventBase{},
ResponseID: responseID,
ItemID: itemID,
OutputIndex: 0,
ContentIndex: 0,
Delta: base64.StdEncoding.EncodeToString(wsPCM),
})
}
language := ""
if session.InputAudioTranscription != nil {
language = session.InputAudioTranscription.Language
}
if session.ModelConfig != nil && session.ModelConfig.Pipeline.StreamTTS() {
if err := session.ModelInterface.TTSStream(ctx, text, session.Voice, language, sendChunk); err != nil {
return nil, err
}
return wsAudio, nil
}
// Unary fallback: synthesize the whole utterance to a file, then emit once.
audioFilePath, res, err := session.ModelInterface.TTS(ctx, text, session.Voice, language)
if err != nil {
return nil, err
}
if res != nil && !res.Success {
return nil, fmt.Errorf("tts generation failed: %s", res.Message)
}
defer func() { _ = os.Remove(audioFilePath) }()
// filepath.Clean normalizes the backend-produced temp path before reading
// (also keeps gosec G304 quiet — the path is backend-controlled, not user input).
audioBytes, err := os.ReadFile(filepath.Clean(audioFilePath))
if err != nil {
return nil, fmt.Errorf("read tts audio: %w", err)
}
pcm, sampleRate := laudio.ParseWAV(audioBytes)
if sampleRate == 0 {
sampleRate = session.OutputSampleRate
}
if err := sendChunk(pcm, sampleRate); err != nil {
return nil, err
}
return wsAudio, nil
}

View File

@@ -0,0 +1,70 @@
package openai
import (
"context"
"os"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/http/endpoints/openai/types"
laudio "github.com/mudler/LocalAI/pkg/audio"
)
// emitSpeech synthesizes a piece of text and forwards the audio to the client,
// streaming a delta per TTS chunk when the pipeline opts in, or sending the
// whole utterance as one delta otherwise.
var _ = Describe("emitSpeech", func() {
ttsOn := true
streamingSession := func(m Model) *Session {
return &Session{
OutputSampleRate: 24000,
ModelInterface: m,
ModelConfig: &config.ModelConfig{
Pipeline: config.Pipeline{Streaming: config.PipelineStreaming{TTS: &ttsOn}},
},
}
}
It("streams one output_audio.delta per TTS chunk when streaming is enabled", func() {
m := &fakeModel{
ttsStreamChunks: [][]byte{{1, 2}, {3, 4}, {5, 6}},
ttsStreamRate: 24000,
}
t := &fakeTransport{}
audio, err := emitSpeech(context.Background(), t, streamingSession(m), "resp1", "item1", "Hello there.")
Expect(err).ToNot(HaveOccurred())
Expect(t.countEvents(types.ServerEventTypeResponseOutputAudioDelta)).To(Equal(3))
// The returned audio is all chunks concatenated (session output rate).
Expect(audio).To(Equal([]byte{1, 2, 3, 4, 5, 6}))
})
It("sends a single output_audio.delta in unary mode", func() {
// A minimal real WAV file for the unary TTS path to read + parse.
f, err := os.CreateTemp("", "emit-*.wav")
Expect(err).ToNot(HaveOccurred())
defer func() { _ = os.Remove(f.Name()) }()
pcm := make([]byte, 320) // 160 samples of silence
hdr := laudio.NewWAVHeader(uint32(len(pcm)))
Expect(hdr.Write(f)).To(Succeed())
_, err = f.Write(pcm)
Expect(err).ToNot(HaveOccurred())
Expect(f.Close()).To(Succeed())
session := &Session{
OutputSampleRate: 24000,
ModelInterface: &fakeModel{ttsFile: f.Name()},
ModelConfig: &config.ModelConfig{}, // streaming off
}
t := &fakeTransport{}
_, err = emitSpeech(context.Background(), t, session, "resp1", "item1", "Hello there.")
Expect(err).ToNot(HaveOccurred())
Expect(t.countEvents(types.ServerEventTypeResponseOutputAudioDelta)).To(Equal(1))
})
})

View File

@@ -0,0 +1,315 @@
package openai
import (
"context"
"encoding/base64"
"fmt"
"github.com/mudler/LocalAI/core/backend"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/http/endpoints/openai/types"
"github.com/mudler/LocalAI/core/schema"
"github.com/mudler/LocalAI/pkg/functions"
"github.com/mudler/LocalAI/pkg/reasoning"
)
// transcriptStreamer turns streamed LLM tokens into the assistant's spoken
// transcript: it strips reasoning incrementally and sends one
// response.output_audio_transcript.delta per content fragment. It does NOT
// synthesize audio — the caller buffers the full message and synthesizes it
// once (streaming the audio chunks when the TTS backend supports TTSStream),
// which works uniformly for streaming and non-streaming TTS and for languages
// without sentence or word boundaries.
type transcriptStreamer struct {
ctx context.Context
t Transport
responseID string
itemID string
extractor *reasoning.ReasoningExtractor
// announce, if set, is invoked once just before the first transcript delta.
// It lets the caller create the assistant item lazily, so a content-less
// tool-call turn never emits a spurious empty assistant item.
announce func()
announced bool
}
func newTranscriptStreamer(ctx context.Context, t Transport, responseID, itemID, thinkingStartToken string, reasoningCfg reasoning.Config) *transcriptStreamer {
return &transcriptStreamer{
ctx: ctx,
t: t,
responseID: responseID,
itemID: itemID,
extractor: reasoning.NewReasoningExtractor(thinkingStartToken, spokenReasoningConfig(reasoningCfg)),
}
}
// onToken handles one streamed unit of model output, sending a transcript delta
// for the new content (reasoning stripped) and returning that content delta so
// the caller can also feed it to the clause chunker. For plain-content models
// the unit is the raw text token; for autoparser tool turns the backend clears
// the text and delivers content via ChatDeltas, so the caller passes that
// content here. Returns "" when the token produced no new spoken content.
func (s *transcriptStreamer) onToken(token string) string {
_, content := s.extractor.ProcessToken(token)
if content == "" {
return ""
}
if !s.announced {
s.announced = true
if s.announce != nil {
s.announce()
}
}
_ = s.t.SendEvent(types.ResponseOutputAudioTranscriptDeltaEvent{
ServerEventBase: types.ServerEventBase{},
ResponseID: s.responseID,
ItemID: s.itemID,
OutputIndex: 0,
ContentIndex: 0,
Delta: content,
})
return content
}
// content returns the full transcript so far with reasoning stripped.
func (s *transcriptStreamer) content() string {
return s.extractor.CleanedContent()
}
// streamLLMResponse drives a streamed realtime reply. It streams the assistant
// transcript as the LLM generates, then synthesizes the whole buffered message
// once (streaming the audio chunks when the TTS backend supports it, otherwise a
// single unary delta). Tool calls parsed from the autoparser ChatDeltas are
// emitted after the spoken content. The assistant content item is created lazily
// on the first content delta, so a content-less tool-call turn emits only the
// tool calls. It returns true when it has fully handled the response so the
// caller can return; callers must only invoke it for an audio modality, and with
// tools only when the model uses its tokenizer template (see triggerResponseAtTurn).
func streamLLMResponse(ctx context.Context, session *Session, conv *Conversation, t Transport, responseID string, history schema.Messages, images []string, llmCfg *config.ModelConfig, tools []types.ToolUnion, toolChoice *types.ToolChoiceUnion, toolTurn int) bool {
itemID := generateItemID()
item := types.MessageItemUnion{
Assistant: &types.MessageItemAssistant{
ID: itemID,
Status: types.ItemStatusInProgress,
Content: []types.MessageContentOutput{{Type: types.MessageContentTypeOutputAudio}},
},
}
// announce creates the assistant content item lazily, just before the first
// transcript delta — a tool-only turn never produces content, so it stays out
// of the conversation and the client sees only the tool calls.
announced := false
announce := func() {
announced = true
conv.Lock.Lock()
conv.Items = append(conv.Items, &item)
conv.Lock.Unlock()
sendEvent(t, types.ResponseOutputItemAddedEvent{
ServerEventBase: types.ServerEventBase{},
ResponseID: responseID,
OutputIndex: 0,
Item: item,
})
sendEvent(t, types.ResponseContentPartAddedEvent{
ServerEventBase: types.ServerEventBase{},
ResponseID: responseID,
ItemID: itemID,
OutputIndex: 0,
ContentIndex: 0,
Part: item.Assistant.Content[0],
})
}
cancel := func() {
if announced {
conv.Lock.Lock()
for i := len(conv.Items) - 1; i >= 0; i-- {
if conv.Items[i].Assistant != nil && conv.Items[i].Assistant.ID == itemID {
conv.Items = append(conv.Items[:i], conv.Items[i+1:]...)
break
}
}
conv.Lock.Unlock()
}
sendEvent(t, types.ResponseDoneEvent{
ServerEventBase: types.ServerEventBase{},
Response: types.Response{ID: responseID, Object: "realtime.response", Status: types.ResponseStatusCancelled},
})
}
var template string
if llmCfg.TemplateConfig.UseTokenizerTemplate {
template = llmCfg.GetModelTemplate()
} else {
template = llmCfg.TemplateConfig.Chat
}
thinkingStartToken := reasoning.DetectThinkingStartToken(template, &llmCfg.ReasoningConfig)
// The autoparser (tokenizer-template path) already delivers reasoning-free
// content. Prefilling the thinking start token here would re-tag that clean
// content as an unclosed reasoning block, leaving CleanedContent() empty —
// no spoken reply, no TTS. Disable the prefill; closed tag pairs are still
// stripped (PEG-fallback case, #9985).
reasoningCfg := llmCfg.ReasoningConfig
if llmCfg.TemplateConfig.UseTokenizerTemplate {
disablePrefill := true
reasoningCfg.DisableReasoningTagPrefill = &disablePrefill
}
streamer := newTranscriptStreamer(ctx, t, responseID, itemID, thinkingStartToken, reasoningCfg)
streamer.announce = announce
// Clause chunking (opt-in): synthesize each clause as soon as it completes
// instead of buffering the whole reply. streamedAudio accumulates the PCM
// across clauses for the conversation item record; ttsErr captures the first
// synthesis failure so the token callback can stop the prediction. emitSpeech
// runs synchronously here — the LLM keeps generating into the gRPC stream
// while a clause is synthesized, so audio still starts mid-generation.
var chunker *clauseChunker
if session.ModelConfig != nil && session.ModelConfig.Pipeline.ChunkClauses() {
chunker = newClauseChunker(defaultClauseMinRunes, defaultClauseMaxRunes)
}
var streamedAudio []byte
var ttsErr error
speakClause := func(clause string) error {
a, err := emitSpeech(ctx, t, session, responseID, itemID, clause)
if err != nil {
return err
}
streamedAudio = append(streamedAudio, a...)
return nil
}
// fail reports a mid-stream failure. A cancelled context means the client
// interrupted (barge-in), so roll the turn back instead of erroring.
fail := func(code, msg string, err error) bool {
if ctx.Err() != nil {
cancel()
} else {
sendError(t, code, fmt.Sprintf("%s: %v", msg, err), "", itemID)
}
return true
}
cb := func(token string, usage backend.TokenUsage) bool {
if ctx.Err() != nil {
return false
}
// Plain-content models stream text via the token; autoparser tool turns
// clear the text and deliver content via ChatDeltas, so prefer the latter
// when present. Either way only content reaches the transcript — tool-call
// deltas are parsed from the final response below.
text := token
if len(usage.ChatDeltas) > 0 {
text = functions.ContentFromChatDeltas(usage.ChatDeltas)
}
delta := streamer.onToken(text)
if chunker != nil && delta != "" {
for _, clause := range chunker.push(delta) {
if ttsErr = speakClause(clause); ttsErr != nil {
return false // stop the prediction; reported after predFunc returns
}
}
}
return true
}
predFunc, err := session.ModelInterface.Predict(ctx, history, images, nil, nil, cb, tools, toolChoice, nil, nil, nil)
if err != nil {
sendError(t, "inference_failed", fmt.Sprintf("backend error: %v", err), "", itemID)
return true
}
pred, err := predFunc()
// A clause synthesis failed mid-stream (the callback stopped the prediction);
// report it as a TTS error rather than a prediction error.
if ttsErr != nil {
return fail("tts_error", "TTS generation failed", ttsErr)
}
if err != nil {
return fail("prediction_failed", "backend error", err)
}
if ctx.Err() != nil {
cancel()
return true
}
content := streamer.content()
toolCalls := functions.ToolCallsFromChatDeltas(pred.ChatDeltas)
// Finalize the spoken content item only when the turn produced content. A
// tool-only turn skips this entirely (no empty assistant item).
if content != "" {
if !announced {
announce()
}
// Synthesize the audio. With clause chunking the completed clauses were
// already spoken inside the token callback; flush the trailing clause(s)
// the segmenter was still holding. Otherwise buffer the whole message and
// synthesize it once. emitSpeech streams the audio chunks when the TTS
// backend supports TTSStream, otherwise it sends a single unary delta.
var audio []byte
if chunker != nil {
for _, clause := range chunker.flush() {
if ttsErr = speakClause(clause); ttsErr != nil {
break
}
}
audio = streamedAudio
} else {
audio, ttsErr = emitSpeech(ctx, t, session, responseID, itemID, content)
}
if ttsErr != nil {
return fail("tts_error", "TTS generation failed", ttsErr)
}
_, isWebRTC := t.(*WebRTCTransport)
sendEvent(t, types.ResponseOutputAudioTranscriptDoneEvent{
ServerEventBase: types.ServerEventBase{},
ResponseID: responseID,
ItemID: itemID,
OutputIndex: 0,
ContentIndex: 0,
Transcript: content,
})
if !isWebRTC {
sendEvent(t, types.ResponseOutputAudioDoneEvent{
ServerEventBase: types.ServerEventBase{},
ResponseID: responseID,
ItemID: itemID,
OutputIndex: 0,
ContentIndex: 0,
})
}
conv.Lock.Lock()
item.Assistant.Status = types.ItemStatusCompleted
item.Assistant.Content[0].Transcript = content
if !isWebRTC {
item.Assistant.Content[0].Audio = base64.StdEncoding.EncodeToString(audio)
}
conv.Lock.Unlock()
sendEvent(t, types.ResponseContentPartDoneEvent{
ServerEventBase: types.ServerEventBase{},
ResponseID: responseID,
ItemID: itemID,
OutputIndex: 0,
ContentIndex: 0,
Part: item.Assistant.Content[0],
})
sendEvent(t, types.ResponseOutputItemDoneEvent{
ServerEventBase: types.ServerEventBase{},
ResponseID: responseID,
OutputIndex: 0,
Item: item,
})
}
// Emit any tool calls, the terminal response.done, and (for server-side
// assistant tools) the follow-up turn — shared with the buffered path.
emitToolCallItems(ctx, session, conv, t, responseID, toolCalls, content != "", toolTurn)
return true
}

View File

@@ -0,0 +1,213 @@
package openai
import (
"context"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
"github.com/mudler/LocalAI/core/backend"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/http/endpoints/openai/types"
"github.com/mudler/LocalAI/pkg/grpc/proto"
"github.com/mudler/LocalAI/pkg/reasoning"
)
// transcriptStreamer turns streamed LLM tokens into incremental transcript
// deltas, stripping reasoning. Audio is synthesized once from the full message
// by the caller, so there is no per-sentence segmentation.
var _ = Describe("transcriptStreamer", func() {
It("emits one transcript delta per content token", func() {
t := &fakeTransport{}
s := newTranscriptStreamer(context.Background(), t, "resp1", "item1", "", reasoning.Config{})
for _, tok := range []string{"Hello", " world.", " Bye"} {
s.onToken(tok)
}
Expect(s.content()).To(Equal("Hello world. Bye"))
Expect(t.countEvents(types.ServerEventTypeResponseOutputAudioTranscriptDelta)).To(Equal(3))
Expect(t.transcriptDeltaText()).To(Equal("Hello world. Bye"))
})
It("strips leaked reasoning even when reasoning is disabled (disable_thinking safety net)", func() {
// disable_thinking maps to DisableReasoning=true (enable_thinking=false to
// the backend). If the model emits thinking anyway, the transcript must
// still not leak it: stripping always runs for spoken output.
disable := true
t := &fakeTransport{}
s := newTranscriptStreamer(context.Background(), t, "resp1", "item1", "",
reasoning.Config{DisableReasoning: &disable})
s.onToken("<think>secret plan</think>")
s.onToken("The answer is 42.")
Expect(s.content()).To(Equal("The answer is 42."))
Expect(s.content()).ToNot(ContainSubstring("secret plan"))
Expect(t.transcriptDeltaText()).ToNot(ContainSubstring("secret plan"))
})
It("does not swallow autoparser content when the template has a thinking start token (tokenizer-template path)", func() {
// Regression: with tag prefill on, the detected <think> token is
// prepended to the autoparser's already-clean content, swallowing the
// whole reply (empty transcript → no TTS). streamLLMResponse disables
// the prefill for the tokenizer-template path.
disablePrefill := true
t := &fakeTransport{}
s := newTranscriptStreamer(context.Background(), t, "resp1", "item1", "<think>",
reasoning.Config{DisableReasoningTagPrefill: &disablePrefill})
s.onToken("Hello")
s.onToken(" there.")
Expect(s.content()).To(Equal("Hello there."))
Expect(t.transcriptDeltaText()).To(Equal("Hello there."))
})
It("still strips embedded closed reasoning tags with prefill disabled (PEG-fallback safety, #9985)", func() {
// Disabling prefill must not stop stripping closed <think>…</think>
// pairs the PEG fallback can leave in autoparser content.
disablePrefill := true
t := &fakeTransport{}
s := newTranscriptStreamer(context.Background(), t, "resp1", "item1", "<think>",
reasoning.Config{DisableReasoningTagPrefill: &disablePrefill})
s.onToken("<think>secret</think>")
s.onToken("The answer is 42.")
Expect(s.content()).To(Equal("The answer is 42."))
Expect(t.transcriptDeltaText()).ToNot(ContainSubstring("secret"))
})
})
// streamLLMResponse drives a full streamed realtime turn: live transcript
// deltas while the LLM generates, then the whole message is synthesized once.
var _ = Describe("streamLLMResponse", func() {
It("streams transcript deltas then synthesizes the whole message once", func() {
on := true
m := &fakeModel{
predictTokens: []string{"Hello", " world.", " How are you?"},
predictResp: backend.LLMResponse{Response: "Hello world. How are you?"},
ttsStreamChunks: [][]byte{{9}},
ttsStreamRate: 24000,
}
session := &Session{
OutputSampleRate: 24000,
ModelInterface: m,
ModelConfig: &config.ModelConfig{
Pipeline: config.Pipeline{Streaming: config.PipelineStreaming{LLM: &on, TTS: &on}},
},
}
conv := &Conversation{}
t := &fakeTransport{}
llmCfg := &config.ModelConfig{}
handled := streamLLMResponse(context.Background(), session, conv, t, "resp1", nil, nil, llmCfg, nil, nil, 0)
Expect(handled).To(BeTrue())
// One live transcript delta per streamed token.
Expect(t.countEvents(types.ServerEventTypeResponseOutputAudioTranscriptDelta)).To(Equal(3))
// The whole message is synthesized ONCE (not per sentence): a single
// emitSpeech replays the one TTS stream chunk.
Expect(t.countEvents(types.ServerEventTypeResponseOutputAudioDelta)).To(Equal(1))
Expect(t.transcriptDeltaText()).To(Equal("Hello world. How are you?"))
})
It("synthesizes each clause as it completes when clause chunking is enabled", func() {
on := true
m := &fakeModel{
predictTokens: []string{"Hello world.", " How are you?"},
predictResp: backend.LLMResponse{Response: "Hello world. How are you?"},
ttsStreamChunks: [][]byte{{9}},
ttsStreamRate: 24000,
}
session := &Session{
OutputSampleRate: 24000,
ModelInterface: m,
ModelConfig: &config.ModelConfig{
Pipeline: config.Pipeline{Streaming: config.PipelineStreaming{LLM: &on, TTS: &on, ClauseChunking: &on}},
},
}
conv := &Conversation{}
t := &fakeTransport{}
llmCfg := &config.ModelConfig{}
handled := streamLLMResponse(context.Background(), session, conv, t, "resp1", nil, nil, llmCfg, nil, nil, 0)
Expect(handled).To(BeTrue())
// Two clauses ("Hello world." mid-stream, "How are you?" on flush) → two
// emitSpeech calls → two audio deltas, vs one for whole-message buffering.
Expect(t.countEvents(types.ServerEventTypeResponseOutputAudioDelta)).To(Equal(2))
// The full transcript still streams verbatim.
Expect(t.transcriptDeltaText()).To(Equal("Hello world. How are you?"))
// Exactly one terminal response.done.
Expect(t.countEvents(types.ServerEventTypeResponseDone)).To(Equal(1))
})
It("streams content deltas and emits tool-call items (autoparser tool turn)", func() {
on := true
// Autoparser path: reply.Message is empty; content + tool calls arrive via
// ChatDeltas. Chunk 1 carries content, chunk 2 carries the tool call.
contentDelta := []*proto.ChatDelta{{Content: "Let me check."}}
toolDelta := []*proto.ChatDelta{{ToolCalls: []*proto.ToolCallDelta{{Index: 0, Name: "get_weather", Arguments: `{"city":"Paris"}`}}}}
m := &fakeModel{
predictTokens: []string{"", ""},
predictChunkDeltas: [][]*proto.ChatDelta{contentDelta, toolDelta},
predictResp: backend.LLMResponse{ChatDeltas: append(append([]*proto.ChatDelta{}, contentDelta...), toolDelta...)},
ttsStreamChunks: [][]byte{{9}},
ttsStreamRate: 24000,
}
session := &Session{
OutputSampleRate: 24000,
ModelInterface: m,
ModelConfig: &config.ModelConfig{
Pipeline: config.Pipeline{Streaming: config.PipelineStreaming{LLM: &on, TTS: &on}},
},
}
conv := &Conversation{}
t := &fakeTransport{}
llmCfg := &config.ModelConfig{}
llmCfg.TemplateConfig.UseTokenizerTemplate = true
handled := streamLLMResponse(context.Background(), session, conv, t, "resp1", nil, nil, llmCfg, nil, nil, 0)
Expect(handled).To(BeTrue())
// The spoken content was streamed live.
Expect(t.transcriptDeltaText()).To(Equal("Let me check."))
// The tool call is emitted as a function_call item.
Expect(t.countEvents(types.ServerEventTypeResponseFunctionCallArgumentsDone)).To(Equal(1))
// Exactly one terminal response.done.
Expect(t.countEvents(types.ServerEventTypeResponseDone)).To(Equal(1))
})
It("emits only tool-call items for a content-less tool turn (no empty assistant item)", func() {
on := true
toolDelta := []*proto.ChatDelta{{ToolCalls: []*proto.ToolCallDelta{{Index: 0, Name: "get_weather", Arguments: `{"city":"Rome"}`}}}}
m := &fakeModel{
predictTokens: []string{""},
predictChunkDeltas: [][]*proto.ChatDelta{toolDelta},
predictResp: backend.LLMResponse{ChatDeltas: toolDelta},
}
session := &Session{
OutputSampleRate: 24000,
ModelInterface: m,
ModelConfig: &config.ModelConfig{
Pipeline: config.Pipeline{Streaming: config.PipelineStreaming{LLM: &on, TTS: &on}},
},
}
conv := &Conversation{}
t := &fakeTransport{}
llmCfg := &config.ModelConfig{}
llmCfg.TemplateConfig.UseTokenizerTemplate = true
handled := streamLLMResponse(context.Background(), session, conv, t, "resp1", nil, nil, llmCfg, nil, nil, 0)
Expect(handled).To(BeTrue())
// No content → no transcript deltas and no spurious assistant content item.
Expect(t.transcriptDeltaText()).To(Equal(""))
Expect(t.countEvents(types.ServerEventTypeResponseOutputAudioTranscriptDelta)).To(Equal(0))
// The tool call is still emitted.
Expect(t.countEvents(types.ServerEventTypeResponseFunctionCallArgumentsDone)).To(Equal(1))
Expect(t.countEvents(types.ServerEventTypeResponseDone)).To(Equal(1))
})
})

View File

@@ -0,0 +1,33 @@
package openai
import (
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/pkg/reasoning"
)
// applyPipelineThinking forces the LLM's reasoning/thinking off when the realtime
// pipeline sets disable_thinking, mapping to the enable_thinking=false backend
// metadata via ReasoningConfig.DisableReasoning. The LLM config passed in is the
// per-session copy returned by the config loader, so this does not affect other
// users of the same model. When the pipeline does not set disable_thinking the
// LLM config is left untouched.
func applyPipelineThinking(llm *config.ModelConfig, pipeline config.Pipeline) {
if llm == nil || !pipeline.ThinkingDisabled() {
return
}
disable := true
llm.ReasoningConfig.DisableReasoning = &disable
}
// spokenReasoningConfig adapts a model's reasoning config for stripping reasoning
// OUT of realtime spoken output. ReasoningConfig.DisableReasoning is overloaded:
// the backend reads it as the "enable_thinking=false" hint (which pipeline
// disable_thinking sets via applyPipelineThinking), but the reasoning extractor
// reads it as "skip stripping, assume there is no reasoning". Honouring the latter
// when extracting for speech would leak raw <think>…</think> whenever the model
// ignores the suppression hint. Spoken output must never contain reasoning, so we
// always strip: clear DisableReasoning while keeping custom tokens/tag pairs.
func spokenReasoningConfig(cfg reasoning.Config) reasoning.Config {
cfg.DisableReasoning = nil
return cfg
}

View File

@@ -0,0 +1,50 @@
package openai
import (
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/pkg/reasoning"
)
// applyPipelineThinking lets a realtime pipeline force the LLM's thinking off
// (enable_thinking=false metadata) without editing the LLM model config.
var _ = Describe("applyPipelineThinking", func() {
It("disables reasoning on the LLM config when the pipeline disables thinking", func() {
disable := true
llm := &config.ModelConfig{}
applyPipelineThinking(llm, config.Pipeline{DisableThinking: &disable})
Expect(llm.ReasoningConfig.DisableReasoning).ToNot(BeNil())
Expect(*llm.ReasoningConfig.DisableReasoning).To(BeTrue())
})
It("leaves the LLM config untouched when the pipeline does not set disable_thinking", func() {
llm := &config.ModelConfig{}
applyPipelineThinking(llm, config.Pipeline{})
Expect(llm.ReasoningConfig.DisableReasoning).To(BeNil())
})
})
// spokenReasoningConfig clears DisableReasoning so realtime spoken output always
// strips reasoning, even though disable_thinking sets DisableReasoning=true on the
// LLM config (which the backend reads as enable_thinking=false).
var _ = Describe("spokenReasoningConfig", func() {
It("clears DisableReasoning so the extractor still strips leaked reasoning", func() {
disable := true
out := spokenReasoningConfig(reasoning.Config{DisableReasoning: &disable})
Expect(out.DisableReasoning).To(BeNil())
})
It("preserves the other reasoning settings", func() {
disable := true
out := spokenReasoningConfig(reasoning.Config{
DisableReasoning: &disable,
ThinkingStartTokens: []string{"<reason>"},
TagPairs: []reasoning.TagPair{{Start: "<reason>", End: "</reason>"}},
})
Expect(out.ThinkingStartTokens).To(Equal([]string{"<reason>"}))
Expect(out.TagPairs).To(HaveLen(1))
Expect(out.TagPairs[0].Start).To(Equal("<reason>"))
})
})

View File

@@ -0,0 +1,63 @@
package openai
import (
"context"
"fmt"
"github.com/mudler/LocalAI/core/http/endpoints/openai/types"
)
// emitTranscription transcribes a committed utterance and emits the transcription
// events for it, returning the final transcript text. With
// pipeline.streaming.transcription enabled it streams each transcript fragment as
// a conversation.item.input_audio_transcription.delta as the backend produces it,
// then a completed event; otherwise it transcribes the whole utterance and emits
// a single completed event. delta and completed events share itemID.
func emitTranscription(ctx context.Context, t Transport, session *Session, itemID, audioPath string) (string, error) {
cfg := session.InputAudioTranscription
if session.ModelConfig != nil && session.ModelConfig.Pipeline.StreamTranscription() {
final, err := session.ModelInterface.TranscribeStream(ctx, audioPath, cfg.Language, false, false, cfg.Prompt, func(delta string) {
_ = t.SendEvent(types.ConversationItemInputAudioTranscriptionDeltaEvent{
ServerEventBase: types.ServerEventBase{EventID: "event_TODO"},
ItemID: itemID,
ContentIndex: 0,
Delta: delta,
})
})
if err != nil {
return "", err
}
transcript := ""
if final != nil {
transcript = final.Text
}
if err := t.SendEvent(types.ConversationItemInputAudioTranscriptionCompletedEvent{
ServerEventBase: types.ServerEventBase{EventID: "event_TODO"},
ItemID: itemID,
ContentIndex: 0,
Transcript: transcript,
}); err != nil {
return "", err
}
return transcript, nil
}
// Unary fallback: transcribe the whole utterance, emit one completed event.
tr, err := session.ModelInterface.Transcribe(ctx, audioPath, cfg.Language, false, false, cfg.Prompt)
if err != nil {
return "", err
}
if tr == nil {
return "", fmt.Errorf("transcribe result is nil")
}
if err := t.SendEvent(types.ConversationItemInputAudioTranscriptionCompletedEvent{
ServerEventBase: types.ServerEventBase{EventID: "event_TODO"},
ItemID: itemID,
ContentIndex: 0,
Transcript: tr.Text,
}); err != nil {
return "", err
}
return tr.Text, nil
}

View File

@@ -0,0 +1,54 @@
package openai
import (
"context"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/http/endpoints/openai/types"
"github.com/mudler/LocalAI/core/schema"
)
// emitTranscription transcribes a committed utterance, streaming transcript text
// deltas when the pipeline opts in, and returns the final transcript text.
var _ = Describe("emitTranscription", func() {
It("streams transcription deltas then a completed event when streaming is enabled", func() {
on := true
session := &Session{
InputAudioTranscription: &types.AudioTranscription{},
ModelConfig: &config.ModelConfig{
Pipeline: config.Pipeline{Streaming: config.PipelineStreaming{Transcription: &on}},
},
ModelInterface: &fakeModel{
transcribeDeltas: []string{"Hel", "lo", " world"},
transcribeFinal: &schema.TranscriptionResult{Text: "Hello world"},
},
}
t := &fakeTransport{}
transcript, err := emitTranscription(context.Background(), t, session, "item1", "/tmp/x.wav")
Expect(err).ToNot(HaveOccurred())
Expect(transcript).To(Equal("Hello world"))
Expect(t.countEvents(types.ServerEventTypeConversationItemInputAudioTranscriptionDelta)).To(Equal(3))
Expect(t.countEvents(types.ServerEventTypeConversationItemInputAudioTranscriptionCompleted)).To(Equal(1))
})
It("emits a single completed event with no deltas in unary mode", func() {
session := &Session{
InputAudioTranscription: &types.AudioTranscription{},
ModelConfig: &config.ModelConfig{}, // streaming off
ModelInterface: &fakeModel{transcribeFinal: &schema.TranscriptionResult{Text: "Hi"}},
}
t := &fakeTransport{}
transcript, err := emitTranscription(context.Background(), t, session, "item1", "/tmp/x.wav")
Expect(err).ToNot(HaveOccurred())
Expect(transcript).To(Equal("Hi"))
Expect(t.countEvents(types.ServerEventTypeConversationItemInputAudioTranscriptionDelta)).To(Equal(0))
Expect(t.countEvents(types.ServerEventTypeConversationItemInputAudioTranscriptionCompleted)).To(Equal(1))
})
})

View File

@@ -48,7 +48,8 @@ func RealtimeCalls(application *application.Application) echo.HandlerFunc {
return c.JSON(http.StatusInternalServerError, map[string]string{"error": "codec registration failed"})
}
api := webrtc.NewAPI(webrtc.WithMediaEngine(m))
se := webRTCSettingEngine(application.ApplicationConfig())
api := webrtc.NewAPI(webrtc.WithMediaEngine(m), webrtc.WithSettingEngine(se))
pc, err := api.NewPeerConnection(webrtc.Configuration{})
if err != nil {

View File

@@ -0,0 +1,47 @@
package openai
import (
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/xlog"
"github.com/pion/webrtc/v4"
)
// webRTCSettingEngine builds the pion SettingEngine for /v1/realtime WebRTC.
//
// With a default (empty) SettingEngine, pion gathers a host ICE candidate for
// every local interface. Under Docker host networking that includes bridge
// addresses (docker0/veth, 172.x) that a remote browser cannot route to; the
// connection often establishes on a good pair and then drops once ICE consent
// checks fail on the unreachable ones. The two opt-in knobs below let an
// operator advertise only the reachable address.
func webRTCSettingEngine(cfg *config.ApplicationConfig) webrtc.SettingEngine {
s := webrtc.SettingEngine{}
if cfg == nil {
return s
}
if len(cfg.WebRTCNAT1To1IPs) > 0 {
s.SetNAT1To1IPs(cfg.WebRTCNAT1To1IPs, webrtc.ICECandidateTypeHost)
xlog.Debug("realtime webrtc: advertising NAT 1:1 host IPs", "ips", cfg.WebRTCNAT1To1IPs)
}
if filter := iceInterfaceFilter(cfg.WebRTCICEInterfaces); filter != nil {
s.SetInterfaceFilter(filter)
xlog.Debug("realtime webrtc: restricting ICE interfaces", "interfaces", cfg.WebRTCICEInterfaces)
}
return s
}
// iceInterfaceFilter returns an interface allow-list predicate for pion, or nil
// when no interfaces are configured (pion's default: gather from all).
func iceInterfaceFilter(allowed []string) func(string) bool {
if len(allowed) == 0 {
return nil
}
set := make(map[string]struct{}, len(allowed))
for _, name := range allowed {
set[name] = struct{}{}
}
return func(iface string) bool {
_, ok := set[iface]
return ok
}
}

View File

@@ -0,0 +1,39 @@
package openai
import (
"github.com/mudler/LocalAI/core/config"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
var _ = Describe("webRTC ICE settings", func() {
Describe("iceInterfaceFilter", func() {
It("returns nil when no interfaces are configured", func() {
Expect(iceInterfaceFilter(nil)).To(BeNil())
Expect(iceInterfaceFilter([]string{})).To(BeNil())
})
It("admits only the configured interfaces", func() {
f := iceInterfaceFilter([]string{"eth0", "wlan0"})
Expect(f).NotTo(BeNil())
Expect(f("eth0")).To(BeTrue())
Expect(f("wlan0")).To(BeTrue())
Expect(f("docker0")).To(BeFalse())
Expect(f("veth123")).To(BeFalse())
})
})
Describe("webRTCSettingEngine", func() {
It("does not panic on a nil config", func() {
Expect(func() { webRTCSettingEngine(nil) }).NotTo(Panic())
})
It("builds an engine with NAT 1:1 IPs and an interface filter configured", func() {
cfg := &config.ApplicationConfig{
WebRTCNAT1To1IPs: []string{"192.168.1.10"},
WebRTCICEInterfaces: []string{"eth0"},
}
Expect(func() { webRTCSettingEngine(cfg) }).NotTo(Panic())
})
})
})

View File

@@ -1356,7 +1356,7 @@ func handleOpenResponsesNonStream(c echo.Context, responseID string, createdAt i
thinkingStartToken := reason.DetectThinkingStartToken(template, &cfg.ReasoningConfig)
// Extract reasoning from result before cleaning
reasoningContent, cleanedResult := reason.ExtractReasoningWithConfig(result, thinkingStartToken, cfg.ReasoningConfig)
reasoningContent, cleanedResult := reason.ExtractReasoningComplete(result, thinkingStartToken, cfg.ReasoningConfig)
// Parse tool calls if using functions
var outputItems []schema.ORItemField
@@ -1996,7 +1996,7 @@ func handleOpenResponsesStream(c echo.Context, responseID string, createdAt int6
finalCleanedResult = extractor.CleanedContent()
}
if finalReasoning == "" && finalCleanedResult == "" {
finalReasoning, finalCleanedResult = reason.ExtractReasoningWithConfig(result, thinkingStartToken, cfg.ReasoningConfig)
finalReasoning, finalCleanedResult = reason.ExtractReasoningComplete(result, thinkingStartToken, cfg.ReasoningConfig)
}
// Close reasoning item if it exists and wasn't closed yet
@@ -2493,7 +2493,7 @@ func handleOpenResponsesStream(c echo.Context, responseID string, createdAt int6
finalCleanedResult = extractor.CleanedContent()
}
if finalReasoning == "" && finalCleanedResult == "" {
finalReasoning, finalCleanedResult = reason.ExtractReasoningWithConfig(result, thinkingStartToken, cfg.ReasoningConfig)
finalReasoning, finalCleanedResult = reason.ExtractReasoningComplete(result, thinkingStartToken, cfg.ReasoningConfig)
}
// Close reasoning item if it exists and wasn't closed yet

View File

@@ -216,6 +216,12 @@ export function useChat(initialModel = '') {
audio_url: { url: `data:${file.type};base64,${file.base64}` },
})
userFiles.push({ name: file.name, type: 'audio' })
} else if (file.type?.startsWith('video/')) {
messageContent.push({
type: 'video_url',
video_url: { url: `data:${file.type};base64,${file.base64}` },
})
userFiles.push({ name: file.name, type: 'video' })
} else {
// Text/PDF files - append to content
if (file.textContent) {

View File

@@ -506,7 +506,10 @@ export default function Backends() {
<tbody>
{backends.map((b, idx) => {
const op = getBackendOp(b)
const isProcessing = !!op
// A failed op is intentionally kept in the operations list so the
// OperationsBar can surface the error + Dismiss; it must NOT render
// as a perpetual "Installing..." spinner here (mirrors Models.jsx).
const isProcessing = !!op && !op.error
const isExpanded = expandedRow === idx
return (

View File

@@ -265,7 +265,7 @@ function UserMessageContent({ content, files }) {
<div className="chat-message-files">
{files.map((f, i) => (
<span key={i} className="chat-file-inline">
<i className={`fas ${f.type === 'image' ? 'fa-image' : f.type === 'audio' ? 'fa-headphones' : 'fa-file'}`} />
<i className={`fas ${f.type === 'image' ? 'fa-image' : f.type === 'audio' ? 'fa-headphones' : f.type === 'video' ? 'fa-film' : 'fa-file'}`} />
{f.name}
</span>
))}
@@ -274,6 +274,9 @@ function UserMessageContent({ content, files }) {
{Array.isArray(content) && content.filter(c => c.type === 'image_url').map((img, i) => (
<img key={i} src={img.image_url.url} alt="attached" className="chat-inline-image" />
))}
{Array.isArray(content) && content.filter(c => c.type === 'video_url').map((vid, i) => (
<video key={i} src={vid.video_url.url} controls className="chat-inline-video" />
))}
</>
)
}
@@ -711,7 +714,7 @@ export default function Chat() {
for (const file of e.target.files) {
const base64 = await fileToBase64(file)
const entry = { name: file.name, type: file.type, base64 }
if (!file.type.startsWith('image/') && !file.type.startsWith('audio/')) {
if (!file.type.startsWith('image/') && !file.type.startsWith('audio/') && !file.type.startsWith('video/')) {
entry.textContent = await file.text().catch(() => '')
}
newFiles.push(entry)
@@ -1244,7 +1247,7 @@ export default function Chat() {
<div className="chat-files">
{files.map((f, i) => (
<span key={i} className="chat-file-badge">
<i className={`fas ${f.type?.startsWith('image/') ? 'fa-image' : f.type?.startsWith('audio/') ? 'fa-headphones' : 'fa-file'}`} />
<i className={`fas ${f.type?.startsWith('image/') ? 'fa-image' : f.type?.startsWith('audio/') ? 'fa-headphones' : f.type?.startsWith('video/') ? 'fa-film' : 'fa-file'}`} />
{f.name}
<button onClick={() => setFiles(prev => prev.filter((_, idx) => idx !== i))}>
<i className="fas fa-xmark" />
@@ -1343,7 +1346,7 @@ export default function Chat() {
ref={fileInputRef}
type="file"
multiple
accept="image/*,audio/*,application/pdf,.txt,.md,.csv,.json"
accept="image/*,audio/*,video/*,application/pdf,.txt,.md,.csv,.json"
style={{ display: 'none' }}
onChange={handleFileChange}
/>

View File

@@ -12,6 +12,7 @@ import ActionMenu from '../components/ActionMenu'
import ResourceRow, { ChevronCell, IconCell, StopPropagationCell } from '../components/ResourceRow'
import { useModels } from '../hooks/useModels'
import { useGalleryEnrichment } from '../hooks/useGalleryEnrichment'
import { useOperations } from '../hooks/useOperations'
import { backendControlApi, modelsApi, backendsApi, systemApi, nodesApi } from '../utils/api'
import { renderMarkdown } from '../utils/markdown'
import { safeHref } from '../utils/url'
@@ -126,6 +127,7 @@ export default function Manage() {
const [activeTab, setActiveTab] = useState(TABS.some(tab => tab.key === initialTab) ? initialTab : 'models')
const { models, loading: modelsLoading, refetch: refetchModels } = useModels()
const { enrichModel, enrichBackend } = useGalleryEnrichment()
const { operations } = useOperations()
const [loadedModelIds, setLoadedModelIds] = useState(new Set())
const [backends, setBackends] = useState([])
const [backendsLoading, setBackendsLoading] = useState(true)
@@ -258,14 +260,19 @@ export default function Manage() {
return `${m}m ago`
})()
// Fetch available backend upgrades
// Refresh installed backends + available upgrades when the Backends tab opens
// AND whenever a backend operation settles (operations.length changes as a
// reinstall/upgrade completes and drops off the list). Without the op-settle
// refresh the installed-version cell and the "update available" badge stay
// stale after an upgrade until the user switches tabs - the op looks like it
// "did nothing". Mirrors the operations.length watch Backends.jsx uses.
useEffect(() => {
if (activeTab === 'backends') {
backendsApi.checkUpgrades()
.then(data => setUpgrades(data || {}))
.catch(() => {})
}
}, [activeTab])
if (activeTab !== 'backends') return
fetchBackends()
backendsApi.checkUpgrades()
.then(data => setUpgrades(data || {}))
.catch(() => {})
}, [operations.length, activeTab, fetchBackends])
const handleStopModel = (modelName) => {
setConfirmDialog({

View File

@@ -17,6 +17,24 @@ const STATUS_STYLES = {
error: { icon: 'fa-solid fa-circle', color: 'var(--color-error)', bg: 'var(--color-error-light)' },
}
// upsertAssistant merges a streamed transcript fragment into the assistant entry
// identified by the server's item_id, or appends a new entry if none exists yet.
// Keying by item_id (not a mutable index tracked across handler/updater
// boundaries) makes streamed deltas idempotent and order-independent, so React's
// batching of non-React data-channel events cannot produce a duplicate bubble.
// mode 'append' adds to the running text; 'replace' sets the final transcript.
function upsertAssistant(prev, itemId, text, mode) {
// Only assistant entries carry an id, and the streaming entry is almost
// always the newest — search from the tail so per-delta cost stays constant.
const i = prev.findLastIndex(e => e.id === itemId)
if (i === -1) {
return [...prev, { role: 'assistant', id: itemId, text }]
}
const next = [...prev]
next[i] = { ...next[i], text: mode === 'append' ? next[i].text + text : text }
return next
}
export default function Talk() {
const { addToast } = useOutletContext()
const navigate = useNavigate()
@@ -34,7 +52,10 @@ export default function Talk() {
// Transcript
const [transcript, setTranscript] = useState([])
const streamingRef = useRef(null) // tracks the index of the in-progress assistant message
// item_id of the assistant message currently streaming — used only to remove
// its partial bubble when a response is cancelled (barge-in). The transcript
// itself is keyed by item_id via upsertAssistant, not by this ref.
const inProgressIdRef = useRef(null)
// Session settings
const [instructions, setInstructions] = useState(
@@ -227,39 +248,21 @@ export default function Talk() {
break
case 'conversation.item.input_audio_transcription.completed':
if (event.transcript) {
streamingRef.current = null
setTranscript(prev => [...prev, { role: 'user', text: event.transcript }])
}
updateStatus('thinking', 'Generating response...')
break
case 'response.output_audio_transcript.delta':
if (event.delta) {
setTranscript(prev => {
if (streamingRef.current !== null) {
const updated = [...prev]
updated[streamingRef.current] = {
...updated[streamingRef.current],
text: updated[streamingRef.current].text + event.delta,
}
return updated
}
streamingRef.current = prev.length
return [...prev, { role: 'assistant', text: event.delta }]
})
inProgressIdRef.current = event.item_id
setTranscript(prev => upsertAssistant(prev, event.item_id, event.delta, 'append'))
}
break
case 'response.output_audio_transcript.done':
if (event.transcript) {
setTranscript(prev => {
if (streamingRef.current !== null) {
const updated = [...prev]
updated[streamingRef.current] = { ...updated[streamingRef.current], text: event.transcript }
return updated
}
return [...prev, { role: 'assistant', text: event.transcript }]
})
setTranscript(prev => upsertAssistant(prev, event.item_id, event.transcript, 'replace'))
}
streamingRef.current = null
inProgressIdRef.current = null
break
case 'response.output_audio.delta':
updateStatus('speaking', 'Speaking...')
@@ -281,7 +284,7 @@ export default function Talk() {
// Pretty-print JSON for readability; fall back to raw string.
try { preview = JSON.stringify(JSON.parse(preview), null, 2) } catch (_) { /* keep raw */ }
setTranscript(prev => [...prev, { role: 'tool_result', text: preview }])
streamingRef.current = null // tool result ends the current assistant text run
inProgressIdRef.current = null // tool result ends the current assistant text run
}
break
}
@@ -290,9 +293,20 @@ export default function Talk() {
// conversation.item.create + response.create when it's done.
handleFunctionCall(event)
break
case 'response.done':
case 'response.done': {
// A cancelled response (barge-in / interruption) leaves a partial,
// incrementally-streamed assistant bubble behind. The server discards
// the interrupted item from history; mirror that here (remove the
// in-progress assistant entry by item_id) so the regenerated reply
// doesn't show up as a second assistant message.
if (event.response?.status === 'cancelled' && inProgressIdRef.current) {
const id = inProgressIdRef.current
inProgressIdRef.current = null
setTranscript(prev => prev.filter(e => e.id !== id))
}
updateStatus('listening', 'Listening...')
break
}
case 'error':
hasErrorRef.current = true
updateStatus('error', 'Error: ' + (event.error?.message || 'Unknown error'))
@@ -789,7 +803,7 @@ export default function Talk() {
const iconColor = isToolCall || isToolResult ? 'var(--color-text-secondary)'
: isUser ? 'var(--color-primary)' : 'var(--color-accent)'
return (
<div key={i} style={{ display: 'flex', alignItems: 'flex-start', gap: 'var(--spacing-xs)' }}>
<div key={entry.id || i} style={{ display: 'flex', alignItems: 'flex-start', gap: 'var(--spacing-xs)' }}>
<i className={iconClass} style={{ color: iconColor, marginTop: 3, flexShrink: 0, fontSize: '0.75rem' }} />
<p style={{
margin: 0,

View File

@@ -466,10 +466,11 @@ func (s *AgentPoolService) Chat(name, message string) (string, error) {
s.collectAndCopyMetadata(metadata, chatUserID)
}
content := s.appendLocalAGIKBCitations(response.Response, name, message, response.State)
msg := map[string]any{
"id": messageID + "-agent",
"sender": "agent",
"content": response.Response,
"content": content,
"timestamp": time.Now().Format(time.RFC3339),
}
if len(metadata) > 0 {
@@ -489,6 +490,79 @@ func (s *AgentPoolService) Chat(name, message string) (string, error) {
return messageID, nil
}
func (s *AgentPoolService) appendLocalAGIKBCitations(response, agentKey, message string, states []coreTypes.ActionState) string {
if strings.TrimSpace(response) == "" {
return response
}
userID, collection := splitAgentKey(agentKey)
cfg := s.localAGI.pool.GetConfig(agentKey)
if cfg == nil || !cfg.EnableKnowledgeBase {
return response
}
citations := kbCitationsFromActionStates(states)
if len(citations) == 0 && cfg.KBAutoSearch {
maxResults := cfg.KnowledgeBaseResults
if maxResults <= 0 {
maxResults = 5
}
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
kbResult := agents.KBAutoSearchPrompt(ctx, s.apiURL, s.apiKey, collection, message, maxResults, userID)
citations = kbResult.Citations
}
return agents.AppendKBCitations(response, collection, userID, citations)
}
func splitAgentKey(agentKey string) (userID, name string) {
if uid, n, ok := strings.Cut(agentKey, ":"); ok {
return uid, n
}
return "", agentKey
}
func kbCitationsFromActionStates(states []coreTypes.ActionState) []agents.KBCitation {
var citations []agents.KBCitation
for _, state := range states {
citations = append(citations, kbCitationsFromMetadata(state.Metadata)...)
}
return citations
}
func kbCitationsFromMetadata(metadata map[string]any) []agents.KBCitation {
if len(metadata) == 0 {
return nil
}
fileName := metadata["file_name"]
source := metadata["source"]
if fileName == nil && source == nil {
return nil
}
citation := agents.KBCitation{
FileName: metadataString(fileName),
EntryKey: metadataString(source),
}
if citation.FileName == "" && citation.EntryKey == "" {
return nil
}
return []agents.KBCitation{citation}
}
func metadataString(value any) string {
switch v := value.(type) {
case string:
return v
case fmt.Stringer:
return v.String()
default:
return ""
}
}
// userOutputsDir returns the per-user outputs directory, creating it if needed.
// If userID is empty, falls back to the shared outputs directory.
func (s *AgentPoolService) userOutputsDir(userID string) string {

View File

@@ -0,0 +1,127 @@
package agents
import (
"fmt"
"net/url"
"strings"
"sync"
)
type kbCitationList struct {
mu sync.Mutex
citations []KBCitation
}
func (l *kbCitationList) AddKBCitations(citations []KBCitation) {
if len(citations) == 0 {
return
}
l.mu.Lock()
defer l.mu.Unlock()
l.citations = append(l.citations, citations...)
}
func (l *kbCitationList) Citations() []KBCitation {
l.mu.Lock()
defer l.mu.Unlock()
out := make([]KBCitation, len(l.citations))
copy(out, l.citations)
return out
}
// AppendKBCitations appends a markdown Sources block for KB citations.
func AppendKBCitations(response, collection, userID string, citations []KBCitation) string {
if strings.TrimSpace(response) == "" || len(citations) == 0 {
return response
}
var lines []string
seen := make(map[string]struct{})
for _, citation := range citations {
key := strings.TrimSpace(citation.EntryKey)
if key == "" {
key = strings.TrimSpace(citation.FileName)
}
if key == "" {
continue
}
if _, ok := seen[key]; ok {
continue
}
seen[key] = struct{}{}
displayName := kbCitationDisplayName(citation)
if displayName == "" {
continue
}
sourceURL := kbCitationRawFileURL(collection, citation.EntryKey, userID)
number := len(lines) + 1
if sourceURL == "" {
lines = append(lines, fmt.Sprintf("[%d] %s", number, displayName))
continue
}
lines = append(lines, fmt.Sprintf("[%d] [%s](%s)", number, escapeMarkdownLinkText(displayName), sourceURL))
}
if len(lines) == 0 {
return response
}
var sb strings.Builder
sb.WriteString(strings.TrimRight(response, "\n"))
sb.WriteString("\n\nSources:\n")
for _, line := range lines {
sb.WriteString(line)
sb.WriteString("\n")
}
return strings.TrimRight(sb.String(), "\n")
}
func kbCitationDisplayName(citation KBCitation) string {
if fileName := strings.TrimSpace(citation.FileName); fileName != "" {
return fileName
}
segments := strings.Split(strings.Trim(strings.TrimSpace(citation.EntryKey), "/"), "/")
for i := len(segments) - 1; i >= 0; i-- {
if segment := strings.TrimSpace(segments[i]); segment != "" {
return segment
}
}
return ""
}
func kbCitationRawFileURL(collection, entryKey, userID string) string {
collection = strings.TrimSpace(collection)
entryKey = strings.Trim(strings.TrimSpace(entryKey), "/")
if collection == "" || entryKey == "" {
return ""
}
var escapedEntrySegments []string
for _, segment := range strings.Split(entryKey, "/") {
if segment == "" {
continue
}
escapedEntrySegments = append(escapedEntrySegments, url.PathEscape(segment))
}
if len(escapedEntrySegments) == 0 {
return ""
}
sourceURL := "/api/agents/collections/" + url.PathEscape(collection) + "/entries-raw/" + strings.Join(escapedEntrySegments, "/")
if userID != "" {
query := url.Values{}
query.Set("user_id", userID)
sourceURL += "?" + query.Encode()
}
return sourceURL
}
func escapeMarkdownLinkText(text string) string {
text = strings.ReplaceAll(text, `\`, `\\`)
text = strings.ReplaceAll(text, "[", `\[`)
text = strings.ReplaceAll(text, "]", `\]`)
return text
}

View File

@@ -167,10 +167,12 @@ func ExecuteChatWithLLM(ctx context.Context, llm cogito.LLM, cfg *AgentConfig, m
}
}
kbCitations := &kbCitationList{}
if cfg.EnableKnowledgeBase && (kbMode == KBModeAutoSearch || kbMode == KBModeBoth) {
kbResults := KBAutoSearchPrompt(ctx, effectiveURL, effectiveKey, cfg.Name, message, cfg.KnowledgeBaseResults, userID)
if kbResults != "" {
fragment = fragment.AddMessage(cogito.SystemMessageRole, kbResults)
kbResult := KBAutoSearchPrompt(ctx, effectiveURL, effectiveKey, cfg.Name, message, cfg.KnowledgeBaseResults, userID)
if kbResult.Prompt != "" {
fragment = fragment.AddMessage(cogito.SystemMessageRole, kbResult.Prompt)
kbCitations.AddKBCitations(kbResult.Citations)
}
}
@@ -197,7 +199,7 @@ func ExecuteChatWithLLM(ctx context.Context, llm cogito.LLM, cfg *AgentConfig, m
}
cogitoOpts = append(cogitoOpts, cogito.WithTools(
cogito.NewToolDefinition(
KBSearchMemoryTool{APIURL: effectiveURL, APIKey: effectiveKey, Collection: cfg.Name, MaxResults: kbResults, UserID: userID},
KBSearchMemoryTool{APIURL: effectiveURL, APIKey: effectiveKey, Collection: cfg.Name, MaxResults: kbResults, UserID: userID, CitationCollector: kbCitations},
KBSearchMemoryArgs{},
"search_memory",
"Search the knowledge base for relevant information",
@@ -336,6 +338,8 @@ func ExecuteChatWithLLM(ctx context.Context, llm cogito.LLM, cfg *AgentConfig, m
if cfg.StripThinkingTags && response != "" {
response = stripThinkingTags(response)
}
responseForMemory := response
response = AppendKBCitations(response, cfg.Name, userID, kbCitations.Citations())
// Save conversation to KB when long-term memory is enabled.
// Use a detached context: the parent ctx may be cancelled (e.g. in distributed
@@ -344,7 +348,7 @@ func ExecuteChatWithLLM(ctx context.Context, llm cogito.LLM, cfg *AgentConfig, m
go func() {
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
saveConversationToKB(ctx, llm, effectiveURL, effectiveKey, cfg, message, response, userID)
saveConversationToKB(ctx, llm, effectiveURL, effectiveKey, cfg, message, responseForMemory, userID)
}()
}

View File

@@ -2,6 +2,8 @@ package agents
import (
"context"
"net/http"
"net/http/httptest"
"sync"
"sync/atomic"
@@ -36,6 +38,34 @@ func (m *mockLLM) CreateChatCompletion(ctx context.Context, req openai.ChatCompl
}, cogito.LLMUsage{}, nil
}
type toolCallingMockLLM struct {
createResponses []openai.ChatCompletionResponse
askResponse string
callCount atomic.Int32
}
func (m *toolCallingMockLLM) Ask(ctx context.Context, f cogito.Fragment) (cogito.Fragment, error) {
m.callCount.Add(1)
return f.AddMessage(cogito.AssistantMessageRole, m.askResponse), nil
}
func (m *toolCallingMockLLM) CreateChatCompletion(ctx context.Context, req openai.ChatCompletionRequest) (cogito.LLMReply, cogito.LLMUsage, error) {
idx := int(m.callCount.Add(1)) - 1
if idx >= len(m.createResponses) {
return cogito.LLMReply{
ChatCompletionResponse: openai.ChatCompletionResponse{
Choices: []openai.ChatCompletionChoice{{
Message: openai.ChatCompletionMessage{
Role: "assistant",
Content: "No more tools needed.",
},
}},
},
}, cogito.LLMUsage{}, nil
}
return cogito.LLMReply{ChatCompletionResponse: m.createResponses[idx]}, cogito.LLMUsage{}, nil
}
// statusCollector records status callbacks in a thread-safe way.
type statusCollector struct {
mu sync.Mutex
@@ -73,6 +103,74 @@ var _ = DescribeTable("stripThinkingTags",
Entry("adjacent tag pairs", "<thinking>a</thinking><thinking>b</thinking>", ""),
)
var _ = DescribeTable("appendKBCitations",
func(response, collection, userID string, citations []KBCitation, want string) {
Expect(AppendKBCitations(response, collection, userID, citations)).To(Equal(want))
},
Entry("leaves responses without citations unchanged",
"answer",
"agent",
"",
nil,
"answer",
),
Entry("leaves blank responses unchanged",
"",
"agent",
"",
[]KBCitation{{FileName: "source.pdf", EntryKey: "uuid/source.pdf"}},
"",
),
Entry("appends clickable source links",
"answer",
"my-agent",
"",
[]KBCitation{{FileName: "new feature.pdf", EntryKey: "uuid/new feature.pdf"}},
"answer\n\nSources:\n[1] [new feature.pdf](/api/agents/collections/my-agent/entries-raw/uuid/new%20feature.pdf)",
),
Entry("deduplicates citations by entry key",
"answer",
"agent",
"",
[]KBCitation{
{FileName: "first.pdf", EntryKey: "uuid/shared.pdf"},
{FileName: "second.pdf", EntryKey: "uuid/shared.pdf"},
},
"answer\n\nSources:\n[1] [first.pdf](/api/agents/collections/agent/entries-raw/uuid/shared.pdf)",
),
Entry("uses plain text when entry key is missing",
"answer",
"agent",
"",
[]KBCitation{{FileName: "source.pdf"}},
"answer\n\nSources:\n[1] source.pdf",
),
Entry("uses entry basename when filename is missing",
"answer",
"agent",
"",
[]KBCitation{{EntryKey: "uuid/source.pdf"}},
"answer\n\nSources:\n[1] [source.pdf](/api/agents/collections/agent/entries-raw/uuid/source.pdf)",
),
Entry("adds user id query when present",
"answer",
"agent",
"user 1",
[]KBCitation{{FileName: "source.pdf", EntryKey: "uuid/source.pdf"}},
"answer\n\nSources:\n[1] [source.pdf](/api/agents/collections/agent/entries-raw/uuid/source.pdf?user_id=user+1)",
),
Entry("escapes collection, path segments, and markdown link text",
"answer",
"agent one",
"",
[]KBCitation{{FileName: "source [draft].pdf", EntryKey: "uuid/source [draft].pdf"}},
`answer
Sources:
[1] [source \[draft\].pdf](/api/agents/collections/agent%20one/entries-raw/uuid/source%20%5Bdraft%5D.pdf)`,
),
)
var _ = Describe("ExecuteChatWithLLM", func() {
var (
ctx context.Context
@@ -184,6 +282,150 @@ var _ = Describe("ExecuteChatWithLLM", func() {
})
})
Context("knowledge base citations", func() {
It("appends KB sources to the returned response and callback message", func() {
mux := http.NewServeMux()
mux.HandleFunc("/api/agents/collections/kb-agent/search", func(w http.ResponseWriter, r *http.Request) {
Expect(r.URL.Query().Get("user_id")).To(Equal("user-1"))
w.Header().Set("Content-Type", "application/json")
_, _ = w.Write([]byte(`{
"results": [
{
"content": "KB content",
"id": "result-1",
"similarity": 0.99,
"metadata": {
"file_name": "new feature.pdf",
"source": "uuid/new feature.pdf"
}
}
],
"count": 1
}`))
})
server := httptest.NewServer(mux)
defer server.Close()
var msgContent string
cb.OnMessage = func(sender, content, messageID string) {
msgContent = content
}
llm := &mockLLM{response: "agent reply"}
cfg := &AgentConfig{
Name: "kb-agent",
Model: "test-model",
EnableKnowledgeBase: true,
KBMode: KBModeAutoSearch,
}
result, err := ExecuteChatWithLLM(ctx, llm, cfg, "hello", cb, ExecuteChatOpts{
APIURL: server.URL,
UserID: "user-1",
})
Expect(err).ToNot(HaveOccurred())
Expect(result).To(Equal("agent reply\n\nSources:\n[1] [new feature.pdf](/api/agents/collections/kb-agent/entries-raw/uuid/new%20feature.pdf?user_id=user-1)"))
Expect(msgContent).To(Equal(result))
})
It("collects citations from the search_memory tool", func() {
mux := http.NewServeMux()
mux.HandleFunc("/api/agents/collections/kb-agent/search", func(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "application/json")
_, _ = w.Write([]byte(`{
"results": [
{
"content": "Tool KB content",
"id": "result-1",
"similarity": 0.99,
"metadata": {
"file_name": "tool source.pdf",
"source": "uuid/tool source.pdf"
}
}
],
"count": 1
}`))
})
server := httptest.NewServer(mux)
defer server.Close()
collector := &kbCitationList{}
tool := KBSearchMemoryTool{
APIURL: server.URL,
Collection: "kb-agent",
CitationCollector: collector,
}
result, _, err := tool.Run(KBSearchMemoryArgs{Query: "hello"})
Expect(err).ToNot(HaveOccurred())
Expect(result).To(ContainSubstring("Tool KB content"))
Expect(collector.Citations()).To(Equal([]KBCitation{{FileName: "tool source.pdf", EntryKey: "uuid/tool source.pdf"}}))
})
It("appends KB sources found through tools-only search_memory calls", func() {
mux := http.NewServeMux()
mux.HandleFunc("/api/agents/collections/kb-agent/search", func(w http.ResponseWriter, r *http.Request) {
Expect(r.URL.Query().Get("user_id")).To(Equal("user-1"))
w.Header().Set("Content-Type", "application/json")
_, _ = w.Write([]byte(`{
"results": [
{
"content": "Tool KB content",
"id": "result-1",
"similarity": 0.99,
"metadata": {
"file_name": "tool source.pdf",
"source": "uuid/tool source.pdf"
}
}
],
"count": 1
}`))
})
server := httptest.NewServer(mux)
defer server.Close()
llm := &toolCallingMockLLM{
askResponse: "agent reply from tool context",
createResponses: []openai.ChatCompletionResponse{
{
Choices: []openai.ChatCompletionChoice{
{
Message: openai.ChatCompletionMessage{
Role: "assistant",
ToolCalls: []openai.ToolCall{
{
ID: "call-1",
Type: openai.ToolTypeFunction,
Function: openai.FunctionCall{
Name: "search_memory",
Arguments: `{"query":"hello"}`,
},
},
},
},
},
},
},
},
}
cfg := &AgentConfig{
Name: "kb-agent",
Model: "test-model",
EnableKnowledgeBase: true,
KBMode: KBModeTools,
}
result, err := ExecuteChatWithLLM(ctx, llm, cfg, "hello", cb, ExecuteChatOpts{
APIURL: server.URL,
UserID: "user-1",
})
Expect(err).ToNot(HaveOccurred())
Expect(result).To(Equal("agent reply from tool context\n\nSources:\n[1] [tool source.pdf](/api/agents/collections/kb-agent/entries-raw/uuid/tool%20source.pdf?user_id=user-1)"))
})
})
Context("context cancellation", func() {
It("returns an error when context is already cancelled", func() {
cancelledCtx, cancel := context.WithCancel(ctx)

View File

@@ -8,6 +8,7 @@ import (
"io"
"mime/multipart"
"net/http"
"net/url"
"strings"
"time"
@@ -17,10 +18,19 @@ import (
"github.com/mudler/LocalAI/pkg/httpclient"
)
// Metadata keys populated by localrecall for every stored chunk. The original
// upload file name lives under file_name (used for display); source holds the
// collection entry key ("<uuid>/<filename>") used to build the raw-file URL.
const (
kbMetadataFileName = "file_name"
kbMetadataSource = "source"
)
// KBSearchResult represents a search result from the knowledge base.
// Field names mirror the collection search endpoint's JSON response.
type KBSearchResult struct {
Content string `json:"content"`
Score float64 `json:"score"`
ID string `json:"id"`
Similarity float64 `json:"similarity"`
Metadata map[string]string `json:"metadata"`
}
@@ -31,22 +41,48 @@ type kbSearchResponse struct {
Count int `json:"count"`
}
// KBAutoSearchPrompt queries the knowledge base with the user's message
// and returns a system prompt block with relevant results.
// KBCitation is a single source document that a KB search drew from. Citations
// travel alongside the prompt as structured data so the consumer (and UI) can
// render clickable source links, independent of what the model writes inline.
type KBCitation struct {
// FileName is the original uploaded file name, for display (e.g. "report.pdf").
FileName string `json:"file_name"`
// EntryKey is the collection entry identifier ("<uuid>/<filename>"), used to
// build the raw-file URL and as the de-duplication key.
EntryKey string `json:"entry_key"`
}
// KBSearchContext is the result of an auto-search against the knowledge base:
// the system-prompt block to feed the model, plus the de-duplicated list of
// source documents the results were drawn from.
type KBSearchContext struct {
Prompt string `json:"prompt"`
Citations []KBCitation `json:"citations"`
}
// KBCitationCollector receives source citations found during KB searches.
type KBCitationCollector interface {
AddKBCitations([]KBCitation)
}
// KBAutoSearchPrompt queries the knowledge base with the user's message and
// returns a KBSearchContext: a system prompt block with the relevant results
// plus the de-duplicated source citations those results came from.
// Uses LocalAI's collection search endpoint via the API.
func KBAutoSearchPrompt(ctx context.Context, apiURL, apiKey, collection, query string, maxResults int, userID string) string {
func KBAutoSearchPrompt(ctx context.Context, apiURL, apiKey, collection, query string, maxResults int, userID string) KBSearchContext {
if collection == "" || query == "" {
return ""
return KBSearchContext{}
}
if maxResults <= 0 {
maxResults = 5
}
// Call LocalAI's collection search API
searchURL := strings.TrimRight(apiURL, "/") + "/api/agents/collections/" + collection + "/search"
searchURL := strings.TrimRight(apiURL, "/") + "/api/agents/collections/" + url.PathEscape(collection) + "/search"
if userID != "" {
searchURL += "?user_id=" + userID
query := url.Values{}
query.Set("user_id", userID)
searchURL += "?" + query.Encode()
}
reqBody, _ := json.Marshal(map[string]any{
"query": query,
@@ -56,7 +92,7 @@ func KBAutoSearchPrompt(ctx context.Context, apiURL, apiKey, collection, query s
req, err := http.NewRequestWithContext(ctx, http.MethodPost, searchURL, strings.NewReader(string(reqBody)))
if err != nil {
xlog.Warn("KB auto-search: failed to create request", "error", err)
return ""
return KBSearchContext{}
}
req.Header.Set("Content-Type", "application/json")
if apiKey != "" {
@@ -66,41 +102,70 @@ func KBAutoSearchPrompt(ctx context.Context, apiURL, apiKey, collection, query s
resp, err := httpclient.New().Do(req)
if err != nil {
xlog.Warn("KB auto-search: request failed", "error", err)
return ""
return KBSearchContext{}
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
body, _ := io.ReadAll(resp.Body)
xlog.Warn("KB auto-search: non-200 response", "status", resp.StatusCode, "body", string(body))
return ""
return KBSearchContext{}
}
var searchResp kbSearchResponse
if err := json.NewDecoder(resp.Body).Decode(&searchResp); err != nil {
xlog.Warn("KB auto-search: failed to decode response", "error", err)
return ""
return KBSearchContext{}
}
if len(searchResp.Results) == 0 {
return ""
return KBSearchContext{}
}
// Format results as a system prompt block (same format as LocalAGI)
// Build the system prompt block, labelling each chunk with its source file
// so the model can attribute inline, and collect the structured citations.
var sb strings.Builder
sb.WriteString("Given the user input you have the following in memory:\n")
for i, r := range searchResp.Results {
sb.WriteString(fmt.Sprintf("- %s", r.Content))
if len(r.Metadata) > 0 {
meta, _ := json.Marshal(r.Metadata)
sb.WriteString(fmt.Sprintf(" (%s)", string(meta)))
var citations []KBCitation
seen := make(map[string]struct{})
for _, r := range searchResp.Results {
fileName := r.Metadata[kbMetadataFileName]
source := r.Metadata[kbMetadataSource]
label := fileName
if label == "" {
label = "unknown"
}
if i < len(searchResp.Results)-1 {
sb.WriteString("\n")
sb.WriteString(fmt.Sprintf("[Source: %s]\n%s\n", label, r.Content))
// Citations are de-duplicated per source document: many chunks from the
// same file share one source key, so a file is listed only once. Skip
// results with no source key — they cannot be linked back to a document.
dedupKey := source
if dedupKey == "" {
dedupKey = fileName
}
if dedupKey == "" {
continue
}
if _, ok := seen[dedupKey]; ok {
continue
}
seen[dedupKey] = struct{}{}
citations = append(citations, KBCitation{
FileName: fileName,
EntryKey: source,
})
}
return sb.String()
sb.WriteString("When answering, cite sources using [Source: filename].")
return KBSearchContext{
Prompt: sb.String(),
Citations: citations,
}
}
// KBSearchMemoryArgs defines the arguments for the search_memory tool.
@@ -110,21 +175,25 @@ type KBSearchMemoryArgs struct {
// KBSearchMemoryTool implements the search_memory MCP tool.
type KBSearchMemoryTool struct {
APIURL string
APIKey string
Collection string
MaxResults int
UserID string
APIURL string
APIKey string
Collection string
MaxResults int
UserID string
CitationCollector KBCitationCollector
}
func (t KBSearchMemoryTool) Run(args KBSearchMemoryArgs) (string, any, error) {
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
result := KBAutoSearchPrompt(ctx, t.APIURL, t.APIKey, t.Collection, args.Query, t.MaxResults, t.UserID)
if result == "" {
if result.Prompt == "" {
return "No results found.", nil, nil
}
return result, nil, nil
if t.CitationCollector != nil {
t.CitationCollector.AddKBCitations(result.Citations)
}
return result.Prompt, nil, nil
}
// KBAddMemoryArgs defines the arguments for the add_memory tool.
@@ -156,9 +225,11 @@ func (t KBAddMemoryTool) Run(args KBAddMemoryArgs) (string, any, error) {
// KBStoreContent uploads text content to a collection via the multipart upload API.
func KBStoreContent(ctx context.Context, apiURL, apiKey, collection, content, userID string) error {
uploadURL := strings.TrimRight(apiURL, "/") + "/api/agents/collections/" + collection + "/upload"
uploadURL := strings.TrimRight(apiURL, "/") + "/api/agents/collections/" + url.PathEscape(collection) + "/upload"
if userID != "" {
uploadURL += "?user_id=" + userID
query := url.Values{}
query.Set("user_id", userID)
uploadURL += "?" + query.Encode()
}
// Build multipart form with the text content as a file

View File

@@ -180,18 +180,21 @@ func (s *GalleryStore) Cancel(id string) error {
return s.UpdateStatus(id, "cancelled", "")
}
// CleanStale marks abandoned in-progress operations as failed.
// Should be called on startup to recover from crashed instances that
// left records in pending/downloading/processing state.
func (s *GalleryStore) CleanStale(age time.Duration) error {
// CleanStale marks abandoned in-progress operations as failed and returns the
// number of rows reaped. Called on startup AND periodically to recover from
// crashed/restarted instances that left records in pending/downloading/
// processing state — an op orphaned after startup would otherwise linger
// "processing" until the next restart.
func (s *GalleryStore) CleanStale(age time.Duration) (int64, error) {
cutoff := time.Now().Add(-age)
return s.db.Model(&GalleryOperationRecord{}).
res := s.db.Model(&GalleryOperationRecord{}).
Where("updated_at < ? AND status IN ?", cutoff, activeStatuses).
Updates(map[string]any{
"status": "failed",
"error": "stale operation cleaned up on startup",
"error": "stale operation reaped (abandoned by a crashed or restarted instance)",
"updated_at": time.Now(),
}).Error
})
return res.RowsAffected, res.Error
}
// CleanOld removes operations older than the given duration.

View File

@@ -71,7 +71,7 @@ func (g *GalleryService) backendHandler(op *ManagementOp[gallery.GalleryBackend,
var err error
if op.Upgrade {
err = g.backendManager.UpgradeBackend(ctx, op.GalleryElementName, progressCallback)
err = g.backendManager.UpgradeBackend(ctx, op.ID, op.GalleryElementName, progressCallback)
} else if op.Delete {
err = g.backendManager.DeleteBackend(op.GalleryElementName)
} else {

View File

@@ -0,0 +1,106 @@
package galleryop_test
import (
"context"
"time"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/services/distributed"
"github.com/mudler/LocalAI/core/services/galleryop"
"github.com/mudler/LocalAI/core/services/testutil"
)
// Reproduces "a cancelled/orphaned op resurrects as 'processing' after a pod
// restart". CancelOperation flipped the in-memory status to cancelled and
// broadcast a NATS event, but never persisted the terminal status to the
// gallery store. On the next replica restart the still-"pending" row hydrated
// straight back into processingBackends and the UI spun again. CancelOperation
// must persist the cancellation so it survives a restart.
var _ = Describe("GalleryService.CancelOperation persistence", func() {
It("persists the cancelled status to the gallery store", func() {
db := testutil.SetupTestDB()
store, err := distributed.NewGalleryStore(db)
Expect(err).ToNot(HaveOccurred())
// Seed an in-flight op as if a replica was mid-install.
Expect(store.Create(&distributed.GalleryOperationRecord{
ID: "op-cancel",
GalleryElementName: "llama-cpp-development",
OpType: "backend_install",
Status: "pending",
Progress: 0,
})).To(Succeed())
svc := galleryop.NewGalleryService(&config.ApplicationConfig{}, nil)
svc.SetGalleryStore(store)
// Make the op locally cancellable so CancelOperation proceeds.
svc.StoreCancellation("op-cancel", context.CancelFunc(func() {}))
Expect(svc.CancelOperation("op-cancel")).To(Succeed())
// The persisted row must now be terminal — otherwise it re-hydrates as
// pending on the next restart.
rec, err := store.Get("op-cancel")
Expect(err).ToNot(HaveOccurred())
Expect(rec.Status).To(Equal("cancelled"))
// And a fresh service hydrating from the store must NOT see it as active.
fresh := galleryop.NewGalleryService(&config.ApplicationConfig{}, nil)
fresh.SetGalleryStore(store)
Expect(fresh.Hydrate()).To(Succeed())
Expect(fresh.GetStatus("op-cancel")).To(BeNil(),
"a cancelled op must not hydrate back as active after a restart")
})
})
// Reproduces "an op orphaned by a replica that died mid-flight stays 'pending'
// forever". CleanStale (which marks abandoned active ops failed) only ran once
// on startup, so an op orphaned AFTER startup was never reaped until the next
// restart. The service must reap stale ops on an interval, not just at boot.
var _ = Describe("GalleryService.ReapStaleOperations", func() {
It("marks abandoned active ops terminal once they pass the age cutoff", func() {
db := testutil.SetupTestDB()
store, err := distributed.NewGalleryStore(db)
Expect(err).ToNot(HaveOccurred())
Expect(store.Create(&distributed.GalleryOperationRecord{
ID: "orphan-op",
GalleryElementName: "llama-cpp-development",
OpType: "backend_install",
Status: "pending",
Progress: 0,
})).To(Succeed())
// Force the row's updated_at into the past so it is older than the cutoff.
Expect(db.Exec(
"UPDATE gallery_operations SET updated_at = ? WHERE id = ?",
time.Now().Add(-1*time.Hour), "orphan-op",
).Error).To(Succeed())
// A fresh, still-progressing op must NOT be reaped.
Expect(store.Create(&distributed.GalleryOperationRecord{
ID: "live-op",
GalleryElementName: "vllm-development",
OpType: "backend_install",
Status: "downloading",
Progress: 50,
})).To(Succeed())
svc := galleryop.NewGalleryService(&config.ApplicationConfig{}, nil)
svc.SetGalleryStore(store)
reaped, err := svc.ReapStaleOperations(30 * time.Minute)
Expect(err).ToNot(HaveOccurred())
Expect(reaped).To(Equal(int64(1)))
orphan, err := store.Get("orphan-op")
Expect(err).ToNot(HaveOccurred())
Expect(orphan.Status).To(Equal("failed"))
live, err := store.Get("live-op")
Expect(err).ToNot(HaveOccurred())
Expect(live.Status).To(Equal("downloading"), "a recently-updated op must not be reaped")
})
})

View File

@@ -20,7 +20,7 @@ type BackendManager interface {
InstallBackend(ctx context.Context, op *ManagementOp[gallery.GalleryBackend, any], progressCb ProgressCallback) error
DeleteBackend(name string) error
ListBackends() (gallery.SystemBackends, error)
UpgradeBackend(ctx context.Context, name string, progressCb ProgressCallback) error
UpgradeBackend(ctx context.Context, opID, name string, progressCb ProgressCallback) error
CheckUpgrades(ctx context.Context) (map[string]gallery.UpgradeInfo, error)
// IsDistributed reports whether installs fan out across worker nodes.
// The HTTP layer uses this to refuse hardware-specific (non-meta) installs

View File

@@ -96,7 +96,10 @@ func (b *LocalBackendManager) ListBackends() (gallery.SystemBackends, error) {
return gallery.ListSystemBackends(b.systemState)
}
func (b *LocalBackendManager) UpgradeBackend(ctx context.Context, name string, progressCb ProgressCallback) error {
// UpgradeBackend ignores opID: a single-node install reports progress through
// the local progressCb already; opID only matters for distributed per-node
// streaming (see DistributedBackendManager.UpgradeBackend).
func (b *LocalBackendManager) UpgradeBackend(ctx context.Context, _ string, name string, progressCb ProgressCallback) error {
return gallery.UpgradeBackend(ctx, b.systemState, b.modelLoader, b.backendGalleries, name, progressCb, b.requireBackendIntegrity)
}

View File

@@ -0,0 +1,107 @@
package galleryop_test
import (
"errors"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/services/galleryop"
)
// These specs reproduce the distributed "Reinstall spins forever" bug:
// processingBackends (the UI spinner source) is built from OpCache.GetStatus,
// which historically returned every cached op unconditionally. Cleanup only
// happened when a client polled /api/backends/job/:uid, but the Manage-page
// Reinstall/Upgrade buttons never poll, so a completed install stayed in
// processingBackends forever. GetStatus must self-evict terminal ops.
var _ = Describe("OpCache.GetStatus eviction", func() {
var (
svc *galleryop.GalleryService
cache *galleryop.OpCache
)
BeforeEach(func() {
svc = galleryop.NewGalleryService(&config.ApplicationConfig{}, nil)
cache = galleryop.NewOpCache(svc)
})
It("keeps an op that is still processing", func() {
cache.SetBackend("llama-cpp", "uuid-1")
svc.UpdateStatus("uuid-1", &galleryop.OpStatus{Message: "processing backend: llama-cpp", Progress: 0})
processing, _ := cache.GetStatus()
Expect(processing).To(HaveKeyWithValue("llama-cpp", "uuid-1"))
Expect(cache.Exists("llama-cpp")).To(BeTrue())
})
It("evicts a completed op so it no longer shows as processing", func() {
cache.SetBackend("llama-cpp", "uuid-1")
svc.UpdateStatus("uuid-1", &galleryop.OpStatus{Processed: true, Progress: 100, Message: "completed"})
processing, _ := cache.GetStatus()
Expect(processing).NotTo(HaveKey("llama-cpp"))
Expect(cache.Exists("llama-cpp")).To(BeFalse())
})
It("keeps a failed op so the operations panel can surface the error and offer Dismiss", func() {
cache.SetBackend("piper", "uuid-2")
svc.UpdateStatus("uuid-2", &galleryop.OpStatus{Processed: true, Error: errors.New("boom")})
processing, _ := cache.GetStatus()
Expect(processing).To(HaveKeyWithValue("piper", "uuid-2"))
Expect(cache.Exists("piper")).To(BeTrue())
})
It("keeps the ErrWorkerStillInstalling soft-path op (worker still installing in background)", func() {
// Processed=true with no error but progress != 100 and a non-"completed"
// message: the worker timed out the NATS round-trip but is still installing.
// Evicting it would hide an install that may still fail; the reconciler
// confirms the real outcome later.
cache.SetBackend("vllm-development", "uuid-soft")
svc.UpdateStatus("uuid-soft", &galleryop.OpStatus{
Processed: true,
Message: "backend vllm-development: worker still installing in background; reconciler will confirm completion",
})
processing, _ := cache.GetStatus()
Expect(processing).To(HaveKeyWithValue("vllm-development", "uuid-soft"))
Expect(cache.Exists("vllm-development")).To(BeTrue())
})
It("evicts a cancelled op", func() {
cache.SetBackend("vllm", "uuid-3")
svc.UpdateStatus("uuid-3", &galleryop.OpStatus{Processed: true, Cancelled: true, Message: "cancelled"})
processing, _ := cache.GetStatus()
Expect(processing).NotTo(HaveKey("vllm"))
})
It("does not evict an op with no status yet (queued)", func() {
cache.SetBackend("whisper", "uuid-4")
processing, taskTypes := cache.GetStatus()
Expect(processing).To(HaveKeyWithValue("whisper", "uuid-4"))
Expect(taskTypes).To(HaveKeyWithValue("whisper", "Waiting"))
})
// Regression guard: GetStatus is called concurrently by four HTTP handlers
// (~1s poll). An earlier version evicted by deleting from m.Map() — which
// returns the live internal map by reference — causing a fatal
// "concurrent map writes" crash. Run under -race; must not panic or race.
It("is safe under concurrent GetStatus + Set/complete", func() {
done := make(chan struct{})
go func() {
defer GinkgoRecover()
for i := 0; i < 2000; i++ {
_, _ = cache.GetStatus()
}
close(done)
}()
for i := 0; i < 2000; i++ {
id := "uuid-c"
cache.SetBackend("concurrent-backend", id)
// Half the time mark it completed so GetStatus evicts it.
if i%2 == 0 {
svc.UpdateStatus(id, &galleryop.OpStatus{Processed: true, Progress: 100, Message: "completed"})
}
_, _ = cache.GetStatus()
}
<-done
})
})

View File

@@ -408,12 +408,43 @@ func (m *OpCache) Exists(key string) bool {
}
func (m *OpCache) GetStatus() (map[string]string, map[string]string) {
processingModelsData := m.Map()
taskTypes := map[string]string{}
processingModelsData := map[string]string{}
for k, v := range processingModelsData {
// Iterate a snapshot (Keys() copies) and build a fresh result map. We must
// NOT delete from m.Map() during the range: Map() returns the live internal
// map by reference, so a bare delete here would be an unsynchronized write
// to a map four HTTP handlers read every ~1s — a concurrent-map-write crash.
// Collect evictions and apply them via the locked DeleteUUID after the loop.
var evict []string
for _, k := range m.status.Keys() {
v := m.status.Get(k)
if v == "" {
continue // raced with a concurrent Delete
}
status := m.galleryService.GetStatus(v)
// Terminal ops must not keep showing as "processing". Cleanup was
// previously only triggered by a client polling /api/backends/job/:uid,
// but the Manage-page Reinstall/Upgrade buttons never poll, so completed
// ops leaked into processingBackends forever and the card spun
// "reinstalling" indefinitely. Evict here on the list read (the UI always
// calls this). DeleteUUID broadcasts the eviction so peer replicas converge.
//
// We evict ONLY a clean success (progress 100 + "completed", matching the
// job-poll's historical delete condition) or a cancellation. Deliberately
// NOT evicted:
// - failed ops (Error != nil): kept so /api/operations can surface the
// error and offer Dismiss.
// - the ErrWorkerStillInstalling soft-path (Processed=true, Error=nil,
// progress != 100): the worker is still installing in the background
// and the reconciler confirms the real outcome later — evicting it
// would hide an install that may still fail.
if status != nil && status.Processed &&
((status.Progress == 100 && status.Message == "completed") || status.Cancelled) {
evict = append(evict, v)
continue
}
processingModelsData[k] = v
taskTypes[k] = "Installation"
if status != nil && status.Deletion {
taskTypes[k] = "Deletion"
@@ -422,6 +453,10 @@ func (m *OpCache) GetStatus() (map[string]string, map[string]string) {
}
}
for _, v := range evict {
m.DeleteUUID(v)
}
return processingModelsData, taskTypes
}

View File

@@ -5,6 +5,7 @@ import (
"errors"
"fmt"
"sync"
"time"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/gallery"
@@ -31,9 +32,9 @@ type GalleryService struct {
// natsClient is the wider MessagingClient (Publisher + subscribe methods)
// when wired by the distributed startup path; broadcastSubs holds the
// progress + cancel subscriptions opened by SubscribeBroadcasts.
natsClient messaging.MessagingClient
galleryStore *distributed.GalleryStore
broadcastSubs []messaging.Subscription
natsClient messaging.MessagingClient
galleryStore *distributed.GalleryStore
broadcastSubs []messaging.Subscription
// OnBackendOpCompleted is fired after every successful install/upgrade/delete
// on the backend channel. The Application wires this to UpgradeChecker.TriggerCheck
@@ -274,6 +275,29 @@ func (g *GalleryService) GetAllStatus() map[string]*OpStatus {
return g.statuses
}
// ReapStaleOperations marks abandoned in-progress operations (pending/
// downloading/processing) older than `age` as failed, so an op orphaned by a
// replica that died mid-flight does not linger as "processing" forever. The
// store's CleanStale runs once on startup; this exposes it for periodic
// invocation (a post-startup orphan is otherwise not reaped until the next
// restart). No-op when no gallery store is wired. Returns rows reaped.
func (g *GalleryService) ReapStaleOperations(age time.Duration) (int64, error) {
g.Lock()
store := g.galleryStore
g.Unlock()
if store == nil {
return 0, nil
}
n, err := store.CleanStale(age)
if err != nil {
return 0, err
}
if n > 0 {
xlog.Info("Reaped stale gallery operations", "count", n)
}
return n, nil
}
// CancelOperation cancels an in-progress operation by its ID.
//
// In distributed mode the UI's cancel click may land on a different replica
@@ -295,6 +319,7 @@ func (g *GalleryService) CancelOperation(id string) error {
}
nc := g.natsClient
store := g.galleryStore
if !localExists && nc == nil {
g.Unlock()
@@ -315,6 +340,17 @@ func (g *GalleryService) CancelOperation(id string) error {
}
g.Unlock()
// Persist the terminal status so the cancel survives a restart. Without
// this the row stays in its active state and re-hydrates straight back into
// processingBackends on the next replica boot — the UI spins again on an op
// the admin already cancelled. The peer that broadcasts wins the write; a
// no-op when standalone (store nil).
if store != nil {
if err := store.Cancel(id); err != nil {
xlog.Warn("Failed to persist gallery operation cancellation", "op_id", id, "error", err)
}
}
// I/O and user-provided callback after Unlock — the cancel-wildcard
// subscriber loops back into applyCancel on this same replica, which
// would otherwise deadlock on g.Mutex.

View File

@@ -194,6 +194,14 @@ type BackendUpgradeRequest struct {
// but the field lets future per-replica metadata (e.g. progress reporting
// scoped to a slot) ride the same wire without a v3 type.
ReplicaIndex int32 `json:"replica_index,omitempty"`
// OpID identifies the admin-side operation. When non-empty the worker
// publishes BackendInstallProgressEvent values to
// SubjectNodeBackendInstallProgress(nodeID, OpID) while the force-reinstall
// runs, so the master can stream per-node progress for upgrades exactly as
// it already does for installs (an upgrade IS a force-reinstall, so the
// install-progress subject is reused rather than minting a new one — no new
// NATS permission or rolling-update compat surface). Empty on legacy callers.
OpID string `json:"op_id,omitempty"`
}
// BackendUpgradeReply mirrors BackendInstallReply minus Address — upgrade does

View File

@@ -157,3 +157,82 @@ func (c *InFlightTrackingClient) Rerank(ctx context.Context, in *pb.RerankReques
res, err := c.Backend.Rerank(ctx, in, opts...)
return res, c.reconcile(err)
}
func (c *InFlightTrackingClient) VAD(ctx context.Context, in *pb.VADRequest, opts ...ggrpc.CallOption) (*pb.VADResponse, error) {
defer c.track(ctx)()
res, err := c.Backend.VAD(ctx, in, opts...)
return res, c.reconcile(err)
}
func (c *InFlightTrackingClient) Diarize(ctx context.Context, in *pb.DiarizeRequest, opts ...ggrpc.CallOption) (*pb.DiarizeResponse, error) {
defer c.track(ctx)()
res, err := c.Backend.Diarize(ctx, in, opts...)
return res, c.reconcile(err)
}
func (c *InFlightTrackingClient) FaceVerify(ctx context.Context, in *pb.FaceVerifyRequest, opts ...ggrpc.CallOption) (*pb.FaceVerifyResponse, error) {
defer c.track(ctx)()
res, err := c.Backend.FaceVerify(ctx, in, opts...)
return res, c.reconcile(err)
}
func (c *InFlightTrackingClient) FaceAnalyze(ctx context.Context, in *pb.FaceAnalyzeRequest, opts ...ggrpc.CallOption) (*pb.FaceAnalyzeResponse, error) {
defer c.track(ctx)()
res, err := c.Backend.FaceAnalyze(ctx, in, opts...)
return res, c.reconcile(err)
}
func (c *InFlightTrackingClient) VoiceVerify(ctx context.Context, in *pb.VoiceVerifyRequest, opts ...ggrpc.CallOption) (*pb.VoiceVerifyResponse, error) {
defer c.track(ctx)()
res, err := c.Backend.VoiceVerify(ctx, in, opts...)
return res, c.reconcile(err)
}
func (c *InFlightTrackingClient) VoiceAnalyze(ctx context.Context, in *pb.VoiceAnalyzeRequest, opts ...ggrpc.CallOption) (*pb.VoiceAnalyzeResponse, error) {
defer c.track(ctx)()
res, err := c.Backend.VoiceAnalyze(ctx, in, opts...)
return res, c.reconcile(err)
}
func (c *InFlightTrackingClient) VoiceEmbed(ctx context.Context, in *pb.VoiceEmbedRequest, opts ...ggrpc.CallOption) (*pb.VoiceEmbedResponse, error) {
defer c.track(ctx)()
res, err := c.Backend.VoiceEmbed(ctx, in, opts...)
return res, c.reconcile(err)
}
func (c *InFlightTrackingClient) TokenClassify(ctx context.Context, in *pb.TokenClassifyRequest, opts ...ggrpc.CallOption) (*pb.TokenClassifyResponse, error) {
defer c.track(ctx)()
res, err := c.Backend.TokenClassify(ctx, in, opts...)
return res, c.reconcile(err)
}
func (c *InFlightTrackingClient) Score(ctx context.Context, in *pb.ScoreRequest, opts ...ggrpc.CallOption) (*pb.ScoreResponse, error) {
defer c.track(ctx)()
res, err := c.Backend.Score(ctx, in, opts...)
return res, c.reconcile(err)
}
func (c *InFlightTrackingClient) AudioEncode(ctx context.Context, in *pb.AudioEncodeRequest, opts ...ggrpc.CallOption) (*pb.AudioEncodeResult, error) {
defer c.track(ctx)()
res, err := c.Backend.AudioEncode(ctx, in, opts...)
return res, c.reconcile(err)
}
func (c *InFlightTrackingClient) AudioDecode(ctx context.Context, in *pb.AudioDecodeRequest, opts ...ggrpc.CallOption) (*pb.AudioDecodeResult, error) {
defer c.track(ctx)()
res, err := c.Backend.AudioDecode(ctx, in, opts...)
return res, c.reconcile(err)
}
func (c *InFlightTrackingClient) AudioTransform(ctx context.Context, in *pb.AudioTransformRequest, opts ...ggrpc.CallOption) (*pb.AudioTransformResult, error) {
defer c.track(ctx)()
res, err := c.Backend.AudioTransform(ctx, in, opts...)
return res, c.reconcile(err)
}
// AudioTransformStream, AudioToAudioStream and Forward are deliberately left as
// embedded passthrough: they return a stream client and the inference spans the
// stream's lifetime, not the constructor call. Wrapping the constructor with
// track() would increment and immediately decrement (and fire onFirstComplete)
// before any audio flows. Tracking those correctly needs the done() func tied to
// stream close, which the current Backend interface doesn't surface here.

View File

@@ -304,6 +304,105 @@ var _ = Describe("InFlightTrackingClient", func() {
})
})
Describe("non-LLM inference methods track in-flight", func() {
// silero-vad and friends only ever expose a single non-Predict method.
// If that method isn't wrapped, the load-time reservation released by
// onFirstComplete never fires and in-flight is stuck at 1 forever.
assertTracked := func(call func() error) {
var firstFired int
client.OnFirstComplete(func() { firstFired++ })
err := call()
Expect(err).ToNot(HaveOccurred())
Expect(tracker.increments).To(Equal(1), "method must increment in-flight")
Expect(tracker.decrements).To(Equal(1), "method must decrement in-flight")
Expect(firstFired).To(Equal(1), "method must release the load-time reservation")
}
It("VAD", func() {
assertTracked(func() error {
_, err := client.VAD(context.Background(), &pb.VADRequest{})
return err
})
})
It("Diarize", func() {
assertTracked(func() error {
_, err := client.Diarize(context.Background(), &pb.DiarizeRequest{})
return err
})
})
It("VoiceVerify", func() {
assertTracked(func() error {
_, err := client.VoiceVerify(context.Background(), &pb.VoiceVerifyRequest{})
return err
})
})
It("VoiceAnalyze", func() {
assertTracked(func() error {
_, err := client.VoiceAnalyze(context.Background(), &pb.VoiceAnalyzeRequest{})
return err
})
})
It("VoiceEmbed", func() {
assertTracked(func() error {
_, err := client.VoiceEmbed(context.Background(), &pb.VoiceEmbedRequest{})
return err
})
})
It("FaceVerify", func() {
assertTracked(func() error {
_, err := client.FaceVerify(context.Background(), &pb.FaceVerifyRequest{})
return err
})
})
It("FaceAnalyze", func() {
assertTracked(func() error {
_, err := client.FaceAnalyze(context.Background(), &pb.FaceAnalyzeRequest{})
return err
})
})
It("TokenClassify", func() {
assertTracked(func() error {
_, err := client.TokenClassify(context.Background(), &pb.TokenClassifyRequest{})
return err
})
})
It("Score", func() {
assertTracked(func() error {
_, err := client.Score(context.Background(), &pb.ScoreRequest{})
return err
})
})
It("AudioEncode", func() {
assertTracked(func() error {
_, err := client.AudioEncode(context.Background(), &pb.AudioEncodeRequest{})
return err
})
})
It("AudioDecode", func() {
assertTracked(func() error {
_, err := client.AudioDecode(context.Background(), &pb.AudioDecodeRequest{})
return err
})
})
It("AudioTransform", func() {
assertTracked(func() error {
_, err := client.AudioTransform(context.Background(), &pb.AudioTransformRequest{})
return err
})
})
})
Describe("stale model reload (self-heal)", func() {
It("removes the replica when the backend reports the model is not loaded", func() {
backend.predictErr = fmt.Errorf("parakeet-cpp: model not loaded")

View File

@@ -533,7 +533,7 @@ func (d *DistributedBackendManager) InstallBackend(ctx context.Context, op *gall
// backend.upgrade, we try the legacy backend.install Force=true path so a
// new master + old worker still converges. Drop the fallback once every
// worker in the fleet is on 2026-05-08 or newer.
func (d *DistributedBackendManager) UpgradeBackend(ctx context.Context, name string, progressCb galleryop.ProgressCallback) error {
func (d *DistributedBackendManager) UpgradeBackend(ctx context.Context, opID, name string, progressCb galleryop.ProgressCallback) error {
galleriesJSON, _ := json.Marshal(d.backendGalleries)
installed, err := d.ListBackends()
@@ -549,17 +549,39 @@ func (d *DistributedBackendManager) UpgradeBackend(ctx context.Context, name str
targetNodeIDs[n.NodeID] = true
}
// Empty opID: the caller (galleryop) doesn't thread an op ID into
// UpgradeBackend today, so we can't tag per-node sink writes with the
// right OpStatus key. Until the upgrade path takes a ManagementOp the
// way InstallBackend does, the sink stays no-op here.
result, err := d.enqueueAndDrainBackendOp(ctx, "", OpBackendUpgrade, name, galleriesJSON, targetNodeIDs, func(node BackendNode) error {
reply, err := d.adapter.UpgradeBackend(node.ID, name, string(galleriesJSON), "", "", "", 0)
result, err := d.enqueueAndDrainBackendOp(ctx, opID, OpBackendUpgrade, name, galleriesJSON, targetNodeIDs, func(node BackendNode) error {
// Per-node progress sink: fan each worker download tick into the legacy
// single-bar progressCb and the per-node OpStatus.Nodes view, exactly as
// InstallBackend does. Defined per-node so each closure captures its own
// node.Name. Without this an upgrade blocks opaque at progress 0 for the
// whole 15m round-trip (the original "reinstalling but nothing happens").
onProgress := func(ev messaging.BackendInstallProgressEvent) {
if progressCb != nil {
progressCb(ev.FileName, ev.Current, ev.Total, ev.Percentage)
}
if d.progressSink != nil && opID != "" {
d.progressSink.UpdateNodeProgress(opID, ev.NodeID, galleryop.NodeProgress{
NodeID: ev.NodeID,
NodeName: node.Name,
Status: galleryop.NodeStatusDownloading,
FileName: ev.FileName,
Current: ev.Current,
Total: ev.Total,
Percentage: ev.Percentage,
Phase: ev.Phase,
})
}
}
var onProgressArg func(messaging.BackendInstallProgressEvent)
if progressCb != nil || d.progressSink != nil {
onProgressArg = onProgress
}
reply, err := d.adapter.UpgradeBackend(node.ID, name, string(galleriesJSON), "", "", "", 0, opID, onProgressArg)
if err != nil {
// Rolling-update fallback: an older worker doesn't know
// backend.upgrade. Try the legacy install-with-force path.
if errors.Is(err, nats.ErrNoResponders) {
instReply, instErr := d.adapter.installWithForceFallback(node.ID, name, string(galleriesJSON), "", "", "", 0)
instReply, instErr := d.adapter.installWithForceFallback(node.ID, name, string(galleriesJSON), "", "", "", 0, opID, onProgressArg)
if instErr != nil {
return instErr
}

View File

@@ -317,7 +317,7 @@ func (stubLocalBackendManager) DeleteBackend(_ string) error { return gallery.Er
func (stubLocalBackendManager) ListBackends() (gallery.SystemBackends, error) {
return gallery.SystemBackends{}, nil
}
func (stubLocalBackendManager) UpgradeBackend(_ context.Context, _ string, _ galleryop.ProgressCallback) error {
func (stubLocalBackendManager) UpgradeBackend(_ context.Context, _ string, _ string, _ galleryop.ProgressCallback) error {
return nil
}
func (stubLocalBackendManager) CheckUpgrades(_ context.Context) (map[string]gallery.UpgradeInfo, error) {
@@ -782,7 +782,7 @@ var _ = Describe("DistributedBackendManager", func() {
mc.scriptReply(messaging.SubjectNodeBackendUpgrade(n2.ID),
messaging.BackendUpgradeReply{Success: false, Error: "registry unauthorized"})
err := mgr.UpgradeBackend(ctx, "vllm-development", nil)
err := mgr.UpgradeBackend(ctx, "", "vllm-development", nil)
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring("worker-a"))
Expect(err.Error()).To(ContainSubstring("image manifest not found"))
@@ -797,7 +797,7 @@ var _ = Describe("DistributedBackendManager", func() {
scriptInstalled("vllm-development", n1.ID)
mc.scriptReply(messaging.SubjectNodeBackendUpgrade(n1.ID),
messaging.BackendUpgradeReply{Success: true})
Expect(mgr.UpgradeBackend(ctx, "vllm-development", nil)).To(Succeed())
Expect(mgr.UpgradeBackend(ctx, "", "vllm-development", nil)).To(Succeed())
})
})
@@ -819,7 +819,7 @@ var _ = Describe("DistributedBackendManager", func() {
// if the manager attempts it, the scripted-client default returns
// fakeNoRespondersErr and the assertion below fails loudly.
Expect(mgr.UpgradeBackend(ctx, "cpu-insightface-development", nil)).To(Succeed())
Expect(mgr.UpgradeBackend(ctx, "", "cpu-insightface-development", nil)).To(Succeed())
mc.mu.Lock()
defer mc.mu.Unlock()
@@ -835,7 +835,7 @@ var _ = Describe("DistributedBackendManager", func() {
n1 := registerHealthyBackend("worker-a", "10.0.0.1:50051")
scriptNoBackends(n1.ID)
err := mgr.UpgradeBackend(ctx, "vllm-development", nil)
err := mgr.UpgradeBackend(ctx, "", "vllm-development", nil)
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring("not installed on any node"))
@@ -865,7 +865,7 @@ var _ = Describe("DistributedBackendManager", func() {
func(req messaging.BackendInstallRequest) bool { return req.Force },
messaging.BackendInstallReply{Success: true, Address: "10.0.0.1:50100"})
Expect(mgr.UpgradeBackend(ctx, "vllm-development", nil)).To(Succeed())
Expect(mgr.UpgradeBackend(ctx, "", "vllm-development", nil)).To(Succeed())
})
It("returns the upgrade error when it is not ErrNoResponders", func() {
@@ -875,7 +875,7 @@ var _ = Describe("DistributedBackendManager", func() {
mc.scriptReply(messaging.SubjectNodeBackendUpgrade(n.ID),
messaging.BackendUpgradeReply{Success: false, Error: "disk full"})
err := mgr.UpgradeBackend(ctx, "vllm-development", nil)
err := mgr.UpgradeBackend(ctx, "", "vllm-development", nil)
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring("disk full"))
})

View File

@@ -0,0 +1,135 @@
package nodes
import (
"context"
"runtime"
"time"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
"github.com/mudler/LocalAI/core/services/testutil"
)
// These specs reproduce the distributed "pending ops behind dead nodes leak
// forever" bug. ListDuePendingBackendOps only returns rows whose node is
// StatusHealthy, so an op queued against a node that goes offline (heartbeat
// stale) or draining (admin action) is never retried, never aged out, and
// never deleted. On a live cluster these rows sat at attempts=0 indefinitely
// and kept the UI operation alive. DeleteStalePendingBackendOps garbage-collects
// them: draining nodes immediately (models already purged), offline nodes only
// after a grace window so a brief heartbeat blip does not nuke in-flight work.
var _ = Describe("DeleteStalePendingBackendOps", func() {
var (
registry *NodeRegistry
ctx context.Context
)
BeforeEach(func() {
if runtime.GOOS == "darwin" {
Skip("testcontainers requires Docker, not available on macOS CI")
}
db := testutil.SetupTestDB()
var err error
registry, err = NewNodeRegistry(db)
Expect(err).ToNot(HaveOccurred())
ctx = context.Background()
})
// registerBackend registers an auto-approved backend node and returns its ID.
registerBackend := func(name, address string) string {
node := &BackendNode{Name: name, NodeType: NodeTypeBackend, Address: address}
Expect(registry.Register(ctx, node, true)).To(Succeed())
fetched, err := registry.GetByName(ctx, name)
Expect(err).ToNot(HaveOccurred())
return fetched.ID
}
// setHeartbeat forces a node's last_heartbeat (Register/MarkOffline leave it
// at "now"; we age it to simulate a node that went silent a while ago).
setHeartbeat := func(nodeID string, t time.Time) {
Expect(registry.db.WithContext(ctx).Model(&BackendNode{}).
Where("id = ?", nodeID).
Update("last_heartbeat", t).Error).To(Succeed())
}
pendingCountFor := func(nodeID string) int64 {
var n int64
Expect(registry.db.WithContext(ctx).Model(&PendingBackendOp{}).
Where("node_id = ?", nodeID).Count(&n).Error).To(Succeed())
return n
}
It("clears ops behind an offline node whose heartbeat is past the grace window", func() {
dead := registerBackend("nvidia-thor", "10.0.0.9:50051")
Expect(registry.UpsertPendingBackendOp(ctx, dead, "llama-cpp-development", OpBackendInstall, nil)).To(Succeed())
Expect(registry.MarkOffline(ctx, dead)).To(Succeed())
setHeartbeat(dead, time.Now().Add(-1*time.Hour))
removed, err := registry.DeleteStalePendingBackendOps(ctx, 10*time.Minute)
Expect(err).ToNot(HaveOccurred())
Expect(removed).To(Equal(int64(1)))
Expect(pendingCountFor(dead)).To(Equal(int64(0)))
})
It("clears ops behind a draining node immediately, even with a fresh heartbeat", func() {
// Mirrors the live mac-mini-m4 case: draining but still heartbeating.
drain := registerBackend("mac-mini-m4", "10.0.0.3:50051")
Expect(registry.UpsertPendingBackendOp(ctx, drain, "llama-cpp-development", OpBackendInstall, nil)).To(Succeed())
Expect(registry.MarkDraining(ctx, drain)).To(Succeed())
setHeartbeat(drain, time.Now()) // fresh heartbeat
removed, err := registry.DeleteStalePendingBackendOps(ctx, 10*time.Minute)
Expect(err).ToNot(HaveOccurred())
Expect(removed).To(Equal(int64(1)))
Expect(pendingCountFor(drain)).To(Equal(int64(0)))
})
It("clears ops behind an unhealthy node with a stale heartbeat (never ages to offline)", func() {
// A node marked unhealthy on a NATS ErrNoResponders never transitions to
// offline, so its ops must be reaped via the same stale-heartbeat path.
sick := registerBackend("agx-orin-sick", "10.0.0.7:50051")
Expect(registry.UpsertPendingBackendOp(ctx, sick, "llama-cpp-development", OpBackendUpgrade, nil)).To(Succeed())
Expect(registry.MarkUnhealthy(ctx, sick)).To(Succeed())
setHeartbeat(sick, time.Now().Add(-1*time.Hour))
removed, err := registry.DeleteStalePendingBackendOps(ctx, 10*time.Minute)
Expect(err).ToNot(HaveOccurred())
Expect(removed).To(Equal(int64(1)))
Expect(pendingCountFor(sick)).To(Equal(int64(0)))
})
It("keeps ops behind an unhealthy node that is still heartbeating (recovering)", func() {
recovering := registerBackend("agx-orin-flap", "10.0.0.8:50051")
Expect(registry.UpsertPendingBackendOp(ctx, recovering, "llama-cpp-development", OpBackendUpgrade, nil)).To(Succeed())
Expect(registry.MarkUnhealthy(ctx, recovering)).To(Succeed())
setHeartbeat(recovering, time.Now()) // fresh heartbeat → recovering
removed, err := registry.DeleteStalePendingBackendOps(ctx, 10*time.Minute)
Expect(err).ToNot(HaveOccurred())
Expect(removed).To(Equal(int64(0)))
Expect(pendingCountFor(recovering)).To(Equal(int64(1)))
})
It("keeps ops behind a node that only just went offline (within grace)", func() {
blip := registerBackend("agx-orin", "10.0.0.4:50051")
Expect(registry.UpsertPendingBackendOp(ctx, blip, "parakeet-cpp-development", OpBackendInstall, nil)).To(Succeed())
Expect(registry.MarkOffline(ctx, blip)).To(Succeed())
setHeartbeat(blip, time.Now().Add(-1*time.Minute)) // gone only 1m, grace 10m
removed, err := registry.DeleteStalePendingBackendOps(ctx, 10*time.Minute)
Expect(err).ToNot(HaveOccurred())
Expect(removed).To(Equal(int64(0)))
Expect(pendingCountFor(blip)).To(Equal(int64(1)))
})
It("keeps ops behind a healthy node", func() {
healthy := registerBackend("dgx-spark", "10.0.0.1:50051")
Expect(registry.UpsertPendingBackendOp(ctx, healthy, "llama-cpp-development", OpBackendUpgrade, nil)).To(Succeed())
removed, err := registry.DeleteStalePendingBackendOps(ctx, 10*time.Minute)
Expect(err).ToNot(HaveOccurred())
Expect(removed).To(Equal(int64(0)))
Expect(pendingCountFor(healthy)).To(Equal(int64(1)))
})
})

View File

@@ -189,6 +189,13 @@ func (rc *ReplicaReconciler) reconcileState(ctx context.Context) {
// passed on nodes that are currently healthy. On success the row is deleted;
// on failure attempts++ and next_retry_at moves out via exponential backoff.
func (rc *ReplicaReconciler) drainPendingBackendOps(ctx context.Context) {
// Garbage-collect ops behind nodes that went offline/draining. These are
// invisible to ListDuePendingBackendOps (which filters status=healthy), so
// without this sweep they leak forever and keep the UI operation spinning.
if _, err := rc.registry.DeleteStalePendingBackendOps(ctx, stalePendingBackendOpGrace); err != nil {
xlog.Warn("Reconciler: failed to clear stale pending backend ops", "error", err)
}
ops, err := rc.registry.ListDuePendingBackendOps(ctx)
if err != nil {
xlog.Warn("Reconciler: failed to list pending backend ops", "error", err)
@@ -223,10 +230,13 @@ func (rc *ReplicaReconciler) drainPendingBackendOps(ctx context.Context) {
// the same worker. Falls back to the legacy backend.install
// Force=true path on nats.ErrNoResponders for old workers that
// don't subscribe to backend.upgrade yet (rolling-update window).
reply, err := rc.adapter.UpgradeBackend(op.NodeID, op.Backend, string(op.Galleries), "", "", "", 0)
// Reconciler retries are background reconciliation with no live
// admin watching a progress bar, so opID/onProgress are empty —
// the adapter skips the progress subscription entirely.
reply, err := rc.adapter.UpgradeBackend(op.NodeID, op.Backend, string(op.Galleries), "", "", "", 0, "", nil)
if err != nil {
if errors.Is(err, nats.ErrNoResponders) {
instReply, instErr := rc.adapter.installWithForceFallback(op.NodeID, op.Backend, string(op.Galleries), "", "", "", 0)
instReply, instErr := rc.adapter.installWithForceFallback(op.NodeID, op.Backend, string(op.Galleries), "", "", "", 0, "", nil)
if instErr != nil {
applyErr = instErr
} else if !instReply.Success {
@@ -293,6 +303,13 @@ func (rc *ReplicaReconciler) drainPendingBackendOps(ctx context.Context) {
// amount of further retrying will help.
const maxPendingBackendOpAttempts = 10
// stalePendingBackendOpGrace is how long a node may be offline before its
// pending backend ops are garbage-collected. Draining nodes are cleared
// immediately regardless of this window (see DeleteStalePendingBackendOps).
// ListDuePendingBackendOps never surfaces ops behind non-healthy nodes, so
// without this sweep they would leak forever and keep the UI op spinning.
const stalePendingBackendOpGrace = 15 * time.Minute
// probeLoadedModels gRPC-health-checks model addresses that the DB says are
// loaded. If a model's backend process is gone (OOM, crash, manual restart)
// we remove the row so ghosts don't linger. Only probes rows older than

View File

@@ -1776,6 +1776,38 @@ func (r *NodeRegistry) DeletePendingBackendOp(ctx context.Context, id uint) erro
return nil
}
// DeleteStalePendingBackendOps garbage-collects pending backend ops whose target
// node can never drain them. ListDuePendingBackendOps only returns rows behind a
// StatusHealthy node, so ops behind a node that went offline or draining are
// otherwise never retried, aged out, or deleted — they leak forever and keep the
// UI operation spinning. Draining nodes are cleared immediately (an explicit
// admin action; their model rows are already purged). Offline nodes are cleared
// only once their last heartbeat is older than `grace`, so a brief heartbeat blip
// does not nuke an install that is still legitimately in flight. Returns the
// number of rows deleted.
func (r *NodeRegistry) DeleteStalePendingBackendOps(ctx context.Context, grace time.Duration) (int64, error) {
cutoff := time.Now().Add(-grace)
// Draining nodes are cleared immediately (admin action; model rows already
// purged). Offline AND unhealthy nodes are cleared only once their heartbeat
// is older than the grace window: a node marked unhealthy on a NATS
// ErrNoResponders never transitions to offline (health.go skips re-marking
// it), so without including unhealthy here its ops would leak exactly like
// the offline case. A node with a fresh heartbeat (last_heartbeat > cutoff)
// is recovering and keeps its op for retry.
res := r.db.WithContext(ctx).
Where(`node_id IN (SELECT id FROM backend_nodes WHERE status = ?)
OR node_id IN (SELECT id FROM backend_nodes WHERE status IN ? AND last_heartbeat <= ?)`,
StatusDraining, []string{StatusOffline, StatusUnhealthy}, cutoff).
Delete(&PendingBackendOp{})
if res.Error != nil {
return 0, fmt.Errorf("deleting stale pending backend ops: %w", res.Error)
}
if res.RowsAffected > 0 {
xlog.Info("Cleared pending backend ops behind non-healthy nodes", "deleted", res.RowsAffected)
}
return res.RowsAffected, nil
}
// RecordPendingBackendOpFailure bumps Attempts, captures the error, and
// pushes NextRetryAt out with exponential backoff capped at 15 minutes.
func (r *NodeRegistry) RecordPendingBackendOpFailure(ctx context.Context, id uint, errMsg string) error {

View File

@@ -365,7 +365,7 @@ func (f *fakeUnloader) InstallBackend(nodeID, backend, modelID, _, _, _, _ strin
return f.installReply, f.installErr
}
func (f *fakeUnloader) UpgradeBackend(nodeID, backend, _, _, _, _ string, replica int) (*messaging.BackendUpgradeReply, error) {
func (f *fakeUnloader) UpgradeBackend(nodeID, backend, _, _, _, _ string, replica int, _ string, _ func(messaging.BackendInstallProgressEvent)) (*messaging.BackendUpgradeReply, error) {
f.mu.Lock()
f.upgradeCalls = append(f.upgradeCalls, upgradeCall{nodeID, backend, replica})
f.mu.Unlock()

View File

@@ -35,7 +35,7 @@ type backendStopRequest struct {
// backend.upgrade subject.
type NodeCommandSender interface {
InstallBackend(nodeID, backendType, modelID, galleriesJSON, uri, name, alias string, replicaIndex int, opID string, onProgress func(messaging.BackendInstallProgressEvent)) (*messaging.BackendInstallReply, error)
UpgradeBackend(nodeID, backendType, galleriesJSON, uri, name, alias string, replicaIndex int) (*messaging.BackendUpgradeReply, error)
UpgradeBackend(nodeID, backendType, galleriesJSON, uri, name, alias string, replicaIndex int, opID string, onProgress func(messaging.BackendInstallProgressEvent)) (*messaging.BackendUpgradeReply, error)
DeleteBackend(nodeID, backendName string) (*messaging.BackendDeleteReply, error)
ListBackends(nodeID string) (*messaging.BackendListReply, error)
StopBackend(nodeID, backend string) error
@@ -127,38 +127,8 @@ func (a *RemoteUnloaderAdapter) InstallBackend(
xlog.Info("Sending NATS backend.install", "nodeID", nodeID, "backend", backendType, "modelID", modelID, "replica", replicaIndex, "opID", opID)
// Subscribe to the per-op progress subject BEFORE publishing the install
// request so we don't miss early events. When onProgress is nil OR opID
// is empty (the reconciler-driven retry path), skip subscription entirely:
// silent installs cost nothing extra.
var sub messaging.Subscription
if onProgress != nil && opID != "" {
progressSubject := messaging.SubjectNodeBackendInstallProgress(nodeID, opID)
s, subErr := a.nats.Subscribe(progressSubject, func(raw []byte) {
var ev messaging.BackendInstallProgressEvent
if err := json.Unmarshal(raw, &ev); err != nil {
xlog.Debug("malformed install progress event", "subject", progressSubject, "error", err)
return
}
// Goroutine guard: a slow onProgress callback must not stall
// the NATS reader thread.
//
// NOTE: events spawn one goroutine each, so ordering at the
// consumer is best-effort. In practice the worker debounces to
// ~250ms which is far larger than goroutine scheduling jitter,
// so reordering is rare. The worker's final Flush() event is
// intended to win as the terminal tick. A future hardening pass
// could add a Seq uint64 field to BackendInstallProgressEvent
// and drop stale-by-seq at the bridge if reordering becomes a
// real UX issue.
go onProgress(ev)
})
if subErr != nil {
xlog.Warn("Failed to subscribe to install progress subject; proceeding without progress streaming",
"subject", progressSubject, "error", subErr)
} else {
sub = s
}
}
// request so we don't miss early events.
sub := a.subscribeProgress(nodeID, opID, onProgress)
reply, err := messaging.RequestJSON[messaging.BackendInstallRequest, messaging.BackendInstallReply](a.nats, subject, messaging.BackendInstallRequest{
Backend: backendType,
@@ -182,18 +152,58 @@ func (a *RemoteUnloaderAdapter) InstallBackend(
return reply, err
}
// subscribeProgress subscribes to the per-op backend-install progress subject
// so the master can stream per-node download ticks while a worker installs or
// upgrades. Returns nil (and subscribes to nothing) when onProgress is nil or
// opID is empty — the reconciler-driven retry path and legacy callers stay
// silent at no cost. Shared by InstallBackend, UpgradeBackend, and the legacy
// force-install fallback: an upgrade is a force-reinstall, so it reuses the
// install-progress subject rather than minting a new one (no new NATS
// permission, no new rolling-update compat surface). Caller must Unsubscribe
// the returned subscription after the request completes.
func (a *RemoteUnloaderAdapter) subscribeProgress(nodeID, opID string, onProgress func(messaging.BackendInstallProgressEvent)) messaging.Subscription {
if onProgress == nil || opID == "" {
return nil
}
progressSubject := messaging.SubjectNodeBackendInstallProgress(nodeID, opID)
s, subErr := a.nats.Subscribe(progressSubject, func(raw []byte) {
var ev messaging.BackendInstallProgressEvent
if err := json.Unmarshal(raw, &ev); err != nil {
xlog.Debug("malformed backend progress event", "subject", progressSubject, "error", err)
return
}
// Goroutine guard: a slow onProgress callback must not stall the NATS
// reader thread. Events spawn one goroutine each, so ordering at the
// consumer is best-effort; the worker debounces to ~250ms which dwarfs
// goroutine scheduling jitter, and its final Flush() is the terminal tick.
go onProgress(ev)
})
if subErr != nil {
xlog.Warn("Failed to subscribe to backend progress subject; proceeding without progress streaming",
"subject", progressSubject, "error", subErr)
return nil
}
return s
}
// UpgradeBackend sends a backend.upgrade request-reply to a worker node.
// The worker stops every live process for this backend, force-reinstalls
// from the gallery (overwriting the on-disk artifact), and replies. The
// next routine InstallBackend call spawns a fresh process with the new
// binary - upgrade itself does not start a process.
//
// When opID is non-empty and onProgress is set, the master subscribes to the
// per-op progress subject before firing the request so a long force-reinstall
// streams per-node download ticks instead of blocking opaque at progress 0.
//
// Timeout: configured via DistributedConfig.BackendUpgradeTimeoutOrDefault
// (default 15m). Real-world worst case observed: 8-10 minutes for large
// CUDA-l4t backend images on Jetson over WiFi.
func (a *RemoteUnloaderAdapter) UpgradeBackend(nodeID, backendType, galleriesJSON, uri, name, alias string, replicaIndex int) (*messaging.BackendUpgradeReply, error) {
func (a *RemoteUnloaderAdapter) UpgradeBackend(nodeID, backendType, galleriesJSON, uri, name, alias string, replicaIndex int, opID string, onProgress func(messaging.BackendInstallProgressEvent)) (*messaging.BackendUpgradeReply, error) {
subject := messaging.SubjectNodeBackendUpgrade(nodeID)
xlog.Info("Sending NATS backend.upgrade", "nodeID", nodeID, "backend", backendType, "replica", replicaIndex)
xlog.Info("Sending NATS backend.upgrade", "nodeID", nodeID, "backend", backendType, "replica", replicaIndex, "opID", opID)
sub := a.subscribeProgress(nodeID, opID, onProgress)
reply, err := messaging.RequestJSON[messaging.BackendUpgradeRequest, messaging.BackendUpgradeReply](a.nats, subject, messaging.BackendUpgradeRequest{
Backend: backendType,
@@ -202,7 +212,13 @@ func (a *RemoteUnloaderAdapter) UpgradeBackend(nodeID, backendType, galleriesJSO
Name: name,
Alias: alias,
ReplicaIndex: int32(replicaIndex),
OpID: opID,
}, a.upgradeTimeout)
if sub != nil {
_ = sub.Unsubscribe()
}
if err != nil && isNATSTimeout(err) {
return nil, fmt.Errorf("%w (subject=%s nodeID=%s backend=%s): %v",
galleryop.ErrWorkerStillInstalling, subject, nodeID, backendType, err)
@@ -216,10 +232,12 @@ func (a *RemoteUnloaderAdapter) UpgradeBackend(nodeID, backendType, galleriesJSO
// doesn't subscribe to the new subject). It re-fires the legacy
// backend.install with Force=true. Drop this once every worker is on
// 2026-05-08 or newer.
func (a *RemoteUnloaderAdapter) installWithForceFallback(nodeID, backendType, galleriesJSON, uri, name, alias string, replicaIndex int) (*messaging.BackendInstallReply, error) {
func (a *RemoteUnloaderAdapter) installWithForceFallback(nodeID, backendType, galleriesJSON, uri, name, alias string, replicaIndex int, opID string, onProgress func(messaging.BackendInstallProgressEvent)) (*messaging.BackendInstallReply, error) {
subject := messaging.SubjectNodeBackendInstall(nodeID)
xlog.Warn("Falling back to legacy backend.install Force=true (old worker)", "nodeID", nodeID, "backend", backendType)
sub := a.subscribeProgress(nodeID, opID, onProgress)
reply, err := messaging.RequestJSON[messaging.BackendInstallRequest, messaging.BackendInstallReply](a.nats, subject, messaging.BackendInstallRequest{
Backend: backendType,
BackendGalleries: galleriesJSON,
@@ -228,7 +246,13 @@ func (a *RemoteUnloaderAdapter) installWithForceFallback(nodeID, backendType, ga
Alias: alias,
ReplicaIndex: int32(replicaIndex),
Force: true,
OpID: opID,
}, a.upgradeTimeout)
if sub != nil {
_ = sub.Unsubscribe()
}
if err != nil && isNATSTimeout(err) {
return nil, fmt.Errorf("%w (subject=%s nodeID=%s backend=%s): %v",
galleryop.ErrWorkerStillInstalling, subject, nodeID, backendType, err)

View File

@@ -282,7 +282,7 @@ var _ = Describe("RemoteUnloaderAdapter timeout configuration", func() {
mc.scriptReply(messaging.SubjectNodeBackendUpgrade("n1"), messaging.BackendUpgradeReply{Success: true})
adapter := NewRemoteUnloaderAdapter(nil, mc, 7*time.Minute, 11*time.Minute)
_, err := adapter.UpgradeBackend("n1", "llama-cpp", "[]", "", "", "", 0)
_, err := adapter.UpgradeBackend("n1", "llama-cpp", "[]", "", "", "", 0, "", nil)
Expect(err).ToNot(HaveOccurred())
Expect(mc.calls).To(HaveLen(1))

View File

@@ -1,6 +1,7 @@
package nodes
import (
"sync"
"time"
. "github.com/onsi/ginkgo/v2"
@@ -18,7 +19,7 @@ var _ = Describe("RemoteUnloaderAdapter.UpgradeBackend", func() {
messaging.BackendUpgradeReply{Success: true})
adapter := NewRemoteUnloaderAdapter(nil, mc, 3*time.Minute, 15*time.Minute)
reply, err := adapter.UpgradeBackend(nodeID, "llama-cpp", `[{"name":"x"}]`, "", "", "", 0)
reply, err := adapter.UpgradeBackend(nodeID, "llama-cpp", `[{"name":"x"}]`, "", "", "", 0, "", nil)
Expect(err).ToNot(HaveOccurred())
Expect(reply.Success).To(BeTrue())
})
@@ -27,7 +28,55 @@ var _ = Describe("RemoteUnloaderAdapter.UpgradeBackend", func() {
mc := newScriptedMessagingClient() // unscripted subject => fakeNoRespondersErr by harness convention
adapter := NewRemoteUnloaderAdapter(nil, mc, 3*time.Minute, 15*time.Minute)
_, err := adapter.UpgradeBackend("missing-node", "llama-cpp", "", "", "", "", 0)
_, err := adapter.UpgradeBackend("missing-node", "llama-cpp", "", "", "", "", 0, "", nil)
Expect(err).To(HaveOccurred())
})
// Reproducer for "upgrade reports progress:0 the whole time" (Bug B). The
// install path streamed per-node download ticks; the upgrade path did a bare
// request→single-reply with no progress subscription, so a long force-reinstall
// blocked opaque. The adapter must subscribe to the per-op progress subject
// (reused from install) BEFORE the request and deliver each tick to onProgress.
It("streams per-node progress ticks during the upgrade", func() {
mc := newScriptedMessagingClient()
nodeID := "node-slow"
opID := "op-upgrade-1"
mc.scriptReply(messaging.SubjectNodeBackendUpgrade(nodeID),
messaging.BackendUpgradeReply{Success: true})
// The worker would publish these while force-reinstalling. The harness
// replays them as soon as the adapter subscribes to the per-op subject.
mc.scheduleProgressPublish(nodeID, opID, []messaging.BackendInstallProgressEvent{
{NodeID: nodeID, FileName: "llama-cpp.tar", Current: "10 MB", Total: "100 MB", Percentage: 10},
{NodeID: nodeID, FileName: "llama-cpp.tar", Current: "100 MB", Total: "100 MB", Percentage: 100},
})
var mu sync.Mutex
var got []messaging.BackendInstallProgressEvent
onProgress := func(ev messaging.BackendInstallProgressEvent) {
mu.Lock()
got = append(got, ev)
mu.Unlock()
}
adapter := NewRemoteUnloaderAdapter(nil, mc, 3*time.Minute, 15*time.Minute)
reply, err := adapter.UpgradeBackend(nodeID, "llama-cpp", `[{"name":"x"}]`, "", "", "", 0, opID, onProgress)
Expect(err).ToNot(HaveOccurred())
Expect(reply.Success).To(BeTrue())
// Confirm it subscribed to the (reused) install-progress subject for this op.
Expect(mc.subscribeCalls()).To(ContainElement(messaging.SubjectNodeBackendInstallProgress(nodeID, opID)))
// Progress events are delivered asynchronously (goroutine-per-event), so
// poll for both and assert on the set — ordering is best-effort by design.
Eventually(func() []float64 {
mu.Lock()
defer mu.Unlock()
pcts := make([]float64, 0, len(got))
for _, e := range got {
pcts = append(pcts, e.Percentage)
}
return pcts
}, 2*time.Second, 20*time.Millisecond).Should(ConsistOf(float64(10), float64(100)))
})
})

Some files were not shown because too many files have changed in this diff Show More