mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-12 02:38:19 -04:00
feat(realtime): stream the LLM / TTS / transcription pipeline stages (#10176)
* feat(realtime): pipeline streaming + disable_thinking config
Add a nested pipeline.streaming.{llm,tts,transcription} block plus
pipeline.disable_thinking, with StreamLLM/StreamTTS/StreamTranscription/
ThinkingDisabled helpers. Pointer-bools so unset keeps the unary path;
existing configs are unaffected. Wiring into the realtime handler follows.
Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): sentence segmenter for streamed LLM->TTS pipelining
streamSegmenter accumulates streamed LLM tokens and emits complete
sentence/clause segments (terminator+whitespace, or newline) so TTS can
synthesize each segment as it completes instead of waiting for the whole
reply. Pure helper; the streaming handler wiring consumes it next.
Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): streaming TTS/transcription methods on Model interface
Add TTSStream and TranscribeStream to the realtime Model interface and
implement them on wrappedModel (delegating to backend.ModelTTSStream /
ModelTranscriptionStream) and transcriptOnlyModel. ttsStream adapts the
backend's WAV-framed stream (44-byte header carrying the sample rate, then
PCM) into raw PCM + sample rate for the realtime transports. Handler wiring
that consumes these (flag-gated) follows.
Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): emitSpeech with flag-gated streaming TTS
emitSpeech synthesizes a piece of text and forwards audio to the client,
streaming one output_audio.delta per backend PCM chunk when the pipeline
sets streaming.tts, or one delta for the whole utterance otherwise. WebRTC
gets raw PCM (it resamples internally); WebSocket gets base64 PCM at the
session rate. It emits no transcript/audio-done events so a streamed reply
can be split into multiple spoken segments sharing one response.
Adds fakeModel/fakeTransport test doubles for the realtime Model/Transport
interfaces, driving streaming assertions deterministically.
Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): route response audio through emitSpeech (streaming TTS)
Replace the inline unary TTS block in the response handler with emitSpeech,
which streams a response.output_audio.delta per backend PCM chunk when
pipeline.streaming.tts is set and otherwise preserves the single-delta unary
behaviour. emitSpeech returns the accumulated base64 audio, stored on the
conversation item as before. Transcript and audio-done events stay in the
handler so later per-segment streaming can reuse emitSpeech.
Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): streaming transcription text deltas
Add emitTranscription and route commitUtterance through it. With
pipeline.streaming.transcription set it streams each transcript fragment as
a conversation.item.input_audio_transcription.delta via TranscribeStream
then a completed event; otherwise it preserves the single completed-event
unary behaviour. Returns the final transcript for response generation.
Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): pipeline disable_thinking maps to enable_thinking off
applyPipelineThinking forces the LLM's ReasoningConfig.DisableReasoning when
pipeline.disable_thinking is set, which gRPCPredictOpts turns into the
enable_thinking=false backend metadata. Applied at newModel construction on
the per-session LLM config copy, so it doesn't leak to other model users and
needs no realtime-specific request plumbing.
Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): speechStreamer for token-streamed LLM->TTS
emitSpeech now returns raw PCM (caller base64-encodes) so streamed segments
accumulate correctly. speechStreamer consumes streamed LLM tokens: it strips
reasoning via the streaming ReasoningExtractor, emits a transcript delta per
content fragment, and sentence-pipes content into emitSpeech so each sentence
is synthesized as soon as it's ready. Handler wiring (plain-content turns)
follows.
Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): wire streamLLMResponse for token-streamed replies
triggerResponseAtTurn takes a streamed path when pipeline.streaming.llm is
set, the turn has no tools, and audio is requested: streamLLMResponse
announces the assistant item, drives the LLM token callback through a
speechStreamer (reasoning-stripped transcript deltas + sentence-piped TTS),
and emits the terminal events. Tool turns and non-streaming pipelines keep
the existing buffered path unchanged, so this is strictly opt-in.
Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* docs(realtime): document pipeline streaming + disable_thinking
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(realtime): register pipeline streaming/thinking config fields
TestAllFieldsHaveRegistryEntries (core/config/meta) requires every config
field to have a meta registry entry. The four new pipeline fields
(disable_thinking, streaming.{llm,tts,transcription}) had none, failing
tests-linux/tests-apple. Add toggle entries for them.
Also handle the os.Remove return in realtime_speech_test.go to satisfy
errcheck (golangci-lint).
Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(realtime): always strip reasoning from spoken output
disable_thinking maps to ReasoningConfig.DisableReasoning=true on the LLM
config, which the backend reads as enable_thinking=false. But the realtime
handler reads that SAME config to drive reasoning extraction, and there
DisableReasoning=true means "skip stripping". PredictConfig() returns this
LLM config, so both the streamed (speechStreamer) and buffered realtime
paths stopped stripping <think>…</think> exactly when disable_thinking was
on — leaking raw reasoning to the client whenever the model ignored the
enable_thinking hint (e.g. lfm2.5).
Add spokenReasoningConfig() which clears DisableReasoning for extraction
(keeping custom tokens/tag pairs) and route both realtime paths through it.
Spoken output now always strips reasoning, independent of the backend
suppression hint.
Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(realtime): clean TTS temp path before read (gosec G304)
emitSpeech reads the WAV file the TTS backend wrote. The read moved here
from realtime.go, so code-scanning flagged it as a new G304 alert even
though the path is backend-controlled (a temp file), not user input.
Wrap it in filepath.Clean — a real path normalization that also clears
the alert, keeping with the repo's no-#nosec convention.
Assisted-by: Claude:claude-opus-4-8 gosec, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(realtime): buffer whole message for TTS, drop sentence segmenter
Per review (richiejp): the sentence segmenter pipelined unary TTS by
splitting on ASCII .!?/newline, which does nothing for languages without
those boundaries (CJK/Thai) — there it already degraded to buffering the
whole message anyway.
Replace it with a uniform model: stream the LLM transcript live, buffer the
full message, then synthesize it once. emitSpeech already streams the audio
chunks when the backend implements TTSStream and falls back to a single
unary delta otherwise, so this is real streaming TTS where supported and a
clean whole-message synthesis elsewhere — no per-sentence emulation, no
language assumptions. speechStreamer becomes transcriptStreamer (transcript
deltas only); the whole-message synthesis moves into streamLLMResponse.
Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): stream tool-call turns via tokenizer-template autoparser
Per review (richiejp): tool-call deltas exist, so streaming should work with
tools too. It does — for models that use their tokenizer template. The C++
autoparser then clears reply.Message and delivers content + tool calls via
ChatDeltas, so the streamed transcript carries only spoken content (no
tool-call JSON leak) and the tool calls are parsed from the final response.
- Drop the len(tools)==0 gate; stream when no tools OR use_tokenizer_template
(grammar-based function calling still buffers, since its call is emitted as
JSON in the token stream and would leak into the transcript).
- streamLLMResponse takes tools/toolChoice/toolTurn, reads ChatDelta content
in the token callback, parses tool calls from the final ChatDeltas, and
creates the assistant content item lazily so a content-less tool turn emits
only the tool calls.
- Extract emitToolCallItems from the buffered path so both paths finalize tool
calls, response.done, and server-side assistant-tool follow-ups identically.
Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): script-aware clause chunking + streamed-reply fixes
Opt-in pipeline.streaming.clause_chunking splits the streamed LLM reply
into speakable clauses and synthesizes each as soon as it completes,
lowering time-to-first-audio instead of buffering the whole message. The
splitter is script-aware (rivo/uniseg, pure Go): UAX#29 sentence
segmentation handles CJK 。!? with no whitespace, CJK clause
punctuation (,、;:) and Thai/Lao spaces give finer cuts, and a UAX#14
line-break cap bounds an over-long punctuation-less run. Unlike the old
ASCII .!?/newline segmenter (dropped in 076dcdbe) it does not degrade to
whole-message buffering for CJK/Thai; scripts needing a dictionary
(Khmer/Burmese) stay buffered until a space or end-of-message. Clauses
are synthesized synchronously in the token callback (the LLM keeps
generating into the gRPC stream meanwhile), so audio still starts
mid-generation. Off by default — the whole-message path is unchanged.
Also fix the streamed-reply path and the Talk page:
- Don't swallow streamed autoparser content as reasoning: the
tokenizer-template path already delivers reasoning-free content via
ChatDeltas, so prefilling the thinking start token re-tagged it as an
unclosed reasoning block, leaving no spoken reply. Disable the prefill
on that path; closed tag pairs are still stripped (#9985).
- Generate collision-free realtime IDs (16 random bytes) instead of a
constant, so per-item bookkeeping (cancel, conversation.item.retrieve)
works.
- Key the Talk transcript by the server item_id and upsert entries.
Realtime events arrive over a WebRTC data channel — outside React's
event system — so React defers the setTranscript updaters while
synchronous ref writes in handler bodies run first; the old
index-tracking ref rendered a duplicate assistant bubble on
completion. Upserts by item_id are idempotent and order-independent.
- Drop the partial assistant bubble on a cancelled response (barge-in):
the server discards the interrupted item and sends response.done with
status "cancelled"; mirror that in the UI so the regenerated reply
isn't rendered as a second assistant message.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Assisted-by: Claude:claude-fable-5 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Richard Palethorpe <io@richiejp.com>
This commit is contained in:
@@ -31,6 +31,43 @@ This configuration links the following components:
|
||||
|
||||
Make sure all referenced models (`silero-vad-ggml`, `whisper-large-turbo`, `qwen3-4b`, `tts-1`) are also installed or defined in your LocalAI instance.
|
||||
|
||||
### Streaming the pipeline
|
||||
|
||||
By default each stage runs to completion before the next begins: the whole utterance is transcribed, the full LLM reply is generated, then it is synthesized. Each stage can instead be streamed incrementally, which lowers the time-to-first-audio of a turn:
|
||||
|
||||
```yaml
|
||||
name: gpt-realtime
|
||||
pipeline:
|
||||
vad: silero-vad-ggml
|
||||
transcription: whisper-large-turbo
|
||||
llm: qwen3-4b
|
||||
tts: tts-1
|
||||
streaming:
|
||||
llm: true # stream LLM tokens as transcript deltas
|
||||
tts: true # emit audio deltas per synthesized chunk
|
||||
transcription: true # stream transcript text deltas of the user's speech
|
||||
clause_chunking: true # synthesize each clause as soon as it completes
|
||||
```
|
||||
|
||||
- **streaming.tts**: emit a `response.output_audio.delta` per audio chunk the TTS backend produces (requires a backend that supports streaming synthesis), instead of one delta for the whole utterance. Falls back to a single unary delta otherwise.
|
||||
- **streaming.transcription**: stream `conversation.item.input_audio_transcription.delta` events as the transcript is produced (requires a transcription backend that supports streaming).
|
||||
- **streaming.llm**: stream the LLM reply token-by-token as `response.output_audio_transcript.delta` events. The full reply is buffered and synthesized once it is complete — streamed as audio chunks when `streaming.tts` is enabled (and the TTS backend supports it), otherwise as a single unary delta. Reasoning/thinking is always stripped from the spoken transcript. Tool calls are supported while streaming when the LLM uses its tokenizer template (`use_tokenizer_template: true`): the backend's autoparser then delivers content and tool calls separately, so the spoken transcript never leaks tool-call tokens. Grammar-based function calling keeps the buffered path.
|
||||
- **streaming.clause_chunking**: instead of buffering the whole reply before TTS, split it into speakable clauses and synthesize each as soon as it completes, lowering the time-to-first-audio. The splitter is script-aware: it uses Unicode sentence segmentation (so it handles CJK `。!?` with no whitespace), CJK clause punctuation (`,、;:`), and Thai/Lao spaces — it does **not** rely on whitespace sentence boundaries, so it works for languages such as Chinese, Japanese and Thai where the old per-sentence approach degraded to whole-message buffering. Requires `streaming.llm`; scripts that genuinely need a dictionary (e.g. Khmer, Burmese) simply stay buffered until a space or end-of-message. Off by default.
|
||||
|
||||
All streaming flags are off by default, so existing pipelines are unaffected.
|
||||
|
||||
### Disabling thinking
|
||||
|
||||
For reasoning models, you can force the pipeline LLM's thinking off without editing the LLM model config:
|
||||
|
||||
```yaml
|
||||
pipeline:
|
||||
llm: qwen3-4b
|
||||
disable_thinking: true # maps to enable_thinking=false for the realtime LLM
|
||||
```
|
||||
|
||||
This is applied only to the realtime session's copy of the LLM config, so it does not affect other users of the same model. Leave it unset to use the LLM model config's own reasoning settings.
|
||||
|
||||
## Transports
|
||||
|
||||
The Realtime API supports two transports: **WebSocket** and **WebRTC**.
|
||||
|
||||
Reference in New Issue
Block a user