feat(realtime): stream the LLM / TTS / transcription pipeline stages (#10176)

* feat(realtime): pipeline streaming + disable_thinking config

Add a nested pipeline.streaming.{llm,tts,transcription} block plus
pipeline.disable_thinking, with StreamLLM/StreamTTS/StreamTranscription/
ThinkingDisabled helpers. Pointer-bools so unset keeps the unary path;
existing configs are unaffected. Wiring into the realtime handler follows.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): sentence segmenter for streamed LLM->TTS pipelining

streamSegmenter accumulates streamed LLM tokens and emits complete
sentence/clause segments (terminator+whitespace, or newline) so TTS can
synthesize each segment as it completes instead of waiting for the whole
reply. Pure helper; the streaming handler wiring consumes it next.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): streaming TTS/transcription methods on Model interface

Add TTSStream and TranscribeStream to the realtime Model interface and
implement them on wrappedModel (delegating to backend.ModelTTSStream /
ModelTranscriptionStream) and transcriptOnlyModel. ttsStream adapts the
backend's WAV-framed stream (44-byte header carrying the sample rate, then
PCM) into raw PCM + sample rate for the realtime transports. Handler wiring
that consumes these (flag-gated) follows.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): emitSpeech with flag-gated streaming TTS

emitSpeech synthesizes a piece of text and forwards audio to the client,
streaming one output_audio.delta per backend PCM chunk when the pipeline
sets streaming.tts, or one delta for the whole utterance otherwise. WebRTC
gets raw PCM (it resamples internally); WebSocket gets base64 PCM at the
session rate. It emits no transcript/audio-done events so a streamed reply
can be split into multiple spoken segments sharing one response.

Adds fakeModel/fakeTransport test doubles for the realtime Model/Transport
interfaces, driving streaming assertions deterministically.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): route response audio through emitSpeech (streaming TTS)

Replace the inline unary TTS block in the response handler with emitSpeech,
which streams a response.output_audio.delta per backend PCM chunk when
pipeline.streaming.tts is set and otherwise preserves the single-delta unary
behaviour. emitSpeech returns the accumulated base64 audio, stored on the
conversation item as before. Transcript and audio-done events stay in the
handler so later per-segment streaming can reuse emitSpeech.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): streaming transcription text deltas

Add emitTranscription and route commitUtterance through it. With
pipeline.streaming.transcription set it streams each transcript fragment as
a conversation.item.input_audio_transcription.delta via TranscribeStream
then a completed event; otherwise it preserves the single completed-event
unary behaviour. Returns the final transcript for response generation.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): pipeline disable_thinking maps to enable_thinking off

applyPipelineThinking forces the LLM's ReasoningConfig.DisableReasoning when
pipeline.disable_thinking is set, which gRPCPredictOpts turns into the
enable_thinking=false backend metadata. Applied at newModel construction on
the per-session LLM config copy, so it doesn't leak to other model users and
needs no realtime-specific request plumbing.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): speechStreamer for token-streamed LLM->TTS

emitSpeech now returns raw PCM (caller base64-encodes) so streamed segments
accumulate correctly. speechStreamer consumes streamed LLM tokens: it strips
reasoning via the streaming ReasoningExtractor, emits a transcript delta per
content fragment, and sentence-pipes content into emitSpeech so each sentence
is synthesized as soon as it's ready. Handler wiring (plain-content turns)
follows.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): wire streamLLMResponse for token-streamed replies

triggerResponseAtTurn takes a streamed path when pipeline.streaming.llm is
set, the turn has no tools, and audio is requested: streamLLMResponse
announces the assistant item, drives the LLM token callback through a
speechStreamer (reasoning-stripped transcript deltas + sentence-piped TTS),
and emits the terminal events. Tool turns and non-streaming pipelines keep
the existing buffered path unchanged, so this is strictly opt-in.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(realtime): document pipeline streaming + disable_thinking

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(realtime): register pipeline streaming/thinking config fields

TestAllFieldsHaveRegistryEntries (core/config/meta) requires every config
field to have a meta registry entry. The four new pipeline fields
(disable_thinking, streaming.{llm,tts,transcription}) had none, failing
tests-linux/tests-apple. Add toggle entries for them.

Also handle the os.Remove return in realtime_speech_test.go to satisfy
errcheck (golangci-lint).

Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(realtime): always strip reasoning from spoken output

disable_thinking maps to ReasoningConfig.DisableReasoning=true on the LLM
config, which the backend reads as enable_thinking=false. But the realtime
handler reads that SAME config to drive reasoning extraction, and there
DisableReasoning=true means "skip stripping". PredictConfig() returns this
LLM config, so both the streamed (speechStreamer) and buffered realtime
paths stopped stripping <think>…</think> exactly when disable_thinking was
on — leaking raw reasoning to the client whenever the model ignored the
enable_thinking hint (e.g. lfm2.5).

Add spokenReasoningConfig() which clears DisableReasoning for extraction
(keeping custom tokens/tag pairs) and route both realtime paths through it.
Spoken output now always strips reasoning, independent of the backend
suppression hint.

Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(realtime): clean TTS temp path before read (gosec G304)

emitSpeech reads the WAV file the TTS backend wrote. The read moved here
from realtime.go, so code-scanning flagged it as a new G304 alert even
though the path is backend-controlled (a temp file), not user input.
Wrap it in filepath.Clean — a real path normalization that also clears
the alert, keeping with the repo's no-#nosec convention.

Assisted-by: Claude:claude-opus-4-8 gosec, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(realtime): buffer whole message for TTS, drop sentence segmenter

Per review (richiejp): the sentence segmenter pipelined unary TTS by
splitting on ASCII .!?/newline, which does nothing for languages without
those boundaries (CJK/Thai) — there it already degraded to buffering the
whole message anyway.

Replace it with a uniform model: stream the LLM transcript live, buffer the
full message, then synthesize it once. emitSpeech already streams the audio
chunks when the backend implements TTSStream and falls back to a single
unary delta otherwise, so this is real streaming TTS where supported and a
clean whole-message synthesis elsewhere — no per-sentence emulation, no
language assumptions. speechStreamer becomes transcriptStreamer (transcript
deltas only); the whole-message synthesis moves into streamLLMResponse.

Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): stream tool-call turns via tokenizer-template autoparser

Per review (richiejp): tool-call deltas exist, so streaming should work with
tools too. It does — for models that use their tokenizer template. The C++
autoparser then clears reply.Message and delivers content + tool calls via
ChatDeltas, so the streamed transcript carries only spoken content (no
tool-call JSON leak) and the tool calls are parsed from the final response.

- Drop the len(tools)==0 gate; stream when no tools OR use_tokenizer_template
  (grammar-based function calling still buffers, since its call is emitted as
  JSON in the token stream and would leak into the transcript).
- streamLLMResponse takes tools/toolChoice/toolTurn, reads ChatDelta content
  in the token callback, parses tool calls from the final ChatDeltas, and
  creates the assistant content item lazily so a content-less tool turn emits
  only the tool calls.
- Extract emitToolCallItems from the buffered path so both paths finalize tool
  calls, response.done, and server-side assistant-tool follow-ups identically.

Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): script-aware clause chunking + streamed-reply fixes

Opt-in pipeline.streaming.clause_chunking splits the streamed LLM reply
into speakable clauses and synthesizes each as soon as it completes,
lowering time-to-first-audio instead of buffering the whole message. The
splitter is script-aware (rivo/uniseg, pure Go): UAX#29 sentence
segmentation handles CJK 。!? with no whitespace, CJK clause
punctuation (,、;:) and Thai/Lao spaces give finer cuts, and a UAX#14
line-break cap bounds an over-long punctuation-less run. Unlike the old
ASCII .!?/newline segmenter (dropped in 076dcdbe) it does not degrade to
whole-message buffering for CJK/Thai; scripts needing a dictionary
(Khmer/Burmese) stay buffered until a space or end-of-message. Clauses
are synthesized synchronously in the token callback (the LLM keeps
generating into the gRPC stream meanwhile), so audio still starts
mid-generation. Off by default — the whole-message path is unchanged.

Also fix the streamed-reply path and the Talk page:

- Don't swallow streamed autoparser content as reasoning: the
  tokenizer-template path already delivers reasoning-free content via
  ChatDeltas, so prefilling the thinking start token re-tagged it as an
  unclosed reasoning block, leaving no spoken reply. Disable the prefill
  on that path; closed tag pairs are still stripped (#9985).

- Generate collision-free realtime IDs (16 random bytes) instead of a
  constant, so per-item bookkeeping (cancel, conversation.item.retrieve)
  works.

- Key the Talk transcript by the server item_id and upsert entries.
  Realtime events arrive over a WebRTC data channel — outside React's
  event system — so React defers the setTranscript updaters while
  synchronous ref writes in handler bodies run first; the old
  index-tracking ref rendered a duplicate assistant bubble on
  completion. Upserts by item_id are idempotent and order-independent.

- Drop the partial assistant bubble on a cancelled response (barge-in):
  the server discards the interrupted item and sends response.done with
  status "cancelled"; mirror that in the UI so the regenerated reply
  isn't rendered as a second assistant message.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Assisted-by: Claude:claude-fable-5 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Richard Palethorpe <io@richiejp.com>
This commit is contained in:
LocalAI [bot]
2026-06-11 09:43:12 +02:00
committed by GitHub
parent 228a6dfe79
commit 892fc49949
19 changed files with 1716 additions and 128 deletions

View File

@@ -2,8 +2,10 @@ package openai
import (
"context"
"crypto/rand"
"encoding/base64"
"encoding/binary"
"encoding/hex"
"encoding/json"
"fmt"
"math"
@@ -235,6 +237,12 @@ type Model interface {
Transcribe(ctx context.Context, audio, language string, translate bool, diarize bool, prompt string) (*schema.TranscriptionResult, error)
Predict(ctx context.Context, messages schema.Messages, images, videos, audios []string, tokenCallback func(string, backend.TokenUsage) bool, tools []types.ToolUnion, toolChoice *types.ToolChoiceUnion, logprobs *int, topLogprobs *int, logitBias map[string]float64) (func() (backend.LLMResponse, error), error)
TTS(ctx context.Context, text, voice, language string) (string, *proto.Result, error)
// TTSStream synthesizes speech incrementally, invoking onAudio with raw PCM
// chunks (and the backend sample rate) as they are produced.
TTSStream(ctx context.Context, text, voice, language string, onAudio func(pcm []byte, sampleRate int) error) error
// TranscribeStream transcribes audio incrementally, invoking onDelta for each
// transcript text fragment and returning the final aggregated result.
TranscribeStream(ctx context.Context, audio, language string, translate, diarize bool, prompt string, onDelta func(text string)) (*schema.TranscriptionResult, error)
PredictConfig() *config.ModelConfig
}
@@ -1254,27 +1262,15 @@ func commitUtterance(ctx context.Context, utt []byte, session *Session, conv *Co
// TODO: If we have a real any-to-any model then transcription is optional
var transcript string
if session.InputAudioTranscription != nil {
tr, err := session.ModelInterface.Transcribe(ctx, f.Name(), session.InputAudioTranscription.Language, false, false, session.InputAudioTranscription.Prompt)
// emitTranscription streams transcript deltas when
// pipeline.streaming.transcription is set, otherwise emits a single
// completed event; either way it returns the final transcript text.
var err error
transcript, err = emitTranscription(ctx, t, session, generateItemID(), f.Name())
if err != nil {
sendError(t, "transcription_failed", err.Error(), "", "event_TODO")
return
} else if tr == nil {
sendError(t, "transcription_failed", "trancribe result is nil", "", "event_TODO")
return
}
transcript = tr.Text
sendEvent(t, types.ConversationItemInputAudioTranscriptionCompletedEvent{
ServerEventBase: types.ServerEventBase{
EventID: "event_TODO",
},
ItemID: generateItemID(),
// ResponseID: "resp_TODO", // Not needed for transcription completed event
// OutputIndex: 0,
ContentIndex: 0,
Transcript: transcript,
})
} else {
sendNotImplemented(t, "any-to-any models")
return
@@ -1502,6 +1498,26 @@ func triggerResponseAtTurn(ctx context.Context, session *Session, conv *Conversa
},
})
// Streamed LLM path: when the pipeline opts into LLM streaming, stream the
// transcript to the client as it is generated and synthesize the buffered
// message once. Tool turns are supported only when the model uses its
// tokenizer template: the C++ autoparser then delivers content and tool
// calls via ChatDeltas (clearing the text stream), so the spoken transcript
// never leaks tool-call tokens. Grammar-based function calling emits the
// call as JSON in the token stream, so those turns keep the buffered path.
if config != nil && session.ModelConfig != nil && session.ModelConfig.Pipeline.StreamLLM() {
canStream := len(tools) == 0 || config.TemplateConfig.UseTokenizerTemplate
var respMods []types.Modality
if overrides != nil {
respMods = overrides.OutputModalities
}
if canStream && modalitiesContainAudio(resolveOutputModalities(session.OutputModalities, respMods)) {
if streamLLMResponse(ctx, session, conv, t, responseID, conversationHistory, images, config, tools, toolChoice, toolTurn) {
return
}
}
}
predFunc, err := session.ModelInterface.Predict(ctx, conversationHistory, images, nil, nil, nil, tools, toolChoice, nil, nil, nil)
if err != nil {
sendError(t, "inference_failed", fmt.Sprintf("backend error: %v", err), "", "") // item.Assistant.ID is unknown here
@@ -1579,7 +1595,7 @@ func triggerResponseAtTurn(ctx context.Context, session *Session, conv *Conversa
// ExtractReasoningWithConfig is a no-op when no tag pair matches,
// so it's safe to apply unconditionally in the no-reasoning branch.
if deltaReasoning == "" && deltaContent != "" {
deltaReasoning, deltaContent = reasoning.ExtractReasoningComplete(deltaContent, thinkingStartToken, config.ReasoningConfig)
deltaReasoning, deltaContent = reasoning.ExtractReasoningComplete(deltaContent, thinkingStartToken, spokenReasoningConfig(config.ReasoningConfig))
}
reasoningText = deltaReasoning
responseWithoutReasoning = deltaContent
@@ -1587,7 +1603,7 @@ func triggerResponseAtTurn(ctx context.Context, session *Session, conv *Conversa
cleanedResponse = deltaContent
toolCalls = deltaToolCalls
} else {
reasoningText, responseWithoutReasoning = reasoning.ExtractReasoningComplete(rawResponse, thinkingStartToken, config.ReasoningConfig)
reasoningText, responseWithoutReasoning = reasoning.ExtractReasoningComplete(rawResponse, thinkingStartToken, spokenReasoningConfig(config.ReasoningConfig))
textContent = functions.ParseTextContent(responseWithoutReasoning, config.FunctionsConfig)
cleanedResponse = functions.CleanupLLMResult(responseWithoutReasoning, config.FunctionsConfig)
toolCalls = functions.ParseFunctionCall(cleanedResponse, config.FunctionsConfig)
@@ -1713,64 +1729,7 @@ func triggerResponseAtTurn(ctx context.Context, session *Session, conv *Conversa
return
}
audioFilePath, res, err := session.ModelInterface.TTS(ctx, finalSpeech, session.Voice, session.InputAudioTranscription.Language)
if err != nil {
if ctx.Err() != nil {
xlog.Debug("TTS cancelled (barge-in)")
sendCancelledResponse()
return
}
xlog.Error("TTS failed", "error", err)
sendError(t, "tts_error", fmt.Sprintf("TTS generation failed: %v", err), "", item.Assistant.ID)
return
}
if !res.Success {
xlog.Error("TTS failed", "message", res.Message)
sendError(t, "tts_error", fmt.Sprintf("TTS generation failed: %s", res.Message), "", item.Assistant.ID)
return
}
defer func() { _ = os.Remove(audioFilePath) }()
audioBytes, err := os.ReadFile(audioFilePath)
if err != nil {
xlog.Error("failed to read TTS file", "error", err)
sendError(t, "tts_error", fmt.Sprintf("Failed to read TTS audio: %v", err), "", item.Assistant.ID)
return
}
// Parse WAV header to get raw PCM and the actual sample rate from the TTS backend.
pcmData, ttsSampleRate := laudio.ParseWAV(audioBytes)
if ttsSampleRate == 0 {
ttsSampleRate = localSampleRate
}
xlog.Debug("TTS audio parsed", "raw_bytes", len(audioBytes), "pcm_bytes", len(pcmData), "sample_rate", ttsSampleRate)
// SendAudio (WebRTC) passes PCM at the TTS sample rate directly to the
// Opus encoder, which resamples to 48kHz internally. This avoids a
// lossy intermediate resample through 16kHz.
// XXX: This is a noop in websocket mode; it's included in the JSON instead
if err := t.SendAudio(ctx, pcmData, ttsSampleRate); err != nil {
if ctx.Err() != nil {
xlog.Debug("Audio playback cancelled (barge-in)")
sendCancelledResponse()
return
}
xlog.Error("failed to send audio via transport", "error", err)
}
// For WebSocket clients, resample to the session's output rate and
// deliver audio as base64 in JSON events. WebRTC clients already
// received audio over the RTP track, so skip the base64 payload.
if !isWebRTC {
wsPCM := pcmData
if ttsSampleRate != session.OutputSampleRate {
samples := sound.BytesToInt16sLE(pcmData)
resampled := sound.ResampleInt16(samples, ttsSampleRate, session.OutputSampleRate)
wsPCM = sound.Int16toBytesLE(resampled)
}
audioString = base64.StdEncoding.EncodeToString(wsPCM)
}
// Transcript of the spoken reply (the audio's text).
sendEvent(t, types.ResponseOutputAudioTranscriptDeltaEvent{
ServerEventBase: types.ServerEventBase{},
ResponseID: responseID,
@@ -1788,15 +1747,26 @@ func triggerResponseAtTurn(ctx context.Context, session *Session, conv *Conversa
Transcript: finalSpeech,
})
// Synthesize and send the audio. With pipeline.streaming.tts enabled
// emitSpeech forwards a response.output_audio.delta per backend PCM
// chunk as it's produced; otherwise it sends the whole utterance as a
// single delta. The returned PCM is stored (base64) on the item below.
pcmAudio, err := emitSpeech(ctx, t, session, responseID, item.Assistant.ID, finalSpeech)
if err != nil {
if ctx.Err() != nil {
xlog.Debug("TTS cancelled (barge-in)")
sendCancelledResponse()
return
}
xlog.Error("TTS failed", "error", err)
sendError(t, "tts_error", fmt.Sprintf("TTS generation failed: %v", err), "", item.Assistant.ID)
return
}
if !isWebRTC {
audioString = base64.StdEncoding.EncodeToString(pcmAudio)
}
if !isWebRTC {
sendEvent(t, types.ResponseOutputAudioDeltaEvent{
ServerEventBase: types.ServerEventBase{},
ResponseID: responseID,
ItemID: item.Assistant.ID,
OutputIndex: 0,
ContentIndex: 0,
Delta: audioString,
})
sendEvent(t, types.ResponseOutputAudioDoneEvent{
ServerEventBase: types.ServerEventBase{},
ResponseID: responseID,
@@ -1849,17 +1819,27 @@ func triggerResponseAtTurn(ctx context.Context, session *Session, conv *Conversa
})
}
// Handle Tool Calls. Two paths:
// - LocalAI Assistant tools (session.AssistantExecutor.IsTool) run
// server-side; we append both the call and its output to conv.Items
// and re-trigger a follow-up response so the model can speak the
// result. The client only sees observability events.
// - All other tools follow the standard OpenAI flow: emit
// function_call_arguments.done and wait for the client to send
// conversation.item.create back.
xlog.Debug("About to handle tool calls", "finalToolCallsCount", len(finalToolCalls))
// Emit the parsed tool calls, the terminal response.done, and (for
// server-side assistant tools) the follow-up response. Shared with the
// streamed path so both finalize tool calls identically.
emitToolCallItems(ctx, session, conv, t, responseID, finalToolCalls, finalSpeech != "", toolTurn)
}
// emitToolCallItems emits the realtime function_call items for the parsed tool
// calls, the terminal response.done, and — for server-side LocalAI Assistant
// tools — re-triggers a follow-up response so the model can speak the result.
// hasContent shifts the tool-call output index past the assistant content item
// when the same turn also produced spoken/text content. Two tool paths:
// - LocalAI Assistant tools (session.AssistantExecutor.IsTool) run server-side;
// we append both the call and its output to conv.Items and re-trigger. The
// client only sees observability events.
// - All other tools follow the standard OpenAI flow: emit
// function_call_arguments.done and wait for the client to send
// conversation.item.create back.
func emitToolCallItems(ctx context.Context, session *Session, conv *Conversation, t Transport, responseID string, toolCalls []functions.FuncCallResults, hasContent bool, toolTurn int) {
xlog.Debug("About to handle tool calls", "finalToolCallsCount", len(toolCalls))
executedAssistantTool := false
for i, tc := range finalToolCalls {
for i, tc := range toolCalls {
toolCallID := generateItemID()
callID := "call_" + generateUniqueID() // OpenAI uses call_xyz
@@ -1879,7 +1859,7 @@ func triggerResponseAtTurn(ctx context.Context, session *Session, conv *Conversa
conv.Lock.Unlock()
outputIndex := i
if finalSpeech != "" {
if hasContent {
outputIndex++
}
@@ -2005,8 +1985,11 @@ func generateItemID() string {
}
func generateUniqueID() string {
// Generate a unique ID string
// For simplicity, use a counter or UUID
// Implement as needed
return "unique_id"
// 16 random bytes, hex-encoded. Must be collision-free: session, item,
// response and call IDs build on this, and the conversation tracks/removes
// items by ID (e.g. cancel() in realtime_stream.go, conversation.item.retrieve).
// A constant would make every ID alias and corrupt that bookkeeping.
var b [16]byte
_, _ = rand.Read(b[:])
return hex.EncodeToString(b[:])
}

View File

@@ -0,0 +1,200 @@
package openai
import (
"strings"
"unicode"
"unicode/utf8"
"github.com/rivo/uniseg"
)
// Default clause-chunker bounds (in runes). minRunes gates only sub-sentence
// (clause-mark / Thai-space) cuts so we don't synthesize tiny choppy fragments;
// full sentences always flush regardless of length. maxRunes caps an
// unterminated run so a long punctuation-less span doesn't buffer unbounded.
const (
defaultClauseMinRunes = 12
defaultClauseMaxRunes = 200
)
// clauseChunker splits streamed LLM content into speakable clauses for
// incremental TTS, in a SCRIPT-AWARE way so it works for languages without
// whitespace word boundaries. It leans on UAX #29 sentence segmentation (which
// natively terminates on CJK 。!? as well as Latin .!?), adds CJK clause
// punctuation (,、;:) and Thai/Lao spaces as finer boundaries, and caps an
// over-long unterminated run via UAX #14 line-break opportunities.
//
// Unlike the old ASCII .!?/newline segmenter (dropped in 076dcdbe), it does not
// degrade to whole-message buffering for CJK (handled natively) or Thai/Lao
// (handled via spaces, which Thai uses at clause/sentence boundaries). Scripts
// that genuinely need a dictionary (Khmer/Myanmar) simply stay buffered until a
// space or end-of-message — no worse than the buffered default.
//
// It is not safe for concurrent use; callers feed it from a single goroutine
// (the LLM token callback).
type clauseChunker struct {
buf strings.Builder
minRunes int
maxRunes int
}
func newClauseChunker(minRunes, maxRunes int) *clauseChunker {
return &clauseChunker{minRunes: minRunes, maxRunes: maxRunes}
}
// push appends streamed content and returns any clauses that are now complete —
// "complete" meaning confirmed by following content, so we never speak a clause
// that the next token might extend. Incomplete trailing text stays buffered.
func (c *clauseChunker) push(text string) []string {
c.buf.WriteString(text)
return c.drain(false)
}
// flush returns the remaining buffered clauses, treating end-of-input as a hard
// boundary, and clears the buffer.
func (c *clauseChunker) flush() []string {
return c.drain(true)
}
func (c *clauseChunker) drain(final bool) []string {
s := c.buf.String()
rest := s
var out []string
for rest != "" {
end, ok := c.nextBoundary(rest, final)
if !ok {
break
}
if seg := strings.TrimSpace(rest[:end]); seg != "" {
out = append(out, seg)
}
rest = rest[end:]
}
// Rewriting the builder reallocates and copies the whole buffer; skip it on
// the common per-token call where no boundary was confirmed.
if len(rest) != len(s) {
c.buf.Reset()
c.buf.WriteString(rest)
}
return out
}
// nextBoundary returns the byte offset just past the first emittable clause in
// s, or ok=false when more input is needed (final=false) and no boundary is
// confirmed yet.
func (c *clauseChunker) nextBoundary(s string, final bool) (int, bool) {
if s == "" {
return 0, false
}
// 1) UAX #29 sentence boundary. When the first sentence is followed by more
// text it is a confirmed complete sentence (handles Latin .!? with
// abbreviation/decimal guards, and CJK 。!? with no whitespace).
sentence, rest, _ := uniseg.FirstSentenceInString(s, -1)
if rest != "" {
// Optionally cut finer inside the sentence at a clause boundary.
if cut, ok := c.firstClauseCut(sentence); ok {
return cut, true
}
return len(sentence), true
}
// 2) Unterminated tail: look for a sub-sentence clause boundary (CJK
// punctuation or a Thai/Lao space) confirmed by following content.
if cut, ok := c.firstClauseCut(s); ok {
return cut, true
}
// 3) Over-long punctuation-less run: force a typographically legal break so
// we don't buffer unbounded (e.g. a long CJK run with no punctuation).
if !final && c.maxRunes > 0 && utf8.RuneCountInString(s) > c.maxRunes {
if cut, ok := lineBreakCut(s, c.maxRunes); ok {
return cut, true
}
}
// 4) End of input: emit whatever remains as the final clause.
if final {
return len(s), true
}
return 0, false
}
// firstClauseCut returns the byte offset just past the first sub-sentence clause
// boundary in s — a CJK clause punctuation mark, or a space following a Thai/Lao
// letter — provided the prefix is at least minRunes long and non-space content
// follows. The boundary mark (and any trailing spaces) stay with the left clause.
func (c *clauseChunker) firstClauseCut(s string) (int, bool) {
var prev rune
runes := 0
for i, r := range s {
boundary := isCJKClausePunct(r) || (unicode.IsSpace(r) && isThaiLao(prev))
if boundary && runes+1 >= c.minRunes {
end := i + utf8.RuneLen(r)
for end < len(s) {
nr, sz := utf8.DecodeRuneInString(s[end:])
if !unicode.IsSpace(nr) {
break
}
end += sz
}
if end < len(s) { // confirmed: real content follows the boundary
return end, true
}
// Boundary sits at the end of the buffer with nothing after it yet —
// wait for the next token to confirm it rather than emit early.
return 0, false
}
prev = r
runes++
}
return 0, false
}
// lineBreakCut walks UAX #14 line segments and returns the byte offset of the
// last legal break opportunity at or before maxRunes. Returns ok=false when the
// run has no internal break opportunity (e.g. a space-less Thai run), leaving it
// buffered.
func lineBreakCut(s string, maxRunes int) (int, bool) {
state := -1
rest := s
consumed := 0
runes := 0
for rest != "" {
seg, rem, _, st := uniseg.FirstLineSegmentInString(rest, state)
state = st
runes += utf8.RuneCountInString(seg)
consumed += len(seg)
rest = rem
if runes >= maxRunes {
if consumed < len(s) {
return consumed, true
}
return 0, false
}
}
return 0, false
}
// isCJKClausePunct reports whether r is a CJK clause-level separator worth a
// soft TTS break. Sentence terminators (。!?) are intentionally excluded — UAX
// #29 sentence segmentation already handles those.
func isCJKClausePunct(r rune) bool {
switch r {
case '', // fullwidth comma
'、', // 、 ideographic comma
'', // fullwidth semicolon
'', // fullwidth colon
'・', // ・ katakana middle dot
'・': // ・ halfwidth katakana middle dot
return true
}
return false
}
// isThaiLao reports whether r is a Thai or Lao letter. Those scripts have no
// inter-word spaces; an ASCII space inside such a run marks a clause/sentence
// boundary, which is the only no-dictionary segmentation signal available.
func isThaiLao(r rune) bool {
return unicode.Is(unicode.Thai, r) || unicode.Is(unicode.Lao, r)
}

View File

@@ -0,0 +1,103 @@
package openai
import (
"strings"
"unicode/utf8"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
// clauseChunker splits streamed LLM content into speakable clauses in a
// script-aware way: UAX#29 sentences (Latin .!? and CJK 。!?), CJK clause
// punctuation, and Thai/Lao spaces — never whitespace-splitting CJK.
var _ = Describe("clauseChunker", func() {
Context("Latin sentences", func() {
It("emits a sentence only once following content confirms it is complete", func() {
c := newClauseChunker(12, 200)
Expect(c.push("Hello world. How are you?")).To(Equal([]string{"Hello world."}))
// The trailing sentence is held until flush (the next token might extend it).
Expect(c.flush()).To(Equal([]string{"How are you?"}))
})
It("assembles a sentence across many small tokens", func() {
c := newClauseChunker(12, 200)
var got []string
for _, tok := range []string{"Hello", " world.", " How", " are", " you?"} {
got = append(got, c.push(tok)...)
}
got = append(got, c.flush()...)
Expect(got).To(Equal([]string{"Hello world.", "How are you?"}))
})
It("does not split decimals or abbreviations (UAX#29 SB6)", func() {
c := newClauseChunker(12, 200)
got := c.push("Pi is 3.14 and e is 2.72. Done")
Expect(got).To(Equal([]string{"Pi is 3.14 and e is 2.72."}))
Expect(c.flush()).To(Equal([]string{"Done"}))
})
})
Context("CJK (no whitespace)", func() {
It("splits Chinese on the ideographic full stop", func() {
c := newClauseChunker(12, 200)
Expect(c.push("你好世界。今天天气很好。")).To(Equal([]string{"你好世界。"}))
Expect(c.flush()).To(Equal([]string{"今天天气很好。"}))
})
It("splits Japanese on the ideographic full stop", func() {
c := newClauseChunker(12, 200)
Expect(c.push("こんにちは。元気ですか。")).To(Equal([]string{"こんにちは。"}))
Expect(c.flush()).To(Equal([]string{"元気ですか。"}))
})
It("splits on CJK clause punctuation for lower latency", func() {
c := newClauseChunker(2, 200) // small min so short test clauses cut
Expect(c.push("你好,世界。再见")).To(Equal([]string{"你好,", "世界。"}))
Expect(c.flush()).To(Equal([]string{"再见"}))
})
})
Context("Thai (spaces mark clauses, not words)", func() {
It("splits a Thai run on the inter-clause space", func() {
c := newClauseChunker(2, 200)
Expect(c.push("สวัสดีครับ กินข้าวไหม")).To(Equal([]string{"สวัสดีครับ"}))
Expect(c.flush()).To(Equal([]string{"กินข้าวไหม"}))
})
It("never shatters a space-less Thai run into characters", func() {
c := newClauseChunker(2, 200)
Expect(c.push("สวัสดีครับ")).To(BeEmpty()) // held, no boundary
Expect(c.flush()).To(Equal([]string{"สวัสดีครับ"}))
})
})
Context("length cap (UAX#14 fallback)", func() {
It("force-breaks an over-long punctuation-less CJK run at legal points", func() {
c := newClauseChunker(4, 10) // maxRunes = 10
run := strings.Repeat("字", 25)
got := c.push(run)
got = append(got, c.flush()...)
total := 0
for _, seg := range got {
n := utf8.RuneCountInString(seg)
Expect(n).To(BeNumerically("<=", 10)) // never exceeds the cap
total += n
}
Expect(total).To(Equal(25)) // nothing dropped
Expect(len(got)).To(BeNumerically(">=", 3)) // 10 + 10 + 5
})
})
Context("buffer lifecycle", func() {
It("flush clears the buffer so the chunker is reusable", func() {
c := newClauseChunker(12, 200)
// "First one." is confirmed by the following "Second", so push drains it;
// only the unterminated tail remains for flush.
Expect(c.push("First one. Second")).To(Equal([]string{"First one."}))
Expect(c.flush()).To(Equal([]string{"Second"}))
Expect(c.flush()).To(BeEmpty())
Expect(c.push("Again. More")).To(Equal([]string{"Again."}))
})
})
})

View File

@@ -0,0 +1,138 @@
package openai
import (
"context"
"strings"
"github.com/mudler/LocalAI/core/backend"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/http/endpoints/openai/types"
"github.com/mudler/LocalAI/core/schema"
"github.com/mudler/LocalAI/pkg/grpc/proto"
)
// fakeTransport records the server events and audio sent to a realtime client
// so streaming behaviour can be asserted without a real WebSocket/WebRTC peer.
// It is not a *WebRTCTransport, so handler code takes the WebSocket path.
type fakeTransport struct {
events []types.ServerEvent
audio []fakeAudioChunk
}
type fakeAudioChunk struct {
pcm []byte
sampleRate int
}
func (f *fakeTransport) SendEvent(e types.ServerEvent) error {
f.events = append(f.events, e)
return nil
}
func (f *fakeTransport) ReadEvent() ([]byte, error) { return nil, nil }
func (f *fakeTransport) SendAudio(_ context.Context, pcm []byte, sampleRate int) error {
f.audio = append(f.audio, fakeAudioChunk{pcm: pcm, sampleRate: sampleRate})
return nil
}
func (f *fakeTransport) Close() error { return nil }
// countEvents returns how many recorded events have the given type.
func (f *fakeTransport) countEvents(et types.ServerEventType) int {
n := 0
for _, e := range f.events {
if e.ServerEventType() == et {
n++
}
}
return n
}
// transcriptDeltaText concatenates the Delta of every recorded transcript
// delta event — i.e. the text streamed to the client as it is generated.
func (f *fakeTransport) transcriptDeltaText() string {
var b strings.Builder
for _, e := range f.events {
if d, ok := e.(types.ResponseOutputAudioTranscriptDeltaEvent); ok {
b.WriteString(d.Delta)
}
}
return b.String()
}
// fakeModel is a configurable Model double. TTSStream replays ttsStreamChunks
// and TranscribeStream replays transcribeDeltas, so the handler's streaming
// paths can be driven deterministically.
type fakeModel struct {
cfg *config.ModelConfig
ttsFile string
ttsStreamChunks [][]byte
ttsStreamRate int
ttsStreamErr error
transcribeDeltas []string
transcribeFinal *schema.TranscriptionResult
// Predict streaming: predictTokens are replayed through the token callback
// (simulating streamed LLM output); predictResp/predictErr are returned by
// the deferred predict function. predictChunkDeltas, when set, are delivered
// per-token via TokenUsage.ChatDeltas to exercise the autoparser path.
predictTokens []string
predictChunkDeltas [][]*proto.ChatDelta
predictResp backend.LLMResponse
predictErr error
}
func (m *fakeModel) VAD(context.Context, *schema.VADRequest) (*schema.VADResponse, error) {
return nil, nil
}
func (m *fakeModel) Transcribe(context.Context, string, string, bool, bool, string) (*schema.TranscriptionResult, error) {
return m.transcribeFinal, nil
}
func (m *fakeModel) Predict(_ context.Context, _ schema.Messages, _, _, _ []string, cb func(string, backend.TokenUsage) bool, _ []types.ToolUnion, _ *types.ToolChoiceUnion, _, _ *int, _ map[string]float64) (func() (backend.LLMResponse, error), error) {
if m.predictErr != nil {
return nil, m.predictErr
}
return func() (backend.LLMResponse, error) {
for i, tok := range m.predictTokens {
if cb == nil {
continue
}
usage := backend.TokenUsage{}
if i < len(m.predictChunkDeltas) {
usage.ChatDeltas = m.predictChunkDeltas[i]
}
cb(tok, usage)
}
return m.predictResp, nil
}, nil
}
func (m *fakeModel) TTS(context.Context, string, string, string) (string, *proto.Result, error) {
return m.ttsFile, &proto.Result{Success: true}, nil
}
func (m *fakeModel) TTSStream(_ context.Context, _, _, _ string, onAudio func(pcm []byte, sampleRate int) error) error {
if m.ttsStreamErr != nil {
return m.ttsStreamErr
}
for _, c := range m.ttsStreamChunks {
if err := onAudio(c, m.ttsStreamRate); err != nil {
return err
}
}
return nil
}
func (m *fakeModel) TranscribeStream(_ context.Context, _, _ string, _, _ bool, _ string, onDelta func(text string)) (*schema.TranscriptionResult, error) {
for _, d := range m.transcribeDeltas {
onDelta(d)
}
return m.transcribeFinal, nil
}
func (m *fakeModel) PredictConfig() *config.ModelConfig { return m.cfg }

View File

@@ -3,6 +3,7 @@ package openai
import (
"context"
"crypto/rand"
"encoding/binary"
"encoding/hex"
"encoding/json"
"fmt"
@@ -87,6 +88,14 @@ func (m *transcriptOnlyModel) TTS(ctx context.Context, text, voice, language str
return "", nil, fmt.Errorf("TTS not supported in transcript-only mode")
}
func (m *transcriptOnlyModel) TTSStream(ctx context.Context, text, voice, language string, onAudio func(pcm []byte, sampleRate int) error) error {
return fmt.Errorf("TTS not supported in transcript-only mode")
}
func (m *transcriptOnlyModel) TranscribeStream(ctx context.Context, audio, language string, translate, diarize bool, prompt string, onDelta func(text string)) (*schema.TranscriptionResult, error) {
return transcribeStream(ctx, m.modelLoader, *m.TranscriptionConfig, m.appConfig, audio, language, translate, diarize, prompt, onDelta)
}
func (m *transcriptOnlyModel) PredictConfig() *config.ModelConfig {
return nil
}
@@ -321,10 +330,75 @@ func (m *wrappedModel) TTS(ctx context.Context, text, voice, language string) (s
return backend.ModelTTS(ctx, text, voice, language, "", nil, m.modelLoader, m.appConfig, *m.TTSConfig)
}
func (m *wrappedModel) TTSStream(ctx context.Context, text, voice, language string, onAudio func(pcm []byte, sampleRate int) error) error {
return ttsStream(ctx, m.modelLoader, m.appConfig, *m.TTSConfig, text, voice, language, onAudio)
}
func (m *wrappedModel) TranscribeStream(ctx context.Context, audio, language string, translate, diarize bool, prompt string, onDelta func(text string)) (*schema.TranscriptionResult, error) {
return transcribeStream(ctx, m.modelLoader, *m.TranscriptionConfig, m.appConfig, audio, language, translate, diarize, prompt, onDelta)
}
func (m *wrappedModel) PredictConfig() *config.ModelConfig {
return m.LLMConfig
}
// wavStreamHeaderBytes is the size of the WAV header that backend.ModelTTSStream
// emits as its first audio callback; the sample rate lives at byte offset 24.
const wavStreamHeaderBytes = 44
// ttsStream adapts backend.ModelTTSStream (which emits a WAV stream: a 44-byte
// header carrying the sample rate, then raw PCM) to the realtime onAudio
// callback, which wants raw PCM plus the sample rate. The header is buffered
// until complete, the sample rate is read from it, and subsequent bytes are
// forwarded as PCM.
func ttsStream(ctx context.Context, ml *model.ModelLoader, appConfig *config.ApplicationConfig, ttsConfig config.ModelConfig, text, voice, language string, onAudio func(pcm []byte, sampleRate int) error) error {
var header []byte
headerDone := false
sampleRate := 0
return backend.ModelTTSStream(ctx, text, voice, language, "", nil, ml, appConfig, ttsConfig, func(b []byte) error {
if headerDone {
if len(b) == 0 {
return nil
}
return onAudio(b, sampleRate)
}
header = append(header, b...)
if len(header) < wavStreamHeaderBytes {
return nil
}
sampleRate = int(binary.LittleEndian.Uint32(header[24:28]))
headerDone = true
if len(header) > wavStreamHeaderBytes {
return onAudio(header[wavStreamHeaderBytes:], sampleRate)
}
return nil
})
}
// transcribeStream adapts backend.ModelTranscriptionStream to the realtime
// onDelta callback, returning the final aggregated transcription result.
func transcribeStream(ctx context.Context, ml *model.ModelLoader, transcriptionConfig config.ModelConfig, appConfig *config.ApplicationConfig, audio, language string, translate, diarize bool, prompt string, onDelta func(text string)) (*schema.TranscriptionResult, error) {
var final *schema.TranscriptionResult
err := backend.ModelTranscriptionStream(ctx, backend.TranscriptionRequest{
Audio: audio,
Language: language,
Translate: translate,
Diarize: diarize,
Prompt: prompt,
}, ml, transcriptionConfig, appConfig, func(chunk backend.TranscriptionStreamChunk) {
if chunk.Delta != "" {
onDelta(chunk.Delta)
}
if chunk.Final != nil {
final = chunk.Final
}
})
if err != nil {
return nil, err
}
return final, nil
}
func newTranscriptionOnlyModel(pipeline *config.Pipeline, cl *config.ModelConfigLoader, ml *model.ModelLoader, appConfig *config.ApplicationConfig) (Model, *config.ModelConfig, error) {
cfgVAD, err := cl.LoadModelConfigFileByName(pipeline.VAD, ml.ModelPath)
if err != nil {
@@ -454,8 +528,10 @@ func newModel(pipeline *config.Pipeline, cl *config.ModelConfigLoader, ml *model
return nil, fmt.Errorf("failed to validate config: %w", err)
}
// Let the pipeline set the LLM's reasoning effort (cfgLLM is a per-session copy).
// Let the pipeline set the LLM's reasoning effort and force thinking off
// (cfgLLM is a per-session copy). disable_thinking applies after the effort.
applyPipelineReasoning(cfgLLM, *pipeline)
applyPipelineThinking(cfgLLM, *pipeline)
cfgTTS, err := cl.LoadModelConfigFileByName(pipeline.TTS, ml.ModelPath)
if err != nil {

View File

@@ -0,0 +1,102 @@
package openai
import (
"context"
"encoding/base64"
"fmt"
"os"
"path/filepath"
"github.com/mudler/LocalAI/core/http/endpoints/openai/types"
laudio "github.com/mudler/LocalAI/pkg/audio"
"github.com/mudler/LocalAI/pkg/sound"
)
// emitSpeech synthesizes text and sends the audio to the client. When the
// pipeline opts into TTS streaming it forwards each PCM chunk as its own
// response.output_audio.delta as soon as the backend produces it; otherwise it
// synthesizes the whole utterance and sends it as a single delta.
//
// It deliberately does NOT emit transcript or audio-done events: the caller owns
// those so a streamed reply can be split into several spoken segments that share
// one response/item.
//
// It returns the PCM audio (at the session output rate) accumulated across all
// chunks, which the caller base64-encodes onto the conversation item. For WebRTC
// the audio goes over the RTP track instead, so the returned slice is empty.
func emitSpeech(ctx context.Context, t Transport, session *Session, responseID, itemID, text string) ([]byte, error) {
if text == "" {
return nil, nil
}
_, isWebRTC := t.(*WebRTCTransport)
var wsAudio []byte // PCM at the session output rate, accumulated for the item record
// sendChunk hands one PCM buffer to the transport: WebRTC consumes the raw
// PCM directly (it resamples internally); WebSocket gets base64 PCM at the
// session output rate via a JSON delta event.
sendChunk := func(pcm []byte, sampleRate int) error {
if len(pcm) == 0 {
return nil
}
if err := t.SendAudio(ctx, pcm, sampleRate); err != nil {
return err
}
if isWebRTC {
return nil
}
wsPCM := pcm
if sampleRate != 0 && sampleRate != session.OutputSampleRate {
samples := sound.BytesToInt16sLE(pcm)
resampled := sound.ResampleInt16(samples, sampleRate, session.OutputSampleRate)
wsPCM = sound.Int16toBytesLE(resampled)
}
wsAudio = append(wsAudio, wsPCM...)
return t.SendEvent(types.ResponseOutputAudioDeltaEvent{
ServerEventBase: types.ServerEventBase{},
ResponseID: responseID,
ItemID: itemID,
OutputIndex: 0,
ContentIndex: 0,
Delta: base64.StdEncoding.EncodeToString(wsPCM),
})
}
language := ""
if session.InputAudioTranscription != nil {
language = session.InputAudioTranscription.Language
}
if session.ModelConfig != nil && session.ModelConfig.Pipeline.StreamTTS() {
if err := session.ModelInterface.TTSStream(ctx, text, session.Voice, language, sendChunk); err != nil {
return nil, err
}
return wsAudio, nil
}
// Unary fallback: synthesize the whole utterance to a file, then emit once.
audioFilePath, res, err := session.ModelInterface.TTS(ctx, text, session.Voice, language)
if err != nil {
return nil, err
}
if res != nil && !res.Success {
return nil, fmt.Errorf("tts generation failed: %s", res.Message)
}
defer func() { _ = os.Remove(audioFilePath) }()
// filepath.Clean normalizes the backend-produced temp path before reading
// (also keeps gosec G304 quiet — the path is backend-controlled, not user input).
audioBytes, err := os.ReadFile(filepath.Clean(audioFilePath))
if err != nil {
return nil, fmt.Errorf("read tts audio: %w", err)
}
pcm, sampleRate := laudio.ParseWAV(audioBytes)
if sampleRate == 0 {
sampleRate = session.OutputSampleRate
}
if err := sendChunk(pcm, sampleRate); err != nil {
return nil, err
}
return wsAudio, nil
}

View File

@@ -0,0 +1,70 @@
package openai
import (
"context"
"os"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/http/endpoints/openai/types"
laudio "github.com/mudler/LocalAI/pkg/audio"
)
// emitSpeech synthesizes a piece of text and forwards the audio to the client,
// streaming a delta per TTS chunk when the pipeline opts in, or sending the
// whole utterance as one delta otherwise.
var _ = Describe("emitSpeech", func() {
ttsOn := true
streamingSession := func(m Model) *Session {
return &Session{
OutputSampleRate: 24000,
ModelInterface: m,
ModelConfig: &config.ModelConfig{
Pipeline: config.Pipeline{Streaming: config.PipelineStreaming{TTS: &ttsOn}},
},
}
}
It("streams one output_audio.delta per TTS chunk when streaming is enabled", func() {
m := &fakeModel{
ttsStreamChunks: [][]byte{{1, 2}, {3, 4}, {5, 6}},
ttsStreamRate: 24000,
}
t := &fakeTransport{}
audio, err := emitSpeech(context.Background(), t, streamingSession(m), "resp1", "item1", "Hello there.")
Expect(err).ToNot(HaveOccurred())
Expect(t.countEvents(types.ServerEventTypeResponseOutputAudioDelta)).To(Equal(3))
// The returned audio is all chunks concatenated (session output rate).
Expect(audio).To(Equal([]byte{1, 2, 3, 4, 5, 6}))
})
It("sends a single output_audio.delta in unary mode", func() {
// A minimal real WAV file for the unary TTS path to read + parse.
f, err := os.CreateTemp("", "emit-*.wav")
Expect(err).ToNot(HaveOccurred())
defer func() { _ = os.Remove(f.Name()) }()
pcm := make([]byte, 320) // 160 samples of silence
hdr := laudio.NewWAVHeader(uint32(len(pcm)))
Expect(hdr.Write(f)).To(Succeed())
_, err = f.Write(pcm)
Expect(err).ToNot(HaveOccurred())
Expect(f.Close()).To(Succeed())
session := &Session{
OutputSampleRate: 24000,
ModelInterface: &fakeModel{ttsFile: f.Name()},
ModelConfig: &config.ModelConfig{}, // streaming off
}
t := &fakeTransport{}
_, err = emitSpeech(context.Background(), t, session, "resp1", "item1", "Hello there.")
Expect(err).ToNot(HaveOccurred())
Expect(t.countEvents(types.ServerEventTypeResponseOutputAudioDelta)).To(Equal(1))
})
})

View File

@@ -0,0 +1,315 @@
package openai
import (
"context"
"encoding/base64"
"fmt"
"github.com/mudler/LocalAI/core/backend"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/http/endpoints/openai/types"
"github.com/mudler/LocalAI/core/schema"
"github.com/mudler/LocalAI/pkg/functions"
"github.com/mudler/LocalAI/pkg/reasoning"
)
// transcriptStreamer turns streamed LLM tokens into the assistant's spoken
// transcript: it strips reasoning incrementally and sends one
// response.output_audio_transcript.delta per content fragment. It does NOT
// synthesize audio — the caller buffers the full message and synthesizes it
// once (streaming the audio chunks when the TTS backend supports TTSStream),
// which works uniformly for streaming and non-streaming TTS and for languages
// without sentence or word boundaries.
type transcriptStreamer struct {
ctx context.Context
t Transport
responseID string
itemID string
extractor *reasoning.ReasoningExtractor
// announce, if set, is invoked once just before the first transcript delta.
// It lets the caller create the assistant item lazily, so a content-less
// tool-call turn never emits a spurious empty assistant item.
announce func()
announced bool
}
func newTranscriptStreamer(ctx context.Context, t Transport, responseID, itemID, thinkingStartToken string, reasoningCfg reasoning.Config) *transcriptStreamer {
return &transcriptStreamer{
ctx: ctx,
t: t,
responseID: responseID,
itemID: itemID,
extractor: reasoning.NewReasoningExtractor(thinkingStartToken, spokenReasoningConfig(reasoningCfg)),
}
}
// onToken handles one streamed unit of model output, sending a transcript delta
// for the new content (reasoning stripped) and returning that content delta so
// the caller can also feed it to the clause chunker. For plain-content models
// the unit is the raw text token; for autoparser tool turns the backend clears
// the text and delivers content via ChatDeltas, so the caller passes that
// content here. Returns "" when the token produced no new spoken content.
func (s *transcriptStreamer) onToken(token string) string {
_, content := s.extractor.ProcessToken(token)
if content == "" {
return ""
}
if !s.announced {
s.announced = true
if s.announce != nil {
s.announce()
}
}
_ = s.t.SendEvent(types.ResponseOutputAudioTranscriptDeltaEvent{
ServerEventBase: types.ServerEventBase{},
ResponseID: s.responseID,
ItemID: s.itemID,
OutputIndex: 0,
ContentIndex: 0,
Delta: content,
})
return content
}
// content returns the full transcript so far with reasoning stripped.
func (s *transcriptStreamer) content() string {
return s.extractor.CleanedContent()
}
// streamLLMResponse drives a streamed realtime reply. It streams the assistant
// transcript as the LLM generates, then synthesizes the whole buffered message
// once (streaming the audio chunks when the TTS backend supports it, otherwise a
// single unary delta). Tool calls parsed from the autoparser ChatDeltas are
// emitted after the spoken content. The assistant content item is created lazily
// on the first content delta, so a content-less tool-call turn emits only the
// tool calls. It returns true when it has fully handled the response so the
// caller can return; callers must only invoke it for an audio modality, and with
// tools only when the model uses its tokenizer template (see triggerResponseAtTurn).
func streamLLMResponse(ctx context.Context, session *Session, conv *Conversation, t Transport, responseID string, history schema.Messages, images []string, llmCfg *config.ModelConfig, tools []types.ToolUnion, toolChoice *types.ToolChoiceUnion, toolTurn int) bool {
itemID := generateItemID()
item := types.MessageItemUnion{
Assistant: &types.MessageItemAssistant{
ID: itemID,
Status: types.ItemStatusInProgress,
Content: []types.MessageContentOutput{{Type: types.MessageContentTypeOutputAudio}},
},
}
// announce creates the assistant content item lazily, just before the first
// transcript delta — a tool-only turn never produces content, so it stays out
// of the conversation and the client sees only the tool calls.
announced := false
announce := func() {
announced = true
conv.Lock.Lock()
conv.Items = append(conv.Items, &item)
conv.Lock.Unlock()
sendEvent(t, types.ResponseOutputItemAddedEvent{
ServerEventBase: types.ServerEventBase{},
ResponseID: responseID,
OutputIndex: 0,
Item: item,
})
sendEvent(t, types.ResponseContentPartAddedEvent{
ServerEventBase: types.ServerEventBase{},
ResponseID: responseID,
ItemID: itemID,
OutputIndex: 0,
ContentIndex: 0,
Part: item.Assistant.Content[0],
})
}
cancel := func() {
if announced {
conv.Lock.Lock()
for i := len(conv.Items) - 1; i >= 0; i-- {
if conv.Items[i].Assistant != nil && conv.Items[i].Assistant.ID == itemID {
conv.Items = append(conv.Items[:i], conv.Items[i+1:]...)
break
}
}
conv.Lock.Unlock()
}
sendEvent(t, types.ResponseDoneEvent{
ServerEventBase: types.ServerEventBase{},
Response: types.Response{ID: responseID, Object: "realtime.response", Status: types.ResponseStatusCancelled},
})
}
var template string
if llmCfg.TemplateConfig.UseTokenizerTemplate {
template = llmCfg.GetModelTemplate()
} else {
template = llmCfg.TemplateConfig.Chat
}
thinkingStartToken := reasoning.DetectThinkingStartToken(template, &llmCfg.ReasoningConfig)
// The autoparser (tokenizer-template path) already delivers reasoning-free
// content. Prefilling the thinking start token here would re-tag that clean
// content as an unclosed reasoning block, leaving CleanedContent() empty —
// no spoken reply, no TTS. Disable the prefill; closed tag pairs are still
// stripped (PEG-fallback case, #9985).
reasoningCfg := llmCfg.ReasoningConfig
if llmCfg.TemplateConfig.UseTokenizerTemplate {
disablePrefill := true
reasoningCfg.DisableReasoningTagPrefill = &disablePrefill
}
streamer := newTranscriptStreamer(ctx, t, responseID, itemID, thinkingStartToken, reasoningCfg)
streamer.announce = announce
// Clause chunking (opt-in): synthesize each clause as soon as it completes
// instead of buffering the whole reply. streamedAudio accumulates the PCM
// across clauses for the conversation item record; ttsErr captures the first
// synthesis failure so the token callback can stop the prediction. emitSpeech
// runs synchronously here — the LLM keeps generating into the gRPC stream
// while a clause is synthesized, so audio still starts mid-generation.
var chunker *clauseChunker
if session.ModelConfig != nil && session.ModelConfig.Pipeline.ChunkClauses() {
chunker = newClauseChunker(defaultClauseMinRunes, defaultClauseMaxRunes)
}
var streamedAudio []byte
var ttsErr error
speakClause := func(clause string) error {
a, err := emitSpeech(ctx, t, session, responseID, itemID, clause)
if err != nil {
return err
}
streamedAudio = append(streamedAudio, a...)
return nil
}
// fail reports a mid-stream failure. A cancelled context means the client
// interrupted (barge-in), so roll the turn back instead of erroring.
fail := func(code, msg string, err error) bool {
if ctx.Err() != nil {
cancel()
} else {
sendError(t, code, fmt.Sprintf("%s: %v", msg, err), "", itemID)
}
return true
}
cb := func(token string, usage backend.TokenUsage) bool {
if ctx.Err() != nil {
return false
}
// Plain-content models stream text via the token; autoparser tool turns
// clear the text and deliver content via ChatDeltas, so prefer the latter
// when present. Either way only content reaches the transcript — tool-call
// deltas are parsed from the final response below.
text := token
if len(usage.ChatDeltas) > 0 {
text = functions.ContentFromChatDeltas(usage.ChatDeltas)
}
delta := streamer.onToken(text)
if chunker != nil && delta != "" {
for _, clause := range chunker.push(delta) {
if ttsErr = speakClause(clause); ttsErr != nil {
return false // stop the prediction; reported after predFunc returns
}
}
}
return true
}
predFunc, err := session.ModelInterface.Predict(ctx, history, images, nil, nil, cb, tools, toolChoice, nil, nil, nil)
if err != nil {
sendError(t, "inference_failed", fmt.Sprintf("backend error: %v", err), "", itemID)
return true
}
pred, err := predFunc()
// A clause synthesis failed mid-stream (the callback stopped the prediction);
// report it as a TTS error rather than a prediction error.
if ttsErr != nil {
return fail("tts_error", "TTS generation failed", ttsErr)
}
if err != nil {
return fail("prediction_failed", "backend error", err)
}
if ctx.Err() != nil {
cancel()
return true
}
content := streamer.content()
toolCalls := functions.ToolCallsFromChatDeltas(pred.ChatDeltas)
// Finalize the spoken content item only when the turn produced content. A
// tool-only turn skips this entirely (no empty assistant item).
if content != "" {
if !announced {
announce()
}
// Synthesize the audio. With clause chunking the completed clauses were
// already spoken inside the token callback; flush the trailing clause(s)
// the segmenter was still holding. Otherwise buffer the whole message and
// synthesize it once. emitSpeech streams the audio chunks when the TTS
// backend supports TTSStream, otherwise it sends a single unary delta.
var audio []byte
if chunker != nil {
for _, clause := range chunker.flush() {
if ttsErr = speakClause(clause); ttsErr != nil {
break
}
}
audio = streamedAudio
} else {
audio, ttsErr = emitSpeech(ctx, t, session, responseID, itemID, content)
}
if ttsErr != nil {
return fail("tts_error", "TTS generation failed", ttsErr)
}
_, isWebRTC := t.(*WebRTCTransport)
sendEvent(t, types.ResponseOutputAudioTranscriptDoneEvent{
ServerEventBase: types.ServerEventBase{},
ResponseID: responseID,
ItemID: itemID,
OutputIndex: 0,
ContentIndex: 0,
Transcript: content,
})
if !isWebRTC {
sendEvent(t, types.ResponseOutputAudioDoneEvent{
ServerEventBase: types.ServerEventBase{},
ResponseID: responseID,
ItemID: itemID,
OutputIndex: 0,
ContentIndex: 0,
})
}
conv.Lock.Lock()
item.Assistant.Status = types.ItemStatusCompleted
item.Assistant.Content[0].Transcript = content
if !isWebRTC {
item.Assistant.Content[0].Audio = base64.StdEncoding.EncodeToString(audio)
}
conv.Lock.Unlock()
sendEvent(t, types.ResponseContentPartDoneEvent{
ServerEventBase: types.ServerEventBase{},
ResponseID: responseID,
ItemID: itemID,
OutputIndex: 0,
ContentIndex: 0,
Part: item.Assistant.Content[0],
})
sendEvent(t, types.ResponseOutputItemDoneEvent{
ServerEventBase: types.ServerEventBase{},
ResponseID: responseID,
OutputIndex: 0,
Item: item,
})
}
// Emit any tool calls, the terminal response.done, and (for server-side
// assistant tools) the follow-up turn — shared with the buffered path.
emitToolCallItems(ctx, session, conv, t, responseID, toolCalls, content != "", toolTurn)
return true
}

View File

@@ -0,0 +1,213 @@
package openai
import (
"context"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
"github.com/mudler/LocalAI/core/backend"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/http/endpoints/openai/types"
"github.com/mudler/LocalAI/pkg/grpc/proto"
"github.com/mudler/LocalAI/pkg/reasoning"
)
// transcriptStreamer turns streamed LLM tokens into incremental transcript
// deltas, stripping reasoning. Audio is synthesized once from the full message
// by the caller, so there is no per-sentence segmentation.
var _ = Describe("transcriptStreamer", func() {
It("emits one transcript delta per content token", func() {
t := &fakeTransport{}
s := newTranscriptStreamer(context.Background(), t, "resp1", "item1", "", reasoning.Config{})
for _, tok := range []string{"Hello", " world.", " Bye"} {
s.onToken(tok)
}
Expect(s.content()).To(Equal("Hello world. Bye"))
Expect(t.countEvents(types.ServerEventTypeResponseOutputAudioTranscriptDelta)).To(Equal(3))
Expect(t.transcriptDeltaText()).To(Equal("Hello world. Bye"))
})
It("strips leaked reasoning even when reasoning is disabled (disable_thinking safety net)", func() {
// disable_thinking maps to DisableReasoning=true (enable_thinking=false to
// the backend). If the model emits thinking anyway, the transcript must
// still not leak it: stripping always runs for spoken output.
disable := true
t := &fakeTransport{}
s := newTranscriptStreamer(context.Background(), t, "resp1", "item1", "",
reasoning.Config{DisableReasoning: &disable})
s.onToken("<think>secret plan</think>")
s.onToken("The answer is 42.")
Expect(s.content()).To(Equal("The answer is 42."))
Expect(s.content()).ToNot(ContainSubstring("secret plan"))
Expect(t.transcriptDeltaText()).ToNot(ContainSubstring("secret plan"))
})
It("does not swallow autoparser content when the template has a thinking start token (tokenizer-template path)", func() {
// Regression: with tag prefill on, the detected <think> token is
// prepended to the autoparser's already-clean content, swallowing the
// whole reply (empty transcript → no TTS). streamLLMResponse disables
// the prefill for the tokenizer-template path.
disablePrefill := true
t := &fakeTransport{}
s := newTranscriptStreamer(context.Background(), t, "resp1", "item1", "<think>",
reasoning.Config{DisableReasoningTagPrefill: &disablePrefill})
s.onToken("Hello")
s.onToken(" there.")
Expect(s.content()).To(Equal("Hello there."))
Expect(t.transcriptDeltaText()).To(Equal("Hello there."))
})
It("still strips embedded closed reasoning tags with prefill disabled (PEG-fallback safety, #9985)", func() {
// Disabling prefill must not stop stripping closed <think>…</think>
// pairs the PEG fallback can leave in autoparser content.
disablePrefill := true
t := &fakeTransport{}
s := newTranscriptStreamer(context.Background(), t, "resp1", "item1", "<think>",
reasoning.Config{DisableReasoningTagPrefill: &disablePrefill})
s.onToken("<think>secret</think>")
s.onToken("The answer is 42.")
Expect(s.content()).To(Equal("The answer is 42."))
Expect(t.transcriptDeltaText()).ToNot(ContainSubstring("secret"))
})
})
// streamLLMResponse drives a full streamed realtime turn: live transcript
// deltas while the LLM generates, then the whole message is synthesized once.
var _ = Describe("streamLLMResponse", func() {
It("streams transcript deltas then synthesizes the whole message once", func() {
on := true
m := &fakeModel{
predictTokens: []string{"Hello", " world.", " How are you?"},
predictResp: backend.LLMResponse{Response: "Hello world. How are you?"},
ttsStreamChunks: [][]byte{{9}},
ttsStreamRate: 24000,
}
session := &Session{
OutputSampleRate: 24000,
ModelInterface: m,
ModelConfig: &config.ModelConfig{
Pipeline: config.Pipeline{Streaming: config.PipelineStreaming{LLM: &on, TTS: &on}},
},
}
conv := &Conversation{}
t := &fakeTransport{}
llmCfg := &config.ModelConfig{}
handled := streamLLMResponse(context.Background(), session, conv, t, "resp1", nil, nil, llmCfg, nil, nil, 0)
Expect(handled).To(BeTrue())
// One live transcript delta per streamed token.
Expect(t.countEvents(types.ServerEventTypeResponseOutputAudioTranscriptDelta)).To(Equal(3))
// The whole message is synthesized ONCE (not per sentence): a single
// emitSpeech replays the one TTS stream chunk.
Expect(t.countEvents(types.ServerEventTypeResponseOutputAudioDelta)).To(Equal(1))
Expect(t.transcriptDeltaText()).To(Equal("Hello world. How are you?"))
})
It("synthesizes each clause as it completes when clause chunking is enabled", func() {
on := true
m := &fakeModel{
predictTokens: []string{"Hello world.", " How are you?"},
predictResp: backend.LLMResponse{Response: "Hello world. How are you?"},
ttsStreamChunks: [][]byte{{9}},
ttsStreamRate: 24000,
}
session := &Session{
OutputSampleRate: 24000,
ModelInterface: m,
ModelConfig: &config.ModelConfig{
Pipeline: config.Pipeline{Streaming: config.PipelineStreaming{LLM: &on, TTS: &on, ClauseChunking: &on}},
},
}
conv := &Conversation{}
t := &fakeTransport{}
llmCfg := &config.ModelConfig{}
handled := streamLLMResponse(context.Background(), session, conv, t, "resp1", nil, nil, llmCfg, nil, nil, 0)
Expect(handled).To(BeTrue())
// Two clauses ("Hello world." mid-stream, "How are you?" on flush) → two
// emitSpeech calls → two audio deltas, vs one for whole-message buffering.
Expect(t.countEvents(types.ServerEventTypeResponseOutputAudioDelta)).To(Equal(2))
// The full transcript still streams verbatim.
Expect(t.transcriptDeltaText()).To(Equal("Hello world. How are you?"))
// Exactly one terminal response.done.
Expect(t.countEvents(types.ServerEventTypeResponseDone)).To(Equal(1))
})
It("streams content deltas and emits tool-call items (autoparser tool turn)", func() {
on := true
// Autoparser path: reply.Message is empty; content + tool calls arrive via
// ChatDeltas. Chunk 1 carries content, chunk 2 carries the tool call.
contentDelta := []*proto.ChatDelta{{Content: "Let me check."}}
toolDelta := []*proto.ChatDelta{{ToolCalls: []*proto.ToolCallDelta{{Index: 0, Name: "get_weather", Arguments: `{"city":"Paris"}`}}}}
m := &fakeModel{
predictTokens: []string{"", ""},
predictChunkDeltas: [][]*proto.ChatDelta{contentDelta, toolDelta},
predictResp: backend.LLMResponse{ChatDeltas: append(append([]*proto.ChatDelta{}, contentDelta...), toolDelta...)},
ttsStreamChunks: [][]byte{{9}},
ttsStreamRate: 24000,
}
session := &Session{
OutputSampleRate: 24000,
ModelInterface: m,
ModelConfig: &config.ModelConfig{
Pipeline: config.Pipeline{Streaming: config.PipelineStreaming{LLM: &on, TTS: &on}},
},
}
conv := &Conversation{}
t := &fakeTransport{}
llmCfg := &config.ModelConfig{}
llmCfg.TemplateConfig.UseTokenizerTemplate = true
handled := streamLLMResponse(context.Background(), session, conv, t, "resp1", nil, nil, llmCfg, nil, nil, 0)
Expect(handled).To(BeTrue())
// The spoken content was streamed live.
Expect(t.transcriptDeltaText()).To(Equal("Let me check."))
// The tool call is emitted as a function_call item.
Expect(t.countEvents(types.ServerEventTypeResponseFunctionCallArgumentsDone)).To(Equal(1))
// Exactly one terminal response.done.
Expect(t.countEvents(types.ServerEventTypeResponseDone)).To(Equal(1))
})
It("emits only tool-call items for a content-less tool turn (no empty assistant item)", func() {
on := true
toolDelta := []*proto.ChatDelta{{ToolCalls: []*proto.ToolCallDelta{{Index: 0, Name: "get_weather", Arguments: `{"city":"Rome"}`}}}}
m := &fakeModel{
predictTokens: []string{""},
predictChunkDeltas: [][]*proto.ChatDelta{toolDelta},
predictResp: backend.LLMResponse{ChatDeltas: toolDelta},
}
session := &Session{
OutputSampleRate: 24000,
ModelInterface: m,
ModelConfig: &config.ModelConfig{
Pipeline: config.Pipeline{Streaming: config.PipelineStreaming{LLM: &on, TTS: &on}},
},
}
conv := &Conversation{}
t := &fakeTransport{}
llmCfg := &config.ModelConfig{}
llmCfg.TemplateConfig.UseTokenizerTemplate = true
handled := streamLLMResponse(context.Background(), session, conv, t, "resp1", nil, nil, llmCfg, nil, nil, 0)
Expect(handled).To(BeTrue())
// No content → no transcript deltas and no spurious assistant content item.
Expect(t.transcriptDeltaText()).To(Equal(""))
Expect(t.countEvents(types.ServerEventTypeResponseOutputAudioTranscriptDelta)).To(Equal(0))
// The tool call is still emitted.
Expect(t.countEvents(types.ServerEventTypeResponseFunctionCallArgumentsDone)).To(Equal(1))
Expect(t.countEvents(types.ServerEventTypeResponseDone)).To(Equal(1))
})
})

View File

@@ -0,0 +1,33 @@
package openai
import (
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/pkg/reasoning"
)
// applyPipelineThinking forces the LLM's reasoning/thinking off when the realtime
// pipeline sets disable_thinking, mapping to the enable_thinking=false backend
// metadata via ReasoningConfig.DisableReasoning. The LLM config passed in is the
// per-session copy returned by the config loader, so this does not affect other
// users of the same model. When the pipeline does not set disable_thinking the
// LLM config is left untouched.
func applyPipelineThinking(llm *config.ModelConfig, pipeline config.Pipeline) {
if llm == nil || !pipeline.ThinkingDisabled() {
return
}
disable := true
llm.ReasoningConfig.DisableReasoning = &disable
}
// spokenReasoningConfig adapts a model's reasoning config for stripping reasoning
// OUT of realtime spoken output. ReasoningConfig.DisableReasoning is overloaded:
// the backend reads it as the "enable_thinking=false" hint (which pipeline
// disable_thinking sets via applyPipelineThinking), but the reasoning extractor
// reads it as "skip stripping, assume there is no reasoning". Honouring the latter
// when extracting for speech would leak raw <think>…</think> whenever the model
// ignores the suppression hint. Spoken output must never contain reasoning, so we
// always strip: clear DisableReasoning while keeping custom tokens/tag pairs.
func spokenReasoningConfig(cfg reasoning.Config) reasoning.Config {
cfg.DisableReasoning = nil
return cfg
}

View File

@@ -0,0 +1,50 @@
package openai
import (
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/pkg/reasoning"
)
// applyPipelineThinking lets a realtime pipeline force the LLM's thinking off
// (enable_thinking=false metadata) without editing the LLM model config.
var _ = Describe("applyPipelineThinking", func() {
It("disables reasoning on the LLM config when the pipeline disables thinking", func() {
disable := true
llm := &config.ModelConfig{}
applyPipelineThinking(llm, config.Pipeline{DisableThinking: &disable})
Expect(llm.ReasoningConfig.DisableReasoning).ToNot(BeNil())
Expect(*llm.ReasoningConfig.DisableReasoning).To(BeTrue())
})
It("leaves the LLM config untouched when the pipeline does not set disable_thinking", func() {
llm := &config.ModelConfig{}
applyPipelineThinking(llm, config.Pipeline{})
Expect(llm.ReasoningConfig.DisableReasoning).To(BeNil())
})
})
// spokenReasoningConfig clears DisableReasoning so realtime spoken output always
// strips reasoning, even though disable_thinking sets DisableReasoning=true on the
// LLM config (which the backend reads as enable_thinking=false).
var _ = Describe("spokenReasoningConfig", func() {
It("clears DisableReasoning so the extractor still strips leaked reasoning", func() {
disable := true
out := spokenReasoningConfig(reasoning.Config{DisableReasoning: &disable})
Expect(out.DisableReasoning).To(BeNil())
})
It("preserves the other reasoning settings", func() {
disable := true
out := spokenReasoningConfig(reasoning.Config{
DisableReasoning: &disable,
ThinkingStartTokens: []string{"<reason>"},
TagPairs: []reasoning.TagPair{{Start: "<reason>", End: "</reason>"}},
})
Expect(out.ThinkingStartTokens).To(Equal([]string{"<reason>"}))
Expect(out.TagPairs).To(HaveLen(1))
Expect(out.TagPairs[0].Start).To(Equal("<reason>"))
})
})

View File

@@ -0,0 +1,63 @@
package openai
import (
"context"
"fmt"
"github.com/mudler/LocalAI/core/http/endpoints/openai/types"
)
// emitTranscription transcribes a committed utterance and emits the transcription
// events for it, returning the final transcript text. With
// pipeline.streaming.transcription enabled it streams each transcript fragment as
// a conversation.item.input_audio_transcription.delta as the backend produces it,
// then a completed event; otherwise it transcribes the whole utterance and emits
// a single completed event. delta and completed events share itemID.
func emitTranscription(ctx context.Context, t Transport, session *Session, itemID, audioPath string) (string, error) {
cfg := session.InputAudioTranscription
if session.ModelConfig != nil && session.ModelConfig.Pipeline.StreamTranscription() {
final, err := session.ModelInterface.TranscribeStream(ctx, audioPath, cfg.Language, false, false, cfg.Prompt, func(delta string) {
_ = t.SendEvent(types.ConversationItemInputAudioTranscriptionDeltaEvent{
ServerEventBase: types.ServerEventBase{EventID: "event_TODO"},
ItemID: itemID,
ContentIndex: 0,
Delta: delta,
})
})
if err != nil {
return "", err
}
transcript := ""
if final != nil {
transcript = final.Text
}
if err := t.SendEvent(types.ConversationItemInputAudioTranscriptionCompletedEvent{
ServerEventBase: types.ServerEventBase{EventID: "event_TODO"},
ItemID: itemID,
ContentIndex: 0,
Transcript: transcript,
}); err != nil {
return "", err
}
return transcript, nil
}
// Unary fallback: transcribe the whole utterance, emit one completed event.
tr, err := session.ModelInterface.Transcribe(ctx, audioPath, cfg.Language, false, false, cfg.Prompt)
if err != nil {
return "", err
}
if tr == nil {
return "", fmt.Errorf("transcribe result is nil")
}
if err := t.SendEvent(types.ConversationItemInputAudioTranscriptionCompletedEvent{
ServerEventBase: types.ServerEventBase{EventID: "event_TODO"},
ItemID: itemID,
ContentIndex: 0,
Transcript: tr.Text,
}); err != nil {
return "", err
}
return tr.Text, nil
}

View File

@@ -0,0 +1,54 @@
package openai
import (
"context"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/http/endpoints/openai/types"
"github.com/mudler/LocalAI/core/schema"
)
// emitTranscription transcribes a committed utterance, streaming transcript text
// deltas when the pipeline opts in, and returns the final transcript text.
var _ = Describe("emitTranscription", func() {
It("streams transcription deltas then a completed event when streaming is enabled", func() {
on := true
session := &Session{
InputAudioTranscription: &types.AudioTranscription{},
ModelConfig: &config.ModelConfig{
Pipeline: config.Pipeline{Streaming: config.PipelineStreaming{Transcription: &on}},
},
ModelInterface: &fakeModel{
transcribeDeltas: []string{"Hel", "lo", " world"},
transcribeFinal: &schema.TranscriptionResult{Text: "Hello world"},
},
}
t := &fakeTransport{}
transcript, err := emitTranscription(context.Background(), t, session, "item1", "/tmp/x.wav")
Expect(err).ToNot(HaveOccurred())
Expect(transcript).To(Equal("Hello world"))
Expect(t.countEvents(types.ServerEventTypeConversationItemInputAudioTranscriptionDelta)).To(Equal(3))
Expect(t.countEvents(types.ServerEventTypeConversationItemInputAudioTranscriptionCompleted)).To(Equal(1))
})
It("emits a single completed event with no deltas in unary mode", func() {
session := &Session{
InputAudioTranscription: &types.AudioTranscription{},
ModelConfig: &config.ModelConfig{}, // streaming off
ModelInterface: &fakeModel{transcribeFinal: &schema.TranscriptionResult{Text: "Hi"}},
}
t := &fakeTransport{}
transcript, err := emitTranscription(context.Background(), t, session, "item1", "/tmp/x.wav")
Expect(err).ToNot(HaveOccurred())
Expect(transcript).To(Equal("Hi"))
Expect(t.countEvents(types.ServerEventTypeConversationItemInputAudioTranscriptionDelta)).To(Equal(0))
Expect(t.countEvents(types.ServerEventTypeConversationItemInputAudioTranscriptionCompleted)).To(Equal(1))
})
})

View File

@@ -17,6 +17,24 @@ const STATUS_STYLES = {
error: { icon: 'fa-solid fa-circle', color: 'var(--color-error)', bg: 'var(--color-error-light)' },
}
// upsertAssistant merges a streamed transcript fragment into the assistant entry
// identified by the server's item_id, or appends a new entry if none exists yet.
// Keying by item_id (not a mutable index tracked across handler/updater
// boundaries) makes streamed deltas idempotent and order-independent, so React's
// batching of non-React data-channel events cannot produce a duplicate bubble.
// mode 'append' adds to the running text; 'replace' sets the final transcript.
function upsertAssistant(prev, itemId, text, mode) {
// Only assistant entries carry an id, and the streaming entry is almost
// always the newest — search from the tail so per-delta cost stays constant.
const i = prev.findLastIndex(e => e.id === itemId)
if (i === -1) {
return [...prev, { role: 'assistant', id: itemId, text }]
}
const next = [...prev]
next[i] = { ...next[i], text: mode === 'append' ? next[i].text + text : text }
return next
}
export default function Talk() {
const { addToast } = useOutletContext()
const navigate = useNavigate()
@@ -34,7 +52,10 @@ export default function Talk() {
// Transcript
const [transcript, setTranscript] = useState([])
const streamingRef = useRef(null) // tracks the index of the in-progress assistant message
// item_id of the assistant message currently streaming — used only to remove
// its partial bubble when a response is cancelled (barge-in). The transcript
// itself is keyed by item_id via upsertAssistant, not by this ref.
const inProgressIdRef = useRef(null)
// Session settings
const [instructions, setInstructions] = useState(
@@ -227,39 +248,21 @@ export default function Talk() {
break
case 'conversation.item.input_audio_transcription.completed':
if (event.transcript) {
streamingRef.current = null
setTranscript(prev => [...prev, { role: 'user', text: event.transcript }])
}
updateStatus('thinking', 'Generating response...')
break
case 'response.output_audio_transcript.delta':
if (event.delta) {
setTranscript(prev => {
if (streamingRef.current !== null) {
const updated = [...prev]
updated[streamingRef.current] = {
...updated[streamingRef.current],
text: updated[streamingRef.current].text + event.delta,
}
return updated
}
streamingRef.current = prev.length
return [...prev, { role: 'assistant', text: event.delta }]
})
inProgressIdRef.current = event.item_id
setTranscript(prev => upsertAssistant(prev, event.item_id, event.delta, 'append'))
}
break
case 'response.output_audio_transcript.done':
if (event.transcript) {
setTranscript(prev => {
if (streamingRef.current !== null) {
const updated = [...prev]
updated[streamingRef.current] = { ...updated[streamingRef.current], text: event.transcript }
return updated
}
return [...prev, { role: 'assistant', text: event.transcript }]
})
setTranscript(prev => upsertAssistant(prev, event.item_id, event.transcript, 'replace'))
}
streamingRef.current = null
inProgressIdRef.current = null
break
case 'response.output_audio.delta':
updateStatus('speaking', 'Speaking...')
@@ -281,7 +284,7 @@ export default function Talk() {
// Pretty-print JSON for readability; fall back to raw string.
try { preview = JSON.stringify(JSON.parse(preview), null, 2) } catch (_) { /* keep raw */ }
setTranscript(prev => [...prev, { role: 'tool_result', text: preview }])
streamingRef.current = null // tool result ends the current assistant text run
inProgressIdRef.current = null // tool result ends the current assistant text run
}
break
}
@@ -290,9 +293,20 @@ export default function Talk() {
// conversation.item.create + response.create when it's done.
handleFunctionCall(event)
break
case 'response.done':
case 'response.done': {
// A cancelled response (barge-in / interruption) leaves a partial,
// incrementally-streamed assistant bubble behind. The server discards
// the interrupted item from history; mirror that here (remove the
// in-progress assistant entry by item_id) so the regenerated reply
// doesn't show up as a second assistant message.
if (event.response?.status === 'cancelled' && inProgressIdRef.current) {
const id = inProgressIdRef.current
inProgressIdRef.current = null
setTranscript(prev => prev.filter(e => e.id !== id))
}
updateStatus('listening', 'Listening...')
break
}
case 'error':
hasErrorRef.current = true
updateStatus('error', 'Error: ' + (event.error?.message || 'Unknown error'))
@@ -789,7 +803,7 @@ export default function Talk() {
const iconColor = isToolCall || isToolResult ? 'var(--color-text-secondary)'
: isUser ? 'var(--color-primary)' : 'var(--color-accent)'
return (
<div key={i} style={{ display: 'flex', alignItems: 'flex-start', gap: 'var(--spacing-xs)' }}>
<div key={entry.id || i} style={{ display: 'flex', alignItems: 'flex-start', gap: 'var(--spacing-xs)' }}>
<i className={iconClass} style={{ color: iconColor, marginTop: 3, flexShrink: 0, fontSize: '0.75rem' }} />
<p style={{
margin: 0,