test(parakeet-cpp): update model-gated specs for multi-segment output

The offline AudioTranscription specs asserted the old single synthetic segment (Segments HaveLen(1), Segments[0].Text == res.Text). With NeMo-faithful segmentation a multi-sentence clip now yields multiple punctuation-delimited segments, so assert the new contract instead: one-or-more time-ordered segments, each with text and (under word granularity) per-segment words whose span tracks the segment start/end. Caught by running the model-gated suite on the dgx (GB10) against the real tdt_ctc-110m + realtime_eou models. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
docs(audio): document parakeet-cpp segment timestamps + segment_gap_threshold
2026-06-07 08:16:53 -04:00 · 2026-06-07 10:53:30 +00:00 · 2026-06-07 08:47:12 +00:00 · 2026-06-07 08:47:12 +00:00 · 2026-06-07 08:37:42 +02:00 · 2026-06-07 00:37:28 +02:00
35 changed files with 1093 additions and 1313 deletions
--- a/.github/gallery-agent/main.go
+++ b/.github/gallery-agent/main.go
@@ -3,6 +3,7 @@ package main
 import (
 	"context"
 	"encoding/json"
+	"errors"
 	"fmt"
 	"os"
 	"strconv"
@@ -113,6 +114,17 @@ func main() {
 	fmt.Println("Searching for trending models on HuggingFace...")
 	rawModels, err := client.GetTrending(searchTerm, limit)
 	if err != nil {
+		if errors.Is(err, hfapi.ErrRateLimited) {
+			fmt.Printf("HuggingFace API is rate limited after retries, skipping this run: %v\n", err)
+			writeSummary(AddedModelSummary{
+				SearchTerm:     searchTerm,
+				TotalFound:     0,
+				ModelsAdded:    0,
+				Quantization:   quantization,
+				ProcessingTime: time.Since(startTime).String(),
+			})
+			return
+		}
 		fmt.Fprintf(os.Stderr, "Error fetching models: %v\n", err)
 		os.Exit(1)
 	}
@@ -277,4 +289,3 @@ func truncateString(s string, maxLen int) string {
 	}
 	return s[:maxLen] + "..."
 }
-
--- a/backend/cpp/ik-llama-cpp/Makefile
+++ b/backend/cpp/ik-llama-cpp/Makefile
@@ -1,5 +1,5 @@

-IK_LLAMA_VERSION?=1520eda980564241434b791ce2bbbd128c4be9ea
+IK_LLAMA_VERSION?=6b9de3dbaa21ae95ea80638e5ee836795cc48c93
 LLAMA_REPO?=https://github.com/ikawrakow/ik_llama.cpp

 CMAKE_ARGS?=
--- a/backend/go/parakeet-cpp/Makefile
+++ b/backend/go/parakeet-cpp/Makefile
@@ -1,6 +1,6 @@
 # parakeet-cpp backend Makefile.
 #
-# Upstream pin lives below as PARAKEET_VERSION?=b11fe5bca78ad8b342dd559a43d76df3984bb447
+# Upstream pin lives below as PARAKEET_VERSION?=abd0087dcc92ec5ad1f96f9fd86c49eb26a5ce67
 # (.github/bump_deps.sh) can find and update it - matches the
 # whisper.cpp / ds4 / vibevoice-cpp convention.
 #
@@ -15,7 +15,7 @@
 # That's what the L0 smoke test uses. The default target below does the
 # proper clone-at-pin + cmake build so CI doesn't need a side-checkout.

-PARAKEET_VERSION?=b11fe5bca78ad8b342dd559a43d76df3984bb447
+PARAKEET_VERSION?=abd0087dcc92ec5ad1f96f9fd86c49eb26a5ce67
 PARAKEET_REPO?=https://github.com/mudler/parakeet.cpp

 GOCMD?=go
--- a/backend/go/parakeet-cpp/batcher.go
+++ b/backend/go/parakeet-cpp/batcher.go
@@ -7,8 +7,12 @@ import "time"
 type batchRequest struct {
 	pcm     []float32
 	decoder int32
-	tag     string
-	reply   chan batchReply
+	// language is the per-request target locale ("" means the model default).
+	// parakeet.cpp's batched C-API takes ONE target_lang for the whole batch,
+	// so the dispatcher only coalesces requests that share a language.
+	language string
+	tag      string
+	reply    chan batchReply
 }

 // batchReply carries one per-item JSON object string (an element of the C-API's
@@ -43,13 +47,25 @@ func newBatcher(maxSize int, maxWait time.Duration, runBatch func([]*batchReques
 // run is the dispatcher loop: accumulate submitted requests until either maxSize
 // is reached or maxWait elapses since the first queued request, then dispatch.
 // Exits when stop is closed (draining any partially-filled batch first).
+//
+// A batch carries ONE language (parakeet.cpp's batched C-API takes a single
+// target_lang), so a request whose language differs from the batch leader is
+// not coalesced: it is held in carry and becomes the leader of the next batch.
+// carry is therefore never dropped and its caller never deadlocks: every batch
+// (including a lone carry on stop) is dispatched, and runBatch replies to all.
 func (b *batcher) run(stop <-chan struct{}) {
+	var carry *batchRequest
 	for {
 		var first *batchRequest
-		select {
-		case first = <-b.submit:
-		case <-stop:
-			return
+		if carry != nil {
+			// A mismatched request from the previous fill leads this batch.
+			first, carry = carry, nil
+		} else {
+			select {
+			case first = <-b.submit:
+			case <-stop:
+				return
+			}
 		}
 		batch := []*batchRequest{first}

@@ -64,12 +80,22 @@ func (b *batcher) run(stop <-chan struct{}) {
 		for len(batch) < b.maxSize {
 			select {
 			case r := <-b.submit:
+				if r.language != first.language {
+					// Different language: carry it to the next batch so this
+					// batch stays single-language, then dispatch what we have.
+					carry = r
+					break fill
+				}
 				batch = append(batch, r)
 			case <-timer.C:
 				break fill
 			case <-stop:
 				timer.Stop()
 				b.runBatch(batch)
+				// Don't strand a carried request's caller on shutdown.
+				if carry != nil {
+					b.runBatch([]*batchRequest{carry})
+				}
 				return
 			}
 		}
--- a/backend/go/parakeet-cpp/batcher_test.go
+++ b/backend/go/parakeet-cpp/batcher_test.go
@@ -105,4 +105,60 @@ var _ = Describe("batcher", func() {
 		go func() { <-rep }()
 		Eventually(dispatched, "2s").Should(Receive(Equal(1)))
 	})
+
+	It("never coalesces requests with different languages into one batch", func() {
+		// parakeet.cpp's batched C-API takes ONE target_lang per batch, so the
+		// dispatcher must keep every dispatched batch single-language. Submit a
+		// mix of languages and assert (a) no batch ever carries more than one
+		// distinct language and (b) every submitted request still gets a reply
+		// (the mismatched carry-over is never dropped).
+		var mu sync.Mutex
+		var langsPerBatch [][]string
+		run := func(reqs []*batchRequest) {
+			seen := map[string]struct{}{}
+			var distinct []string
+			for _, r := range reqs {
+				if _, ok := seen[r.language]; !ok {
+					seen[r.language] = struct{}{}
+					distinct = append(distinct, r.language)
+				}
+			}
+			mu.Lock()
+			langsPerBatch = append(langsPerBatch, distinct)
+			mu.Unlock()
+			echoReply(reqs)
+		}
+		// Large window + size so the fill loop stays open across submits and the
+		// language constraint (not the timer) is what splits the batches.
+		b := newBatcher(16, 200*time.Millisecond, run)
+		stop := make(chan struct{})
+		go b.run(stop)
+		defer close(stop)
+
+		langs := []string{"en", "en", "de", "de", "en", "fr", "fr"}
+		const N = 7
+		var wg sync.WaitGroup
+		got := make([]string, N)
+		for i := 0; i < N; i++ {
+			wg.Add(1)
+			go func(i int) {
+				defer wg.Done()
+				rep := make(chan batchReply, 1)
+				b.submit <- &batchRequest{tag: string(rune('a' + i)), language: langs[i], reply: rep}
+				got[i] = (<-rep).json
+			}(i)
+		}
+		wg.Wait()
+
+		mu.Lock()
+		defer mu.Unlock()
+		// Invariant: every dispatched batch is single-language.
+		for _, distinct := range langsPerBatch {
+			Expect(len(distinct)).To(Equal(1), "a batch coalesced more than one language: %v", distinct)
+		}
+		// Liveness: every request got a reply (carry-over never stranded).
+		for i := 0; i < N; i++ {
+			Expect(got[i]).To(Equal(string(rune('a' + i))))
+		}
+	})
 })
--- a/backend/go/parakeet-cpp/goparakeetcpp.go
+++ b/backend/go/parakeet-cpp/goparakeetcpp.go
@@ -48,6 +48,13 @@ var (
 	// side reads them as const float*/const int*.
 	CppTranscribePcmBatchJSON func(ctx uintptr, samplesConcat []float32, nSamples []int32, nClips int32, sampleRate int32, decoder int32) uintptr

+	// CppTranscribePcmBatchJSONLang is the multilingual variant of the batched
+	// JSON entry point: identical, plus a trailing target_lang. "" (the model
+	// default, "auto") is passed for non-prompt models, which ignore it; an
+	// unknown locale on a prompt model returns 0 and sets last_error. Present
+	// only in newer libparakeet.so; nil falls back to CppTranscribePcmBatchJSON.
+	CppTranscribePcmBatchJSONLang func(ctx uintptr, samplesConcat []float32, nSamples []int32, nClips int32, sampleRate int32, decoder int32, targetLang string) uintptr
+
 	// Cache-aware streaming (RNN-T) entry points. stream_begin returns 0 for
 	// non-streaming models. feed/finalize return a malloc'd char* (uintptr,
 	// freed via CppFreeString); feed writes 1 to *eouOut on an <EOU>/<EOB>.
@@ -55,6 +62,18 @@ var (
 	CppStreamFeed     func(s uintptr, pcm []float32, nSamples int32, eouOut unsafe.Pointer) uintptr
 	CppStreamFinalize func(s uintptr) uintptr
 	CppStreamFree     func(s uintptr)
+
+	// CppStreamBeginLang is the multilingual variant of stream_begin: identical,
+	// plus a trailing target_lang ("" means the model default). Present only in
+	// newer libparakeet.so; nil falls back to CppStreamBegin.
+	CppStreamBeginLang func(ctx uintptr, targetLang string) uintptr
+
+	// Streaming JSON variants (ABI v4): feed/finalize returning a malloc'd char*
+	// JSON document {text,eou,frame_sec,words} (uintptr, freed via CppFreeString)
+	// so streaming segments can carry per-word timestamps. Present only in newer
+	// libparakeet.so; nil falls back to the text-only CppStreamFeed/Finalize path.
+	CppStreamFeedJSON     func(s uintptr, pcm []float32, nSamples int32) uintptr
+	CppStreamFinalizeJSON func(s uintptr) uintptr
 )

 // streamChunkSamples is how much 16 kHz mono PCM we hand to stream_feed per
@@ -72,9 +91,26 @@ const streamChunkSamples = 16000
 //
 // "start"/"end"/"t" are seconds; "conf" is confidence in (0,1].
 type transcriptJSON struct {
-	Text   string            `json:"text"`
-	Words  []transcriptWord  `json:"words"`
-	Tokens []transcriptToken `json:"tokens"`
+	Text     string            `json:"text"`
+	FrameSec float64           `json:"frame_sec"`
+	Words    []transcriptWord  `json:"words"`
+	Tokens   []transcriptToken `json:"tokens"`
+}
+
+// streamFeedJSON mirrors the document returned by
+// parakeet_capi_stream_feed_json / parakeet_capi_stream_finalize_json (ABI v4):
+//
+//	{"text":"...","eou":0,"frame_sec":0.080000,
+//	 "words":[{"w":"...","start":0.480,"end":0.640,"conf":0.9100}, ...]}
+//
+// "text" is the newly-finalized text since the last call; "eou" is 1 when an
+// <EOU>/<EOB> fired this feed; "words" are the words finalized this call with
+// absolute (stream-relative) start/end seconds.
+type streamFeedJSON struct {
+	Text     string           `json:"text"`
+	Eou      int              `json:"eou"`
+	FrameSec float64          `json:"frame_sec"`
+	Words    []transcriptWord `json:"words"`
 }

 type transcriptWord struct {
@@ -103,6 +139,10 @@ type ParakeetCpp struct {
 	engineMu sync.Mutex // sole guard of the one C engine (dispatcher + streaming)
 	bat      *batcher
 	batStop  chan struct{}
+	// segmentGapFrames is NeMo's segment_gap_threshold in ENCODER FRAMES (model
+	// YAML option, default 0=off). When >0 it adds NeMo's silence-gap split on
+	// top of the punctuation split; converted to seconds via the JSON frame_sec.
+	segmentGapFrames int
 }

 // Load is the LocalAI gRPC entry point for LoadModel: it calls
@@ -132,6 +172,11 @@ func (p *ParakeetCpp) Load(opts *pb.ModelOptions) error {
 	if maxWaitMs < 0 {
 		maxWaitMs = 0
 	}
+
+	// NeMo's segment_gap_threshold (encoder frames, default 0=off). Off by
+	// default matches NeMo's default (punctuation-only segments); when set it
+	// additionally splits segments on inter-word silence (see transcriptResultFromDoc).
+	p.segmentGapFrames = optInt(opts, "segment_gap_threshold", 0)
 	if CppTranscribePcmBatchJSON != nil {
 		p.batStop = make(chan struct{})
 		p.bat = newBatcher(maxSize, time.Duration(maxWaitMs)*time.Millisecond, p.runBatch)
@@ -187,8 +232,19 @@ func (p *ParakeetCpp) runBatch(reqs []*batchRequest) {
 	if len(reqs) > 0 {
 		dec = reqs[0].decoder
 	}
+	// All requests in a batch share one language (the batcher coalesces only
+	// same-language requests), so any element's language describes the batch.
+	lang := ""
+	if len(reqs) > 0 {
+		lang = reqs[0].language
+	}
 	p.engineMu.Lock()
-	cstr := CppTranscribePcmBatchJSON(p.ctxPtr, concat, nSamples, int32(len(reqs)), 16000, dec)
+	var cstr uintptr
+	if CppTranscribePcmBatchJSONLang != nil {
+		cstr = CppTranscribePcmBatchJSONLang(p.ctxPtr, concat, nSamples, int32(len(reqs)), 16000, dec, lang)
+	} else {
+		cstr = CppTranscribePcmBatchJSON(p.ctxPtr, concat, nSamples, int32(len(reqs)), 16000, dec)
+	}
 	p.engineMu.Unlock()
 	if cstr == 0 {
 		err := fmt.Errorf("parakeet-cpp: batch transcribe failed: %s", CppLastError(p.ctxPtr))
@@ -226,8 +282,9 @@ func (p *ParakeetCpp) runBatch(reqs []*batchRequest) {
 // OpenAI API, whose default is segment-level); token ids always populate
 // Segment.Tokens.
 //
-// translate/diarize/prompt/temperature/language/threads are not applicable to
-// parakeet and are ignored; streaming is handled by AudioTranscriptionStream
+// translate/diarize/prompt/temperature/threads are not applicable to parakeet
+// and are ignored; language is honored on the batched + streaming paths (see
+// opts.GetLanguage() below); streaming is handled by AudioTranscriptionStream
 // (L2).
 func (p *ParakeetCpp) AudioTranscription(ctx context.Context, opts *pb.TranscriptRequest) (pb.TranscriptResult, error) {
 	if p.ctxPtr == 0 {
@@ -259,7 +316,7 @@ func (p *ParakeetCpp) AudioTranscription(ctx context.Context, opts *pb.Transcrip
 		if err := json.Unmarshal([]byte(raw), &doc); err != nil {
 			return pb.TranscriptResult{}, fmt.Errorf("parakeet-cpp: decode transcript json: %w", err)
 		}
-		return transcriptResultFromDoc(doc, opts), nil
+		return transcriptResultFromDoc(doc, opts, p.segmentGapFrames), nil
 	}

 	// Batched path: decode to PCM, submit to the batcher, wait for this request's
@@ -271,7 +328,7 @@ func (p *ParakeetCpp) AudioTranscription(ctx context.Context, opts *pb.Transcrip
 	}
 	rep := make(chan batchReply, 1)
 	select {
-	case p.bat.submit <- &batchRequest{pcm: pcm, decoder: 0, reply: rep}:
+	case p.bat.submit <- &batchRequest{pcm: pcm, decoder: 0, language: opts.GetLanguage(), reply: rep}:
 	case <-ctx.Done():
 		return pb.TranscriptResult{}, status.Error(codes.Canceled, "transcription cancelled")
 	}
@@ -288,34 +345,169 @@ func (p *ParakeetCpp) AudioTranscription(ctx context.Context, opts *pb.Transcrip
 	if err := json.Unmarshal([]byte(res.json), &doc); err != nil {
 		return pb.TranscriptResult{}, fmt.Errorf("parakeet-cpp: decode transcript json: %w", err)
 	}
-	return transcriptResultFromDoc(doc, opts), nil
+	return transcriptResultFromDoc(doc, opts, p.segmentGapFrames), nil
 }

+// segmentSeparators is NeMo's default segment_seperators (sentence-ending
+// punctuation). Splitting on these matches NeMo's default segment timestamps.
+var segmentSeparators = []rune{'.', '?', '!'}
+
 // transcriptResultFromDoc maps a decoded transcriptJSON to a TranscriptResult,
-// synthesising a single whole-clip segment and attaching word timings only when
-// the caller requested word granularity. Shared by the batched and direct paths.
-func transcriptResultFromDoc(doc transcriptJSON, opts *pb.TranscriptRequest) pb.TranscriptResult {
+// grouping words into NeMo-faithful segments (see splitWordsIntoSegments). The
+// optional gapFrames (NeMo's segment_gap_threshold, in encoder FRAMES; 0=off)
+// additionally splits on inter-word silence; it is converted to a seconds gap
+// with the document's frame_sec. Per-segment word timings are attached only when
+// the caller requested word granularity; token ids populate each segment's
+// Tokens by time-window membership. Shared by the batched and direct paths.
+func transcriptResultFromDoc(doc transcriptJSON, opts *pb.TranscriptRequest, gapFrames int) pb.TranscriptResult {
 	text := strings.TrimSpace(doc.Text)
-	words := make([]*pb.TranscriptWord, 0, len(doc.Words))
-	for _, w := range doc.Words {
-		words = append(words, &pb.TranscriptWord{Start: secondsToNanos(w.Start), End: secondsToNanos(w.End), Text: w.W})
+
+	// Frame-unit gap threshold -> seconds (NeMo segment_gap_threshold). 0 = off.
+	gapSeconds := 0.0
+	if gapFrames > 0 {
+		if doc.FrameSec > 0 {
+			gapSeconds = float64(gapFrames) * doc.FrameSec
+		} else {
+			xlog.Warn("parakeet-cpp: segment_gap_threshold set but libparakeet.so " +
+				"did not report frame_sec; falling back to punctuation-only segments")
+		}
 	}
-	tokens := make([]int32, 0, len(doc.Tokens))
-	for _, t := range doc.Tokens {
-		tokens = append(tokens, t.ID)
+
+	groups := splitWordsIntoSegments(doc.Words, segmentSeparators, gapSeconds)
+	if len(groups) == 0 {
+		// No words (edge case): single whole-clip text segment.
+		return pb.TranscriptResult{
+			Text:     text,
+			Segments: []*pb.TranscriptSegment{{Id: 0, Text: text}},
+		}
 	}
-	var segStart, segEnd int64
-	if len(words) > 0 {
-		segStart = words[0].Start
-		segEnd = words[len(words)-1].End
+
+	wantWords := wordsRequested(opts.TimestampGranularities)
+	segments := make([]*pb.TranscriptSegment, 0, len(groups))
+	for id, group := range groups {
+		parts := make([]string, len(group))
+		for i, gw := range group {
+			parts[i] = gw.W
+		}
+		seg := &pb.TranscriptSegment{
+			Id:     int32(id),
+			Start:  secondsToNanos(group[0].Start),
+			End:    secondsToNanos(group[len(group)-1].End),
+			Text:   strings.TrimSpace(strings.Join(parts, " ")),
+			Tokens: tokensInWindow(doc.Tokens, group[0].Start, group[len(group)-1].End),
+		}
+		if wantWords {
+			ws := make([]*pb.TranscriptWord, len(group))
+			for i, gw := range group {
+				ws[i] = &pb.TranscriptWord{Start: secondsToNanos(gw.Start), End: secondsToNanos(gw.End), Text: gw.W}
+			}
+			seg.Words = ws
+		}
+		segments = append(segments, seg)
 	}
-	seg := &pb.TranscriptSegment{Id: 0, Start: segStart, End: segEnd, Text: text, Tokens: tokens}
-	if wordsRequested(opts.TimestampGranularities) {
-		seg.Words = words
-	}
-	return pb.TranscriptResult{Text: text, Segments: []*pb.TranscriptSegment{seg}}
+	return pb.TranscriptResult{Text: text, Segments: segments}
 }

+// splitWordsIntoSegments groups words into segments exactly as NeMo's
+// get_segment_offsets does (nemo/collections/asr/parts/utils/timestamp_utils.py).
+// Walking the words, it closes a segment when (1) the gap rule is enabled
+// (gapSeconds > 0) and the segment already has words and the gap from the
+// previous word's end to this word's start is >= gapSeconds - the current word
+// then STARTS a new segment - or, checked only when the gap rule did not apply
+// (NeMo's elif), (2) the word ends with (or is) a separator, which closes the
+// segment INCLUDING that word. Trailing words flush into a final segment.
+// gapSeconds <= 0 disables the gap rule, matching NeMo's default
+// segment_gap_threshold=None (punctuation-only segments).
+func splitWordsIntoSegments(words []transcriptWord, separators []rune, gapSeconds float64) [][]transcriptWord {
+	var segments [][]transcriptWord
+	var cur []transcriptWord
+	for i, word := range words {
+		gapActive := gapSeconds > 0 && len(cur) > 0
+		if gapActive && (word.Start-words[i-1].End) >= gapSeconds {
+			segments = append(segments, cur)
+			cur = []transcriptWord{word}
+			continue
+		}
+		if !gapActive && endsWithSeparator(word.W, separators) {
+			cur = append(cur, word)
+			segments = append(segments, cur)
+			cur = nil
+			continue
+		}
+		cur = append(cur, word)
+	}
+	if len(cur) > 0 {
+		segments = append(segments, cur)
+	}
+	return segments
+}
+
+// endsWithSeparator reports whether w's last rune is in separators (matching
+// NeMo's `word[-1] in delims or word in delims`).
+func endsWithSeparator(w string, separators []rune) bool {
+	r := []rune(strings.TrimSpace(w))
+	if len(r) == 0 {
+		return false
+	}
+	last := r[len(r)-1]
+	for _, s := range separators {
+		if last == s {
+			return true
+		}
+	}
+	return false
+}
+
+// tokensInWindow returns the ids of tokens whose timestamp t falls in
+// [start, end] (inclusive), assigning each token to the segment that spans its
+// time. The last segment's end is the last word end, so the final token is
+// included.
+func tokensInWindow(tokens []transcriptToken, start, end float64) []int32 {
+	var ids []int32
+	for _, t := range tokens {
+		if t.T >= start && t.T <= end {
+			ids = append(ids, t.ID)
+		}
+	}
+	return ids
+}
+
+// streamSegmenter accumulates streaming words into per-utterance segments. EOU
+// is the model's own utterance boundary; each closed segment takes its start/end
+// from its first/last accumulated word.
+type streamSegmenter struct {
+	segs   []*pb.TranscriptSegment
+	cur    []transcriptWord
+	nextID int32
+}
+
+func (s *streamSegmenter) add(doc streamFeedJSON) {
+	s.cur = append(s.cur, doc.Words...)
+	if doc.Eou != 0 {
+		s.flush()
+	}
+}
+
+func (s *streamSegmenter) flush() {
+	if len(s.cur) == 0 {
+		return
+	}
+	parts := make([]string, len(s.cur))
+	for i, w := range s.cur {
+		parts[i] = w.W
+	}
+	s.segs = append(s.segs, &pb.TranscriptSegment{
+		Id:    s.nextID,
+		Start: secondsToNanos(s.cur[0].Start),
+		End:   secondsToNanos(s.cur[len(s.cur)-1].End),
+		Text:  strings.TrimSpace(strings.Join(parts, " ")),
+	})
+	s.nextID++
+	s.cur = nil
+}
+
+func (s *streamSegmenter) segments() []*pb.TranscriptSegment { return s.segs }
+
 // wordsRequested reports whether the caller asked for word-level timestamps.
 // The OpenAI transcription API gates word timings behind
 // timestamp_granularities[] containing "word" and defaults to segment-level
@@ -361,7 +553,12 @@ func (p *ParakeetCpp) AudioTranscriptionStream(ctx context.Context, opts *pb.Tra
 		return status.Error(codes.Canceled, "transcription cancelled")
 	}

-	stream := CppStreamBegin(p.ctxPtr)
+	var stream uintptr
+	if CppStreamBeginLang != nil {
+		stream = CppStreamBeginLang(p.ctxPtr, opts.GetLanguage())
+	} else {
+		stream = CppStreamBegin(p.ctxPtr)
+	}
 	if stream == 0 {
 		// Not a cache-aware streaming model: run a normal offline
 		// transcription and emit it as one delta + a closing final result.
@@ -390,6 +587,14 @@ func (p *ParakeetCpp) AudioTranscriptionStream(ctx context.Context, opts *pb.Tra
 		return err
 	}

+	// ABI v4: when the streaming JSON entry points are present, drive them so the
+	// per-utterance segments carry per-word start/end timestamps. Falls through to
+	// the text-only loop below against an older libparakeet.so. Runs under the
+	// engineMu already held above.
+	if CppStreamFeedJSON != nil {
+		return p.streamJSON(ctx, stream, data, duration, results)
+	}
+
 	var (
 		full     strings.Builder
 		segText  strings.Builder
@@ -466,6 +671,71 @@ func (p *ParakeetCpp) AudioTranscriptionStream(ctx context.Context, opts *pb.Tra
 	return nil
 }

+// streamJSON drives the ABI v4 streaming JSON entry points: each feed/finalize
+// returns a {text,eou,frame_sec,words} document. The newly-finalized text is
+// emitted as a delta (unchanged streaming contract) while words are accumulated
+// into per-utterance segments (closed on EOU) so the closing FinalResult carries
+// timestamped segments. Runs under engineMu (already held by the caller).
+func (p *ParakeetCpp) streamJSON(ctx context.Context, stream uintptr, data []float32,
+	duration float32, results chan *pb.TranscriptStreamResponse) error {
+	var (
+		full strings.Builder
+		seg  streamSegmenter
+	)
+	// consume frees the malloc'd char* (a 0 return is an error), parses the JSON,
+	// emits the delta, and routes words through the segmenter.
+	consume := func(ret uintptr) error {
+		if ret == 0 {
+			msg := CppLastError(p.ctxPtr)
+			if msg == "" {
+				msg = "unknown error"
+			}
+			return fmt.Errorf("parakeet-cpp: stream feed/finalize failed: %s", msg)
+		}
+		raw := goStringFromCPtr(ret)
+		CppFreeString(ret)
+		var doc streamFeedJSON
+		if err := json.Unmarshal([]byte(raw), &doc); err != nil {
+			return fmt.Errorf("parakeet-cpp: decode stream json: %w", err)
+		}
+		if doc.Text != "" {
+			full.WriteString(doc.Text)
+			results <- &pb.TranscriptStreamResponse{Delta: doc.Text}
+		}
+		seg.add(doc)
+		return nil
+	}
+
+	for off := 0; off < len(data); off += streamChunkSamples {
+		if err := ctx.Err(); err != nil {
+			return status.Error(codes.Canceled, "transcription cancelled")
+		}
+		end := min(off+streamChunkSamples, len(data))
+		chunk := data[off:end]
+		if err := consume(CppStreamFeedJSON(stream, chunk, int32(len(chunk)))); err != nil {
+			return err
+		}
+	}
+	if err := consume(CppStreamFinalizeJSON(stream)); err != nil {
+		return err
+	}
+	seg.flush() // close any trailing utterance that never saw an EOU
+
+	text := strings.TrimSpace(full.String())
+	segments := seg.segments()
+	if len(segments) == 0 && text != "" {
+		segments = append(segments, &pb.TranscriptSegment{Id: 0, Text: text})
+	}
+	results <- &pb.TranscriptStreamResponse{
+		FinalResult: &pb.TranscriptResult{
+			Text:     text,
+			Segments: segments,
+			Duration: duration,
+		},
+	}
+	return nil
+}
+
 // decodeWavMono16k converts any input audio to 16 kHz mono PCM and returns the
 // float samples plus the clip duration in seconds. Mirrors the whisper
 // backend: utils.AudioToWav (ffmpeg) normalises rate/channels, go-audio
--- a/backend/go/parakeet-cpp/goparakeetcpp_test.go
+++ b/backend/go/parakeet-cpp/goparakeetcpp_test.go
@@ -53,6 +53,10 @@ func ensureLibLoaded() {
 		purego.RegisterLibFunc(&CppStreamFeed, lib, "parakeet_capi_stream_feed")
 		purego.RegisterLibFunc(&CppStreamFinalize, lib, "parakeet_capi_stream_finalize")
 		purego.RegisterLibFunc(&CppStreamFree, lib, "parakeet_capi_stream_free")
+		if sym, err := purego.Dlsym(lib, "parakeet_capi_stream_feed_json"); err == nil && sym != 0 {
+			purego.RegisterLibFunc(&CppStreamFeedJSON, lib, "parakeet_capi_stream_feed_json")
+			purego.RegisterLibFunc(&CppStreamFinalizeJSON, lib, "parakeet_capi_stream_finalize_json")
+		}
 		purego.RegisterLibFunc(&CppFreeString, lib, "parakeet_capi_free_string")
 		purego.RegisterLibFunc(&CppLastError, lib, "parakeet_capi_last_error")
 	})
@@ -107,13 +111,22 @@ var _ = Describe("ParakeetCpp", func() {
 			Expect(err).ToNot(HaveOccurred())
 			Expect(strings.TrimSpace(res.Text)).ToNot(BeEmpty(),
 				"expected non-empty transcript for %s", audioPath)
-			Expect(res.Segments).To(HaveLen(1),
-				"synthesises a single whole-clip segment")
-			Expect(res.Segments[0].Text).To(Equal(res.Text),
-				"single segment text must equal the top-level text")
-			// Default (no granularities) is segment-level: no per-word timings.
-			Expect(res.Segments[0].Words).To(BeEmpty(),
-				"word timings are opt-in via timestamp_granularities")
+			// NeMo-faithful segmentation: one or more punctuation-delimited
+			// segments, each with text and a monotonically-advancing time span.
+			Expect(res.Segments).ToNot(BeEmpty(), "expected at least one segment")
+			var prevEnd int64
+			for i, seg := range res.Segments {
+				Expect(strings.TrimSpace(seg.Text)).ToNot(BeEmpty(),
+					"segment %d must have text", i)
+				Expect(seg.End).To(BeNumerically(">=", seg.Start),
+					"segment %d end must not precede its start", i)
+				Expect(seg.Start).To(BeNumerically(">=", prevEnd),
+					"segments must be in time order")
+				prevEnd = seg.End
+				// Default (no granularities) is segment-level: no per-word timings.
+				Expect(seg.Words).To(BeEmpty(),
+					"word timings are opt-in via timestamp_granularities")
+			}
 		})

 		It("emits word-level timestamps when granularity=word", func() {
@@ -129,15 +142,28 @@ var _ = Describe("ParakeetCpp", func() {
 				TimestampGranularities: []string{"word"},
 			})
 			Expect(err).ToNot(HaveOccurred())
-			Expect(res.Segments).To(HaveLen(1))
-			seg := res.Segments[0]
-			Expect(seg.Words).ToNot(BeEmpty(),
-				"expected per-word timestamps with granularity=word")
-			// Monotonic, non-negative timings spanning the segment.
-			Expect(seg.Words[0].Start).To(BeNumerically(">=", int64(0)))
-			Expect(seg.End).To(BeNumerically(">=", seg.Start))
-			Expect(seg.Words[len(seg.Words)-1].End).To(Equal(seg.End),
-				"segment end tracks the last word")
+			Expect(res.Segments).ToNot(BeEmpty())
+			// With word granularity every segment carries its own words, and each
+			// segment's span tracks its first/last word; word starts advance
+			// monotonically across the whole transcript.
+			totalWords := 0
+			var prevStart int64 = -1
+			for i, seg := range res.Segments {
+				Expect(seg.Words).ToNot(BeEmpty(),
+					"segment %d must carry per-word timestamps with granularity=word", i)
+				Expect(seg.Start).To(Equal(seg.Words[0].Start),
+					"segment %d start tracks its first word", i)
+				Expect(seg.End).To(Equal(seg.Words[len(seg.Words)-1].End),
+					"segment %d end tracks its last word", i)
+				for _, w := range seg.Words {
+					Expect(w.End).To(BeNumerically(">=", w.Start))
+					Expect(w.Start).To(BeNumerically(">=", prevStart))
+					prevStart = w.Start
+					totalWords++
+				}
+			}
+			Expect(totalWords).To(BeNumerically(">", 0))
+			Expect(res.Segments[0].Words[0].Start).To(BeNumerically(">=", int64(0)))
 		})
 	})

--- a/backend/go/parakeet-cpp/main.go
+++ b/backend/go/parakeet-cpp/main.go
@@ -65,6 +65,25 @@ func main() {
 		purego.RegisterLibFunc(&CppTranscribePcmBatchJSON, lib, "parakeet_capi_transcribe_pcm_batch_json")
 	}

+	// Per-request language variants (multilingual nemotron). Same probe pattern:
+	// present only in libparakeet.so built with multilingual support, so the
+	// backend still loads against an older library and falls back to the
+	// non-lang batched + streaming entry points (model default / "auto").
+	if sym, err := purego.Dlsym(lib, "parakeet_capi_transcribe_pcm_batch_json_lang"); err == nil && sym != 0 {
+		purego.RegisterLibFunc(&CppTranscribePcmBatchJSONLang, lib, "parakeet_capi_transcribe_pcm_batch_json_lang")
+	}
+	if sym, err := purego.Dlsym(lib, "parakeet_capi_stream_begin_lang"); err == nil && sym != 0 {
+		purego.RegisterLibFunc(&CppStreamBeginLang, lib, "parakeet_capi_stream_begin_lang")
+	}
+
+	// Streaming JSON entry points (ABI v4): surface per-word timestamps on the
+	// streaming path. Same probe pattern; absent in older libparakeet.so, where
+	// the backend falls back to the text-only streaming feed.
+	if sym, err := purego.Dlsym(lib, "parakeet_capi_stream_feed_json"); err == nil && sym != 0 {
+		purego.RegisterLibFunc(&CppStreamFeedJSON, lib, "parakeet_capi_stream_feed_json")
+		purego.RegisterLibFunc(&CppStreamFinalizeJSON, lib, "parakeet_capi_stream_finalize_json")
+	}
+
 	fmt.Fprintf(os.Stderr, "[parakeet-cpp] ABI=%d\n", CppAbiVersion())

 	flag.Parse()
--- a/backend/go/parakeet-cpp/segments_test.go
+++ b/backend/go/parakeet-cpp/segments_test.go
@@ -0,0 +1,127 @@
+package main
+
+import (
+	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
+	. "github.com/onsi/ginkgo/v2"
+	. "github.com/onsi/gomega"
+)
+
+func tw(text string, start, end float64) transcriptWord {
+	return transcriptWord{W: text, Start: start, End: end}
+}
+
+var _ = Describe("splitWordsIntoSegments (NeMo get_segment_offsets parity)", func() {
+	seps := []rune{'.', '?', '!'}
+
+	It("splits on sentence-ending punctuation, including the delimiter word", func() {
+		words := []transcriptWord{tw("hello", 0, 0.4), tw("world.", 0.4, 0.8), tw("bye", 1.0, 1.3)}
+		segs := splitWordsIntoSegments(words, seps, 0)
+		Expect(segs).To(HaveLen(2))
+		Expect(segs[0]).To(HaveLen(2))
+		Expect(segs[0][1].W).To(Equal("world."))
+		Expect(segs[1]).To(HaveLen(1))
+		Expect(segs[1][0].W).To(Equal("bye"))
+	})
+
+	It("keeps a single segment with no terminal punctuation and gap off", func() {
+		words := []transcriptWord{tw("a", 0, 0.2), tw("b", 0.2, 0.4), tw("c", 5.0, 5.2)}
+		segs := splitWordsIntoSegments(words, seps, 0)
+		Expect(segs).To(HaveLen(1))
+	})
+
+	It("splits on the gap rule when enabled, the gapped word starting the next segment", func() {
+		words := []transcriptWord{tw("a", 0, 0.2), tw("b", 0.2, 0.4), tw("c", 5.0, 5.2)}
+		segs := splitWordsIntoSegments(words, seps, 1.0) // c is 4.6s after b
+		Expect(segs).To(HaveLen(2))
+		Expect(segs[0]).To(HaveLen(2)) // a b
+		Expect(segs[1]).To(HaveLen(1)) // c
+		Expect(segs[1][0].W).To(Equal("c"))
+	})
+
+	It("checks the gap rule before punctuation (NeMo elif order)", func() {
+		// "b." would terminate, but c is far after it -> gap closes [a b.] at b.
+		words := []transcriptWord{tw("a", 0, 0.2), tw("b.", 0.2, 0.4), tw("c", 9.0, 9.2)}
+		segs := splitWordsIntoSegments(words, seps, 1.0)
+		Expect(segs).To(HaveLen(2))
+		Expect(segs[0]).To(HaveLen(2))
+		Expect(segs[1][0].W).To(Equal("c"))
+	})
+
+	It("still splits on punctuation when the gap rule is enabled but does not fire", func() {
+		words := []transcriptWord{tw("hi.", 0, 0.4), tw("bye", 0.4, 0.8)}
+		segs := splitWordsIntoSegments(words, seps, 5.0) // gap never reached
+		Expect(segs).To(HaveLen(2))
+		Expect(segs[0][0].W).To(Equal("hi."))
+	})
+
+	It("returns nothing for empty input", func() {
+		Expect(splitWordsIntoSegments(nil, seps, 0)).To(BeEmpty())
+	})
+})
+
+var _ = Describe("transcriptResultFromDoc (multi-segment)", func() {
+	doc := transcriptJSON{
+		Text:     "hello world. bye now",
+		FrameSec: 0.08,
+		Words: []transcriptWord{
+			{W: "hello", Start: 0.0, End: 0.4},
+			{W: "world.", Start: 0.4, End: 0.8},
+			{W: "bye", Start: 1.0, End: 1.3},
+			{W: "now", Start: 1.3, End: 1.6},
+		},
+		Tokens: []transcriptToken{{ID: 1, T: 0.1}, {ID: 2, T: 0.5}, {ID: 3, T: 1.1}, {ID: 4, T: 1.4}},
+	}
+
+	It("emits one segment per punctuation-delimited group with start/end", func() {
+		res := transcriptResultFromDoc(doc, &pb.TranscriptRequest{}, 0)
+		Expect(res.Segments).To(HaveLen(2))
+		Expect(res.Segments[0].Text).To(Equal("hello world."))
+		Expect(res.Segments[0].Start).To(Equal(int64(0)))
+		Expect(res.Segments[0].End).To(Equal(secondsToNanos(0.8)))
+		Expect(res.Segments[1].Text).To(Equal("bye now"))
+		Expect(res.Segments[1].Start).To(Equal(secondsToNanos(1.0)))
+		Expect(res.Segments[1].Id).To(Equal(int32(1)))
+	})
+
+	It("assigns tokens to the segment whose time window contains them", func() {
+		res := transcriptResultFromDoc(doc, &pb.TranscriptRequest{}, 0)
+		Expect(res.Segments[0].Tokens).To(Equal([]int32{1, 2}))
+		Expect(res.Segments[1].Tokens).To(Equal([]int32{3, 4}))
+	})
+
+	It("attaches per-segment words only when word granularity requested", func() {
+		plain := transcriptResultFromDoc(doc, &pb.TranscriptRequest{}, 0)
+		Expect(plain.Segments[0].Words).To(BeEmpty())
+		withWords := transcriptResultFromDoc(doc, &pb.TranscriptRequest{TimestampGranularities: []string{"word"}}, 0)
+		Expect(withWords.Segments[0].Words).To(HaveLen(2))
+	})
+
+	It("falls back to a single text segment when there are no words", func() {
+		res := transcriptResultFromDoc(transcriptJSON{Text: "hi"}, &pb.TranscriptRequest{}, 0)
+		Expect(res.Segments).To(HaveLen(1))
+		Expect(res.Segments[0].Text).To(Equal("hi"))
+	})
+})
+
+var _ = Describe("streaming segment assembly", func() {
+	It("closes a segment with start/end from its words on EOU", func() {
+		acc := &streamSegmenter{}
+		acc.add(streamFeedJSON{Text: "hello world", Eou: 1, Words: []transcriptWord{
+			{W: "hello", Start: 0.0, End: 0.4}, {W: "world", Start: 0.4, End: 0.9},
+		}})
+		segs := acc.segments()
+		Expect(segs).To(HaveLen(1))
+		Expect(segs[0].Text).To(Equal("hello world"))
+		Expect(segs[0].Start).To(Equal(int64(0)))
+		Expect(segs[0].End).To(Equal(secondsToNanos(0.9)))
+	})
+
+	It("buffers words across feeds until EOU", func() {
+		acc := &streamSegmenter{}
+		acc.add(streamFeedJSON{Text: "hi", Eou: 0, Words: []transcriptWord{{W: "hi", Start: 0, End: 0.3}}})
+		Expect(acc.segments()).To(BeEmpty())
+		acc.add(streamFeedJSON{Text: "there", Eou: 1, Words: []transcriptWord{{W: "there", Start: 0.3, End: 0.7}}})
+		Expect(acc.segments()).To(HaveLen(1))
+		Expect(acc.segments()[0].Text).To(Equal("hi there"))
+	})
+})
--- a/backend/go/stablediffusion-ggml/Makefile
+++ b/backend/go/stablediffusion-ggml/Makefile
@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)

 # stablediffusion.cpp (ggml)
 STABLEDIFFUSION_GGML_REPO?=https://github.com/leejet/stable-diffusion.cpp
-STABLEDIFFUSION_GGML_VERSION?=1f9ee88e09c258053fa59d5e05e23dfb10fa0b13
+STABLEDIFFUSION_GGML_VERSION?=b9254dda0d10b91ee6f17fb7f4420097dd29824b

 CMAKE_ARGS+=-DGGML_MAX_NAME=128

--- a/backend/go/stablediffusion-ggml/cpp/gosd.cpp
+++ b/backend/go/stablediffusion-ggml/cpp/gosd.cpp
@@ -386,6 +386,7 @@ int load_model(const char *model, char *model_path, char* options[], int threads
    const char *llm_vision_path = "";
    const char *diffusion_model_path = stableDiffusionModel;
    const char *high_noise_diffusion_model_path = "";
+    const char *uncond_diffusion_model_path = "";
    const char *taesd_path  = "";
    const char *control_net_path = "";
    const char *embedding_dir = "";
@@ -472,6 +473,7 @@ int load_model(const char *model, char *model_path, char* options[], int threads
        if (!strcmp(optname, "llm_vision_path")) llm_vision_path = strdup(optval);
        if (!strcmp(optname, "diffusion_model_path")) diffusion_model_path = strdup(optval);
        if (!strcmp(optname, "high_noise_diffusion_model_path")) high_noise_diffusion_model_path = strdup(optval);
+        if (!strcmp(optname, "uncond_diffusion_model_path")) uncond_diffusion_model_path = strdup(optval);
        if (!strcmp(optname, "taesd_path")) taesd_path = strdup(optval);
        if (!strcmp(optname, "control_net_path")) control_net_path = strdup(optval);
        if (!strcmp(optname, "embedding_dir")) {
@@ -571,6 +573,7 @@ int load_model(const char *model, char *model_path, char* options[], int threads
    ctx_params.llm_vision_path = llm_vision_path;
    ctx_params.diffusion_model_path = diffusion_model_path;
    ctx_params.high_noise_diffusion_model_path = high_noise_diffusion_model_path;
+    ctx_params.uncond_diffusion_model_path = uncond_diffusion_model_path;
    ctx_params.vae_path = vae_path;
    ctx_params.audio_vae_path = audio_vae_path;
    ctx_params.embeddings_connectors_path = embeddings_connectors_path;
--- a/backend/go/whisper/Makefile
+++ b/backend/go/whisper/Makefile
@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)

 # whisper.cpp version
 WHISPER_REPO?=https://github.com/ggml-org/whisper.cpp
-WHISPER_CPP_VERSION?=99613cb720b65036237d44b52f753b51f75c2797
+WHISPER_CPP_VERSION?=a8ec021f2750a473ff4a8f3883bc9fdf5feafa84
 SO_TARGET?=libgowhisper.so

 CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF
--- a/backend/python/vllm/requirements-cublas13-after.txt
+++ b/backend/python/vllm/requirements-cublas13-after.txt
@@ -3,5 +3,5 @@
 # on a cu130 host. Pull the cu130-flavoured wheel from vLLM's per-tag index
 # instead — the cublas13 case in install.sh adds --index-strategy=unsafe-best-match
 # so uv consults this index alongside PyPI.
--extra-index-url https://wheels.vllm.ai/0.22.0/cu130
-vllm==0.22.0
+--extra-index-url https://wheels.vllm.ai/0.22.1/cu130
+vllm==0.22.1
--- a/core/config/meta/registry.go
+++ b/core/config/meta/registry.go
@@ -308,34 +308,6 @@ func DefaultRegistry() map[string]FieldMetaOverride {
 			},
 			Order: 64,
 		},
-		"pipeline.disable_thinking": {
-			Section:     "pipeline",
-			Label:       "Disable Thinking",
-			Description: "Suppress reasoning/thinking output from the pipeline LLM (sets enable_thinking=false on the underlying model). Use for models that emit <think> blocks you don't want spoken or streamed back to the realtime client.",
-			Component:   "toggle",
-			Order:       65,
-		},
-		"pipeline.streaming.llm": {
-			Section:     "pipeline",
-			Label:       "Stream LLM",
-			Description: "Stream LLM tokens to the realtime client as they are generated instead of waiting for the full response. Emits incremental response.output_audio_transcript.delta / text deltas.",
-			Component:   "toggle",
-			Order:       66,
-		},
-		"pipeline.streaming.tts": {
-			Section:     "pipeline",
-			Label:       "Stream TTS",
-			Description: "Stream synthesized audio chunks to the realtime client as they are produced (requires a TTS backend that implements TTSStream). Falls back to unary synthesis otherwise.",
-			Component:   "toggle",
-			Order:       67,
-		},
-		"pipeline.streaming.transcription": {
-			Section:     "pipeline",
-			Label:       "Stream Transcription",
-			Description: "Stream partial transcription text to the realtime client as the STT backend produces it (requires a transcription backend that implements AudioTranscriptionStream). Falls back to unary transcription otherwise.",
-			Component:   "toggle",
-			Order:       68,
-		},

 		// --- Functions ---
 		"function.grammar.parallel_calls": {
--- a/core/config/model_config.go
+++ b/core/config/model_config.go
@@ -499,16 +499,6 @@ type Pipeline struct {
 	// the pipeline's LLM without editing the LLM model config. Overrides the LLM's
 	// own reasoning_effort. Unset leaves the LLM model config in charge.
 	ReasoningEffort string `yaml:"reasoning_effort,omitempty" json:"reasoning_effort,omitempty"`
-
-	// Streaming opts each pipeline stage into incremental delivery (LLM tokens,
-	// TTS audio chunks, transcription text). Unset stages keep the blocking
-	// unary path, so existing configs are unaffected.
-	Streaming PipelineStreaming `yaml:"streaming,omitempty" json:"streaming,omitempty"`
-
-	// DisableThinking suppresses reasoning/thinking for the pipeline LLM (maps
-	// to enable_thinking=false backend metadata) without editing the underlying
-	// LLM model config. Unset leaves the LLM model config in charge.
-	DisableThinking *bool `yaml:"disable_thinking,omitempty" json:"disable_thinking,omitempty"`
 }

 // ApplyReasoningEffort resolves the effective reasoning effort — a per-request
@@ -540,29 +530,6 @@ func (c *ModelConfig) ApplyReasoningEffort(requestEffort string) {
 	}
 }

-// @Description PipelineStreaming toggles incremental delivery per realtime stage.
-type PipelineStreaming struct {
-	LLM           *bool `yaml:"llm,omitempty" json:"llm,omitempty"`
-	TTS           *bool `yaml:"tts,omitempty" json:"tts,omitempty"`
-	Transcription *bool `yaml:"transcription,omitempty" json:"transcription,omitempty"`
-}
-
-// StreamLLM reports whether LLM tokens should be streamed for this pipeline.
-func (p Pipeline) StreamLLM() bool { return p.Streaming.LLM != nil && *p.Streaming.LLM }
-
-// StreamTTS reports whether TTS audio should be streamed for this pipeline.
-func (p Pipeline) StreamTTS() bool { return p.Streaming.TTS != nil && *p.Streaming.TTS }
-
-// StreamTranscription reports whether transcription text should be streamed.
-func (p Pipeline) StreamTranscription() bool {
-	return p.Streaming.Transcription != nil && *p.Streaming.Transcription
-}
-
-// ThinkingDisabled reports whether the pipeline forces the LLM's thinking off.
-func (p Pipeline) ThinkingDisabled() bool {
-	return p.DisableThinking != nil && *p.DisableThinking
-}
-
 // @Description File configuration for model downloads
 type File struct {
 	Filename string         `yaml:"filename,omitempty" json:"filename,omitempty"`
--- a/core/config/pipeline_streaming_test.go
+++ b/core/config/pipeline_streaming_test.go
@@ -1,54 +0,0 @@
-package config
-
-import (
-	. "github.com/onsi/ginkgo/v2"
-	. "github.com/onsi/gomega"
-	"gopkg.in/yaml.v3"
-)
-
-// The realtime pipeline can stream each stage (LLM tokens, TTS audio,
-// transcription text) and can disable model "thinking" for the LLM. These are
-// opt-in per pipeline; everything defaults to off so existing configs keep the
-// unary behaviour.
-var _ = Describe("Pipeline streaming config", func() {
-	It("defaults every streaming + thinking helper to false when unset", func() {
-		var p Pipeline
-		Expect(p.StreamLLM()).To(BeFalse())
-		Expect(p.StreamTTS()).To(BeFalse())
-		Expect(p.StreamTranscription()).To(BeFalse())
-		Expect(p.ThinkingDisabled()).To(BeFalse())
-	})
-
-	It("parses the nested streaming block and disable_thinking from YAML", func() {
-		var c ModelConfig
-		err := yaml.Unmarshal([]byte(`
-name: gpt-realtime
-pipeline:
-  llm: my-llm
-  tts: my-tts
-  transcription: my-stt
-  streaming:
-    llm: true
-    tts: true
-    transcription: true
-  disable_thinking: true
-`), &c)
-		Expect(err).ToNot(HaveOccurred())
-		Expect(c.Pipeline.StreamLLM()).To(BeTrue())
-		Expect(c.Pipeline.StreamTTS()).To(BeTrue())
-		Expect(c.Pipeline.StreamTranscription()).To(BeTrue())
-		Expect(c.Pipeline.ThinkingDisabled()).To(BeTrue())
-	})
-
-	It("treats an explicit false in the streaming block as disabled", func() {
-		var c ModelConfig
-		err := yaml.Unmarshal([]byte(`
-name: gpt-realtime
-pipeline:
-  streaming:
-    tts: false
-`), &c)
-		Expect(err).ToNot(HaveOccurred())
-		Expect(c.Pipeline.StreamTTS()).To(BeFalse())
-	})
-})
--- a/core/http/endpoints/openai/realtime.go
+++ b/core/http/endpoints/openai/realtime.go
@@ -235,12 +235,6 @@ type Model interface {
 	Transcribe(ctx context.Context, audio, language string, translate bool, diarize bool, prompt string) (*schema.TranscriptionResult, error)
 	Predict(ctx context.Context, messages schema.Messages, images, videos, audios []string, tokenCallback func(string, backend.TokenUsage) bool, tools []types.ToolUnion, toolChoice *types.ToolChoiceUnion, logprobs *int, topLogprobs *int, logitBias map[string]float64) (func() (backend.LLMResponse, error), error)
 	TTS(ctx context.Context, text, voice, language string) (string, *proto.Result, error)
-	// TTSStream synthesizes speech incrementally, invoking onAudio with raw PCM
-	// chunks (and the backend sample rate) as they are produced.
-	TTSStream(ctx context.Context, text, voice, language string, onAudio func(pcm []byte, sampleRate int) error) error
-	// TranscribeStream transcribes audio incrementally, invoking onDelta for each
-	// transcript text fragment and returning the final aggregated result.
-	TranscribeStream(ctx context.Context, audio, language string, translate, diarize bool, prompt string, onDelta func(text string)) (*schema.TranscriptionResult, error)
 	PredictConfig() *config.ModelConfig
 }

@@ -1260,15 +1254,27 @@ func commitUtterance(ctx context.Context, utt []byte, session *Session, conv *Co
 	// TODO: If we have a real any-to-any model then transcription is optional
 	var transcript string
 	if session.InputAudioTranscription != nil {
-		// emitTranscription streams transcript deltas when
-		// pipeline.streaming.transcription is set, otherwise emits a single
-		// completed event; either way it returns the final transcript text.
-		var err error
-		transcript, err = emitTranscription(ctx, t, session, generateItemID(), f.Name())
+		tr, err := session.ModelInterface.Transcribe(ctx, f.Name(), session.InputAudioTranscription.Language, false, false, session.InputAudioTranscription.Prompt)
 		if err != nil {
 			sendError(t, "transcription_failed", err.Error(), "", "event_TODO")
 			return
+		} else if tr == nil {
+			sendError(t, "transcription_failed", "trancribe result is nil", "", "event_TODO")
+			return
 		}
+
+		transcript = tr.Text
+		sendEvent(t, types.ConversationItemInputAudioTranscriptionCompletedEvent{
+			ServerEventBase: types.ServerEventBase{
+				EventID: "event_TODO",
+			},
+
+			ItemID: generateItemID(),
+			// ResponseID:   "resp_TODO", // Not needed for transcription completed event
+			// OutputIndex:  0,
+			ContentIndex: 0,
+			Transcript:   transcript,
+		})
 	} else {
 		sendNotImplemented(t, "any-to-any models")
 		return
@@ -1496,26 +1502,6 @@ func triggerResponseAtTurn(ctx context.Context, session *Session, conv *Conversa
 		},
 	})

-	// Streamed LLM path: when the pipeline opts into LLM streaming, stream the
-	// transcript to the client as it is generated and synthesize the buffered
-	// message once. Tool turns are supported only when the model uses its
-	// tokenizer template: the C++ autoparser then delivers content and tool
-	// calls via ChatDeltas (clearing the text stream), so the spoken transcript
-	// never leaks tool-call tokens. Grammar-based function calling emits the
-	// call as JSON in the token stream, so those turns keep the buffered path.
-	if config != nil && session.ModelConfig != nil && session.ModelConfig.Pipeline.StreamLLM() {
-		canStream := len(tools) == 0 || config.TemplateConfig.UseTokenizerTemplate
-		var respMods []types.Modality
-		if overrides != nil {
-			respMods = overrides.OutputModalities
-		}
-		if canStream && modalitiesContainAudio(resolveOutputModalities(session.OutputModalities, respMods)) {
-			if streamLLMResponse(ctx, session, conv, t, responseID, conversationHistory, images, config, tools, toolChoice, toolTurn) {
-				return
-			}
-		}
-	}
-
 	predFunc, err := session.ModelInterface.Predict(ctx, conversationHistory, images, nil, nil, nil, tools, toolChoice, nil, nil, nil)
 	if err != nil {
 		sendError(t, "inference_failed", fmt.Sprintf("backend error: %v", err), "", "") // item.Assistant.ID is unknown here
@@ -1593,7 +1579,7 @@ func triggerResponseAtTurn(ctx context.Context, session *Session, conv *Conversa
 		// ExtractReasoningWithConfig is a no-op when no tag pair matches,
 		// so it's safe to apply unconditionally in the no-reasoning branch.
 		if deltaReasoning == "" && deltaContent != "" {
-			deltaReasoning, deltaContent = reasoning.ExtractReasoningWithConfig(deltaContent, thinkingStartToken, spokenReasoningConfig(config.ReasoningConfig))
+			deltaReasoning, deltaContent = reasoning.ExtractReasoningWithConfig(deltaContent, thinkingStartToken, config.ReasoningConfig)
 		}
 		reasoningText = deltaReasoning
 		responseWithoutReasoning = deltaContent
@@ -1601,7 +1587,7 @@ func triggerResponseAtTurn(ctx context.Context, session *Session, conv *Conversa
 		cleanedResponse = deltaContent
 		toolCalls = deltaToolCalls
 	} else {
-		reasoningText, responseWithoutReasoning = reasoning.ExtractReasoningWithConfig(rawResponse, thinkingStartToken, spokenReasoningConfig(config.ReasoningConfig))
+		reasoningText, responseWithoutReasoning = reasoning.ExtractReasoningWithConfig(rawResponse, thinkingStartToken, config.ReasoningConfig)
 		textContent = functions.ParseTextContent(responseWithoutReasoning, config.FunctionsConfig)
 		cleanedResponse = functions.CleanupLLMResult(responseWithoutReasoning, config.FunctionsConfig)
 		toolCalls = functions.ParseFunctionCall(cleanedResponse, config.FunctionsConfig)
@@ -1727,7 +1713,64 @@ func triggerResponseAtTurn(ctx context.Context, session *Session, conv *Conversa
 				return
 			}

-			// Transcript of the spoken reply (the audio's text).
+			audioFilePath, res, err := session.ModelInterface.TTS(ctx, finalSpeech, session.Voice, session.InputAudioTranscription.Language)
+			if err != nil {
+				if ctx.Err() != nil {
+					xlog.Debug("TTS cancelled (barge-in)")
+					sendCancelledResponse()
+					return
+				}
+				xlog.Error("TTS failed", "error", err)
+				sendError(t, "tts_error", fmt.Sprintf("TTS generation failed: %v", err), "", item.Assistant.ID)
+				return
+			}
+			if !res.Success {
+				xlog.Error("TTS failed", "message", res.Message)
+				sendError(t, "tts_error", fmt.Sprintf("TTS generation failed: %s", res.Message), "", item.Assistant.ID)
+				return
+			}
+			defer func() { _ = os.Remove(audioFilePath) }()
+
+			audioBytes, err := os.ReadFile(audioFilePath)
+			if err != nil {
+				xlog.Error("failed to read TTS file", "error", err)
+				sendError(t, "tts_error", fmt.Sprintf("Failed to read TTS audio: %v", err), "", item.Assistant.ID)
+				return
+			}
+
+			// Parse WAV header to get raw PCM and the actual sample rate from the TTS backend.
+			pcmData, ttsSampleRate := laudio.ParseWAV(audioBytes)
+			if ttsSampleRate == 0 {
+				ttsSampleRate = localSampleRate
+			}
+			xlog.Debug("TTS audio parsed", "raw_bytes", len(audioBytes), "pcm_bytes", len(pcmData), "sample_rate", ttsSampleRate)
+
+			// SendAudio (WebRTC) passes PCM at the TTS sample rate directly to the
+			// Opus encoder, which resamples to 48kHz internally. This avoids a
+			// lossy intermediate resample through 16kHz.
+			// XXX: This is a noop in websocket mode; it's included in the JSON instead
+			if err := t.SendAudio(ctx, pcmData, ttsSampleRate); err != nil {
+				if ctx.Err() != nil {
+					xlog.Debug("Audio playback cancelled (barge-in)")
+					sendCancelledResponse()
+					return
+				}
+				xlog.Error("failed to send audio via transport", "error", err)
+			}
+
+			// For WebSocket clients, resample to the session's output rate and
+			// deliver audio as base64 in JSON events. WebRTC clients already
+			// received audio over the RTP track, so skip the base64 payload.
+			if !isWebRTC {
+				wsPCM := pcmData
+				if ttsSampleRate != session.OutputSampleRate {
+					samples := sound.BytesToInt16sLE(pcmData)
+					resampled := sound.ResampleInt16(samples, ttsSampleRate, session.OutputSampleRate)
+					wsPCM = sound.Int16toBytesLE(resampled)
+				}
+				audioString = base64.StdEncoding.EncodeToString(wsPCM)
+			}
+
 			sendEvent(t, types.ResponseOutputAudioTranscriptDeltaEvent{
 				ServerEventBase: types.ServerEventBase{},
 				ResponseID:      responseID,
@@ -1745,26 +1788,15 @@ func triggerResponseAtTurn(ctx context.Context, session *Session, conv *Conversa
 				Transcript:      finalSpeech,
 			})

-			// Synthesize and send the audio. With pipeline.streaming.tts enabled
-			// emitSpeech forwards a response.output_audio.delta per backend PCM
-			// chunk as it's produced; otherwise it sends the whole utterance as a
-			// single delta. The returned PCM is stored (base64) on the item below.
-			pcmAudio, err := emitSpeech(ctx, t, session, responseID, item.Assistant.ID, finalSpeech)
-			if err != nil {
-				if ctx.Err() != nil {
-					xlog.Debug("TTS cancelled (barge-in)")
-					sendCancelledResponse()
-					return
-				}
-				xlog.Error("TTS failed", "error", err)
-				sendError(t, "tts_error", fmt.Sprintf("TTS generation failed: %v", err), "", item.Assistant.ID)
-				return
-			}
-			if !isWebRTC {
-				audioString = base64.StdEncoding.EncodeToString(pcmAudio)
-			}
-
 			if !isWebRTC {
+				sendEvent(t, types.ResponseOutputAudioDeltaEvent{
+					ServerEventBase: types.ServerEventBase{},
+					ResponseID:      responseID,
+					ItemID:          item.Assistant.ID,
+					OutputIndex:     0,
+					ContentIndex:    0,
+					Delta:           audioString,
+				})
 				sendEvent(t, types.ResponseOutputAudioDoneEvent{
 					ServerEventBase: types.ServerEventBase{},
 					ResponseID:      responseID,
@@ -1817,27 +1849,17 @@ func triggerResponseAtTurn(ctx context.Context, session *Session, conv *Conversa
 		})
 	}

-	// Emit the parsed tool calls, the terminal response.done, and (for
-	// server-side assistant tools) the follow-up response. Shared with the
-	// streamed path so both finalize tool calls identically.
-	emitToolCallItems(ctx, session, conv, t, responseID, finalToolCalls, finalSpeech != "", toolTurn)
-}
-
-// emitToolCallItems emits the realtime function_call items for the parsed tool
-// calls, the terminal response.done, and — for server-side LocalAI Assistant
-// tools — re-triggers a follow-up response so the model can speak the result.
-// hasContent shifts the tool-call output index past the assistant content item
-// when the same turn also produced spoken/text content. Two tool paths:
-//   - LocalAI Assistant tools (session.AssistantExecutor.IsTool) run server-side;
-//     we append both the call and its output to conv.Items and re-trigger. The
-//     client only sees observability events.
-//   - All other tools follow the standard OpenAI flow: emit
-//     function_call_arguments.done and wait for the client to send
-//     conversation.item.create back.
-func emitToolCallItems(ctx context.Context, session *Session, conv *Conversation, t Transport, responseID string, toolCalls []functions.FuncCallResults, hasContent bool, toolTurn int) {
-	xlog.Debug("About to handle tool calls", "finalToolCallsCount", len(toolCalls))
+	// Handle Tool Calls. Two paths:
+	//   - LocalAI Assistant tools (session.AssistantExecutor.IsTool) run
+	//     server-side; we append both the call and its output to conv.Items
+	//     and re-trigger a follow-up response so the model can speak the
+	//     result. The client only sees observability events.
+	//   - All other tools follow the standard OpenAI flow: emit
+	//     function_call_arguments.done and wait for the client to send
+	//     conversation.item.create back.
+	xlog.Debug("About to handle tool calls", "finalToolCallsCount", len(finalToolCalls))
 	executedAssistantTool := false
-	for i, tc := range toolCalls {
+	for i, tc := range finalToolCalls {
 		toolCallID := generateItemID()
 		callID := "call_" + generateUniqueID() // OpenAI uses call_xyz

@@ -1857,7 +1879,7 @@ func emitToolCallItems(ctx context.Context, session *Session, conv *Conversation
 		conv.Lock.Unlock()

 		outputIndex := i
-		if hasContent {
+		if finalSpeech != "" {
 			outputIndex++
 		}

--- a/core/http/endpoints/openai/realtime_doubles_test.go
+++ b/core/http/endpoints/openai/realtime_doubles_test.go
@@ -1,138 +0,0 @@
-package openai
-
-import (
-	"context"
-	"strings"
-
-	"github.com/mudler/LocalAI/core/backend"
-	"github.com/mudler/LocalAI/core/config"
-	"github.com/mudler/LocalAI/core/http/endpoints/openai/types"
-	"github.com/mudler/LocalAI/core/schema"
-	"github.com/mudler/LocalAI/pkg/grpc/proto"
-)
-
-// fakeTransport records the server events and audio sent to a realtime client
-// so streaming behaviour can be asserted without a real WebSocket/WebRTC peer.
-// It is not a *WebRTCTransport, so handler code takes the WebSocket path.
-type fakeTransport struct {
-	events []types.ServerEvent
-	audio  []fakeAudioChunk
-}
-
-type fakeAudioChunk struct {
-	pcm        []byte
-	sampleRate int
-}
-
-func (f *fakeTransport) SendEvent(e types.ServerEvent) error {
-	f.events = append(f.events, e)
-	return nil
-}
-
-func (f *fakeTransport) ReadEvent() ([]byte, error) { return nil, nil }
-
-func (f *fakeTransport) SendAudio(_ context.Context, pcm []byte, sampleRate int) error {
-	f.audio = append(f.audio, fakeAudioChunk{pcm: pcm, sampleRate: sampleRate})
-	return nil
-}
-
-func (f *fakeTransport) Close() error { return nil }
-
-// countEvents returns how many recorded events have the given type.
-func (f *fakeTransport) countEvents(et types.ServerEventType) int {
-	n := 0
-	for _, e := range f.events {
-		if e.ServerEventType() == et {
-			n++
-		}
-	}
-	return n
-}
-
-// transcriptDeltaText concatenates the Delta of every recorded transcript
-// delta event — i.e. the text streamed to the client as it is generated.
-func (f *fakeTransport) transcriptDeltaText() string {
-	var b strings.Builder
-	for _, e := range f.events {
-		if d, ok := e.(types.ResponseOutputAudioTranscriptDeltaEvent); ok {
-			b.WriteString(d.Delta)
-		}
-	}
-	return b.String()
-}
-
-// fakeModel is a configurable Model double. TTSStream replays ttsStreamChunks
-// and TranscribeStream replays transcribeDeltas, so the handler's streaming
-// paths can be driven deterministically.
-type fakeModel struct {
-	cfg *config.ModelConfig
-
-	ttsFile         string
-	ttsStreamChunks [][]byte
-	ttsStreamRate   int
-	ttsStreamErr    error
-
-	transcribeDeltas []string
-	transcribeFinal  *schema.TranscriptionResult
-
-	// Predict streaming: predictTokens are replayed through the token callback
-	// (simulating streamed LLM output); predictResp/predictErr are returned by
-	// the deferred predict function. predictChunkDeltas, when set, are delivered
-	// per-token via TokenUsage.ChatDeltas to exercise the autoparser path.
-	predictTokens      []string
-	predictChunkDeltas [][]*proto.ChatDelta
-	predictResp        backend.LLMResponse
-	predictErr         error
-}
-
-func (m *fakeModel) VAD(context.Context, *schema.VADRequest) (*schema.VADResponse, error) {
-	return nil, nil
-}
-
-func (m *fakeModel) Transcribe(context.Context, string, string, bool, bool, string) (*schema.TranscriptionResult, error) {
-	return m.transcribeFinal, nil
-}
-
-func (m *fakeModel) Predict(_ context.Context, _ schema.Messages, _, _, _ []string, cb func(string, backend.TokenUsage) bool, _ []types.ToolUnion, _ *types.ToolChoiceUnion, _, _ *int, _ map[string]float64) (func() (backend.LLMResponse, error), error) {
-	if m.predictErr != nil {
-		return nil, m.predictErr
-	}
-	return func() (backend.LLMResponse, error) {
-		for i, tok := range m.predictTokens {
-			if cb == nil {
-				continue
-			}
-			usage := backend.TokenUsage{}
-			if i < len(m.predictChunkDeltas) {
-				usage.ChatDeltas = m.predictChunkDeltas[i]
-			}
-			cb(tok, usage)
-		}
-		return m.predictResp, nil
-	}, nil
-}
-
-func (m *fakeModel) TTS(context.Context, string, string, string) (string, *proto.Result, error) {
-	return m.ttsFile, &proto.Result{Success: true}, nil
-}
-
-func (m *fakeModel) TTSStream(_ context.Context, _, _, _ string, onAudio func(pcm []byte, sampleRate int) error) error {
-	if m.ttsStreamErr != nil {
-		return m.ttsStreamErr
-	}
-	for _, c := range m.ttsStreamChunks {
-		if err := onAudio(c, m.ttsStreamRate); err != nil {
-			return err
-		}
-	}
-	return nil
-}
-
-func (m *fakeModel) TranscribeStream(_ context.Context, _, _ string, _, _ bool, _ string, onDelta func(text string)) (*schema.TranscriptionResult, error) {
-	for _, d := range m.transcribeDeltas {
-		onDelta(d)
-	}
-	return m.transcribeFinal, nil
-}
-
-func (m *fakeModel) PredictConfig() *config.ModelConfig { return m.cfg }
--- a/core/http/endpoints/openai/realtime_model.go
+++ b/core/http/endpoints/openai/realtime_model.go
@@ -3,7 +3,6 @@ package openai
 import (
 	"context"
 	"crypto/rand"
-	"encoding/binary"
 	"encoding/hex"
 	"encoding/json"
 	"fmt"
@@ -88,14 +87,6 @@ func (m *transcriptOnlyModel) TTS(ctx context.Context, text, voice, language str
 	return "", nil, fmt.Errorf("TTS not supported in transcript-only mode")
 }

-func (m *transcriptOnlyModel) TTSStream(ctx context.Context, text, voice, language string, onAudio func(pcm []byte, sampleRate int) error) error {
-	return fmt.Errorf("TTS not supported in transcript-only mode")
-}
-
-func (m *transcriptOnlyModel) TranscribeStream(ctx context.Context, audio, language string, translate, diarize bool, prompt string, onDelta func(text string)) (*schema.TranscriptionResult, error) {
-	return transcribeStream(ctx, m.modelLoader, *m.TranscriptionConfig, m.appConfig, audio, language, translate, diarize, prompt, onDelta)
-}
-
 func (m *transcriptOnlyModel) PredictConfig() *config.ModelConfig {
 	return nil
 }
@@ -330,75 +321,10 @@ func (m *wrappedModel) TTS(ctx context.Context, text, voice, language string) (s
 	return backend.ModelTTS(ctx, text, voice, language, "", nil, m.modelLoader, m.appConfig, *m.TTSConfig)
 }

-func (m *wrappedModel) TTSStream(ctx context.Context, text, voice, language string, onAudio func(pcm []byte, sampleRate int) error) error {
-	return ttsStream(ctx, m.modelLoader, m.appConfig, *m.TTSConfig, text, voice, language, onAudio)
-}
-
-func (m *wrappedModel) TranscribeStream(ctx context.Context, audio, language string, translate, diarize bool, prompt string, onDelta func(text string)) (*schema.TranscriptionResult, error) {
-	return transcribeStream(ctx, m.modelLoader, *m.TranscriptionConfig, m.appConfig, audio, language, translate, diarize, prompt, onDelta)
-}
-
 func (m *wrappedModel) PredictConfig() *config.ModelConfig {
 	return m.LLMConfig
 }

-// wavStreamHeaderBytes is the size of the WAV header that backend.ModelTTSStream
-// emits as its first audio callback; the sample rate lives at byte offset 24.
-const wavStreamHeaderBytes = 44
-
-// ttsStream adapts backend.ModelTTSStream (which emits a WAV stream: a 44-byte
-// header carrying the sample rate, then raw PCM) to the realtime onAudio
-// callback, which wants raw PCM plus the sample rate. The header is buffered
-// until complete, the sample rate is read from it, and subsequent bytes are
-// forwarded as PCM.
-func ttsStream(ctx context.Context, ml *model.ModelLoader, appConfig *config.ApplicationConfig, ttsConfig config.ModelConfig, text, voice, language string, onAudio func(pcm []byte, sampleRate int) error) error {
-	var header []byte
-	headerDone := false
-	sampleRate := 0
-	return backend.ModelTTSStream(ctx, text, voice, language, "", nil, ml, appConfig, ttsConfig, func(b []byte) error {
-		if headerDone {
-			if len(b) == 0 {
-				return nil
-			}
-			return onAudio(b, sampleRate)
-		}
-		header = append(header, b...)
-		if len(header) < wavStreamHeaderBytes {
-			return nil
-		}
-		sampleRate = int(binary.LittleEndian.Uint32(header[24:28]))
-		headerDone = true
-		if len(header) > wavStreamHeaderBytes {
-			return onAudio(header[wavStreamHeaderBytes:], sampleRate)
-		}
-		return nil
-	})
-}
-
-// transcribeStream adapts backend.ModelTranscriptionStream to the realtime
-// onDelta callback, returning the final aggregated transcription result.
-func transcribeStream(ctx context.Context, ml *model.ModelLoader, transcriptionConfig config.ModelConfig, appConfig *config.ApplicationConfig, audio, language string, translate, diarize bool, prompt string, onDelta func(text string)) (*schema.TranscriptionResult, error) {
-	var final *schema.TranscriptionResult
-	err := backend.ModelTranscriptionStream(ctx, backend.TranscriptionRequest{
-		Audio:     audio,
-		Language:  language,
-		Translate: translate,
-		Diarize:   diarize,
-		Prompt:    prompt,
-	}, ml, transcriptionConfig, appConfig, func(chunk backend.TranscriptionStreamChunk) {
-		if chunk.Delta != "" {
-			onDelta(chunk.Delta)
-		}
-		if chunk.Final != nil {
-			final = chunk.Final
-		}
-	})
-	if err != nil {
-		return nil, err
-	}
-	return final, nil
-}
-
 func newTranscriptionOnlyModel(pipeline *config.Pipeline, cl *config.ModelConfigLoader, ml *model.ModelLoader, appConfig *config.ApplicationConfig) (Model, *config.ModelConfig, error) {
 	cfgVAD, err := cl.LoadModelConfigFileByName(pipeline.VAD, ml.ModelPath)
 	if err != nil {
@@ -528,10 +454,8 @@ func newModel(pipeline *config.Pipeline, cl *config.ModelConfigLoader, ml *model
 		return nil, fmt.Errorf("failed to validate config: %w", err)
 	}

-	// Let the pipeline set the LLM's reasoning effort and force thinking off
-	// (cfgLLM is a per-session copy). disable_thinking applies after the effort.
+	// Let the pipeline set the LLM's reasoning effort (cfgLLM is a per-session copy).
 	applyPipelineReasoning(cfgLLM, *pipeline)
-	applyPipelineThinking(cfgLLM, *pipeline)

 	cfgTTS, err := cl.LoadModelConfigFileByName(pipeline.TTS, ml.ModelPath)
 	if err != nil {
--- a/core/http/endpoints/openai/realtime_speech.go
+++ b/core/http/endpoints/openai/realtime_speech.go
@@ -1,102 +0,0 @@
-package openai
-
-import (
-	"context"
-	"encoding/base64"
-	"fmt"
-	"os"
-	"path/filepath"
-
-	"github.com/mudler/LocalAI/core/http/endpoints/openai/types"
-	laudio "github.com/mudler/LocalAI/pkg/audio"
-	"github.com/mudler/LocalAI/pkg/sound"
-)
-
-// emitSpeech synthesizes text and sends the audio to the client. When the
-// pipeline opts into TTS streaming it forwards each PCM chunk as its own
-// response.output_audio.delta as soon as the backend produces it; otherwise it
-// synthesizes the whole utterance and sends it as a single delta.
-//
-// It deliberately does NOT emit transcript or audio-done events: the caller owns
-// those so a streamed reply can be split into several spoken segments that share
-// one response/item.
-//
-// It returns the PCM audio (at the session output rate) accumulated across all
-// chunks, which the caller base64-encodes onto the conversation item. For WebRTC
-// the audio goes over the RTP track instead, so the returned slice is empty.
-func emitSpeech(ctx context.Context, t Transport, session *Session, responseID, itemID, text string) ([]byte, error) {
-	if text == "" {
-		return nil, nil
-	}
-
-	_, isWebRTC := t.(*WebRTCTransport)
-
-	var wsAudio []byte // PCM at the session output rate, accumulated for the item record
-
-	// sendChunk hands one PCM buffer to the transport: WebRTC consumes the raw
-	// PCM directly (it resamples internally); WebSocket gets base64 PCM at the
-	// session output rate via a JSON delta event.
-	sendChunk := func(pcm []byte, sampleRate int) error {
-		if len(pcm) == 0 {
-			return nil
-		}
-		if err := t.SendAudio(ctx, pcm, sampleRate); err != nil {
-			return err
-		}
-		if isWebRTC {
-			return nil
-		}
-		wsPCM := pcm
-		if sampleRate != 0 && sampleRate != session.OutputSampleRate {
-			samples := sound.BytesToInt16sLE(pcm)
-			resampled := sound.ResampleInt16(samples, sampleRate, session.OutputSampleRate)
-			wsPCM = sound.Int16toBytesLE(resampled)
-		}
-		wsAudio = append(wsAudio, wsPCM...)
-		return t.SendEvent(types.ResponseOutputAudioDeltaEvent{
-			ServerEventBase: types.ServerEventBase{},
-			ResponseID:      responseID,
-			ItemID:          itemID,
-			OutputIndex:     0,
-			ContentIndex:    0,
-			Delta:           base64.StdEncoding.EncodeToString(wsPCM),
-		})
-	}
-
-	language := ""
-	if session.InputAudioTranscription != nil {
-		language = session.InputAudioTranscription.Language
-	}
-
-	if session.ModelConfig != nil && session.ModelConfig.Pipeline.StreamTTS() {
-		if err := session.ModelInterface.TTSStream(ctx, text, session.Voice, language, sendChunk); err != nil {
-			return nil, err
-		}
-		return wsAudio, nil
-	}
-
-	// Unary fallback: synthesize the whole utterance to a file, then emit once.
-	audioFilePath, res, err := session.ModelInterface.TTS(ctx, text, session.Voice, language)
-	if err != nil {
-		return nil, err
-	}
-	if res != nil && !res.Success {
-		return nil, fmt.Errorf("tts generation failed: %s", res.Message)
-	}
-	defer func() { _ = os.Remove(audioFilePath) }()
-
-	// filepath.Clean normalizes the backend-produced temp path before reading
-	// (also keeps gosec G304 quiet — the path is backend-controlled, not user input).
-	audioBytes, err := os.ReadFile(filepath.Clean(audioFilePath))
-	if err != nil {
-		return nil, fmt.Errorf("read tts audio: %w", err)
-	}
-	pcm, sampleRate := laudio.ParseWAV(audioBytes)
-	if sampleRate == 0 {
-		sampleRate = session.OutputSampleRate
-	}
-	if err := sendChunk(pcm, sampleRate); err != nil {
-		return nil, err
-	}
-	return wsAudio, nil
-}
--- a/core/http/endpoints/openai/realtime_speech_test.go
+++ b/core/http/endpoints/openai/realtime_speech_test.go
@@ -1,70 +0,0 @@
-package openai
-
-import (
-	"context"
-	"os"
-
-	. "github.com/onsi/ginkgo/v2"
-	. "github.com/onsi/gomega"
-
-	"github.com/mudler/LocalAI/core/config"
-	"github.com/mudler/LocalAI/core/http/endpoints/openai/types"
-	laudio "github.com/mudler/LocalAI/pkg/audio"
-)
-
-// emitSpeech synthesizes a piece of text and forwards the audio to the client,
-// streaming a delta per TTS chunk when the pipeline opts in, or sending the
-// whole utterance as one delta otherwise.
-var _ = Describe("emitSpeech", func() {
-	ttsOn := true
-
-	streamingSession := func(m Model) *Session {
-		return &Session{
-			OutputSampleRate: 24000,
-			ModelInterface:   m,
-			ModelConfig: &config.ModelConfig{
-				Pipeline: config.Pipeline{Streaming: config.PipelineStreaming{TTS: &ttsOn}},
-			},
-		}
-	}
-
-	It("streams one output_audio.delta per TTS chunk when streaming is enabled", func() {
-		m := &fakeModel{
-			ttsStreamChunks: [][]byte{{1, 2}, {3, 4}, {5, 6}},
-			ttsStreamRate:   24000,
-		}
-		t := &fakeTransport{}
-
-		audio, err := emitSpeech(context.Background(), t, streamingSession(m), "resp1", "item1", "Hello there.")
-
-		Expect(err).ToNot(HaveOccurred())
-		Expect(t.countEvents(types.ServerEventTypeResponseOutputAudioDelta)).To(Equal(3))
-		// The returned audio is all chunks concatenated (session output rate).
-		Expect(audio).To(Equal([]byte{1, 2, 3, 4, 5, 6}))
-	})
-
-	It("sends a single output_audio.delta in unary mode", func() {
-		// A minimal real WAV file for the unary TTS path to read + parse.
-		f, err := os.CreateTemp("", "emit-*.wav")
-		Expect(err).ToNot(HaveOccurred())
-		defer func() { _ = os.Remove(f.Name()) }()
-		pcm := make([]byte, 320) // 160 samples of silence
-		hdr := laudio.NewWAVHeader(uint32(len(pcm)))
-		Expect(hdr.Write(f)).To(Succeed())
-		_, err = f.Write(pcm)
-		Expect(err).ToNot(HaveOccurred())
-		Expect(f.Close()).To(Succeed())
-
-		session := &Session{
-			OutputSampleRate: 24000,
-			ModelInterface:   &fakeModel{ttsFile: f.Name()},
-			ModelConfig:      &config.ModelConfig{}, // streaming off
-		}
-		t := &fakeTransport{}
-
-		_, err = emitSpeech(context.Background(), t, session, "resp1", "item1", "Hello there.")
-
-		Expect(err).ToNot(HaveOccurred())
-		Expect(t.countEvents(types.ServerEventTypeResponseOutputAudioDelta)).To(Equal(1))
-	})
-})
--- a/core/http/endpoints/openai/realtime_stream.go
+++ b/core/http/endpoints/openai/realtime_stream.go
@@ -1,253 +0,0 @@
-package openai
-
-import (
-	"context"
-	"encoding/base64"
-	"fmt"
-
-	"github.com/mudler/LocalAI/core/backend"
-	"github.com/mudler/LocalAI/core/config"
-	"github.com/mudler/LocalAI/core/http/endpoints/openai/types"
-	"github.com/mudler/LocalAI/core/schema"
-	"github.com/mudler/LocalAI/pkg/functions"
-	"github.com/mudler/LocalAI/pkg/reasoning"
-)
-
-// transcriptStreamer turns streamed LLM tokens into the assistant's spoken
-// transcript: it strips reasoning incrementally and sends one
-// response.output_audio_transcript.delta per content fragment. It does NOT
-// synthesize audio — the caller buffers the full message and synthesizes it
-// once (streaming the audio chunks when the TTS backend supports TTSStream),
-// which works uniformly for streaming and non-streaming TTS and for languages
-// without sentence or word boundaries.
-type transcriptStreamer struct {
-	ctx        context.Context
-	t          Transport
-	responseID string
-	itemID     string
-	extractor  *reasoning.ReasoningExtractor
-
-	// announce, if set, is invoked once just before the first transcript delta.
-	// It lets the caller create the assistant item lazily, so a content-less
-	// tool-call turn never emits a spurious empty assistant item.
-	announce  func()
-	announced bool
-}
-
-func newTranscriptStreamer(ctx context.Context, t Transport, responseID, itemID, thinkingStartToken string, reasoningCfg reasoning.Config) *transcriptStreamer {
-	return &transcriptStreamer{
-		ctx:        ctx,
-		t:          t,
-		responseID: responseID,
-		itemID:     itemID,
-		extractor:  reasoning.NewReasoningExtractor(thinkingStartToken, spokenReasoningConfig(reasoningCfg)),
-	}
-}
-
-// onToken handles one streamed unit of model output, sending a transcript delta
-// for the new content (reasoning stripped). For plain-content models the unit is
-// the raw text token; for autoparser tool turns the backend clears the text and
-// delivers content via ChatDeltas, so the caller passes that content here.
-func (s *transcriptStreamer) onToken(token string) {
-	_, content := s.extractor.ProcessToken(token)
-	if content == "" {
-		return
-	}
-	if !s.announced {
-		s.announced = true
-		if s.announce != nil {
-			s.announce()
-		}
-	}
-	_ = s.t.SendEvent(types.ResponseOutputAudioTranscriptDeltaEvent{
-		ServerEventBase: types.ServerEventBase{},
-		ResponseID:      s.responseID,
-		ItemID:          s.itemID,
-		OutputIndex:     0,
-		ContentIndex:    0,
-		Delta:           content,
-	})
-}
-
-// content returns the full transcript so far with reasoning stripped.
-func (s *transcriptStreamer) content() string {
-	return s.extractor.CleanedContent()
-}
-
-// streamLLMResponse drives a streamed realtime reply. It streams the assistant
-// transcript as the LLM generates, then synthesizes the whole buffered message
-// once (streaming the audio chunks when the TTS backend supports it, otherwise a
-// single unary delta). Tool calls parsed from the autoparser ChatDeltas are
-// emitted after the spoken content. The assistant content item is created lazily
-// on the first content delta, so a content-less tool-call turn emits only the
-// tool calls. It returns true when it has fully handled the response so the
-// caller can return; callers must only invoke it for an audio modality, and with
-// tools only when the model uses its tokenizer template (see triggerResponseAtTurn).
-func streamLLMResponse(ctx context.Context, session *Session, conv *Conversation, t Transport, responseID string, history schema.Messages, images []string, llmCfg *config.ModelConfig, tools []types.ToolUnion, toolChoice *types.ToolChoiceUnion, toolTurn int) bool {
-	itemID := generateItemID()
-	item := types.MessageItemUnion{
-		Assistant: &types.MessageItemAssistant{
-			ID:      itemID,
-			Status:  types.ItemStatusInProgress,
-			Content: []types.MessageContentOutput{{Type: types.MessageContentTypeOutputAudio}},
-		},
-	}
-
-	// announce creates the assistant content item lazily, just before the first
-	// transcript delta — a tool-only turn never produces content, so it stays out
-	// of the conversation and the client sees only the tool calls.
-	announced := false
-	announce := func() {
-		announced = true
-		conv.Lock.Lock()
-		conv.Items = append(conv.Items, &item)
-		conv.Lock.Unlock()
-		sendEvent(t, types.ResponseOutputItemAddedEvent{
-			ServerEventBase: types.ServerEventBase{},
-			ResponseID:      responseID,
-			OutputIndex:     0,
-			Item:            item,
-		})
-		sendEvent(t, types.ResponseContentPartAddedEvent{
-			ServerEventBase: types.ServerEventBase{},
-			ResponseID:      responseID,
-			ItemID:          itemID,
-			OutputIndex:     0,
-			ContentIndex:    0,
-			Part:            item.Assistant.Content[0],
-		})
-	}
-
-	cancel := func() {
-		if announced {
-			conv.Lock.Lock()
-			for i := len(conv.Items) - 1; i >= 0; i-- {
-				if conv.Items[i].Assistant != nil && conv.Items[i].Assistant.ID == itemID {
-					conv.Items = append(conv.Items[:i], conv.Items[i+1:]...)
-					break
-				}
-			}
-			conv.Lock.Unlock()
-		}
-		sendEvent(t, types.ResponseDoneEvent{
-			ServerEventBase: types.ServerEventBase{},
-			Response:        types.Response{ID: responseID, Object: "realtime.response", Status: types.ResponseStatusCancelled},
-		})
-	}
-
-	var template string
-	if llmCfg.TemplateConfig.UseTokenizerTemplate {
-		template = llmCfg.GetModelTemplate()
-	} else {
-		template = llmCfg.TemplateConfig.Chat
-	}
-	thinkingStartToken := reasoning.DetectThinkingStartToken(template, &llmCfg.ReasoningConfig)
-
-	streamer := newTranscriptStreamer(ctx, t, responseID, itemID, thinkingStartToken, llmCfg.ReasoningConfig)
-	streamer.announce = announce
-	cb := func(token string, usage backend.TokenUsage) bool {
-		if ctx.Err() != nil {
-			return false
-		}
-		// Plain-content models stream text via the token; autoparser tool turns
-		// clear the text and deliver content via ChatDeltas, so prefer the latter
-		// when present. Either way only content reaches the transcript — tool-call
-		// deltas are parsed from the final response below.
-		text := token
-		if len(usage.ChatDeltas) > 0 {
-			text = functions.ContentFromChatDeltas(usage.ChatDeltas)
-		}
-		streamer.onToken(text)
-		return true
-	}
-
-	predFunc, err := session.ModelInterface.Predict(ctx, history, images, nil, nil, cb, tools, toolChoice, nil, nil, nil)
-	if err != nil {
-		sendError(t, "inference_failed", fmt.Sprintf("backend error: %v", err), "", itemID)
-		return true
-	}
-	pred, err := predFunc()
-	if err != nil {
-		if ctx.Err() != nil {
-			cancel()
-			return true
-		}
-		sendError(t, "prediction_failed", fmt.Sprintf("backend error: %v", err), "", itemID)
-		return true
-	}
-	if ctx.Err() != nil {
-		cancel()
-		return true
-	}
-
-	content := streamer.content()
-	toolCalls := functions.ToolCallsFromChatDeltas(pred.ChatDeltas)
-
-	// Finalize the spoken content item only when the turn produced content. A
-	// tool-only turn skips this entirely (no empty assistant item).
-	if content != "" {
-		if !announced {
-			announce()
-		}
-		// Buffer the whole message, then synthesize it once. emitSpeech streams
-		// the audio chunks when the TTS backend supports TTSStream, otherwise it
-		// sends a single unary delta — no per-sentence segmentation either way.
-		audio, err := emitSpeech(ctx, t, session, responseID, itemID, content)
-		if err != nil {
-			if ctx.Err() != nil {
-				cancel()
-				return true
-			}
-			sendError(t, "tts_error", fmt.Sprintf("TTS generation failed: %v", err), "", itemID)
-			return true
-		}
-
-		_, isWebRTC := t.(*WebRTCTransport)
-
-		sendEvent(t, types.ResponseOutputAudioTranscriptDoneEvent{
-			ServerEventBase: types.ServerEventBase{},
-			ResponseID:      responseID,
-			ItemID:          itemID,
-			OutputIndex:     0,
-			ContentIndex:    0,
-			Transcript:      content,
-		})
-		if !isWebRTC {
-			sendEvent(t, types.ResponseOutputAudioDoneEvent{
-				ServerEventBase: types.ServerEventBase{},
-				ResponseID:      responseID,
-				ItemID:          itemID,
-				OutputIndex:     0,
-				ContentIndex:    0,
-			})
-		}
-
-		conv.Lock.Lock()
-		item.Assistant.Status = types.ItemStatusCompleted
-		item.Assistant.Content[0].Transcript = content
-		if !isWebRTC {
-			item.Assistant.Content[0].Audio = base64.StdEncoding.EncodeToString(audio)
-		}
-		conv.Lock.Unlock()
-
-		sendEvent(t, types.ResponseContentPartDoneEvent{
-			ServerEventBase: types.ServerEventBase{},
-			ResponseID:      responseID,
-			ItemID:          itemID,
-			OutputIndex:     0,
-			ContentIndex:    0,
-			Part:            item.Assistant.Content[0],
-		})
-		sendEvent(t, types.ResponseOutputItemDoneEvent{
-			ServerEventBase: types.ServerEventBase{},
-			ResponseID:      responseID,
-			OutputIndex:     0,
-			Item:            item,
-		})
-	}
-
-	// Emit any tool calls, the terminal response.done, and (for server-side
-	// assistant tools) the follow-up turn — shared with the buffered path.
-	emitToolCallItems(ctx, session, conv, t, responseID, toolCalls, content != "", toolTurn)
-	return true
-}
--- a/core/http/endpoints/openai/realtime_stream_test.go
+++ b/core/http/endpoints/openai/realtime_stream_test.go
@@ -1,150 +0,0 @@
-package openai
-
-import (
-	"context"
-
-	. "github.com/onsi/ginkgo/v2"
-	. "github.com/onsi/gomega"
-
-	"github.com/mudler/LocalAI/core/backend"
-	"github.com/mudler/LocalAI/core/config"
-	"github.com/mudler/LocalAI/core/http/endpoints/openai/types"
-	"github.com/mudler/LocalAI/pkg/grpc/proto"
-	"github.com/mudler/LocalAI/pkg/reasoning"
-)
-
-// transcriptStreamer turns streamed LLM tokens into incremental transcript
-// deltas, stripping reasoning. Audio is synthesized once from the full message
-// by the caller, so there is no per-sentence segmentation.
-var _ = Describe("transcriptStreamer", func() {
-	It("emits one transcript delta per content token", func() {
-		t := &fakeTransport{}
-		s := newTranscriptStreamer(context.Background(), t, "resp1", "item1", "", reasoning.Config{})
-
-		for _, tok := range []string{"Hello", " world.", " Bye"} {
-			s.onToken(tok)
-		}
-
-		Expect(s.content()).To(Equal("Hello world. Bye"))
-		Expect(t.countEvents(types.ServerEventTypeResponseOutputAudioTranscriptDelta)).To(Equal(3))
-		Expect(t.transcriptDeltaText()).To(Equal("Hello world. Bye"))
-	})
-
-	It("strips leaked reasoning even when reasoning is disabled (disable_thinking safety net)", func() {
-		// disable_thinking maps to DisableReasoning=true (enable_thinking=false to
-		// the backend). If the model emits thinking anyway, the transcript must
-		// still not leak it: stripping always runs for spoken output.
-		disable := true
-		t := &fakeTransport{}
-		s := newTranscriptStreamer(context.Background(), t, "resp1", "item1", "",
-			reasoning.Config{DisableReasoning: &disable})
-
-		s.onToken("<think>secret plan</think>")
-		s.onToken("The answer is 42.")
-
-		Expect(s.content()).To(Equal("The answer is 42."))
-		Expect(s.content()).ToNot(ContainSubstring("secret plan"))
-		Expect(t.transcriptDeltaText()).ToNot(ContainSubstring("secret plan"))
-	})
-})
-
-// streamLLMResponse drives a full streamed realtime turn: live transcript
-// deltas while the LLM generates, then the whole message is synthesized once.
-var _ = Describe("streamLLMResponse", func() {
-	It("streams transcript deltas then synthesizes the whole message once", func() {
-		on := true
-		m := &fakeModel{
-			predictTokens:   []string{"Hello", " world.", " How are you?"},
-			predictResp:     backend.LLMResponse{Response: "Hello world. How are you?"},
-			ttsStreamChunks: [][]byte{{9}},
-			ttsStreamRate:   24000,
-		}
-		session := &Session{
-			OutputSampleRate: 24000,
-			ModelInterface:   m,
-			ModelConfig: &config.ModelConfig{
-				Pipeline: config.Pipeline{Streaming: config.PipelineStreaming{LLM: &on, TTS: &on}},
-			},
-		}
-		conv := &Conversation{}
-		t := &fakeTransport{}
-		llmCfg := &config.ModelConfig{}
-
-		handled := streamLLMResponse(context.Background(), session, conv, t, "resp1", nil, nil, llmCfg, nil, nil, 0)
-
-		Expect(handled).To(BeTrue())
-		// One live transcript delta per streamed token.
-		Expect(t.countEvents(types.ServerEventTypeResponseOutputAudioTranscriptDelta)).To(Equal(3))
-		// The whole message is synthesized ONCE (not per sentence): a single
-		// emitSpeech replays the one TTS stream chunk.
-		Expect(t.countEvents(types.ServerEventTypeResponseOutputAudioDelta)).To(Equal(1))
-		Expect(t.transcriptDeltaText()).To(Equal("Hello world. How are you?"))
-	})
-
-	It("streams content deltas and emits tool-call items (autoparser tool turn)", func() {
-		on := true
-		// Autoparser path: reply.Message is empty; content + tool calls arrive via
-		// ChatDeltas. Chunk 1 carries content, chunk 2 carries the tool call.
-		contentDelta := []*proto.ChatDelta{{Content: "Let me check."}}
-		toolDelta := []*proto.ChatDelta{{ToolCalls: []*proto.ToolCallDelta{{Index: 0, Name: "get_weather", Arguments: `{"city":"Paris"}`}}}}
-		m := &fakeModel{
-			predictTokens:      []string{"", ""},
-			predictChunkDeltas: [][]*proto.ChatDelta{contentDelta, toolDelta},
-			predictResp:        backend.LLMResponse{ChatDeltas: append(append([]*proto.ChatDelta{}, contentDelta...), toolDelta...)},
-			ttsStreamChunks:    [][]byte{{9}},
-			ttsStreamRate:      24000,
-		}
-		session := &Session{
-			OutputSampleRate: 24000,
-			ModelInterface:   m,
-			ModelConfig: &config.ModelConfig{
-				Pipeline: config.Pipeline{Streaming: config.PipelineStreaming{LLM: &on, TTS: &on}},
-			},
-		}
-		conv := &Conversation{}
-		t := &fakeTransport{}
-		llmCfg := &config.ModelConfig{}
-		llmCfg.TemplateConfig.UseTokenizerTemplate = true
-
-		handled := streamLLMResponse(context.Background(), session, conv, t, "resp1", nil, nil, llmCfg, nil, nil, 0)
-
-		Expect(handled).To(BeTrue())
-		// The spoken content was streamed live.
-		Expect(t.transcriptDeltaText()).To(Equal("Let me check."))
-		// The tool call is emitted as a function_call item.
-		Expect(t.countEvents(types.ServerEventTypeResponseFunctionCallArgumentsDone)).To(Equal(1))
-		// Exactly one terminal response.done.
-		Expect(t.countEvents(types.ServerEventTypeResponseDone)).To(Equal(1))
-	})
-
-	It("emits only tool-call items for a content-less tool turn (no empty assistant item)", func() {
-		on := true
-		toolDelta := []*proto.ChatDelta{{ToolCalls: []*proto.ToolCallDelta{{Index: 0, Name: "get_weather", Arguments: `{"city":"Rome"}`}}}}
-		m := &fakeModel{
-			predictTokens:      []string{""},
-			predictChunkDeltas: [][]*proto.ChatDelta{toolDelta},
-			predictResp:        backend.LLMResponse{ChatDeltas: toolDelta},
-		}
-		session := &Session{
-			OutputSampleRate: 24000,
-			ModelInterface:   m,
-			ModelConfig: &config.ModelConfig{
-				Pipeline: config.Pipeline{Streaming: config.PipelineStreaming{LLM: &on, TTS: &on}},
-			},
-		}
-		conv := &Conversation{}
-		t := &fakeTransport{}
-		llmCfg := &config.ModelConfig{}
-		llmCfg.TemplateConfig.UseTokenizerTemplate = true
-
-		handled := streamLLMResponse(context.Background(), session, conv, t, "resp1", nil, nil, llmCfg, nil, nil, 0)
-
-		Expect(handled).To(BeTrue())
-		// No content → no transcript deltas and no spurious assistant content item.
-		Expect(t.transcriptDeltaText()).To(Equal(""))
-		Expect(t.countEvents(types.ServerEventTypeResponseOutputAudioTranscriptDelta)).To(Equal(0))
-		// The tool call is still emitted.
-		Expect(t.countEvents(types.ServerEventTypeResponseFunctionCallArgumentsDone)).To(Equal(1))
-		Expect(t.countEvents(types.ServerEventTypeResponseDone)).To(Equal(1))
-	})
-})
--- a/core/http/endpoints/openai/realtime_thinking.go
+++ b/core/http/endpoints/openai/realtime_thinking.go
@@ -1,33 +0,0 @@
-package openai
-
-import (
-	"github.com/mudler/LocalAI/core/config"
-	"github.com/mudler/LocalAI/pkg/reasoning"
-)
-
-// applyPipelineThinking forces the LLM's reasoning/thinking off when the realtime
-// pipeline sets disable_thinking, mapping to the enable_thinking=false backend
-// metadata via ReasoningConfig.DisableReasoning. The LLM config passed in is the
-// per-session copy returned by the config loader, so this does not affect other
-// users of the same model. When the pipeline does not set disable_thinking the
-// LLM config is left untouched.
-func applyPipelineThinking(llm *config.ModelConfig, pipeline config.Pipeline) {
-	if llm == nil || !pipeline.ThinkingDisabled() {
-		return
-	}
-	disable := true
-	llm.ReasoningConfig.DisableReasoning = &disable
-}
-
-// spokenReasoningConfig adapts a model's reasoning config for stripping reasoning
-// OUT of realtime spoken output. ReasoningConfig.DisableReasoning is overloaded:
-// the backend reads it as the "enable_thinking=false" hint (which pipeline
-// disable_thinking sets via applyPipelineThinking), but the reasoning extractor
-// reads it as "skip stripping, assume there is no reasoning". Honouring the latter
-// when extracting for speech would leak raw <think>…</think> whenever the model
-// ignores the suppression hint. Spoken output must never contain reasoning, so we
-// always strip: clear DisableReasoning while keeping custom tokens/tag pairs.
-func spokenReasoningConfig(cfg reasoning.Config) reasoning.Config {
-	cfg.DisableReasoning = nil
-	return cfg
-}
--- a/core/http/endpoints/openai/realtime_thinking_test.go
+++ b/core/http/endpoints/openai/realtime_thinking_test.go
@@ -1,50 +0,0 @@
-package openai
-
-import (
-	. "github.com/onsi/ginkgo/v2"
-	. "github.com/onsi/gomega"
-
-	"github.com/mudler/LocalAI/core/config"
-	"github.com/mudler/LocalAI/pkg/reasoning"
-)
-
-// applyPipelineThinking lets a realtime pipeline force the LLM's thinking off
-// (enable_thinking=false metadata) without editing the LLM model config.
-var _ = Describe("applyPipelineThinking", func() {
-	It("disables reasoning on the LLM config when the pipeline disables thinking", func() {
-		disable := true
-		llm := &config.ModelConfig{}
-		applyPipelineThinking(llm, config.Pipeline{DisableThinking: &disable})
-		Expect(llm.ReasoningConfig.DisableReasoning).ToNot(BeNil())
-		Expect(*llm.ReasoningConfig.DisableReasoning).To(BeTrue())
-	})
-
-	It("leaves the LLM config untouched when the pipeline does not set disable_thinking", func() {
-		llm := &config.ModelConfig{}
-		applyPipelineThinking(llm, config.Pipeline{})
-		Expect(llm.ReasoningConfig.DisableReasoning).To(BeNil())
-	})
-})
-
-// spokenReasoningConfig clears DisableReasoning so realtime spoken output always
-// strips reasoning, even though disable_thinking sets DisableReasoning=true on the
-// LLM config (which the backend reads as enable_thinking=false).
-var _ = Describe("spokenReasoningConfig", func() {
-	It("clears DisableReasoning so the extractor still strips leaked reasoning", func() {
-		disable := true
-		out := spokenReasoningConfig(reasoning.Config{DisableReasoning: &disable})
-		Expect(out.DisableReasoning).To(BeNil())
-	})
-
-	It("preserves the other reasoning settings", func() {
-		disable := true
-		out := spokenReasoningConfig(reasoning.Config{
-			DisableReasoning:    &disable,
-			ThinkingStartTokens: []string{"<reason>"},
-			TagPairs:            []reasoning.TagPair{{Start: "<reason>", End: "</reason>"}},
-		})
-		Expect(out.ThinkingStartTokens).To(Equal([]string{"<reason>"}))
-		Expect(out.TagPairs).To(HaveLen(1))
-		Expect(out.TagPairs[0].Start).To(Equal("<reason>"))
-	})
-})
--- a/core/http/endpoints/openai/realtime_transcription.go
+++ b/core/http/endpoints/openai/realtime_transcription.go
@@ -1,63 +0,0 @@
-package openai
-
-import (
-	"context"
-	"fmt"
-
-	"github.com/mudler/LocalAI/core/http/endpoints/openai/types"
-)
-
-// emitTranscription transcribes a committed utterance and emits the transcription
-// events for it, returning the final transcript text. With
-// pipeline.streaming.transcription enabled it streams each transcript fragment as
-// a conversation.item.input_audio_transcription.delta as the backend produces it,
-// then a completed event; otherwise it transcribes the whole utterance and emits
-// a single completed event. delta and completed events share itemID.
-func emitTranscription(ctx context.Context, t Transport, session *Session, itemID, audioPath string) (string, error) {
-	cfg := session.InputAudioTranscription
-
-	if session.ModelConfig != nil && session.ModelConfig.Pipeline.StreamTranscription() {
-		final, err := session.ModelInterface.TranscribeStream(ctx, audioPath, cfg.Language, false, false, cfg.Prompt, func(delta string) {
-			_ = t.SendEvent(types.ConversationItemInputAudioTranscriptionDeltaEvent{
-				ServerEventBase: types.ServerEventBase{EventID: "event_TODO"},
-				ItemID:          itemID,
-				ContentIndex:    0,
-				Delta:           delta,
-			})
-		})
-		if err != nil {
-			return "", err
-		}
-		transcript := ""
-		if final != nil {
-			transcript = final.Text
-		}
-		if err := t.SendEvent(types.ConversationItemInputAudioTranscriptionCompletedEvent{
-			ServerEventBase: types.ServerEventBase{EventID: "event_TODO"},
-			ItemID:          itemID,
-			ContentIndex:    0,
-			Transcript:      transcript,
-		}); err != nil {
-			return "", err
-		}
-		return transcript, nil
-	}
-
-	// Unary fallback: transcribe the whole utterance, emit one completed event.
-	tr, err := session.ModelInterface.Transcribe(ctx, audioPath, cfg.Language, false, false, cfg.Prompt)
-	if err != nil {
-		return "", err
-	}
-	if tr == nil {
-		return "", fmt.Errorf("transcribe result is nil")
-	}
-	if err := t.SendEvent(types.ConversationItemInputAudioTranscriptionCompletedEvent{
-		ServerEventBase: types.ServerEventBase{EventID: "event_TODO"},
-		ItemID:          itemID,
-		ContentIndex:    0,
-		Transcript:      tr.Text,
-	}); err != nil {
-		return "", err
-	}
-	return tr.Text, nil
-}
--- a/core/http/endpoints/openai/realtime_transcription_test.go
+++ b/core/http/endpoints/openai/realtime_transcription_test.go
@@ -1,54 +0,0 @@
-package openai
-
-import (
-	"context"
-
-	. "github.com/onsi/ginkgo/v2"
-	. "github.com/onsi/gomega"
-
-	"github.com/mudler/LocalAI/core/config"
-	"github.com/mudler/LocalAI/core/http/endpoints/openai/types"
-	"github.com/mudler/LocalAI/core/schema"
-)
-
-// emitTranscription transcribes a committed utterance, streaming transcript text
-// deltas when the pipeline opts in, and returns the final transcript text.
-var _ = Describe("emitTranscription", func() {
-	It("streams transcription deltas then a completed event when streaming is enabled", func() {
-		on := true
-		session := &Session{
-			InputAudioTranscription: &types.AudioTranscription{},
-			ModelConfig: &config.ModelConfig{
-				Pipeline: config.Pipeline{Streaming: config.PipelineStreaming{Transcription: &on}},
-			},
-			ModelInterface: &fakeModel{
-				transcribeDeltas: []string{"Hel", "lo", " world"},
-				transcribeFinal:  &schema.TranscriptionResult{Text: "Hello world"},
-			},
-		}
-		t := &fakeTransport{}
-
-		transcript, err := emitTranscription(context.Background(), t, session, "item1", "/tmp/x.wav")
-
-		Expect(err).ToNot(HaveOccurred())
-		Expect(transcript).To(Equal("Hello world"))
-		Expect(t.countEvents(types.ServerEventTypeConversationItemInputAudioTranscriptionDelta)).To(Equal(3))
-		Expect(t.countEvents(types.ServerEventTypeConversationItemInputAudioTranscriptionCompleted)).To(Equal(1))
-	})
-
-	It("emits a single completed event with no deltas in unary mode", func() {
-		session := &Session{
-			InputAudioTranscription: &types.AudioTranscription{},
-			ModelConfig:             &config.ModelConfig{}, // streaming off
-			ModelInterface:          &fakeModel{transcribeFinal: &schema.TranscriptionResult{Text: "Hi"}},
-		}
-		t := &fakeTransport{}
-
-		transcript, err := emitTranscription(context.Background(), t, session, "item1", "/tmp/x.wav")
-
-		Expect(err).ToNot(HaveOccurred())
-		Expect(transcript).To(Equal("Hi"))
-		Expect(t.countEvents(types.ServerEventTypeConversationItemInputAudioTranscriptionDelta)).To(Equal(0))
-		Expect(t.countEvents(types.ServerEventTypeConversationItemInputAudioTranscriptionCompleted)).To(Equal(1))
-	})
-})
--- a/docs/content/features/audio-to-text.md
+++ b/docs/content/features/audio-to-text.md
@@ -187,6 +187,21 @@ curl http://localhost:8080/v1/audio/transcriptions \

 For real-time use, load a cache-aware streaming model (e.g. `realtime_eou_120m-v1-*.gguf`) and pass `-F stream=true`. Deltas are emitted as the audio is decoded, with end-of-utterance events closing each segment.

+### Segment timestamps
+
+Transcriptions are split into segments the same way NVIDIA NeMo does: a new segment starts after sentence-ending punctuation (`.`, `?`, `!`), and each segment carries `start`/`end` times. This is the default (NeMo's punctuation-only segmentation) and needs no configuration. While streaming, each end-of-utterance closes a segment, now with timestamps.
+
+You can additionally split on silence by setting `segment_gap_threshold` (NeMo's `segment_gap_threshold`, in **encoder frames**; off by default). When set, a gap between two words wider than the threshold also starts a new segment. The value is in frames to match NeMo exactly; the backend converts it to seconds using the model's frame stride (`frame_sec`, reported by the engine):
+
+```yaml
+name: parakeet-110m
+backend: parakeet-cpp
+parameters:
+  model: tdt_ctc-110m-f16.gguf
+options:
+- segment_gap_threshold:12   # split on silence > 12 encoder frames (default 0 = off, punctuation-only)
+```
+
 ### Dynamic batching

 The backend can coalesce concurrent transcription requests into a single batched engine call, which improves throughput on GPU when many requests arrive at once. Batching is **off by default** (`batch_max_size:1`, one request at a time); raise it to opt in. Two `options:` knobs control it:
--- a/docs/content/features/distributed-mode.md
+++ b/docs/content/features/distributed-mode.md
@@ -133,9 +133,9 @@ When S3 is not configured, model files are transferred directly from the fronten

 For high-throughput or very large model files, S3 can be more efficient since it avoids streaming through the frontend.

-{{% alert icon="⚠️" color="warning" %}}
+{{% notice warning %}}
 The worker HTTP file transfer server is authenticated by `LOCALAI_REGISTRATION_TOKEN`. If the token is **empty**, the server **fails open** — anyone who can reach the port gets read/write access to the worker's models/staging/data directories (a remote model-poisoning / exfiltration vector). The worker logs a loud warning at startup in this case. Always set `LOCALAI_REGISTRATION_TOKEN` in distributed mode, and set `LOCALAI_DISTRIBUTED_REQUIRE_AUTH=true` (frontend **and** workers) to make a missing token *or* missing NATS credentials a hard startup error rather than a silent fail-open. Firewall the file-transfer port (gRPC base − 1) so only the frontend can reach it.
-{{% /alert %}}
+{{% /notice %}}

 ### Watching Backend Installs

--- a/docs/content/features/openai-realtime.md
+++ b/docs/content/features/openai-realtime.md
@@ -31,41 +31,6 @@ This configuration links the following components:

 Make sure all referenced models (`silero-vad-ggml`, `whisper-large-turbo`, `qwen3-4b`, `tts-1`) are also installed or defined in your LocalAI instance.

-### Streaming the pipeline
-
-By default each stage runs to completion before the next begins: the whole utterance is transcribed, the full LLM reply is generated, then it is synthesized. Each stage can instead be streamed incrementally, which lowers the time-to-first-audio of a turn:
-
-```yaml
-name: gpt-realtime
-pipeline:
-  vad: silero-vad-ggml
-  transcription: whisper-large-turbo
-  llm: qwen3-4b
-  tts: tts-1
-  streaming:
-    llm: true            # stream LLM tokens as transcript deltas
-    tts: true            # emit audio deltas per synthesized chunk
-    transcription: true  # stream transcript text deltas of the user's speech
-```
-
- **streaming.tts**: emit a `response.output_audio.delta` per audio chunk the TTS backend produces (requires a backend that supports streaming synthesis), instead of one delta for the whole utterance. Falls back to a single unary delta otherwise.
- **streaming.transcription**: stream `conversation.item.input_audio_transcription.delta` events as the transcript is produced (requires a transcription backend that supports streaming).
- **streaming.llm**: stream the LLM reply token-by-token as `response.output_audio_transcript.delta` events. The full reply is buffered and synthesized once it is complete — streamed as audio chunks when `streaming.tts` is enabled (and the TTS backend supports it), otherwise as a single unary delta. Reasoning/thinking is always stripped from the spoken transcript. Tool calls are supported while streaming when the LLM uses its tokenizer template (`use_tokenizer_template: true`): the backend's autoparser then delivers content and tool calls separately, so the spoken transcript never leaks tool-call tokens. Grammar-based function calling keeps the buffered path.
-
-All streaming flags are off by default, so existing pipelines are unaffected.
-
-### Disabling thinking
-
-For reasoning models, you can force the pipeline LLM's thinking off without editing the LLM model config:
-
-```yaml
-pipeline:
-  llm: qwen3-4b
-  disable_thinking: true   # maps to enable_thinking=false for the realtime LLM
-```
-
-This is applied only to the realtime session's copy of the LLM config, so it does not affect other users of the same model. Leave it unset to use the LLM model config's own reasoning settings.
-
 ## Transports

 The Realtime API supports two transports: **WebSocket** and **WebRTC**.
--- a/gallery/index.yaml
+++ b/gallery/index.yaml
@@ -1,4 +1,57 @@
 ---
+- name: "gemma-4-12b-it-qat-q4_0"
+  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
+  urls:
+    - https://huggingface.co/google/gemma-4-12B-it-qat-q4_0-gguf
+  description: |
+    Hugging Face |
+    GitHub |
+    Launch Blog |
+    Documentation
+
+    License: Apache 2.0 | Authors: Google DeepMind
+
+    > [!Note]
+    > This model card is for the new versions of the Gemma 4 family optimized with Quantization-Aware Training (QAT), which allows preserving similar quality to bfloat16 while dramatically reducing the memory requirements to load the model.
+    > Four versions of the QAT checkpoints are available:
+    > * **Unquantized QAT checkpoints** (Q4_0): Half-precision weights extracted from the QAT pipeline, ideal for custom downstream compilation and research. Available for Gemma 4 E2B, E4B, 12B, 26B A4B, and 31B, and their drafter models.
+    > * **GGUF** (Q4_0): Ready-to-deploy formats for broad ecosystem compatibility. Available for Gemma 4 E2B, E4B, 12B, 26B A4B, and 31B.
+    > * **Mobile-optimized** (wNa8o8): A custom schema engineered explicitly for mobile hardware efficiency. It features targeted 2-bit decoding layers, optimized KV caches, and static activations to maximize VRAM savings. Available for Gemma 4 E2B and E4B.
+    > * **Compressed Tensors** (w4a16): QAT checkpoints serialized in the compressed-tensors format for native, optimized inference with vLLM. Available for Gemma 4 E2B, E4B, 12B
+
+    ...
+  license: "apache-2.0"
+  tags:
+    - llm
+    - gguf
+  icon: https://ai.google.dev/gemma/images/gemma4_banner.png
+  overrides:
+    backend: llama-cpp
+    function:
+      automatic_tool_parsing_fallback: true
+      grammar:
+        disable: true
+    known_usecases:
+      - chat
+    mmproj: llama-cpp/mmproj/gemma-4-12B-it-qat-q4_0-gguf/mmproj-gemma-4-12b-it-qat-q4_0.gguf
+    options:
+      - use_jinja:true
+    parameters:
+      min_p: 0
+      model: llama-cpp/models/gemma-4-12B-it-qat-q4_0-gguf/gemma-4-12b-it-qat-q4_0.gguf
+      repeat_penalty: 1
+      temperature: 1
+      top_k: 64
+      top_p: 0.95
+    template:
+      use_tokenizer_template: true
+  files:
+    - filename: llama-cpp/models/gemma-4-12B-it-qat-q4_0-gguf/gemma-4-12b-it-qat-q4_0.gguf
+      sha256: faff1a63667fac17ac5e777f47114688fcefea96e220e211aaa8d62c2c4561f1
+      uri: https://huggingface.co/google/gemma-4-12B-it-qat-q4_0-gguf/resolve/main/gemma-4-12b-it-qat-q4_0.gguf
+    - filename: llama-cpp/mmproj/gemma-4-12B-it-qat-q4_0-gguf/mmproj-gemma-4-12b-it-qat-q4_0.gguf
+      sha256: e70b0e5cd80323d5d588b4ed06780356b7b1ba03995a4b8164c6ae9db0ff5989
+      uri: https://huggingface.co/google/gemma-4-12B-it-qat-q4_0-gguf/resolve/main/mmproj-gemma-4-12b-it-qat-q4_0.gguf
 - name: "step-3.7-flash"
  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
  urls:
@@ -26112,6 +26165,106 @@
    - filename: ae.safetensors
      sha256: afc8e28272cd15db3919bacdb6918ce9c1ed22e96cb12c4d5ed0fba823529e38
      uri: https://huggingface.co/ChuckMcSneed/FLUX.1-dev/resolve/main/ae.safetensors
+- name: ideogram-4-iq4nl-ggml
+  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
+  urls:
+    - https://huggingface.co/ideogram-ai/ideogram-4-fp8
+    - https://huggingface.co/stduhpf/ideogram-4-gguf
+  description: |
+    Ideogram 4 is a text-to-image diffusion model known for state-of-the-art prompt adherence and exceptional, accurate text rendering inside images. It is driven by a Qwen3-VL-8B text encoder and performs real classifier-free guidance from a separate unconditional diffusion model.
+
+    This is the iQ4_NL (4-bit) quantization, a good balance of quality and footprint (~5.8GB diffusion + ~5.8GB unconditional). The bundle also pulls the Qwen3-VL-8B-Instruct text encoder and the FLUX.2 VAE. Quantized GGUF weights by stduhpf for use with stable-diffusion.cpp.
+  license: ideogram-non-commercial-model-agreement
+  tags:
+    - ideogram
+    - ideogram4
+    - text-to-image
+    - image-generation
+    - gguf
+    - quantized
+    - 8b
+    - diffusion
+  last_checked: "2026-06-06"
+  overrides:
+    backend: stablediffusion-ggml
+    step: 25
+    # Ideogram4 runs real classifier-free guidance from a separate
+    # unconditional diffusion model, so it needs a CFG scale > 1 (unlike the
+    # guidance-distilled Flux / Z-Image models). 7 matches the upstream
+    # stable-diffusion.cpp default used in the Ideogram4 example.
+    cfg_scale: 7
+    options:
+      - diffusion_model
+      - uncond_diffusion_model_path:ideogram4_unconditional-iQ4_NL.gguf
+      - llm_path:Qwen3-VL-8B-Instruct-Q4_K_M.gguf
+      - vae_path:flux2-vae.safetensors
+      - sampler:euler
+      - offload_params_to_cpu:true
+    parameters:
+      model: ideogram4-iQ4_NL.gguf
+  files:
+    - filename: ideogram4-iQ4_NL.gguf
+      sha256: 578502024f23e8e988e0cb297201f1ac88dddad5706726ad222d918727e0211d
+      uri: huggingface://stduhpf/ideogram-4-gguf/ideogram4-iQ4_NL.gguf
+    - filename: ideogram4_unconditional-iQ4_NL.gguf
+      sha256: 4140e58c6818dac8221fa590a6814246b5336bb23246fbbb96b9048e887f47cf
+      uri: huggingface://stduhpf/ideogram-4-gguf/ideogram4_unconditional-iQ4_NL.gguf
+    - filename: Qwen3-VL-8B-Instruct-Q4_K_M.gguf
+      sha256: 108e7ff92b78eefd3db4741885104acba514255c11b617d3c7b197a5f46efe89
+      uri: huggingface://unsloth/Qwen3-VL-8B-Instruct-GGUF/Qwen3-VL-8B-Instruct-Q4_K_M.gguf
+    - filename: flux2-vae.safetensors
+      sha256: 868fe7b343cc8f3a19dbcfcafbc3d5f888802be3f89bd81b65b3621a066ce8f3
+      uri: https://huggingface.co/Comfy-Org/Ideogram-4/resolve/main/vae/flux2-vae.safetensors
+- name: ideogram-4-q8_0-ggml
+  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
+  urls:
+    - https://huggingface.co/ideogram-ai/ideogram-4-fp8
+    - https://huggingface.co/stduhpf/ideogram-4-gguf
+  description: |
+    Ideogram 4 is a text-to-image diffusion model known for state-of-the-art prompt adherence and exceptional, accurate text rendering inside images. It is driven by a Qwen3-VL-8B text encoder and performs real classifier-free guidance from a separate unconditional diffusion model.
+
+    This is the Q8_0 (8-bit) quantization for highest quality (~10.1GB diffusion + ~10.1GB unconditional). The bundle also pulls the Qwen3-VL-8B-Instruct text encoder and the FLUX.2 VAE. Quantized GGUF weights by stduhpf for use with stable-diffusion.cpp.
+  license: ideogram-non-commercial-model-agreement
+  tags:
+    - ideogram
+    - ideogram4
+    - text-to-image
+    - image-generation
+    - gguf
+    - quantized
+    - 8b
+    - diffusion
+  last_checked: "2026-06-06"
+  overrides:
+    backend: stablediffusion-ggml
+    step: 25
+    # Ideogram4 runs real classifier-free guidance from a separate
+    # unconditional diffusion model, so it needs a CFG scale > 1 (unlike the
+    # guidance-distilled Flux / Z-Image models). 7 matches the upstream
+    # stable-diffusion.cpp default used in the Ideogram4 example.
+    cfg_scale: 7
+    options:
+      - diffusion_model
+      - uncond_diffusion_model_path:ideogram4_unconditional-Q8_0.gguf
+      - llm_path:Qwen3-VL-8B-Instruct-Q4_K_M.gguf
+      - vae_path:flux2-vae.safetensors
+      - sampler:euler
+      - offload_params_to_cpu:true
+    parameters:
+      model: ideogram4-Q8_0.gguf
+  files:
+    - filename: ideogram4-Q8_0.gguf
+      sha256: feb6cae997927ba0e339bf6ef64b14df9353064f60805d53f84c592643addcfd
+      uri: huggingface://stduhpf/ideogram-4-gguf/ideogram4-Q8_0.gguf
+    - filename: ideogram4_unconditional-Q8_0.gguf
+      sha256: 9261d1473d328aa7edbe1b3fa48a9b9bd2e19fe78439fe6a293af1016c63debd
+      uri: huggingface://stduhpf/ideogram-4-gguf/ideogram4_unconditional-Q8_0.gguf
+    - filename: Qwen3-VL-8B-Instruct-Q4_K_M.gguf
+      sha256: 108e7ff92b78eefd3db4741885104acba514255c11b617d3c7b197a5f46efe89
+      uri: huggingface://unsloth/Qwen3-VL-8B-Instruct-GGUF/Qwen3-VL-8B-Instruct-Q4_K_M.gguf
+    - filename: flux2-vae.safetensors
+      sha256: 868fe7b343cc8f3a19dbcfcafbc3d5f888802be3f89bd81b65b3621a066ce8f3
+      uri: https://huggingface.co/Comfy-Org/Ideogram-4/resolve/main/vae/flux2-vae.safetensors
 - name: whisper-1
  url: github:mudler/LocalAI/gallery/whisper-base.yaml@master
  urls:
@@ -31887,6 +32040,41 @@
    - filename: parakeet-cpp/tdt_ctc-1.1b-f16.gguf
      uri: huggingface://mudler/parakeet-cpp-gguf/tdt_ctc-1.1b-f16.gguf
      sha256: cd53f64eefac2623a12f2f118ef50b56622dc3012f42c815c6adf0d08292f387
+- name: parakeet-cpp-nemotron-3.5-asr-streaming-0.6b
+  url: github:mudler/LocalAI/gallery/virtual.yaml@master
+  urls:
+    - https://huggingface.co/mudler/parakeet-cpp-gguf
+    - https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b
+    - https://github.com/mudler/parakeet.cpp
+  description: |
+    Multilingual (40+ locales), prompt-conditioned, cache-aware streaming FastConformer RNN-T, 0.6B.
+    Q8_0 GGUF for the parakeet-cpp backend (C++/ggml port of NVIDIA NeMo). Byte-identical to NeMo at
+    WER 0 offline and streaming, about 2.5x faster than NeMo on CPU with no GPU. Select a language with
+    the request "language" field (for example en, de, es, ja-JP), or leave it empty for automatic
+    detection. License OpenMDW-1.1.
+  license: other
+  tags:
+    - parakeet
+    - parakeet-cpp
+    - nemotron
+    - asr
+    - speech-recognition
+    - stt
+    - multilingual
+    - streaming
+    - gguf
+    - ggml
+  overrides:
+    backend: parakeet-cpp
+    known_usecases:
+      - transcript
+    name: parakeet-cpp-nemotron-3.5-asr-streaming-0.6b
+    parameters:
+      model: parakeet-cpp/nemotron-3.5-asr-streaming-0.6b-q8_0.gguf
+  files:
+    - filename: parakeet-cpp/nemotron-3.5-asr-streaming-0.6b-q8_0.gguf
+      uri: huggingface://mudler/parakeet-cpp-gguf/nemotron-3.5-asr-streaming-0.6b-q8_0.gguf
+      sha256: ba2f13eccd4a5245be728f77e6149bd6a4fdcdd133ff2e08ac6005bcef7a99f1
 - name: parakeet-crispasr
  url: github:mudler/LocalAI/gallery/virtual.yaml@master
  urls:
--- a/go.mod
+++ b/go.mod
@@ -219,8 +219,8 @@ require (
 	github.com/kevinburke/ssh_config v1.2.0 // indirect
 	github.com/labstack/gommon v0.4.2 // indirect
 	github.com/mschoch/smat v0.2.0 // indirect
-	github.com/mudler/LocalAGI v0.0.0-20260508125235-37810d918a87
-	github.com/mudler/localrecall v0.6.1-0.20260507074622-a7724fef6f81 // indirect
+	github.com/mudler/LocalAGI v0.0.0-20260606071251-14aed1ae4336
+	github.com/mudler/localrecall v0.6.3-0.20260606070048-9a3b3321a9cd // indirect
 	github.com/mudler/skillserver v0.0.7-0.20260520220837-a7317cbf9145
 	github.com/olekukonko/tablewriter v0.0.5 // indirect
 	github.com/oxffaa/gopher-parse-sitemap v0.0.0-20191021113419-005d2eb1def4 // indirect
--- a/go.sum
+++ b/go.sum
@@ -966,8 +966,8 @@ github.com/mr-tron/base58 v1.3.0 h1:K6Y13R2h+dku0wOqKtecgRnBUBPrZzLZy5aIj8lCcJI=
 github.com/mr-tron/base58 v1.3.0/go.mod h1:2BuubE67DCSWwVfx37JWNG8emOC0sHEU4/HpcYgCLX8=
 github.com/mschoch/smat v0.2.0 h1:8imxQsjDm8yFEAVBe7azKmKSgzSkZXDuKkSq9374khM=
 github.com/mschoch/smat v0.2.0/go.mod h1:kc9mz7DoBKqDyiRL7VZN8KvXQMWeTaVnttLRXOlotKw=
-github.com/mudler/LocalAGI v0.0.0-20260508125235-37810d918a87 h1:az+2umaD/sT1rRvI3WZHWXjzdJVJHxcyxp0SNYbqlFk=
-github.com/mudler/LocalAGI v0.0.0-20260508125235-37810d918a87/go.mod h1:x77p9W1zKZr+W+UcEwg8/qdp00p4XXOI69wE7WlXZc0=
+github.com/mudler/LocalAGI v0.0.0-20260606071251-14aed1ae4336 h1:iKBkSnpisOvMVxFoYsAObvAuOqXBakRPMD0PWxWG5EE=
+github.com/mudler/LocalAGI v0.0.0-20260606071251-14aed1ae4336/go.mod h1:U+g6u8mF2wQxhkdBl3dr8G4db1cv3n7KTKmraoJ7D0c=
 github.com/mudler/cogito v0.9.5-0.20260315222927-63abdec7189b h1:A74T2Lauvg61KodYqsjTYDY05kPLcW+efVZjd23dghU=
 github.com/mudler/cogito v0.9.5-0.20260315222927-63abdec7189b/go.mod h1:6sfja3lcu2nWRzEc0wwqGNu/eCG3EWgij+8s7xyUeQ4=
 github.com/mudler/edgevpn v0.34.0 h1:qDrD/rCPFY/FdURbXudIZWihVKY4VOX3nMn3CcbeQEU=
@@ -976,8 +976,8 @@ github.com/mudler/go-piper v0.0.0-20241023091659-2494246fd9fc h1:RxwneJl1VgvikiX
 github.com/mudler/go-piper v0.0.0-20241023091659-2494246fd9fc/go.mod h1:O7SwdSWMilAWhBZMK9N9Y/oBDyMMzshE3ju8Xkexwig=
 github.com/mudler/go-processmanager v0.1.1 h1:c/1NRZOZpW8HuFv9RhBG57nQu1oDMRomEHedwBFMlrw=
 github.com/mudler/go-processmanager v0.1.1/go.mod h1:h6kmHUZeafr+k5hRYpGLMzJFH4hItHffgpRo2QIkP+o=
-github.com/mudler/localrecall v0.6.1-0.20260507074622-a7724fef6f81 h1:8D9NJ/ikhsJCxUwbdzIzadw6RqDrW+L0FPqpQQSeux8=
-github.com/mudler/localrecall v0.6.1-0.20260507074622-a7724fef6f81/go.mod h1:28k5n19raUrkuwXkacdNsBlj8yuSnGhpT16tu+2+4dU=
+github.com/mudler/localrecall v0.6.3-0.20260606070048-9a3b3321a9cd h1:trn9D5UHAE6zdRyD2uX04W1tLSslAwozVwcyNTd72Ak=
+github.com/mudler/localrecall v0.6.3-0.20260606070048-9a3b3321a9cd/go.mod h1:28k5n19raUrkuwXkacdNsBlj8yuSnGhpT16tu+2+4dU=
 github.com/mudler/memory v0.0.0-20260406210934-424c1ecf2cf8 h1:Ry8RiWy8fZ6Ff4E7dPmjRsBrnHOnPeOOj2LhCgyjQu0=
 github.com/mudler/memory v0.0.0-20260406210934-424c1ecf2cf8/go.mod h1:EA8Ashhd56o32qN7ouPKFSRUs/Z+LrRCF4v6R2Oarm8=
 github.com/mudler/skillserver v0.0.7-0.20260520220837-a7317cbf9145 h1:z59tA3IDYPt71nzH1jpxeaA1LuDw8aZfpTQFNU43Zb8=
--- a/pkg/huggingface-api/client.go
+++ b/pkg/huggingface-api/client.go
@@ -2,6 +2,7 @@ package hfapi

 import (
 	"encoding/json"
+	"errors"
 	"fmt"
 	"io"
 	"net/http"
@@ -10,6 +11,7 @@ import (
 	"sort"
 	"strconv"
 	"strings"
+	"time"

 	"github.com/mudler/LocalAI/pkg/httpclient"
 )
@@ -88,57 +90,128 @@ type SearchParams struct {

 // Client represents a Hugging Face API client
 type Client struct {
-	baseURL string
-	client  *http.Client
+	baseURL      string
+	client       *http.Client
+	maxRetries   int
+	retryBackoff time.Duration
+	maxBackoff   time.Duration
+	sleepFn      func(time.Duration)
 }

+var ErrRateLimited = errors.New("huggingface API rate limited")
+
 // NewClient creates a new Hugging Face API client
 func NewClient() *Client {
 	return &Client{
-		baseURL: "https://huggingface.co/api/models",
-		client:  httpclient.New(httpclient.WithFollowRedirects()),
+		baseURL:      "https://huggingface.co/api/models",
+		client:       httpclient.New(httpclient.WithFollowRedirects()),
+		maxRetries:   5,
+		retryBackoff: 1 * time.Second,
+		maxBackoff:   30 * time.Second,
+		sleepFn:      time.Sleep,
 	}
 }

 // SearchModels searches for models using the Hugging Face API
 func (c *Client) SearchModels(params SearchParams) ([]Model, error) {
-	req, err := http.NewRequest("GET", c.baseURL, nil)
-	if err != nil {
-		return nil, fmt.Errorf("failed to create request: %w", err)
+	for attempt := 1; attempt <= c.maxRetries; attempt++ {
+		req, err := http.NewRequest("GET", c.baseURL, nil)
+		if err != nil {
+			return nil, fmt.Errorf("failed to create request: %w", err)
+		}
+
+		// Add query parameters
+		q := req.URL.Query()
+		q.Add("sort", params.Sort)
+		q.Add("direction", fmt.Sprintf("%d", params.Direction))
+		q.Add("limit", fmt.Sprintf("%d", params.Limit))
+		q.Add("search", params.Search)
+		req.URL.RawQuery = q.Encode()
+
+		resp, err := c.client.Do(req)
+		if err != nil {
+			if attempt < c.maxRetries {
+				c.sleepFn(c.exponentialBackoff(attempt))
+				continue
+			}
+			return nil, fmt.Errorf("failed to make request: %w", err)
+		}
+
+		if resp.StatusCode != http.StatusOK {
+			if err := resp.Body.Close(); err != nil {
+				return nil, fmt.Errorf("failed to close response body: %w", err)
+			}
+			if c.isRetryableStatus(resp.StatusCode) && attempt < c.maxRetries {
+				c.sleepFn(c.retryDelay(resp, attempt))
+				continue
+			}
+			if resp.StatusCode == http.StatusTooManyRequests {
+				return nil, fmt.Errorf("%w: failed to fetch models. Status code: %d", ErrRateLimited, resp.StatusCode)
+			}
+			return nil, fmt.Errorf("failed to fetch models. Status code: %d", resp.StatusCode)
+		}
+
+		// Read the response body
+		body, err := io.ReadAll(resp.Body)
+		closeErr := resp.Body.Close()
+		if err != nil {
+			return nil, fmt.Errorf("failed to read response body: %w", err)
+		}
+		if closeErr != nil {
+			return nil, fmt.Errorf("failed to close response body: %w", closeErr)
+		}
+
+		// Parse the JSON response
+		var models []Model
+		if err := json.Unmarshal(body, &models); err != nil {
+			return nil, fmt.Errorf("failed to parse JSON response: %w", err)
+		}
+
+		return models, nil
 	}

-	// Add query parameters
-	q := req.URL.Query()
-	q.Add("sort", params.Sort)
-	q.Add("direction", fmt.Sprintf("%d", params.Direction))
-	q.Add("limit", fmt.Sprintf("%d", params.Limit))
-	q.Add("search", params.Search)
-	req.URL.RawQuery = q.Encode()
+	return nil, fmt.Errorf("%w: failed to fetch models. Status code: %d", ErrRateLimited, http.StatusTooManyRequests)
+}

-	// Make the HTTP request
-	resp, err := c.client.Do(req)
-	if err != nil {
-		return nil, fmt.Errorf("failed to make request: %w", err)
-	}
-	defer resp.Body.Close()
+func (c *Client) isRetryableStatus(code int) bool {
+	return code == http.StatusTooManyRequests || (code >= http.StatusInternalServerError && code <= http.StatusNetworkAuthenticationRequired)
+}

-	if resp.StatusCode != http.StatusOK {
-		return nil, fmt.Errorf("failed to fetch models. Status code: %d", resp.StatusCode)
+func (c *Client) retryDelay(resp *http.Response, attempt int) time.Duration {
+	if retryAfter := strings.TrimSpace(resp.Header.Get("Retry-After")); retryAfter != "" {
+		if seconds, err := strconv.Atoi(retryAfter); err == nil && seconds > 0 {
+			delay := time.Duration(seconds) * time.Second
+			if delay > c.maxBackoff {
+				return c.maxBackoff
+			}
+			return delay
+		}
+		if at, err := http.ParseTime(retryAfter); err == nil {
+			delay := time.Until(at)
+			if delay > 0 {
+				if delay > c.maxBackoff {
+					return c.maxBackoff
+				}
+				return delay
+			}
+		}
 	}

-	// Read the response body
-	body, err := io.ReadAll(resp.Body)
-	if err != nil {
-		return nil, fmt.Errorf("failed to read response body: %w", err)
-	}
+	return c.exponentialBackoff(attempt)
+}

-	// Parse the JSON response
-	var models []Model
-	if err := json.Unmarshal(body, &models); err != nil {
-		return nil, fmt.Errorf("failed to parse JSON response: %w", err)
+func (c *Client) exponentialBackoff(attempt int) time.Duration {
+	delay := c.retryBackoff
+	for i := 1; i < attempt; i++ {
+		delay *= 2
+		if delay >= c.maxBackoff {
+			return c.maxBackoff
+		}
 	}
-
-	return models, nil
+	if delay > c.maxBackoff {
+		return c.maxBackoff
+	}
+	return delay
 }

 // GetLatest fetches the latest GGUF models
--- a/pkg/huggingface-api/client_test.go
+++ b/pkg/huggingface-api/client_test.go
@@ -1,10 +1,12 @@
 package hfapi_test

 import (
+	"errors"
 	"fmt"
 	"net/http"
 	"net/http/httptest"
 	"strings"
+	"time"

 	. "github.com/onsi/ginkgo/v2"
 	. "github.com/onsi/gomega"
@@ -185,6 +187,87 @@ var _ = Describe("HuggingFace API Client", func() {
 			Expect(err.Error()).To(ContainSubstring("failed to parse JSON response"))
 			Expect(models).To(BeNil())
 		})
+
+		It("should retry 429 responses and honor Retry-After", func() {
+			attempts := 0
+			server = httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+				attempts++
+				if attempts == 1 {
+					w.Header().Set("Retry-After", "1")
+					w.WriteHeader(http.StatusTooManyRequests)
+					return
+				}
+				w.Header().Set("Content-Type", "application/json")
+				w.WriteHeader(http.StatusOK)
+				_, err := w.Write([]byte("[]"))
+				Expect(err).ToNot(HaveOccurred())
+			}))
+			client.SetBaseURL(server.URL)
+
+			params := hfapi.SearchParams{
+				Sort:      "lastModified",
+				Direction: -1,
+				Limit:     30,
+				Search:    "GGUF",
+			}
+
+			start := time.Now()
+			models, err := client.SearchModels(params)
+			elapsed := time.Since(start)
+
+			Expect(err).ToNot(HaveOccurred())
+			Expect(models).To(HaveLen(0))
+			Expect(attempts).To(Equal(2))
+			Expect(elapsed).To(BeNumerically(">=", 900*time.Millisecond))
+		})
+
+		It("should fail fast on non-retryable 4xx responses", func() {
+			attempts := 0
+			server = httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+				attempts++
+				w.WriteHeader(http.StatusBadRequest)
+			}))
+			client.SetBaseURL(server.URL)
+
+			params := hfapi.SearchParams{
+				Sort:      "lastModified",
+				Direction: -1,
+				Limit:     30,
+				Search:    "GGUF",
+			}
+
+			start := time.Now()
+			models, err := client.SearchModels(params)
+			elapsed := time.Since(start)
+
+			Expect(err).To(HaveOccurred())
+			Expect(err.Error()).To(ContainSubstring("Status code: 400"))
+			Expect(models).To(BeNil())
+			Expect(attempts).To(Equal(1))
+			Expect(elapsed).To(BeNumerically("<", 500*time.Millisecond))
+		})
+
+		It("should return ErrRateLimited when 429 persists after retries", func() {
+			server = httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+				w.Header().Set("Retry-After", "1")
+				w.WriteHeader(http.StatusTooManyRequests)
+			}))
+			client.SetBaseURL(server.URL)
+
+			params := hfapi.SearchParams{
+				Sort:      "trendingScore",
+				Direction: -1,
+				Limit:     15,
+				Search:    "GGUF",
+			}
+
+			models, err := client.SearchModels(params)
+
+			Expect(err).To(HaveOccurred())
+			Expect(errors.Is(err, hfapi.ErrRateLimited)).To(BeTrue())
+			Expect(err.Error()).To(ContainSubstring("Status code: 429"))
+			Expect(models).To(BeNil())
+		})
 	})

 	Context("when getting latest GGUF models", func() {
Author	SHA1	Message	Date
Ettore Di Giacinto	a21c79c953	test(parakeet-cpp): update model-gated specs for multi-segment output The offline AudioTranscription specs asserted the old single synthetic segment (Segments HaveLen(1), Segments[0].Text == res.Text). With NeMo-faithful segmentation a multi-sentence clip now yields multiple punctuation-delimited segments, so assert the new contract instead: one-or-more time-ordered segments, each with text and (under word granularity) per-segment words whose span tracks the segment start/end. Caught by running the model-gated suite on the dgx (GB10) against the real tdt_ctc-110m + realtime_eou models. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-07 10:53:30 +00:00
Ettore Di Giacinto	dd04a9b80e	docs(audio): document parakeet-cpp segment timestamps + segment_gap_threshold Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-07 08:47:12 +00:00
Ettore Di Giacinto	071872bb53	feat(parakeet-cpp): real segment timestamps (NeMo-faithful) Offline: replace the single synthetic whole-clip segment with multiple segments grouped exactly like NeMo's get_segment_offsets - a new segment after sentence-ending punctuation ('. ? !'), each carrying start/end and its time-window token ids. The optional model option segment_gap_threshold (NeMo's unit: encoder FRAMES, default 0=off) adds NeMo's silence-gap split, converted to seconds via the JSON frame_sec the engine now reports. Per-segment words are still gated behind timestamp_granularities=["word"]; a zero-word document falls back to a single text segment. Streaming: when libparakeet.so exposes the ABI v4 JSON entry points (probed), drive parakeet_capi_stream_feed_json / _finalize_json and accumulate the streamed per-word timestamps into per-utterance segments (EOU stays the boundary), so streaming FinalResult segments now carry start/end. Falls back to the text-only feed against an older library. Pure-Go specs cover splitWordsIntoSegments (punctuation + gap rules, NeMo elif order, fallback), transcriptResultFromDoc (multi-segment, token windows, word-granularity gate), and the streaming segmenter. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-07 08:47:12 +00:00
LocalAI [bot]	8c42695ef8	chore: ⬆️ Update ggml-org/whisper.cpp to `a8ec021f2750a473ff4a8f3883bc9fdf5feafa84` (#10202 ) ⬆️ Update ggml-org/whisper.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-07 08:37:42 +02:00
LocalAI [bot]	72e3241431	chore: ⬆️ Update mudler/parakeet.cpp to `abd0087dcc92ec5ad1f96f9fd86c49eb26a5ce67` (#10204 ) ⬆️ Update mudler/parakeet.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-07 00:37:28 +02:00
LocalAI [bot]	cd2bf95862	fix(docs): use relearn notice shortcode instead of unsupported alert (#10206 ) The Hugo relearn theme does not provide an "alert" shortcode, so the docs deploy failed at the Build site step: failed to extract shortcode: template for shortcode "alert" not found docs/content/features/distributed-mode.md:136 Convert the warning block to the theme-supported notice shortcode used everywhere else in the docs. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-07 00:37:12 +02:00
LocalAI [bot]	f64b72dd7d	feat: support Ideogram4 in stablediffusion-ggml backend + gallery (#10201 ) * feat(stablediffusion-ggml): support Ideogram4 unconditional diffusion model Bump stable-diffusion.cpp from 1f9ee88 to b9254dd, the upstream commit that adds Ideogram4 support (leejet/stable-diffusion.cpp#1609). Ideogram4 derives its classifier-free guidance from a separate unconditional diffusion model, exposed upstream through the new sd_ctx_params_t.uncond_diffusion_model_path field. Wire that field into the gosd wrapper via a new uncond_diffusion_model_path option. The _path suffix is deliberate: the Go loader only resolves options whose name contains "path" to an absolute path under the model directory, so this keeps the option consistent with diffusion_model_path and high_noise_diffusion_model_path. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(gallery): add Ideogram4 stablediffusion-ggml models Single-file GGUF weights for Ideogram4 are now published (stduhpf/ideogram-4-gguf), so add the model to the gallery. Ideogram4 is a text-to-image model with strong, accurate in-image text rendering, driven by a Qwen3-VL-8B text encoder and real classifier-free guidance from a separate unconditional diffusion model (the uncond_diffusion_model_path support added in the preceding commit). Two index entries, both built on gallery/virtual.yaml with the full config inlined in overrides (same pattern as the other models, no dedicated template file): - ideogram-4-iq4nl-ggml (4-bit, ~11.6GB diffusion) - ideogram-4-q8_0-ggml (8-bit, ~20GB diffusion) Each bundles the diffusion + unconditional GGUF (stduhpf), the Qwen3-VL-8B-Instruct text encoder (unsloth), and the FLUX.2 VAE (Comfy-Org mirror, non-gated). cfg_scale is 7 to match the upstream Ideogram4 default, since it performs real CFG unlike the guidance-distilled Flux/Z-Image models. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-06 22:50:12 +02:00
LocalAI [bot]	03c84cff28	feat(parakeet-cpp): nemotron-3.5-asr multilingual streaming model + request language support (#10199 ) * feat(parakeet-cpp): honor request language (multilingual nemotron) on batched + streaming paths Reads opts.GetLanguage() and threads it through to the new parakeet_capi_transcribe_pcm_batch_json_lang and parakeet_capi_stream_begin_lang C-API entry points, both probed with Dlsym so the backend still loads against an older libparakeet.so (falling back to the non-lang paths, i.e. model default). parakeet.cpp's batched C-API takes a single target_lang for the whole batch, so the dispatcher only coalesces same-language requests: a request whose language differs from the batch leader is held as a single carry-over and becomes the leader of the next batch, never dropped and never left waiting (including on shutdown). A new batcher test asserts no dispatched batch is ever mixed-language and that every submitted request still receives a reply. Assisted-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(gallery): add parakeet-cpp-nemotron-3.5-asr-streaming-0.6b; bump parakeet.cpp pin Adds the multilingual prompt-conditioned streaming model to the gallery (q8_0 default, OpenMDW-1.1) and bumps the parakeet-cpp backend pin to the parakeet.cpp commit that ships nemotron support plus batched causal subsampling and the batched target_lang C-API. Assisted-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-06 13:53:10 +02:00
LocalAI [bot]	9bc69c9e5f	chore(model gallery): 🤖 add 1 new models via gallery agent (#10200 ) chore(model gallery): 🤖 add new models via gallery agent Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-06 13:52:46 +02:00
LocalAI [bot]	1e6c9cfd60	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `6b9de3dbaa21ae95ea80638e5ee836795cc48c93` (#10190 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-06 09:42:43 +02:00
LocalAI [bot]	0e6712f734	chore: ⬆️ Update mudler/parakeet.cpp to `843600590f96a31467a5199f827c253f34c110f7` (#10198 ) chore(parakeet-cpp): bump pin to banded long-audio attention (843600590) Update PARAKEET_VERSION to mudler/parakeet.cpp@843600590f (merge of parakeet.cpp#9). Brings NeMo rel_pos_local_attn banded/Longformer attention with the chunk-matmul construction: long audio now uses O(T*window) attention instead of global O(T^2), fixing the encoder OOM on long clips (~16.6-min clip: 54GB->9.4GB peak, ~4x faster) at NeMo's full [128,128] window. Short clips are unchanged (global path). No C-ABI change. Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-06 09:25:25 +02:00
LocalAI [bot]	0e4cee9a97	chore: bump LocalAGI + localrecall (fix pgvector hybrid search seqscan, #10186 ) (#10192 ) chore: bump LocalAGI and localrecall (index-backed RRF hybrid search) Bumps the agent stack to pull in the PostgreSQL hybrid-search fix: - mudler/localrecall -> v0.6.3-...-9a3b3321a9cd (mudler/LocalRecall#46, merged) - mudler/LocalAGI -> ...-14aed1ae4336 (mudler/LocalAGI#477, merged) localrecall's hybrid search previously sorted on a wrapped scalar similarity expression, which blinded the planner into a full sequential scan over every row and exceeded the statement timeout on large collections, returning an empty result set. It now uses the canonical Reciprocal Rank Fusion pattern (index-backed candidate retrieval + FULL OUTER JOIN + weighted RRF). Fixes #10186 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-06 09:16:59 +02:00
Copilot	352b7ec604	Harden gallery-agent Hugging Face fetches against transient rate limiting (#10187 ) * Initial plan * fix: retry HuggingFace trending fetch on transient rate limits * fix: handle body close/write errors in huggingface retry paths --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>	2026-06-05 23:43:06 +02:00
LocalAI [bot]	ba706422fb	chore: ⬆️ Update vllm-project/vllm cu130 wheel to `0.22.1` (#10188 ) ⬆️ Update vllm-project/vllm cu130 wheel Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-05 23:42:50 +02:00