mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-01 11:56:57 -04:00
* feat(realtime): EOU-driven semantic_vad turn detection Add a `semantic_vad` turn-detection mode to the realtime API that feeds the transcription model live and decides "the user finished speaking" from the `<EOU>` end-of-utterance token rather than from silence alone. When EOU fires the turn commits immediately (~0.3s); otherwise it falls back to an eagerness-scaled silence threshold (low/med/high = 8/4/2s). Plumbing, bottom to top: - proto: `AudioTranscriptionLive` bidirectional RPC (config-first oneof, mono float PCM @16k, ready-ack / Unimplemented degrade signal) plus `TranscriptResult.eou` for the unary retranscribe gate. - pkg/grpc: client/server/base/embed scaffolding for the bidi stream, modeled on AudioTransformStream; release stream conns on terminal Recv. - parakeet-cpp: live transcription RPC with per-C-call engine locking (one live stream per turn, finalize+free at commit); bump parakeet.cpp to ABI v5 — incremental StreamingMel (no more quadratic per-feed mel recompute that delayed EOU on long turns) and the <EOU>/<EOB> split; strip the literal <EOU>/<EOB> from offline text and set Eou. - core/backend: LiveTranscriptionSession wrapper + pipeline `turn_detection:` config block (type/eagerness/retranscribe). - realtime: semantic_vad integration — live input captions streamed as transcription deltas while the user speaks, EOU-immediate commit with eagerness fallback, optional retranscribe gate (batch re-decode must also end in <EOU> to confirm), clause synthesis off the LLM token callback, and per-turn live-transcription / model_load telemetry. - UI: show the realtime pipeline components as a vertical list. Docs and tests included; opt-in via the pipeline YAML or per-session `session.update`. Non-streaming STT backends degrade to silence-only. Assisted-by: Claude Code:claude-opus-4-8 [Read] [Edit] [Write] [Bash] Assisted-by: Claude Code:claude-fable-5 [Read] [Edit] [Bash] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(realtime): explicit formally-verified state machines + parakeet streaming driver The realtime API had several implicit state machines whose state was inferred from scattered booleans, channels, and five separate mutexes, leaving illegal/inconsistent states reachable. Make them explicit and keep the implementation in step with a formal design; rework the parakeet streaming backend along the same lines. Realtime state machines (M1-M5). Each is a sealed sum-type State/Event/Effect with a total, pure Next(state,event)->(state,[]effect) behind a single-writer Coordinator: M1 conncoord connection lifecycle: VAD toggle + once-only teardown (replaces vadServerStarted + a `done` channel closed from two sites). M2 turncoord turn detection: collapses speechStarted and the live-stream "turn open" flag into one state, so discardTurn can no longer desync them and suppress the next onset. M3 respcoord response coordination: serializes the dual-writer start/cancel so at most one response is live; one response.done per response.create. M4 compactcoord conversation compaction: single-flight (replaces the `compacting atomic.Bool` CAS). M5 ttscoord TTS pipeline: open->closing->closed, idempotent wait(), rejects enqueue-after-close (was a silent drop). The Coordinator/Sink/Next plumbing — only the sealed types and Next differed per machine — is extracted once into core/http/endpoints/openai/coordinator as a generic Coordinator[S,E,F]; each machine keeps its public API via type aliases, so no sink, call-site, or test moved. Hierarchy. session_lifecycle.fizz models M1 as the parent region with its children (M2/M3/M4) as one statechart and asserts ChildrenDieWithParent (conn torn => all children terminal, none start after teardown). respcoord and compactcoord gain an absorbing Terminated state + Shutdown event; conncoord's teardown drives the children terminal. This closes a compaction teardown gap: a fire-and-forget compaction could outlive a torn session — compactionSink now takes a session-scoped cancellable context + WaitGroup and joins the in-flight summarize+evict on shutdown. Formal verification. formal-verification/ holds one authoritative FizzBee spec per machine plus the composition spec, each with an always-assertion and a documented one-line edit that makes the checker fail (verified non-vacuous). scripts/realtime-conformance.sh is fail-closed: all Go conformance suites under -race AND a model-check of every .fizz spec; a missing FizzBee is a hard error (only the loud REALTIME_CONFORMANCE_SKIP_FIZZBEE=1 bypasses it, never in CI). FizzBee is pinned by sha256 and installed via scripts/install-fizzbee.sh into .tools/ (gitignored). Wired as make test-realtime-conformance, a CI workflow, and a pre-commit path filter. Go conformance tests are Ginkgo/Gomega (per the repo's forbidigo lint): transition tables + fixed-seed property walks + concurrent/-race specs, no rapid dependency. Design map: docs/design/realtime-state-machines.md. Parakeet streaming backend. The same treatment applied to the parakeet-cpp streaming paths: - AudioTranscriptionStream returns codes.Unimplemented for non-streaming models instead of decoding offline and emitting it as one delta + final. A client that asked for streaming learns the model cannot stream rather than receiving a batch result shaped like a stream. New grpcerrors.StreamTranscriptionUnsupported carries that signal; the HTTP /v1/audio/transcriptions stream path surfaces it as an SSE error event. Mirrors AudioTranscriptionLive, which already did this. - utteranceBoundary (boundary.go): a single definition of the end-of-utterance latch, replacing three open-coded finalEou toggles. Modelled as a two-valued type so illegal states are unrepresentable. - Shared decode driver (driver.go): streamFeedResult (one per-feed event) + feedChunk (hides the ABI v4 JSON vs text-only split) + feedSlices + flushTail. The feed loop is written once. - AudioTranscriptionLive becomes a bidi adapter: it streams the per-feed {delta,eou,eob,words} the realtime turn detector consumes and a terminal FinalResult carrying only Text. Segments/duration/eou are offline-only and no longer produced (nor read) on the live path; liveTraceState drops the terminal eou and keeps the per-feed eou_events count. - AudioTranscriptionStream + streamJSON merge into one driver-based function; streamSegmenter is generalized to the unified event with a text-only fallback that preserves the legacy (no-words) library's per-utterance segmentation. Verified: build/vet/gofumpt clean, golangci-lint 0 issues, all coordinator and parakeet packages under -race, the fail-closed conformance gate green, and make test-realtime (12 e2e WS+WebRTC). Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com>
155 lines
6.1 KiB
Go
155 lines
6.1 KiB
Go
package main
|
|
|
|
import (
|
|
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
|
|
. "github.com/onsi/ginkgo/v2"
|
|
. "github.com/onsi/gomega"
|
|
)
|
|
|
|
func tw(text string, start, end float64) transcriptWord {
|
|
return transcriptWord{W: text, Start: start, End: end}
|
|
}
|
|
|
|
var _ = Describe("splitWordsIntoSegments (NeMo get_segment_offsets parity)", func() {
|
|
seps := []rune{'.', '?', '!'}
|
|
|
|
It("splits on sentence-ending punctuation, including the delimiter word", func() {
|
|
words := []transcriptWord{tw("hello", 0, 0.4), tw("world.", 0.4, 0.8), tw("bye", 1.0, 1.3)}
|
|
segs := splitWordsIntoSegments(words, seps, 0)
|
|
Expect(segs).To(HaveLen(2))
|
|
Expect(segs[0]).To(HaveLen(2))
|
|
Expect(segs[0][1].W).To(Equal("world."))
|
|
Expect(segs[1]).To(HaveLen(1))
|
|
Expect(segs[1][0].W).To(Equal("bye"))
|
|
})
|
|
|
|
It("keeps a single segment with no terminal punctuation and gap off", func() {
|
|
words := []transcriptWord{tw("a", 0, 0.2), tw("b", 0.2, 0.4), tw("c", 5.0, 5.2)}
|
|
segs := splitWordsIntoSegments(words, seps, 0)
|
|
Expect(segs).To(HaveLen(1))
|
|
})
|
|
|
|
It("splits on the gap rule when enabled, the gapped word starting the next segment", func() {
|
|
words := []transcriptWord{tw("a", 0, 0.2), tw("b", 0.2, 0.4), tw("c", 5.0, 5.2)}
|
|
segs := splitWordsIntoSegments(words, seps, 1.0) // c is 4.6s after b
|
|
Expect(segs).To(HaveLen(2))
|
|
Expect(segs[0]).To(HaveLen(2)) // a b
|
|
Expect(segs[1]).To(HaveLen(1)) // c
|
|
Expect(segs[1][0].W).To(Equal("c"))
|
|
})
|
|
|
|
It("checks the gap rule before punctuation (NeMo elif order)", func() {
|
|
// "b." would terminate, but c is far after it -> gap closes [a b.] at b.
|
|
words := []transcriptWord{tw("a", 0, 0.2), tw("b.", 0.2, 0.4), tw("c", 9.0, 9.2)}
|
|
segs := splitWordsIntoSegments(words, seps, 1.0)
|
|
Expect(segs).To(HaveLen(2))
|
|
Expect(segs[0]).To(HaveLen(2))
|
|
Expect(segs[1][0].W).To(Equal("c"))
|
|
})
|
|
|
|
It("still splits on punctuation when the gap rule is enabled but does not fire", func() {
|
|
words := []transcriptWord{tw("hi.", 0, 0.4), tw("bye", 0.4, 0.8)}
|
|
segs := splitWordsIntoSegments(words, seps, 5.0) // gap never reached
|
|
Expect(segs).To(HaveLen(2))
|
|
Expect(segs[0][0].W).To(Equal("hi."))
|
|
})
|
|
|
|
It("returns nothing for empty input", func() {
|
|
Expect(splitWordsIntoSegments(nil, seps, 0)).To(BeEmpty())
|
|
})
|
|
})
|
|
|
|
var _ = Describe("transcriptResultFromDoc (multi-segment)", func() {
|
|
doc := transcriptJSON{
|
|
Text: "hello world. bye now",
|
|
FrameSec: 0.08,
|
|
Words: []transcriptWord{
|
|
{W: "hello", Start: 0.0, End: 0.4},
|
|
{W: "world.", Start: 0.4, End: 0.8},
|
|
{W: "bye", Start: 1.0, End: 1.3},
|
|
{W: "now", Start: 1.3, End: 1.6},
|
|
},
|
|
Tokens: []transcriptToken{{ID: 1, T: 0.1}, {ID: 2, T: 0.5}, {ID: 3, T: 1.1}, {ID: 4, T: 1.4}},
|
|
}
|
|
|
|
It("emits one segment per punctuation-delimited group with start/end", func() {
|
|
res := transcriptResultFromDoc(doc, &pb.TranscriptRequest{}, 0)
|
|
Expect(res.Segments).To(HaveLen(2))
|
|
Expect(res.Segments[0].Text).To(Equal("hello world."))
|
|
Expect(res.Segments[0].Start).To(Equal(int64(0)))
|
|
Expect(res.Segments[0].End).To(Equal(secondsToNanos(0.8)))
|
|
Expect(res.Segments[1].Text).To(Equal("bye now"))
|
|
Expect(res.Segments[1].Start).To(Equal(secondsToNanos(1.0)))
|
|
Expect(res.Segments[1].Id).To(Equal(int32(1)))
|
|
})
|
|
|
|
It("assigns tokens to the segment whose time window contains them", func() {
|
|
res := transcriptResultFromDoc(doc, &pb.TranscriptRequest{}, 0)
|
|
Expect(res.Segments[0].Tokens).To(Equal([]int32{1, 2}))
|
|
Expect(res.Segments[1].Tokens).To(Equal([]int32{3, 4}))
|
|
})
|
|
|
|
It("attaches per-segment words only when word granularity requested", func() {
|
|
plain := transcriptResultFromDoc(doc, &pb.TranscriptRequest{}, 0)
|
|
Expect(plain.Segments[0].Words).To(BeEmpty())
|
|
withWords := transcriptResultFromDoc(doc, &pb.TranscriptRequest{TimestampGranularities: []string{"word"}}, 0)
|
|
Expect(withWords.Segments[0].Words).To(HaveLen(2))
|
|
})
|
|
|
|
It("falls back to a single text segment when there are no words", func() {
|
|
res := transcriptResultFromDoc(transcriptJSON{Text: "hi"}, &pb.TranscriptRequest{}, 0)
|
|
Expect(res.Segments).To(HaveLen(1))
|
|
Expect(res.Segments[0].Text).To(Equal("hi"))
|
|
})
|
|
})
|
|
|
|
var _ = Describe("streaming segment assembly", func() {
|
|
It("closes a segment with start/end from its words on EOU", func() {
|
|
acc := &streamSegmenter{}
|
|
acc.add(streamFeedResult{Delta: "hello world", Eou: true, Words: []transcriptWord{
|
|
{W: "hello", Start: 0.0, End: 0.4}, {W: "world", Start: 0.4, End: 0.9},
|
|
}})
|
|
segs := acc.segments()
|
|
Expect(segs).To(HaveLen(1))
|
|
Expect(segs[0].Text).To(Equal("hello world"))
|
|
Expect(segs[0].Start).To(Equal(int64(0)))
|
|
Expect(segs[0].End).To(Equal(secondsToNanos(0.9)))
|
|
})
|
|
|
|
It("buffers words across feeds until EOU", func() {
|
|
acc := &streamSegmenter{}
|
|
acc.add(streamFeedResult{Delta: "hi", Words: []transcriptWord{{W: "hi", Start: 0, End: 0.3}}})
|
|
Expect(acc.segments()).To(BeEmpty())
|
|
acc.add(streamFeedResult{Delta: "there", Eou: true, Words: []transcriptWord{{W: "there", Start: 0.3, End: 0.7}}})
|
|
Expect(acc.segments()).To(HaveLen(1))
|
|
Expect(acc.segments()[0].Text).To(Equal("hi there"))
|
|
})
|
|
|
|
// ABI v5 split <EOB> (backchannel) out of the "eou" flag into its own "eob"
|
|
// field; a backchannel must still close the segment as it did in v4.
|
|
It("closes a segment on EOB (backchannel) too", func() {
|
|
acc := &streamSegmenter{}
|
|
acc.add(streamFeedResult{Delta: "uh huh", Eob: true, Words: []transcriptWord{
|
|
{W: "uh", Start: 0.0, End: 0.2}, {W: "huh", Start: 0.2, End: 0.5},
|
|
}})
|
|
segs := acc.segments()
|
|
Expect(segs).To(HaveLen(1))
|
|
Expect(segs[0].Text).To(Equal("uh huh"))
|
|
Expect(segs[0].End).To(Equal(secondsToNanos(0.5)))
|
|
})
|
|
|
|
// Older text-only libparakeet.so: no per-word timings, so a segment is cut
|
|
// from the delta text on each <EOU>/<EOB> (no timestamps), one per utterance.
|
|
It("falls back to text segments when the feed carries no words", func() {
|
|
acc := &streamSegmenter{}
|
|
acc.add(streamFeedResult{Delta: "first turn", Eou: true})
|
|
acc.add(streamFeedResult{Delta: "second turn", Eou: true})
|
|
segs := acc.segments()
|
|
Expect(segs).To(HaveLen(2))
|
|
Expect(segs[0].Text).To(Equal("first turn"))
|
|
Expect(segs[1].Text).To(Equal("second turn"))
|
|
Expect(segs[0].Start).To(Equal(int64(0)), "no per-word timing on the text path")
|
|
Expect(segs[0].End).To(Equal(int64(0)))
|
|
})
|
|
})
|