* feat(realtime): EOU-driven semantic_vad turn detection Add a `semantic_vad` turn-detection mode to the realtime API that feeds the transcription model live and decides "the user finished speaking" from the `<EOU>` end-of-utterance token rather than from silence alone. When EOU fires the turn commits immediately (~0.3s); otherwise it falls back to an eagerness-scaled silence threshold (low/med/high = 8/4/2s). Plumbing, bottom to top: - proto: `AudioTranscriptionLive` bidirectional RPC (config-first oneof, mono float PCM @16k, ready-ack / Unimplemented degrade signal) plus `TranscriptResult.eou` for the unary retranscribe gate. - pkg/grpc: client/server/base/embed scaffolding for the bidi stream, modeled on AudioTransformStream; release stream conns on terminal Recv. - parakeet-cpp: live transcription RPC with per-C-call engine locking (one live stream per turn, finalize+free at commit); bump parakeet.cpp to ABI v5 — incremental StreamingMel (no more quadratic per-feed mel recompute that delayed EOU on long turns) and the <EOU>/<EOB> split; strip the literal <EOU>/<EOB> from offline text and set Eou. - core/backend: LiveTranscriptionSession wrapper + pipeline `turn_detection:` config block (type/eagerness/retranscribe). - realtime: semantic_vad integration — live input captions streamed as transcription deltas while the user speaks, EOU-immediate commit with eagerness fallback, optional retranscribe gate (batch re-decode must also end in <EOU> to confirm), clause synthesis off the LLM token callback, and per-turn live-transcription / model_load telemetry. - UI: show the realtime pipeline components as a vertical list. Docs and tests included; opt-in via the pipeline YAML or per-session `session.update`. Non-streaming STT backends degrade to silence-only. Assisted-by: Claude Code:claude-opus-4-8 [Read] [Edit] [Write] [Bash] Assisted-by: Claude Code:claude-fable-5 [Read] [Edit] [Bash] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(realtime): explicit formally-verified state machines + parakeet streaming driver The realtime API had several implicit state machines whose state was inferred from scattered booleans, channels, and five separate mutexes, leaving illegal/inconsistent states reachable. Make them explicit and keep the implementation in step with a formal design; rework the parakeet streaming backend along the same lines. Realtime state machines (M1-M5). Each is a sealed sum-type State/Event/Effect with a total, pure Next(state,event)->(state,[]effect) behind a single-writer Coordinator: M1 conncoord connection lifecycle: VAD toggle + once-only teardown (replaces vadServerStarted + a `done` channel closed from two sites). M2 turncoord turn detection: collapses speechStarted and the live-stream "turn open" flag into one state, so discardTurn can no longer desync them and suppress the next onset. M3 respcoord response coordination: serializes the dual-writer start/cancel so at most one response is live; one response.done per response.create. M4 compactcoord conversation compaction: single-flight (replaces the `compacting atomic.Bool` CAS). M5 ttscoord TTS pipeline: open->closing->closed, idempotent wait(), rejects enqueue-after-close (was a silent drop). The Coordinator/Sink/Next plumbing — only the sealed types and Next differed per machine — is extracted once into core/http/endpoints/openai/coordinator as a generic Coordinator[S,E,F]; each machine keeps its public API via type aliases, so no sink, call-site, or test moved. Hierarchy. session_lifecycle.fizz models M1 as the parent region with its children (M2/M3/M4) as one statechart and asserts ChildrenDieWithParent (conn torn => all children terminal, none start after teardown). respcoord and compactcoord gain an absorbing Terminated state + Shutdown event; conncoord's teardown drives the children terminal. This closes a compaction teardown gap: a fire-and-forget compaction could outlive a torn session — compactionSink now takes a session-scoped cancellable context + WaitGroup and joins the in-flight summarize+evict on shutdown. Formal verification. formal-verification/ holds one authoritative FizzBee spec per machine plus the composition spec, each with an always-assertion and a documented one-line edit that makes the checker fail (verified non-vacuous). scripts/realtime-conformance.sh is fail-closed: all Go conformance suites under -race AND a model-check of every .fizz spec; a missing FizzBee is a hard error (only the loud REALTIME_CONFORMANCE_SKIP_FIZZBEE=1 bypasses it, never in CI). FizzBee is pinned by sha256 and installed via scripts/install-fizzbee.sh into .tools/ (gitignored). Wired as make test-realtime-conformance, a CI workflow, and a pre-commit path filter. Go conformance tests are Ginkgo/Gomega (per the repo's forbidigo lint): transition tables + fixed-seed property walks + concurrent/-race specs, no rapid dependency. Design map: docs/design/realtime-state-machines.md. Parakeet streaming backend. The same treatment applied to the parakeet-cpp streaming paths: - AudioTranscriptionStream returns codes.Unimplemented for non-streaming models instead of decoding offline and emitting it as one delta + final. A client that asked for streaming learns the model cannot stream rather than receiving a batch result shaped like a stream. New grpcerrors.StreamTranscriptionUnsupported carries that signal; the HTTP /v1/audio/transcriptions stream path surfaces it as an SSE error event. Mirrors AudioTranscriptionLive, which already did this. - utteranceBoundary (boundary.go): a single definition of the end-of-utterance latch, replacing three open-coded finalEou toggles. Modelled as a two-valued type so illegal states are unrepresentable. - Shared decode driver (driver.go): streamFeedResult (one per-feed event) + feedChunk (hides the ABI v4 JSON vs text-only split) + feedSlices + flushTail. The feed loop is written once. - AudioTranscriptionLive becomes a bidi adapter: it streams the per-feed {delta,eou,eob,words} the realtime turn detector consumes and a terminal FinalResult carrying only Text. Segments/duration/eou are offline-only and no longer produced (nor read) on the live path; liveTraceState drops the terminal eou and keeps the per-feed eou_events count. - AudioTranscriptionStream + streamJSON merge into one driver-based function; streamSegmenter is generalized to the unified event with a text-only fallback that preserves the legacy (no-words) library's per-utterance segmentation. Verified: build/vet/gofumpt clean, golangci-lint 0 issues, all coordinator and parakeet packages under -race, the fail-closed conformance gate green, and make test-realtime (12 e2e WS+WebRTC). Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com>
39 KiB
Realtime API state machines — map & re-architecture research
Status: research / design (compaction phase). No code changes implied yet.
The realtime API (core/http/endpoints/openai/realtime*.go) grew feature-by-feature
(server_vad → semantic_vad/EOU, streaming pipeline, tool turns, compaction, voice
gate, sound detection, WebRTC). The result is several implicit state machines
whose states and transitions are scattered across goroutine-local variables, shared
Session/Conversation fields under five different mutexes, raw channels, and
context cancellation. State is inferred from variable combinations rather than
stored; several illegal/inconsistent states are reachable.
This document (1) inventories the implicit machines, (2) catalogues the cross-cutting failure modes, (3) researches how to re-implement them explicitly and verifiably, and (4) lists the invariants a correct implementation must guarantee.
All line numbers are against the current feat/realtime-semantic-vad-eou branch and
will drift; treat them as anchors.
Part 1 — Inventory of the implicit state machines
There is no state/status field anywhere in Session or Conversation. Every
machine below is reconstructed from variable combinations.
M1. Connection / transport lifecycle
Two transports implement one Transport interface; their lifecycles differ sharply.
- WebSocket (
realtime_transport_ws.go): essentially stateless — a*websocket.Connplus a writesync.Mutex. No send queue, no send goroutine, no closed flag. "Closed" =ReadEventreturns an error. - WebRTC (
realtime_transport_webrtc.go): an explicit-ish machine built from raw channels —dcReady(closed bydcDone sync.OnceFunc),closed(closed bycloseDone sync.OnceFuncfrom eitherOnConnectionStateChangeorClose()),flushed,sessionCh(cap 1),inEvents/outEvents(cap 256), plus asendLoopgoroutine and RTP counters underrtpMu.
Conceptual states (connecting → data-channel-open → session-created → active → closing → closed) are not stored; the only persisted membership state is the
sessions[sessionID] map entry (exists realtime.go:631→:1009). session-created
and session-updated are events, not states.
Teardown order (realtime.go:989-1010): cancelActiveResponse → close(decodeDone)
→ close(done) (if VAD running) → close(soundWindowDone) → wg.Wait() →
delete(sessions,…). Then, WebRTC only, defer transport.Close() → closeDone() →
<-flushed → pc.Close().
M2. Audio-input / turn-detection (server_vad + semantic_vad + EOU)
One handleVAD goroutine (realtime.go:1322) on a 300 ms ticker. Mode is
re-evaluated every tick under sessionLock (:1350-1357) so it can flip mid-turn.
- server_vad states are encoded by the goroutine-local
speechStarted bool(:1337) plus silence measured (not timed) asaudioLength - segEndTime > silenceThresholdrecomputed each tick (:1461). States: idle → inspecting → speech-detected → awaiting-commit → committing → transcribing/responding. "Holdback" is a byte count (noSpeechHoldbackSec*rate*2), not a timer. - semantic_vad adds the
liveTurnStatestruct (realtime_semantic_vad.go):live(nil = closed),unavailable(sticky degrade → behaves as server_vad),eouAtSec,parts,itemID(allocated at turn open so captions can stream),deltasSent. Extra states: closed, open/streaming-ASR, EOU-pending, EOU-fallback (dynamic silence threshold 0 s when EOU pending, else eagerness 8/4/2 s), retranscribe-gate, EOU-rejected, finished, discarded. The one cross-goroutine edge: the backend recv callback pushes ontoevents(buffered 64, non-blocking — drops on overflow,:116-117);drainEventsreads it on the tick. - Voice gate (
realtime_voicegate.go) runs inside the commit goroutine: resolving → authorized/rejected, with a stickyvoiceVerified(undergateMu) forwhen:first.
M3. Response lifecycle (+ synchronous tool-turn recursion)
A response is "active" iff Session.activeResponseDone is non-nil and unclosed
(responseMu, :172). One goroutine owns it; its lifetime == that channel's. State
is observable only through the response.* event stream and ItemStatus* on the
assistant item. Logical states: idle → starting → generating-text →
generating-audio → tool-call-pending → tool-executing → awaiting-next-tool-turn →
cancelling → done(completed|cancelled) | failed.
- Cancellation is cooperative at discrete checkpoints (
ctx.Err()at:2172,2364,2394,realtime_stream.go:193,202,241,259). - The tool loop is synchronous recursion on the same goroutine, bounded by
maxAssistantToolTurns = 10; each level mints a freshresponseIDand emits a fullresponse.created … response.done{Completed}cycle — so one user turn can emit severalresponse.done{Completed}events under different IDs. - Terminal events are not exactly-once: failed paths
returnwith noresponse.done; cancelled paths emitdone{Cancelled}; the completed terminal is unconditional at the tail ofemitToolCallItems.
M4. Conversation / compaction
Conversation: Items + Memory (rolling summary) under Lock; compacting atomic.Bool. States: normal ↔ compacting. Compaction (realtime_compaction.go)
snapshots overflow under Lock, summarizes unlocked, re-locks and commits guarded
by an optimistic head-prefixMatches check. It is launched only by turn-0
triggerResponse (:1963), off the response path — so a long agentic turn
(recursion calls triggerResponseAtTurn directly) can append many tool items and
never compact until the next user turn (compaction starvation).
M5. Streaming sub-machines (transcription, chunker, TTS)
Backend LLM/TTS/transcription streams are synchronous callback recv loops on the caller's goroutine — no internal goroutines/channels. The only true concurrent FSM is:
- TTS pipeline (
realtime_tts_pipeline.go): one worker goroutine, an unbounded mutex-guardedqueue, a coalescedwakechan (cap 1), aclosedflag, adonechan closed once by the worker'sdefer, a lock-freefailed atomic.Bool, and worker-ownedaudio/firstErrthat are safe to read only afterwait()joins viadone. Idempotentwait(); deferredwait()backstop guarantees no worker leak. - Chunker (
realtime_chunker.go): a pure single-buffer FSM (buffering ↔ emitting,flush= hard boundary). No concurrency guard — correctness depends entirely onpush/flushbeing called from one goroutine (the LLM recv loop). On cancel the flush is skipped, so the buffered partial clause is intentionally dropped. - Transcription (
realtime_transcription.go): stateless straight-line function; "streaming" is just repeated synchronous callbacks.
Part 2 — Cross-cutting failure modes (why it's a mess)
-
Shared mutable
Sessionconfig with inconsistent locking (the core problem).updateSession/updateTransSessionmutateVoice,Instructions,Tools,OutputModalities,ModelConfig,ModelInterface, sample rates, and the sharedInputAudioTranscriptionpointer undersessionLock. But in-flight response/speech/transcription goroutines read those same fields without any lock (realtime_speech.go:72-79,realtime_stream.go:228, semantic_vad:110). ReloadingModelInterfacemid-response is a data race against a running Predict/TTS/Transcribe, and the swapped-out model is dropped without Close.sessionLockactually guards the globalsessionsmap; it only mutually excludes the handful of other sites that happen to also take it (handleVAD tick, the commit branch). Response goroutines never take it. -
Two writers of the active-response pair.
startResponse/cancelActiveResponseare called from both the main read loop (:836,973,981,990) and the VAD goroutine (barge-in:1429, end-of-speech:1543).responseMuguards only the field swap; the<-donewait is outside the lock. A read-loopResponseCreateracing a VADspeech_stoppedcan have both read the same prior pair, both overwrite, and briefly leave two live response goroutines both appending toconv.Items. The "never overlapping" guarantee holds only under the unstated assumption that responses are driven from a single goroutine — which is false. -
State is inferred, not stored. Whether a response is active, whether a turn is open, whether audio is being buffered — all are derived from combinations of booleans, nil-checks, channel state, and
contexterror. No single source of truth; no place to assert an invariant. -
Reachable inconsistent states. e.g. after a semantic-VAD
discardTurn,speechStartedstays true whileltsis closed, so they disagree and the next onset suppressesSpeechStarted. Mid-stream cancel leaves the client having seenoutput_item.added/content_part.addedwith no matching…done.events-channel overflow silently drops an EOU, degrading EOU-pending to the 2–8 s fallback. -
Lifecycle/ownership gaps.
decodeOpusLoopis a barego(not inwg) and can run afterdelete(sessions,…).handleIncomingAudioTrack(pionOnTrackgoroutine) has no shutdown signal — it appends toOpusFramesuntilReadRTPerrors, unjoined bywg. WebRTCoutEventsenqueued before the DC opens are lost on early failure. -
The
done-channel/vadServerStartedtoggle dance. A singledonelocal (:655) is reassigned to a fresh channel on each VAD start (:662) and closed at toggle-off (:670) and teardown (:999). Safe today only because one goroutine owns it — one variable name meaning different channels over time is a structural fragility, not an explicit lifecycle.
Part 3 — Research: explicit, verifiable re-implementation
The goal the user stated: transitions cannot lead to an inconsistent state, and we can verify that. Four layered techniques, from architecture down to runtime.
3.1 Architecture: single-writer session actor (share by communicating)
The root cause of (1) and (2) is shared mutable state across goroutines. The most effective, idiomatic-Go fix is to give each session one owning goroutine that holds all session state with no locks, and have every other goroutine communicate with it over channels:
┌────────── inbound events ──────────┐
transport ─┤ client events (ReadEvent) │
VAD ─┤ vad: speech_started/stopped, EOU ├─► session actor ──► outbound
model I/O ─┤ llm/tts/asr results, errors │ (owns ALL state, events
timers ─┤ ticks, deadlines │ single goroutine)
└────────────────────────────────────┘
- All state mutation happens in one place;
sessionLock,responseMu,gateMu,AudioBufferLock,OpusFramesLock,Conversation.Lockcollapse into "the actor owns it." Worker goroutines (Predict/TTS/ASR, opus decode, RTP read) become stateless effects that take an immutable snapshot in and send results back as events. ModelInterfacereload becomes an event the actor sequences relative to responses (e.g. drain/cancel the active response first), eliminating the mid-call swap race.- Cancellation stays
context-based but the actor is the only thing that starts/stops responses, killing the dual-writer race (2).
This is the actor / CSP model. It does not by itself prove correctness — that's what 3.2–3.4 add — but it makes the state centralized and explicit, which is the precondition for verification.
3.2 Make illegal states unrepresentable (type-level)
Inside the actor, model each machine as an explicit state with a pure transition
function next(state, event) (state, []effect, error):
- Represent states as a Go sealed sum type (interface with an unexported marker
method, one struct per state carrying only that state's data) so e.g.
EOU-pendingdata cannot be accessed whileClosed. This is the Go equivalent of an ADT and is the single biggest lever for "inconsistent state unrepresentable." - The transition function is total and pure (no I/O, no goroutines): it returns the next state plus a list of effects (send event, start Predict, arm timer) that the actor executes. Pure transition functions are trivially unit-testable and property-testable.
- An unexpected
(state, event)pair returns an explicit error / stays put and logs — never a silent half-transition.
The four machines are hierarchical (a statechart): Connection ⊃ Turn(M2) and Response(M3) ⊃ Tool-turn; Conversation(M4) and the TTS sub-machine(M5) are largely orthogonal regions. Model them as nested states rather than one flat enum.
Library options (all guard logic, none give concurrency safety — that's 3.1's job):
qmuntal/stateless— declarative, hierarchical, guard/entry/exit actions; closest fit.looplab/fsm— simpler, flat, event-callback based.- Hand-rolled transition tables — most control, no dep; recommended here given the
hierarchy and the desire to keep transitions auditable.
go.modcurrently pulls no FSM lib.
3.3 Design-time formal verification (prove the protocol)
Before/while coding, model the protocol (not the Go) in a model checker to prove the hard concurrency properties exhaustively:
- FizzBee (the adopted tool) to specify the actor's event/state space and check: no
two concurrent active responses; barge-in + ResponseCancel + speech_stopped
interleavings never deadlock or drop a turn; every
response.createdis eventually followed by exactly one terminal; teardown joins all goroutines. The cancel/startResponse/barge-in interplay (failure mode 2) is exactly the kind of liveness/safety property model checkers exist for. - Keep the spec small and focused on the M2↔M3 boundary (turn detection ↔ response), which is where the real races live.
3.4 Implementation-time & runtime verification
- Exhaustive table-driven transition tests: since transitions are a pure function,
enumerate
(state × event)and assert the result for every cell, including the illegal cells (assert they error / no-op). This is the practical stand-in for a proof that "no transition leads to inconsistent state." - Property-based testing: feed random event sequences into the actor and assert
global invariants hold after every step (Part 4). This catches reachable-bad-state
bugs the example tests miss. (Implemented as Ginkgo/Gomega seeded random-walk specs
— see Part 6.2 for why not
rapid.) - Race detector under load: run the property tests with
-race; with 3.1 there should be zero shared mutable state, so-racecleanliness becomes a meaningful signal rather than noise. - Runtime invariant assertions + structured transition logging: log every
state --event--> statewith the session ID; assert invariants in dev builds. Replace today's silent degradations (dropped EOU, suppressed SpeechStarted) with explicit, observable transitions.
3.5 Recommended path for LocalAI
- Specify the M2↔M3 protocol in FizzBee; nail the cancel/barge-in invariants.
- Introduce a per-session actor (3.1) that owns existing state behind the current
Transportinterface — incremental, keeps the event types. - Replace each implicit machine with an explicit sealed-state transition function (3.2), one at a time: Response first (highest-risk dual-writer), then Turn/VAD, then Connection, then leave TTS/Chunker/Compaction (already mostly self-contained) for last.
- Land the table-driven + property-based test suites alongside each machine; gate on
-race.
Part 4 — Invariants a correct implementation must guarantee
These are the "cannot reach inconsistent state" properties to encode as assertions, property-test oracles, and FizzBee invariants:
- At most one active response per session at any instant (no overlapping response
goroutines; no two appenders to
conv.Itemsfrom response logic). - Exactly one terminal per
response.created: every emittedresponse.createdis followed by exactly one ofresponse.done{completed|cancelled}or a defined failure terminal — never zero, never two. (Decide whether agentic tool turns are one response or many; make it explicit either way.) - No
response.*content events after that response's terminal. Nooutput_item.added/content_part.addedwithout a matching…done(even on cancel). - Turn/response coupling:
speechStarted⟺ a live turn is open; barge-in cancels the active response before a new turn's commit starts. - No config field is read by a worker while being mutated (reload is sequenced against in-flight work; a response uses an immutable snapshot of model/voice/tools).
- Audio buffer monotonic & consistent: commit/clear/append/VAD-drop never lose or
double-consume bytes;
clearresets all turn state (includinglts). - No dropped control events: an EOU/Final is never silently lost (no overflow-drop on a bounded channel that changes turn outcome).
- Clean teardown: every spawned goroutine (incl.
decodeOpusLoop,handleIncomingAudioTrack) is signalled and joined before the session is deleted; no sends after transport close. - Compaction safety & liveness: compaction never races a reader into a torn
Items; and it actually runs when the trigger is exceeded, including inside long agentic turns. - Idempotent close: every channel/resource closed exactly once on every path.
Implementation status
- M3 (response coordination) — first vertical slice landed. Explicit machine in
core/http/endpoints/openai/respcoord/(sealedState/Event/Effectsum types, a total pureNext, a single-writerCoordinator); transition-table + Ginkgo/Gomega seeded-property + concurrent conformance tests (green under-race); a deterministic characterization test pinning the legacy dual-writer race. Authoritative spec:formal-verification/response_lifecycle.fizz. Gate:scripts/realtime-conformance.sh(Go layer always; FizzBee when pinned) wired asmake test-realtime-conformanceand.github/workflows/realtime-conformance.yml. Seeformal-verification/README.md. - Gate is fail-closed and pinned (done).
fizzbee.sha256pins all four platforms; the gate hard-fails without FizzBee; CI installs+caches the verified binary with no skip; pre-commit runs the gate onrespcoord/**orformal-verification/**changes. - M3 wired into the live session (done).
realtime_respcoord.goaddsresponseSink(therespcoord.Coordinator+ a goroutine-spawning effect sink) toSession. The legacystartResponse/cancelActiveResponseand the dual-writeractiveResponse*/responseMufields are gone; all six call sites (manual commit,response.create, VAD speech-stopped,response.cancel, barge-in, teardown) route through it. Barge-in/cancel are now non-blocking (removes the legacy ~300 ms VAD stall); teardown stops input goroutines, then cancels +wait()s all response goroutines before deleting the session.EmitTerminalis a no-op for now (the response body still emits its ownresponse.done) — coordination is fixed without changing wire behavior. Verified: builds,go vetclean, all 300 openai specs pass under-race, andmake test-realtime(the mock-backend realtime e2e suite, 12 specs over WS + WebRTC) passes. - Single authoritative terminal + populated Output/Usage (done). One
response.createdand oneresponse.doneperresponse.create, even across the server-side agentic tool loop (which is now internal turns of one response, not one terminal each). AliveResponseaccumulator threads throughtriggerResponse→triggerResponseAtTurn→emitToolCallItems/streamLLMResponse, collecting output items as they complete and summing token usage;triggerResponseemits the one terminal (completed/cancelled; failed still emits none, matching legacy) withOutput+Usagefilled in (both were always empty before). Verified: 301 openai specs under-race(incl. a newtriggerResponseterminal test) +make test-realtime. Design note: emission is hoisted totriggerResponse(the body owns it) rather than the coordinator'sEmitTerminaleffect — at cancel/supersede time the coordinator doesn't yet have the body's partial Output, so the body, which does, is the natural emitter. The coordinator still guarantees one body run perresponse.create, so "exactly one terminal" holds transitively;EmitTerminalremains the spec's logical marker (no-op in the sink). - M2 (turn detection) — model + spec landed AND wired into the live session.
Explicit machine in
core/http/endpoints/openai/turncoord/(sealedState=Idle | Speaking{Turn},Event=Onset | Silence | Abort{Reason},Effect=BargeIn | OpenTurn | EmitSpeechStarted | EmitSpeechStopped | CommitTurn | DiscardTurn, a total pureNext, a single-writerCoordinator); transition-table + Ginkgo/Gomega seeded-property + concurrent conformance tests (green under-race). The fix it encodes: "speech detected" and "a turn is open" — the two legacy variables (speechStartedandlts.open()) that adiscardTurncould desync (failure mode 4) — become ONE state, so the next-onset suppression bug is unrepresentable. Authoritative spec:formal-verification/turn_lifecycle.fizz, with analways assertion Coupled(speech ⟺ turn-open), verified non-vacuous (deletingself.speech = 0inAbortmakes the checker reportCoupledviolated). The gate (scripts/realtime-conformance.sh, pre-commit, CI) coversturncoordand the spec. Wired (done):realtime_turncoord.goaddsturnSink(theturncoord.Coordinator+ a loop-local effect sink) tohandleVAD. The legacyspeechStartedbool is gone; onset/no-speech-clear/commit/teardown route throughcoord.Apply(Onset|Abort{NoSpeech}|Silence|Abort{Teardown}). The turn id is minted at onset and carried by the coordinator to the committed event (so it matches the live captions);liveTurnState.openTurnnow takes that id instead of minting its own. A semantic→server mode switch mid-turn is deliberately NOT an abort (it only drops the orphaned live stream and lets the turn continue under server_vad), so it stays inline. Verified: builds,go vet/gofmt/golangci-lint clean, all openai specs under-race, andmake test-realtime(12 e2e specs over WS + WebRTC) pass. - M1 (connection lifecycle) — model + spec landed AND wired. Explicit machine
in
core/http/endpoints/openai/conncoord/(sealedState=Live{VADRunning} | Torn,Event=SetVAD | Close,Effect=StartVAD | StopVAD | Teardown, a total pureNext, a single-writerCoordinator); transition-table + Ginkgo/Gomega seeded-property + concurrent conformance tests (green under-race). It replaces the legacyvadServerStartedbool + thedonechannel reassigned on every turn-detection toggle and closed from two sites (failure mode 6): the coordinator owns whether the VAD goroutine runs, so its done channel is closed exactly once and never resurrected after teardown;Closemoves toTorn, which absorbs every later event so teardown runs exactly once even from multiple exit paths (invariants #8, #10). Spec:formal-verification/conn_lifecycle.fizz(always assertion TeardownOnce+NoRunAfterTorn), verified non-vacuous (deletingself.torn = 1inClosefailsTeardownOnce). Wired (done):realtime_conncoord.goaddsconnSink; the handler's setup/toggleVAD/teardown now route throughconn.setVAD(...)/conn.close(); thedone/vadServerStartedlocals and the manual ordered-teardown block are gone (the Teardown effect performs that sequence). Verified: builds, vet/gofmt/golangci-lint clean, openai specs under-race,make test-realtime(12 e2e WS+WebRTC), full conformance gate green (3 Go packages + 3 fizz specs PASSED). - M4 (conversation compaction) — model + spec landed AND wired. Explicit
machine in
core/http/endpoints/openai/compactcoord/(sealedState=Idle | Running,Event=Trigger | Finished,Effect=StartCompaction, a total pureNext, a single-writerCoordinator); transition-table + Ginkgo/Gomega seeded-property + concurrent (effect-spawns-work-reports-Finished) conformance tests (green under-race). It makes the legacycompacting atomic.Boolsingle-flight guard explicit: aTriggerwhileRunningis dropped (not superseded — compaction is idempotent work on the same overflow), so at most one summarize+evict runs per conversation (invariant #9). Spec:formal-verification/compaction.fizz(always assertion SingleFlight), verified non-vacuous (deleting theif self.active == 0guard failsSingleFlight). Wired (done):realtime_compactcoord.goaddscompactionSink; theConversation.compacting atomic.Boolis replaced byConversation.compaction *compactionSink(built at conversation creation with the summarize+evict run closure);maybeCompactnow callsconv.compaction.trigger(). The summarizer resolution +compact()stay in the sink's spawned goroutine (off the response path);compact()itself (snapshot/summarize-unlocked/optimistic-commit) is unchanged. Verified: builds, vet/gofmt/golangci-lint clean, openai specs under-race,make test-realtime(12 e2e), full conformance gate green (4 Go packages + 4 fizz specs PASSED). - M5 (TTS pipeline lifecycle) — model + spec landed AND wired. Explicit
machine in
core/http/endpoints/openai/ttscoord/(sealedState=Open | Closing | Closed,Event=Close | WorkerExited,Effect=Wake, a total pureNext, a single-writerCoordinator); transition-table + Ginkgo/Gomega seeded-property + two-writer conformance tests (green under-race). It is a genuine two-writer machine (producerClosefromwait()vs workerWorkerExited); it makes the legacyclosed boollifecycle explicit and monotonic, fixes the latent enqueue-after-close silent drop (enqueue is now gated onOpen), and guarantees idempotentwait()(one wake / one worker join). The poisonfailedlatch stays a lock-freeatomic.Bool(orthogonal, read per clause on the worker's hot path). Spec:formal-verification/tts_pipeline.fizz(always assertion WakeOnce+Monotonic), verified non-vacuous (deleting theif self.phase == 0guard inClosefailsWakeOnce). Wired (done):realtime_tts_pipeline.go'sttsPipelineembeds the coordinator (and is its effect sink —Wake→signal());closed boolis gone; the worker checksclosing()and raisesWorkerExitedon drain,enqueuerejects once notOpen,wait()raisesClose. The wake/done channel mechanics are unchanged. Verified: builds, vet/gofmt/golangci-lint clean, openai specs under-race,make test-realtime(12 e2e), full conformance gate green (5 Go packages + 5 fizz specs PASSED). - All five mapped machines (M1–M5) are now explicit, wired, and verified. The
realtime-conformance gate model-checks all
.fizzspecs and runs all five Go conformance suites under-race, fail-closed. - The machines form a hierarchy, and that relationship is now modeled and
enforced. M1 (connection) is the parent region; when it tears down, every child
must be terminal. Previously this was only an imperative side effect of
conncoord's teardown ordering, with a real gap (M4 compaction was fire-and-forget and could outlive the torn session). Now:formal-verification/session_lifecycle.fizzis a composition spec that models conn + its direct children (vad/M2, resp/M3, compaction/M4) as one statechart and assertsChildrenDieWithParent(conn torn ⟹ all children terminal) plus "no child starts after teardown". Its non-vacuity reproduces the exact M4 gap (drop the compaction-terminate line → assertion fails).respcoord(M3) andcompactcoord(M4) gained an absorbingTerminatedstate + aShutdownevent, so a response/compaction cannot start after teardown (structural "no resurrection").conncoord'sTeardowneffect now explicitly drives the children terminal: stop+join the VAD goroutine (M2),respSink.shutdown()(M3 → Terminated, joins response goroutines and their M5 pipelines), andcompaction.shutdown()for every conversation (M4: cancel the in-flight summary via a session-scoped context, then join — closing the gap).compactnow takes acontextso teardown can bound the join. M2's terminal is realized by the goroutine join and M5's by its existingClosed; the persistent coordinators (M3/M4) carry the explicitTerminatedstate.
Part 5 — Library vs hand-rolled (Go ecosystem, verified 2026-06)
Researched against live GitHub/pkg.go.dev data. Verdict: hand-roll a typed transition
table over sealed sum-type states for the per-connection machines. No Go library gives
the two properties we most want — compile-time-illegal states and a pure
next(state,event)->(state,[]effect,error); every library models states as
string/int/any and fires side-effecting callbacks mid-transition. And since the
actor (Part 3.1) drives everything from one goroutine, the libraries' main value-add —
internal locking — is dead weight.
Library landscape:
| Option | Stars / status | Hierarchy | Typed states | Illegal-transition | Viz | Fit |
|---|---|---|---|---|---|---|
| hand-rolled table + sealed sum types | — | DIY (parent field / nested switch) | yes (sealed iface) | explicit default: |
~30 LOC Mermaid emitter | best |
| qmuntal/stateless (port of .NET Stateless) | 1.36k, v1.8.0 2026-02, maintained | yes (substates, guards, entry/exit, internal/ignored) | any |
error + OnUnhandledTrigger + PermittedTriggers |
DOT | best library fallback if hierarchy grows |
| looplab/fsm | 3.4k, v1.0.3 2025-05, maintained | flat | strings | typed errors | DOT+Mermaid | only for flat machines wanting free diagrams |
| cocoonspace/fsm | 89, dormant 2021 | flat | int | bool no-op |
— | lock-free but dead; DIY beats it |
| true Harel statecharts (gstate, statechartx) | ≤10, <1yr, single-author | parallel+history | varies | varies | varies | only if we truly need parallel regions; unproven |
| Temporal / Cadence | large, maintained | n/a | n/a | n/a | n/a | overkill — external cluster+DB, durable replay, wrong latency class |
Decision: hand-roll; keep qmuntal/stateless as the fallback if one machine grows deep
hierarchy/guards faster than we want to hand-maintain (its error-on-illegal-trigger and
PermittedTriggers() are the most useful library features for our "reject illegal
transitions" requirement, at the cost of any-typed states). Add a tiny Mermaid emitter
over the hand-rolled table so we keep the visualization the libraries advertise.
Part 6 — Formal design tied to code, and making it authoritative
The user requirement: the formal design is authoritative — a coding agent should be unable to silently change implementation behavior without it being caught against the spec; the default path is "update the spec and re-verify," not "edit the code and ignore the spec." This is a conformance + enforcement problem, in three layers.
6.1 The source of truth & design-time check
Write the concurrency-critical core — the M2↔M3 boundary (turn detection ↔ response: barge-in, ResponseCancel, speech_stopped, the dual-writer race) — as a FizzBee spec and model-check it in CI. Keep the spec small and focused on M2↔M3; that is where the real safety/liveness properties (Part 4 invariants 1–4) live. (FizzBee is the adopted model checker — see Part 6.4.)
6.2 The conformance bridge (code ↔ spec)
The honest finding: design-time model checking is well-supported; the Go conformance bridge is thin everywhere and needs per-spec glue. Two layers, adopted together:
- FizzBee MBT — the authoritative layer. The
.fizzspec is model-checked, andfizz mbt-scaffold --lang gogenerates Go interfaces + ago testharness; you implement adapters mapping model actions→code andStateGetter→state. Conformance runs as plaingo test— the cleanest CI fit. Risk: pre-1.0, essentially one maintainer (pin a version + sha256, vendor examples). - Ginkgo/Gomega seeded property tests — the Go-native floor. A small Go model
(the test's
open/registeredshadow) is the oracle; a fixed-seed random walk drives random event sequences against theCoordinator, asserting the Part-4 invariants after each step / per seed. It checks the implementation against a Go oracle — it complements, but does not replace, the FizzBee check of the design. (We originally speccedpgregory.net/rapidhere for its(*T).Repeatdriver and automatic shrinking, but LocalAI mandates Ginkgo/Gomega for all tests — itsforbidigolint forbids stdlibtestingassertions — andrapid.Checkneeds a concrete*testing.T/*rapid.Tthat cannot run inside a GinkgoIt. Rather than weaken the lint gate with an exclusion, the property layer is hand-rolled seeded walks: fixed seeds make every failure reproducible, at the cost ofrapid's automatic shrinking.rapidis consequently not a direct dependency.)
These compose: model-check the design (6.1) for "the design is right"; conformance-test
the code (6.2) for "the code matches the design." Add go test -race (with -cpu=1,2,4,
repeated runs) over the stateful tests for interleaving-bug discovery, and Go native
fuzzing over the same harness for coverage-guided sequence exploration + a committable
regression corpus. (testing/quick is frozen — do not use.)
There is no viable single-source-of-truth codegen (one spec compiled into both the runtime Go and the model) for retrofitting existing Go — the candidates are research-grade and greenfield-only. Our practical substitute is the CI gate below plus a single Go transition table that emits both the diagram and the test action set.
6.3 Enforcement — making the design un-ignorable for agents
Structural enforcement, leveraging this repo's existing non-bypassable gate culture
(pre-commit + monotonic ratchets; --no-verify is forbidden, baselines never lowered):
- Add a
realtime-conformancegate to the pre-commit/CI pipeline that runs (a) the model check (6.1) and (b) the conformance bridge (6.2). A behavior change that does not conform turns the gate red; the only green paths are make the code conform or update the spec — and updating the spec re-triggers the model check, so an illegal design is rejected too. This is the actual mechanism that makes "update the design and verify" the default rather than optional. - Treat the spec as a ratchet artifact like coverage: the gate must not be weakened, the spec not deleted, the build tag not silently disabled.
- Write an
.agents/realtime-state-machines.mdguide (indexed fromCLAUDE.md) stating the spec is the source of truth: change the spec first, re-run the gate, then implement. The doc is secondary; the gate is what enforces it.
6.4 Decided stack
- Implementation: hand-rolled sealed-state transition functions + single-writer actor (Parts 3.1–3.2).
- Design-time + conformance: FizzBee (decided).
.fizzspec is model-checked, andfizz's Go MBT generator (mbt/generator/templates/go→ interfaces/adapters/test; driven via a gRPC plugin inmbt/lib/go) produces ago testconformance harness whose adapters map model actions → our actor andStateGetter→ our state. Go is a first-class MBT target (Go + Rust are the only two). Verified 2026-06: Apache-2.0, v0.5.2, prebuilt linux/macos×x86/arm binaries, ships Claude Code skills (/fizz-spec|check|debug|mbt) for the spec-authoring loop. - Go-native layer: Ginkgo/Gomega seeded property tests run alongside — they
check the implementation, complementing (not substituting for) the FizzBee check
of the design. Skipping FizzBee is NOT "degrading to the Go layer": the design
authority would be gone. The gate is therefore fail-closed (see Enforcement).
(Originally specced as
rapid; switched to Ginkgo/Gomega to satisfy LocalAI's Ginkgo-onlyforbidigolint without weakening that gate — see Part 6.2.) - Enforcement: the
realtime-conformancepre-commit/CI gate +.agents/guide (Part 6.3).
FizzBee risk mitigations (decided):
- The gate is fail-closed: a missing FizzBee is a hard failure, never a silent skip.
The only bypass is the explicit, loud
REALTIME_CONFORMANCE_SKIP_FIZZBEE=1(local only; CI never sets it; pre-commit runs the gate on anyrespcoord/**orformal-verification/**change so a pure.fizzedit still re-verifies). - CI pins the FizzBee release binary by version + sha256 (
formal-verification/fizzbee.sha256, all four platforms, digests from the GitHub release; installer verifies before extract, CI caches it). Not go-gettable:pkg/modelcheckerimports the Bazel-internalfizz/protowith no committed.pb.go, so a plaingo getwon't build — hence the pinned binary. - Keep the
.fizzmodel portable (no exotic features) so it stays re-expressible in another model checker if FizzBee is ever abandoned — lock-in is at the tooling layer only, not the design.
Open questions (decide before implementing)
- Scope of the actor refactor: full single-writer per session, or incrementally migrate one machine at a time behind the existing locks? (Suggest: M3 response coordination first — it has the load-bearing dual-writer bug.)
Resolved: FSM library vs hand-rolled → hand-rolled sealed-state tables, qmuntal/stateless fallback (Part 5). Conformance bridge → FizzBee (model-check + Go MBT) with a Ginkgo/Gomega seeded-property Go-native floor as hedge (Part 6.4). Single-source-of-truth codegen (PGo/MPCal) → not viable (research-grade, greenfield-only); substitute is the CI conformance gate (Part 6.3).
Agentic turn semantics → invariant #2 is one response.done per response.create
(OpenAI-faithful); the server-side AssistantExecutor tool loop becomes internal
sub-states of a single response rather than emitting one terminal per turn. Verified safe
in-tree: the current response.done carries only {id, object, status} (Output/Usage
never populated), the React UI (Talk.jsx:330) reads only status, every unit test
already asserts ResponseDone == 1 for tool turns, no test expects multiplicity, and the
server-side recursion is untested. Collapsing also fixes a latent "Listening…" flicker
mid-agentic-loop. The client-driven tool loop (fresh response.create per round-trip)
legitimately keeps one terminal each — unaffected. Follow-up: actually populate Output +
Usage in the single terminal (currently always empty).