Files
LocalAI/docs/content/features/openai-realtime.md
Richard Palethorpe 5d0c43ec6e feat(realtime): Semantic VAD EOU token (#10444)
* feat(realtime): EOU-driven semantic_vad turn detection

Add a `semantic_vad` turn-detection mode to the realtime API that feeds
the transcription model live and decides "the user finished speaking"
from the `<EOU>` end-of-utterance token rather than from silence alone.
When EOU fires the turn commits immediately (~0.3s); otherwise it falls
back to an eagerness-scaled silence threshold (low/med/high = 8/4/2s).

Plumbing, bottom to top:

- proto: `AudioTranscriptionLive` bidirectional RPC (config-first oneof,
  mono float PCM @16k, ready-ack / Unimplemented degrade signal) plus
  `TranscriptResult.eou` for the unary retranscribe gate.
- pkg/grpc: client/server/base/embed scaffolding for the bidi stream,
  modeled on AudioTransformStream; release stream conns on terminal Recv.
- parakeet-cpp: live transcription RPC with per-C-call engine locking
  (one live stream per turn, finalize+free at commit); bump parakeet.cpp
  to ABI v5 — incremental StreamingMel (no more quadratic per-feed mel
  recompute that delayed EOU on long turns) and the <EOU>/<EOB> split;
  strip the literal <EOU>/<EOB> from offline text and set Eou.
- core/backend: LiveTranscriptionSession wrapper + pipeline
  `turn_detection:` config block (type/eagerness/retranscribe).
- realtime: semantic_vad integration — live input captions streamed as
  transcription deltas while the user speaks, EOU-immediate commit with
  eagerness fallback, optional retranscribe gate (batch re-decode must
  also end in <EOU> to confirm), clause synthesis off the LLM token
  callback, and per-turn live-transcription / model_load telemetry.
- UI: show the realtime pipeline components as a vertical list.

Docs and tests included; opt-in via the pipeline YAML or per-session
`session.update`. Non-streaming STT backends degrade to silence-only.

Assisted-by: Claude Code:claude-opus-4-8 [Read] [Edit] [Write] [Bash]
Assisted-by: Claude Code:claude-fable-5 [Read] [Edit] [Bash]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(realtime): explicit formally-verified state machines + parakeet streaming driver

The realtime API had several implicit state machines whose state was inferred
from scattered booleans, channels, and five separate mutexes, leaving
illegal/inconsistent states reachable. Make them explicit and keep the
implementation in step with a formal design; rework the parakeet streaming
backend along the same lines.

Realtime state machines (M1-M5). Each is a sealed sum-type State/Event/Effect
with a total, pure Next(state,event)->(state,[]effect) behind a single-writer
Coordinator:

  M1 conncoord    connection lifecycle: VAD toggle + once-only teardown
                  (replaces vadServerStarted + a `done` channel closed from
                  two sites).
  M2 turncoord    turn detection: collapses speechStarted and the live-stream
                  "turn open" flag into one state, so discardTurn can no longer
                  desync them and suppress the next onset.
  M3 respcoord    response coordination: serializes the dual-writer
                  start/cancel so at most one response is live; one
                  response.done per response.create.
  M4 compactcoord conversation compaction: single-flight (replaces the
                  `compacting atomic.Bool` CAS).
  M5 ttscoord     TTS pipeline: open->closing->closed, idempotent wait(),
                  rejects enqueue-after-close (was a silent drop).

The Coordinator/Sink/Next plumbing — only the sealed types and Next differed
per machine — is extracted once into core/http/endpoints/openai/coordinator as
a generic Coordinator[S,E,F]; each machine keeps its public API via type
aliases, so no sink, call-site, or test moved.

Hierarchy. session_lifecycle.fizz models M1 as the parent region with its
children (M2/M3/M4) as one statechart and asserts ChildrenDieWithParent (conn
torn => all children terminal, none start after teardown). respcoord and
compactcoord gain an absorbing Terminated state + Shutdown event; conncoord's
teardown drives the children terminal. This closes a compaction teardown gap: a
fire-and-forget compaction could outlive a torn session — compactionSink now
takes a session-scoped cancellable context + WaitGroup and joins the in-flight
summarize+evict on shutdown.

Formal verification. formal-verification/ holds one authoritative FizzBee spec
per machine plus the composition spec, each with an always-assertion and a
documented one-line edit that makes the checker fail (verified non-vacuous).
scripts/realtime-conformance.sh is fail-closed: all Go conformance suites under
-race AND a model-check of every .fizz spec; a missing FizzBee is a hard error
(only the loud REALTIME_CONFORMANCE_SKIP_FIZZBEE=1 bypasses it, never in CI).
FizzBee is pinned by sha256 and installed via scripts/install-fizzbee.sh into
.tools/ (gitignored). Wired as make test-realtime-conformance, a CI workflow,
and a pre-commit path filter. Go conformance tests are Ginkgo/Gomega (per the
repo's forbidigo lint): transition tables + fixed-seed property walks +
concurrent/-race specs, no rapid dependency. Design map:
docs/design/realtime-state-machines.md.

Parakeet streaming backend. The same treatment applied to the parakeet-cpp
streaming paths:
- AudioTranscriptionStream returns codes.Unimplemented for non-streaming models
  instead of decoding offline and emitting it as one delta + final. A client
  that asked for streaming learns the model cannot stream rather than receiving
  a batch result shaped like a stream. New grpcerrors.StreamTranscriptionUnsupported
  carries that signal; the HTTP /v1/audio/transcriptions stream path surfaces it
  as an SSE error event. Mirrors AudioTranscriptionLive, which already did this.
- utteranceBoundary (boundary.go): a single definition of the end-of-utterance
  latch, replacing three open-coded finalEou toggles. Modelled as a two-valued
  type so illegal states are unrepresentable.
- Shared decode driver (driver.go): streamFeedResult (one per-feed event) +
  feedChunk (hides the ABI v4 JSON vs text-only split) + feedSlices + flushTail.
  The feed loop is written once.
- AudioTranscriptionLive becomes a bidi adapter: it streams the per-feed
  {delta,eou,eob,words} the realtime turn detector consumes and a terminal
  FinalResult carrying only Text. Segments/duration/eou are offline-only and no
  longer produced (nor read) on the live path; liveTraceState drops the terminal
  eou and keeps the per-feed eou_events count.
- AudioTranscriptionStream + streamJSON merge into one driver-based function;
  streamSegmenter is generalized to the unified event with a text-only fallback
  that preserves the legacy (no-words) library's per-utterance segmentation.

Verified: build/vet/gofumpt clean, golangci-lint 0 issues, all coordinator and
parakeet packages under -race, the fail-closed conformance gate green, and
make test-realtime (12 e2e WS+WebRTC).

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-06-30 09:01:22 +02:00

18 KiB


title: "Realtime API" weight: 60

The realtime voice loop: VAD to STT to LLM to TTS, over WebSocket or WebRTC

LocalAI supports the OpenAI Realtime API which enables low-latency, multi-modal conversations (voice and text) over WebSocket.

To use the Realtime API, you need to configure a pipeline model that defines the components for Voice Activity Detection (VAD), Transcription (STT), Language Model (LLM), and Text-to-Speech (TTS).

Configuration

Create a model configuration file (e.g., gpt-realtime.yaml) in your models directory. For a complete reference of configuration options, see [Model Configuration]({{%relref "advanced/model-configuration" %}}).

name: gpt-realtime
pipeline:
  vad: silero-vad-ggml
  transcription: whisper-large-turbo
  llm: qwen3-4b
  tts: tts-1

This configuration links the following components:

  • vad: The Voice Activity Detection model (e.g., silero-vad-ggml) to detect when the user is speaking.
  • transcription: The Speech-to-Text model (e.g., whisper-large-turbo) to transcribe user audio.
  • llm: The Large Language Model (e.g., qwen3-4b) to generate responses.
  • tts: The Text-to-Speech model (e.g., tts-1) to synthesize the audio response.

Make sure all referenced models (silero-vad-ggml, whisper-large-turbo, qwen3-4b, tts-1) are also installed or defined in your LocalAI instance.

Streaming the pipeline

By default each stage runs to completion before the next begins: the whole utterance is transcribed, the full LLM reply is generated, then it is synthesized. Each stage can instead be streamed incrementally, which lowers the time-to-first-audio of a turn:

name: gpt-realtime
pipeline:
  vad: silero-vad-ggml
  transcription: whisper-large-turbo
  llm: qwen3-4b
  tts: tts-1
  streaming:
    llm: true             # stream LLM tokens as transcript deltas
    tts: true             # emit audio deltas per synthesized chunk
    transcription: true   # stream transcript text deltas of the user's speech
    clause_chunking: true # synthesize each clause as soon as it completes
  • streaming.tts: emit a response.output_audio.delta per audio chunk the TTS backend produces (requires a backend that supports streaming synthesis), instead of one delta for the whole utterance. Falls back to a single unary delta otherwise.
  • streaming.transcription: stream conversation.item.input_audio_transcription.delta events as the transcript is produced (requires a transcription backend that supports streaming).
  • streaming.llm: stream the LLM reply token-by-token as response.output_audio_transcript.delta events. The full reply is buffered and synthesized once it is complete — streamed as audio chunks when streaming.tts is enabled (and the TTS backend supports it), otherwise as a single unary delta. Reasoning/thinking is always stripped from the spoken transcript. Tool calls are supported while streaming when the LLM uses its tokenizer template (use_tokenizer_template: true): the backend's autoparser then delivers content and tool calls separately, so the spoken transcript never leaks tool-call tokens. Grammar-based function calling keeps the buffered path.
  • streaming.clause_chunking: instead of buffering the whole reply before TTS, split it into speakable clauses and synthesize each as soon as it completes, lowering the time-to-first-audio. The splitter is script-aware: it uses Unicode sentence segmentation (so it handles CJK 。!? with no whitespace), CJK clause punctuation (,、;:), and Thai/Lao spaces — it does not rely on whitespace sentence boundaries, so it works for languages such as Chinese, Japanese and Thai where the old per-sentence approach degraded to whole-message buffering. Requires streaming.llm; scripts that genuinely need a dictionary (e.g. Khmer, Burmese) simply stay buffered until a space or end-of-message. Off by default.

All streaming flags are off by default, so existing pipelines are unaffected.

Turn detection

Turn detection decides when the user has finished speaking and the pipeline should respond. Two modes are supported, matching the OpenAI session schema:

  • server_vad (default): silence-based. The VAD model watches the audio and the turn commits after silence_duration_ms (default 500 ms) of silence. Simple and model-agnostic, but a fixed silence window must trade interrupting mid-sentence pauses against sluggish responses.
  • semantic_vad: model-driven. The transcription model itself signals end-of-utterance and the silence window becomes dynamic: short right after the model emits its end-of-utterance token, much longer when it does not — so pausing to think no longer gets cut off, while finished sentences get a fast response.

semantic_vad requires a transcription model that emits an end-of-utterance token over a cache-aware streaming decode — currently parakeet-cpp-realtime_eou_120m-v1 (the model is trained to distinguish "paused, expecting a reply" from "paused mid-thought"). The realtime pipeline feeds it the microphone audio live while the user speaks. With any other transcription backend the session degrades gracefully to silence-only detection using the eagerness timeout below (a warning is logged once). The model also emits a distinct end-of-backchannel token (<EOB>) for short acknowledgments like "uh-huh": those are transcribed but never treated as the user yielding the turn.

Sessions can opt in via session.update (turn_detection: {"type": "semantic_vad", "eagerness": "medium"}), or the pipeline can set a server-side default so clients need no changes:

name: gpt-realtime
pipeline:
  vad: silero-vad-ggml
  transcription: parakeet-cpp-realtime_eou_120m-v1
  llm: qwen3-4b
  tts: tts-1
  turn_detection:
    type: semantic_vad   # default for sessions on this model (server_vad if unset)
    eagerness: medium    # low | medium | high | auto (auto == medium)
    retranscribe: false  # see below

A client session.update still overrides type and eagerness per session.

Eagerness sets the fallback silence window used when no end-of-utterance token was seen (the model missed it, or the user genuinely trails off): low waits 8 s, medium/auto 4 s, high 2 s — the same max-timeout semantics OpenAI documents. After the token is seen, the turn commits on the next VAD tick (~300 ms).

Live captions: while the user speaks, semantic_vad streams conversation.item.input_audio_transcription.delta events under the item id the commit will later reuse, so clients can render the words as they are recognized. The completed event at commit carries the authoritative transcript and replaces the partial text (with retranscribe: true it may differ from the captions); a turn discarded before commit emits conversation.item.input_audio_transcription.failed so clients can retract its captions.

retranscribe (server-side only, semantic_vad only) cross-checks the streaming decode against a batch decode at commit time:

  • false (default): the transcript accumulated from the live stream is used as-is — the model runs once per utterance and the LLM starts immediately at commit.
  • true: the committed audio is re-transcribed offline. If the batch decode also ends with the end-of-utterance token the turn proceeds (using the batch transcript); if it does not, the commit is cancelled and the session keeps listening — treating the streaming token as a false positive. Both transcripts are compared and logged, which makes this mode a useful diagnostic for how well the streaming and batch decodes align, at the cost of one extra decode per turn.

Disabling thinking

For reasoning models, you can force the pipeline LLM's thinking off without editing the LLM model config:

pipeline:
  llm: qwen3-4b
  disable_thinking: true   # maps to enable_thinking=false for the realtime LLM

This is applied only to the realtime session's copy of the LLM config, so it does not affect other users of the same model. Leave it unset to use the LLM model config's own reasoning settings.

Conversation compaction (long sessions on CPU)

By default a realtime session feeds only the last max_history_items turns to the LLM; older turns are dropped and forgotten. On CPU, long calls also grow expensive as the prompt fills with verbatim history. Enable compaction to instead fold older turns into a rolling summary, so long calls stay cheap without losing earlier context.

Compaction works with two numbers:

  • max_history_items is the live window — the recent turns kept verbatim in the prompt.
  • compaction.trigger_items is the high-water mark — let the buffer grow to here, then summarize the overflow (everything above max_history_items) into a rolling memory and evict it. It must be greater than max_history_items; if it is not, it is clamped up.

The gap between the two controls how often summarization runs: a summary call fires roughly every (trigger_items - max_history_items) turns (here, about every 6 turns).

pipeline:
  max_history_items: 6        # live window — recent turns kept verbatim
  compaction:
    enabled: true
    trigger_items: 12         # summarize overflow back down to max_history_items
    summary_model: ""         # optional: a small model for the summary (CPU); default = pipeline LLM
    max_summary_tokens: 512

{{% notice tip %}} On CPU, set summary_model to a small, fast model so compaction never competes with the conversation LLM for compute. Left empty, the pipeline's own LLM produces the summary. {{% /notice %}}

Clients can also manage history directly via the now-supported conversation.item.delete, conversation.item.truncate, and input_audio_buffer.clear realtime events.

Transports

The Realtime API supports two transports: WebSocket and WebRTC.

WebSocket

Connect to the WebSocket endpoint:

ws://localhost:8080/v1/realtime?model=gpt-realtime

Audio is sent and received as raw PCM in the WebSocket messages, following the OpenAI Realtime API protocol.

WebRTC

The WebRTC transport enables browser-based voice conversations with lower latency. Connect by POSTing an SDP offer to the REST endpoint:

POST http://localhost:8080/v1/realtime?model=gpt-realtime
Content-Type: application/sdp

<SDP offer body>

The response contains the SDP answer to complete the WebRTC handshake.

Opus backend requirement

WebRTC uses the Opus audio codec for encoding and decoding audio on RTP tracks. The opus backend must be installed for WebRTC to work. Install it from the model gallery:

curl http://localhost:8080/models/apply -H "Content-Type: application/json" -d '{"id": "opus"}'

Or set the EXTERNAL_GRPC_BACKENDS environment variable if running a local build:

EXTERNAL_GRPC_BACKENDS=opus:/path/to/backend/go/opus/opus

The opus backend is loaded automatically when a WebRTC session starts. It does not require any model configuration file — just the backend binary.

WebRTC behind Docker host networking or NAT

By default pion gathers a host ICE candidate for every local interface. Under Docker host networking that includes bridge addresses (docker0/veth, 172.x) that a remote browser cannot route to: the call typically connects on a good candidate and then drops a few seconds later when ICE consent checks fail on the unreachable ones. Two settings let you advertise only the reachable address:

# Advertise these IPs as the host ICE candidates (e.g. the host's LAN IP)
LOCALAI_WEBRTC_NAT_1TO1_IPS=192.168.1.10

# ...or restrict ICE gathering to specific interfaces
LOCALAI_WEBRTC_ICE_INTERFACES=eth0

{{% notice tip %}} For a browser on another LAN machine talking to LocalAI in a host-networked container, set LOCALAI_WEBRTC_NAT_1TO1_IPS to the host's LAN IP. This is the most reliable fix for WebRTC connections that establish and then drop. {{% /notice %}}

Protocol

The API follows the OpenAI Realtime API protocol for handling sessions, audio buffers, and conversation items.

Gating a realtime pipeline with voice recognition

A pipeline realtime model can require speaker verification before it responds. Add a voice_recognition block under pipeline. When present, each committed utterance is verified against authorized speakers; unauthorized utterances are dropped before the LLM runs (no LLM call, no tool execution, no TTS). The session stays open.

The same block also drives two optional, independent behaviors: an authorization gate (enforce) and speaker surfacing/personalization (identity). Set enforce: false to keep recognizing the speaker without ever rejecting a turn.

name: my-realtime
pipeline:
  vad: silero-vad
  transcription: whisper
  llm: qwen
  tts: kokoro
  voice_recognition:
    model: speaker-recognition   # the speaker-recognition backend model
    mode: identify               # "identify" (registry) or "verify" (references)
    threshold: 0.25              # cosine distance; <= passes
    enforce: true                # authorization gate (default true)
    when: every                  # "every" (default) or "first"
    on_reject: drop_event        # "drop_event" (default) or "drop_silent"
    anti_spoofing: false         # optional liveness check (verify mode)

    # identify mode: authorized registry identities (multiple persons)
    allow:
      names: ["alice", "bob"]    # match registered speaker names
      labels: ["family"]         # OR any identity carrying this label
      # empty allow = any registered speaker within threshold passes

    # verify mode: reference speakers (multiple persons)
    references:
      - name: alice
        audio: /models/voices/alice.wav
      - name: bob
        audio: /models/voices/bob.wav

Identifying speakers without gating

To recognize who is speaking and surface it to the client and the LLM without ever rejecting a turn, set enforce: false and add an identity block. The identity block works with or without the gate; when it is set, the speaker is resolved on every turn even if when: first.

name: my-realtime
pipeline:
  vad: silero-vad
  transcription: whisper
  llm: qwen
  tts: kokoro
  voice_recognition:
    model: speaker-recognition
    mode: identify
    threshold: 0.25
    # Authorization gate. Defaults to enforcing (rejects unauthorized speakers).
    # Set enforce:false to identify the speaker WITHOUT rejecting anyone.
    enforce: false
    when: every
    # Surface the recognized speaker to the client and the LLM. Works with or
    # without enforce; when set, identity is resolved on every turn even if
    # when:first.
    identity:
      announce: true            # emit the conversation.item.speaker event
      announce_unknown: false   # also emit it when there is no confident match
      personalize: true         # tell the LLM who is speaking
      inject_name: true         # set the per-message OpenAI name field
      inject_system_note: true  # append a "current speaker" line to the system message
      note_unknown: false       # append a "speaker is unknown" note when unidentified
Field Meaning
model Speaker-recognition backend model name.
mode identify matches against speakers registered via /v1/voice/register; verify matches against the references audios.
threshold Maximum cosine distance that still counts as a match (default ~0.25).
enforce Authorization gate. true (or omitted) rejects unauthorized speakers (the gating behavior above). false resolves and surfaces the speaker without ever dropping a turn.
when every verifies each utterance; first verifies once then trusts the session. When an identity block is set, the speaker is still resolved on every turn even with first.
on_reject drop_event drops and emits a speaker_not_authorized error event; drop_silent drops quietly.
anti_spoofing Verify mode only: runs the backend liveness check (slower).
allow.names / allow.labels identify mode: which registry identities are authorized. Empty = any registered speaker.
references verify mode: authorized reference speakers; the utterance passes if it matches any.
identity.announce Emit the conversation.item.speaker event to the client (see below).
identity.announce_unknown Also emit that event when there is no confident match. By default the event is emitted only on a match.
identity.personalize Inform the LLM who is speaking.
identity.inject_name Set the per-message OpenAI name field on each user turn.
identity.inject_system_note Append a The current speaker is <Name>. line to the system message.
identity.note_unknown When unidentified, append The current speaker is unknown. (lets the model ask who it is talking to).

identify mode requires the voice registry (speakers registered through /v1/voice/register). verify mode needs no registry: reference audios are embedded once at model load.

The conversation.item.speaker event

When identity.announce is enabled, the server emits a conversation.item.speaker event after the user conversation item, naming the recognized speaker:

{
  "type": "conversation.item.speaker",
  "item_id": "item_abc",
  "speaker": { "name": "Jeremy", "id": "spk_1", "labels": { "role": "owner" }, "confidence": 92.0, "distance": 0.1, "matched": true }
}

confidence is a 0-100 score, distance is the cosine distance, and matched is true when a confident match was found. labels carries any labels attached to the registered speaker (identify mode); it is omitted when the speaker has none. The name and id fields are omitted when empty. By default the event is emitted only on a match; set identity.announce_unknown: true to also emit it (with matched: false) when no speaker is identified.

This event is a LocalAI extension to the OpenAI Realtime API and is server-emitted only. Standard OpenAI Realtime clients ignore event types they do not recognize, so enabling it is non-breaking.

Examples

  • Realtime voice assistant demo (Go): a minimal Go client for the Realtime (WebSocket) API with a full talk-back voice loop and an example tool call. Ships a docker compose setup that brings up a realtime-capable LocalAI for you.
  • Realtime voice assistant example (Python): thin-client architecture (Silero VAD on the client, heavy lifting on LocalAI), suited to running the client on a Raspberry Pi.