Files
LocalAI/docs/content/features/openai-realtime.md
Ettore Di Giacinto d05d83ff36 feat(realtime): stream tool-call turns via tokenizer-template autoparser
Per review (richiejp): tool-call deltas exist, so streaming should work with
tools too. It does — for models that use their tokenizer template. The C++
autoparser then clears reply.Message and delivers content + tool calls via
ChatDeltas, so the streamed transcript carries only spoken content (no
tool-call JSON leak) and the tool calls are parsed from the final response.

- Drop the len(tools)==0 gate; stream when no tools OR use_tokenizer_template
  (grammar-based function calling still buffers, since its call is emitted as
  JSON in the token stream and would leak into the transcript).
- streamLLMResponse takes tools/toolChoice/toolTurn, reads ChatDelta content
  in the token callback, parses tool calls from the final ChatDeltas, and
  creates the assistant content item lazily so a content-less tool turn emits
  only the tool calls.
- Extract emitToolCallItems from the buffered path so both paths finalize tool
  calls, response.done, and server-side assistant-tool follow-ups identically.

Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-05 14:03:36 +00:00

5.0 KiB


title: "Realtime API" weight: 60

The realtime voice loop: VAD to STT to LLM to TTS, over WebSocket or WebRTC

LocalAI supports the OpenAI Realtime API which enables low-latency, multi-modal conversations (voice and text) over WebSocket.

To use the Realtime API, you need to configure a pipeline model that defines the components for Voice Activity Detection (VAD), Transcription (STT), Language Model (LLM), and Text-to-Speech (TTS).

Configuration

Create a model configuration file (e.g., gpt-realtime.yaml) in your models directory. For a complete reference of configuration options, see [Model Configuration]({{%relref "advanced/model-configuration" %}}).

name: gpt-realtime
pipeline:
  vad: silero-vad-ggml
  transcription: whisper-large-turbo
  llm: qwen3-4b
  tts: tts-1

This configuration links the following components:

  • vad: The Voice Activity Detection model (e.g., silero-vad-ggml) to detect when the user is speaking.
  • transcription: The Speech-to-Text model (e.g., whisper-large-turbo) to transcribe user audio.
  • llm: The Large Language Model (e.g., qwen3-4b) to generate responses.
  • tts: The Text-to-Speech model (e.g., tts-1) to synthesize the audio response.

Make sure all referenced models (silero-vad-ggml, whisper-large-turbo, qwen3-4b, tts-1) are also installed or defined in your LocalAI instance.

Streaming the pipeline

By default each stage runs to completion before the next begins: the whole utterance is transcribed, the full LLM reply is generated, then it is synthesized. Each stage can instead be streamed incrementally, which lowers the time-to-first-audio of a turn:

name: gpt-realtime
pipeline:
  vad: silero-vad-ggml
  transcription: whisper-large-turbo
  llm: qwen3-4b
  tts: tts-1
  streaming:
    llm: true            # stream LLM tokens as transcript deltas
    tts: true            # emit audio deltas per synthesized chunk
    transcription: true  # stream transcript text deltas of the user's speech
  • streaming.tts: emit a response.output_audio.delta per audio chunk the TTS backend produces (requires a backend that supports streaming synthesis), instead of one delta for the whole utterance. Falls back to a single unary delta otherwise.
  • streaming.transcription: stream conversation.item.input_audio_transcription.delta events as the transcript is produced (requires a transcription backend that supports streaming).
  • streaming.llm: stream the LLM reply token-by-token as response.output_audio_transcript.delta events. The full reply is buffered and synthesized once it is complete — streamed as audio chunks when streaming.tts is enabled (and the TTS backend supports it), otherwise as a single unary delta. Reasoning/thinking is always stripped from the spoken transcript. Tool calls are supported while streaming when the LLM uses its tokenizer template (use_tokenizer_template: true): the backend's autoparser then delivers content and tool calls separately, so the spoken transcript never leaks tool-call tokens. Grammar-based function calling keeps the buffered path.

All streaming flags are off by default, so existing pipelines are unaffected.

Disabling thinking

For reasoning models, you can force the pipeline LLM's thinking off without editing the LLM model config:

pipeline:
  llm: qwen3-4b
  disable_thinking: true   # maps to enable_thinking=false for the realtime LLM

This is applied only to the realtime session's copy of the LLM config, so it does not affect other users of the same model. Leave it unset to use the LLM model config's own reasoning settings.

Transports

The Realtime API supports two transports: WebSocket and WebRTC.

WebSocket

Connect to the WebSocket endpoint:

ws://localhost:8080/v1/realtime?model=gpt-realtime

Audio is sent and received as raw PCM in the WebSocket messages, following the OpenAI Realtime API protocol.

WebRTC

The WebRTC transport enables browser-based voice conversations with lower latency. Connect by POSTing an SDP offer to the REST endpoint:

POST http://localhost:8080/v1/realtime?model=gpt-realtime
Content-Type: application/sdp

<SDP offer body>

The response contains the SDP answer to complete the WebRTC handshake.

Opus backend requirement

WebRTC uses the Opus audio codec for encoding and decoding audio on RTP tracks. The opus backend must be installed for WebRTC to work. Install it from the model gallery:

curl http://localhost:8080/models/apply -H "Content-Type: application/json" -d '{"id": "opus"}'

Or set the EXTERNAL_GRPC_BACKENDS environment variable if running a local build:

EXTERNAL_GRPC_BACKENDS=opus:/path/to/backend/go/opus/opus

The opus backend is loaded automatically when a WebRTC session starts. It does not require any model configuration file — just the backend binary.

Protocol

The API follows the OpenAI Realtime API protocol for handling sessions, audio buffers, and conversation items.