Files
LocalAI/docs/content/features/openai-realtime.md
LocalAI [bot] 7e59a5c7c5 docs: architecture & feature diagrams (blueprint style) (#10137)
* docs: add 'how LocalAI works' architecture diagram

Add a blueprint-style architecture diagram: clients -> small core (API,
router, WebUI, agents) -> gRPC -> backend processes pulled on demand as
OCI images. Place it on the overview page and replace the stale external
architecture image on the reference page.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs: add blueprint diagrams across feature, distributed & getting-started docs

Add 24 architecture/flow/comparison diagrams (PNG + HTML source) under
docs/static/images/diagrams/, wired into their docs pages, from an
impact-vs-effort audit of the docs. Broaden the API surface on the
overview architecture diagram (OpenAI, Anthropic, ElevenLabs, Ollama,
and LocalAI's own API) and move the gRPC boundary label clear of the arrows.

Pages: distributed mode (architecture, scheduling, ds4 layer-split),
distributed inferencing, MLX, realtime, quantization, MCP, agents,
mitm & cloud proxy, middleware, reverse-proxy TLS, VRAM, voice & face
recognition, reranker, function calling, fine-tuning (recipe + jobs),
diarization, audio transform, quickstart, model resolution.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs: add composable-core diagram to README hero

Commit the composable-core card (small core + on-demand backend tiles)
alongside the other diagrams and reference it from the README hero via a
repo-relative path, so it renders on GitHub.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs: fix composable-core connectors/badge and federated-vs-worker layout

- composable-core: thicken the plug-in connectors so they read clearly, and
  widen the SEPARATE IMAGE badge so its text no longer overflows the box.
- federated-vs-worker: shorten the WHOLE/SPLIT REQUEST pills to fit, and
  replace the tangled node-to-node activation arrows with a clean fan-out
  (request split across all sharded nodes), mirroring the federated panel.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-02 18:43:22 +02:00

2.8 KiB


title: "Realtime API" weight: 60

The realtime voice loop: VAD to STT to LLM to TTS, over WebSocket or WebRTC

LocalAI supports the OpenAI Realtime API which enables low-latency, multi-modal conversations (voice and text) over WebSocket.

To use the Realtime API, you need to configure a pipeline model that defines the components for Voice Activity Detection (VAD), Transcription (STT), Language Model (LLM), and Text-to-Speech (TTS).

Configuration

Create a model configuration file (e.g., gpt-realtime.yaml) in your models directory. For a complete reference of configuration options, see [Model Configuration]({{%relref "advanced/model-configuration" %}}).

name: gpt-realtime
pipeline:
  vad: silero-vad-ggml
  transcription: whisper-large-turbo
  llm: qwen3-4b
  tts: tts-1

This configuration links the following components:

  • vad: The Voice Activity Detection model (e.g., silero-vad-ggml) to detect when the user is speaking.
  • transcription: The Speech-to-Text model (e.g., whisper-large-turbo) to transcribe user audio.
  • llm: The Large Language Model (e.g., qwen3-4b) to generate responses.
  • tts: The Text-to-Speech model (e.g., tts-1) to synthesize the audio response.

Make sure all referenced models (silero-vad-ggml, whisper-large-turbo, qwen3-4b, tts-1) are also installed or defined in your LocalAI instance.

Transports

The Realtime API supports two transports: WebSocket and WebRTC.

WebSocket

Connect to the WebSocket endpoint:

ws://localhost:8080/v1/realtime?model=gpt-realtime

Audio is sent and received as raw PCM in the WebSocket messages, following the OpenAI Realtime API protocol.

WebRTC

The WebRTC transport enables browser-based voice conversations with lower latency. Connect by POSTing an SDP offer to the REST endpoint:

POST http://localhost:8080/v1/realtime?model=gpt-realtime
Content-Type: application/sdp

<SDP offer body>

The response contains the SDP answer to complete the WebRTC handshake.

Opus backend requirement

WebRTC uses the Opus audio codec for encoding and decoding audio on RTP tracks. The opus backend must be installed for WebRTC to work. Install it from the model gallery:

curl http://localhost:8080/models/apply -H "Content-Type: application/json" -d '{"id": "opus"}'

Or set the EXTERNAL_GRPC_BACKENDS environment variable if running a local build:

EXTERNAL_GRPC_BACKENDS=opus:/path/to/backend/go/opus/opus

The opus backend is loaded automatically when a WebRTC session starts. It does not require any model configuration file — just the backend binary.

Protocol

The API follows the OpenAI Realtime API protocol for handling sessions, audio buffers, and conversation items.