* docs: add 'how LocalAI works' architecture diagram Add a blueprint-style architecture diagram: clients -> small core (API, router, WebUI, agents) -> gRPC -> backend processes pulled on demand as OCI images. Place it on the overview page and replace the stale external architecture image on the reference page. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs: add blueprint diagrams across feature, distributed & getting-started docs Add 24 architecture/flow/comparison diagrams (PNG + HTML source) under docs/static/images/diagrams/, wired into their docs pages, from an impact-vs-effort audit of the docs. Broaden the API surface on the overview architecture diagram (OpenAI, Anthropic, ElevenLabs, Ollama, and LocalAI's own API) and move the gRPC boundary label clear of the arrows. Pages: distributed mode (architecture, scheduling, ds4 layer-split), distributed inferencing, MLX, realtime, quantization, MCP, agents, mitm & cloud proxy, middleware, reverse-proxy TLS, VRAM, voice & face recognition, reranker, function calling, fine-tuning (recipe + jobs), diarization, audio transform, quickstart, model resolution. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs: add composable-core diagram to README hero Commit the composable-core card (small core + on-demand backend tiles) alongside the other diagrams and reference it from the README hero via a repo-relative path, so it renders on GitHub. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs: fix composable-core connectors/badge and federated-vs-worker layout - composable-core: thicken the plug-in connectors so they read clearly, and widen the SEPARATE IMAGE badge so its text no longer overflows the box. - federated-vs-worker: shorten the WHOLE/SPLIT REQUEST pills to fit, and replace the tangled node-to-node activation arrows with a clean fan-out (request split across all sharded nodes), mirroring the federated panel. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
6.0 KiB
+++ disableToc = false title = "Audio Transform" weight = 17 url = "/features/audio-transform/" +++
The audio-transform endpoints take audio in and emit audio out, optionally conditioned on a second reference audio signal. The category is generic by design — concrete operations include joint acoustic echo cancellation + noise suppression + dereverberation (LocalVQE), voice conversion (reference = target speaker), pitch shifting, audio super-resolution, and so on.
The first shipping backend is LocalVQE, a 1.3 M-parameter GGML-based model that performs joint AEC + noise suppression
- dereverberation on 16 kHz mono speech, ~9.6× realtime on a desktop CPU. It is a derivative of the Microsoft DeepVQE paper.
The mental model
Every audio-transform request carries:
audio— the primary input file (required).reference— an auxiliary signal whose meaning is backend-specific (optional).- For echo cancellation: the loopback / far-end signal played through the speakers.
- For voice conversion: the target speaker's reference clip.
- For pitch / style transfer: a tonal or style reference.
- When omitted, the backend treats it as silence and degrades gracefully (LocalVQE, for example, does denoise + dereverb only when ref is empty).
params— a generickey=valuemap forwarded to the backend.- LocalVQE keys:
noise_gate=true|false,noise_gate_threshold_dbfs=<float>.
- LocalVQE keys:
This shape mirrors WebRTC's ProcessStream(near) / ProcessReverseStream(far)
APM API, NVIDIA Maxine's NvAFX_Run paired-stream signature, and the ICASSP
AEC challenge 2-channel WAV convention.
Batch endpoint
POST /audio/transformations (alias POST /audio/transform) — multipart
form-data, returns audio bytes.
| Field | Type | Required | Notes |
|---|---|---|---|
model |
string | yes | Audio-transform model id (e.g. localvqe) |
audio |
file | yes | Primary input audio |
reference |
file | no | Optional auxiliary signal |
response_format |
string | no | wav (default), mp3, ogg, flac |
sample_rate |
int | no | Desired output sample rate |
params[<key>] |
string | no | Repeated; forwarded to backend |
Example (LocalVQE: cancel echo, suppress noise, gate residual):
curl -X POST http://localhost:8080/audio/transformations \
-F model=localvqe \
-F audio=@mic.wav \
-F reference=@loopback.wav \
-F 'params[noise_gate]=true' \
-F 'params[noise_gate_threshold_dbfs]=-50' \
-o enhanced.wav
When reference is omitted, LocalVQE zero-fills the reference channel and
the operation reduces to noise suppression + dereverberation.
Streaming endpoint
GET /audio/transformations/stream — bidirectional WebSocket. The first
client message is a JSON envelope; subsequent client messages are binary
PCM frames; server emits binary PCM frames at the same cadence.
Wire format
Client → server (text frame, first):
{
"type": "session.update",
"model": "localvqe",
"sample_format": "S16_LE",
"sample_rate": 16000,
"frame_samples": 256,
"params": { "noise_gate": "true" }
}
sample_format is S16_LE (16-bit signed little-endian) or F32_LE (32-bit
float little-endian, [-1, 1]). frame_samples defaults to the backend's
preferred hop length (256 = 16 ms for LocalVQE).
Client → server (binary frames, subsequent): interleaved stereo PCM,
channel 0 = audio (mic), channel 1 = reference. Frame size:
frame_samples × 2 channels × sample_size. For S16_LE at 256 samples that
is 1024 bytes per frame; for F32_LE it is 2048 bytes. If the reference is
silent (no auxiliary signal), send zeros on channel 1.
Server → client (binary frames): mono PCM in the same format,
frame_samples × sample_size bytes (512 bytes for S16_LE, 1024 for F32_LE).
Mid-stream control (text frame): another session.update resets the
streaming state when its reset field is true; a session.close text frame
ends the session cleanly.
Latency
LocalVQE has 16 ms algorithmic latency (one hop). At runtime the per-frame CPU cost depends on the model: ~1.6 ms for the compact 1.3 M models (v1.1/v1.2, ~9.7× realtime) and ~3.3 ms for the wider v1.3 4.8 M model (~4.7× realtime) on a 4-thread modern desktop, leaving the rest of the budget for network and downstream playback.
Backend-specific tuning (LocalVQE)
params[<key>] |
Type | Default | Effect |
|---|---|---|---|
noise_gate |
bool | false |
Enable post-OLA RMS-based residual-echo gate |
noise_gate_threshold_dbfs |
float | -45.0 |
Gate threshold in dBFS; frames below are zeroed |
The gate is most useful in far-end-only / silent-near-end stretches where the
model's residual would otherwise sound like buffering or amplified noise floor.
A reasonable starting point is -50 dBFS.
Configuring a model
LocalVQE ships several weight releases in the gallery: localvqe-v1.3-4.8m
(current default — best quality), localvqe-v1.2-1.3m and localvqe-v1.1-1.3m
(compact, ~¼ the per-hop cost — good for low-core or power-constrained hosts).
All share the same backend and request API; only the model filename differs.
name: localvqe
backend: localvqe
parameters:
model: localvqe-v1.3-4.8M-f32.gguf
# Backend-specific defaults can be set in Options[]; per-request
# params[*] form fields override.
#
# `backend` and `device` route through the upstream localvqe options
# builder so you can force a non-default GGML backend (e.g. `Vulkan`) or
# pin to a specific GPU index. Leave both unset to keep the CPU default.
options:
- noise_gate=true
- noise_gate_threshold_dbfs=-50
# - backend=Vulkan
# - device=0
See also
- [Text to Audio (TTS)]({{< relref "text-to-audio.md" >}})
- [Audio to Text]({{< relref "audio-to-text.md" >}})
- LocalVQE upstream
- DeepVQE paper (Indenbom et al., Interspeech 2023)
