Files
LocalAI/docs/content/features/audio-transform.md
LocalAI [bot] 7e59a5c7c5 docs: architecture & feature diagrams (blueprint style) (#10137)
* docs: add 'how LocalAI works' architecture diagram

Add a blueprint-style architecture diagram: clients -> small core (API,
router, WebUI, agents) -> gRPC -> backend processes pulled on demand as
OCI images. Place it on the overview page and replace the stale external
architecture image on the reference page.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs: add blueprint diagrams across feature, distributed & getting-started docs

Add 24 architecture/flow/comparison diagrams (PNG + HTML source) under
docs/static/images/diagrams/, wired into their docs pages, from an
impact-vs-effort audit of the docs. Broaden the API surface on the
overview architecture diagram (OpenAI, Anthropic, ElevenLabs, Ollama,
and LocalAI's own API) and move the gRPC boundary label clear of the arrows.

Pages: distributed mode (architecture, scheduling, ds4 layer-split),
distributed inferencing, MLX, realtime, quantization, MCP, agents,
mitm & cloud proxy, middleware, reverse-proxy TLS, VRAM, voice & face
recognition, reranker, function calling, fine-tuning (recipe + jobs),
diarization, audio transform, quickstart, model resolution.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs: add composable-core diagram to README hero

Commit the composable-core card (small core + on-demand backend tiles)
alongside the other diagrams and reference it from the README hero via a
repo-relative path, so it renders on GitHub.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs: fix composable-core connectors/badge and federated-vs-worker layout

- composable-core: thicken the plug-in connectors so they read clearly, and
  widen the SEPARATE IMAGE badge so its text no longer overflows the box.
- federated-vs-worker: shorten the WHOLE/SPLIT REQUEST pills to fit, and
  replace the tangled node-to-node activation arrows with a clean fan-out
  (request split across all sharded nodes), mirroring the federated panel.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-02 18:43:22 +02:00

6.0 KiB
Raw Blame History

+++ disableToc = false title = "Audio Transform" weight = 17 url = "/features/audio-transform/" +++

Audio transform: two inputs (mic plus reference) become one cleaned output; interleaved-stereo on the wire

The audio-transform endpoints take audio in and emit audio out, optionally conditioned on a second reference audio signal. The category is generic by design — concrete operations include joint acoustic echo cancellation + noise suppression + dereverberation (LocalVQE), voice conversion (reference = target speaker), pitch shifting, audio super-resolution, and so on.

The first shipping backend is LocalVQE, a 1.3 M-parameter GGML-based model that performs joint AEC + noise suppression

  • dereverberation on 16 kHz mono speech, ~9.6× realtime on a desktop CPU. It is a derivative of the Microsoft DeepVQE paper.

The mental model

Every audio-transform request carries:

  • audio — the primary input file (required).
  • reference — an auxiliary signal whose meaning is backend-specific (optional).
    • For echo cancellation: the loopback / far-end signal played through the speakers.
    • For voice conversion: the target speaker's reference clip.
    • For pitch / style transfer: a tonal or style reference.
    • When omitted, the backend treats it as silence and degrades gracefully (LocalVQE, for example, does denoise + dereverb only when ref is empty).
  • params — a generic key=value map forwarded to the backend.
    • LocalVQE keys: noise_gate=true|false, noise_gate_threshold_dbfs=<float>.

This shape mirrors WebRTC's ProcessStream(near) / ProcessReverseStream(far) APM API, NVIDIA Maxine's NvAFX_Run paired-stream signature, and the ICASSP AEC challenge 2-channel WAV convention.

Batch endpoint

POST /audio/transformations (alias POST /audio/transform) — multipart form-data, returns audio bytes.

Field Type Required Notes
model string yes Audio-transform model id (e.g. localvqe)
audio file yes Primary input audio
reference file no Optional auxiliary signal
response_format string no wav (default), mp3, ogg, flac
sample_rate int no Desired output sample rate
params[<key>] string no Repeated; forwarded to backend

Example (LocalVQE: cancel echo, suppress noise, gate residual):

curl -X POST http://localhost:8080/audio/transformations \
  -F model=localvqe \
  -F audio=@mic.wav \
  -F reference=@loopback.wav \
  -F 'params[noise_gate]=true' \
  -F 'params[noise_gate_threshold_dbfs]=-50' \
  -o enhanced.wav

When reference is omitted, LocalVQE zero-fills the reference channel and the operation reduces to noise suppression + dereverberation.

Streaming endpoint

GET /audio/transformations/stream — bidirectional WebSocket. The first client message is a JSON envelope; subsequent client messages are binary PCM frames; server emits binary PCM frames at the same cadence.

Wire format

Client → server (text frame, first):

{
  "type": "session.update",
  "model": "localvqe",
  "sample_format": "S16_LE",
  "sample_rate": 16000,
  "frame_samples": 256,
  "params": { "noise_gate": "true" }
}

sample_format is S16_LE (16-bit signed little-endian) or F32_LE (32-bit float little-endian, [-1, 1]). frame_samples defaults to the backend's preferred hop length (256 = 16 ms for LocalVQE).

Client → server (binary frames, subsequent): interleaved stereo PCM, channel 0 = audio (mic), channel 1 = reference. Frame size: frame_samples × 2 channels × sample_size. For S16_LE at 256 samples that is 1024 bytes per frame; for F32_LE it is 2048 bytes. If the reference is silent (no auxiliary signal), send zeros on channel 1.

Server → client (binary frames): mono PCM in the same format, frame_samples × sample_size bytes (512 bytes for S16_LE, 1024 for F32_LE).

Mid-stream control (text frame): another session.update resets the streaming state when its reset field is true; a session.close text frame ends the session cleanly.

Latency

LocalVQE has 16 ms algorithmic latency (one hop). At runtime the per-frame CPU cost depends on the model: ~1.6 ms for the compact 1.3 M models (v1.1/v1.2, ~9.7× realtime) and ~3.3 ms for the wider v1.3 4.8 M model (~4.7× realtime) on a 4-thread modern desktop, leaving the rest of the budget for network and downstream playback.

Backend-specific tuning (LocalVQE)

params[<key>] Type Default Effect
noise_gate bool false Enable post-OLA RMS-based residual-echo gate
noise_gate_threshold_dbfs float -45.0 Gate threshold in dBFS; frames below are zeroed

The gate is most useful in far-end-only / silent-near-end stretches where the model's residual would otherwise sound like buffering or amplified noise floor. A reasonable starting point is -50 dBFS.

Configuring a model

LocalVQE ships several weight releases in the gallery: localvqe-v1.3-4.8m (current default — best quality), localvqe-v1.2-1.3m and localvqe-v1.1-1.3m (compact, ~¼ the per-hop cost — good for low-core or power-constrained hosts). All share the same backend and request API; only the model filename differs.

name: localvqe
backend: localvqe
parameters:
  model: localvqe-v1.3-4.8M-f32.gguf

# Backend-specific defaults can be set in Options[]; per-request
# params[*] form fields override.
#
# `backend` and `device` route through the upstream localvqe options
# builder so you can force a non-default GGML backend (e.g. `Vulkan`) or
# pin to a specific GPU index. Leave both unset to keep the CPU default.
options:
- noise_gate=true
- noise_gate_threshold_dbfs=-50
# - backend=Vulkan
# - device=0

See also