Files
LocalAI/docs/content/features/audio-transform.md
LocalAI [bot] 7e59a5c7c5 docs: architecture & feature diagrams (blueprint style) (#10137)
* docs: add 'how LocalAI works' architecture diagram

Add a blueprint-style architecture diagram: clients -> small core (API,
router, WebUI, agents) -> gRPC -> backend processes pulled on demand as
OCI images. Place it on the overview page and replace the stale external
architecture image on the reference page.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs: add blueprint diagrams across feature, distributed & getting-started docs

Add 24 architecture/flow/comparison diagrams (PNG + HTML source) under
docs/static/images/diagrams/, wired into their docs pages, from an
impact-vs-effort audit of the docs. Broaden the API surface on the
overview architecture diagram (OpenAI, Anthropic, ElevenLabs, Ollama,
and LocalAI's own API) and move the gRPC boundary label clear of the arrows.

Pages: distributed mode (architecture, scheduling, ds4 layer-split),
distributed inferencing, MLX, realtime, quantization, MCP, agents,
mitm & cloud proxy, middleware, reverse-proxy TLS, VRAM, voice & face
recognition, reranker, function calling, fine-tuning (recipe + jobs),
diarization, audio transform, quickstart, model resolution.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs: add composable-core diagram to README hero

Commit the composable-core card (small core + on-demand backend tiles)
alongside the other diagrams and reference it from the README hero via a
repo-relative path, so it renders on GitHub.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs: fix composable-core connectors/badge and federated-vs-worker layout

- composable-core: thicken the plug-in connectors so they read clearly, and
  widen the SEPARATE IMAGE badge so its text no longer overflows the box.
- federated-vs-worker: shorten the WHOLE/SPLIT REQUEST pills to fit, and
  replace the tangled node-to-node activation arrows with a clean fan-out
  (request split across all sharded nodes), mirroring the federated panel.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-02 18:43:22 +02:00

157 lines
6.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
+++
disableToc = false
title = "Audio Transform"
weight = 17
url = "/features/audio-transform/"
+++
![Audio transform: two inputs (mic plus reference) become one cleaned output; interleaved-stereo on the wire](/images/diagrams/audio-transform-io.png)
The audio-transform endpoints take **audio in** and emit **audio out**, optionally
conditioned on a second reference audio signal. The category is generic by
design — concrete operations include joint **acoustic echo cancellation +
noise suppression + dereverberation** (LocalVQE), voice conversion (reference
= target speaker), pitch shifting, audio super-resolution, and so on.
The first shipping backend is [LocalVQE](https://github.com/localai-org/LocalVQE),
a 1.3 M-parameter GGML-based model that performs joint AEC + noise suppression
+ dereverberation on 16 kHz mono speech, ~9.6× realtime on a desktop CPU. It
is a derivative of the Microsoft DeepVQE paper.
## The mental model
Every audio-transform request carries:
- **`audio`** — the primary input file (required).
- **`reference`** — an auxiliary signal whose meaning is backend-specific (optional).
- For echo cancellation: the loopback / far-end signal played through the speakers.
- For voice conversion: the target speaker's reference clip.
- For pitch / style transfer: a tonal or style reference.
- When omitted, the backend treats it as silence and degrades gracefully (LocalVQE,
for example, does denoise + dereverb only when ref is empty).
- **`params`** — a generic `key=value` map forwarded to the backend.
- LocalVQE keys: `noise_gate=true|false`, `noise_gate_threshold_dbfs=<float>`.
This shape mirrors WebRTC's `ProcessStream(near)` / `ProcessReverseStream(far)`
APM API, NVIDIA Maxine's `NvAFX_Run` paired-stream signature, and the ICASSP
AEC challenge 2-channel WAV convention.
## Batch endpoint
`POST /audio/transformations` (alias `POST /audio/transform`) — multipart
form-data, returns audio bytes.
| Field | Type | Required | Notes |
|---|---|---|---|
| `model` | string | yes | Audio-transform model id (e.g. `localvqe`) |
| `audio` | file | yes | Primary input audio |
| `reference` | file | no | Optional auxiliary signal |
| `response_format` | string | no | `wav` (default), `mp3`, `ogg`, `flac` |
| `sample_rate` | int | no | Desired output sample rate |
| `params[<key>]` | string | no | Repeated; forwarded to backend |
Example (LocalVQE: cancel echo, suppress noise, gate residual):
```bash
curl -X POST http://localhost:8080/audio/transformations \
-F model=localvqe \
-F audio=@mic.wav \
-F reference=@loopback.wav \
-F 'params[noise_gate]=true' \
-F 'params[noise_gate_threshold_dbfs]=-50' \
-o enhanced.wav
```
When `reference` is omitted, LocalVQE zero-fills the reference channel and
the operation reduces to noise suppression + dereverberation.
## Streaming endpoint
`GET /audio/transformations/stream` — bidirectional WebSocket. The first
client message is a JSON envelope; subsequent client messages are binary
PCM frames; server emits binary PCM frames at the same cadence.
### Wire format
**Client → server** (text frame, first):
```json
{
"type": "session.update",
"model": "localvqe",
"sample_format": "S16_LE",
"sample_rate": 16000,
"frame_samples": 256,
"params": { "noise_gate": "true" }
}
```
`sample_format` is `S16_LE` (16-bit signed little-endian) or `F32_LE` (32-bit
float little-endian, [-1, 1]). `frame_samples` defaults to the backend's
preferred hop length (256 = 16 ms for LocalVQE).
**Client → server** (binary frames, subsequent): interleaved stereo PCM,
channel 0 = audio (mic), channel 1 = reference. Frame size:
`frame_samples × 2 channels × sample_size`. For `S16_LE` at 256 samples that
is 1024 bytes per frame; for `F32_LE` it is 2048 bytes. If the reference is
silent (no auxiliary signal), send zeros on channel 1.
**Server → client** (binary frames): mono PCM in the same format,
`frame_samples × sample_size` bytes (512 bytes for `S16_LE`, 1024 for `F32_LE`).
**Mid-stream control** (text frame): another `session.update` resets the
streaming state when its `reset` field is true; a `session.close` text frame
ends the session cleanly.
### Latency
LocalVQE has 16 ms algorithmic latency (one hop). At runtime the per-frame CPU
cost depends on the model: ~1.6 ms for the compact 1.3 M models (v1.1/v1.2,
~9.7× realtime) and ~3.3 ms for the wider v1.3 4.8 M model (~4.7× realtime) on
a 4-thread modern desktop, leaving the rest of the budget for network and
downstream playback.
## Backend-specific tuning (LocalVQE)
| `params[<key>]` | Type | Default | Effect |
|---|---|---|---|
| `noise_gate` | bool | `false` | Enable post-OLA RMS-based residual-echo gate |
| `noise_gate_threshold_dbfs` | float | `-45.0` | Gate threshold in dBFS; frames below are zeroed |
The gate is most useful in far-end-only / silent-near-end stretches where the
model's residual would otherwise sound like buffering or amplified noise floor.
A reasonable starting point is `-50` dBFS.
## Configuring a model
LocalVQE ships several weight releases in the gallery: `localvqe-v1.3-4.8m`
(current default — best quality), `localvqe-v1.2-1.3m` and `localvqe-v1.1-1.3m`
(compact, ~¼ the per-hop cost — good for low-core or power-constrained hosts).
All share the same backend and request API; only the `model` filename differs.
```yaml
name: localvqe
backend: localvqe
parameters:
model: localvqe-v1.3-4.8M-f32.gguf
# Backend-specific defaults can be set in Options[]; per-request
# params[*] form fields override.
#
# `backend` and `device` route through the upstream localvqe options
# builder so you can force a non-default GGML backend (e.g. `Vulkan`) or
# pin to a specific GPU index. Leave both unset to keep the CPU default.
options:
- noise_gate=true
- noise_gate_threshold_dbfs=-50
# - backend=Vulkan
# - device=0
```
## See also
- [Text to Audio (TTS)]({{< relref "text-to-audio.md" >}})
- [Audio to Text]({{< relref "audio-to-text.md" >}})
- [LocalVQE upstream](https://github.com/localai-org/LocalVQE)
- [DeepVQE paper (Indenbom et al., Interspeech 2023)](https://arxiv.org/abs/2306.03177)