Files
LocalAI/docs/content/features/audio-transform.md
Richard Palethorpe bb033b16a9 feat: add LocalVQE backend and audio transformations UI (#9640)
feat(audio-transform): add LocalVQE backend, bidi gRPC RPC, Studio UI

Introduce a generic "audio transform" capability for any audio-in / audio-out
operation (echo cancellation, noise suppression, dereverberation, voice
conversion, etc.) and ship LocalVQE as the first backend implementation.

Backend protocol:
- Two new gRPC RPCs in backend.proto: unary AudioTransform for batch and
  bidirectional AudioTransformStream for low-latency frame-by-frame use.
  This is the first bidi stream in the proto; per-frame unary at LocalVQE's
  16 ms hop would be RTT-bound. Wire it through pkg/grpc/{client,server,
  embed,interface,base} with paired-channel ergonomics.

LocalVQE backend (backend/go/localvqe/):
- Go-Purego wrapper around upstream liblocalvqe.so. CMake builds the upstream
  shared lib + its libggml-cpu-*.so runtime variants directly — no MODULE
  wrapper needed because LocalVQE handles CPU feature selection internally
  via GGML_BACKEND_DL.
- Sets GGML_NTHREADS from opts.Threads (or runtime.NumCPU()-1) — without it
  LocalVQE runs single-threaded at ~1× realtime instead of the documented
  ~9.6×.
- Reference-length policy: zero-pad short refs, truncate long ones (the
  trailing portion can't have leaked into a mic that wasn't recording).
- Ginkgo test suite (9 always-on specs + 2 model-gated).

HTTP layer:
- POST /audio/transformations (alias /audio/transform): multipart batch
  endpoint, accepts audio + optional reference + params[*]=v form fields.
  Persists inputs alongside the output in GeneratedContentDir/audio so the
  React UI history can replay past (audio, reference, output) triples.
- GET /audio/transformations/stream: WebSocket bidi, 16 ms PCM frames
  (interleaved stereo mic+ref in, mono out). JSON session.update envelope
  for config; constants hoisted in core/schema/audio_transform.go.
- ffmpeg-based input normalisation to 16 kHz mono s16 WAV via the existing
  utils.AudioToWav (with passthrough fast-path), so the user can upload any
  format / rate without seeing the model's strict 16 kHz constraint.
- BackendTraceAudioTransform integration so /api/backend-traces and the
  Traces UI light up with audio_snippet base64 and timing.
- Routes registered under routes/localai.go (LocalAI extension; OpenAI has
  no /audio/transformations endpoint), traced via TraceMiddleware.

Auth + capability + importer:
- FLAG_AUDIO_TRANSFORM (model_config.go), FeatureAudioTransform (default-on,
  in APIFeatures), three RouteFeatureRegistry rows.
- localvqe added to knownPrefOnlyBackends with modality "audio-transform".
- Gallery entry localvqe-v1-1.3m (sha256-pinned, hosted on
  huggingface.co/LocalAI-io/LocalVQE).

React UI:
- New /app/transform page surfaced via a dedicated "Enhance" sidebar
  section (sibling of Tools / Biometrics) — the page is enhancement, not
  generation, so it lives outside Studio. Two AudioInput components
  (Upload + Record tabs, drag-drop, mic capture).
- Echo-test button: records mic while playing the loaded reference through
  the speakers — the mic naturally picks up speaker bleed, giving a real
  (mic, ref) pair for AEC testing without leaving the UI.
- Reusable WaveformPlayer (canvas peaks + click-to-seek + audio controls)
  and useAudioPeaks hook (shared module-scoped AudioContext to avoid
  hitting browser context limits with three players on one page); migrated
  TTS, Sound, Traces audio blocks to use it.
- Past runs saved in localStorage via useMediaHistory('audio-transform') —
  the history entry stores all three URLs so clicking re-renders the full
  triple, not just the output.

Build + e2e:
- 11 matrix entries removed from .github/workflows/backend.yml (CUDA, ROCm,
  SYCL, Metal, L4T): upstream supports only CPU + Vulkan, so we ship those
  two and let GPU-class hardware route through Vulkan in the gallery
  capabilities map.
- tests-localvqe-grpc-transform job in test-extra.yml (gated on
  detect-changes.outputs.localvqe).
- New audio_transform capability + 4 specs in tests/e2e-backends.
- Playwright spec suite in core/http/react-ui/e2e/audio-transform.spec.js
  (8 specs covering tabs, file upload, multipart shape, history, errors).

Docs:
- New docs/content/features/audio-transform.md covering the (audio,
  reference) mental model, batch + WebSocket wire formats, LocalVQE param
  keys, and a YAML config example. Cross-links from text-to-audio and
  audio-to-text feature pages.

Assisted-by: Claude:claude-opus-4-7 [Bash Read Edit Write Agent TaskCreate]

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-04 22:07:11 +02:00

148 lines
5.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
+++
disableToc = false
title = "Audio Transform"
weight = 17
url = "/features/audio-transform/"
+++
The audio-transform endpoints take **audio in** and emit **audio out**, optionally
conditioned on a second reference audio signal. The category is generic by
design — concrete operations include joint **acoustic echo cancellation +
noise suppression + dereverberation** (LocalVQE), voice conversion (reference
= target speaker), pitch shifting, audio super-resolution, and so on.
The first shipping backend is [LocalVQE](https://github.com/localai-org/LocalVQE),
a 1.3 M-parameter GGML-based model that performs joint AEC + noise suppression
+ dereverberation on 16 kHz mono speech, ~9.6× realtime on a desktop CPU. It
is a derivative of the Microsoft DeepVQE paper.
## The mental model
Every audio-transform request carries:
- **`audio`** — the primary input file (required).
- **`reference`** — an auxiliary signal whose meaning is backend-specific (optional).
- For echo cancellation: the loopback / far-end signal played through the speakers.
- For voice conversion: the target speaker's reference clip.
- For pitch / style transfer: a tonal or style reference.
- When omitted, the backend treats it as silence and degrades gracefully (LocalVQE,
for example, does denoise + dereverb only when ref is empty).
- **`params`** — a generic `key=value` map forwarded to the backend.
- LocalVQE keys: `noise_gate=true|false`, `noise_gate_threshold_dbfs=<float>`.
This shape mirrors WebRTC's `ProcessStream(near)` / `ProcessReverseStream(far)`
APM API, NVIDIA Maxine's `NvAFX_Run` paired-stream signature, and the ICASSP
AEC challenge 2-channel WAV convention.
## Batch endpoint
`POST /audio/transformations` (alias `POST /audio/transform`) — multipart
form-data, returns audio bytes.
| Field | Type | Required | Notes |
|---|---|---|---|
| `model` | string | yes | Audio-transform model id (e.g. `localvqe`) |
| `audio` | file | yes | Primary input audio |
| `reference` | file | no | Optional auxiliary signal |
| `response_format` | string | no | `wav` (default), `mp3`, `ogg`, `flac` |
| `sample_rate` | int | no | Desired output sample rate |
| `params[<key>]` | string | no | Repeated; forwarded to backend |
Example (LocalVQE: cancel echo, suppress noise, gate residual):
```bash
curl -X POST http://localhost:8080/audio/transformations \
-F model=localvqe \
-F audio=@mic.wav \
-F reference=@loopback.wav \
-F 'params[noise_gate]=true' \
-F 'params[noise_gate_threshold_dbfs]=-50' \
-o enhanced.wav
```
When `reference` is omitted, LocalVQE zero-fills the reference channel and
the operation reduces to noise suppression + dereverberation.
## Streaming endpoint
`GET /audio/transformations/stream` — bidirectional WebSocket. The first
client message is a JSON envelope; subsequent client messages are binary
PCM frames; server emits binary PCM frames at the same cadence.
### Wire format
**Client → server** (text frame, first):
```json
{
"type": "session.update",
"model": "localvqe",
"sample_format": "S16_LE",
"sample_rate": 16000,
"frame_samples": 256,
"params": { "noise_gate": "true" }
}
```
`sample_format` is `S16_LE` (16-bit signed little-endian) or `F32_LE` (32-bit
float little-endian, [-1, 1]). `frame_samples` defaults to the backend's
preferred hop length (256 = 16 ms for LocalVQE).
**Client → server** (binary frames, subsequent): interleaved stereo PCM,
channel 0 = audio (mic), channel 1 = reference. Frame size:
`frame_samples × 2 channels × sample_size`. For `S16_LE` at 256 samples that
is 1024 bytes per frame; for `F32_LE` it is 2048 bytes. If the reference is
silent (no auxiliary signal), send zeros on channel 1.
**Server → client** (binary frames): mono PCM in the same format,
`frame_samples × sample_size` bytes (512 bytes for `S16_LE`, 1024 for `F32_LE`).
**Mid-stream control** (text frame): another `session.update` resets the
streaming state when its `reset` field is true; a `session.close` text frame
ends the session cleanly.
### Latency
LocalVQE has 16 ms algorithmic latency (one hop). At runtime, ~1.66 ms of CPU
time per frame on a modern desktop, leaving the rest of the budget for
network and downstream playback.
## Backend-specific tuning (LocalVQE)
| `params[<key>]` | Type | Default | Effect |
|---|---|---|---|
| `noise_gate` | bool | `false` | Enable post-OLA RMS-based residual-echo gate |
| `noise_gate_threshold_dbfs` | float | `-45.0` | Gate threshold in dBFS; frames below are zeroed |
The gate is most useful in far-end-only / silent-near-end stretches where the
model's residual would otherwise sound like buffering or amplified noise floor.
A reasonable starting point is `-50` dBFS.
## Configuring a model
```yaml
name: localvqe
backend: localvqe
parameters:
model: localvqe-v1.1-1.3M-f32.gguf
# Backend-specific defaults can be set in Options[]; per-request
# params[*] form fields override.
#
# `backend` and `device` route through the upstream localvqe options
# builder so you can force a non-default GGML backend (e.g. `Vulkan`) or
# pin to a specific GPU index. Leave both unset to keep the CPU default.
options:
- noise_gate=true
- noise_gate_threshold_dbfs=-50
# - backend=Vulkan
# - device=0
```
## See also
- [Text to Audio (TTS)]({{< relref "tts.md" >}})
- [Audio to Text]({{< relref "audio-to-text.md" >}})
- [LocalVQE upstream](https://github.com/localai-org/LocalVQE)
- [DeepVQE paper (Indenbom et al., Interspeech 2023)](https://arxiv.org/abs/2306.03177)