feat: add LocalVQE backend and audio transformations UI (#9640)

feat(audio-transform): add LocalVQE backend, bidi gRPC RPC, Studio UI Introduce a generic "audio transform" capability for any audio-in / audio-out operation (echo cancellation, noise suppression, dereverberation, voice conversion, etc.) and ship LocalVQE as the first backend implementation. Backend protocol: - Two new gRPC RPCs in backend.proto: unary AudioTransform for batch and bidirectional AudioTransformStream for low-latency frame-by-frame use. This is the first bidi stream in the proto; per-frame unary at LocalVQE's 16 ms hop would be RTT-bound. Wire it through pkg/grpc/{client,server, embed,interface,base} with paired-channel ergonomics. LocalVQE backend (backend/go/localvqe/): - Go-Purego wrapper around upstream liblocalvqe.so. CMake builds the upstream shared lib + its libggml-cpu-*.so runtime variants directly — no MODULE wrapper needed because LocalVQE handles CPU feature selection internally via GGML_BACKEND_DL. - Sets GGML_NTHREADS from opts.Threads (or runtime.NumCPU()-1) — without it LocalVQE runs single-threaded at ~1× realtime instead of the documented ~9.6×. - Reference-length policy: zero-pad short refs, truncate long ones (the trailing portion can't have leaked into a mic that wasn't recording). - Ginkgo test suite (9 always-on specs + 2 model-gated). HTTP layer: - POST /audio/transformations (alias /audio/transform): multipart batch endpoint, accepts audio + optional reference + params[*]=v form fields. Persists inputs alongside the output in GeneratedContentDir/audio so the React UI history can replay past (audio, reference, output) triples. - GET /audio/transformations/stream: WebSocket bidi, 16 ms PCM frames (interleaved stereo mic+ref in, mono out). JSON session.update envelope for config; constants hoisted in core/schema/audio_transform.go. - ffmpeg-based input normalisation to 16 kHz mono s16 WAV via the existing utils.AudioToWav (with passthrough fast-path), so the user can upload any format / rate without seeing the model's strict 16 kHz constraint. - BackendTraceAudioTransform integration so /api/backend-traces and the Traces UI light up with audio_snippet base64 and timing. - Routes registered under routes/localai.go (LocalAI extension; OpenAI has no /audio/transformations endpoint), traced via TraceMiddleware. Auth + capability + importer: - FLAG_AUDIO_TRANSFORM (model_config.go), FeatureAudioTransform (default-on, in APIFeatures), three RouteFeatureRegistry rows. - localvqe added to knownPrefOnlyBackends with modality "audio-transform". - Gallery entry localvqe-v1-1.3m (sha256-pinned, hosted on huggingface.co/LocalAI-io/LocalVQE). React UI: - New /app/transform page surfaced via a dedicated "Enhance" sidebar section (sibling of Tools / Biometrics) — the page is enhancement, not generation, so it lives outside Studio. Two AudioInput components (Upload + Record tabs, drag-drop, mic capture). - Echo-test button: records mic while playing the loaded reference through the speakers — the mic naturally picks up speaker bleed, giving a real (mic, ref) pair for AEC testing without leaving the UI. - Reusable WaveformPlayer (canvas peaks + click-to-seek + audio controls) and useAudioPeaks hook (shared module-scoped AudioContext to avoid hitting browser context limits with three players on one page); migrated TTS, Sound, Traces audio blocks to use it. - Past runs saved in localStorage via useMediaHistory('audio-transform') — the history entry stores all three URLs so clicking re-renders the full triple, not just the output. Build + e2e: - 11 matrix entries removed from .github/workflows/backend.yml (CUDA, ROCm, SYCL, Metal, L4T): upstream supports only CPU + Vulkan, so we ship those two and let GPU-class hardware route through Vulkan in the gallery capabilities map. - tests-localvqe-grpc-transform job in test-extra.yml (gated on detect-changes.outputs.localvqe). - New audio_transform capability + 4 specs in tests/e2e-backends. - Playwright spec suite in core/http/react-ui/e2e/audio-transform.spec.js (8 specs covering tabs, file upload, multipart shape, history, errors). Docs: - New docs/content/features/audio-transform.md covering the (audio, reference) mental model, batch + WebSocket wire formats, LocalVQE param keys, and a YAML config example. Cross-links from text-to-audio and audio-to-text feature pages. Assisted-by: Claude:claude-opus-4-7 [Bash Read Edit Write Agent TaskCreate] Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-18 21:45:01 -04:00 · 2026-05-04 21:07:11 +01:00
parent de83b72bb7
commit bb033b16a9
59 changed files with 3923 additions and 86 deletions
--- a/docs/content/features/audio-to-text.md
+++ b/docs/content/features/audio-to-text.md
@@ -154,3 +154,7 @@ curl http://localhost:8080/v1/audio/transcriptions \
  -F file="@jfk.wav" \
  -F model="qwen3-asr"
 ```
+
+## See also
+
+- [Audio Transform]({{< relref "audio-transform.md" >}}) — clean up the audio (echo cancellation, noise suppression, dereverberation) before passing it to a transcription model.
--- a/docs/content/features/audio-transform.md
+++ b/docs/content/features/audio-transform.md
@@ -0,0 +1,147 @@
+++
+disableToc = false
+title = "Audio Transform"
+weight = 17
+url = "/features/audio-transform/"
+++
+
+The audio-transform endpoints take **audio in** and emit **audio out**, optionally
+conditioned on a second reference audio signal. The category is generic by
+design — concrete operations include joint **acoustic echo cancellation +
+noise suppression + dereverberation** (LocalVQE), voice conversion (reference
+= target speaker), pitch shifting, audio super-resolution, and so on.
+
+The first shipping backend is [LocalVQE](https://github.com/localai-org/LocalVQE),
+a 1.3 M-parameter GGML-based model that performs joint AEC + noise suppression
+ dereverberation on 16 kHz mono speech, ~9.6× realtime on a desktop CPU. It
+is a derivative of the Microsoft DeepVQE paper.
+
+## The mental model
+
+Every audio-transform request carries:
+
+- **`audio`** — the primary input file (required).
+- **`reference`** — an auxiliary signal whose meaning is backend-specific (optional).
+  - For echo cancellation: the loopback / far-end signal played through the speakers.
+  - For voice conversion: the target speaker's reference clip.
+  - For pitch / style transfer: a tonal or style reference.
+  - When omitted, the backend treats it as silence and degrades gracefully (LocalVQE,
+    for example, does denoise + dereverb only when ref is empty).
+- **`params`** — a generic `key=value` map forwarded to the backend.
+  - LocalVQE keys: `noise_gate=true|false`, `noise_gate_threshold_dbfs=<float>`.
+
+This shape mirrors WebRTC's `ProcessStream(near)` / `ProcessReverseStream(far)`
+APM API, NVIDIA Maxine's `NvAFX_Run` paired-stream signature, and the ICASSP
+AEC challenge 2-channel WAV convention.
+
+## Batch endpoint
+
+`POST /audio/transformations` (alias `POST /audio/transform`) — multipart
+form-data, returns audio bytes.
+
+| Field | Type | Required | Notes |
+|---|---|---|---|
+| `model` | string | yes | Audio-transform model id (e.g. `localvqe`) |
+| `audio` | file   | yes | Primary input audio |
+| `reference` | file | no | Optional auxiliary signal |
+| `response_format` | string | no | `wav` (default), `mp3`, `ogg`, `flac` |
+| `sample_rate` | int | no | Desired output sample rate |
+| `params[<key>]` | string | no | Repeated; forwarded to backend |
+
+Example (LocalVQE: cancel echo, suppress noise, gate residual):
+
+```bash
+curl -X POST http://localhost:8080/audio/transformations \
+  -F model=localvqe \
+  -F audio=@mic.wav \
+  -F reference=@loopback.wav \
+  -F 'params[noise_gate]=true' \
+  -F 'params[noise_gate_threshold_dbfs]=-50' \
+  -o enhanced.wav
+```
+
+When `reference` is omitted, LocalVQE zero-fills the reference channel and
+the operation reduces to noise suppression + dereverberation.
+
+## Streaming endpoint
+
+`GET /audio/transformations/stream` — bidirectional WebSocket. The first
+client message is a JSON envelope; subsequent client messages are binary
+PCM frames; server emits binary PCM frames at the same cadence.
+
+### Wire format
+
+**Client → server** (text frame, first):
+
+```json
+{
+  "type": "session.update",
+  "model": "localvqe",
+  "sample_format": "S16_LE",
+  "sample_rate": 16000,
+  "frame_samples": 256,
+  "params": { "noise_gate": "true" }
+}
+```
+
+`sample_format` is `S16_LE` (16-bit signed little-endian) or `F32_LE` (32-bit
+float little-endian, [-1, 1]). `frame_samples` defaults to the backend's
+preferred hop length (256 = 16 ms for LocalVQE).
+
+**Client → server** (binary frames, subsequent): interleaved stereo PCM,
+channel 0 = audio (mic), channel 1 = reference. Frame size:
+`frame_samples × 2 channels × sample_size`. For `S16_LE` at 256 samples that
+is 1024 bytes per frame; for `F32_LE` it is 2048 bytes. If the reference is
+silent (no auxiliary signal), send zeros on channel 1.
+
+**Server → client** (binary frames): mono PCM in the same format,
+`frame_samples × sample_size` bytes (512 bytes for `S16_LE`, 1024 for `F32_LE`).
+
+**Mid-stream control** (text frame): another `session.update` resets the
+streaming state when its `reset` field is true; a `session.close` text frame
+ends the session cleanly.
+
+### Latency
+
+LocalVQE has 16 ms algorithmic latency (one hop). At runtime, ~1.66 ms of CPU
+time per frame on a modern desktop, leaving the rest of the budget for
+network and downstream playback.
+
+## Backend-specific tuning (LocalVQE)
+
+| `params[<key>]` | Type | Default | Effect |
+|---|---|---|---|
+| `noise_gate` | bool | `false` | Enable post-OLA RMS-based residual-echo gate |
+| `noise_gate_threshold_dbfs` | float | `-45.0` | Gate threshold in dBFS; frames below are zeroed |
+
+The gate is most useful in far-end-only / silent-near-end stretches where the
+model's residual would otherwise sound like buffering or amplified noise floor.
+A reasonable starting point is `-50` dBFS.
+
+## Configuring a model
+
+```yaml
+name: localvqe
+backend: localvqe
+parameters:
+  model: localvqe-v1.1-1.3M-f32.gguf
+
+# Backend-specific defaults can be set in Options[]; per-request
+# params[*] form fields override.
+#
+# `backend` and `device` route through the upstream localvqe options
+# builder so you can force a non-default GGML backend (e.g. `Vulkan`) or
+# pin to a specific GPU index. Leave both unset to keep the CPU default.
+options:
+- noise_gate=true
+- noise_gate_threshold_dbfs=-50
+# - backend=Vulkan
+# - device=0
+```
+
+## See also
+
+- [Text to Audio (TTS)]({{< relref "tts.md" >}})
+- [Audio to Text]({{< relref "audio-to-text.md" >}})
+- [LocalVQE upstream](https://github.com/localai-org/LocalVQE)
+- [DeepVQE paper (Indenbom et al., Interspeech 2023)](https://arxiv.org/abs/2306.03177)
--- a/docs/content/whats-new.md
+++ b/docs/content/whats-new.md
@@ -12,6 +12,7 @@ You can see the release notes [here](https://github.com/mudler/LocalAI/releases)

 ## 2026 Highlights

+- **April 2026**: [Audio Transform](/features/audio-transform/) — generic audio-in / audio-out endpoint with optional reference signal. First implementation: [LocalVQE](https://github.com/localai-org/LocalVQE) C++ backend (joint AEC + noise suppression + dereverberation, DeepVQE-style). Both batch (`POST /audio/transformations`) and bidirectional WebSocket streaming (`/audio/transformations/stream`). Studio "Transform" tab with synchronized waveform players for input / reference / output.
 - **April 2026**: [Face recognition backend](/features/face-recognition/) — `insightface`-powered 1:1 verification, 1:N identification, face embedding, face detection, and demographic analysis. Ships both a non-commercial `buffalo_l` model and an Apache 2.0 OpenCV Zoo alternative.

 ## 2024 Highlights