Files
LocalAI/docs/content/features/audio-transform.md
Richard Palethorpe bb033b16a9 feat: add LocalVQE backend and audio transformations UI (#9640)
feat(audio-transform): add LocalVQE backend, bidi gRPC RPC, Studio UI

Introduce a generic "audio transform" capability for any audio-in / audio-out
operation (echo cancellation, noise suppression, dereverberation, voice
conversion, etc.) and ship LocalVQE as the first backend implementation.

Backend protocol:
- Two new gRPC RPCs in backend.proto: unary AudioTransform for batch and
  bidirectional AudioTransformStream for low-latency frame-by-frame use.
  This is the first bidi stream in the proto; per-frame unary at LocalVQE's
  16 ms hop would be RTT-bound. Wire it through pkg/grpc/{client,server,
  embed,interface,base} with paired-channel ergonomics.

LocalVQE backend (backend/go/localvqe/):
- Go-Purego wrapper around upstream liblocalvqe.so. CMake builds the upstream
  shared lib + its libggml-cpu-*.so runtime variants directly — no MODULE
  wrapper needed because LocalVQE handles CPU feature selection internally
  via GGML_BACKEND_DL.
- Sets GGML_NTHREADS from opts.Threads (or runtime.NumCPU()-1) — without it
  LocalVQE runs single-threaded at ~1× realtime instead of the documented
  ~9.6×.
- Reference-length policy: zero-pad short refs, truncate long ones (the
  trailing portion can't have leaked into a mic that wasn't recording).
- Ginkgo test suite (9 always-on specs + 2 model-gated).

HTTP layer:
- POST /audio/transformations (alias /audio/transform): multipart batch
  endpoint, accepts audio + optional reference + params[*]=v form fields.
  Persists inputs alongside the output in GeneratedContentDir/audio so the
  React UI history can replay past (audio, reference, output) triples.
- GET /audio/transformations/stream: WebSocket bidi, 16 ms PCM frames
  (interleaved stereo mic+ref in, mono out). JSON session.update envelope
  for config; constants hoisted in core/schema/audio_transform.go.
- ffmpeg-based input normalisation to 16 kHz mono s16 WAV via the existing
  utils.AudioToWav (with passthrough fast-path), so the user can upload any
  format / rate without seeing the model's strict 16 kHz constraint.
- BackendTraceAudioTransform integration so /api/backend-traces and the
  Traces UI light up with audio_snippet base64 and timing.
- Routes registered under routes/localai.go (LocalAI extension; OpenAI has
  no /audio/transformations endpoint), traced via TraceMiddleware.

Auth + capability + importer:
- FLAG_AUDIO_TRANSFORM (model_config.go), FeatureAudioTransform (default-on,
  in APIFeatures), three RouteFeatureRegistry rows.
- localvqe added to knownPrefOnlyBackends with modality "audio-transform".
- Gallery entry localvqe-v1-1.3m (sha256-pinned, hosted on
  huggingface.co/LocalAI-io/LocalVQE).

React UI:
- New /app/transform page surfaced via a dedicated "Enhance" sidebar
  section (sibling of Tools / Biometrics) — the page is enhancement, not
  generation, so it lives outside Studio. Two AudioInput components
  (Upload + Record tabs, drag-drop, mic capture).
- Echo-test button: records mic while playing the loaded reference through
  the speakers — the mic naturally picks up speaker bleed, giving a real
  (mic, ref) pair for AEC testing without leaving the UI.
- Reusable WaveformPlayer (canvas peaks + click-to-seek + audio controls)
  and useAudioPeaks hook (shared module-scoped AudioContext to avoid
  hitting browser context limits with three players on one page); migrated
  TTS, Sound, Traces audio blocks to use it.
- Past runs saved in localStorage via useMediaHistory('audio-transform') —
  the history entry stores all three URLs so clicking re-renders the full
  triple, not just the output.

Build + e2e:
- 11 matrix entries removed from .github/workflows/backend.yml (CUDA, ROCm,
  SYCL, Metal, L4T): upstream supports only CPU + Vulkan, so we ship those
  two and let GPU-class hardware route through Vulkan in the gallery
  capabilities map.
- tests-localvqe-grpc-transform job in test-extra.yml (gated on
  detect-changes.outputs.localvqe).
- New audio_transform capability + 4 specs in tests/e2e-backends.
- Playwright spec suite in core/http/react-ui/e2e/audio-transform.spec.js
  (8 specs covering tabs, file upload, multipart shape, history, errors).

Docs:
- New docs/content/features/audio-transform.md covering the (audio,
  reference) mental model, batch + WebSocket wire formats, LocalVQE param
  keys, and a YAML config example. Cross-links from text-to-audio and
  audio-to-text feature pages.

Assisted-by: Claude:claude-opus-4-7 [Bash Read Edit Write Agent TaskCreate]

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-04 22:07:11 +02:00

5.4 KiB
Raw Blame History

+++ disableToc = false title = "Audio Transform" weight = 17 url = "/features/audio-transform/" +++

The audio-transform endpoints take audio in and emit audio out, optionally conditioned on a second reference audio signal. The category is generic by design — concrete operations include joint acoustic echo cancellation + noise suppression + dereverberation (LocalVQE), voice conversion (reference = target speaker), pitch shifting, audio super-resolution, and so on.

The first shipping backend is LocalVQE, a 1.3 M-parameter GGML-based model that performs joint AEC + noise suppression

  • dereverberation on 16 kHz mono speech, ~9.6× realtime on a desktop CPU. It is a derivative of the Microsoft DeepVQE paper.

The mental model

Every audio-transform request carries:

  • audio — the primary input file (required).
  • reference — an auxiliary signal whose meaning is backend-specific (optional).
    • For echo cancellation: the loopback / far-end signal played through the speakers.
    • For voice conversion: the target speaker's reference clip.
    • For pitch / style transfer: a tonal or style reference.
    • When omitted, the backend treats it as silence and degrades gracefully (LocalVQE, for example, does denoise + dereverb only when ref is empty).
  • params — a generic key=value map forwarded to the backend.
    • LocalVQE keys: noise_gate=true|false, noise_gate_threshold_dbfs=<float>.

This shape mirrors WebRTC's ProcessStream(near) / ProcessReverseStream(far) APM API, NVIDIA Maxine's NvAFX_Run paired-stream signature, and the ICASSP AEC challenge 2-channel WAV convention.

Batch endpoint

POST /audio/transformations (alias POST /audio/transform) — multipart form-data, returns audio bytes.

Field Type Required Notes
model string yes Audio-transform model id (e.g. localvqe)
audio file yes Primary input audio
reference file no Optional auxiliary signal
response_format string no wav (default), mp3, ogg, flac
sample_rate int no Desired output sample rate
params[<key>] string no Repeated; forwarded to backend

Example (LocalVQE: cancel echo, suppress noise, gate residual):

curl -X POST http://localhost:8080/audio/transformations \
  -F model=localvqe \
  -F audio=@mic.wav \
  -F reference=@loopback.wav \
  -F 'params[noise_gate]=true' \
  -F 'params[noise_gate_threshold_dbfs]=-50' \
  -o enhanced.wav

When reference is omitted, LocalVQE zero-fills the reference channel and the operation reduces to noise suppression + dereverberation.

Streaming endpoint

GET /audio/transformations/stream — bidirectional WebSocket. The first client message is a JSON envelope; subsequent client messages are binary PCM frames; server emits binary PCM frames at the same cadence.

Wire format

Client → server (text frame, first):

{
  "type": "session.update",
  "model": "localvqe",
  "sample_format": "S16_LE",
  "sample_rate": 16000,
  "frame_samples": 256,
  "params": { "noise_gate": "true" }
}

sample_format is S16_LE (16-bit signed little-endian) or F32_LE (32-bit float little-endian, [-1, 1]). frame_samples defaults to the backend's preferred hop length (256 = 16 ms for LocalVQE).

Client → server (binary frames, subsequent): interleaved stereo PCM, channel 0 = audio (mic), channel 1 = reference. Frame size: frame_samples × 2 channels × sample_size. For S16_LE at 256 samples that is 1024 bytes per frame; for F32_LE it is 2048 bytes. If the reference is silent (no auxiliary signal), send zeros on channel 1.

Server → client (binary frames): mono PCM in the same format, frame_samples × sample_size bytes (512 bytes for S16_LE, 1024 for F32_LE).

Mid-stream control (text frame): another session.update resets the streaming state when its reset field is true; a session.close text frame ends the session cleanly.

Latency

LocalVQE has 16 ms algorithmic latency (one hop). At runtime, ~1.66 ms of CPU time per frame on a modern desktop, leaving the rest of the budget for network and downstream playback.

Backend-specific tuning (LocalVQE)

params[<key>] Type Default Effect
noise_gate bool false Enable post-OLA RMS-based residual-echo gate
noise_gate_threshold_dbfs float -45.0 Gate threshold in dBFS; frames below are zeroed

The gate is most useful in far-end-only / silent-near-end stretches where the model's residual would otherwise sound like buffering or amplified noise floor. A reasonable starting point is -50 dBFS.

Configuring a model

name: localvqe
backend: localvqe
parameters:
  model: localvqe-v1.1-1.3M-f32.gguf

# Backend-specific defaults can be set in Options[]; per-request
# params[*] form fields override.
#
# `backend` and `device` route through the upstream localvqe options
# builder so you can force a non-default GGML backend (e.g. `Vulkan`) or
# pin to a specific GPU index. Leave both unset to keep the CPU default.
options:
- noise_gate=true
- noise_gate_threshold_dbfs=-50
# - backend=Vulkan
# - device=0

See also