Files
LocalAI/docs/content/features/audio-transform.md
Richard Palethorpe 718223f33b feat(localvqe/audio): v1.3 release and add spectrograms to audio transform UI (#10113)
* chore(localvqe): update backend to v1.3, add v1.2/v1.3 gallery models

Bump the LocalVQE backend pin 72bfb4c6 -> b0f0378a, which adds the v1.2
(1.3 M) and v1.3 (4.8 M) GGUF SHA-256s to the upstream released-models
allowlist (and the arch_version=3 loader) so both load without
LOCALVQE_ALLOW_UNHASHED.

Add gallery entries for localvqe-v1.2-1.3m and localvqe-v1.3-4.8m
(SHA-256 verified against the downloaded weights) and update the
audio-transform docs to make v1.3 the current default while noting the
compact v1.1/v1.2 alternatives.

Assisted-by: Claude:claude-opus-4-8 Claude-Code
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* chore(flake): add ffmpeg-headless to the dev shell

pkg/utils/ffmpeg_test.go shells out to the `ffmpeg` CLI, and the
pre-commit gate runs those tests via `make test-coverage`. Without
ffmpeg in the dev shell the gate fails with "executable file not found
in $PATH". The headless build provides the CLI without GUI/X deps.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(localvqe): parse WAV by walking RIFF sub-chunks

Walk the RIFF chunk list instead of assuming the canonical 44-byte
header layout. Real inputs (browser-recorded clips, ffmpeg output with
an 18/40-byte extensible `fmt ` chunk or trailing LIST/INFO metadata)
would otherwise splice header/metadata bytes into the PCM stream as an
audible impulse. Honour the `data` chunk size and validate that both
`fmt ` and `data` chunks are present.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(security-headers): allow blob: in connect-src for waveform fetch

The waveform renderer XHRs/fetches a freshly-created blob: object URL
(e.g. an uploaded or enhanced clip before it has a server URL). XHR/fetch
of blob: is governed by connect-src, not media-src, so it was blocked by
the CSP. Add blob: to connect-src.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(react-ui): add input/output spectrogram view to AudioTransform

The transform page only showed time-domain amplitude waveforms, so you
could see how loud a clip was but not which frequencies the model
touched. Add a time x frequency spectrogram heatmap and render the input
and output spectrums side by side, so it's visible which bands the
enhancement attenuates (bright input bands that go dark in the output).

Computed client-side via a Hann-windowed STFT over both clips (a small
dependency-free radix-2 FFT), defaulting to the LocalVQE 512/256 frame
geometry. This shows the net input->output spectral change; the model's
internal gain mask is not exposed by the backend.

- src/utils/fft.js            radix-2 FFT
- src/hooks/useSpectrogram.js decode + STFT -> normalised dB magnitude grid
- src/components/audio/Spectrogram.jsx  canvas heatmap (magma colormap)
- AudioTransform.jsx          dual-spectrogram panel + CSS
- e2e spec + UI coverage baseline bump (38.29 -> 39.0; measured ~39.4-40.2)

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* test(react-ui): make UI coverage deterministic, tighten the gate

UI e2e line coverage swung ~1pp run-to-run (39.1% <-> 40.2%), which forced
a loose 0.8pp tolerance on the monotonic gate — a band wide enough to let
a real ~300-line regression through silently. The swing was a bug, not
inherent jitter: the 'Create Agent navigates' spec ended on the URL
assertion, so AgentCreate.jsx's ~400 lines were collected only when its
render happened to beat the coverage teardown.

Wait for the page to actually render (assert its heading) so those lines
are covered every run. With the race gone, repeated runs land within
~0.013pp of each other, so:

- tighten UI_COVERAGE_TOLERANCE 0.8 -> 0.1 (noise floor, not a drift band)
- set the baseline to the real, reliably-achieved value (39.0 -> 39.86)

Localised by running the V8-coverage suite repeatedly and diffing per-file
line coverage; AgentCreate.jsx was the sole ~1pp flipper.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-31 23:56:46 +02:00

5.9 KiB
Raw Blame History

+++ disableToc = false title = "Audio Transform" weight = 17 url = "/features/audio-transform/" +++

The audio-transform endpoints take audio in and emit audio out, optionally conditioned on a second reference audio signal. The category is generic by design — concrete operations include joint acoustic echo cancellation + noise suppression + dereverberation (LocalVQE), voice conversion (reference = target speaker), pitch shifting, audio super-resolution, and so on.

The first shipping backend is LocalVQE, a 1.3 M-parameter GGML-based model that performs joint AEC + noise suppression

  • dereverberation on 16 kHz mono speech, ~9.6× realtime on a desktop CPU. It is a derivative of the Microsoft DeepVQE paper.

The mental model

Every audio-transform request carries:

  • audio — the primary input file (required).
  • reference — an auxiliary signal whose meaning is backend-specific (optional).
    • For echo cancellation: the loopback / far-end signal played through the speakers.
    • For voice conversion: the target speaker's reference clip.
    • For pitch / style transfer: a tonal or style reference.
    • When omitted, the backend treats it as silence and degrades gracefully (LocalVQE, for example, does denoise + dereverb only when ref is empty).
  • params — a generic key=value map forwarded to the backend.
    • LocalVQE keys: noise_gate=true|false, noise_gate_threshold_dbfs=<float>.

This shape mirrors WebRTC's ProcessStream(near) / ProcessReverseStream(far) APM API, NVIDIA Maxine's NvAFX_Run paired-stream signature, and the ICASSP AEC challenge 2-channel WAV convention.

Batch endpoint

POST /audio/transformations (alias POST /audio/transform) — multipart form-data, returns audio bytes.

Field Type Required Notes
model string yes Audio-transform model id (e.g. localvqe)
audio file yes Primary input audio
reference file no Optional auxiliary signal
response_format string no wav (default), mp3, ogg, flac
sample_rate int no Desired output sample rate
params[<key>] string no Repeated; forwarded to backend

Example (LocalVQE: cancel echo, suppress noise, gate residual):

curl -X POST http://localhost:8080/audio/transformations \
  -F model=localvqe \
  -F audio=@mic.wav \
  -F reference=@loopback.wav \
  -F 'params[noise_gate]=true' \
  -F 'params[noise_gate_threshold_dbfs]=-50' \
  -o enhanced.wav

When reference is omitted, LocalVQE zero-fills the reference channel and the operation reduces to noise suppression + dereverberation.

Streaming endpoint

GET /audio/transformations/stream — bidirectional WebSocket. The first client message is a JSON envelope; subsequent client messages are binary PCM frames; server emits binary PCM frames at the same cadence.

Wire format

Client → server (text frame, first):

{
  "type": "session.update",
  "model": "localvqe",
  "sample_format": "S16_LE",
  "sample_rate": 16000,
  "frame_samples": 256,
  "params": { "noise_gate": "true" }
}

sample_format is S16_LE (16-bit signed little-endian) or F32_LE (32-bit float little-endian, [-1, 1]). frame_samples defaults to the backend's preferred hop length (256 = 16 ms for LocalVQE).

Client → server (binary frames, subsequent): interleaved stereo PCM, channel 0 = audio (mic), channel 1 = reference. Frame size: frame_samples × 2 channels × sample_size. For S16_LE at 256 samples that is 1024 bytes per frame; for F32_LE it is 2048 bytes. If the reference is silent (no auxiliary signal), send zeros on channel 1.

Server → client (binary frames): mono PCM in the same format, frame_samples × sample_size bytes (512 bytes for S16_LE, 1024 for F32_LE).

Mid-stream control (text frame): another session.update resets the streaming state when its reset field is true; a session.close text frame ends the session cleanly.

Latency

LocalVQE has 16 ms algorithmic latency (one hop). At runtime the per-frame CPU cost depends on the model: ~1.6 ms for the compact 1.3 M models (v1.1/v1.2, ~9.7× realtime) and ~3.3 ms for the wider v1.3 4.8 M model (~4.7× realtime) on a 4-thread modern desktop, leaving the rest of the budget for network and downstream playback.

Backend-specific tuning (LocalVQE)

params[<key>] Type Default Effect
noise_gate bool false Enable post-OLA RMS-based residual-echo gate
noise_gate_threshold_dbfs float -45.0 Gate threshold in dBFS; frames below are zeroed

The gate is most useful in far-end-only / silent-near-end stretches where the model's residual would otherwise sound like buffering or amplified noise floor. A reasonable starting point is -50 dBFS.

Configuring a model

LocalVQE ships several weight releases in the gallery: localvqe-v1.3-4.8m (current default — best quality), localvqe-v1.2-1.3m and localvqe-v1.1-1.3m (compact, ~¼ the per-hop cost — good for low-core or power-constrained hosts). All share the same backend and request API; only the model filename differs.

name: localvqe
backend: localvqe
parameters:
  model: localvqe-v1.3-4.8M-f32.gguf

# Backend-specific defaults can be set in Options[]; per-request
# params[*] form fields override.
#
# `backend` and `device` route through the upstream localvqe options
# builder so you can force a non-default GGML backend (e.g. `Vulkan`) or
# pin to a specific GPU index. Leave both unset to keep the CPU default.
options:
- noise_gate=true
- noise_gate_threshold_dbfs=-50
# - backend=Vulkan
# - device=0

See also