* chore(localvqe): update backend to v1.3, add v1.2/v1.3 gallery models Bump the LocalVQE backend pin 72bfb4c6 -> b0f0378a, which adds the v1.2 (1.3 M) and v1.3 (4.8 M) GGUF SHA-256s to the upstream released-models allowlist (and the arch_version=3 loader) so both load without LOCALVQE_ALLOW_UNHASHED. Add gallery entries for localvqe-v1.2-1.3m and localvqe-v1.3-4.8m (SHA-256 verified against the downloaded weights) and update the audio-transform docs to make v1.3 the current default while noting the compact v1.1/v1.2 alternatives. Assisted-by: Claude:claude-opus-4-8 Claude-Code Signed-off-by: Richard Palethorpe <io@richiejp.com> * chore(flake): add ffmpeg-headless to the dev shell pkg/utils/ffmpeg_test.go shells out to the `ffmpeg` CLI, and the pre-commit gate runs those tests via `make test-coverage`. Without ffmpeg in the dev shell the gate fails with "executable file not found in $PATH". The headless build provides the CLI without GUI/X deps. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(localvqe): parse WAV by walking RIFF sub-chunks Walk the RIFF chunk list instead of assuming the canonical 44-byte header layout. Real inputs (browser-recorded clips, ffmpeg output with an 18/40-byte extensible `fmt ` chunk or trailing LIST/INFO metadata) would otherwise splice header/metadata bytes into the PCM stream as an audible impulse. Honour the `data` chunk size and validate that both `fmt ` and `data` chunks are present. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(security-headers): allow blob: in connect-src for waveform fetch The waveform renderer XHRs/fetches a freshly-created blob: object URL (e.g. an uploaded or enhanced clip before it has a server URL). XHR/fetch of blob: is governed by connect-src, not media-src, so it was blocked by the CSP. Add blob: to connect-src. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(react-ui): add input/output spectrogram view to AudioTransform The transform page only showed time-domain amplitude waveforms, so you could see how loud a clip was but not which frequencies the model touched. Add a time x frequency spectrogram heatmap and render the input and output spectrums side by side, so it's visible which bands the enhancement attenuates (bright input bands that go dark in the output). Computed client-side via a Hann-windowed STFT over both clips (a small dependency-free radix-2 FFT), defaulting to the LocalVQE 512/256 frame geometry. This shows the net input->output spectral change; the model's internal gain mask is not exposed by the backend. - src/utils/fft.js radix-2 FFT - src/hooks/useSpectrogram.js decode + STFT -> normalised dB magnitude grid - src/components/audio/Spectrogram.jsx canvas heatmap (magma colormap) - AudioTransform.jsx dual-spectrogram panel + CSS - e2e spec + UI coverage baseline bump (38.29 -> 39.0; measured ~39.4-40.2) Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * test(react-ui): make UI coverage deterministic, tighten the gate UI e2e line coverage swung ~1pp run-to-run (39.1% <-> 40.2%), which forced a loose 0.8pp tolerance on the monotonic gate — a band wide enough to let a real ~300-line regression through silently. The swing was a bug, not inherent jitter: the 'Create Agent navigates' spec ended on the URL assertion, so AgentCreate.jsx's ~400 lines were collected only when its render happened to beat the coverage teardown. Wait for the page to actually render (assert its heading) so those lines are covered every run. With the race gone, repeated runs land within ~0.013pp of each other, so: - tighten UI_COVERAGE_TOLERANCE 0.8 -> 0.1 (noise floor, not a drift band) - set the baseline to the real, reliably-achieved value (39.0 -> 39.86) Localised by running the V8-coverage suite repeatedly and diffing per-file line coverage; AgentCreate.jsx was the sole ~1pp flipper. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com>
5.9 KiB
+++ disableToc = false title = "Audio Transform" weight = 17 url = "/features/audio-transform/" +++
The audio-transform endpoints take audio in and emit audio out, optionally conditioned on a second reference audio signal. The category is generic by design — concrete operations include joint acoustic echo cancellation + noise suppression + dereverberation (LocalVQE), voice conversion (reference = target speaker), pitch shifting, audio super-resolution, and so on.
The first shipping backend is LocalVQE, a 1.3 M-parameter GGML-based model that performs joint AEC + noise suppression
- dereverberation on 16 kHz mono speech, ~9.6× realtime on a desktop CPU. It is a derivative of the Microsoft DeepVQE paper.
The mental model
Every audio-transform request carries:
audio— the primary input file (required).reference— an auxiliary signal whose meaning is backend-specific (optional).- For echo cancellation: the loopback / far-end signal played through the speakers.
- For voice conversion: the target speaker's reference clip.
- For pitch / style transfer: a tonal or style reference.
- When omitted, the backend treats it as silence and degrades gracefully (LocalVQE, for example, does denoise + dereverb only when ref is empty).
params— a generickey=valuemap forwarded to the backend.- LocalVQE keys:
noise_gate=true|false,noise_gate_threshold_dbfs=<float>.
- LocalVQE keys:
This shape mirrors WebRTC's ProcessStream(near) / ProcessReverseStream(far)
APM API, NVIDIA Maxine's NvAFX_Run paired-stream signature, and the ICASSP
AEC challenge 2-channel WAV convention.
Batch endpoint
POST /audio/transformations (alias POST /audio/transform) — multipart
form-data, returns audio bytes.
| Field | Type | Required | Notes |
|---|---|---|---|
model |
string | yes | Audio-transform model id (e.g. localvqe) |
audio |
file | yes | Primary input audio |
reference |
file | no | Optional auxiliary signal |
response_format |
string | no | wav (default), mp3, ogg, flac |
sample_rate |
int | no | Desired output sample rate |
params[<key>] |
string | no | Repeated; forwarded to backend |
Example (LocalVQE: cancel echo, suppress noise, gate residual):
curl -X POST http://localhost:8080/audio/transformations \
-F model=localvqe \
-F audio=@mic.wav \
-F reference=@loopback.wav \
-F 'params[noise_gate]=true' \
-F 'params[noise_gate_threshold_dbfs]=-50' \
-o enhanced.wav
When reference is omitted, LocalVQE zero-fills the reference channel and
the operation reduces to noise suppression + dereverberation.
Streaming endpoint
GET /audio/transformations/stream — bidirectional WebSocket. The first
client message is a JSON envelope; subsequent client messages are binary
PCM frames; server emits binary PCM frames at the same cadence.
Wire format
Client → server (text frame, first):
{
"type": "session.update",
"model": "localvqe",
"sample_format": "S16_LE",
"sample_rate": 16000,
"frame_samples": 256,
"params": { "noise_gate": "true" }
}
sample_format is S16_LE (16-bit signed little-endian) or F32_LE (32-bit
float little-endian, [-1, 1]). frame_samples defaults to the backend's
preferred hop length (256 = 16 ms for LocalVQE).
Client → server (binary frames, subsequent): interleaved stereo PCM,
channel 0 = audio (mic), channel 1 = reference. Frame size:
frame_samples × 2 channels × sample_size. For S16_LE at 256 samples that
is 1024 bytes per frame; for F32_LE it is 2048 bytes. If the reference is
silent (no auxiliary signal), send zeros on channel 1.
Server → client (binary frames): mono PCM in the same format,
frame_samples × sample_size bytes (512 bytes for S16_LE, 1024 for F32_LE).
Mid-stream control (text frame): another session.update resets the
streaming state when its reset field is true; a session.close text frame
ends the session cleanly.
Latency
LocalVQE has 16 ms algorithmic latency (one hop). At runtime the per-frame CPU cost depends on the model: ~1.6 ms for the compact 1.3 M models (v1.1/v1.2, ~9.7× realtime) and ~3.3 ms for the wider v1.3 4.8 M model (~4.7× realtime) on a 4-thread modern desktop, leaving the rest of the budget for network and downstream playback.
Backend-specific tuning (LocalVQE)
params[<key>] |
Type | Default | Effect |
|---|---|---|---|
noise_gate |
bool | false |
Enable post-OLA RMS-based residual-echo gate |
noise_gate_threshold_dbfs |
float | -45.0 |
Gate threshold in dBFS; frames below are zeroed |
The gate is most useful in far-end-only / silent-near-end stretches where the
model's residual would otherwise sound like buffering or amplified noise floor.
A reasonable starting point is -50 dBFS.
Configuring a model
LocalVQE ships several weight releases in the gallery: localvqe-v1.3-4.8m
(current default — best quality), localvqe-v1.2-1.3m and localvqe-v1.1-1.3m
(compact, ~¼ the per-hop cost — good for low-core or power-constrained hosts).
All share the same backend and request API; only the model filename differs.
name: localvqe
backend: localvqe
parameters:
model: localvqe-v1.3-4.8M-f32.gguf
# Backend-specific defaults can be set in Options[]; per-request
# params[*] form fields override.
#
# `backend` and `device` route through the upstream localvqe options
# builder so you can force a non-default GGML backend (e.g. `Vulkan`) or
# pin to a specific GPU index. Leave both unset to keep the CPU default.
options:
- noise_gate=true
- noise_gate_threshold_dbfs=-50
# - backend=Vulkan
# - device=0
See also
- [Text to Audio (TTS)]({{< relref "text-to-audio.md" >}})
- [Audio to Text]({{< relref "audio-to-text.md" >}})
- LocalVQE upstream
- DeepVQE paper (Indenbom et al., Interspeech 2023)