mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-05 15:26:14 -04:00
* chore(localvqe): update backend to v1.3, add v1.2/v1.3 gallery models Bump the LocalVQE backend pin 72bfb4c6 -> b0f0378a, which adds the v1.2 (1.3 M) and v1.3 (4.8 M) GGUF SHA-256s to the upstream released-models allowlist (and the arch_version=3 loader) so both load without LOCALVQE_ALLOW_UNHASHED. Add gallery entries for localvqe-v1.2-1.3m and localvqe-v1.3-4.8m (SHA-256 verified against the downloaded weights) and update the audio-transform docs to make v1.3 the current default while noting the compact v1.1/v1.2 alternatives. Assisted-by: Claude:claude-opus-4-8 Claude-Code Signed-off-by: Richard Palethorpe <io@richiejp.com> * chore(flake): add ffmpeg-headless to the dev shell pkg/utils/ffmpeg_test.go shells out to the `ffmpeg` CLI, and the pre-commit gate runs those tests via `make test-coverage`. Without ffmpeg in the dev shell the gate fails with "executable file not found in $PATH". The headless build provides the CLI without GUI/X deps. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(localvqe): parse WAV by walking RIFF sub-chunks Walk the RIFF chunk list instead of assuming the canonical 44-byte header layout. Real inputs (browser-recorded clips, ffmpeg output with an 18/40-byte extensible `fmt ` chunk or trailing LIST/INFO metadata) would otherwise splice header/metadata bytes into the PCM stream as an audible impulse. Honour the `data` chunk size and validate that both `fmt ` and `data` chunks are present. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(security-headers): allow blob: in connect-src for waveform fetch The waveform renderer XHRs/fetches a freshly-created blob: object URL (e.g. an uploaded or enhanced clip before it has a server URL). XHR/fetch of blob: is governed by connect-src, not media-src, so it was blocked by the CSP. Add blob: to connect-src. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(react-ui): add input/output spectrogram view to AudioTransform The transform page only showed time-domain amplitude waveforms, so you could see how loud a clip was but not which frequencies the model touched. Add a time x frequency spectrogram heatmap and render the input and output spectrums side by side, so it's visible which bands the enhancement attenuates (bright input bands that go dark in the output). Computed client-side via a Hann-windowed STFT over both clips (a small dependency-free radix-2 FFT), defaulting to the LocalVQE 512/256 frame geometry. This shows the net input->output spectral change; the model's internal gain mask is not exposed by the backend. - src/utils/fft.js radix-2 FFT - src/hooks/useSpectrogram.js decode + STFT -> normalised dB magnitude grid - src/components/audio/Spectrogram.jsx canvas heatmap (magma colormap) - AudioTransform.jsx dual-spectrogram panel + CSS - e2e spec + UI coverage baseline bump (38.29 -> 39.0; measured ~39.4-40.2) Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * test(react-ui): make UI coverage deterministic, tighten the gate UI e2e line coverage swung ~1pp run-to-run (39.1% <-> 40.2%), which forced a loose 0.8pp tolerance on the monotonic gate — a band wide enough to let a real ~300-line regression through silently. The swing was a bug, not inherent jitter: the 'Create Agent navigates' spec ended on the URL assertion, so AgentCreate.jsx's ~400 lines were collected only when its render happened to beat the coverage teardown. Wait for the page to actually render (assert its heading) so those lines are covered every run. With the race gone, repeated runs land within ~0.013pp of each other, so: - tighten UI_COVERAGE_TOLERANCE 0.8 -> 0.1 (noise floor, not a drift band) - set the baseline to the real, reliably-achieved value (39.0 -> 39.86) Localised by running the V8-coverage suite repeatedly and diffing per-file line coverage; AgentCreate.jsx was the sole ~1pp flipper. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com>
155 lines
5.9 KiB
Markdown
155 lines
5.9 KiB
Markdown
+++
|
||
disableToc = false
|
||
title = "Audio Transform"
|
||
weight = 17
|
||
url = "/features/audio-transform/"
|
||
+++
|
||
|
||
The audio-transform endpoints take **audio in** and emit **audio out**, optionally
|
||
conditioned on a second reference audio signal. The category is generic by
|
||
design — concrete operations include joint **acoustic echo cancellation +
|
||
noise suppression + dereverberation** (LocalVQE), voice conversion (reference
|
||
= target speaker), pitch shifting, audio super-resolution, and so on.
|
||
|
||
The first shipping backend is [LocalVQE](https://github.com/localai-org/LocalVQE),
|
||
a 1.3 M-parameter GGML-based model that performs joint AEC + noise suppression
|
||
+ dereverberation on 16 kHz mono speech, ~9.6× realtime on a desktop CPU. It
|
||
is a derivative of the Microsoft DeepVQE paper.
|
||
|
||
## The mental model
|
||
|
||
Every audio-transform request carries:
|
||
|
||
- **`audio`** — the primary input file (required).
|
||
- **`reference`** — an auxiliary signal whose meaning is backend-specific (optional).
|
||
- For echo cancellation: the loopback / far-end signal played through the speakers.
|
||
- For voice conversion: the target speaker's reference clip.
|
||
- For pitch / style transfer: a tonal or style reference.
|
||
- When omitted, the backend treats it as silence and degrades gracefully (LocalVQE,
|
||
for example, does denoise + dereverb only when ref is empty).
|
||
- **`params`** — a generic `key=value` map forwarded to the backend.
|
||
- LocalVQE keys: `noise_gate=true|false`, `noise_gate_threshold_dbfs=<float>`.
|
||
|
||
This shape mirrors WebRTC's `ProcessStream(near)` / `ProcessReverseStream(far)`
|
||
APM API, NVIDIA Maxine's `NvAFX_Run` paired-stream signature, and the ICASSP
|
||
AEC challenge 2-channel WAV convention.
|
||
|
||
## Batch endpoint
|
||
|
||
`POST /audio/transformations` (alias `POST /audio/transform`) — multipart
|
||
form-data, returns audio bytes.
|
||
|
||
| Field | Type | Required | Notes |
|
||
|---|---|---|---|
|
||
| `model` | string | yes | Audio-transform model id (e.g. `localvqe`) |
|
||
| `audio` | file | yes | Primary input audio |
|
||
| `reference` | file | no | Optional auxiliary signal |
|
||
| `response_format` | string | no | `wav` (default), `mp3`, `ogg`, `flac` |
|
||
| `sample_rate` | int | no | Desired output sample rate |
|
||
| `params[<key>]` | string | no | Repeated; forwarded to backend |
|
||
|
||
Example (LocalVQE: cancel echo, suppress noise, gate residual):
|
||
|
||
```bash
|
||
curl -X POST http://localhost:8080/audio/transformations \
|
||
-F model=localvqe \
|
||
-F audio=@mic.wav \
|
||
-F reference=@loopback.wav \
|
||
-F 'params[noise_gate]=true' \
|
||
-F 'params[noise_gate_threshold_dbfs]=-50' \
|
||
-o enhanced.wav
|
||
```
|
||
|
||
When `reference` is omitted, LocalVQE zero-fills the reference channel and
|
||
the operation reduces to noise suppression + dereverberation.
|
||
|
||
## Streaming endpoint
|
||
|
||
`GET /audio/transformations/stream` — bidirectional WebSocket. The first
|
||
client message is a JSON envelope; subsequent client messages are binary
|
||
PCM frames; server emits binary PCM frames at the same cadence.
|
||
|
||
### Wire format
|
||
|
||
**Client → server** (text frame, first):
|
||
|
||
```json
|
||
{
|
||
"type": "session.update",
|
||
"model": "localvqe",
|
||
"sample_format": "S16_LE",
|
||
"sample_rate": 16000,
|
||
"frame_samples": 256,
|
||
"params": { "noise_gate": "true" }
|
||
}
|
||
```
|
||
|
||
`sample_format` is `S16_LE` (16-bit signed little-endian) or `F32_LE` (32-bit
|
||
float little-endian, [-1, 1]). `frame_samples` defaults to the backend's
|
||
preferred hop length (256 = 16 ms for LocalVQE).
|
||
|
||
**Client → server** (binary frames, subsequent): interleaved stereo PCM,
|
||
channel 0 = audio (mic), channel 1 = reference. Frame size:
|
||
`frame_samples × 2 channels × sample_size`. For `S16_LE` at 256 samples that
|
||
is 1024 bytes per frame; for `F32_LE` it is 2048 bytes. If the reference is
|
||
silent (no auxiliary signal), send zeros on channel 1.
|
||
|
||
**Server → client** (binary frames): mono PCM in the same format,
|
||
`frame_samples × sample_size` bytes (512 bytes for `S16_LE`, 1024 for `F32_LE`).
|
||
|
||
**Mid-stream control** (text frame): another `session.update` resets the
|
||
streaming state when its `reset` field is true; a `session.close` text frame
|
||
ends the session cleanly.
|
||
|
||
### Latency
|
||
|
||
LocalVQE has 16 ms algorithmic latency (one hop). At runtime the per-frame CPU
|
||
cost depends on the model: ~1.6 ms for the compact 1.3 M models (v1.1/v1.2,
|
||
~9.7× realtime) and ~3.3 ms for the wider v1.3 4.8 M model (~4.7× realtime) on
|
||
a 4-thread modern desktop, leaving the rest of the budget for network and
|
||
downstream playback.
|
||
|
||
## Backend-specific tuning (LocalVQE)
|
||
|
||
| `params[<key>]` | Type | Default | Effect |
|
||
|---|---|---|---|
|
||
| `noise_gate` | bool | `false` | Enable post-OLA RMS-based residual-echo gate |
|
||
| `noise_gate_threshold_dbfs` | float | `-45.0` | Gate threshold in dBFS; frames below are zeroed |
|
||
|
||
The gate is most useful in far-end-only / silent-near-end stretches where the
|
||
model's residual would otherwise sound like buffering or amplified noise floor.
|
||
A reasonable starting point is `-50` dBFS.
|
||
|
||
## Configuring a model
|
||
|
||
LocalVQE ships several weight releases in the gallery: `localvqe-v1.3-4.8m`
|
||
(current default — best quality), `localvqe-v1.2-1.3m` and `localvqe-v1.1-1.3m`
|
||
(compact, ~¼ the per-hop cost — good for low-core or power-constrained hosts).
|
||
All share the same backend and request API; only the `model` filename differs.
|
||
|
||
```yaml
|
||
name: localvqe
|
||
backend: localvqe
|
||
parameters:
|
||
model: localvqe-v1.3-4.8M-f32.gguf
|
||
|
||
# Backend-specific defaults can be set in Options[]; per-request
|
||
# params[*] form fields override.
|
||
#
|
||
# `backend` and `device` route through the upstream localvqe options
|
||
# builder so you can force a non-default GGML backend (e.g. `Vulkan`) or
|
||
# pin to a specific GPU index. Leave both unset to keep the CPU default.
|
||
options:
|
||
- noise_gate=true
|
||
- noise_gate_threshold_dbfs=-50
|
||
# - backend=Vulkan
|
||
# - device=0
|
||
```
|
||
|
||
## See also
|
||
|
||
- [Text to Audio (TTS)]({{< relref "text-to-audio.md" >}})
|
||
- [Audio to Text]({{< relref "audio-to-text.md" >}})
|
||
- [LocalVQE upstream](https://github.com/localai-org/LocalVQE)
|
||
- [DeepVQE paper (Indenbom et al., Interspeech 2023)](https://arxiv.org/abs/2306.03177)
|