Files
LocalAI/docs/content/features/audio-transform.md
LocalAI [bot] a8d7d37a3c fix: unbreak master CI (docs, kokoros, vibevoice-cpp ABI) (#9682)
* fix(docs): correct broken Hugo relrefs

The Hugo build has been failing on master since the relevant pages
landed:

- text-generation.md:720 referenced `/docs/features/distributed-mode`,
  but Hugo `relref` paths are relative to the content root, not the
  rendered URL. Drop the `/docs/` prefix so the lookup matches the
  existing `features/...` form used elsewhere in the file.
- audio-transform.md:144 referenced `tts.md`; the actual page is
  `text-to-audio.md`.

Assisted-by: Claude:claude-opus-4-7[1m]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(kokoros): stub Diarize and AudioTransform Backend trait methods

The recent backend.proto additions (Diarize, AudioTransform,
AudioTransformStream) extended the gRPC Backend trait, breaking
kokoros-grpc compilation with E0046 because the Rust implementation
hadn't picked up the new methods. Add Unimplemented stubs matching the
existing pattern for non-applicable RPCs in this TTS-only backend.

Assisted-by: Claude:claude-opus-4-7[1m]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(vibevoice-cpp): track upstream ABI + wire 1.5B voice cloning

Two recent commits in mudler/vibevoice.cpp reshaped the vv_capi_tts
signature without a corresponding bump on the LocalAI side:

  3bd759c "1.5b: unify into a single tts entry point" inserted a
          ref_audio_path parameter between voice_path and dst_wav_path.
  ad856bd "1.5b: multi-speaker dialog support" promoted that to a
          (const char* const* ref_audio_paths, int n_ref_audio_paths)
          pair for per-speaker conditioning.

Because purego resolves symbols by name and not by signature, the
build kept linking; at runtime the misaligned arguments turned the
TTS->ASR closed-loop test into a SIGSEGV inside cgo. Track HEAD
explicitly and bring the bridge in line with it:

  * Update the CppTTS purego binding to the 9-arg form. purego
    marshals []*byte as a **char by handing the C side the underlying
    array address; nil/empty maps to NULL, which matches the C
    contract for "no reference audio" on the realtime-0.5B path.
  * Add a `ref_audio` gallery option (comma-separated, repeatable)
    that the 1.5B path consumes for runtime voice cloning. Multiple
    entries are interpreted as one WAV per speaker (Speaker 0..n-1).
  * TTSRequest.Voice now routes by extension/shape: `.wav` or a
    comma-separated list goes to ref_audio_paths; anything else stays
    on voice_path (realtime-0.5B's pre-baked voice gguf).
  * Pin VIBEVOICE_CPP_VERSION to ad856bd and wire the Makefile into
    the existing bump_deps matrix so future upstream rolls land as
    reviewable PRs instead of a silent CI break.

Assisted-by: Claude:claude-opus-4-7[1m]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(vibevoice-cpp): use ModelOptions.AudioPath for 1.5B ref audio

Use the existing audio_path field from ModelOptions (already plumbed
through config_file's `audio_path:` YAML and consumed by other audio
backends like kokoros) instead of inventing a custom `ref_audio:`
Options[] string. Multi-speaker setups stay on a single comma-
separated value.

No behavior change beyond the gallery key name; per-call routing via
TTSRequest.Voice is unchanged.

Assisted-by: Claude:claude-opus-4-7[1m]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-06 10:36:59 +02:00

5.4 KiB
Raw Blame History

+++ disableToc = false title = "Audio Transform" weight = 17 url = "/features/audio-transform/" +++

The audio-transform endpoints take audio in and emit audio out, optionally conditioned on a second reference audio signal. The category is generic by design — concrete operations include joint acoustic echo cancellation + noise suppression + dereverberation (LocalVQE), voice conversion (reference = target speaker), pitch shifting, audio super-resolution, and so on.

The first shipping backend is LocalVQE, a 1.3 M-parameter GGML-based model that performs joint AEC + noise suppression

  • dereverberation on 16 kHz mono speech, ~9.6× realtime on a desktop CPU. It is a derivative of the Microsoft DeepVQE paper.

The mental model

Every audio-transform request carries:

  • audio — the primary input file (required).
  • reference — an auxiliary signal whose meaning is backend-specific (optional).
    • For echo cancellation: the loopback / far-end signal played through the speakers.
    • For voice conversion: the target speaker's reference clip.
    • For pitch / style transfer: a tonal or style reference.
    • When omitted, the backend treats it as silence and degrades gracefully (LocalVQE, for example, does denoise + dereverb only when ref is empty).
  • params — a generic key=value map forwarded to the backend.
    • LocalVQE keys: noise_gate=true|false, noise_gate_threshold_dbfs=<float>.

This shape mirrors WebRTC's ProcessStream(near) / ProcessReverseStream(far) APM API, NVIDIA Maxine's NvAFX_Run paired-stream signature, and the ICASSP AEC challenge 2-channel WAV convention.

Batch endpoint

POST /audio/transformations (alias POST /audio/transform) — multipart form-data, returns audio bytes.

Field Type Required Notes
model string yes Audio-transform model id (e.g. localvqe)
audio file yes Primary input audio
reference file no Optional auxiliary signal
response_format string no wav (default), mp3, ogg, flac
sample_rate int no Desired output sample rate
params[<key>] string no Repeated; forwarded to backend

Example (LocalVQE: cancel echo, suppress noise, gate residual):

curl -X POST http://localhost:8080/audio/transformations \
  -F model=localvqe \
  -F audio=@mic.wav \
  -F reference=@loopback.wav \
  -F 'params[noise_gate]=true' \
  -F 'params[noise_gate_threshold_dbfs]=-50' \
  -o enhanced.wav

When reference is omitted, LocalVQE zero-fills the reference channel and the operation reduces to noise suppression + dereverberation.

Streaming endpoint

GET /audio/transformations/stream — bidirectional WebSocket. The first client message is a JSON envelope; subsequent client messages are binary PCM frames; server emits binary PCM frames at the same cadence.

Wire format

Client → server (text frame, first):

{
  "type": "session.update",
  "model": "localvqe",
  "sample_format": "S16_LE",
  "sample_rate": 16000,
  "frame_samples": 256,
  "params": { "noise_gate": "true" }
}

sample_format is S16_LE (16-bit signed little-endian) or F32_LE (32-bit float little-endian, [-1, 1]). frame_samples defaults to the backend's preferred hop length (256 = 16 ms for LocalVQE).

Client → server (binary frames, subsequent): interleaved stereo PCM, channel 0 = audio (mic), channel 1 = reference. Frame size: frame_samples × 2 channels × sample_size. For S16_LE at 256 samples that is 1024 bytes per frame; for F32_LE it is 2048 bytes. If the reference is silent (no auxiliary signal), send zeros on channel 1.

Server → client (binary frames): mono PCM in the same format, frame_samples × sample_size bytes (512 bytes for S16_LE, 1024 for F32_LE).

Mid-stream control (text frame): another session.update resets the streaming state when its reset field is true; a session.close text frame ends the session cleanly.

Latency

LocalVQE has 16 ms algorithmic latency (one hop). At runtime, ~1.66 ms of CPU time per frame on a modern desktop, leaving the rest of the budget for network and downstream playback.

Backend-specific tuning (LocalVQE)

params[<key>] Type Default Effect
noise_gate bool false Enable post-OLA RMS-based residual-echo gate
noise_gate_threshold_dbfs float -45.0 Gate threshold in dBFS; frames below are zeroed

The gate is most useful in far-end-only / silent-near-end stretches where the model's residual would otherwise sound like buffering or amplified noise floor. A reasonable starting point is -50 dBFS.

Configuring a model

name: localvqe
backend: localvqe
parameters:
  model: localvqe-v1.1-1.3M-f32.gguf

# Backend-specific defaults can be set in Options[]; per-request
# params[*] form fields override.
#
# `backend` and `device` route through the upstream localvqe options
# builder so you can force a non-default GGML backend (e.g. `Vulkan`) or
# pin to a specific GPU index. Leave both unset to keep the CPU default.
options:
- noise_gate=true
- noise_gate_threshold_dbfs=-50
# - backend=Vulkan
# - device=0

See also