mirror of
https://github.com/mudler/LocalAI.git
synced 2026-05-17 13:10:23 -04:00
* fix(docs): correct broken Hugo relrefs The Hugo build has been failing on master since the relevant pages landed: - text-generation.md:720 referenced `/docs/features/distributed-mode`, but Hugo `relref` paths are relative to the content root, not the rendered URL. Drop the `/docs/` prefix so the lookup matches the existing `features/...` form used elsewhere in the file. - audio-transform.md:144 referenced `tts.md`; the actual page is `text-to-audio.md`. Assisted-by: Claude:claude-opus-4-7[1m] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(kokoros): stub Diarize and AudioTransform Backend trait methods The recent backend.proto additions (Diarize, AudioTransform, AudioTransformStream) extended the gRPC Backend trait, breaking kokoros-grpc compilation with E0046 because the Rust implementation hadn't picked up the new methods. Add Unimplemented stubs matching the existing pattern for non-applicable RPCs in this TTS-only backend. Assisted-by: Claude:claude-opus-4-7[1m] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(vibevoice-cpp): track upstream ABI + wire 1.5B voice cloning Two recent commits in mudler/vibevoice.cpp reshaped the vv_capi_tts signature without a corresponding bump on the LocalAI side: 3bd759c "1.5b: unify into a single tts entry point" inserted a ref_audio_path parameter between voice_path and dst_wav_path. ad856bd "1.5b: multi-speaker dialog support" promoted that to a (const char* const* ref_audio_paths, int n_ref_audio_paths) pair for per-speaker conditioning. Because purego resolves symbols by name and not by signature, the build kept linking; at runtime the misaligned arguments turned the TTS->ASR closed-loop test into a SIGSEGV inside cgo. Track HEAD explicitly and bring the bridge in line with it: * Update the CppTTS purego binding to the 9-arg form. purego marshals []*byte as a **char by handing the C side the underlying array address; nil/empty maps to NULL, which matches the C contract for "no reference audio" on the realtime-0.5B path. * Add a `ref_audio` gallery option (comma-separated, repeatable) that the 1.5B path consumes for runtime voice cloning. Multiple entries are interpreted as one WAV per speaker (Speaker 0..n-1). * TTSRequest.Voice now routes by extension/shape: `.wav` or a comma-separated list goes to ref_audio_paths; anything else stays on voice_path (realtime-0.5B's pre-baked voice gguf). * Pin VIBEVOICE_CPP_VERSION to ad856bd and wire the Makefile into the existing bump_deps matrix so future upstream rolls land as reviewable PRs instead of a silent CI break. Assisted-by: Claude:claude-opus-4-7[1m] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(vibevoice-cpp): use ModelOptions.AudioPath for 1.5B ref audio Use the existing audio_path field from ModelOptions (already plumbed through config_file's `audio_path:` YAML and consumed by other audio backends like kokoros) instead of inventing a custom `ref_audio:` Options[] string. Multi-speaker setups stay on a single comma- separated value. No behavior change beyond the gallery key name; per-call routing via TTSRequest.Voice is unchanged. Assisted-by: Claude:claude-opus-4-7[1m] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
148 lines
5.4 KiB
Markdown
148 lines
5.4 KiB
Markdown
+++
|
||
disableToc = false
|
||
title = "Audio Transform"
|
||
weight = 17
|
||
url = "/features/audio-transform/"
|
||
+++
|
||
|
||
The audio-transform endpoints take **audio in** and emit **audio out**, optionally
|
||
conditioned on a second reference audio signal. The category is generic by
|
||
design — concrete operations include joint **acoustic echo cancellation +
|
||
noise suppression + dereverberation** (LocalVQE), voice conversion (reference
|
||
= target speaker), pitch shifting, audio super-resolution, and so on.
|
||
|
||
The first shipping backend is [LocalVQE](https://github.com/localai-org/LocalVQE),
|
||
a 1.3 M-parameter GGML-based model that performs joint AEC + noise suppression
|
||
+ dereverberation on 16 kHz mono speech, ~9.6× realtime on a desktop CPU. It
|
||
is a derivative of the Microsoft DeepVQE paper.
|
||
|
||
## The mental model
|
||
|
||
Every audio-transform request carries:
|
||
|
||
- **`audio`** — the primary input file (required).
|
||
- **`reference`** — an auxiliary signal whose meaning is backend-specific (optional).
|
||
- For echo cancellation: the loopback / far-end signal played through the speakers.
|
||
- For voice conversion: the target speaker's reference clip.
|
||
- For pitch / style transfer: a tonal or style reference.
|
||
- When omitted, the backend treats it as silence and degrades gracefully (LocalVQE,
|
||
for example, does denoise + dereverb only when ref is empty).
|
||
- **`params`** — a generic `key=value` map forwarded to the backend.
|
||
- LocalVQE keys: `noise_gate=true|false`, `noise_gate_threshold_dbfs=<float>`.
|
||
|
||
This shape mirrors WebRTC's `ProcessStream(near)` / `ProcessReverseStream(far)`
|
||
APM API, NVIDIA Maxine's `NvAFX_Run` paired-stream signature, and the ICASSP
|
||
AEC challenge 2-channel WAV convention.
|
||
|
||
## Batch endpoint
|
||
|
||
`POST /audio/transformations` (alias `POST /audio/transform`) — multipart
|
||
form-data, returns audio bytes.
|
||
|
||
| Field | Type | Required | Notes |
|
||
|---|---|---|---|
|
||
| `model` | string | yes | Audio-transform model id (e.g. `localvqe`) |
|
||
| `audio` | file | yes | Primary input audio |
|
||
| `reference` | file | no | Optional auxiliary signal |
|
||
| `response_format` | string | no | `wav` (default), `mp3`, `ogg`, `flac` |
|
||
| `sample_rate` | int | no | Desired output sample rate |
|
||
| `params[<key>]` | string | no | Repeated; forwarded to backend |
|
||
|
||
Example (LocalVQE: cancel echo, suppress noise, gate residual):
|
||
|
||
```bash
|
||
curl -X POST http://localhost:8080/audio/transformations \
|
||
-F model=localvqe \
|
||
-F audio=@mic.wav \
|
||
-F reference=@loopback.wav \
|
||
-F 'params[noise_gate]=true' \
|
||
-F 'params[noise_gate_threshold_dbfs]=-50' \
|
||
-o enhanced.wav
|
||
```
|
||
|
||
When `reference` is omitted, LocalVQE zero-fills the reference channel and
|
||
the operation reduces to noise suppression + dereverberation.
|
||
|
||
## Streaming endpoint
|
||
|
||
`GET /audio/transformations/stream` — bidirectional WebSocket. The first
|
||
client message is a JSON envelope; subsequent client messages are binary
|
||
PCM frames; server emits binary PCM frames at the same cadence.
|
||
|
||
### Wire format
|
||
|
||
**Client → server** (text frame, first):
|
||
|
||
```json
|
||
{
|
||
"type": "session.update",
|
||
"model": "localvqe",
|
||
"sample_format": "S16_LE",
|
||
"sample_rate": 16000,
|
||
"frame_samples": 256,
|
||
"params": { "noise_gate": "true" }
|
||
}
|
||
```
|
||
|
||
`sample_format` is `S16_LE` (16-bit signed little-endian) or `F32_LE` (32-bit
|
||
float little-endian, [-1, 1]). `frame_samples` defaults to the backend's
|
||
preferred hop length (256 = 16 ms for LocalVQE).
|
||
|
||
**Client → server** (binary frames, subsequent): interleaved stereo PCM,
|
||
channel 0 = audio (mic), channel 1 = reference. Frame size:
|
||
`frame_samples × 2 channels × sample_size`. For `S16_LE` at 256 samples that
|
||
is 1024 bytes per frame; for `F32_LE` it is 2048 bytes. If the reference is
|
||
silent (no auxiliary signal), send zeros on channel 1.
|
||
|
||
**Server → client** (binary frames): mono PCM in the same format,
|
||
`frame_samples × sample_size` bytes (512 bytes for `S16_LE`, 1024 for `F32_LE`).
|
||
|
||
**Mid-stream control** (text frame): another `session.update` resets the
|
||
streaming state when its `reset` field is true; a `session.close` text frame
|
||
ends the session cleanly.
|
||
|
||
### Latency
|
||
|
||
LocalVQE has 16 ms algorithmic latency (one hop). At runtime, ~1.66 ms of CPU
|
||
time per frame on a modern desktop, leaving the rest of the budget for
|
||
network and downstream playback.
|
||
|
||
## Backend-specific tuning (LocalVQE)
|
||
|
||
| `params[<key>]` | Type | Default | Effect |
|
||
|---|---|---|---|
|
||
| `noise_gate` | bool | `false` | Enable post-OLA RMS-based residual-echo gate |
|
||
| `noise_gate_threshold_dbfs` | float | `-45.0` | Gate threshold in dBFS; frames below are zeroed |
|
||
|
||
The gate is most useful in far-end-only / silent-near-end stretches where the
|
||
model's residual would otherwise sound like buffering or amplified noise floor.
|
||
A reasonable starting point is `-50` dBFS.
|
||
|
||
## Configuring a model
|
||
|
||
```yaml
|
||
name: localvqe
|
||
backend: localvqe
|
||
parameters:
|
||
model: localvqe-v1.1-1.3M-f32.gguf
|
||
|
||
# Backend-specific defaults can be set in Options[]; per-request
|
||
# params[*] form fields override.
|
||
#
|
||
# `backend` and `device` route through the upstream localvqe options
|
||
# builder so you can force a non-default GGML backend (e.g. `Vulkan`) or
|
||
# pin to a specific GPU index. Leave both unset to keep the CPU default.
|
||
options:
|
||
- noise_gate=true
|
||
- noise_gate_threshold_dbfs=-50
|
||
# - backend=Vulkan
|
||
# - device=0
|
||
```
|
||
|
||
## See also
|
||
|
||
- [Text to Audio (TTS)]({{< relref "text-to-audio.md" >}})
|
||
- [Audio to Text]({{< relref "audio-to-text.md" >}})
|
||
- [LocalVQE upstream](https://github.com/localai-org/LocalVQE)
|
||
- [DeepVQE paper (Indenbom et al., Interspeech 2023)](https://arxiv.org/abs/2306.03177)
|