Files
LocalAI/backend/go/vibevoice-cpp/govibevoicecpp.go
Ettore Di Giacinto fe6eb57082 feat(vibevoice-cpp): add purego TTS+ASR backend (#9610)
* feat(vibevoice-cpp): add purego TTS+ASR backend

Wire up Microsoft VibeVoice via the vibevoice.cpp C ABI as a new
purego-based Go backend that serves both Backend.TTS and
Backend.AudioTranscription from a single gRPC binary. Mirrors the
qwen3-tts-cpp / sherpa-onnx pattern so the variant matrix
(cpu/cuda12/cuda13/metal/rocm/sycl-f16/f32/vulkan/l4t) and the
e2e-backends gRPC harness reuse existing infrastructure.

- backend/go/vibevoice-cpp/ - Makefile, CMakeLists, purego shim, gRPC
  Backend with model-dir auto-detection, closed-loop TTS->ASR smoke test
- backend/index.yaml - &vibevoicecpp meta + 18 image entries
- Makefile - .NOTPARALLEL, BACKEND_VIBEVOICE_CPP, docker-build wiring,
  test-extra-backend-vibevoice-cpp-{tts,transcription} e2e wrappers
- .github/workflows/backend.yml - matrix entries for all variants
- .github/workflows/test-extra.yml - per-backend smoke + 2 gRPC e2e jobs

* feat(vibevoice-cpp): drop hardcoded glob detection, add gallery entries

Refactor backend Load() to follow the standard Options[] convention
used by sherpa-onnx and the rest of the multi-role backends:
ModelFile is the primary gguf, supplementary paths come through
opts.Options[] as key=value (or key:value for Make-target compat),
resolved against opts.ModelPath. type=asr/tts decides the role of
ModelFile when neither tts_model nor asr_model is set explicitly.

Add gallery/index.yaml entries:
- vibevoice-cpp     - realtime 0.5B Q8_0 TTS + tokenizer + Carter voice
- vibevoice-cpp-asr - long-form ASR Q8_0 + tokenizer

Both pull from huggingface://mudler/vibevoice.cpp-models with sha256
verification. parameters.model + Options[] paths are siblings under
{models_dir} per the qwen3-tts-cpp convention.

Update Makefile e2e wrappers to pass BACKEND_TEST_OPTIONS comma+colon
style, and tighten the per-backend Go closed-loop test to use the
explicit Options API.

* fix(vibevoice-cpp): force whole-archive link so vv_capi_* exports survive

libvibevoice is a STATIC archive linked into the MODULE library.
Without --whole-archive (or -force_load on Apple, /WHOLEARCHIVE on
MSVC), the linker garbage-collects symbols not referenced from this
translation unit - which means dlopen+RegisterLibFunc panics with
'undefined symbol: vv_capi_load' at backend startup, since purego
looks them up by name and our cpp/govibevoicecpp.cpp doesn't call
them directly.

* test(vibevoice-cpp): rewrite suite with Ginkgo v2

Match the convention used by backend/go/sherpa-onnx/backend_test.go.
The suite now covers backend semantics that don't need purego (Locking,
empty-ModelFile rejection, TTS/ASR-without-loaded-model errors) on top
of the gRPC lifecycle specs (Health, Load, closed-loop TTS->ASR).
Model-dependent specs Skip() when VIBEVOICE_MODEL_DIR is unset, so
`go test ./backend/go/vibevoice-cpp/` is green on a clean checkout
and runs the heavyweight closed-loop spec when test.sh has staged
the bundle.

* fix(vibevoice-cpp): implement TTSStream + AudioTranscriptionStream

The gRPC server's stream handlers (pkg/grpc/server.go) spawn a
goroutine that ranges over a chan; the only thing closing that chan
is the backend's own *Stream method. With the default Base stub
returning 'unimplemented' and never touching the chan, the server
goroutine hangs forever and the client hits DeadlineExceeded - which
is exactly what the e2e harness saw in the test-extra-backend-vibevoice-cpp-tts
matrix run.

TTSStream synthesizes via vv_capi_tts to a tempfile, then emits a
streaming WAV header (chunk sizes 0xFFFFFFFF so HTTP clients can
start playback before the full PCM lands) followed by the PCM body
in 64 KB slices. The header + >=2 PCM frames satisfy the harness's
'expected >=2 chunks' assertion and give a real progressive stream.

AudioTranscriptionStream runs the offline transcription, emits each
segment as a delta, and closes with a final_result whose Text equals
the concatenated deltas (the harness asserts those match).

Two new Ginkgo specs guard the close-channel-on-error path so the
deadline-exceeded regression can't come back silently.

* fix(vibevoice-cpp): silence errcheck on cleanup paths

Lint flagged six unchecked Close()/Remove()/RemoveAll() calls along
purely-cleanup deferred paths. Wrap each in '_ = ...' (or a closure
for defers that take args) - matches what the rest of the LocalAI
backend/go/* tree already does for these callsites.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(vibevoice-cpp): closed-loop slot fill + modelRoot-relative path resolution

Two bugs the test-extra-backend-vibevoice-cpp-* CI matrix surfaced:

1. Closed-loop Load with ModelFile=tts.gguf + Options[asr_model=...] left
   v.ttsModel empty, because the default-fill block only ran when BOTH
   slots were empty. vv_capi_load then got tts="" + a voice and the
   C side rejected it with rc=-3 'TTS model required to load a voice'.
   Fix: ModelFile fills the *primary* role-slot (decided by 'type=' in
   Options, defaulting to tts) independently of the secondary, so
   ModelFile + asr_model resolves to both.

2. resolvePath stat'd CWD before falling back to relTo. With LocalAI
   launched from a directory that happens to contain a same-named
   file, supplementary Options[] paths could leak away from the
   models dir. Drop the CWD probe entirely - relative paths now
   *always* join onto opts.ModelPath (the gallery convention).

New Ginkgo coverage:
  * 'ModelFile slot resolution' (4 specs) - asr_model+ModelFile, type=asr,
    explicit tts_model override, key:value variant.
  * 'resolvePath (relative-to-modelRoot)' (5 specs) - join, abs passthrough,
    empty input, empty relTo, and the CWD-trap regression test.
  * 'Load resolves relative Options paths against opts.ModelPath' - end-
    to-end gallery layout round-trip.

Verified locally: 19/19 specs pass (with model bundle, including the
closed-loop TTS->ASR; without bundle, 17 pass + 2 model-dependent skip).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(vibevoice-cpp): use gallery convention in closed-loop spec

The 'loads the realtime TTS model' / closed-loop specs were passing
already-prefixed paths into Options[]:

    Options: ['tokenizer=' + filepath.Join(modelDir, 'tokenizer.gguf')]

Combined with no ModelPath set on the request, the backend's
modelRoot fell back to filepath.Dir(ModelFile) = modelDir, then
resolvePath joined the prefixed Options path on top of it -
producing 'vibevoice-models/vibevoice-models/tokenizer.gguf' when
the CI's VIBEVOICE_MODEL_DIR is the relative './vibevoice-models'.

The fix is to mirror the gallery contract LocalAI core actually
sends in production: ModelPath is the models root (absolute),
ModelFile is a name *under* it, every Options[] path is relative
to ModelPath. Uses filepath.Base() to get bare filenames.

Verified locally with both VIBEVOICE_MODEL_DIR=/tmp/vv-bundle (abs)
and VIBEVOICE_MODEL_DIR=vibevoice-models (the relative shape that
broke CI). Both: 19/19 specs pass, ~55-60s.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(vibevoice-cpp): switch ASR to Q4_K + bump transcription timeout

The Q8_0 ASR gguf is ~14 GB - too big to fit alongside the runner
image, the docker build cache, and the test artifacts on a free
ubuntu-latest GHA runner; 'test-extra-backend-vibevoice-cpp-transcription'
was getting SIGTERM'd at 90 min before the model could finish loading.

Switch to Q4_K (~10 GB on disk, slightly faster CPU decode) for:
  * the e2e harness Make target
  * the gallery 'vibevoice-cpp-asr' entry (parameters + files block)
  * the per-backend test.sh auto-download list

Bump tests-vibevoice-cpp-grpc-transcription's timeout-minutes from
90 to 150 - even with Q4_K, the 30 s JFK clip on a CPU runner needs
runway above the previous 90 min cap.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(vibevoice-cpp): drop transcription gRPC e2e job - too heavy for free runners

The vibevoice ASR is a 7B-parameter model. Even on Q4_K (~10 GB on
disk) a single 30 s transcription saturates the per-test 30 min
timeout in the e2e-backends harness on a 4-core ubuntu-latest, and
the 10 GB download + Docker layer + working space leaves no headroom
on the runner's free disk. Two attempts in CI got SIGTERM'd at the
LoadModel boundary - the bottleneck isn't tunable from the workflow
side without a paid-tier runner.

The per-backend tests-vibevoice-cpp job already runs the same
AudioTranscription path via a closed-loop TTS->ASR Ginkgo spec - same
gRPC contract, same model, single process - so the standalone
tests-vibevoice-cpp-grpc-transcription job was redundant on top of
the disk/CPU pressure.

The Makefile target test-extra-backend-vibevoice-cpp-transcription
stays for local invocation on workstations that can afford it -
useful when developing the streaming codepaths.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(vibevoice-cpp): restore transcription gRPC e2e on bigger-runner

Switch tests-vibevoice-cpp-grpc-transcription from ubuntu-latest to
the self-hosted 'bigger-runner' label that GPU image builds in
backend.yml use, plus the documented Free-disk-space prep step (purge
dotnet / ghc / android / CodeQL caches) the disabled vllm/sglang
entries in this file describe. That gives the 7B-param Q4_K ASR
model the disk + CPU runway it needs.

Keep timeout-minutes: 150 - even on a beefier runner the 30 s JFK
decode plus 10 GB download has to fit comfortably.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(vibevoice-cpp): apt-get install make on bigger-runner before transcription e2e

bigger-runner is a self-hosted bare runner without the standard
ubuntu image's preinstalled build tools, so the previous job died at
the very first command with 'make: command not found' (exit 127).
Add the Dependencies step that the disabled vllm/sglang entries in
this file already document - apt-get installs make + build-essential
+ curl + unzip + ca-certificates + git + tar before the make target
runs. Mirrors how every other 'runs-on: bigger-runner' entry in
backend.yml prepares the runner.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-29 22:22:14 +02:00

388 lines
12 KiB
Go

package main
import (
"encoding/json"
"fmt"
"os"
"path/filepath"
"strings"
laudio "github.com/mudler/LocalAI/pkg/audio"
"github.com/mudler/LocalAI/pkg/grpc/base"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
)
// vibevoice.cpp synthesizes 24 kHz mono 16-bit PCM. Hardcoded - the
// model itself is fixed-rate; if the upstream ever changes this we'll
// pick it up via vv_capi_version().
const vibevoiceSampleRate = uint32(24000)
// purego-bound entry points from libgovibevoicecpp.
var (
CppLoad func(ttsModel, asrModel, tokenizer, voice string, threads int32) int32
CppTTS func(text, voicePath, dstWav string,
nSteps int32, cfgScale float32, maxSpeechFrames int32, seed uint32) int32
CppASR func(srcWav string, outJSON []byte, capacity uint64,
maxNewTokens int32) int32
CppUnload func()
CppVersion func() string
)
// VibevoiceCpp speaks gRPC against vibevoice.cpp's flat C ABI. The
// engine is a single global, so we serialize calls through SingleThread.
type VibevoiceCpp struct {
base.SingleThread
threads int
// modelRoot is the directory we use to resolve relative paths
// from Options[] and per-call overrides (TTSRequest.Voice).
// Source of truth: opts.ModelPath; falls back to the dir of
// the primary ModelFile when ModelPath is empty.
modelRoot string
ttsModel string
asrModel string
tokenizer string
voice string
}
// resolvePath joins a relative path onto `relTo`. The gallery
// convention is that Options[] carry paths relative to the LocalAI
// models dir (opts.ModelPath), so anything not absolute is treated
// as a sibling of the primary ModelFile - never CWD. Empty / already-
// absolute / no-relTo inputs pass through unchanged.
func resolvePath(p, relTo string) string {
if p == "" || filepath.IsAbs(p) || relTo == "" {
return p
}
return filepath.Join(relTo, p)
}
// parseOptions reads opts.Options[] and pulls out the per-role
// overrides documented in the gallery entries. Accepts both "key=value"
// (gallery YAML style) and "key:value" (Make-target / env-var style).
func (v *VibevoiceCpp) parseOptions(opts []string, relTo string) string {
role := ""
for _, raw := range opts {
k, val, ok := strings.Cut(raw, "=")
if !ok {
k, val, ok = strings.Cut(raw, ":")
if !ok {
continue
}
}
key := strings.TrimSpace(k)
val = strings.TrimSpace(val)
switch key {
case "type":
role = strings.ToLower(val)
case "tokenizer":
v.tokenizer = resolvePath(val, relTo)
case "voice":
v.voice = resolvePath(val, relTo)
case "tts_model":
v.ttsModel = resolvePath(val, relTo)
case "asr_model":
v.asrModel = resolvePath(val, relTo)
}
}
return role
}
func (v *VibevoiceCpp) Load(opts *pb.ModelOptions) error {
if opts.ModelFile == "" {
return fmt.Errorf("vibevoice-cpp: ModelFile is required")
}
modelFile := opts.ModelFile
if !filepath.IsAbs(modelFile) && opts.ModelPath != "" {
modelFile = filepath.Join(opts.ModelPath, modelFile)
}
// ModelPath is the LocalAI core's models root, propagated over
// gRPC. Use it as the resolution base for Options[] (and later
// for TTSRequest.Voice) so gallery entries can reference paths
// like "tokenizer=tokenizer.gguf" and have them resolved
// against the same root the core used to drop the files.
v.modelRoot = opts.ModelPath
if v.modelRoot == "" {
v.modelRoot = filepath.Dir(modelFile)
}
role := v.parseOptions(opts.Options, v.modelRoot)
// ModelFile fills the "primary" role-slot determined by `type=`
// in Options (defaults to tts). The other slot stays exactly as
// Options set it - so a closed-loop config with ModelFile=tts.gguf
// + Options[asr_model=asr.gguf] resolves correctly to both slots,
// and an explicit `tts_model=` / `asr_model=` always wins over
// ModelFile for its own slot.
primaryIsASR := false
switch role {
case "asr", "transcript", "stt", "speech-to-text":
primaryIsASR = true
}
if primaryIsASR {
if v.asrModel == "" {
v.asrModel = modelFile
}
} else if v.ttsModel == "" {
v.ttsModel = modelFile
}
if v.ttsModel == "" && v.asrModel == "" {
return fmt.Errorf("vibevoice-cpp: no TTS or ASR model resolved from ModelFile=%q + options", opts.ModelFile)
}
if v.tokenizer == "" {
return fmt.Errorf("vibevoice-cpp: tokenizer is required - pass options: [tokenizer=<path>]")
}
threads := int(opts.Threads)
if threads <= 0 {
threads = 4
}
v.threads = threads
fmt.Fprintf(os.Stderr,
"[vibevoice-cpp] Loading: tts=%q asr=%q tokenizer=%q voice=%q threads=%d\n",
v.ttsModel, v.asrModel, v.tokenizer, v.voice, threads)
if rc := CppLoad(v.ttsModel, v.asrModel, v.tokenizer, v.voice, int32(threads)); rc != 0 {
return fmt.Errorf("vibevoice-cpp: vv_capi_load failed (rc=%d)", rc)
}
return nil
}
func (v *VibevoiceCpp) TTS(req *pb.TTSRequest) error {
if v.ttsModel == "" {
return fmt.Errorf("vibevoice-cpp: TTS requested but no realtime model was loaded")
}
text := req.Text
dst := req.Dst
if text == "" || dst == "" {
return fmt.Errorf("vibevoice-cpp: TTS requires both text and dst")
}
// req.Voice may be a bare filename (e.g. "voice-en-Emma.gguf") or an
// absolute path. Resolve via the same modelRoot Load() used for
// Options[] so a swap-voice request mirrors the gallery's layout.
voice := resolvePath(req.Voice, v.modelRoot)
if req.Language != nil && *req.Language != "" {
fmt.Fprintf(os.Stderr,
"[vibevoice-cpp] note: TTSRequest.language=%q ignored - vibevoice picks language from the voice prompt\n",
*req.Language)
}
const (
defaultSteps = 20
defaultMaxFrames = 200
)
defaultCfg := float32(1.3)
if rc := CppTTS(text, voice, dst,
int32(defaultSteps), defaultCfg, int32(defaultMaxFrames), 0); rc != 0 {
return fmt.Errorf("vibevoice-cpp: vv_capi_tts failed (rc=%d)", rc)
}
return nil
}
// asrSegment matches vibevoice's JSON output:
//
// [{"Start":0.0,"End":2.8,"Speaker":0,"Content":"…"}, ...]
type asrSegment struct {
Start float64 `json:"Start"`
End float64 `json:"End"`
Speaker int `json:"Speaker"`
Content string `json:"Content"`
}
// callASR invokes vv_capi_asr with a buffer that grows on demand.
// vv_capi_asr returns: >0 bytes written, 0 no transcript, <0 error or
// -required_size. We honor the resize protocol once before giving up.
func (v *VibevoiceCpp) callASR(srcWav string, maxNewTokens int32) (string, error) {
const startCap = 256 * 1024
buf := make([]byte, startCap)
rc := CppASR(srcWav, buf, uint64(len(buf)), maxNewTokens)
if rc < 0 {
need := -int(rc)
if need > 0 && need < (16<<20) && need > len(buf) {
buf = make([]byte, need+64)
rc = CppASR(srcWav, buf, uint64(len(buf)), maxNewTokens)
}
}
if rc < 0 {
return "", fmt.Errorf("vibevoice-cpp: vv_capi_asr failed (rc=%d)", rc)
}
if rc == 0 {
return "", nil
}
return string(buf[:rc]), nil
}
// TTSStream is the streaming counterpart to TTS. vibevoice's C ABI is
// file-only (vv_capi_tts writes a complete WAV), so we synthesize to
// a tempfile, then emit a streaming-WAV header followed by the PCM
// body in chunks. The main reason this exists at all is the gRPC
// server wrapper (pkg/grpc/server.go:TTSStream) blocks on a channel
// that only this method can close - if we leave the default Base
// stub in place, every TTSStream call hangs until the client
// deadline.
func (v *VibevoiceCpp) TTSStream(req *pb.TTSRequest, results chan []byte) error {
defer close(results)
if v.ttsModel == "" {
return fmt.Errorf("vibevoice-cpp: TTSStream requested but no realtime model was loaded")
}
if req.Text == "" {
return fmt.Errorf("vibevoice-cpp: TTSStream requires text")
}
tmp, err := os.CreateTemp("", "vibevoice-cpp-stream-*.wav")
if err != nil {
return fmt.Errorf("vibevoice-cpp: tempfile: %w", err)
}
dst := tmp.Name()
_ = tmp.Close()
defer func() { _ = os.Remove(dst) }()
if err := v.TTS(&pb.TTSRequest{
Text: req.Text,
Voice: req.Voice,
Dst: dst,
Language: req.Language,
}); err != nil {
return err
}
wav, err := os.ReadFile(dst)
if err != nil {
return fmt.Errorf("vibevoice-cpp: read tempfile: %w", err)
}
// Streaming WAV header: declare 0xFFFFFFFF for chunk sizes so HTTP
// clients can start playback before they see the full PCM.
const streamingSize = 0xFFFFFFFF
hdr := laudio.NewWAVHeaderWithRate(streamingSize, vibevoiceSampleRate)
hdr.ChunkSize = streamingSize
hdrBuf := make([]byte, 0, laudio.WAVHeaderSize)
w := newByteWriter(&hdrBuf)
if err := hdr.Write(w); err != nil {
return fmt.Errorf("vibevoice-cpp: write WAV header: %w", err)
}
results <- hdrBuf
// PCM body: send in ~64 KB slices so the client gets multiple
// reply chunks (e2e harness asserts >=2 frames).
pcm := laudio.StripWAVHeader(wav)
const chunkBytes = 64 * 1024
for off := 0; off < len(pcm); off += chunkBytes {
end := off + chunkBytes
if end > len(pcm) {
end = len(pcm)
}
chunk := make([]byte, end-off)
copy(chunk, pcm[off:end])
results <- chunk
}
return nil
}
// byteWriter adapts a *[]byte to io.Writer so we can hand it to
// laudio.WAVHeader.Write without allocating a bytes.Buffer.
type byteWriter struct{ buf *[]byte }
func newByteWriter(b *[]byte) *byteWriter { return &byteWriter{buf: b} }
func (w *byteWriter) Write(p []byte) (int, error) {
*w.buf = append(*w.buf, p...)
return len(p), nil
}
func (v *VibevoiceCpp) AudioTranscription(req *pb.TranscriptRequest) (pb.TranscriptResult, error) {
if v.asrModel == "" {
return pb.TranscriptResult{}, fmt.Errorf("vibevoice-cpp: AudioTranscription requested but no ASR model was loaded")
}
if req.Dst == "" {
return pb.TranscriptResult{}, fmt.Errorf("vibevoice-cpp: TranscriptRequest.dst (audio path) is required")
}
out, err := v.callASR(req.Dst, 0)
if err != nil {
return pb.TranscriptResult{}, err
}
if out == "" {
return pb.TranscriptResult{}, nil
}
var segs []asrSegment
if err := json.Unmarshal([]byte(out), &segs); err != nil {
fmt.Fprintf(os.Stderr,
"[vibevoice-cpp] WARNING: vv_capi_asr returned non-JSON, falling back to single segment: %v\n", err)
return pb.TranscriptResult{
Segments: []*pb.TranscriptSegment{{Id: 0, Text: strings.TrimSpace(out)}},
Text: strings.TrimSpace(out),
}, nil
}
segments := make([]*pb.TranscriptSegment, 0, len(segs))
parts := make([]string, 0, len(segs))
var duration float32
for i, s := range segs {
// LocalAI's whisper backend uses int64 100ns ticks for
// Start/End (seconds * 1e7); follow the same convention so
// consumers can mix vibevoice and whisper transcripts.
segments = append(segments, &pb.TranscriptSegment{
Id: int32(i),
Text: s.Content,
Start: int64(s.Start * 1e7),
End: int64(s.End * 1e7),
Speaker: fmt.Sprintf("%d", s.Speaker),
})
parts = append(parts, strings.TrimSpace(s.Content))
if float32(s.End) > duration {
duration = float32(s.End)
}
}
return pb.TranscriptResult{
Segments: segments,
Text: strings.TrimSpace(strings.Join(parts, " ")),
Duration: duration,
}, nil
}
// AudioTranscriptionStream wraps AudioTranscription so the streaming
// gRPC endpoint (server.go:AudioTranscriptionStream) sees its channel
// close and the client doesn't sit waiting until deadline. vibevoice's
// ASR doesn't expose token-level streaming - vv_capi_asr decodes the
// whole audio and returns a JSON segment list - so we run the offline
// transcription, emit each segment's content as a delta, then close
// with a final_result whose Text equals the concatenated deltas (the
// e2e harness asserts those match).
func (v *VibevoiceCpp) AudioTranscriptionStream(req *pb.TranscriptRequest, results chan *pb.TranscriptStreamResponse) error {
defer close(results)
res, err := v.AudioTranscription(req)
if err != nil {
return err
}
var assembled strings.Builder
for _, seg := range res.Segments {
if seg == nil {
continue
}
txt := strings.TrimSpace(seg.Text)
if txt == "" {
continue
}
delta := txt
if assembled.Len() > 0 {
delta = " " + txt
}
results <- &pb.TranscriptStreamResponse{Delta: delta}
assembled.WriteString(delta)
}
final := pb.TranscriptResult{
Segments: res.Segments,
Duration: res.Duration,
Language: res.Language,
Text: assembled.String(),
}
results <- &pb.TranscriptStreamResponse{FinalResult: &final}
return nil
}