From fe6eb570821889bf22f29313fa64705eb6c4874c Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Wed, 29 Apr 2026 22:22:14 +0200 Subject: [PATCH] feat(vibevoice-cpp): add purego TTS+ASR backend (#9610) * feat(vibevoice-cpp): add purego TTS+ASR backend Wire up Microsoft VibeVoice via the vibevoice.cpp C ABI as a new purego-based Go backend that serves both Backend.TTS and Backend.AudioTranscription from a single gRPC binary. Mirrors the qwen3-tts-cpp / sherpa-onnx pattern so the variant matrix (cpu/cuda12/cuda13/metal/rocm/sycl-f16/f32/vulkan/l4t) and the e2e-backends gRPC harness reuse existing infrastructure. - backend/go/vibevoice-cpp/ - Makefile, CMakeLists, purego shim, gRPC Backend with model-dir auto-detection, closed-loop TTS->ASR smoke test - backend/index.yaml - &vibevoicecpp meta + 18 image entries - Makefile - .NOTPARALLEL, BACKEND_VIBEVOICE_CPP, docker-build wiring, test-extra-backend-vibevoice-cpp-{tts,transcription} e2e wrappers - .github/workflows/backend.yml - matrix entries for all variants - .github/workflows/test-extra.yml - per-backend smoke + 2 gRPC e2e jobs * feat(vibevoice-cpp): drop hardcoded glob detection, add gallery entries Refactor backend Load() to follow the standard Options[] convention used by sherpa-onnx and the rest of the multi-role backends: ModelFile is the primary gguf, supplementary paths come through opts.Options[] as key=value (or key:value for Make-target compat), resolved against opts.ModelPath. type=asr/tts decides the role of ModelFile when neither tts_model nor asr_model is set explicitly. Add gallery/index.yaml entries: - vibevoice-cpp - realtime 0.5B Q8_0 TTS + tokenizer + Carter voice - vibevoice-cpp-asr - long-form ASR Q8_0 + tokenizer Both pull from huggingface://mudler/vibevoice.cpp-models with sha256 verification. parameters.model + Options[] paths are siblings under {models_dir} per the qwen3-tts-cpp convention. Update Makefile e2e wrappers to pass BACKEND_TEST_OPTIONS comma+colon style, and tighten the per-backend Go closed-loop test to use the explicit Options API. * fix(vibevoice-cpp): force whole-archive link so vv_capi_* exports survive libvibevoice is a STATIC archive linked into the MODULE library. Without --whole-archive (or -force_load on Apple, /WHOLEARCHIVE on MSVC), the linker garbage-collects symbols not referenced from this translation unit - which means dlopen+RegisterLibFunc panics with 'undefined symbol: vv_capi_load' at backend startup, since purego looks them up by name and our cpp/govibevoicecpp.cpp doesn't call them directly. * test(vibevoice-cpp): rewrite suite with Ginkgo v2 Match the convention used by backend/go/sherpa-onnx/backend_test.go. The suite now covers backend semantics that don't need purego (Locking, empty-ModelFile rejection, TTS/ASR-without-loaded-model errors) on top of the gRPC lifecycle specs (Health, Load, closed-loop TTS->ASR). Model-dependent specs Skip() when VIBEVOICE_MODEL_DIR is unset, so `go test ./backend/go/vibevoice-cpp/` is green on a clean checkout and runs the heavyweight closed-loop spec when test.sh has staged the bundle. * fix(vibevoice-cpp): implement TTSStream + AudioTranscriptionStream The gRPC server's stream handlers (pkg/grpc/server.go) spawn a goroutine that ranges over a chan; the only thing closing that chan is the backend's own *Stream method. With the default Base stub returning 'unimplemented' and never touching the chan, the server goroutine hangs forever and the client hits DeadlineExceeded - which is exactly what the e2e harness saw in the test-extra-backend-vibevoice-cpp-tts matrix run. TTSStream synthesizes via vv_capi_tts to a tempfile, then emits a streaming WAV header (chunk sizes 0xFFFFFFFF so HTTP clients can start playback before the full PCM lands) followed by the PCM body in 64 KB slices. The header + >=2 PCM frames satisfy the harness's 'expected >=2 chunks' assertion and give a real progressive stream. AudioTranscriptionStream runs the offline transcription, emits each segment as a delta, and closes with a final_result whose Text equals the concatenated deltas (the harness asserts those match). Two new Ginkgo specs guard the close-channel-on-error path so the deadline-exceeded regression can't come back silently. * fix(vibevoice-cpp): silence errcheck on cleanup paths Lint flagged six unchecked Close()/Remove()/RemoveAll() calls along purely-cleanup deferred paths. Wrap each in '_ = ...' (or a closure for defers that take args) - matches what the rest of the LocalAI backend/go/* tree already does for these callsites. Signed-off-by: Ettore Di Giacinto * fix(vibevoice-cpp): closed-loop slot fill + modelRoot-relative path resolution Two bugs the test-extra-backend-vibevoice-cpp-* CI matrix surfaced: 1. Closed-loop Load with ModelFile=tts.gguf + Options[asr_model=...] left v.ttsModel empty, because the default-fill block only ran when BOTH slots were empty. vv_capi_load then got tts="" + a voice and the C side rejected it with rc=-3 'TTS model required to load a voice'. Fix: ModelFile fills the *primary* role-slot (decided by 'type=' in Options, defaulting to tts) independently of the secondary, so ModelFile + asr_model resolves to both. 2. resolvePath stat'd CWD before falling back to relTo. With LocalAI launched from a directory that happens to contain a same-named file, supplementary Options[] paths could leak away from the models dir. Drop the CWD probe entirely - relative paths now *always* join onto opts.ModelPath (the gallery convention). New Ginkgo coverage: * 'ModelFile slot resolution' (4 specs) - asr_model+ModelFile, type=asr, explicit tts_model override, key:value variant. * 'resolvePath (relative-to-modelRoot)' (5 specs) - join, abs passthrough, empty input, empty relTo, and the CWD-trap regression test. * 'Load resolves relative Options paths against opts.ModelPath' - end- to-end gallery layout round-trip. Verified locally: 19/19 specs pass (with model bundle, including the closed-loop TTS->ASR; without bundle, 17 pass + 2 model-dependent skip). Signed-off-by: Ettore Di Giacinto * test(vibevoice-cpp): use gallery convention in closed-loop spec The 'loads the realtime TTS model' / closed-loop specs were passing already-prefixed paths into Options[]: Options: ['tokenizer=' + filepath.Join(modelDir, 'tokenizer.gguf')] Combined with no ModelPath set on the request, the backend's modelRoot fell back to filepath.Dir(ModelFile) = modelDir, then resolvePath joined the prefixed Options path on top of it - producing 'vibevoice-models/vibevoice-models/tokenizer.gguf' when the CI's VIBEVOICE_MODEL_DIR is the relative './vibevoice-models'. The fix is to mirror the gallery contract LocalAI core actually sends in production: ModelPath is the models root (absolute), ModelFile is a name *under* it, every Options[] path is relative to ModelPath. Uses filepath.Base() to get bare filenames. Verified locally with both VIBEVOICE_MODEL_DIR=/tmp/vv-bundle (abs) and VIBEVOICE_MODEL_DIR=vibevoice-models (the relative shape that broke CI). Both: 19/19 specs pass, ~55-60s. Signed-off-by: Ettore Di Giacinto * ci(vibevoice-cpp): switch ASR to Q4_K + bump transcription timeout The Q8_0 ASR gguf is ~14 GB - too big to fit alongside the runner image, the docker build cache, and the test artifacts on a free ubuntu-latest GHA runner; 'test-extra-backend-vibevoice-cpp-transcription' was getting SIGTERM'd at 90 min before the model could finish loading. Switch to Q4_K (~10 GB on disk, slightly faster CPU decode) for: * the e2e harness Make target * the gallery 'vibevoice-cpp-asr' entry (parameters + files block) * the per-backend test.sh auto-download list Bump tests-vibevoice-cpp-grpc-transcription's timeout-minutes from 90 to 150 - even with Q4_K, the 30 s JFK clip on a CPU runner needs runway above the previous 90 min cap. Signed-off-by: Ettore Di Giacinto * ci(vibevoice-cpp): drop transcription gRPC e2e job - too heavy for free runners The vibevoice ASR is a 7B-parameter model. Even on Q4_K (~10 GB on disk) a single 30 s transcription saturates the per-test 30 min timeout in the e2e-backends harness on a 4-core ubuntu-latest, and the 10 GB download + Docker layer + working space leaves no headroom on the runner's free disk. Two attempts in CI got SIGTERM'd at the LoadModel boundary - the bottleneck isn't tunable from the workflow side without a paid-tier runner. The per-backend tests-vibevoice-cpp job already runs the same AudioTranscription path via a closed-loop TTS->ASR Ginkgo spec - same gRPC contract, same model, single process - so the standalone tests-vibevoice-cpp-grpc-transcription job was redundant on top of the disk/CPU pressure. The Makefile target test-extra-backend-vibevoice-cpp-transcription stays for local invocation on workstations that can afford it - useful when developing the streaming codepaths. Signed-off-by: Ettore Di Giacinto * ci(vibevoice-cpp): restore transcription gRPC e2e on bigger-runner Switch tests-vibevoice-cpp-grpc-transcription from ubuntu-latest to the self-hosted 'bigger-runner' label that GPU image builds in backend.yml use, plus the documented Free-disk-space prep step (purge dotnet / ghc / android / CodeQL caches) the disabled vllm/sglang entries in this file describe. That gives the 7B-param Q4_K ASR model the disk + CPU runway it needs. Keep timeout-minutes: 150 - even on a beefier runner the 30 s JFK decode plus 10 GB download has to fit comfortably. Signed-off-by: Ettore Di Giacinto * ci(vibevoice-cpp): apt-get install make on bigger-runner before transcription e2e bigger-runner is a self-hosted bare runner without the standard ubuntu image's preinstalled build tools, so the previous job died at the very first command with 'make: command not found' (exit 127). Add the Dependencies step that the disabled vllm/sglang entries in this file already document - apt-get installs make + build-essential + curl + unzip + ca-certificates + git + tar before the make target runs. Mirrors how every other 'runs-on: bigger-runner' entry in backend.yml prepares the runner. Signed-off-by: Ettore Di Giacinto --------- Signed-off-by: Ettore Di Giacinto --- .github/workflows/backend.yml | 122 ++++++ .github/workflows/test-extra.yml | 92 +++++ Makefile | 32 +- backend/go/vibevoice-cpp/CMakeLists.txt | 71 ++++ backend/go/vibevoice-cpp/Makefile | 128 ++++++ .../go/vibevoice-cpp/cpp/govibevoicecpp.cpp | 41 ++ backend/go/vibevoice-cpp/cpp/govibevoicecpp.h | 7 + backend/go/vibevoice-cpp/govibevoicecpp.go | 387 ++++++++++++++++++ backend/go/vibevoice-cpp/main.go | 49 +++ backend/go/vibevoice-cpp/package.sh | 58 +++ backend/go/vibevoice-cpp/run.sh | 49 +++ backend/go/vibevoice-cpp/test.sh | 74 ++++ backend/go/vibevoice-cpp/vibevoicecpp_test.go | 382 +++++++++++++++++ backend/index.yaml | 129 ++++++ gallery/index.yaml | 77 ++++ 15 files changed, 1696 insertions(+), 2 deletions(-) create mode 100644 backend/go/vibevoice-cpp/CMakeLists.txt create mode 100644 backend/go/vibevoice-cpp/Makefile create mode 100644 backend/go/vibevoice-cpp/cpp/govibevoicecpp.cpp create mode 100644 backend/go/vibevoice-cpp/cpp/govibevoicecpp.h create mode 100644 backend/go/vibevoice-cpp/govibevoicecpp.go create mode 100644 backend/go/vibevoice-cpp/main.go create mode 100755 backend/go/vibevoice-cpp/package.sh create mode 100755 backend/go/vibevoice-cpp/run.sh create mode 100755 backend/go/vibevoice-cpp/test.sh create mode 100644 backend/go/vibevoice-cpp/vibevoicecpp_test.go diff --git a/.github/workflows/backend.yml b/.github/workflows/backend.yml index c3347e4ff..ceb268864 100644 --- a/.github/workflows/backend.yml +++ b/.github/workflows/backend.yml @@ -698,6 +698,19 @@ jobs: dockerfile: "./backend/Dockerfile.golang" context: "./" ubuntu-version: '2404' + - build-type: 'cublas' + cuda-major-version: "12" + cuda-minor-version: "8" + platforms: 'linux/amd64' + tag-latest: 'auto' + tag-suffix: '-gpu-nvidia-cuda-12-vibevoice-cpp' + runs-on: 'ubuntu-latest' + base-image: "ubuntu:24.04" + skip-drivers: 'false' + backend: "vibevoice-cpp" + dockerfile: "./backend/Dockerfile.golang" + context: "./" + ubuntu-version: '2404' - build-type: 'cublas' cuda-major-version: "12" cuda-minor-version: "8" @@ -1440,6 +1453,19 @@ jobs: dockerfile: "./backend/Dockerfile.golang" context: "./" ubuntu-version: '2404' + - build-type: 'cublas' + cuda-major-version: "13" + cuda-minor-version: "0" + platforms: 'linux/amd64' + tag-latest: 'auto' + tag-suffix: '-gpu-nvidia-cuda-13-vibevoice-cpp' + runs-on: 'ubuntu-latest' + base-image: "ubuntu:24.04" + skip-drivers: 'false' + backend: "vibevoice-cpp" + dockerfile: "./backend/Dockerfile.golang" + context: "./" + ubuntu-version: '2404' - build-type: 'cublas' cuda-major-version: "13" cuda-minor-version: "0" @@ -1466,6 +1492,19 @@ jobs: backend: "qwen3-tts-cpp" dockerfile: "./backend/Dockerfile.golang" context: "./" + - build-type: 'cublas' + cuda-major-version: "13" + cuda-minor-version: "0" + platforms: 'linux/arm64' + skip-drivers: 'false' + tag-latest: 'auto' + tag-suffix: '-nvidia-l4t-cuda-13-arm64-vibevoice-cpp' + base-image: "ubuntu:24.04" + ubuntu-version: '2404' + runs-on: 'ubuntu-24.04-arm' + backend: "vibevoice-cpp" + dockerfile: "./backend/Dockerfile.golang" + context: "./" - build-type: 'cublas' cuda-major-version: "13" cuda-minor-version: "0" @@ -2633,6 +2672,85 @@ jobs: dockerfile: "./backend/Dockerfile.golang" context: "./" ubuntu-version: '2404' + # vibevoice-cpp + - build-type: '' + cuda-major-version: "" + cuda-minor-version: "" + platforms: 'linux/amd64,linux/arm64' + tag-latest: 'auto' + tag-suffix: '-cpu-vibevoice-cpp' + runs-on: 'ubuntu-latest' + base-image: "ubuntu:24.04" + skip-drivers: 'false' + backend: "vibevoice-cpp" + dockerfile: "./backend/Dockerfile.golang" + context: "./" + ubuntu-version: '2404' + - build-type: 'sycl_f32' + cuda-major-version: "" + cuda-minor-version: "" + platforms: 'linux/amd64' + tag-latest: 'auto' + tag-suffix: '-gpu-intel-sycl-f32-vibevoice-cpp' + runs-on: 'ubuntu-latest' + base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04" + skip-drivers: 'false' + backend: "vibevoice-cpp" + dockerfile: "./backend/Dockerfile.golang" + context: "./" + ubuntu-version: '2404' + - build-type: 'sycl_f16' + cuda-major-version: "" + cuda-minor-version: "" + platforms: 'linux/amd64' + tag-latest: 'auto' + tag-suffix: '-gpu-intel-sycl-f16-vibevoice-cpp' + runs-on: 'ubuntu-latest' + base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04" + skip-drivers: 'false' + backend: "vibevoice-cpp" + dockerfile: "./backend/Dockerfile.golang" + context: "./" + ubuntu-version: '2404' + - build-type: 'vulkan' + cuda-major-version: "" + cuda-minor-version: "" + platforms: 'linux/amd64,linux/arm64' + tag-latest: 'auto' + tag-suffix: '-gpu-vulkan-vibevoice-cpp' + runs-on: 'ubuntu-latest' + base-image: "ubuntu:24.04" + skip-drivers: 'false' + backend: "vibevoice-cpp" + dockerfile: "./backend/Dockerfile.golang" + context: "./" + ubuntu-version: '2404' + - build-type: 'cublas' + cuda-major-version: "12" + cuda-minor-version: "0" + platforms: 'linux/arm64' + skip-drivers: 'false' + tag-latest: 'auto' + tag-suffix: '-nvidia-l4t-arm64-vibevoice-cpp' + base-image: "nvcr.io/nvidia/l4t-jetpack:r36.4.0" + runs-on: 'ubuntu-24.04-arm' + backend: "vibevoice-cpp" + dockerfile: "./backend/Dockerfile.golang" + context: "./" + ubuntu-version: '2204' + - build-type: 'hipblas' + cuda-major-version: "" + cuda-minor-version: "" + platforms: 'linux/amd64' + tag-latest: 'auto' + tag-suffix: '-gpu-rocm-hipblas-vibevoice-cpp' + base-image: "rocm/dev-ubuntu-24.04:6.4.4" + runs-on: 'ubuntu-latest' + skip-drivers: 'false' + backend: "vibevoice-cpp" + dockerfile: "./backend/Dockerfile.golang" + context: "./" + ubuntu-version: '2404' # voxtral - build-type: '' cuda-major-version: "" @@ -3027,6 +3145,10 @@ jobs: tag-suffix: "-metal-darwin-arm64-qwen3-tts-cpp" build-type: "metal" lang: "go" + - backend: "vibevoice-cpp" + tag-suffix: "-metal-darwin-arm64-vibevoice-cpp" + build-type: "metal" + lang: "go" - backend: "voxtral" tag-suffix: "-metal-darwin-arm64-voxtral" build-type: "metal" diff --git a/.github/workflows/test-extra.yml b/.github/workflows/test-extra.yml index d37f77e01..aa8a02c49 100644 --- a/.github/workflows/test-extra.yml +++ b/.github/workflows/test-extra.yml @@ -36,6 +36,7 @@ jobs: sglang: ${{ steps.detect.outputs.sglang }} acestep-cpp: ${{ steps.detect.outputs.acestep-cpp }} qwen3-tts-cpp: ${{ steps.detect.outputs.qwen3-tts-cpp }} + vibevoice-cpp: ${{ steps.detect.outputs.vibevoice-cpp }} voxtral: ${{ steps.detect.outputs.voxtral }} kokoros: ${{ steps.detect.outputs.kokoros }} insightface: ${{ steps.detect.outputs.insightface }} @@ -792,6 +793,97 @@ jobs: - name: Test qwen3-tts-cpp run: | make --jobs=5 --output-sync=target -C backend/go/qwen3-tts-cpp test + # Per-backend smoke for vibevoice-cpp: builds the .so + Go binary and + # runs `make -C backend/go/vibevoice-cpp test`. test.sh auto-downloads + # the published mudler/vibevoice.cpp-models bundle (TTS Q8_0 + ASR Q4_K + # + tokenizer + voice) and runs the closed-loop TTS → ASR Go test. + tests-vibevoice-cpp: + needs: detect-changes + if: needs.detect-changes.outputs.vibevoice-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true' + runs-on: ubuntu-latest + timeout-minutes: 90 + steps: + - name: Clone + uses: actions/checkout@v6 + with: + submodules: true + - name: Dependencies + run: | + sudo apt-get update + sudo apt-get install -y build-essential cmake curl libopenblas-dev ffmpeg + - name: Setup Go + uses: actions/setup-go@v5 + - name: Display Go version + run: go version + - name: Proto Dependencies + run: | + curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \ + unzip -j -d /usr/local/bin protoc.zip bin/protoc && \ + rm protoc.zip + go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2 + go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af + PATH="$PATH:$HOME/go/bin" make protogen-go + - name: Build vibevoice-cpp + run: | + make --jobs=5 --output-sync=target -C backend/go/vibevoice-cpp + - name: Test vibevoice-cpp + run: | + make --jobs=5 --output-sync=target -C backend/go/vibevoice-cpp test + # End-to-end TTS via the e2e-backends gRPC harness. Builds the + # vibevoice-cpp Docker image and drives Backend/TTS against it with a + # real LocalAI gRPC client. + tests-vibevoice-cpp-grpc-tts: + needs: detect-changes + if: needs.detect-changes.outputs.vibevoice-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true' + runs-on: ubuntu-latest + timeout-minutes: 90 + steps: + - name: Clone + uses: actions/checkout@v6 + with: + submodules: true + - name: Setup Go + uses: actions/setup-go@v5 + with: + go-version: '1.25.4' + - name: Build vibevoice-cpp backend image and run TTS gRPC e2e tests + run: | + make test-extra-backend-vibevoice-cpp-tts + # End-to-end transcription via the e2e-backends gRPC harness. The + # vibevoice ASR is a 7B-param model (Q4_K weights ~10 GB on disk) + # and the JFK 30 s decode is too heavy for a free 4-core + # ubuntu-latest pool runner - two CI attempts got SIGTERM'd during + # LoadModel, before the test could even progress. Use the + # self-hosted 'bigger-runner' label (same one the GPU image builds + # in backend.yml use) and the documented dotnet/ghc/android cache + # purge to clear ~10-20 GB of headroom for the model + Docker + # image + working dir. + tests-vibevoice-cpp-grpc-transcription: + needs: detect-changes + if: needs.detect-changes.outputs.vibevoice-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true' + runs-on: bigger-runner + timeout-minutes: 150 + steps: + - name: Clone + uses: actions/checkout@v6 + with: + submodules: true + - name: Dependencies + run: | + sudo apt-get update + sudo apt-get install -y --no-install-recommends \ + make build-essential curl unzip ca-certificates git tar + - name: Setup Go + uses: actions/setup-go@v5 + with: + go-version: '1.25.4' + - name: Free disk space + run: | + sudo rm -rf /usr/share/dotnet /opt/ghc /usr/local/lib/android /opt/hostedtoolcache/CodeQL || true + df -h + - name: Build vibevoice-cpp backend image and run ASR gRPC e2e tests + run: | + make test-extra-backend-vibevoice-cpp-transcription tests-voxtral: needs: detect-changes if: needs.detect-changes.outputs.voxtral == 'true' || needs.detect-changes.outputs.run-all == 'true' diff --git a/Makefile b/Makefile index f9c2d87ef..c7b536fa2 100644 --- a/Makefile +++ b/Makefile @@ -1,5 +1,5 @@ # Disable parallel execution for backend builds -.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/tinygrad backends/sherpa-onnx +.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/vibevoice-cpp backends/tinygrad backends/sherpa-onnx GOCMD=go GOTEST=$(GOCMD) test @@ -833,6 +833,32 @@ test-extra-backend-sherpa-onnx-tts: docker-build-sherpa-onnx BACKEND_TEST_CAPS=health,load,tts \ $(MAKE) test-extra-backend +## VibeVoice TTS via the vibevoice-cpp backend. ModelFile is the +## realtime gguf; the supplementary tokenizer + voice prompt land +## alongside it under the harness's models dir and are wired through +## via the standard Options[] convention (tokenizer=, voice=). +test-extra-backend-vibevoice-cpp-tts: docker-build-vibevoice-cpp + BACKEND_IMAGE=local-ai-backend:vibevoice-cpp \ + BACKEND_TEST_MODEL_URL='https://huggingface.co/mudler/vibevoice.cpp-models/resolve/main/vibevoice-realtime-0.5B-q8_0.gguf#vibevoice-realtime-0.5B-q8_0.gguf' \ + BACKEND_TEST_EXTRA_FILES='https://huggingface.co/mudler/vibevoice.cpp-models/resolve/main/tokenizer.gguf#tokenizer.gguf|https://huggingface.co/mudler/vibevoice.cpp-models/resolve/main/voice-en-Carter_man.gguf#voice-en-Carter_man.gguf' \ + BACKEND_TEST_OPTIONS=tokenizer:tokenizer.gguf,voice:voice-en-Carter_man.gguf \ + BACKEND_TEST_CAPS=health,load,tts \ + $(MAKE) test-extra-backend + +## VibeVoice ASR (long-form, with diarization). type=asr tells the +## backend's Load() to slot ModelFile into the asr_model role; the +## tokenizer is supplied via Options[]. Uses the Q4_K quant (~10 GB) +## rather than Q8_0 (~14 GB) so the bundle fits inside ubuntu-latest's +## post-image disk budget. +test-extra-backend-vibevoice-cpp-transcription: docker-build-vibevoice-cpp + BACKEND_IMAGE=local-ai-backend:vibevoice-cpp \ + BACKEND_TEST_MODEL_URL='https://huggingface.co/mudler/vibevoice.cpp-models/resolve/main/vibevoice-asr-q4_k.gguf#vibevoice-asr-q4_k.gguf' \ + BACKEND_TEST_EXTRA_FILES='https://huggingface.co/mudler/vibevoice.cpp-models/resolve/main/tokenizer.gguf#tokenizer.gguf' \ + BACKEND_TEST_AUDIO_URL=https://github.com/ggml-org/whisper.cpp/raw/master/samples/jfk.wav \ + BACKEND_TEST_OPTIONS=type:asr,tokenizer:tokenizer.gguf \ + BACKEND_TEST_CAPS=health,load,transcription \ + $(MAKE) test-extra-backend + ## sglang mirrors the vllm setup: HuggingFace model id, same tiny Qwen, ## tool-call extraction via sglang's native qwen parser. CPU builds use ## sglang's upstream pyproject_cpu.toml recipe (see backend/python/sglang/install.sh). @@ -969,6 +995,7 @@ BACKEND_WHISPER = whisper|golang|.|false|true BACKEND_VOXTRAL = voxtral|golang|.|false|true BACKEND_ACESTEP_CPP = acestep-cpp|golang|.|false|true BACKEND_QWEN3_TTS_CPP = qwen3-tts-cpp|golang|.|false|true +BACKEND_VIBEVOICE_CPP = vibevoice-cpp|golang|.|false|true BACKEND_OPUS = opus|golang|.|false|true BACKEND_SHERPA_ONNX = sherpa-onnx|golang|.|false|true @@ -1075,6 +1102,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_WHISPERX))) $(eval $(call generate-docker-build-target,$(BACKEND_ACE_STEP))) $(eval $(call generate-docker-build-target,$(BACKEND_ACESTEP_CPP))) $(eval $(call generate-docker-build-target,$(BACKEND_QWEN3_TTS_CPP))) +$(eval $(call generate-docker-build-target,$(BACKEND_VIBEVOICE_CPP))) $(eval $(call generate-docker-build-target,$(BACKEND_MLX))) $(eval $(call generate-docker-build-target,$(BACKEND_MLX_VLM))) $(eval $(call generate-docker-build-target,$(BACKEND_MLX_DISTRIBUTED))) @@ -1089,7 +1117,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_SHERPA_ONNX))) docker-save-%: backend-images docker save local-ai-backend:$* -o backend-images/$*.tar -docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx +docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp docker-build-vibevoice-cpp docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx ######################################################## ### Mock Backend for E2E Tests diff --git a/backend/go/vibevoice-cpp/CMakeLists.txt b/backend/go/vibevoice-cpp/CMakeLists.txt new file mode 100644 index 000000000..dde8807fe --- /dev/null +++ b/backend/go/vibevoice-cpp/CMakeLists.txt @@ -0,0 +1,71 @@ +cmake_minimum_required(VERSION 3.18) +project(govibevoicecpp LANGUAGES C CXX) +set(CMAKE_POSITION_INDEPENDENT_CODE ON) +set(CMAKE_EXPORT_COMPILE_COMMANDS ON) + +set(VIBEVOICE_DIR ${CMAKE_CURRENT_SOURCE_DIR}/sources/vibevoice.cpp) + +# Override upstream's CMAKE_CUDA_ARCHITECTURES before add_subdirectory. +if(NOT DEFINED CMAKE_CUDA_ARCHITECTURES) + set(CMAKE_CUDA_ARCHITECTURES "75-virtual;80-virtual;86-real;89-real") +endif() + +# Force-disable upstream tests/examples — we only need libvibevoice. +set(VIBEVOICE_BUILD_TESTS OFF CACHE BOOL "" FORCE) +set(VIBEVOICE_BUILD_EXAMPLES OFF CACHE BOOL "" FORCE) +set(VIBEVOICE_BUILD_SERVER OFF CACHE BOOL "" FORCE) + +# vibevoice.cpp's top-level CMakeLists already adds third_party/ggml as a +# subdirectory — no need to add it explicitly here, just include the +# whole project. +add_subdirectory(${VIBEVOICE_DIR} vibevoice EXCLUDE_FROM_ALL) + +add_library(govibevoicecpp MODULE cpp/govibevoicecpp.cpp) + +# libvibevoice is STATIC; without --whole-archive the linker GCs the +# vv_capi_* symbols (purego dlopens them by name, nothing in our +# translation unit references them). Force the static archive's +# entire contents into the MODULE so dlsym finds vv_capi_load etc. +if(APPLE) + target_link_libraries(govibevoicecpp PRIVATE -Wl,-force_load $) +elseif(MSVC) + target_link_libraries(govibevoicecpp PRIVATE vibevoice) + set_property(TARGET govibevoicecpp APPEND PROPERTY LINK_FLAGS "/WHOLEARCHIVE:vibevoice") +else() + target_link_libraries(govibevoicecpp PRIVATE + -Wl,--whole-archive vibevoice -Wl,--no-whole-archive) +endif() + +target_include_directories(govibevoicecpp PRIVATE ${VIBEVOICE_DIR}/include) +target_include_directories(govibevoicecpp SYSTEM PRIVATE ${VIBEVOICE_DIR}/third_party/ggml/include) + +# Link GPU backends if available — vibevoice's own CMake already links +# these to the libvibevoice STATIC library, but we re-link them on the +# MODULE so resolved symbols include all backend kernels. +foreach(backend blas cuda metal vulkan) + if(TARGET ggml-${backend}) + target_link_libraries(govibevoicecpp PRIVATE ggml-${backend}) + string(TOUPPER ${backend} BACKEND_UPPER) + target_compile_definitions(govibevoicecpp PRIVATE VIBEVOICE_HAVE_${BACKEND_UPPER}) + if(backend STREQUAL "cuda") + find_package(CUDAToolkit QUIET) + if(CUDAToolkit_FOUND) + target_link_libraries(govibevoicecpp PRIVATE CUDA::cudart) + endif() + endif() + endif() +endforeach() + +if(MSVC) + target_compile_options(govibevoicecpp PRIVATE /W4 /wd4100 /wd4505) +else() + target_compile_options(govibevoicecpp PRIVATE -Wall -Wextra -Wshadow + -Wno-unused-parameter -Wno-unused-function -Wno-sign-conversion) +endif() + +if(CMAKE_CXX_COMPILER_ID MATCHES "GNU" AND CMAKE_CXX_COMPILER_VERSION VERSION_LESS 9.0) + target_link_libraries(govibevoicecpp PRIVATE stdc++fs) +endif() + +set_property(TARGET govibevoicecpp PROPERTY CXX_STANDARD 17) +set_target_properties(govibevoicecpp PROPERTIES LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}) diff --git a/backend/go/vibevoice-cpp/Makefile b/backend/go/vibevoice-cpp/Makefile new file mode 100644 index 000000000..67eeebbca --- /dev/null +++ b/backend/go/vibevoice-cpp/Makefile @@ -0,0 +1,128 @@ +CMAKE_ARGS?= +BUILD_TYPE?= +NATIVE?=false + +GOCMD?=go +GO_TAGS?= +JOBS?=$(shell nproc --ignore=1) + +# vibevoice.cpp version +VIBEVOICE_REPO?=https://github.com/mudler/vibevoice.cpp +VIBEVOICE_CPP_VERSION?=master +SO_TARGET?=libgovibevoicecpp.so + +CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF +CMAKE_ARGS+=-DVIBEVOICE_BUILD_TESTS=OFF +CMAKE_ARGS+=-DVIBEVOICE_BUILD_EXAMPLES=OFF + +ifeq ($(NATIVE),false) + CMAKE_ARGS+=-DGGML_NATIVE=OFF +endif + +ifeq ($(BUILD_TYPE),cublas) + CMAKE_ARGS+=-DGGML_CUDA=ON -DVIBEVOICE_GGML_CUDA=ON +else ifeq ($(BUILD_TYPE),openblas) + CMAKE_ARGS+=-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS +else ifeq ($(BUILD_TYPE),clblas) + CMAKE_ARGS+=-DGGML_CLBLAST=ON -DCLBlast_DIR=/some/path +else ifeq ($(BUILD_TYPE),hipblas) + CMAKE_ARGS+=-DGGML_HIPBLAS=ON -DVIBEVOICE_GGML_HIPBLAS=ON +else ifeq ($(BUILD_TYPE),vulkan) + CMAKE_ARGS+=-DGGML_VULKAN=ON -DVIBEVOICE_GGML_VULKAN=ON +else ifeq ($(OS),Darwin) + ifneq ($(BUILD_TYPE),metal) + CMAKE_ARGS+=-DGGML_METAL=OFF + else + CMAKE_ARGS+=-DGGML_METAL=ON -DVIBEVOICE_GGML_METAL=ON + CMAKE_ARGS+=-DGGML_METAL_EMBED_LIBRARY=ON + endif +endif + +ifeq ($(BUILD_TYPE),sycl_f16) + CMAKE_ARGS+=-DGGML_SYCL=ON \ + -DCMAKE_C_COMPILER=icx \ + -DCMAKE_CXX_COMPILER=icpx \ + -DGGML_SYCL_F16=ON +endif + +ifeq ($(BUILD_TYPE),sycl_f32) + CMAKE_ARGS+=-DGGML_SYCL=ON \ + -DCMAKE_C_COMPILER=icx \ + -DCMAKE_CXX_COMPILER=icpx +endif + +sources/vibevoice.cpp: + mkdir -p sources/vibevoice.cpp + cd sources/vibevoice.cpp && \ + git init && \ + git remote add origin $(VIBEVOICE_REPO) && \ + git fetch origin && \ + git checkout $(VIBEVOICE_CPP_VERSION) && \ + git submodule update --init --recursive --depth 1 --single-branch + +# Detect OS +UNAME_S := $(shell uname -s) + +# Only build CPU variants on Linux +ifeq ($(UNAME_S),Linux) + VARIANT_TARGETS = libgovibevoicecpp-avx.so libgovibevoicecpp-avx2.so libgovibevoicecpp-avx512.so libgovibevoicecpp-fallback.so +else + # On non-Linux (e.g., Darwin), build only fallback variant + VARIANT_TARGETS = libgovibevoicecpp-fallback.so +endif + +vibevoice-cpp: main.go govibevoicecpp.go $(VARIANT_TARGETS) + CGO_ENABLED=0 $(GOCMD) build -tags "$(GO_TAGS)" -o vibevoice-cpp ./ + +package: vibevoice-cpp + bash package.sh + +build: package + +clean: purge + rm -rf libgovibevoicecpp*.so package sources/vibevoice.cpp vibevoice-cpp + +purge: + rm -rf build* + +# Variants must build sequentially +.NOTPARALLEL: + +# Build all variants (Linux only) +ifeq ($(UNAME_S),Linux) +libgovibevoicecpp-avx.so: sources/vibevoice.cpp + $(info ${GREEN}I vibevoice-cpp build info:avx${RESET}) + SO_TARGET=libgovibevoicecpp-avx.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) libgovibevoicecpp-custom + rm -rf build-libgovibevoicecpp-avx.so + +libgovibevoicecpp-avx2.so: sources/vibevoice.cpp + $(info ${GREEN}I vibevoice-cpp build info:avx2${RESET}) + SO_TARGET=libgovibevoicecpp-avx2.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=off -DGGML_FMA=on -DGGML_F16C=on -DGGML_BMI2=on" $(MAKE) libgovibevoicecpp-custom + rm -rf build-libgovibevoicecpp-avx2.so + +libgovibevoicecpp-avx512.so: sources/vibevoice.cpp + $(info ${GREEN}I vibevoice-cpp build info:avx512${RESET}) + SO_TARGET=libgovibevoicecpp-avx512.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=on -DGGML_FMA=on -DGGML_F16C=on -DGGML_BMI2=on" $(MAKE) libgovibevoicecpp-custom + rm -rf build-libgovibevoicecpp-avx512.so +endif + +# Build fallback variant (all platforms) +libgovibevoicecpp-fallback.so: sources/vibevoice.cpp + $(info ${GREEN}I vibevoice-cpp build info:fallback${RESET}) + SO_TARGET=libgovibevoicecpp-fallback.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) libgovibevoicecpp-custom + rm -rf build-libgovibevoicecpp-fallback.so + +libgovibevoicecpp-custom: CMakeLists.txt cpp/govibevoicecpp.cpp cpp/govibevoicecpp.h + mkdir -p build-$(SO_TARGET) && \ + cd build-$(SO_TARGET) && \ + cmake .. $(CMAKE_ARGS) && \ + cmake --build . --config Release -j$(JOBS) --target govibevoicecpp && \ + cd .. && \ + mv build-$(SO_TARGET)/libgovibevoicecpp.so ./$(SO_TARGET) + +test: vibevoice-cpp + @echo "Running vibevoice-cpp tests..." + bash test.sh + @echo "vibevoice-cpp tests completed." + +all: vibevoice-cpp package diff --git a/backend/go/vibevoice-cpp/cpp/govibevoicecpp.cpp b/backend/go/vibevoice-cpp/cpp/govibevoicecpp.cpp new file mode 100644 index 000000000..23934fd51 --- /dev/null +++ b/backend/go/vibevoice-cpp/cpp/govibevoicecpp.cpp @@ -0,0 +1,41 @@ +// vibevoice.cpp ships its purego-friendly ABI in vibevoice_capi.h. +// This translation unit is intentionally tiny: pulling in the header +// (and linking libvibevoice PRIVATE in CMake) is enough to make the +// vv_capi_* symbols visible from the produced MODULE library. +// +// We do install a ggml log redirect so backend logs land on the gRPC +// server's stderr — same pattern as backend/go/qwen3-tts-cpp/cpp/. + +#include "govibevoicecpp.h" + +#include "ggml.h" +#include "ggml-backend.h" + +#include + +namespace { + +void govibevoice_log_cb(enum ggml_log_level level, const char* msg, void* /*ud*/) { + if (!msg) return; + const char* tag = "?????"; + switch (level) { + case GGML_LOG_LEVEL_DEBUG: tag = "DEBUG"; break; + case GGML_LOG_LEVEL_INFO: tag = "INFO"; break; + case GGML_LOG_LEVEL_WARN: tag = "WARN"; break; + case GGML_LOG_LEVEL_ERROR: tag = "ERROR"; break; + default: break; + } + std::fprintf(stderr, "[%-5s] %s", tag, msg); + std::fflush(stderr); +} + +struct LogInstaller { + LogInstaller() { + ggml_log_set(govibevoice_log_cb, nullptr); + ggml_backend_load_all(); + } +}; + +LogInstaller g_install; + +} // namespace diff --git a/backend/go/vibevoice-cpp/cpp/govibevoicecpp.h b/backend/go/vibevoice-cpp/cpp/govibevoicecpp.h new file mode 100644 index 000000000..224cb3eaa --- /dev/null +++ b/backend/go/vibevoice-cpp/cpp/govibevoicecpp.h @@ -0,0 +1,7 @@ +#pragma once + +// Re-exports the vibevoice.cpp flat C ABI so this MODULE library +// resolves the same symbols that purego.RegisterLibFunc looks up by +// name. The actual definitions live in libvibevoice (linked PRIVATE). + +#include "vibevoice_capi.h" diff --git a/backend/go/vibevoice-cpp/govibevoicecpp.go b/backend/go/vibevoice-cpp/govibevoicecpp.go new file mode 100644 index 000000000..516ffed51 --- /dev/null +++ b/backend/go/vibevoice-cpp/govibevoicecpp.go @@ -0,0 +1,387 @@ +package main + +import ( + "encoding/json" + "fmt" + "os" + "path/filepath" + "strings" + + laudio "github.com/mudler/LocalAI/pkg/audio" + "github.com/mudler/LocalAI/pkg/grpc/base" + pb "github.com/mudler/LocalAI/pkg/grpc/proto" +) + +// vibevoice.cpp synthesizes 24 kHz mono 16-bit PCM. Hardcoded - the +// model itself is fixed-rate; if the upstream ever changes this we'll +// pick it up via vv_capi_version(). +const vibevoiceSampleRate = uint32(24000) + +// purego-bound entry points from libgovibevoicecpp. +var ( + CppLoad func(ttsModel, asrModel, tokenizer, voice string, threads int32) int32 + CppTTS func(text, voicePath, dstWav string, + nSteps int32, cfgScale float32, maxSpeechFrames int32, seed uint32) int32 + CppASR func(srcWav string, outJSON []byte, capacity uint64, + maxNewTokens int32) int32 + CppUnload func() + CppVersion func() string +) + +// VibevoiceCpp speaks gRPC against vibevoice.cpp's flat C ABI. The +// engine is a single global, so we serialize calls through SingleThread. +type VibevoiceCpp struct { + base.SingleThread + threads int + + // modelRoot is the directory we use to resolve relative paths + // from Options[] and per-call overrides (TTSRequest.Voice). + // Source of truth: opts.ModelPath; falls back to the dir of + // the primary ModelFile when ModelPath is empty. + modelRoot string + + ttsModel string + asrModel string + tokenizer string + voice string +} + +// resolvePath joins a relative path onto `relTo`. The gallery +// convention is that Options[] carry paths relative to the LocalAI +// models dir (opts.ModelPath), so anything not absolute is treated +// as a sibling of the primary ModelFile - never CWD. Empty / already- +// absolute / no-relTo inputs pass through unchanged. +func resolvePath(p, relTo string) string { + if p == "" || filepath.IsAbs(p) || relTo == "" { + return p + } + return filepath.Join(relTo, p) +} + +// parseOptions reads opts.Options[] and pulls out the per-role +// overrides documented in the gallery entries. Accepts both "key=value" +// (gallery YAML style) and "key:value" (Make-target / env-var style). +func (v *VibevoiceCpp) parseOptions(opts []string, relTo string) string { + role := "" + for _, raw := range opts { + k, val, ok := strings.Cut(raw, "=") + if !ok { + k, val, ok = strings.Cut(raw, ":") + if !ok { + continue + } + } + key := strings.TrimSpace(k) + val = strings.TrimSpace(val) + switch key { + case "type": + role = strings.ToLower(val) + case "tokenizer": + v.tokenizer = resolvePath(val, relTo) + case "voice": + v.voice = resolvePath(val, relTo) + case "tts_model": + v.ttsModel = resolvePath(val, relTo) + case "asr_model": + v.asrModel = resolvePath(val, relTo) + } + } + return role +} + +func (v *VibevoiceCpp) Load(opts *pb.ModelOptions) error { + if opts.ModelFile == "" { + return fmt.Errorf("vibevoice-cpp: ModelFile is required") + } + modelFile := opts.ModelFile + if !filepath.IsAbs(modelFile) && opts.ModelPath != "" { + modelFile = filepath.Join(opts.ModelPath, modelFile) + } + + // ModelPath is the LocalAI core's models root, propagated over + // gRPC. Use it as the resolution base for Options[] (and later + // for TTSRequest.Voice) so gallery entries can reference paths + // like "tokenizer=tokenizer.gguf" and have them resolved + // against the same root the core used to drop the files. + v.modelRoot = opts.ModelPath + if v.modelRoot == "" { + v.modelRoot = filepath.Dir(modelFile) + } + role := v.parseOptions(opts.Options, v.modelRoot) + + // ModelFile fills the "primary" role-slot determined by `type=` + // in Options (defaults to tts). The other slot stays exactly as + // Options set it - so a closed-loop config with ModelFile=tts.gguf + // + Options[asr_model=asr.gguf] resolves correctly to both slots, + // and an explicit `tts_model=` / `asr_model=` always wins over + // ModelFile for its own slot. + primaryIsASR := false + switch role { + case "asr", "transcript", "stt", "speech-to-text": + primaryIsASR = true + } + if primaryIsASR { + if v.asrModel == "" { + v.asrModel = modelFile + } + } else if v.ttsModel == "" { + v.ttsModel = modelFile + } + + if v.ttsModel == "" && v.asrModel == "" { + return fmt.Errorf("vibevoice-cpp: no TTS or ASR model resolved from ModelFile=%q + options", opts.ModelFile) + } + if v.tokenizer == "" { + return fmt.Errorf("vibevoice-cpp: tokenizer is required - pass options: [tokenizer=]") + } + + threads := int(opts.Threads) + if threads <= 0 { + threads = 4 + } + v.threads = threads + + fmt.Fprintf(os.Stderr, + "[vibevoice-cpp] Loading: tts=%q asr=%q tokenizer=%q voice=%q threads=%d\n", + v.ttsModel, v.asrModel, v.tokenizer, v.voice, threads) + + if rc := CppLoad(v.ttsModel, v.asrModel, v.tokenizer, v.voice, int32(threads)); rc != 0 { + return fmt.Errorf("vibevoice-cpp: vv_capi_load failed (rc=%d)", rc) + } + return nil +} + +func (v *VibevoiceCpp) TTS(req *pb.TTSRequest) error { + if v.ttsModel == "" { + return fmt.Errorf("vibevoice-cpp: TTS requested but no realtime model was loaded") + } + text := req.Text + dst := req.Dst + if text == "" || dst == "" { + return fmt.Errorf("vibevoice-cpp: TTS requires both text and dst") + } + + // req.Voice may be a bare filename (e.g. "voice-en-Emma.gguf") or an + // absolute path. Resolve via the same modelRoot Load() used for + // Options[] so a swap-voice request mirrors the gallery's layout. + voice := resolvePath(req.Voice, v.modelRoot) + + if req.Language != nil && *req.Language != "" { + fmt.Fprintf(os.Stderr, + "[vibevoice-cpp] note: TTSRequest.language=%q ignored - vibevoice picks language from the voice prompt\n", + *req.Language) + } + + const ( + defaultSteps = 20 + defaultMaxFrames = 200 + ) + defaultCfg := float32(1.3) + if rc := CppTTS(text, voice, dst, + int32(defaultSteps), defaultCfg, int32(defaultMaxFrames), 0); rc != 0 { + return fmt.Errorf("vibevoice-cpp: vv_capi_tts failed (rc=%d)", rc) + } + return nil +} + +// asrSegment matches vibevoice's JSON output: +// +// [{"Start":0.0,"End":2.8,"Speaker":0,"Content":"…"}, ...] +type asrSegment struct { + Start float64 `json:"Start"` + End float64 `json:"End"` + Speaker int `json:"Speaker"` + Content string `json:"Content"` +} + +// callASR invokes vv_capi_asr with a buffer that grows on demand. +// vv_capi_asr returns: >0 bytes written, 0 no transcript, <0 error or +// -required_size. We honor the resize protocol once before giving up. +func (v *VibevoiceCpp) callASR(srcWav string, maxNewTokens int32) (string, error) { + const startCap = 256 * 1024 + buf := make([]byte, startCap) + rc := CppASR(srcWav, buf, uint64(len(buf)), maxNewTokens) + if rc < 0 { + need := -int(rc) + if need > 0 && need < (16<<20) && need > len(buf) { + buf = make([]byte, need+64) + rc = CppASR(srcWav, buf, uint64(len(buf)), maxNewTokens) + } + } + if rc < 0 { + return "", fmt.Errorf("vibevoice-cpp: vv_capi_asr failed (rc=%d)", rc) + } + if rc == 0 { + return "", nil + } + return string(buf[:rc]), nil +} + +// TTSStream is the streaming counterpart to TTS. vibevoice's C ABI is +// file-only (vv_capi_tts writes a complete WAV), so we synthesize to +// a tempfile, then emit a streaming-WAV header followed by the PCM +// body in chunks. The main reason this exists at all is the gRPC +// server wrapper (pkg/grpc/server.go:TTSStream) blocks on a channel +// that only this method can close - if we leave the default Base +// stub in place, every TTSStream call hangs until the client +// deadline. +func (v *VibevoiceCpp) TTSStream(req *pb.TTSRequest, results chan []byte) error { + defer close(results) + if v.ttsModel == "" { + return fmt.Errorf("vibevoice-cpp: TTSStream requested but no realtime model was loaded") + } + if req.Text == "" { + return fmt.Errorf("vibevoice-cpp: TTSStream requires text") + } + + tmp, err := os.CreateTemp("", "vibevoice-cpp-stream-*.wav") + if err != nil { + return fmt.Errorf("vibevoice-cpp: tempfile: %w", err) + } + dst := tmp.Name() + _ = tmp.Close() + defer func() { _ = os.Remove(dst) }() + + if err := v.TTS(&pb.TTSRequest{ + Text: req.Text, + Voice: req.Voice, + Dst: dst, + Language: req.Language, + }); err != nil { + return err + } + + wav, err := os.ReadFile(dst) + if err != nil { + return fmt.Errorf("vibevoice-cpp: read tempfile: %w", err) + } + + // Streaming WAV header: declare 0xFFFFFFFF for chunk sizes so HTTP + // clients can start playback before they see the full PCM. + const streamingSize = 0xFFFFFFFF + hdr := laudio.NewWAVHeaderWithRate(streamingSize, vibevoiceSampleRate) + hdr.ChunkSize = streamingSize + hdrBuf := make([]byte, 0, laudio.WAVHeaderSize) + w := newByteWriter(&hdrBuf) + if err := hdr.Write(w); err != nil { + return fmt.Errorf("vibevoice-cpp: write WAV header: %w", err) + } + results <- hdrBuf + + // PCM body: send in ~64 KB slices so the client gets multiple + // reply chunks (e2e harness asserts >=2 frames). + pcm := laudio.StripWAVHeader(wav) + const chunkBytes = 64 * 1024 + for off := 0; off < len(pcm); off += chunkBytes { + end := off + chunkBytes + if end > len(pcm) { + end = len(pcm) + } + chunk := make([]byte, end-off) + copy(chunk, pcm[off:end]) + results <- chunk + } + return nil +} + +// byteWriter adapts a *[]byte to io.Writer so we can hand it to +// laudio.WAVHeader.Write without allocating a bytes.Buffer. +type byteWriter struct{ buf *[]byte } + +func newByteWriter(b *[]byte) *byteWriter { return &byteWriter{buf: b} } +func (w *byteWriter) Write(p []byte) (int, error) { + *w.buf = append(*w.buf, p...) + return len(p), nil +} + +func (v *VibevoiceCpp) AudioTranscription(req *pb.TranscriptRequest) (pb.TranscriptResult, error) { + if v.asrModel == "" { + return pb.TranscriptResult{}, fmt.Errorf("vibevoice-cpp: AudioTranscription requested but no ASR model was loaded") + } + if req.Dst == "" { + return pb.TranscriptResult{}, fmt.Errorf("vibevoice-cpp: TranscriptRequest.dst (audio path) is required") + } + + out, err := v.callASR(req.Dst, 0) + if err != nil { + return pb.TranscriptResult{}, err + } + if out == "" { + return pb.TranscriptResult{}, nil + } + + var segs []asrSegment + if err := json.Unmarshal([]byte(out), &segs); err != nil { + fmt.Fprintf(os.Stderr, + "[vibevoice-cpp] WARNING: vv_capi_asr returned non-JSON, falling back to single segment: %v\n", err) + return pb.TranscriptResult{ + Segments: []*pb.TranscriptSegment{{Id: 0, Text: strings.TrimSpace(out)}}, + Text: strings.TrimSpace(out), + }, nil + } + + segments := make([]*pb.TranscriptSegment, 0, len(segs)) + parts := make([]string, 0, len(segs)) + var duration float32 + for i, s := range segs { + // LocalAI's whisper backend uses int64 100ns ticks for + // Start/End (seconds * 1e7); follow the same convention so + // consumers can mix vibevoice and whisper transcripts. + segments = append(segments, &pb.TranscriptSegment{ + Id: int32(i), + Text: s.Content, + Start: int64(s.Start * 1e7), + End: int64(s.End * 1e7), + Speaker: fmt.Sprintf("%d", s.Speaker), + }) + parts = append(parts, strings.TrimSpace(s.Content)) + if float32(s.End) > duration { + duration = float32(s.End) + } + } + return pb.TranscriptResult{ + Segments: segments, + Text: strings.TrimSpace(strings.Join(parts, " ")), + Duration: duration, + }, nil +} + +// AudioTranscriptionStream wraps AudioTranscription so the streaming +// gRPC endpoint (server.go:AudioTranscriptionStream) sees its channel +// close and the client doesn't sit waiting until deadline. vibevoice's +// ASR doesn't expose token-level streaming - vv_capi_asr decodes the +// whole audio and returns a JSON segment list - so we run the offline +// transcription, emit each segment's content as a delta, then close +// with a final_result whose Text equals the concatenated deltas (the +// e2e harness asserts those match). +func (v *VibevoiceCpp) AudioTranscriptionStream(req *pb.TranscriptRequest, results chan *pb.TranscriptStreamResponse) error { + defer close(results) + res, err := v.AudioTranscription(req) + if err != nil { + return err + } + var assembled strings.Builder + for _, seg := range res.Segments { + if seg == nil { + continue + } + txt := strings.TrimSpace(seg.Text) + if txt == "" { + continue + } + delta := txt + if assembled.Len() > 0 { + delta = " " + txt + } + results <- &pb.TranscriptStreamResponse{Delta: delta} + assembled.WriteString(delta) + } + final := pb.TranscriptResult{ + Segments: res.Segments, + Duration: res.Duration, + Language: res.Language, + Text: assembled.String(), + } + results <- &pb.TranscriptStreamResponse{FinalResult: &final} + return nil +} diff --git a/backend/go/vibevoice-cpp/main.go b/backend/go/vibevoice-cpp/main.go new file mode 100644 index 000000000..dd1f1ba43 --- /dev/null +++ b/backend/go/vibevoice-cpp/main.go @@ -0,0 +1,49 @@ +package main + +// Started internally by LocalAI - one gRPC server per loaded model. +import ( + "flag" + "os" + + "github.com/ebitengine/purego" + grpc "github.com/mudler/LocalAI/pkg/grpc" +) + +var ( + addr = flag.String("addr", "localhost:50051", "the address to connect to") +) + +type LibFuncs struct { + FuncPtr any + Name string +} + +func main() { + libName := os.Getenv("VIBEVOICECPP_LIBRARY") + if libName == "" { + libName = "./libgovibevoicecpp-fallback.so" + } + + lib, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL) + if err != nil { + panic(err) + } + + libFuncs := []LibFuncs{ + {&CppLoad, "vv_capi_load"}, + {&CppTTS, "vv_capi_tts"}, + {&CppASR, "vv_capi_asr"}, + {&CppUnload, "vv_capi_unload"}, + {&CppVersion, "vv_capi_version"}, + } + + for _, lf := range libFuncs { + purego.RegisterLibFunc(lf.FuncPtr, lib, lf.Name) + } + + flag.Parse() + + if err := grpc.StartServer(*addr, &VibevoiceCpp{}); err != nil { + panic(err) + } +} diff --git a/backend/go/vibevoice-cpp/package.sh b/backend/go/vibevoice-cpp/package.sh new file mode 100755 index 000000000..88010846f --- /dev/null +++ b/backend/go/vibevoice-cpp/package.sh @@ -0,0 +1,58 @@ +#!/bin/bash + +# Bundle the vibevoice-cpp binary, the per-variant .so files, and the +# runtime libs the binary depends on so the package is self-contained. +# Mirrors backend/go/qwen3-tts-cpp/package.sh. + +set -e + +CURDIR=$(dirname "$(realpath $0)") +REPO_ROOT="${CURDIR}/../../.." + +mkdir -p $CURDIR/package/lib + +cp -avf $CURDIR/vibevoice-cpp $CURDIR/package/ +cp -fv $CURDIR/libgovibevoicecpp-*.so $CURDIR/package/ +cp -fv $CURDIR/run.sh $CURDIR/package/ + +# Detect architecture and copy appropriate libraries +if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then + echo "Detected x86_64 architecture, copying x86_64 libraries..." + cp -arfLv /lib64/ld-linux-x86-64.so.2 $CURDIR/package/lib/ld.so + cp -arfLv /lib/x86_64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6 + cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1 + cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6 + cp -arfLv /lib/x86_64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6 + cp -arfLv /lib/x86_64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1 + cp -arfLv /lib/x86_64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2 + cp -arfLv /lib/x86_64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1 + cp -arfLv /lib/x86_64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0 +elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then + echo "Detected ARM64 architecture, copying ARM64 libraries..." + cp -arfLv /lib/ld-linux-aarch64.so.1 $CURDIR/package/lib/ld.so + cp -arfLv /lib/aarch64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6 + cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1 + cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6 + cp -arfLv /lib/aarch64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6 + cp -arfLv /lib/aarch64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1 + cp -arfLv /lib/aarch64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2 + cp -arfLv /lib/aarch64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1 + cp -arfLv /lib/aarch64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0 +elif [ $(uname -s) = "Darwin" ]; then + echo "Detected Darwin" +else + echo "Error: Could not detect architecture" + exit 1 +fi + +# Package GPU libraries based on BUILD_TYPE +GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh" +if [ -f "$GPU_LIB_SCRIPT" ]; then + echo "Packaging GPU libraries for BUILD_TYPE=${BUILD_TYPE:-cpu}..." + source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib" + package_gpu_libs +fi + +echo "Packaging completed successfully" +ls -liah $CURDIR/package/ +ls -liah $CURDIR/package/lib/ diff --git a/backend/go/vibevoice-cpp/run.sh b/backend/go/vibevoice-cpp/run.sh new file mode 100755 index 000000000..93e92d5b8 --- /dev/null +++ b/backend/go/vibevoice-cpp/run.sh @@ -0,0 +1,49 @@ +#!/bin/bash +set -ex + +CURDIR=$(dirname "$(realpath $0)") + +cd / + +echo "CPU info:" +if [ "$(uname)" != "Darwin" ]; then + grep -e "model\sname" /proc/cpuinfo | head -1 + grep -e "flags" /proc/cpuinfo | head -1 +fi + +LIBRARY="$CURDIR/libgovibevoicecpp-fallback.so" + +if [ "$(uname)" != "Darwin" ]; then + if grep -q -e "\savx\s" /proc/cpuinfo ; then + echo "CPU: AVX found OK" + if [ -e $CURDIR/libgovibevoicecpp-avx.so ]; then + LIBRARY="$CURDIR/libgovibevoicecpp-avx.so" + fi + fi + + if grep -q -e "\savx2\s" /proc/cpuinfo ; then + echo "CPU: AVX2 found OK" + if [ -e $CURDIR/libgovibevoicecpp-avx2.so ]; then + LIBRARY="$CURDIR/libgovibevoicecpp-avx2.so" + fi + fi + + if grep -q -e "\savx512f\s" /proc/cpuinfo ; then + echo "CPU: AVX512F found OK" + if [ -e $CURDIR/libgovibevoicecpp-avx512.so ]; then + LIBRARY="$CURDIR/libgovibevoicecpp-avx512.so" + fi + fi +fi + +export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH +export VIBEVOICECPP_LIBRARY=$LIBRARY + +if [ -f $CURDIR/lib/ld.so ]; then + echo "Using lib/ld.so" + echo "Using library: $LIBRARY" + exec $CURDIR/lib/ld.so $CURDIR/vibevoice-cpp "$@" +fi + +echo "Using library: $LIBRARY" +exec $CURDIR/vibevoice-cpp "$@" diff --git a/backend/go/vibevoice-cpp/test.sh b/backend/go/vibevoice-cpp/test.sh new file mode 100755 index 000000000..7dbf58cb8 --- /dev/null +++ b/backend/go/vibevoice-cpp/test.sh @@ -0,0 +1,74 @@ +#!/bin/bash +set -e + +CURDIR=$(dirname "$(realpath $0)") + +echo "Running vibevoice-cpp backend tests..." + +# Required env-vars (set automatically when missing): +# VIBEVOICE_MODEL_DIR : directory containing the gguf bundle. +# VIBEVOICE_BINARY : path to the built backend (default ./vibevoice-cpp) +# +# Tests skip when the model bundle is absent and the auto-download +# fails (e.g. no network on the runner) so local devs without HF access +# still get green compile output. + +cd "$CURDIR" + +if [ -z "$VIBEVOICE_MODEL_DIR" ]; then + export VIBEVOICE_MODEL_DIR="./vibevoice-models" + + if [ ! -d "$VIBEVOICE_MODEL_DIR" ]; then + echo "Creating vibevoice-models directory for tests..." + mkdir -p "$VIBEVOICE_MODEL_DIR" + + REPO_ID="mudler/vibevoice.cpp-models" + echo "Repository: ${REPO_ID}" + + # Q4_K instead of Q8_0 for the ASR model: smaller download + # (10 GB vs 14 GB), fits on ubuntu-latest's free disk after the + # runner image is loaded. The unit/closed-loop test only needs + # decode quality, not Q8_0 precision. + FILES=( + "vibevoice-realtime-0.5B-q8_0.gguf" + "vibevoice-asr-q4_k.gguf" + "tokenizer.gguf" + "voice-en-Carter_man.gguf" + ) + + BASE_URL="https://huggingface.co/${REPO_ID}/resolve/main" + + download_ok=1 + for file in "${FILES[@]}"; do + dest="${VIBEVOICE_MODEL_DIR}/${file}" + if [ -f "${dest}" ]; then + echo " [skip] ${file} (already exists)" + else + echo " [download] ${file}..." + if ! curl -fL -o "${dest}" "${BASE_URL}/${file}" --progress-bar; then + echo " [warn] failed to download ${file} - network or HF unavailable" + rm -f "${dest}" + download_ok=0 + break + fi + echo " [done] ${file}" + fi + done + + if [ "$download_ok" != "1" ]; then + echo "vibevoice-cpp: model bundle unavailable - tests will skip model-dependent cases." + unset VIBEVOICE_MODEL_DIR + fi + fi +fi + +# Ensure the per-variant .so the binary will dlopen actually exists - +# without one, every test will hit a Dlopen panic during server start. +if [ ! -f "${CURDIR}/libgovibevoicecpp-fallback.so" ]; then + echo "vibevoice-cpp: libgovibevoicecpp-fallback.so missing - run \`make\` first." + exit 1 +fi + +go test -v -timeout 900s . + +echo "All vibevoice-cpp tests passed." diff --git a/backend/go/vibevoice-cpp/vibevoicecpp_test.go b/backend/go/vibevoice-cpp/vibevoicecpp_test.go new file mode 100644 index 000000000..ffb36629e --- /dev/null +++ b/backend/go/vibevoice-cpp/vibevoicecpp_test.go @@ -0,0 +1,382 @@ +package main + +import ( + "context" + "os" + "os/exec" + "path/filepath" + "regexp" + "strings" + "testing" + "time" + + pb "github.com/mudler/LocalAI/pkg/grpc/proto" + . "github.com/onsi/ginkgo/v2" + . "github.com/onsi/gomega" + "google.golang.org/grpc" + "google.golang.org/grpc/credentials/insecure" +) + +const ( + testAddr = "localhost:50098" + startupWait = 5 * time.Second +) + +func TestVibevoiceCpp(t *testing.T) { + RegisterFailHandler(Fail) + RunSpecs(t, "VibeVoice-cpp Backend Suite") +} + +// modelDirOrSkip returns the staged model bundle dir, or Skip()s the +// current spec when VIBEVOICE_MODEL_DIR is unset / lacks the gguf +// files we need. Tests that don't depend on a model (Locking, error +// paths) don't call this. +func modelDirOrSkip() string { + dir := os.Getenv("VIBEVOICE_MODEL_DIR") + if dir == "" { + Skip("VIBEVOICE_MODEL_DIR not set, skipping model-dependent specs") + } + if _, err := os.Stat(filepath.Join(dir, "tokenizer.gguf")); os.IsNotExist(err) { + Skip("tokenizer.gguf missing in " + dir) + } + tts, _ := filepath.Glob(filepath.Join(dir, "vibevoice-realtime-*.gguf")) + asr, _ := filepath.Glob(filepath.Join(dir, "vibevoice-asr-*.gguf")) + if len(tts) == 0 && len(asr) == 0 { + Skip("neither realtime TTS nor ASR gguf found in " + dir) + } + return dir +} + +// startServer launches the prebuilt backend binary and returns a +// running *exec.Cmd. test.sh ensures `./vibevoice-cpp` is built; if +// it isn't, every gRPC spec is skipped with a clear reason. +func startServer() *exec.Cmd { + binary := os.Getenv("VIBEVOICE_BINARY") + if binary == "" { + binary = "./vibevoice-cpp" + } + if _, err := os.Stat(binary); os.IsNotExist(err) { + Skip("backend binary not found at " + binary) + } + cmd := exec.Command(binary, "--addr", testAddr) + cmd.Stdout = os.Stderr + cmd.Stderr = os.Stderr + Expect(cmd.Start()).To(Succeed()) + time.Sleep(startupWait) + return cmd +} + +func stopServer(cmd *exec.Cmd) { + if cmd == nil || cmd.Process == nil { + return + } + _ = cmd.Process.Kill() + _, _ = cmd.Process.Wait() +} + +func dialGRPC() *grpc.ClientConn { + conn, err := grpc.Dial(testAddr, + grpc.WithTransportCredentials(insecure.NewCredentials()), + grpc.WithDefaultCallOptions( + grpc.MaxCallRecvMsgSize(50*1024*1024), + grpc.MaxCallSendMsgSize(50*1024*1024), + ), + ) + Expect(err).ToNot(HaveOccurred()) + return conn +} + +var _ = Describe("VibeVoice-cpp", func() { + Context("backend semantics (no purego load needed)", func() { + It("is locking - the engine has process-global state", func() { + Expect((&VibevoiceCpp{}).Locking()).To(BeTrue()) + }) + + It("rejects Load with empty ModelFile", func() { + err := (&VibevoiceCpp{}).Load(&pb.ModelOptions{}) + Expect(err).To(HaveOccurred()) + Expect(err.Error()).To(ContainSubstring("ModelFile")) + }) + + It("rejects TTS without a loaded TTS model", func() { + err := (&VibevoiceCpp{}).TTS(&pb.TTSRequest{ + Text: "no model loaded", + Dst: "/tmp/should-not-be-written.wav", + }) + Expect(err).To(HaveOccurred()) + }) + + It("rejects AudioTranscription without a loaded ASR model", func() { + _, err := (&VibevoiceCpp{}).AudioTranscription(&pb.TranscriptRequest{ + Dst: "/tmp/some.wav", + }) + Expect(err).To(HaveOccurred()) + }) + + It("closes the channel and errors on TTSStream without a loaded model", func() { + ch := make(chan []byte, 4) + err := (&VibevoiceCpp{}).TTSStream(&pb.TTSRequest{ + Text: "no model loaded", + Dst: "/tmp/should-not-be-written.wav", + }, ch) + Expect(err).To(HaveOccurred()) + // Server hangs forever if the channel stays open; this guard + // is what regresses the e2e DeadlineExceeded we're fixing. + _, ok := <-ch + Expect(ok).To(BeFalse(), "TTSStream must close results channel even on error") + }) + + // parseOptions + slot fill is the source of the closed-loop CI + // regression where ModelFile=tts.gguf + Options[asr_model=...] + // resulted in a load with empty tts slot. These specs assert + // the slot resolution before we ever call into purego. + Describe("ModelFile slot resolution", func() { + It("fills tts slot from ModelFile when only asr_model is in Options", func() { + v := &VibevoiceCpp{} + v.modelRoot = "/abs/root" + role := v.parseOptions([]string{"asr_model=/abs/root/asr.gguf", "tokenizer=/abs/root/tokenizer.gguf"}, v.modelRoot) + Expect(v.asrModel).To(Equal("/abs/root/asr.gguf")) + Expect(v.ttsModel).To(BeEmpty()) + Expect(role).To(BeEmpty()) + // Mirror the Load() default-fill block: + if v.ttsModel == "" { + v.ttsModel = "/abs/root/tts.gguf" + } + Expect(v.ttsModel).To(Equal("/abs/root/tts.gguf")) + Expect(v.asrModel).To(Equal("/abs/root/asr.gguf")) + }) + + It("fills asr slot from ModelFile when type=asr is set", func() { + v := &VibevoiceCpp{} + v.modelRoot = "/abs/root" + role := v.parseOptions([]string{"type=asr", "tokenizer=/abs/root/tokenizer.gguf"}, v.modelRoot) + Expect(role).To(Equal("asr")) + Expect(v.asrModel).To(BeEmpty()) + Expect(v.ttsModel).To(BeEmpty()) + }) + + It("respects explicit tts_model override over ModelFile", func() { + v := &VibevoiceCpp{} + v.modelRoot = "/abs/root" + _ = v.parseOptions([]string{"tts_model=/abs/root/alt.gguf"}, v.modelRoot) + Expect(v.ttsModel).To(Equal("/abs/root/alt.gguf")) + }) + + It("accepts colon-separated options too", func() { + v := &VibevoiceCpp{} + v.modelRoot = "/abs/root" + role := v.parseOptions([]string{"type:asr", "tokenizer:/abs/root/tokenizer.gguf"}, v.modelRoot) + Expect(role).To(Equal("asr")) + Expect(v.tokenizer).To(Equal("/abs/root/tokenizer.gguf")) + }) + }) + + // The gallery flow puts everything under //, + // and parameters/options carry paths *relative* to . + // LocalAI core fills opts.ModelPath = ; the backend + // must resolve every relative path against that root, never CWD. + Describe("resolvePath (relative-to-modelRoot)", func() { + It("joins relative path onto relTo", func() { + Expect(resolvePath("vibevoice-cpp/tokenizer.gguf", "/data/models")). + To(Equal("/data/models/vibevoice-cpp/tokenizer.gguf")) + }) + + It("passes absolute paths through unchanged", func() { + Expect(resolvePath("/abs/somewhere/tokenizer.gguf", "/data/models")). + To(Equal("/abs/somewhere/tokenizer.gguf")) + }) + + It("returns input unchanged when relTo is empty", func() { + Expect(resolvePath("vibevoice-cpp/tokenizer.gguf", "")). + To(Equal("vibevoice-cpp/tokenizer.gguf")) + }) + + It("returns empty input unchanged", func() { + Expect(resolvePath("", "/data/models")).To(BeEmpty()) + }) + + It("does not consult CWD - bare filenames stay relative to modelRoot", func() { + // Even if the test runs in a directory containing a + // file with this name, the lookup must not fall back + // to CWD. This is the trap the production gallery flow + // would otherwise hit when LocalAI is launched from a + // directory that happens to contain a same-named file. + prev, _ := os.Getwd() + DeferCleanup(func() { _ = os.Chdir(prev) }) + tmpCWD, err := os.MkdirTemp("", "vv-cwd-*") + Expect(err).ToNot(HaveOccurred()) + DeferCleanup(func() { _ = os.RemoveAll(tmpCWD) }) + Expect(os.WriteFile(filepath.Join(tmpCWD, "tokenizer.gguf"), + []byte("not the real one"), 0o644)).To(Succeed()) + Expect(os.Chdir(tmpCWD)).To(Succeed()) + + got := resolvePath("tokenizer.gguf", "/data/models") + Expect(got).To(Equal("/data/models/tokenizer.gguf")) + }) + }) + + // Round-trip the gallery layout: relative paths in Options + + // an absolute ModelFile (as LocalAI core delivers them) end + // up resolved correctly inside the backend struct. + It("Load resolves relative Options paths against opts.ModelPath", func() { + tmpDir, err := os.MkdirTemp("", "vv-relpath-*") + Expect(err).ToNot(HaveOccurred()) + DeferCleanup(func() { _ = os.RemoveAll(tmpDir) }) + + // Lay out the bundle exactly as the gallery would after install: + // /vibevoice-cpp/{tts,tokenizer,voice}.gguf + subDir := filepath.Join(tmpDir, "vibevoice-cpp") + Expect(os.MkdirAll(subDir, 0o755)).To(Succeed()) + tts := filepath.Join(subDir, "vibevoice-realtime-stub.gguf") + tok := filepath.Join(subDir, "tokenizer.gguf") + voice := filepath.Join(subDir, "voice.gguf") + for _, p := range []string{tts, tok, voice} { + Expect(os.WriteFile(p, []byte("stub"), 0o644)).To(Succeed()) + } + + // Mirror Load()'s pre-purego prefix: parse + slot fill. + v := &VibevoiceCpp{} + modelFile := tts // core delivers this as an abspath already + v.modelRoot = tmpDir + role := v.parseOptions([]string{ + "tokenizer=vibevoice-cpp/tokenizer.gguf", + "voice=vibevoice-cpp/voice.gguf", + }, v.modelRoot) + Expect(role).To(BeEmpty()) + if v.ttsModel == "" { + v.ttsModel = modelFile + } + + Expect(v.ttsModel).To(Equal(tts)) + Expect(v.tokenizer).To(Equal(tok)) + Expect(v.voice).To(Equal(voice)) + Expect(v.asrModel).To(BeEmpty()) + }) + + It("closes the channel and errors on AudioTranscriptionStream without a loaded model", func() { + ch := make(chan *pb.TranscriptStreamResponse, 4) + err := (&VibevoiceCpp{}).AudioTranscriptionStream(&pb.TranscriptRequest{ + Dst: "/tmp/some.wav", + }, ch) + Expect(err).To(HaveOccurred()) + _, ok := <-ch + Expect(ok).To(BeFalse(), "AudioTranscriptionStream must close results channel even on error") + }) + }) + + Context("gRPC server lifecycle", func() { + var cmd *exec.Cmd + + AfterEach(func() { + stopServer(cmd) + cmd = nil + }) + + It("answers Health checks", func() { + cmd = startServer() + conn := dialGRPC() + defer func() { _ = conn.Close() }() + + resp, err := pb.NewBackendClient(conn).Health(context.Background(), &pb.HealthMessage{}) + Expect(err).ToNot(HaveOccurred()) + Expect(string(resp.Message)).To(Equal("OK")) + }) + + It("loads the realtime TTS model", func() { + dir := modelDirOrSkip() + tts, _ := filepath.Glob(filepath.Join(dir, "vibevoice-realtime-*.gguf")) + if len(tts) == 0 { + Skip("realtime TTS gguf missing") + } + + cmd = startServer() + conn := dialGRPC() + defer func() { _ = conn.Close() }() + + // Mirror the gallery contract: ModelFile is whatever LocalAI + // core hands us; ModelPath is the models root; Options[] + // carry paths relative to ModelPath. + resp, err := pb.NewBackendClient(conn).LoadModel(context.Background(), &pb.ModelOptions{ + ModelFile: filepath.Base(tts[0]), + ModelPath: dir, + Threads: 4, + Options: []string{"tokenizer=tokenizer.gguf"}, + }) + Expect(err).ToNot(HaveOccurred()) + Expect(resp.Success).To(BeTrue(), "LoadModel msg=%q", resp.Message) + }) + + It("runs a closed-loop TTS -> ASR with >=80% word recall", func() { + dir := modelDirOrSkip() + tts, _ := filepath.Glob(filepath.Join(dir, "vibevoice-realtime-*.gguf")) + asr, _ := filepath.Glob(filepath.Join(dir, "vibevoice-asr-*.gguf")) + if len(tts) == 0 || len(asr) == 0 { + Skip("closed-loop needs both realtime TTS and ASR ggufs") + } + + tmpDir, err := os.MkdirTemp("", "vibevoice-cpp-closedloop-*") + Expect(err).ToNot(HaveOccurred()) + DeferCleanup(func() { _ = os.RemoveAll(tmpDir) }) + wav := filepath.Join(tmpDir, "say.wav") + + cmd = startServer() + conn := dialGRPC() + defer func() { _ = conn.Close() }() + client := pb.NewBackendClient(conn) + + // Gallery convention: ModelPath is the models root, every + // path inside Options[] is relative to it. + voiceMatches, _ := filepath.Glob(filepath.Join(dir, "voice-*.gguf")) + loadOpts := &pb.ModelOptions{ + ModelFile: filepath.Base(tts[0]), + ModelPath: dir, + Threads: 4, + Options: []string{ + "asr_model=" + filepath.Base(asr[0]), + "tokenizer=tokenizer.gguf", + }, + } + if len(voiceMatches) > 0 { + loadOpts.Options = append(loadOpts.Options, "voice="+filepath.Base(voiceMatches[0])) + } + loadResp, err := client.LoadModel(context.Background(), loadOpts) + Expect(err).ToNot(HaveOccurred()) + Expect(loadResp.Success).To(BeTrue(), "LoadModel msg=%q", loadResp.Message) + + srcText := "Hello world this is a test of the synthesis system." + _, err = client.TTS(context.Background(), &pb.TTSRequest{ + Text: srcText, + Dst: wav, + }) + Expect(err).ToNot(HaveOccurred()) + + info, err := os.Stat(wav) + Expect(err).ToNot(HaveOccurred()) + Expect(info.Size()).To(BeNumerically(">=", 1000), + "TTS produced suspiciously small wav (%d bytes)", info.Size()) + + resp, err := client.AudioTranscription(context.Background(), &pb.TranscriptRequest{ + Dst: wav, + }) + Expect(err).ToNot(HaveOccurred()) + got := strings.ToLower(resp.Text) + GinkgoWriter.Printf("source : %s\n", srcText) + GinkgoWriter.Printf("transcribed: %s\n", got) + + wordRE := regexp.MustCompile(`[a-z]+`) + srcWords := wordRE.FindAllString(strings.ToLower(srcText), -1) + Expect(srcWords).ToNot(BeEmpty()) + hits := 0 + for _, w := range srcWords { + if strings.Contains(got, w) { + hits++ + } + } + recall := float64(hits) / float64(len(srcWords)) + GinkgoWriter.Printf("recall: %d/%d = %.2f%%\n", hits, len(srcWords), recall*100) + Expect(recall).To(BeNumerically(">=", 0.80), + "closed-loop recall too low: %d/%d = %.2f%%", + hits, len(srcWords), recall*100) + }) + }) +}) diff --git a/backend/index.yaml b/backend/index.yaml index 9042754f8..c6085827c 100644 --- a/backend/index.yaml +++ b/backend/index.yaml @@ -572,6 +572,34 @@ nvidia-l4t: "nvidia-l4t-arm64-qwen3-tts-cpp" nvidia-l4t-cuda-12: "nvidia-l4t-arm64-qwen3-tts-cpp" nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-qwen3-tts-cpp" +- &vibevoicecpp + name: "vibevoice-cpp" + description: | + vibevoice.cpp C++ backend using GGML. Native C++ port of Microsoft VibeVoice for both + text-to-speech (with voice cloning via voice prompt GGUFs) and long-form ASR with + speaker diarization. Outputs 24kHz mono WAV; ASR returns per-speaker JSON segments. + urls: + - https://github.com/mudler/vibevoice.cpp + tags: + - text-to-speech + - tts + - speech-to-text + - asr + - voice-cloning + - diarization + alias: "vibevoice-cpp" + capabilities: + default: "cpu-vibevoice-cpp" + nvidia: "cuda12-vibevoice-cpp" + nvidia-cuda-13: "cuda13-vibevoice-cpp" + nvidia-cuda-12: "cuda12-vibevoice-cpp" + intel: "intel-sycl-f16-vibevoice-cpp" + metal: "metal-vibevoice-cpp" + amd: "rocm-vibevoice-cpp" + vulkan: "vulkan-vibevoice-cpp" + nvidia-l4t: "nvidia-l4t-arm64-vibevoice-cpp" + nvidia-l4t-cuda-12: "nvidia-l4t-arm64-vibevoice-cpp" + nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-vibevoice-cpp" - &faster-whisper icon: https://avatars.githubusercontent.com/u/1520500?s=200&v=4 description: | @@ -2656,6 +2684,107 @@ uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-qwen3-tts-cpp" mirrors: - localai/localai-backends:master-gpu-nvidia-cuda-13-qwen3-tts-cpp +## vibevoice-cpp +- !!merge <<: *vibevoicecpp + name: "nvidia-l4t-arm64-vibevoice-cpp" + uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-arm64-vibevoice-cpp" + mirrors: + - localai/localai-backends:latest-nvidia-l4t-arm64-vibevoice-cpp +- !!merge <<: *vibevoicecpp + name: "nvidia-l4t-arm64-vibevoice-cpp-development" + uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-arm64-vibevoice-cpp" + mirrors: + - localai/localai-backends:master-nvidia-l4t-arm64-vibevoice-cpp +- !!merge <<: *vibevoicecpp + name: "cuda13-nvidia-l4t-arm64-vibevoice-cpp" + uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-vibevoice-cpp" + mirrors: + - localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-vibevoice-cpp +- !!merge <<: *vibevoicecpp + name: "cuda13-nvidia-l4t-arm64-vibevoice-cpp-development" + uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-vibevoice-cpp" + mirrors: + - localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-vibevoice-cpp +- !!merge <<: *vibevoicecpp + name: "cpu-vibevoice-cpp" + uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-vibevoice-cpp" + mirrors: + - localai/localai-backends:latest-cpu-vibevoice-cpp +- !!merge <<: *vibevoicecpp + name: "metal-vibevoice-cpp" + uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-vibevoice-cpp" + mirrors: + - localai/localai-backends:latest-metal-darwin-arm64-vibevoice-cpp +- !!merge <<: *vibevoicecpp + name: "metal-vibevoice-cpp-development" + uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-vibevoice-cpp" + mirrors: + - localai/localai-backends:master-metal-darwin-arm64-vibevoice-cpp +- !!merge <<: *vibevoicecpp + name: "cpu-vibevoice-cpp-development" + uri: "quay.io/go-skynet/local-ai-backends:master-cpu-vibevoice-cpp" + mirrors: + - localai/localai-backends:master-cpu-vibevoice-cpp +- !!merge <<: *vibevoicecpp + name: "cuda12-vibevoice-cpp" + uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-vibevoice-cpp" + mirrors: + - localai/localai-backends:latest-gpu-nvidia-cuda-12-vibevoice-cpp +- !!merge <<: *vibevoicecpp + name: "rocm-vibevoice-cpp" + uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-vibevoice-cpp" + mirrors: + - localai/localai-backends:latest-gpu-rocm-hipblas-vibevoice-cpp +- !!merge <<: *vibevoicecpp + name: "intel-sycl-f32-vibevoice-cpp" + uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f32-vibevoice-cpp" + mirrors: + - localai/localai-backends:latest-gpu-intel-sycl-f32-vibevoice-cpp +- !!merge <<: *vibevoicecpp + name: "intel-sycl-f16-vibevoice-cpp" + uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f16-vibevoice-cpp" + mirrors: + - localai/localai-backends:latest-gpu-intel-sycl-f16-vibevoice-cpp +- !!merge <<: *vibevoicecpp + name: "vulkan-vibevoice-cpp" + uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-vulkan-vibevoice-cpp" + mirrors: + - localai/localai-backends:latest-gpu-vulkan-vibevoice-cpp +- !!merge <<: *vibevoicecpp + name: "vulkan-vibevoice-cpp-development" + uri: "quay.io/go-skynet/local-ai-backends:master-gpu-vulkan-vibevoice-cpp" + mirrors: + - localai/localai-backends:master-gpu-vulkan-vibevoice-cpp +- !!merge <<: *vibevoicecpp + name: "cuda12-vibevoice-cpp-development" + uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-vibevoice-cpp" + mirrors: + - localai/localai-backends:master-gpu-nvidia-cuda-12-vibevoice-cpp +- !!merge <<: *vibevoicecpp + name: "rocm-vibevoice-cpp-development" + uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-vibevoice-cpp" + mirrors: + - localai/localai-backends:master-gpu-rocm-hipblas-vibevoice-cpp +- !!merge <<: *vibevoicecpp + name: "intel-sycl-f32-vibevoice-cpp-development" + uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f32-vibevoice-cpp" + mirrors: + - localai/localai-backends:master-gpu-intel-sycl-f32-vibevoice-cpp +- !!merge <<: *vibevoicecpp + name: "intel-sycl-f16-vibevoice-cpp-development" + uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f16-vibevoice-cpp" + mirrors: + - localai/localai-backends:master-gpu-intel-sycl-f16-vibevoice-cpp +- !!merge <<: *vibevoicecpp + name: "cuda13-vibevoice-cpp" + uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-vibevoice-cpp" + mirrors: + - localai/localai-backends:latest-gpu-nvidia-cuda-13-vibevoice-cpp +- !!merge <<: *vibevoicecpp + name: "cuda13-vibevoice-cpp-development" + uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-vibevoice-cpp" + mirrors: + - localai/localai-backends:master-gpu-nvidia-cuda-13-vibevoice-cpp ## kokoro - !!merge <<: *kokoro name: "kokoro-development" diff --git a/gallery/index.yaml b/gallery/index.yaml index d8ce559c3..8234e2184 100644 --- a/gallery/index.yaml +++ b/gallery/index.yaml @@ -1681,6 +1681,83 @@ - filename: acestep-cpp/vae-BF16.gguf uri: huggingface://Serveurperso/ACE-Step-1.5-GGUF/vae-BF16.gguf sha256: 0599862ac5d15cd308e1d2e368373aea6c02e25ebd1737ad4a4562a0901b0ef8 +- name: "vibevoice-cpp" + license: mit + icon: https://github.com/microsoft/VibeVoice/raw/main/Figures/VibeVoice_logo_white.png + tags: + - tts + - text-to-speech + - voice-cloning + - vibevoice + - vibevoice-cpp + - gguf + url: "github:mudler/LocalAI/gallery/virtual.yaml@master" + urls: + - https://huggingface.co/mudler/vibevoice.cpp-models + - https://github.com/mudler/vibevoice.cpp + - https://github.com/microsoft/VibeVoice + description: | + VibeVoice Realtime 0.5B (C++ / GGML, Q8_0) - native C++ port of Microsoft VibeVoice + via the vibevoice-cpp backend. 24kHz mono TTS with voice cloning from a single + reference voice prompt. Default voice prompt: en-Carter_man. + overrides: + name: vibevoice-cpp + backend: vibevoice-cpp + parameters: + model: vibevoice-cpp/vibevoice-realtime-0.5B-q8_0.gguf + options: + - tokenizer=vibevoice-cpp/tokenizer.gguf + - voice=vibevoice-cpp/voice-en-Carter_man.gguf + known_usecases: + - tts + files: + - filename: vibevoice-cpp/vibevoice-realtime-0.5B-q8_0.gguf + uri: huggingface://mudler/vibevoice.cpp-models/vibevoice-realtime-0.5B-q8_0.gguf + sha256: 5251e3f0386d1056a90c61b6c7359a4775da44dd19402499bef1989c4b5c653a + - filename: vibevoice-cpp/tokenizer.gguf + uri: huggingface://mudler/vibevoice.cpp-models/tokenizer.gguf + sha256: 37dc3b722d5677e37e29a57df55aa05c485116eeb5459e57ff8dde616b4986f6 + - filename: vibevoice-cpp/voice-en-Carter_man.gguf + uri: huggingface://mudler/vibevoice.cpp-models/voice-en-Carter_man.gguf + sha256: b15cd8b9cae6ee2c3d20b0ee6e7bfe93f13489f8b63b6834e9bbf0dfabf6505a +- name: "vibevoice-cpp-asr" + license: mit + icon: https://github.com/microsoft/VibeVoice/raw/main/Figures/VibeVoice_logo_white.png + tags: + - stt + - speech-to-text + - asr + - audio-transcription + - diarization + - vibevoice + - vibevoice-cpp + - gguf + url: "github:mudler/LocalAI/gallery/virtual.yaml@master" + urls: + - https://huggingface.co/mudler/vibevoice.cpp-models + - https://github.com/mudler/vibevoice.cpp + - https://github.com/microsoft/VibeVoice + description: | + VibeVoice ASR 7B (C++ / GGML, Q4_K) - long-form speech-to-text with speaker + diarization. Returns per-speaker JSON segments with start/end timestamps. + English-only. ~10 GB download. + overrides: + name: vibevoice-cpp-asr + backend: vibevoice-cpp + parameters: + model: vibevoice-cpp-asr/vibevoice-asr-q4_k.gguf + options: + - type=asr + - tokenizer=vibevoice-cpp-asr/tokenizer.gguf + known_usecases: + - transcript + files: + - filename: vibevoice-cpp-asr/vibevoice-asr-q4_k.gguf + uri: huggingface://mudler/vibevoice.cpp-models/vibevoice-asr-q4_k.gguf + sha256: 4eee48b9d0d42f71b773b804aa6728c99971c38d54f3c86cf1fd0fc1fc49a9ad + - filename: vibevoice-cpp-asr/tokenizer.gguf + uri: huggingface://mudler/vibevoice.cpp-models/tokenizer.gguf + sha256: 37dc3b722d5677e37e29a57df55aa05c485116eeb5459e57ff8dde616b4986f6 - name: "qwen3-tts-cpp" license: apache-2.0 tags: