mirror of
https://github.com/mudler/LocalAI.git
synced 2026-04-29 19:44:13 -04:00
feat(vibevoice-cpp): add purego TTS+ASR backend (#9610)
* feat(vibevoice-cpp): add purego TTS+ASR backend
Wire up Microsoft VibeVoice via the vibevoice.cpp C ABI as a new
purego-based Go backend that serves both Backend.TTS and
Backend.AudioTranscription from a single gRPC binary. Mirrors the
qwen3-tts-cpp / sherpa-onnx pattern so the variant matrix
(cpu/cuda12/cuda13/metal/rocm/sycl-f16/f32/vulkan/l4t) and the
e2e-backends gRPC harness reuse existing infrastructure.
- backend/go/vibevoice-cpp/ - Makefile, CMakeLists, purego shim, gRPC
Backend with model-dir auto-detection, closed-loop TTS->ASR smoke test
- backend/index.yaml - &vibevoicecpp meta + 18 image entries
- Makefile - .NOTPARALLEL, BACKEND_VIBEVOICE_CPP, docker-build wiring,
test-extra-backend-vibevoice-cpp-{tts,transcription} e2e wrappers
- .github/workflows/backend.yml - matrix entries for all variants
- .github/workflows/test-extra.yml - per-backend smoke + 2 gRPC e2e jobs
* feat(vibevoice-cpp): drop hardcoded glob detection, add gallery entries
Refactor backend Load() to follow the standard Options[] convention
used by sherpa-onnx and the rest of the multi-role backends:
ModelFile is the primary gguf, supplementary paths come through
opts.Options[] as key=value (or key:value for Make-target compat),
resolved against opts.ModelPath. type=asr/tts decides the role of
ModelFile when neither tts_model nor asr_model is set explicitly.
Add gallery/index.yaml entries:
- vibevoice-cpp - realtime 0.5B Q8_0 TTS + tokenizer + Carter voice
- vibevoice-cpp-asr - long-form ASR Q8_0 + tokenizer
Both pull from huggingface://mudler/vibevoice.cpp-models with sha256
verification. parameters.model + Options[] paths are siblings under
{models_dir} per the qwen3-tts-cpp convention.
Update Makefile e2e wrappers to pass BACKEND_TEST_OPTIONS comma+colon
style, and tighten the per-backend Go closed-loop test to use the
explicit Options API.
* fix(vibevoice-cpp): force whole-archive link so vv_capi_* exports survive
libvibevoice is a STATIC archive linked into the MODULE library.
Without --whole-archive (or -force_load on Apple, /WHOLEARCHIVE on
MSVC), the linker garbage-collects symbols not referenced from this
translation unit - which means dlopen+RegisterLibFunc panics with
'undefined symbol: vv_capi_load' at backend startup, since purego
looks them up by name and our cpp/govibevoicecpp.cpp doesn't call
them directly.
* test(vibevoice-cpp): rewrite suite with Ginkgo v2
Match the convention used by backend/go/sherpa-onnx/backend_test.go.
The suite now covers backend semantics that don't need purego (Locking,
empty-ModelFile rejection, TTS/ASR-without-loaded-model errors) on top
of the gRPC lifecycle specs (Health, Load, closed-loop TTS->ASR).
Model-dependent specs Skip() when VIBEVOICE_MODEL_DIR is unset, so
`go test ./backend/go/vibevoice-cpp/` is green on a clean checkout
and runs the heavyweight closed-loop spec when test.sh has staged
the bundle.
* fix(vibevoice-cpp): implement TTSStream + AudioTranscriptionStream
The gRPC server's stream handlers (pkg/grpc/server.go) spawn a
goroutine that ranges over a chan; the only thing closing that chan
is the backend's own *Stream method. With the default Base stub
returning 'unimplemented' and never touching the chan, the server
goroutine hangs forever and the client hits DeadlineExceeded - which
is exactly what the e2e harness saw in the test-extra-backend-vibevoice-cpp-tts
matrix run.
TTSStream synthesizes via vv_capi_tts to a tempfile, then emits a
streaming WAV header (chunk sizes 0xFFFFFFFF so HTTP clients can
start playback before the full PCM lands) followed by the PCM body
in 64 KB slices. The header + >=2 PCM frames satisfy the harness's
'expected >=2 chunks' assertion and give a real progressive stream.
AudioTranscriptionStream runs the offline transcription, emits each
segment as a delta, and closes with a final_result whose Text equals
the concatenated deltas (the harness asserts those match).
Two new Ginkgo specs guard the close-channel-on-error path so the
deadline-exceeded regression can't come back silently.
* fix(vibevoice-cpp): silence errcheck on cleanup paths
Lint flagged six unchecked Close()/Remove()/RemoveAll() calls along
purely-cleanup deferred paths. Wrap each in '_ = ...' (or a closure
for defers that take args) - matches what the rest of the LocalAI
backend/go/* tree already does for these callsites.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(vibevoice-cpp): closed-loop slot fill + modelRoot-relative path resolution
Two bugs the test-extra-backend-vibevoice-cpp-* CI matrix surfaced:
1. Closed-loop Load with ModelFile=tts.gguf + Options[asr_model=...] left
v.ttsModel empty, because the default-fill block only ran when BOTH
slots were empty. vv_capi_load then got tts="" + a voice and the
C side rejected it with rc=-3 'TTS model required to load a voice'.
Fix: ModelFile fills the *primary* role-slot (decided by 'type=' in
Options, defaulting to tts) independently of the secondary, so
ModelFile + asr_model resolves to both.
2. resolvePath stat'd CWD before falling back to relTo. With LocalAI
launched from a directory that happens to contain a same-named
file, supplementary Options[] paths could leak away from the
models dir. Drop the CWD probe entirely - relative paths now
*always* join onto opts.ModelPath (the gallery convention).
New Ginkgo coverage:
* 'ModelFile slot resolution' (4 specs) - asr_model+ModelFile, type=asr,
explicit tts_model override, key:value variant.
* 'resolvePath (relative-to-modelRoot)' (5 specs) - join, abs passthrough,
empty input, empty relTo, and the CWD-trap regression test.
* 'Load resolves relative Options paths against opts.ModelPath' - end-
to-end gallery layout round-trip.
Verified locally: 19/19 specs pass (with model bundle, including the
closed-loop TTS->ASR; without bundle, 17 pass + 2 model-dependent skip).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(vibevoice-cpp): use gallery convention in closed-loop spec
The 'loads the realtime TTS model' / closed-loop specs were passing
already-prefixed paths into Options[]:
Options: ['tokenizer=' + filepath.Join(modelDir, 'tokenizer.gguf')]
Combined with no ModelPath set on the request, the backend's
modelRoot fell back to filepath.Dir(ModelFile) = modelDir, then
resolvePath joined the prefixed Options path on top of it -
producing 'vibevoice-models/vibevoice-models/tokenizer.gguf' when
the CI's VIBEVOICE_MODEL_DIR is the relative './vibevoice-models'.
The fix is to mirror the gallery contract LocalAI core actually
sends in production: ModelPath is the models root (absolute),
ModelFile is a name *under* it, every Options[] path is relative
to ModelPath. Uses filepath.Base() to get bare filenames.
Verified locally with both VIBEVOICE_MODEL_DIR=/tmp/vv-bundle (abs)
and VIBEVOICE_MODEL_DIR=vibevoice-models (the relative shape that
broke CI). Both: 19/19 specs pass, ~55-60s.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* ci(vibevoice-cpp): switch ASR to Q4_K + bump transcription timeout
The Q8_0 ASR gguf is ~14 GB - too big to fit alongside the runner
image, the docker build cache, and the test artifacts on a free
ubuntu-latest GHA runner; 'test-extra-backend-vibevoice-cpp-transcription'
was getting SIGTERM'd at 90 min before the model could finish loading.
Switch to Q4_K (~10 GB on disk, slightly faster CPU decode) for:
* the e2e harness Make target
* the gallery 'vibevoice-cpp-asr' entry (parameters + files block)
* the per-backend test.sh auto-download list
Bump tests-vibevoice-cpp-grpc-transcription's timeout-minutes from
90 to 150 - even with Q4_K, the 30 s JFK clip on a CPU runner needs
runway above the previous 90 min cap.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* ci(vibevoice-cpp): drop transcription gRPC e2e job - too heavy for free runners
The vibevoice ASR is a 7B-parameter model. Even on Q4_K (~10 GB on
disk) a single 30 s transcription saturates the per-test 30 min
timeout in the e2e-backends harness on a 4-core ubuntu-latest, and
the 10 GB download + Docker layer + working space leaves no headroom
on the runner's free disk. Two attempts in CI got SIGTERM'd at the
LoadModel boundary - the bottleneck isn't tunable from the workflow
side without a paid-tier runner.
The per-backend tests-vibevoice-cpp job already runs the same
AudioTranscription path via a closed-loop TTS->ASR Ginkgo spec - same
gRPC contract, same model, single process - so the standalone
tests-vibevoice-cpp-grpc-transcription job was redundant on top of
the disk/CPU pressure.
The Makefile target test-extra-backend-vibevoice-cpp-transcription
stays for local invocation on workstations that can afford it -
useful when developing the streaming codepaths.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* ci(vibevoice-cpp): restore transcription gRPC e2e on bigger-runner
Switch tests-vibevoice-cpp-grpc-transcription from ubuntu-latest to
the self-hosted 'bigger-runner' label that GPU image builds in
backend.yml use, plus the documented Free-disk-space prep step (purge
dotnet / ghc / android / CodeQL caches) the disabled vllm/sglang
entries in this file describe. That gives the 7B-param Q4_K ASR
model the disk + CPU runway it needs.
Keep timeout-minutes: 150 - even on a beefier runner the 30 s JFK
decode plus 10 GB download has to fit comfortably.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* ci(vibevoice-cpp): apt-get install make on bigger-runner before transcription e2e
bigger-runner is a self-hosted bare runner without the standard
ubuntu image's preinstalled build tools, so the previous job died at
the very first command with 'make: command not found' (exit 127).
Add the Dependencies step that the disabled vllm/sglang entries in
this file already document - apt-get installs make + build-essential
+ curl + unzip + ca-certificates + git + tar before the make target
runs. Mirrors how every other 'runs-on: bigger-runner' entry in
backend.yml prepares the runner.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
committed by
GitHub
parent
13fe37df89
commit
fe6eb57082
122
.github/workflows/backend.yml
vendored
122
.github/workflows/backend.yml
vendored
@@ -698,6 +698,19 @@ jobs:
|
||||
dockerfile: "./backend/Dockerfile.golang"
|
||||
context: "./"
|
||||
ubuntu-version: '2404'
|
||||
- build-type: 'cublas'
|
||||
cuda-major-version: "12"
|
||||
cuda-minor-version: "8"
|
||||
platforms: 'linux/amd64'
|
||||
tag-latest: 'auto'
|
||||
tag-suffix: '-gpu-nvidia-cuda-12-vibevoice-cpp'
|
||||
runs-on: 'ubuntu-latest'
|
||||
base-image: "ubuntu:24.04"
|
||||
skip-drivers: 'false'
|
||||
backend: "vibevoice-cpp"
|
||||
dockerfile: "./backend/Dockerfile.golang"
|
||||
context: "./"
|
||||
ubuntu-version: '2404'
|
||||
- build-type: 'cublas'
|
||||
cuda-major-version: "12"
|
||||
cuda-minor-version: "8"
|
||||
@@ -1440,6 +1453,19 @@ jobs:
|
||||
dockerfile: "./backend/Dockerfile.golang"
|
||||
context: "./"
|
||||
ubuntu-version: '2404'
|
||||
- build-type: 'cublas'
|
||||
cuda-major-version: "13"
|
||||
cuda-minor-version: "0"
|
||||
platforms: 'linux/amd64'
|
||||
tag-latest: 'auto'
|
||||
tag-suffix: '-gpu-nvidia-cuda-13-vibevoice-cpp'
|
||||
runs-on: 'ubuntu-latest'
|
||||
base-image: "ubuntu:24.04"
|
||||
skip-drivers: 'false'
|
||||
backend: "vibevoice-cpp"
|
||||
dockerfile: "./backend/Dockerfile.golang"
|
||||
context: "./"
|
||||
ubuntu-version: '2404'
|
||||
- build-type: 'cublas'
|
||||
cuda-major-version: "13"
|
||||
cuda-minor-version: "0"
|
||||
@@ -1466,6 +1492,19 @@ jobs:
|
||||
backend: "qwen3-tts-cpp"
|
||||
dockerfile: "./backend/Dockerfile.golang"
|
||||
context: "./"
|
||||
- build-type: 'cublas'
|
||||
cuda-major-version: "13"
|
||||
cuda-minor-version: "0"
|
||||
platforms: 'linux/arm64'
|
||||
skip-drivers: 'false'
|
||||
tag-latest: 'auto'
|
||||
tag-suffix: '-nvidia-l4t-cuda-13-arm64-vibevoice-cpp'
|
||||
base-image: "ubuntu:24.04"
|
||||
ubuntu-version: '2404'
|
||||
runs-on: 'ubuntu-24.04-arm'
|
||||
backend: "vibevoice-cpp"
|
||||
dockerfile: "./backend/Dockerfile.golang"
|
||||
context: "./"
|
||||
- build-type: 'cublas'
|
||||
cuda-major-version: "13"
|
||||
cuda-minor-version: "0"
|
||||
@@ -2633,6 +2672,85 @@ jobs:
|
||||
dockerfile: "./backend/Dockerfile.golang"
|
||||
context: "./"
|
||||
ubuntu-version: '2404'
|
||||
# vibevoice-cpp
|
||||
- build-type: ''
|
||||
cuda-major-version: ""
|
||||
cuda-minor-version: ""
|
||||
platforms: 'linux/amd64,linux/arm64'
|
||||
tag-latest: 'auto'
|
||||
tag-suffix: '-cpu-vibevoice-cpp'
|
||||
runs-on: 'ubuntu-latest'
|
||||
base-image: "ubuntu:24.04"
|
||||
skip-drivers: 'false'
|
||||
backend: "vibevoice-cpp"
|
||||
dockerfile: "./backend/Dockerfile.golang"
|
||||
context: "./"
|
||||
ubuntu-version: '2404'
|
||||
- build-type: 'sycl_f32'
|
||||
cuda-major-version: ""
|
||||
cuda-minor-version: ""
|
||||
platforms: 'linux/amd64'
|
||||
tag-latest: 'auto'
|
||||
tag-suffix: '-gpu-intel-sycl-f32-vibevoice-cpp'
|
||||
runs-on: 'ubuntu-latest'
|
||||
base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
|
||||
skip-drivers: 'false'
|
||||
backend: "vibevoice-cpp"
|
||||
dockerfile: "./backend/Dockerfile.golang"
|
||||
context: "./"
|
||||
ubuntu-version: '2404'
|
||||
- build-type: 'sycl_f16'
|
||||
cuda-major-version: ""
|
||||
cuda-minor-version: ""
|
||||
platforms: 'linux/amd64'
|
||||
tag-latest: 'auto'
|
||||
tag-suffix: '-gpu-intel-sycl-f16-vibevoice-cpp'
|
||||
runs-on: 'ubuntu-latest'
|
||||
base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
|
||||
skip-drivers: 'false'
|
||||
backend: "vibevoice-cpp"
|
||||
dockerfile: "./backend/Dockerfile.golang"
|
||||
context: "./"
|
||||
ubuntu-version: '2404'
|
||||
- build-type: 'vulkan'
|
||||
cuda-major-version: ""
|
||||
cuda-minor-version: ""
|
||||
platforms: 'linux/amd64,linux/arm64'
|
||||
tag-latest: 'auto'
|
||||
tag-suffix: '-gpu-vulkan-vibevoice-cpp'
|
||||
runs-on: 'ubuntu-latest'
|
||||
base-image: "ubuntu:24.04"
|
||||
skip-drivers: 'false'
|
||||
backend: "vibevoice-cpp"
|
||||
dockerfile: "./backend/Dockerfile.golang"
|
||||
context: "./"
|
||||
ubuntu-version: '2404'
|
||||
- build-type: 'cublas'
|
||||
cuda-major-version: "12"
|
||||
cuda-minor-version: "0"
|
||||
platforms: 'linux/arm64'
|
||||
skip-drivers: 'false'
|
||||
tag-latest: 'auto'
|
||||
tag-suffix: '-nvidia-l4t-arm64-vibevoice-cpp'
|
||||
base-image: "nvcr.io/nvidia/l4t-jetpack:r36.4.0"
|
||||
runs-on: 'ubuntu-24.04-arm'
|
||||
backend: "vibevoice-cpp"
|
||||
dockerfile: "./backend/Dockerfile.golang"
|
||||
context: "./"
|
||||
ubuntu-version: '2204'
|
||||
- build-type: 'hipblas'
|
||||
cuda-major-version: ""
|
||||
cuda-minor-version: ""
|
||||
platforms: 'linux/amd64'
|
||||
tag-latest: 'auto'
|
||||
tag-suffix: '-gpu-rocm-hipblas-vibevoice-cpp'
|
||||
base-image: "rocm/dev-ubuntu-24.04:6.4.4"
|
||||
runs-on: 'ubuntu-latest'
|
||||
skip-drivers: 'false'
|
||||
backend: "vibevoice-cpp"
|
||||
dockerfile: "./backend/Dockerfile.golang"
|
||||
context: "./"
|
||||
ubuntu-version: '2404'
|
||||
# voxtral
|
||||
- build-type: ''
|
||||
cuda-major-version: ""
|
||||
@@ -3027,6 +3145,10 @@ jobs:
|
||||
tag-suffix: "-metal-darwin-arm64-qwen3-tts-cpp"
|
||||
build-type: "metal"
|
||||
lang: "go"
|
||||
- backend: "vibevoice-cpp"
|
||||
tag-suffix: "-metal-darwin-arm64-vibevoice-cpp"
|
||||
build-type: "metal"
|
||||
lang: "go"
|
||||
- backend: "voxtral"
|
||||
tag-suffix: "-metal-darwin-arm64-voxtral"
|
||||
build-type: "metal"
|
||||
|
||||
92
.github/workflows/test-extra.yml
vendored
92
.github/workflows/test-extra.yml
vendored
@@ -36,6 +36,7 @@ jobs:
|
||||
sglang: ${{ steps.detect.outputs.sglang }}
|
||||
acestep-cpp: ${{ steps.detect.outputs.acestep-cpp }}
|
||||
qwen3-tts-cpp: ${{ steps.detect.outputs.qwen3-tts-cpp }}
|
||||
vibevoice-cpp: ${{ steps.detect.outputs.vibevoice-cpp }}
|
||||
voxtral: ${{ steps.detect.outputs.voxtral }}
|
||||
kokoros: ${{ steps.detect.outputs.kokoros }}
|
||||
insightface: ${{ steps.detect.outputs.insightface }}
|
||||
@@ -792,6 +793,97 @@ jobs:
|
||||
- name: Test qwen3-tts-cpp
|
||||
run: |
|
||||
make --jobs=5 --output-sync=target -C backend/go/qwen3-tts-cpp test
|
||||
# Per-backend smoke for vibevoice-cpp: builds the .so + Go binary and
|
||||
# runs `make -C backend/go/vibevoice-cpp test`. test.sh auto-downloads
|
||||
# the published mudler/vibevoice.cpp-models bundle (TTS Q8_0 + ASR Q4_K
|
||||
# + tokenizer + voice) and runs the closed-loop TTS → ASR Go test.
|
||||
tests-vibevoice-cpp:
|
||||
needs: detect-changes
|
||||
if: needs.detect-changes.outputs.vibevoice-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
|
||||
runs-on: ubuntu-latest
|
||||
timeout-minutes: 90
|
||||
steps:
|
||||
- name: Clone
|
||||
uses: actions/checkout@v6
|
||||
with:
|
||||
submodules: true
|
||||
- name: Dependencies
|
||||
run: |
|
||||
sudo apt-get update
|
||||
sudo apt-get install -y build-essential cmake curl libopenblas-dev ffmpeg
|
||||
- name: Setup Go
|
||||
uses: actions/setup-go@v5
|
||||
- name: Display Go version
|
||||
run: go version
|
||||
- name: Proto Dependencies
|
||||
run: |
|
||||
curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
|
||||
unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
|
||||
rm protoc.zip
|
||||
go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2
|
||||
go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af
|
||||
PATH="$PATH:$HOME/go/bin" make protogen-go
|
||||
- name: Build vibevoice-cpp
|
||||
run: |
|
||||
make --jobs=5 --output-sync=target -C backend/go/vibevoice-cpp
|
||||
- name: Test vibevoice-cpp
|
||||
run: |
|
||||
make --jobs=5 --output-sync=target -C backend/go/vibevoice-cpp test
|
||||
# End-to-end TTS via the e2e-backends gRPC harness. Builds the
|
||||
# vibevoice-cpp Docker image and drives Backend/TTS against it with a
|
||||
# real LocalAI gRPC client.
|
||||
tests-vibevoice-cpp-grpc-tts:
|
||||
needs: detect-changes
|
||||
if: needs.detect-changes.outputs.vibevoice-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
|
||||
runs-on: ubuntu-latest
|
||||
timeout-minutes: 90
|
||||
steps:
|
||||
- name: Clone
|
||||
uses: actions/checkout@v6
|
||||
with:
|
||||
submodules: true
|
||||
- name: Setup Go
|
||||
uses: actions/setup-go@v5
|
||||
with:
|
||||
go-version: '1.25.4'
|
||||
- name: Build vibevoice-cpp backend image and run TTS gRPC e2e tests
|
||||
run: |
|
||||
make test-extra-backend-vibevoice-cpp-tts
|
||||
# End-to-end transcription via the e2e-backends gRPC harness. The
|
||||
# vibevoice ASR is a 7B-param model (Q4_K weights ~10 GB on disk)
|
||||
# and the JFK 30 s decode is too heavy for a free 4-core
|
||||
# ubuntu-latest pool runner - two CI attempts got SIGTERM'd during
|
||||
# LoadModel, before the test could even progress. Use the
|
||||
# self-hosted 'bigger-runner' label (same one the GPU image builds
|
||||
# in backend.yml use) and the documented dotnet/ghc/android cache
|
||||
# purge to clear ~10-20 GB of headroom for the model + Docker
|
||||
# image + working dir.
|
||||
tests-vibevoice-cpp-grpc-transcription:
|
||||
needs: detect-changes
|
||||
if: needs.detect-changes.outputs.vibevoice-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
|
||||
runs-on: bigger-runner
|
||||
timeout-minutes: 150
|
||||
steps:
|
||||
- name: Clone
|
||||
uses: actions/checkout@v6
|
||||
with:
|
||||
submodules: true
|
||||
- name: Dependencies
|
||||
run: |
|
||||
sudo apt-get update
|
||||
sudo apt-get install -y --no-install-recommends \
|
||||
make build-essential curl unzip ca-certificates git tar
|
||||
- name: Setup Go
|
||||
uses: actions/setup-go@v5
|
||||
with:
|
||||
go-version: '1.25.4'
|
||||
- name: Free disk space
|
||||
run: |
|
||||
sudo rm -rf /usr/share/dotnet /opt/ghc /usr/local/lib/android /opt/hostedtoolcache/CodeQL || true
|
||||
df -h
|
||||
- name: Build vibevoice-cpp backend image and run ASR gRPC e2e tests
|
||||
run: |
|
||||
make test-extra-backend-vibevoice-cpp-transcription
|
||||
tests-voxtral:
|
||||
needs: detect-changes
|
||||
if: needs.detect-changes.outputs.voxtral == 'true' || needs.detect-changes.outputs.run-all == 'true'
|
||||
|
||||
32
Makefile
32
Makefile
@@ -1,5 +1,5 @@
|
||||
# Disable parallel execution for backend builds
|
||||
.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/tinygrad backends/sherpa-onnx
|
||||
.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/vibevoice-cpp backends/tinygrad backends/sherpa-onnx
|
||||
|
||||
GOCMD=go
|
||||
GOTEST=$(GOCMD) test
|
||||
@@ -833,6 +833,32 @@ test-extra-backend-sherpa-onnx-tts: docker-build-sherpa-onnx
|
||||
BACKEND_TEST_CAPS=health,load,tts \
|
||||
$(MAKE) test-extra-backend
|
||||
|
||||
## VibeVoice TTS via the vibevoice-cpp backend. ModelFile is the
|
||||
## realtime gguf; the supplementary tokenizer + voice prompt land
|
||||
## alongside it under the harness's models dir and are wired through
|
||||
## via the standard Options[] convention (tokenizer=, voice=).
|
||||
test-extra-backend-vibevoice-cpp-tts: docker-build-vibevoice-cpp
|
||||
BACKEND_IMAGE=local-ai-backend:vibevoice-cpp \
|
||||
BACKEND_TEST_MODEL_URL='https://huggingface.co/mudler/vibevoice.cpp-models/resolve/main/vibevoice-realtime-0.5B-q8_0.gguf#vibevoice-realtime-0.5B-q8_0.gguf' \
|
||||
BACKEND_TEST_EXTRA_FILES='https://huggingface.co/mudler/vibevoice.cpp-models/resolve/main/tokenizer.gguf#tokenizer.gguf|https://huggingface.co/mudler/vibevoice.cpp-models/resolve/main/voice-en-Carter_man.gguf#voice-en-Carter_man.gguf' \
|
||||
BACKEND_TEST_OPTIONS=tokenizer:tokenizer.gguf,voice:voice-en-Carter_man.gguf \
|
||||
BACKEND_TEST_CAPS=health,load,tts \
|
||||
$(MAKE) test-extra-backend
|
||||
|
||||
## VibeVoice ASR (long-form, with diarization). type=asr tells the
|
||||
## backend's Load() to slot ModelFile into the asr_model role; the
|
||||
## tokenizer is supplied via Options[]. Uses the Q4_K quant (~10 GB)
|
||||
## rather than Q8_0 (~14 GB) so the bundle fits inside ubuntu-latest's
|
||||
## post-image disk budget.
|
||||
test-extra-backend-vibevoice-cpp-transcription: docker-build-vibevoice-cpp
|
||||
BACKEND_IMAGE=local-ai-backend:vibevoice-cpp \
|
||||
BACKEND_TEST_MODEL_URL='https://huggingface.co/mudler/vibevoice.cpp-models/resolve/main/vibevoice-asr-q4_k.gguf#vibevoice-asr-q4_k.gguf' \
|
||||
BACKEND_TEST_EXTRA_FILES='https://huggingface.co/mudler/vibevoice.cpp-models/resolve/main/tokenizer.gguf#tokenizer.gguf' \
|
||||
BACKEND_TEST_AUDIO_URL=https://github.com/ggml-org/whisper.cpp/raw/master/samples/jfk.wav \
|
||||
BACKEND_TEST_OPTIONS=type:asr,tokenizer:tokenizer.gguf \
|
||||
BACKEND_TEST_CAPS=health,load,transcription \
|
||||
$(MAKE) test-extra-backend
|
||||
|
||||
## sglang mirrors the vllm setup: HuggingFace model id, same tiny Qwen,
|
||||
## tool-call extraction via sglang's native qwen parser. CPU builds use
|
||||
## sglang's upstream pyproject_cpu.toml recipe (see backend/python/sglang/install.sh).
|
||||
@@ -969,6 +995,7 @@ BACKEND_WHISPER = whisper|golang|.|false|true
|
||||
BACKEND_VOXTRAL = voxtral|golang|.|false|true
|
||||
BACKEND_ACESTEP_CPP = acestep-cpp|golang|.|false|true
|
||||
BACKEND_QWEN3_TTS_CPP = qwen3-tts-cpp|golang|.|false|true
|
||||
BACKEND_VIBEVOICE_CPP = vibevoice-cpp|golang|.|false|true
|
||||
BACKEND_OPUS = opus|golang|.|false|true
|
||||
BACKEND_SHERPA_ONNX = sherpa-onnx|golang|.|false|true
|
||||
|
||||
@@ -1075,6 +1102,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_WHISPERX)))
|
||||
$(eval $(call generate-docker-build-target,$(BACKEND_ACE_STEP)))
|
||||
$(eval $(call generate-docker-build-target,$(BACKEND_ACESTEP_CPP)))
|
||||
$(eval $(call generate-docker-build-target,$(BACKEND_QWEN3_TTS_CPP)))
|
||||
$(eval $(call generate-docker-build-target,$(BACKEND_VIBEVOICE_CPP)))
|
||||
$(eval $(call generate-docker-build-target,$(BACKEND_MLX)))
|
||||
$(eval $(call generate-docker-build-target,$(BACKEND_MLX_VLM)))
|
||||
$(eval $(call generate-docker-build-target,$(BACKEND_MLX_DISTRIBUTED)))
|
||||
@@ -1089,7 +1117,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_SHERPA_ONNX)))
|
||||
docker-save-%: backend-images
|
||||
docker save local-ai-backend:$* -o backend-images/$*.tar
|
||||
|
||||
docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx
|
||||
docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp docker-build-vibevoice-cpp docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx
|
||||
|
||||
########################################################
|
||||
### Mock Backend for E2E Tests
|
||||
|
||||
71
backend/go/vibevoice-cpp/CMakeLists.txt
Normal file
71
backend/go/vibevoice-cpp/CMakeLists.txt
Normal file
@@ -0,0 +1,71 @@
|
||||
cmake_minimum_required(VERSION 3.18)
|
||||
project(govibevoicecpp LANGUAGES C CXX)
|
||||
set(CMAKE_POSITION_INDEPENDENT_CODE ON)
|
||||
set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
|
||||
|
||||
set(VIBEVOICE_DIR ${CMAKE_CURRENT_SOURCE_DIR}/sources/vibevoice.cpp)
|
||||
|
||||
# Override upstream's CMAKE_CUDA_ARCHITECTURES before add_subdirectory.
|
||||
if(NOT DEFINED CMAKE_CUDA_ARCHITECTURES)
|
||||
set(CMAKE_CUDA_ARCHITECTURES "75-virtual;80-virtual;86-real;89-real")
|
||||
endif()
|
||||
|
||||
# Force-disable upstream tests/examples — we only need libvibevoice.
|
||||
set(VIBEVOICE_BUILD_TESTS OFF CACHE BOOL "" FORCE)
|
||||
set(VIBEVOICE_BUILD_EXAMPLES OFF CACHE BOOL "" FORCE)
|
||||
set(VIBEVOICE_BUILD_SERVER OFF CACHE BOOL "" FORCE)
|
||||
|
||||
# vibevoice.cpp's top-level CMakeLists already adds third_party/ggml as a
|
||||
# subdirectory — no need to add it explicitly here, just include the
|
||||
# whole project.
|
||||
add_subdirectory(${VIBEVOICE_DIR} vibevoice EXCLUDE_FROM_ALL)
|
||||
|
||||
add_library(govibevoicecpp MODULE cpp/govibevoicecpp.cpp)
|
||||
|
||||
# libvibevoice is STATIC; without --whole-archive the linker GCs the
|
||||
# vv_capi_* symbols (purego dlopens them by name, nothing in our
|
||||
# translation unit references them). Force the static archive's
|
||||
# entire contents into the MODULE so dlsym finds vv_capi_load etc.
|
||||
if(APPLE)
|
||||
target_link_libraries(govibevoicecpp PRIVATE -Wl,-force_load $<TARGET_FILE:vibevoice>)
|
||||
elseif(MSVC)
|
||||
target_link_libraries(govibevoicecpp PRIVATE vibevoice)
|
||||
set_property(TARGET govibevoicecpp APPEND PROPERTY LINK_FLAGS "/WHOLEARCHIVE:vibevoice")
|
||||
else()
|
||||
target_link_libraries(govibevoicecpp PRIVATE
|
||||
-Wl,--whole-archive vibevoice -Wl,--no-whole-archive)
|
||||
endif()
|
||||
|
||||
target_include_directories(govibevoicecpp PRIVATE ${VIBEVOICE_DIR}/include)
|
||||
target_include_directories(govibevoicecpp SYSTEM PRIVATE ${VIBEVOICE_DIR}/third_party/ggml/include)
|
||||
|
||||
# Link GPU backends if available — vibevoice's own CMake already links
|
||||
# these to the libvibevoice STATIC library, but we re-link them on the
|
||||
# MODULE so resolved symbols include all backend kernels.
|
||||
foreach(backend blas cuda metal vulkan)
|
||||
if(TARGET ggml-${backend})
|
||||
target_link_libraries(govibevoicecpp PRIVATE ggml-${backend})
|
||||
string(TOUPPER ${backend} BACKEND_UPPER)
|
||||
target_compile_definitions(govibevoicecpp PRIVATE VIBEVOICE_HAVE_${BACKEND_UPPER})
|
||||
if(backend STREQUAL "cuda")
|
||||
find_package(CUDAToolkit QUIET)
|
||||
if(CUDAToolkit_FOUND)
|
||||
target_link_libraries(govibevoicecpp PRIVATE CUDA::cudart)
|
||||
endif()
|
||||
endif()
|
||||
endif()
|
||||
endforeach()
|
||||
|
||||
if(MSVC)
|
||||
target_compile_options(govibevoicecpp PRIVATE /W4 /wd4100 /wd4505)
|
||||
else()
|
||||
target_compile_options(govibevoicecpp PRIVATE -Wall -Wextra -Wshadow
|
||||
-Wno-unused-parameter -Wno-unused-function -Wno-sign-conversion)
|
||||
endif()
|
||||
|
||||
if(CMAKE_CXX_COMPILER_ID MATCHES "GNU" AND CMAKE_CXX_COMPILER_VERSION VERSION_LESS 9.0)
|
||||
target_link_libraries(govibevoicecpp PRIVATE stdc++fs)
|
||||
endif()
|
||||
|
||||
set_property(TARGET govibevoicecpp PROPERTY CXX_STANDARD 17)
|
||||
set_target_properties(govibevoicecpp PROPERTIES LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR})
|
||||
128
backend/go/vibevoice-cpp/Makefile
Normal file
128
backend/go/vibevoice-cpp/Makefile
Normal file
@@ -0,0 +1,128 @@
|
||||
CMAKE_ARGS?=
|
||||
BUILD_TYPE?=
|
||||
NATIVE?=false
|
||||
|
||||
GOCMD?=go
|
||||
GO_TAGS?=
|
||||
JOBS?=$(shell nproc --ignore=1)
|
||||
|
||||
# vibevoice.cpp version
|
||||
VIBEVOICE_REPO?=https://github.com/mudler/vibevoice.cpp
|
||||
VIBEVOICE_CPP_VERSION?=master
|
||||
SO_TARGET?=libgovibevoicecpp.so
|
||||
|
||||
CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF
|
||||
CMAKE_ARGS+=-DVIBEVOICE_BUILD_TESTS=OFF
|
||||
CMAKE_ARGS+=-DVIBEVOICE_BUILD_EXAMPLES=OFF
|
||||
|
||||
ifeq ($(NATIVE),false)
|
||||
CMAKE_ARGS+=-DGGML_NATIVE=OFF
|
||||
endif
|
||||
|
||||
ifeq ($(BUILD_TYPE),cublas)
|
||||
CMAKE_ARGS+=-DGGML_CUDA=ON -DVIBEVOICE_GGML_CUDA=ON
|
||||
else ifeq ($(BUILD_TYPE),openblas)
|
||||
CMAKE_ARGS+=-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
|
||||
else ifeq ($(BUILD_TYPE),clblas)
|
||||
CMAKE_ARGS+=-DGGML_CLBLAST=ON -DCLBlast_DIR=/some/path
|
||||
else ifeq ($(BUILD_TYPE),hipblas)
|
||||
CMAKE_ARGS+=-DGGML_HIPBLAS=ON -DVIBEVOICE_GGML_HIPBLAS=ON
|
||||
else ifeq ($(BUILD_TYPE),vulkan)
|
||||
CMAKE_ARGS+=-DGGML_VULKAN=ON -DVIBEVOICE_GGML_VULKAN=ON
|
||||
else ifeq ($(OS),Darwin)
|
||||
ifneq ($(BUILD_TYPE),metal)
|
||||
CMAKE_ARGS+=-DGGML_METAL=OFF
|
||||
else
|
||||
CMAKE_ARGS+=-DGGML_METAL=ON -DVIBEVOICE_GGML_METAL=ON
|
||||
CMAKE_ARGS+=-DGGML_METAL_EMBED_LIBRARY=ON
|
||||
endif
|
||||
endif
|
||||
|
||||
ifeq ($(BUILD_TYPE),sycl_f16)
|
||||
CMAKE_ARGS+=-DGGML_SYCL=ON \
|
||||
-DCMAKE_C_COMPILER=icx \
|
||||
-DCMAKE_CXX_COMPILER=icpx \
|
||||
-DGGML_SYCL_F16=ON
|
||||
endif
|
||||
|
||||
ifeq ($(BUILD_TYPE),sycl_f32)
|
||||
CMAKE_ARGS+=-DGGML_SYCL=ON \
|
||||
-DCMAKE_C_COMPILER=icx \
|
||||
-DCMAKE_CXX_COMPILER=icpx
|
||||
endif
|
||||
|
||||
sources/vibevoice.cpp:
|
||||
mkdir -p sources/vibevoice.cpp
|
||||
cd sources/vibevoice.cpp && \
|
||||
git init && \
|
||||
git remote add origin $(VIBEVOICE_REPO) && \
|
||||
git fetch origin && \
|
||||
git checkout $(VIBEVOICE_CPP_VERSION) && \
|
||||
git submodule update --init --recursive --depth 1 --single-branch
|
||||
|
||||
# Detect OS
|
||||
UNAME_S := $(shell uname -s)
|
||||
|
||||
# Only build CPU variants on Linux
|
||||
ifeq ($(UNAME_S),Linux)
|
||||
VARIANT_TARGETS = libgovibevoicecpp-avx.so libgovibevoicecpp-avx2.so libgovibevoicecpp-avx512.so libgovibevoicecpp-fallback.so
|
||||
else
|
||||
# On non-Linux (e.g., Darwin), build only fallback variant
|
||||
VARIANT_TARGETS = libgovibevoicecpp-fallback.so
|
||||
endif
|
||||
|
||||
vibevoice-cpp: main.go govibevoicecpp.go $(VARIANT_TARGETS)
|
||||
CGO_ENABLED=0 $(GOCMD) build -tags "$(GO_TAGS)" -o vibevoice-cpp ./
|
||||
|
||||
package: vibevoice-cpp
|
||||
bash package.sh
|
||||
|
||||
build: package
|
||||
|
||||
clean: purge
|
||||
rm -rf libgovibevoicecpp*.so package sources/vibevoice.cpp vibevoice-cpp
|
||||
|
||||
purge:
|
||||
rm -rf build*
|
||||
|
||||
# Variants must build sequentially
|
||||
.NOTPARALLEL:
|
||||
|
||||
# Build all variants (Linux only)
|
||||
ifeq ($(UNAME_S),Linux)
|
||||
libgovibevoicecpp-avx.so: sources/vibevoice.cpp
|
||||
$(info ${GREEN}I vibevoice-cpp build info:avx${RESET})
|
||||
SO_TARGET=libgovibevoicecpp-avx.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) libgovibevoicecpp-custom
|
||||
rm -rf build-libgovibevoicecpp-avx.so
|
||||
|
||||
libgovibevoicecpp-avx2.so: sources/vibevoice.cpp
|
||||
$(info ${GREEN}I vibevoice-cpp build info:avx2${RESET})
|
||||
SO_TARGET=libgovibevoicecpp-avx2.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=off -DGGML_FMA=on -DGGML_F16C=on -DGGML_BMI2=on" $(MAKE) libgovibevoicecpp-custom
|
||||
rm -rf build-libgovibevoicecpp-avx2.so
|
||||
|
||||
libgovibevoicecpp-avx512.so: sources/vibevoice.cpp
|
||||
$(info ${GREEN}I vibevoice-cpp build info:avx512${RESET})
|
||||
SO_TARGET=libgovibevoicecpp-avx512.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=on -DGGML_FMA=on -DGGML_F16C=on -DGGML_BMI2=on" $(MAKE) libgovibevoicecpp-custom
|
||||
rm -rf build-libgovibevoicecpp-avx512.so
|
||||
endif
|
||||
|
||||
# Build fallback variant (all platforms)
|
||||
libgovibevoicecpp-fallback.so: sources/vibevoice.cpp
|
||||
$(info ${GREEN}I vibevoice-cpp build info:fallback${RESET})
|
||||
SO_TARGET=libgovibevoicecpp-fallback.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) libgovibevoicecpp-custom
|
||||
rm -rf build-libgovibevoicecpp-fallback.so
|
||||
|
||||
libgovibevoicecpp-custom: CMakeLists.txt cpp/govibevoicecpp.cpp cpp/govibevoicecpp.h
|
||||
mkdir -p build-$(SO_TARGET) && \
|
||||
cd build-$(SO_TARGET) && \
|
||||
cmake .. $(CMAKE_ARGS) && \
|
||||
cmake --build . --config Release -j$(JOBS) --target govibevoicecpp && \
|
||||
cd .. && \
|
||||
mv build-$(SO_TARGET)/libgovibevoicecpp.so ./$(SO_TARGET)
|
||||
|
||||
test: vibevoice-cpp
|
||||
@echo "Running vibevoice-cpp tests..."
|
||||
bash test.sh
|
||||
@echo "vibevoice-cpp tests completed."
|
||||
|
||||
all: vibevoice-cpp package
|
||||
41
backend/go/vibevoice-cpp/cpp/govibevoicecpp.cpp
Normal file
41
backend/go/vibevoice-cpp/cpp/govibevoicecpp.cpp
Normal file
@@ -0,0 +1,41 @@
|
||||
// vibevoice.cpp ships its purego-friendly ABI in vibevoice_capi.h.
|
||||
// This translation unit is intentionally tiny: pulling in the header
|
||||
// (and linking libvibevoice PRIVATE in CMake) is enough to make the
|
||||
// vv_capi_* symbols visible from the produced MODULE library.
|
||||
//
|
||||
// We do install a ggml log redirect so backend logs land on the gRPC
|
||||
// server's stderr — same pattern as backend/go/qwen3-tts-cpp/cpp/.
|
||||
|
||||
#include "govibevoicecpp.h"
|
||||
|
||||
#include "ggml.h"
|
||||
#include "ggml-backend.h"
|
||||
|
||||
#include <cstdio>
|
||||
|
||||
namespace {
|
||||
|
||||
void govibevoice_log_cb(enum ggml_log_level level, const char* msg, void* /*ud*/) {
|
||||
if (!msg) return;
|
||||
const char* tag = "?????";
|
||||
switch (level) {
|
||||
case GGML_LOG_LEVEL_DEBUG: tag = "DEBUG"; break;
|
||||
case GGML_LOG_LEVEL_INFO: tag = "INFO"; break;
|
||||
case GGML_LOG_LEVEL_WARN: tag = "WARN"; break;
|
||||
case GGML_LOG_LEVEL_ERROR: tag = "ERROR"; break;
|
||||
default: break;
|
||||
}
|
||||
std::fprintf(stderr, "[%-5s] %s", tag, msg);
|
||||
std::fflush(stderr);
|
||||
}
|
||||
|
||||
struct LogInstaller {
|
||||
LogInstaller() {
|
||||
ggml_log_set(govibevoice_log_cb, nullptr);
|
||||
ggml_backend_load_all();
|
||||
}
|
||||
};
|
||||
|
||||
LogInstaller g_install;
|
||||
|
||||
} // namespace
|
||||
7
backend/go/vibevoice-cpp/cpp/govibevoicecpp.h
Normal file
7
backend/go/vibevoice-cpp/cpp/govibevoicecpp.h
Normal file
@@ -0,0 +1,7 @@
|
||||
#pragma once
|
||||
|
||||
// Re-exports the vibevoice.cpp flat C ABI so this MODULE library
|
||||
// resolves the same symbols that purego.RegisterLibFunc looks up by
|
||||
// name. The actual definitions live in libvibevoice (linked PRIVATE).
|
||||
|
||||
#include "vibevoice_capi.h"
|
||||
387
backend/go/vibevoice-cpp/govibevoicecpp.go
Normal file
387
backend/go/vibevoice-cpp/govibevoicecpp.go
Normal file
@@ -0,0 +1,387 @@
|
||||
package main
|
||||
|
||||
import (
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"os"
|
||||
"path/filepath"
|
||||
"strings"
|
||||
|
||||
laudio "github.com/mudler/LocalAI/pkg/audio"
|
||||
"github.com/mudler/LocalAI/pkg/grpc/base"
|
||||
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
|
||||
)
|
||||
|
||||
// vibevoice.cpp synthesizes 24 kHz mono 16-bit PCM. Hardcoded - the
|
||||
// model itself is fixed-rate; if the upstream ever changes this we'll
|
||||
// pick it up via vv_capi_version().
|
||||
const vibevoiceSampleRate = uint32(24000)
|
||||
|
||||
// purego-bound entry points from libgovibevoicecpp.
|
||||
var (
|
||||
CppLoad func(ttsModel, asrModel, tokenizer, voice string, threads int32) int32
|
||||
CppTTS func(text, voicePath, dstWav string,
|
||||
nSteps int32, cfgScale float32, maxSpeechFrames int32, seed uint32) int32
|
||||
CppASR func(srcWav string, outJSON []byte, capacity uint64,
|
||||
maxNewTokens int32) int32
|
||||
CppUnload func()
|
||||
CppVersion func() string
|
||||
)
|
||||
|
||||
// VibevoiceCpp speaks gRPC against vibevoice.cpp's flat C ABI. The
|
||||
// engine is a single global, so we serialize calls through SingleThread.
|
||||
type VibevoiceCpp struct {
|
||||
base.SingleThread
|
||||
threads int
|
||||
|
||||
// modelRoot is the directory we use to resolve relative paths
|
||||
// from Options[] and per-call overrides (TTSRequest.Voice).
|
||||
// Source of truth: opts.ModelPath; falls back to the dir of
|
||||
// the primary ModelFile when ModelPath is empty.
|
||||
modelRoot string
|
||||
|
||||
ttsModel string
|
||||
asrModel string
|
||||
tokenizer string
|
||||
voice string
|
||||
}
|
||||
|
||||
// resolvePath joins a relative path onto `relTo`. The gallery
|
||||
// convention is that Options[] carry paths relative to the LocalAI
|
||||
// models dir (opts.ModelPath), so anything not absolute is treated
|
||||
// as a sibling of the primary ModelFile - never CWD. Empty / already-
|
||||
// absolute / no-relTo inputs pass through unchanged.
|
||||
func resolvePath(p, relTo string) string {
|
||||
if p == "" || filepath.IsAbs(p) || relTo == "" {
|
||||
return p
|
||||
}
|
||||
return filepath.Join(relTo, p)
|
||||
}
|
||||
|
||||
// parseOptions reads opts.Options[] and pulls out the per-role
|
||||
// overrides documented in the gallery entries. Accepts both "key=value"
|
||||
// (gallery YAML style) and "key:value" (Make-target / env-var style).
|
||||
func (v *VibevoiceCpp) parseOptions(opts []string, relTo string) string {
|
||||
role := ""
|
||||
for _, raw := range opts {
|
||||
k, val, ok := strings.Cut(raw, "=")
|
||||
if !ok {
|
||||
k, val, ok = strings.Cut(raw, ":")
|
||||
if !ok {
|
||||
continue
|
||||
}
|
||||
}
|
||||
key := strings.TrimSpace(k)
|
||||
val = strings.TrimSpace(val)
|
||||
switch key {
|
||||
case "type":
|
||||
role = strings.ToLower(val)
|
||||
case "tokenizer":
|
||||
v.tokenizer = resolvePath(val, relTo)
|
||||
case "voice":
|
||||
v.voice = resolvePath(val, relTo)
|
||||
case "tts_model":
|
||||
v.ttsModel = resolvePath(val, relTo)
|
||||
case "asr_model":
|
||||
v.asrModel = resolvePath(val, relTo)
|
||||
}
|
||||
}
|
||||
return role
|
||||
}
|
||||
|
||||
func (v *VibevoiceCpp) Load(opts *pb.ModelOptions) error {
|
||||
if opts.ModelFile == "" {
|
||||
return fmt.Errorf("vibevoice-cpp: ModelFile is required")
|
||||
}
|
||||
modelFile := opts.ModelFile
|
||||
if !filepath.IsAbs(modelFile) && opts.ModelPath != "" {
|
||||
modelFile = filepath.Join(opts.ModelPath, modelFile)
|
||||
}
|
||||
|
||||
// ModelPath is the LocalAI core's models root, propagated over
|
||||
// gRPC. Use it as the resolution base for Options[] (and later
|
||||
// for TTSRequest.Voice) so gallery entries can reference paths
|
||||
// like "tokenizer=tokenizer.gguf" and have them resolved
|
||||
// against the same root the core used to drop the files.
|
||||
v.modelRoot = opts.ModelPath
|
||||
if v.modelRoot == "" {
|
||||
v.modelRoot = filepath.Dir(modelFile)
|
||||
}
|
||||
role := v.parseOptions(opts.Options, v.modelRoot)
|
||||
|
||||
// ModelFile fills the "primary" role-slot determined by `type=`
|
||||
// in Options (defaults to tts). The other slot stays exactly as
|
||||
// Options set it - so a closed-loop config with ModelFile=tts.gguf
|
||||
// + Options[asr_model=asr.gguf] resolves correctly to both slots,
|
||||
// and an explicit `tts_model=` / `asr_model=` always wins over
|
||||
// ModelFile for its own slot.
|
||||
primaryIsASR := false
|
||||
switch role {
|
||||
case "asr", "transcript", "stt", "speech-to-text":
|
||||
primaryIsASR = true
|
||||
}
|
||||
if primaryIsASR {
|
||||
if v.asrModel == "" {
|
||||
v.asrModel = modelFile
|
||||
}
|
||||
} else if v.ttsModel == "" {
|
||||
v.ttsModel = modelFile
|
||||
}
|
||||
|
||||
if v.ttsModel == "" && v.asrModel == "" {
|
||||
return fmt.Errorf("vibevoice-cpp: no TTS or ASR model resolved from ModelFile=%q + options", opts.ModelFile)
|
||||
}
|
||||
if v.tokenizer == "" {
|
||||
return fmt.Errorf("vibevoice-cpp: tokenizer is required - pass options: [tokenizer=<path>]")
|
||||
}
|
||||
|
||||
threads := int(opts.Threads)
|
||||
if threads <= 0 {
|
||||
threads = 4
|
||||
}
|
||||
v.threads = threads
|
||||
|
||||
fmt.Fprintf(os.Stderr,
|
||||
"[vibevoice-cpp] Loading: tts=%q asr=%q tokenizer=%q voice=%q threads=%d\n",
|
||||
v.ttsModel, v.asrModel, v.tokenizer, v.voice, threads)
|
||||
|
||||
if rc := CppLoad(v.ttsModel, v.asrModel, v.tokenizer, v.voice, int32(threads)); rc != 0 {
|
||||
return fmt.Errorf("vibevoice-cpp: vv_capi_load failed (rc=%d)", rc)
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
func (v *VibevoiceCpp) TTS(req *pb.TTSRequest) error {
|
||||
if v.ttsModel == "" {
|
||||
return fmt.Errorf("vibevoice-cpp: TTS requested but no realtime model was loaded")
|
||||
}
|
||||
text := req.Text
|
||||
dst := req.Dst
|
||||
if text == "" || dst == "" {
|
||||
return fmt.Errorf("vibevoice-cpp: TTS requires both text and dst")
|
||||
}
|
||||
|
||||
// req.Voice may be a bare filename (e.g. "voice-en-Emma.gguf") or an
|
||||
// absolute path. Resolve via the same modelRoot Load() used for
|
||||
// Options[] so a swap-voice request mirrors the gallery's layout.
|
||||
voice := resolvePath(req.Voice, v.modelRoot)
|
||||
|
||||
if req.Language != nil && *req.Language != "" {
|
||||
fmt.Fprintf(os.Stderr,
|
||||
"[vibevoice-cpp] note: TTSRequest.language=%q ignored - vibevoice picks language from the voice prompt\n",
|
||||
*req.Language)
|
||||
}
|
||||
|
||||
const (
|
||||
defaultSteps = 20
|
||||
defaultMaxFrames = 200
|
||||
)
|
||||
defaultCfg := float32(1.3)
|
||||
if rc := CppTTS(text, voice, dst,
|
||||
int32(defaultSteps), defaultCfg, int32(defaultMaxFrames), 0); rc != 0 {
|
||||
return fmt.Errorf("vibevoice-cpp: vv_capi_tts failed (rc=%d)", rc)
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
// asrSegment matches vibevoice's JSON output:
|
||||
//
|
||||
// [{"Start":0.0,"End":2.8,"Speaker":0,"Content":"…"}, ...]
|
||||
type asrSegment struct {
|
||||
Start float64 `json:"Start"`
|
||||
End float64 `json:"End"`
|
||||
Speaker int `json:"Speaker"`
|
||||
Content string `json:"Content"`
|
||||
}
|
||||
|
||||
// callASR invokes vv_capi_asr with a buffer that grows on demand.
|
||||
// vv_capi_asr returns: >0 bytes written, 0 no transcript, <0 error or
|
||||
// -required_size. We honor the resize protocol once before giving up.
|
||||
func (v *VibevoiceCpp) callASR(srcWav string, maxNewTokens int32) (string, error) {
|
||||
const startCap = 256 * 1024
|
||||
buf := make([]byte, startCap)
|
||||
rc := CppASR(srcWav, buf, uint64(len(buf)), maxNewTokens)
|
||||
if rc < 0 {
|
||||
need := -int(rc)
|
||||
if need > 0 && need < (16<<20) && need > len(buf) {
|
||||
buf = make([]byte, need+64)
|
||||
rc = CppASR(srcWav, buf, uint64(len(buf)), maxNewTokens)
|
||||
}
|
||||
}
|
||||
if rc < 0 {
|
||||
return "", fmt.Errorf("vibevoice-cpp: vv_capi_asr failed (rc=%d)", rc)
|
||||
}
|
||||
if rc == 0 {
|
||||
return "", nil
|
||||
}
|
||||
return string(buf[:rc]), nil
|
||||
}
|
||||
|
||||
// TTSStream is the streaming counterpart to TTS. vibevoice's C ABI is
|
||||
// file-only (vv_capi_tts writes a complete WAV), so we synthesize to
|
||||
// a tempfile, then emit a streaming-WAV header followed by the PCM
|
||||
// body in chunks. The main reason this exists at all is the gRPC
|
||||
// server wrapper (pkg/grpc/server.go:TTSStream) blocks on a channel
|
||||
// that only this method can close - if we leave the default Base
|
||||
// stub in place, every TTSStream call hangs until the client
|
||||
// deadline.
|
||||
func (v *VibevoiceCpp) TTSStream(req *pb.TTSRequest, results chan []byte) error {
|
||||
defer close(results)
|
||||
if v.ttsModel == "" {
|
||||
return fmt.Errorf("vibevoice-cpp: TTSStream requested but no realtime model was loaded")
|
||||
}
|
||||
if req.Text == "" {
|
||||
return fmt.Errorf("vibevoice-cpp: TTSStream requires text")
|
||||
}
|
||||
|
||||
tmp, err := os.CreateTemp("", "vibevoice-cpp-stream-*.wav")
|
||||
if err != nil {
|
||||
return fmt.Errorf("vibevoice-cpp: tempfile: %w", err)
|
||||
}
|
||||
dst := tmp.Name()
|
||||
_ = tmp.Close()
|
||||
defer func() { _ = os.Remove(dst) }()
|
||||
|
||||
if err := v.TTS(&pb.TTSRequest{
|
||||
Text: req.Text,
|
||||
Voice: req.Voice,
|
||||
Dst: dst,
|
||||
Language: req.Language,
|
||||
}); err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
wav, err := os.ReadFile(dst)
|
||||
if err != nil {
|
||||
return fmt.Errorf("vibevoice-cpp: read tempfile: %w", err)
|
||||
}
|
||||
|
||||
// Streaming WAV header: declare 0xFFFFFFFF for chunk sizes so HTTP
|
||||
// clients can start playback before they see the full PCM.
|
||||
const streamingSize = 0xFFFFFFFF
|
||||
hdr := laudio.NewWAVHeaderWithRate(streamingSize, vibevoiceSampleRate)
|
||||
hdr.ChunkSize = streamingSize
|
||||
hdrBuf := make([]byte, 0, laudio.WAVHeaderSize)
|
||||
w := newByteWriter(&hdrBuf)
|
||||
if err := hdr.Write(w); err != nil {
|
||||
return fmt.Errorf("vibevoice-cpp: write WAV header: %w", err)
|
||||
}
|
||||
results <- hdrBuf
|
||||
|
||||
// PCM body: send in ~64 KB slices so the client gets multiple
|
||||
// reply chunks (e2e harness asserts >=2 frames).
|
||||
pcm := laudio.StripWAVHeader(wav)
|
||||
const chunkBytes = 64 * 1024
|
||||
for off := 0; off < len(pcm); off += chunkBytes {
|
||||
end := off + chunkBytes
|
||||
if end > len(pcm) {
|
||||
end = len(pcm)
|
||||
}
|
||||
chunk := make([]byte, end-off)
|
||||
copy(chunk, pcm[off:end])
|
||||
results <- chunk
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
// byteWriter adapts a *[]byte to io.Writer so we can hand it to
|
||||
// laudio.WAVHeader.Write without allocating a bytes.Buffer.
|
||||
type byteWriter struct{ buf *[]byte }
|
||||
|
||||
func newByteWriter(b *[]byte) *byteWriter { return &byteWriter{buf: b} }
|
||||
func (w *byteWriter) Write(p []byte) (int, error) {
|
||||
*w.buf = append(*w.buf, p...)
|
||||
return len(p), nil
|
||||
}
|
||||
|
||||
func (v *VibevoiceCpp) AudioTranscription(req *pb.TranscriptRequest) (pb.TranscriptResult, error) {
|
||||
if v.asrModel == "" {
|
||||
return pb.TranscriptResult{}, fmt.Errorf("vibevoice-cpp: AudioTranscription requested but no ASR model was loaded")
|
||||
}
|
||||
if req.Dst == "" {
|
||||
return pb.TranscriptResult{}, fmt.Errorf("vibevoice-cpp: TranscriptRequest.dst (audio path) is required")
|
||||
}
|
||||
|
||||
out, err := v.callASR(req.Dst, 0)
|
||||
if err != nil {
|
||||
return pb.TranscriptResult{}, err
|
||||
}
|
||||
if out == "" {
|
||||
return pb.TranscriptResult{}, nil
|
||||
}
|
||||
|
||||
var segs []asrSegment
|
||||
if err := json.Unmarshal([]byte(out), &segs); err != nil {
|
||||
fmt.Fprintf(os.Stderr,
|
||||
"[vibevoice-cpp] WARNING: vv_capi_asr returned non-JSON, falling back to single segment: %v\n", err)
|
||||
return pb.TranscriptResult{
|
||||
Segments: []*pb.TranscriptSegment{{Id: 0, Text: strings.TrimSpace(out)}},
|
||||
Text: strings.TrimSpace(out),
|
||||
}, nil
|
||||
}
|
||||
|
||||
segments := make([]*pb.TranscriptSegment, 0, len(segs))
|
||||
parts := make([]string, 0, len(segs))
|
||||
var duration float32
|
||||
for i, s := range segs {
|
||||
// LocalAI's whisper backend uses int64 100ns ticks for
|
||||
// Start/End (seconds * 1e7); follow the same convention so
|
||||
// consumers can mix vibevoice and whisper transcripts.
|
||||
segments = append(segments, &pb.TranscriptSegment{
|
||||
Id: int32(i),
|
||||
Text: s.Content,
|
||||
Start: int64(s.Start * 1e7),
|
||||
End: int64(s.End * 1e7),
|
||||
Speaker: fmt.Sprintf("%d", s.Speaker),
|
||||
})
|
||||
parts = append(parts, strings.TrimSpace(s.Content))
|
||||
if float32(s.End) > duration {
|
||||
duration = float32(s.End)
|
||||
}
|
||||
}
|
||||
return pb.TranscriptResult{
|
||||
Segments: segments,
|
||||
Text: strings.TrimSpace(strings.Join(parts, " ")),
|
||||
Duration: duration,
|
||||
}, nil
|
||||
}
|
||||
|
||||
// AudioTranscriptionStream wraps AudioTranscription so the streaming
|
||||
// gRPC endpoint (server.go:AudioTranscriptionStream) sees its channel
|
||||
// close and the client doesn't sit waiting until deadline. vibevoice's
|
||||
// ASR doesn't expose token-level streaming - vv_capi_asr decodes the
|
||||
// whole audio and returns a JSON segment list - so we run the offline
|
||||
// transcription, emit each segment's content as a delta, then close
|
||||
// with a final_result whose Text equals the concatenated deltas (the
|
||||
// e2e harness asserts those match).
|
||||
func (v *VibevoiceCpp) AudioTranscriptionStream(req *pb.TranscriptRequest, results chan *pb.TranscriptStreamResponse) error {
|
||||
defer close(results)
|
||||
res, err := v.AudioTranscription(req)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
var assembled strings.Builder
|
||||
for _, seg := range res.Segments {
|
||||
if seg == nil {
|
||||
continue
|
||||
}
|
||||
txt := strings.TrimSpace(seg.Text)
|
||||
if txt == "" {
|
||||
continue
|
||||
}
|
||||
delta := txt
|
||||
if assembled.Len() > 0 {
|
||||
delta = " " + txt
|
||||
}
|
||||
results <- &pb.TranscriptStreamResponse{Delta: delta}
|
||||
assembled.WriteString(delta)
|
||||
}
|
||||
final := pb.TranscriptResult{
|
||||
Segments: res.Segments,
|
||||
Duration: res.Duration,
|
||||
Language: res.Language,
|
||||
Text: assembled.String(),
|
||||
}
|
||||
results <- &pb.TranscriptStreamResponse{FinalResult: &final}
|
||||
return nil
|
||||
}
|
||||
49
backend/go/vibevoice-cpp/main.go
Normal file
49
backend/go/vibevoice-cpp/main.go
Normal file
@@ -0,0 +1,49 @@
|
||||
package main
|
||||
|
||||
// Started internally by LocalAI - one gRPC server per loaded model.
|
||||
import (
|
||||
"flag"
|
||||
"os"
|
||||
|
||||
"github.com/ebitengine/purego"
|
||||
grpc "github.com/mudler/LocalAI/pkg/grpc"
|
||||
)
|
||||
|
||||
var (
|
||||
addr = flag.String("addr", "localhost:50051", "the address to connect to")
|
||||
)
|
||||
|
||||
type LibFuncs struct {
|
||||
FuncPtr any
|
||||
Name string
|
||||
}
|
||||
|
||||
func main() {
|
||||
libName := os.Getenv("VIBEVOICECPP_LIBRARY")
|
||||
if libName == "" {
|
||||
libName = "./libgovibevoicecpp-fallback.so"
|
||||
}
|
||||
|
||||
lib, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
|
||||
libFuncs := []LibFuncs{
|
||||
{&CppLoad, "vv_capi_load"},
|
||||
{&CppTTS, "vv_capi_tts"},
|
||||
{&CppASR, "vv_capi_asr"},
|
||||
{&CppUnload, "vv_capi_unload"},
|
||||
{&CppVersion, "vv_capi_version"},
|
||||
}
|
||||
|
||||
for _, lf := range libFuncs {
|
||||
purego.RegisterLibFunc(lf.FuncPtr, lib, lf.Name)
|
||||
}
|
||||
|
||||
flag.Parse()
|
||||
|
||||
if err := grpc.StartServer(*addr, &VibevoiceCpp{}); err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
||||
58
backend/go/vibevoice-cpp/package.sh
Executable file
58
backend/go/vibevoice-cpp/package.sh
Executable file
@@ -0,0 +1,58 @@
|
||||
#!/bin/bash
|
||||
|
||||
# Bundle the vibevoice-cpp binary, the per-variant .so files, and the
|
||||
# runtime libs the binary depends on so the package is self-contained.
|
||||
# Mirrors backend/go/qwen3-tts-cpp/package.sh.
|
||||
|
||||
set -e
|
||||
|
||||
CURDIR=$(dirname "$(realpath $0)")
|
||||
REPO_ROOT="${CURDIR}/../../.."
|
||||
|
||||
mkdir -p $CURDIR/package/lib
|
||||
|
||||
cp -avf $CURDIR/vibevoice-cpp $CURDIR/package/
|
||||
cp -fv $CURDIR/libgovibevoicecpp-*.so $CURDIR/package/
|
||||
cp -fv $CURDIR/run.sh $CURDIR/package/
|
||||
|
||||
# Detect architecture and copy appropriate libraries
|
||||
if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
|
||||
echo "Detected x86_64 architecture, copying x86_64 libraries..."
|
||||
cp -arfLv /lib64/ld-linux-x86-64.so.2 $CURDIR/package/lib/ld.so
|
||||
cp -arfLv /lib/x86_64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
|
||||
cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
|
||||
cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
|
||||
cp -arfLv /lib/x86_64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
|
||||
cp -arfLv /lib/x86_64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
|
||||
cp -arfLv /lib/x86_64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
|
||||
cp -arfLv /lib/x86_64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
|
||||
cp -arfLv /lib/x86_64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
|
||||
elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
|
||||
echo "Detected ARM64 architecture, copying ARM64 libraries..."
|
||||
cp -arfLv /lib/ld-linux-aarch64.so.1 $CURDIR/package/lib/ld.so
|
||||
cp -arfLv /lib/aarch64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
|
||||
cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
|
||||
cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
|
||||
cp -arfLv /lib/aarch64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
|
||||
cp -arfLv /lib/aarch64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
|
||||
cp -arfLv /lib/aarch64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
|
||||
cp -arfLv /lib/aarch64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
|
||||
cp -arfLv /lib/aarch64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
|
||||
elif [ $(uname -s) = "Darwin" ]; then
|
||||
echo "Detected Darwin"
|
||||
else
|
||||
echo "Error: Could not detect architecture"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Package GPU libraries based on BUILD_TYPE
|
||||
GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
|
||||
if [ -f "$GPU_LIB_SCRIPT" ]; then
|
||||
echo "Packaging GPU libraries for BUILD_TYPE=${BUILD_TYPE:-cpu}..."
|
||||
source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
|
||||
package_gpu_libs
|
||||
fi
|
||||
|
||||
echo "Packaging completed successfully"
|
||||
ls -liah $CURDIR/package/
|
||||
ls -liah $CURDIR/package/lib/
|
||||
49
backend/go/vibevoice-cpp/run.sh
Executable file
49
backend/go/vibevoice-cpp/run.sh
Executable file
@@ -0,0 +1,49 @@
|
||||
#!/bin/bash
|
||||
set -ex
|
||||
|
||||
CURDIR=$(dirname "$(realpath $0)")
|
||||
|
||||
cd /
|
||||
|
||||
echo "CPU info:"
|
||||
if [ "$(uname)" != "Darwin" ]; then
|
||||
grep -e "model\sname" /proc/cpuinfo | head -1
|
||||
grep -e "flags" /proc/cpuinfo | head -1
|
||||
fi
|
||||
|
||||
LIBRARY="$CURDIR/libgovibevoicecpp-fallback.so"
|
||||
|
||||
if [ "$(uname)" != "Darwin" ]; then
|
||||
if grep -q -e "\savx\s" /proc/cpuinfo ; then
|
||||
echo "CPU: AVX found OK"
|
||||
if [ -e $CURDIR/libgovibevoicecpp-avx.so ]; then
|
||||
LIBRARY="$CURDIR/libgovibevoicecpp-avx.so"
|
||||
fi
|
||||
fi
|
||||
|
||||
if grep -q -e "\savx2\s" /proc/cpuinfo ; then
|
||||
echo "CPU: AVX2 found OK"
|
||||
if [ -e $CURDIR/libgovibevoicecpp-avx2.so ]; then
|
||||
LIBRARY="$CURDIR/libgovibevoicecpp-avx2.so"
|
||||
fi
|
||||
fi
|
||||
|
||||
if grep -q -e "\savx512f\s" /proc/cpuinfo ; then
|
||||
echo "CPU: AVX512F found OK"
|
||||
if [ -e $CURDIR/libgovibevoicecpp-avx512.so ]; then
|
||||
LIBRARY="$CURDIR/libgovibevoicecpp-avx512.so"
|
||||
fi
|
||||
fi
|
||||
fi
|
||||
|
||||
export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
|
||||
export VIBEVOICECPP_LIBRARY=$LIBRARY
|
||||
|
||||
if [ -f $CURDIR/lib/ld.so ]; then
|
||||
echo "Using lib/ld.so"
|
||||
echo "Using library: $LIBRARY"
|
||||
exec $CURDIR/lib/ld.so $CURDIR/vibevoice-cpp "$@"
|
||||
fi
|
||||
|
||||
echo "Using library: $LIBRARY"
|
||||
exec $CURDIR/vibevoice-cpp "$@"
|
||||
74
backend/go/vibevoice-cpp/test.sh
Executable file
74
backend/go/vibevoice-cpp/test.sh
Executable file
@@ -0,0 +1,74 @@
|
||||
#!/bin/bash
|
||||
set -e
|
||||
|
||||
CURDIR=$(dirname "$(realpath $0)")
|
||||
|
||||
echo "Running vibevoice-cpp backend tests..."
|
||||
|
||||
# Required env-vars (set automatically when missing):
|
||||
# VIBEVOICE_MODEL_DIR : directory containing the gguf bundle.
|
||||
# VIBEVOICE_BINARY : path to the built backend (default ./vibevoice-cpp)
|
||||
#
|
||||
# Tests skip when the model bundle is absent and the auto-download
|
||||
# fails (e.g. no network on the runner) so local devs without HF access
|
||||
# still get green compile output.
|
||||
|
||||
cd "$CURDIR"
|
||||
|
||||
if [ -z "$VIBEVOICE_MODEL_DIR" ]; then
|
||||
export VIBEVOICE_MODEL_DIR="./vibevoice-models"
|
||||
|
||||
if [ ! -d "$VIBEVOICE_MODEL_DIR" ]; then
|
||||
echo "Creating vibevoice-models directory for tests..."
|
||||
mkdir -p "$VIBEVOICE_MODEL_DIR"
|
||||
|
||||
REPO_ID="mudler/vibevoice.cpp-models"
|
||||
echo "Repository: ${REPO_ID}"
|
||||
|
||||
# Q4_K instead of Q8_0 for the ASR model: smaller download
|
||||
# (10 GB vs 14 GB), fits on ubuntu-latest's free disk after the
|
||||
# runner image is loaded. The unit/closed-loop test only needs
|
||||
# decode quality, not Q8_0 precision.
|
||||
FILES=(
|
||||
"vibevoice-realtime-0.5B-q8_0.gguf"
|
||||
"vibevoice-asr-q4_k.gguf"
|
||||
"tokenizer.gguf"
|
||||
"voice-en-Carter_man.gguf"
|
||||
)
|
||||
|
||||
BASE_URL="https://huggingface.co/${REPO_ID}/resolve/main"
|
||||
|
||||
download_ok=1
|
||||
for file in "${FILES[@]}"; do
|
||||
dest="${VIBEVOICE_MODEL_DIR}/${file}"
|
||||
if [ -f "${dest}" ]; then
|
||||
echo " [skip] ${file} (already exists)"
|
||||
else
|
||||
echo " [download] ${file}..."
|
||||
if ! curl -fL -o "${dest}" "${BASE_URL}/${file}" --progress-bar; then
|
||||
echo " [warn] failed to download ${file} - network or HF unavailable"
|
||||
rm -f "${dest}"
|
||||
download_ok=0
|
||||
break
|
||||
fi
|
||||
echo " [done] ${file}"
|
||||
fi
|
||||
done
|
||||
|
||||
if [ "$download_ok" != "1" ]; then
|
||||
echo "vibevoice-cpp: model bundle unavailable - tests will skip model-dependent cases."
|
||||
unset VIBEVOICE_MODEL_DIR
|
||||
fi
|
||||
fi
|
||||
fi
|
||||
|
||||
# Ensure the per-variant .so the binary will dlopen actually exists -
|
||||
# without one, every test will hit a Dlopen panic during server start.
|
||||
if [ ! -f "${CURDIR}/libgovibevoicecpp-fallback.so" ]; then
|
||||
echo "vibevoice-cpp: libgovibevoicecpp-fallback.so missing - run \`make\` first."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
go test -v -timeout 900s .
|
||||
|
||||
echo "All vibevoice-cpp tests passed."
|
||||
382
backend/go/vibevoice-cpp/vibevoicecpp_test.go
Normal file
382
backend/go/vibevoice-cpp/vibevoicecpp_test.go
Normal file
@@ -0,0 +1,382 @@
|
||||
package main
|
||||
|
||||
import (
|
||||
"context"
|
||||
"os"
|
||||
"os/exec"
|
||||
"path/filepath"
|
||||
"regexp"
|
||||
"strings"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
|
||||
. "github.com/onsi/ginkgo/v2"
|
||||
. "github.com/onsi/gomega"
|
||||
"google.golang.org/grpc"
|
||||
"google.golang.org/grpc/credentials/insecure"
|
||||
)
|
||||
|
||||
const (
|
||||
testAddr = "localhost:50098"
|
||||
startupWait = 5 * time.Second
|
||||
)
|
||||
|
||||
func TestVibevoiceCpp(t *testing.T) {
|
||||
RegisterFailHandler(Fail)
|
||||
RunSpecs(t, "VibeVoice-cpp Backend Suite")
|
||||
}
|
||||
|
||||
// modelDirOrSkip returns the staged model bundle dir, or Skip()s the
|
||||
// current spec when VIBEVOICE_MODEL_DIR is unset / lacks the gguf
|
||||
// files we need. Tests that don't depend on a model (Locking, error
|
||||
// paths) don't call this.
|
||||
func modelDirOrSkip() string {
|
||||
dir := os.Getenv("VIBEVOICE_MODEL_DIR")
|
||||
if dir == "" {
|
||||
Skip("VIBEVOICE_MODEL_DIR not set, skipping model-dependent specs")
|
||||
}
|
||||
if _, err := os.Stat(filepath.Join(dir, "tokenizer.gguf")); os.IsNotExist(err) {
|
||||
Skip("tokenizer.gguf missing in " + dir)
|
||||
}
|
||||
tts, _ := filepath.Glob(filepath.Join(dir, "vibevoice-realtime-*.gguf"))
|
||||
asr, _ := filepath.Glob(filepath.Join(dir, "vibevoice-asr-*.gguf"))
|
||||
if len(tts) == 0 && len(asr) == 0 {
|
||||
Skip("neither realtime TTS nor ASR gguf found in " + dir)
|
||||
}
|
||||
return dir
|
||||
}
|
||||
|
||||
// startServer launches the prebuilt backend binary and returns a
|
||||
// running *exec.Cmd. test.sh ensures `./vibevoice-cpp` is built; if
|
||||
// it isn't, every gRPC spec is skipped with a clear reason.
|
||||
func startServer() *exec.Cmd {
|
||||
binary := os.Getenv("VIBEVOICE_BINARY")
|
||||
if binary == "" {
|
||||
binary = "./vibevoice-cpp"
|
||||
}
|
||||
if _, err := os.Stat(binary); os.IsNotExist(err) {
|
||||
Skip("backend binary not found at " + binary)
|
||||
}
|
||||
cmd := exec.Command(binary, "--addr", testAddr)
|
||||
cmd.Stdout = os.Stderr
|
||||
cmd.Stderr = os.Stderr
|
||||
Expect(cmd.Start()).To(Succeed())
|
||||
time.Sleep(startupWait)
|
||||
return cmd
|
||||
}
|
||||
|
||||
func stopServer(cmd *exec.Cmd) {
|
||||
if cmd == nil || cmd.Process == nil {
|
||||
return
|
||||
}
|
||||
_ = cmd.Process.Kill()
|
||||
_, _ = cmd.Process.Wait()
|
||||
}
|
||||
|
||||
func dialGRPC() *grpc.ClientConn {
|
||||
conn, err := grpc.Dial(testAddr,
|
||||
grpc.WithTransportCredentials(insecure.NewCredentials()),
|
||||
grpc.WithDefaultCallOptions(
|
||||
grpc.MaxCallRecvMsgSize(50*1024*1024),
|
||||
grpc.MaxCallSendMsgSize(50*1024*1024),
|
||||
),
|
||||
)
|
||||
Expect(err).ToNot(HaveOccurred())
|
||||
return conn
|
||||
}
|
||||
|
||||
var _ = Describe("VibeVoice-cpp", func() {
|
||||
Context("backend semantics (no purego load needed)", func() {
|
||||
It("is locking - the engine has process-global state", func() {
|
||||
Expect((&VibevoiceCpp{}).Locking()).To(BeTrue())
|
||||
})
|
||||
|
||||
It("rejects Load with empty ModelFile", func() {
|
||||
err := (&VibevoiceCpp{}).Load(&pb.ModelOptions{})
|
||||
Expect(err).To(HaveOccurred())
|
||||
Expect(err.Error()).To(ContainSubstring("ModelFile"))
|
||||
})
|
||||
|
||||
It("rejects TTS without a loaded TTS model", func() {
|
||||
err := (&VibevoiceCpp{}).TTS(&pb.TTSRequest{
|
||||
Text: "no model loaded",
|
||||
Dst: "/tmp/should-not-be-written.wav",
|
||||
})
|
||||
Expect(err).To(HaveOccurred())
|
||||
})
|
||||
|
||||
It("rejects AudioTranscription without a loaded ASR model", func() {
|
||||
_, err := (&VibevoiceCpp{}).AudioTranscription(&pb.TranscriptRequest{
|
||||
Dst: "/tmp/some.wav",
|
||||
})
|
||||
Expect(err).To(HaveOccurred())
|
||||
})
|
||||
|
||||
It("closes the channel and errors on TTSStream without a loaded model", func() {
|
||||
ch := make(chan []byte, 4)
|
||||
err := (&VibevoiceCpp{}).TTSStream(&pb.TTSRequest{
|
||||
Text: "no model loaded",
|
||||
Dst: "/tmp/should-not-be-written.wav",
|
||||
}, ch)
|
||||
Expect(err).To(HaveOccurred())
|
||||
// Server hangs forever if the channel stays open; this guard
|
||||
// is what regresses the e2e DeadlineExceeded we're fixing.
|
||||
_, ok := <-ch
|
||||
Expect(ok).To(BeFalse(), "TTSStream must close results channel even on error")
|
||||
})
|
||||
|
||||
// parseOptions + slot fill is the source of the closed-loop CI
|
||||
// regression where ModelFile=tts.gguf + Options[asr_model=...]
|
||||
// resulted in a load with empty tts slot. These specs assert
|
||||
// the slot resolution before we ever call into purego.
|
||||
Describe("ModelFile slot resolution", func() {
|
||||
It("fills tts slot from ModelFile when only asr_model is in Options", func() {
|
||||
v := &VibevoiceCpp{}
|
||||
v.modelRoot = "/abs/root"
|
||||
role := v.parseOptions([]string{"asr_model=/abs/root/asr.gguf", "tokenizer=/abs/root/tokenizer.gguf"}, v.modelRoot)
|
||||
Expect(v.asrModel).To(Equal("/abs/root/asr.gguf"))
|
||||
Expect(v.ttsModel).To(BeEmpty())
|
||||
Expect(role).To(BeEmpty())
|
||||
// Mirror the Load() default-fill block:
|
||||
if v.ttsModel == "" {
|
||||
v.ttsModel = "/abs/root/tts.gguf"
|
||||
}
|
||||
Expect(v.ttsModel).To(Equal("/abs/root/tts.gguf"))
|
||||
Expect(v.asrModel).To(Equal("/abs/root/asr.gguf"))
|
||||
})
|
||||
|
||||
It("fills asr slot from ModelFile when type=asr is set", func() {
|
||||
v := &VibevoiceCpp{}
|
||||
v.modelRoot = "/abs/root"
|
||||
role := v.parseOptions([]string{"type=asr", "tokenizer=/abs/root/tokenizer.gguf"}, v.modelRoot)
|
||||
Expect(role).To(Equal("asr"))
|
||||
Expect(v.asrModel).To(BeEmpty())
|
||||
Expect(v.ttsModel).To(BeEmpty())
|
||||
})
|
||||
|
||||
It("respects explicit tts_model override over ModelFile", func() {
|
||||
v := &VibevoiceCpp{}
|
||||
v.modelRoot = "/abs/root"
|
||||
_ = v.parseOptions([]string{"tts_model=/abs/root/alt.gguf"}, v.modelRoot)
|
||||
Expect(v.ttsModel).To(Equal("/abs/root/alt.gguf"))
|
||||
})
|
||||
|
||||
It("accepts colon-separated options too", func() {
|
||||
v := &VibevoiceCpp{}
|
||||
v.modelRoot = "/abs/root"
|
||||
role := v.parseOptions([]string{"type:asr", "tokenizer:/abs/root/tokenizer.gguf"}, v.modelRoot)
|
||||
Expect(role).To(Equal("asr"))
|
||||
Expect(v.tokenizer).To(Equal("/abs/root/tokenizer.gguf"))
|
||||
})
|
||||
})
|
||||
|
||||
// The gallery flow puts everything under <models_dir>/<entry>/,
|
||||
// and parameters/options carry paths *relative* to <models_dir>.
|
||||
// LocalAI core fills opts.ModelPath = <models_dir>; the backend
|
||||
// must resolve every relative path against that root, never CWD.
|
||||
Describe("resolvePath (relative-to-modelRoot)", func() {
|
||||
It("joins relative path onto relTo", func() {
|
||||
Expect(resolvePath("vibevoice-cpp/tokenizer.gguf", "/data/models")).
|
||||
To(Equal("/data/models/vibevoice-cpp/tokenizer.gguf"))
|
||||
})
|
||||
|
||||
It("passes absolute paths through unchanged", func() {
|
||||
Expect(resolvePath("/abs/somewhere/tokenizer.gguf", "/data/models")).
|
||||
To(Equal("/abs/somewhere/tokenizer.gguf"))
|
||||
})
|
||||
|
||||
It("returns input unchanged when relTo is empty", func() {
|
||||
Expect(resolvePath("vibevoice-cpp/tokenizer.gguf", "")).
|
||||
To(Equal("vibevoice-cpp/tokenizer.gguf"))
|
||||
})
|
||||
|
||||
It("returns empty input unchanged", func() {
|
||||
Expect(resolvePath("", "/data/models")).To(BeEmpty())
|
||||
})
|
||||
|
||||
It("does not consult CWD - bare filenames stay relative to modelRoot", func() {
|
||||
// Even if the test runs in a directory containing a
|
||||
// file with this name, the lookup must not fall back
|
||||
// to CWD. This is the trap the production gallery flow
|
||||
// would otherwise hit when LocalAI is launched from a
|
||||
// directory that happens to contain a same-named file.
|
||||
prev, _ := os.Getwd()
|
||||
DeferCleanup(func() { _ = os.Chdir(prev) })
|
||||
tmpCWD, err := os.MkdirTemp("", "vv-cwd-*")
|
||||
Expect(err).ToNot(HaveOccurred())
|
||||
DeferCleanup(func() { _ = os.RemoveAll(tmpCWD) })
|
||||
Expect(os.WriteFile(filepath.Join(tmpCWD, "tokenizer.gguf"),
|
||||
[]byte("not the real one"), 0o644)).To(Succeed())
|
||||
Expect(os.Chdir(tmpCWD)).To(Succeed())
|
||||
|
||||
got := resolvePath("tokenizer.gguf", "/data/models")
|
||||
Expect(got).To(Equal("/data/models/tokenizer.gguf"))
|
||||
})
|
||||
})
|
||||
|
||||
// Round-trip the gallery layout: relative paths in Options +
|
||||
// an absolute ModelFile (as LocalAI core delivers them) end
|
||||
// up resolved correctly inside the backend struct.
|
||||
It("Load resolves relative Options paths against opts.ModelPath", func() {
|
||||
tmpDir, err := os.MkdirTemp("", "vv-relpath-*")
|
||||
Expect(err).ToNot(HaveOccurred())
|
||||
DeferCleanup(func() { _ = os.RemoveAll(tmpDir) })
|
||||
|
||||
// Lay out the bundle exactly as the gallery would after install:
|
||||
// <modelpath>/vibevoice-cpp/{tts,tokenizer,voice}.gguf
|
||||
subDir := filepath.Join(tmpDir, "vibevoice-cpp")
|
||||
Expect(os.MkdirAll(subDir, 0o755)).To(Succeed())
|
||||
tts := filepath.Join(subDir, "vibevoice-realtime-stub.gguf")
|
||||
tok := filepath.Join(subDir, "tokenizer.gguf")
|
||||
voice := filepath.Join(subDir, "voice.gguf")
|
||||
for _, p := range []string{tts, tok, voice} {
|
||||
Expect(os.WriteFile(p, []byte("stub"), 0o644)).To(Succeed())
|
||||
}
|
||||
|
||||
// Mirror Load()'s pre-purego prefix: parse + slot fill.
|
||||
v := &VibevoiceCpp{}
|
||||
modelFile := tts // core delivers this as an abspath already
|
||||
v.modelRoot = tmpDir
|
||||
role := v.parseOptions([]string{
|
||||
"tokenizer=vibevoice-cpp/tokenizer.gguf",
|
||||
"voice=vibevoice-cpp/voice.gguf",
|
||||
}, v.modelRoot)
|
||||
Expect(role).To(BeEmpty())
|
||||
if v.ttsModel == "" {
|
||||
v.ttsModel = modelFile
|
||||
}
|
||||
|
||||
Expect(v.ttsModel).To(Equal(tts))
|
||||
Expect(v.tokenizer).To(Equal(tok))
|
||||
Expect(v.voice).To(Equal(voice))
|
||||
Expect(v.asrModel).To(BeEmpty())
|
||||
})
|
||||
|
||||
It("closes the channel and errors on AudioTranscriptionStream without a loaded model", func() {
|
||||
ch := make(chan *pb.TranscriptStreamResponse, 4)
|
||||
err := (&VibevoiceCpp{}).AudioTranscriptionStream(&pb.TranscriptRequest{
|
||||
Dst: "/tmp/some.wav",
|
||||
}, ch)
|
||||
Expect(err).To(HaveOccurred())
|
||||
_, ok := <-ch
|
||||
Expect(ok).To(BeFalse(), "AudioTranscriptionStream must close results channel even on error")
|
||||
})
|
||||
})
|
||||
|
||||
Context("gRPC server lifecycle", func() {
|
||||
var cmd *exec.Cmd
|
||||
|
||||
AfterEach(func() {
|
||||
stopServer(cmd)
|
||||
cmd = nil
|
||||
})
|
||||
|
||||
It("answers Health checks", func() {
|
||||
cmd = startServer()
|
||||
conn := dialGRPC()
|
||||
defer func() { _ = conn.Close() }()
|
||||
|
||||
resp, err := pb.NewBackendClient(conn).Health(context.Background(), &pb.HealthMessage{})
|
||||
Expect(err).ToNot(HaveOccurred())
|
||||
Expect(string(resp.Message)).To(Equal("OK"))
|
||||
})
|
||||
|
||||
It("loads the realtime TTS model", func() {
|
||||
dir := modelDirOrSkip()
|
||||
tts, _ := filepath.Glob(filepath.Join(dir, "vibevoice-realtime-*.gguf"))
|
||||
if len(tts) == 0 {
|
||||
Skip("realtime TTS gguf missing")
|
||||
}
|
||||
|
||||
cmd = startServer()
|
||||
conn := dialGRPC()
|
||||
defer func() { _ = conn.Close() }()
|
||||
|
||||
// Mirror the gallery contract: ModelFile is whatever LocalAI
|
||||
// core hands us; ModelPath is the models root; Options[]
|
||||
// carry paths relative to ModelPath.
|
||||
resp, err := pb.NewBackendClient(conn).LoadModel(context.Background(), &pb.ModelOptions{
|
||||
ModelFile: filepath.Base(tts[0]),
|
||||
ModelPath: dir,
|
||||
Threads: 4,
|
||||
Options: []string{"tokenizer=tokenizer.gguf"},
|
||||
})
|
||||
Expect(err).ToNot(HaveOccurred())
|
||||
Expect(resp.Success).To(BeTrue(), "LoadModel msg=%q", resp.Message)
|
||||
})
|
||||
|
||||
It("runs a closed-loop TTS -> ASR with >=80% word recall", func() {
|
||||
dir := modelDirOrSkip()
|
||||
tts, _ := filepath.Glob(filepath.Join(dir, "vibevoice-realtime-*.gguf"))
|
||||
asr, _ := filepath.Glob(filepath.Join(dir, "vibevoice-asr-*.gguf"))
|
||||
if len(tts) == 0 || len(asr) == 0 {
|
||||
Skip("closed-loop needs both realtime TTS and ASR ggufs")
|
||||
}
|
||||
|
||||
tmpDir, err := os.MkdirTemp("", "vibevoice-cpp-closedloop-*")
|
||||
Expect(err).ToNot(HaveOccurred())
|
||||
DeferCleanup(func() { _ = os.RemoveAll(tmpDir) })
|
||||
wav := filepath.Join(tmpDir, "say.wav")
|
||||
|
||||
cmd = startServer()
|
||||
conn := dialGRPC()
|
||||
defer func() { _ = conn.Close() }()
|
||||
client := pb.NewBackendClient(conn)
|
||||
|
||||
// Gallery convention: ModelPath is the models root, every
|
||||
// path inside Options[] is relative to it.
|
||||
voiceMatches, _ := filepath.Glob(filepath.Join(dir, "voice-*.gguf"))
|
||||
loadOpts := &pb.ModelOptions{
|
||||
ModelFile: filepath.Base(tts[0]),
|
||||
ModelPath: dir,
|
||||
Threads: 4,
|
||||
Options: []string{
|
||||
"asr_model=" + filepath.Base(asr[0]),
|
||||
"tokenizer=tokenizer.gguf",
|
||||
},
|
||||
}
|
||||
if len(voiceMatches) > 0 {
|
||||
loadOpts.Options = append(loadOpts.Options, "voice="+filepath.Base(voiceMatches[0]))
|
||||
}
|
||||
loadResp, err := client.LoadModel(context.Background(), loadOpts)
|
||||
Expect(err).ToNot(HaveOccurred())
|
||||
Expect(loadResp.Success).To(BeTrue(), "LoadModel msg=%q", loadResp.Message)
|
||||
|
||||
srcText := "Hello world this is a test of the synthesis system."
|
||||
_, err = client.TTS(context.Background(), &pb.TTSRequest{
|
||||
Text: srcText,
|
||||
Dst: wav,
|
||||
})
|
||||
Expect(err).ToNot(HaveOccurred())
|
||||
|
||||
info, err := os.Stat(wav)
|
||||
Expect(err).ToNot(HaveOccurred())
|
||||
Expect(info.Size()).To(BeNumerically(">=", 1000),
|
||||
"TTS produced suspiciously small wav (%d bytes)", info.Size())
|
||||
|
||||
resp, err := client.AudioTranscription(context.Background(), &pb.TranscriptRequest{
|
||||
Dst: wav,
|
||||
})
|
||||
Expect(err).ToNot(HaveOccurred())
|
||||
got := strings.ToLower(resp.Text)
|
||||
GinkgoWriter.Printf("source : %s\n", srcText)
|
||||
GinkgoWriter.Printf("transcribed: %s\n", got)
|
||||
|
||||
wordRE := regexp.MustCompile(`[a-z]+`)
|
||||
srcWords := wordRE.FindAllString(strings.ToLower(srcText), -1)
|
||||
Expect(srcWords).ToNot(BeEmpty())
|
||||
hits := 0
|
||||
for _, w := range srcWords {
|
||||
if strings.Contains(got, w) {
|
||||
hits++
|
||||
}
|
||||
}
|
||||
recall := float64(hits) / float64(len(srcWords))
|
||||
GinkgoWriter.Printf("recall: %d/%d = %.2f%%\n", hits, len(srcWords), recall*100)
|
||||
Expect(recall).To(BeNumerically(">=", 0.80),
|
||||
"closed-loop recall too low: %d/%d = %.2f%%",
|
||||
hits, len(srcWords), recall*100)
|
||||
})
|
||||
})
|
||||
})
|
||||
@@ -572,6 +572,34 @@
|
||||
nvidia-l4t: "nvidia-l4t-arm64-qwen3-tts-cpp"
|
||||
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-qwen3-tts-cpp"
|
||||
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-qwen3-tts-cpp"
|
||||
- &vibevoicecpp
|
||||
name: "vibevoice-cpp"
|
||||
description: |
|
||||
vibevoice.cpp C++ backend using GGML. Native C++ port of Microsoft VibeVoice for both
|
||||
text-to-speech (with voice cloning via voice prompt GGUFs) and long-form ASR with
|
||||
speaker diarization. Outputs 24kHz mono WAV; ASR returns per-speaker JSON segments.
|
||||
urls:
|
||||
- https://github.com/mudler/vibevoice.cpp
|
||||
tags:
|
||||
- text-to-speech
|
||||
- tts
|
||||
- speech-to-text
|
||||
- asr
|
||||
- voice-cloning
|
||||
- diarization
|
||||
alias: "vibevoice-cpp"
|
||||
capabilities:
|
||||
default: "cpu-vibevoice-cpp"
|
||||
nvidia: "cuda12-vibevoice-cpp"
|
||||
nvidia-cuda-13: "cuda13-vibevoice-cpp"
|
||||
nvidia-cuda-12: "cuda12-vibevoice-cpp"
|
||||
intel: "intel-sycl-f16-vibevoice-cpp"
|
||||
metal: "metal-vibevoice-cpp"
|
||||
amd: "rocm-vibevoice-cpp"
|
||||
vulkan: "vulkan-vibevoice-cpp"
|
||||
nvidia-l4t: "nvidia-l4t-arm64-vibevoice-cpp"
|
||||
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-vibevoice-cpp"
|
||||
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-vibevoice-cpp"
|
||||
- &faster-whisper
|
||||
icon: https://avatars.githubusercontent.com/u/1520500?s=200&v=4
|
||||
description: |
|
||||
@@ -2656,6 +2684,107 @@
|
||||
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-qwen3-tts-cpp"
|
||||
mirrors:
|
||||
- localai/localai-backends:master-gpu-nvidia-cuda-13-qwen3-tts-cpp
|
||||
## vibevoice-cpp
|
||||
- !!merge <<: *vibevoicecpp
|
||||
name: "nvidia-l4t-arm64-vibevoice-cpp"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-arm64-vibevoice-cpp"
|
||||
mirrors:
|
||||
- localai/localai-backends:latest-nvidia-l4t-arm64-vibevoice-cpp
|
||||
- !!merge <<: *vibevoicecpp
|
||||
name: "nvidia-l4t-arm64-vibevoice-cpp-development"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-arm64-vibevoice-cpp"
|
||||
mirrors:
|
||||
- localai/localai-backends:master-nvidia-l4t-arm64-vibevoice-cpp
|
||||
- !!merge <<: *vibevoicecpp
|
||||
name: "cuda13-nvidia-l4t-arm64-vibevoice-cpp"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-vibevoice-cpp"
|
||||
mirrors:
|
||||
- localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-vibevoice-cpp
|
||||
- !!merge <<: *vibevoicecpp
|
||||
name: "cuda13-nvidia-l4t-arm64-vibevoice-cpp-development"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-vibevoice-cpp"
|
||||
mirrors:
|
||||
- localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-vibevoice-cpp
|
||||
- !!merge <<: *vibevoicecpp
|
||||
name: "cpu-vibevoice-cpp"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-vibevoice-cpp"
|
||||
mirrors:
|
||||
- localai/localai-backends:latest-cpu-vibevoice-cpp
|
||||
- !!merge <<: *vibevoicecpp
|
||||
name: "metal-vibevoice-cpp"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-vibevoice-cpp"
|
||||
mirrors:
|
||||
- localai/localai-backends:latest-metal-darwin-arm64-vibevoice-cpp
|
||||
- !!merge <<: *vibevoicecpp
|
||||
name: "metal-vibevoice-cpp-development"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-vibevoice-cpp"
|
||||
mirrors:
|
||||
- localai/localai-backends:master-metal-darwin-arm64-vibevoice-cpp
|
||||
- !!merge <<: *vibevoicecpp
|
||||
name: "cpu-vibevoice-cpp-development"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-vibevoice-cpp"
|
||||
mirrors:
|
||||
- localai/localai-backends:master-cpu-vibevoice-cpp
|
||||
- !!merge <<: *vibevoicecpp
|
||||
name: "cuda12-vibevoice-cpp"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-vibevoice-cpp"
|
||||
mirrors:
|
||||
- localai/localai-backends:latest-gpu-nvidia-cuda-12-vibevoice-cpp
|
||||
- !!merge <<: *vibevoicecpp
|
||||
name: "rocm-vibevoice-cpp"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-vibevoice-cpp"
|
||||
mirrors:
|
||||
- localai/localai-backends:latest-gpu-rocm-hipblas-vibevoice-cpp
|
||||
- !!merge <<: *vibevoicecpp
|
||||
name: "intel-sycl-f32-vibevoice-cpp"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f32-vibevoice-cpp"
|
||||
mirrors:
|
||||
- localai/localai-backends:latest-gpu-intel-sycl-f32-vibevoice-cpp
|
||||
- !!merge <<: *vibevoicecpp
|
||||
name: "intel-sycl-f16-vibevoice-cpp"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f16-vibevoice-cpp"
|
||||
mirrors:
|
||||
- localai/localai-backends:latest-gpu-intel-sycl-f16-vibevoice-cpp
|
||||
- !!merge <<: *vibevoicecpp
|
||||
name: "vulkan-vibevoice-cpp"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-vulkan-vibevoice-cpp"
|
||||
mirrors:
|
||||
- localai/localai-backends:latest-gpu-vulkan-vibevoice-cpp
|
||||
- !!merge <<: *vibevoicecpp
|
||||
name: "vulkan-vibevoice-cpp-development"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-vulkan-vibevoice-cpp"
|
||||
mirrors:
|
||||
- localai/localai-backends:master-gpu-vulkan-vibevoice-cpp
|
||||
- !!merge <<: *vibevoicecpp
|
||||
name: "cuda12-vibevoice-cpp-development"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-vibevoice-cpp"
|
||||
mirrors:
|
||||
- localai/localai-backends:master-gpu-nvidia-cuda-12-vibevoice-cpp
|
||||
- !!merge <<: *vibevoicecpp
|
||||
name: "rocm-vibevoice-cpp-development"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-vibevoice-cpp"
|
||||
mirrors:
|
||||
- localai/localai-backends:master-gpu-rocm-hipblas-vibevoice-cpp
|
||||
- !!merge <<: *vibevoicecpp
|
||||
name: "intel-sycl-f32-vibevoice-cpp-development"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f32-vibevoice-cpp"
|
||||
mirrors:
|
||||
- localai/localai-backends:master-gpu-intel-sycl-f32-vibevoice-cpp
|
||||
- !!merge <<: *vibevoicecpp
|
||||
name: "intel-sycl-f16-vibevoice-cpp-development"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f16-vibevoice-cpp"
|
||||
mirrors:
|
||||
- localai/localai-backends:master-gpu-intel-sycl-f16-vibevoice-cpp
|
||||
- !!merge <<: *vibevoicecpp
|
||||
name: "cuda13-vibevoice-cpp"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-vibevoice-cpp"
|
||||
mirrors:
|
||||
- localai/localai-backends:latest-gpu-nvidia-cuda-13-vibevoice-cpp
|
||||
- !!merge <<: *vibevoicecpp
|
||||
name: "cuda13-vibevoice-cpp-development"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-vibevoice-cpp"
|
||||
mirrors:
|
||||
- localai/localai-backends:master-gpu-nvidia-cuda-13-vibevoice-cpp
|
||||
## kokoro
|
||||
- !!merge <<: *kokoro
|
||||
name: "kokoro-development"
|
||||
|
||||
@@ -1681,6 +1681,83 @@
|
||||
- filename: acestep-cpp/vae-BF16.gguf
|
||||
uri: huggingface://Serveurperso/ACE-Step-1.5-GGUF/vae-BF16.gguf
|
||||
sha256: 0599862ac5d15cd308e1d2e368373aea6c02e25ebd1737ad4a4562a0901b0ef8
|
||||
- name: "vibevoice-cpp"
|
||||
license: mit
|
||||
icon: https://github.com/microsoft/VibeVoice/raw/main/Figures/VibeVoice_logo_white.png
|
||||
tags:
|
||||
- tts
|
||||
- text-to-speech
|
||||
- voice-cloning
|
||||
- vibevoice
|
||||
- vibevoice-cpp
|
||||
- gguf
|
||||
url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
|
||||
urls:
|
||||
- https://huggingface.co/mudler/vibevoice.cpp-models
|
||||
- https://github.com/mudler/vibevoice.cpp
|
||||
- https://github.com/microsoft/VibeVoice
|
||||
description: |
|
||||
VibeVoice Realtime 0.5B (C++ / GGML, Q8_0) - native C++ port of Microsoft VibeVoice
|
||||
via the vibevoice-cpp backend. 24kHz mono TTS with voice cloning from a single
|
||||
reference voice prompt. Default voice prompt: en-Carter_man.
|
||||
overrides:
|
||||
name: vibevoice-cpp
|
||||
backend: vibevoice-cpp
|
||||
parameters:
|
||||
model: vibevoice-cpp/vibevoice-realtime-0.5B-q8_0.gguf
|
||||
options:
|
||||
- tokenizer=vibevoice-cpp/tokenizer.gguf
|
||||
- voice=vibevoice-cpp/voice-en-Carter_man.gguf
|
||||
known_usecases:
|
||||
- tts
|
||||
files:
|
||||
- filename: vibevoice-cpp/vibevoice-realtime-0.5B-q8_0.gguf
|
||||
uri: huggingface://mudler/vibevoice.cpp-models/vibevoice-realtime-0.5B-q8_0.gguf
|
||||
sha256: 5251e3f0386d1056a90c61b6c7359a4775da44dd19402499bef1989c4b5c653a
|
||||
- filename: vibevoice-cpp/tokenizer.gguf
|
||||
uri: huggingface://mudler/vibevoice.cpp-models/tokenizer.gguf
|
||||
sha256: 37dc3b722d5677e37e29a57df55aa05c485116eeb5459e57ff8dde616b4986f6
|
||||
- filename: vibevoice-cpp/voice-en-Carter_man.gguf
|
||||
uri: huggingface://mudler/vibevoice.cpp-models/voice-en-Carter_man.gguf
|
||||
sha256: b15cd8b9cae6ee2c3d20b0ee6e7bfe93f13489f8b63b6834e9bbf0dfabf6505a
|
||||
- name: "vibevoice-cpp-asr"
|
||||
license: mit
|
||||
icon: https://github.com/microsoft/VibeVoice/raw/main/Figures/VibeVoice_logo_white.png
|
||||
tags:
|
||||
- stt
|
||||
- speech-to-text
|
||||
- asr
|
||||
- audio-transcription
|
||||
- diarization
|
||||
- vibevoice
|
||||
- vibevoice-cpp
|
||||
- gguf
|
||||
url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
|
||||
urls:
|
||||
- https://huggingface.co/mudler/vibevoice.cpp-models
|
||||
- https://github.com/mudler/vibevoice.cpp
|
||||
- https://github.com/microsoft/VibeVoice
|
||||
description: |
|
||||
VibeVoice ASR 7B (C++ / GGML, Q4_K) - long-form speech-to-text with speaker
|
||||
diarization. Returns per-speaker JSON segments with start/end timestamps.
|
||||
English-only. ~10 GB download.
|
||||
overrides:
|
||||
name: vibevoice-cpp-asr
|
||||
backend: vibevoice-cpp
|
||||
parameters:
|
||||
model: vibevoice-cpp-asr/vibevoice-asr-q4_k.gguf
|
||||
options:
|
||||
- type=asr
|
||||
- tokenizer=vibevoice-cpp-asr/tokenizer.gguf
|
||||
known_usecases:
|
||||
- transcript
|
||||
files:
|
||||
- filename: vibevoice-cpp-asr/vibevoice-asr-q4_k.gguf
|
||||
uri: huggingface://mudler/vibevoice.cpp-models/vibevoice-asr-q4_k.gguf
|
||||
sha256: 4eee48b9d0d42f71b773b804aa6728c99971c38d54f3c86cf1fd0fc1fc49a9ad
|
||||
- filename: vibevoice-cpp-asr/tokenizer.gguf
|
||||
uri: huggingface://mudler/vibevoice.cpp-models/tokenizer.gguf
|
||||
sha256: 37dc3b722d5677e37e29a57df55aa05c485116eeb5459e57ff8dde616b4986f6
|
||||
- name: "qwen3-tts-cpp"
|
||||
license: apache-2.0
|
||||
tags:
|
||||
|
||||
Reference in New Issue
Block a user