mirror of
https://github.com/mudler/LocalAI.git
synced 2026-04-30 12:08:13 -04:00
* feat(vibevoice-cpp): add purego TTS+ASR backend
Wire up Microsoft VibeVoice via the vibevoice.cpp C ABI as a new
purego-based Go backend that serves both Backend.TTS and
Backend.AudioTranscription from a single gRPC binary. Mirrors the
qwen3-tts-cpp / sherpa-onnx pattern so the variant matrix
(cpu/cuda12/cuda13/metal/rocm/sycl-f16/f32/vulkan/l4t) and the
e2e-backends gRPC harness reuse existing infrastructure.
- backend/go/vibevoice-cpp/ - Makefile, CMakeLists, purego shim, gRPC
Backend with model-dir auto-detection, closed-loop TTS->ASR smoke test
- backend/index.yaml - &vibevoicecpp meta + 18 image entries
- Makefile - .NOTPARALLEL, BACKEND_VIBEVOICE_CPP, docker-build wiring,
test-extra-backend-vibevoice-cpp-{tts,transcription} e2e wrappers
- .github/workflows/backend.yml - matrix entries for all variants
- .github/workflows/test-extra.yml - per-backend smoke + 2 gRPC e2e jobs
* feat(vibevoice-cpp): drop hardcoded glob detection, add gallery entries
Refactor backend Load() to follow the standard Options[] convention
used by sherpa-onnx and the rest of the multi-role backends:
ModelFile is the primary gguf, supplementary paths come through
opts.Options[] as key=value (or key:value for Make-target compat),
resolved against opts.ModelPath. type=asr/tts decides the role of
ModelFile when neither tts_model nor asr_model is set explicitly.
Add gallery/index.yaml entries:
- vibevoice-cpp - realtime 0.5B Q8_0 TTS + tokenizer + Carter voice
- vibevoice-cpp-asr - long-form ASR Q8_0 + tokenizer
Both pull from huggingface://mudler/vibevoice.cpp-models with sha256
verification. parameters.model + Options[] paths are siblings under
{models_dir} per the qwen3-tts-cpp convention.
Update Makefile e2e wrappers to pass BACKEND_TEST_OPTIONS comma+colon
style, and tighten the per-backend Go closed-loop test to use the
explicit Options API.
* fix(vibevoice-cpp): force whole-archive link so vv_capi_* exports survive
libvibevoice is a STATIC archive linked into the MODULE library.
Without --whole-archive (or -force_load on Apple, /WHOLEARCHIVE on
MSVC), the linker garbage-collects symbols not referenced from this
translation unit - which means dlopen+RegisterLibFunc panics with
'undefined symbol: vv_capi_load' at backend startup, since purego
looks them up by name and our cpp/govibevoicecpp.cpp doesn't call
them directly.
* test(vibevoice-cpp): rewrite suite with Ginkgo v2
Match the convention used by backend/go/sherpa-onnx/backend_test.go.
The suite now covers backend semantics that don't need purego (Locking,
empty-ModelFile rejection, TTS/ASR-without-loaded-model errors) on top
of the gRPC lifecycle specs (Health, Load, closed-loop TTS->ASR).
Model-dependent specs Skip() when VIBEVOICE_MODEL_DIR is unset, so
`go test ./backend/go/vibevoice-cpp/` is green on a clean checkout
and runs the heavyweight closed-loop spec when test.sh has staged
the bundle.
* fix(vibevoice-cpp): implement TTSStream + AudioTranscriptionStream
The gRPC server's stream handlers (pkg/grpc/server.go) spawn a
goroutine that ranges over a chan; the only thing closing that chan
is the backend's own *Stream method. With the default Base stub
returning 'unimplemented' and never touching the chan, the server
goroutine hangs forever and the client hits DeadlineExceeded - which
is exactly what the e2e harness saw in the test-extra-backend-vibevoice-cpp-tts
matrix run.
TTSStream synthesizes via vv_capi_tts to a tempfile, then emits a
streaming WAV header (chunk sizes 0xFFFFFFFF so HTTP clients can
start playback before the full PCM lands) followed by the PCM body
in 64 KB slices. The header + >=2 PCM frames satisfy the harness's
'expected >=2 chunks' assertion and give a real progressive stream.
AudioTranscriptionStream runs the offline transcription, emits each
segment as a delta, and closes with a final_result whose Text equals
the concatenated deltas (the harness asserts those match).
Two new Ginkgo specs guard the close-channel-on-error path so the
deadline-exceeded regression can't come back silently.
* fix(vibevoice-cpp): silence errcheck on cleanup paths
Lint flagged six unchecked Close()/Remove()/RemoveAll() calls along
purely-cleanup deferred paths. Wrap each in '_ = ...' (or a closure
for defers that take args) - matches what the rest of the LocalAI
backend/go/* tree already does for these callsites.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(vibevoice-cpp): closed-loop slot fill + modelRoot-relative path resolution
Two bugs the test-extra-backend-vibevoice-cpp-* CI matrix surfaced:
1. Closed-loop Load with ModelFile=tts.gguf + Options[asr_model=...] left
v.ttsModel empty, because the default-fill block only ran when BOTH
slots were empty. vv_capi_load then got tts="" + a voice and the
C side rejected it with rc=-3 'TTS model required to load a voice'.
Fix: ModelFile fills the *primary* role-slot (decided by 'type=' in
Options, defaulting to tts) independently of the secondary, so
ModelFile + asr_model resolves to both.
2. resolvePath stat'd CWD before falling back to relTo. With LocalAI
launched from a directory that happens to contain a same-named
file, supplementary Options[] paths could leak away from the
models dir. Drop the CWD probe entirely - relative paths now
*always* join onto opts.ModelPath (the gallery convention).
New Ginkgo coverage:
* 'ModelFile slot resolution' (4 specs) - asr_model+ModelFile, type=asr,
explicit tts_model override, key:value variant.
* 'resolvePath (relative-to-modelRoot)' (5 specs) - join, abs passthrough,
empty input, empty relTo, and the CWD-trap regression test.
* 'Load resolves relative Options paths against opts.ModelPath' - end-
to-end gallery layout round-trip.
Verified locally: 19/19 specs pass (with model bundle, including the
closed-loop TTS->ASR; without bundle, 17 pass + 2 model-dependent skip).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(vibevoice-cpp): use gallery convention in closed-loop spec
The 'loads the realtime TTS model' / closed-loop specs were passing
already-prefixed paths into Options[]:
Options: ['tokenizer=' + filepath.Join(modelDir, 'tokenizer.gguf')]
Combined with no ModelPath set on the request, the backend's
modelRoot fell back to filepath.Dir(ModelFile) = modelDir, then
resolvePath joined the prefixed Options path on top of it -
producing 'vibevoice-models/vibevoice-models/tokenizer.gguf' when
the CI's VIBEVOICE_MODEL_DIR is the relative './vibevoice-models'.
The fix is to mirror the gallery contract LocalAI core actually
sends in production: ModelPath is the models root (absolute),
ModelFile is a name *under* it, every Options[] path is relative
to ModelPath. Uses filepath.Base() to get bare filenames.
Verified locally with both VIBEVOICE_MODEL_DIR=/tmp/vv-bundle (abs)
and VIBEVOICE_MODEL_DIR=vibevoice-models (the relative shape that
broke CI). Both: 19/19 specs pass, ~55-60s.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* ci(vibevoice-cpp): switch ASR to Q4_K + bump transcription timeout
The Q8_0 ASR gguf is ~14 GB - too big to fit alongside the runner
image, the docker build cache, and the test artifacts on a free
ubuntu-latest GHA runner; 'test-extra-backend-vibevoice-cpp-transcription'
was getting SIGTERM'd at 90 min before the model could finish loading.
Switch to Q4_K (~10 GB on disk, slightly faster CPU decode) for:
* the e2e harness Make target
* the gallery 'vibevoice-cpp-asr' entry (parameters + files block)
* the per-backend test.sh auto-download list
Bump tests-vibevoice-cpp-grpc-transcription's timeout-minutes from
90 to 150 - even with Q4_K, the 30 s JFK clip on a CPU runner needs
runway above the previous 90 min cap.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* ci(vibevoice-cpp): drop transcription gRPC e2e job - too heavy for free runners
The vibevoice ASR is a 7B-parameter model. Even on Q4_K (~10 GB on
disk) a single 30 s transcription saturates the per-test 30 min
timeout in the e2e-backends harness on a 4-core ubuntu-latest, and
the 10 GB download + Docker layer + working space leaves no headroom
on the runner's free disk. Two attempts in CI got SIGTERM'd at the
LoadModel boundary - the bottleneck isn't tunable from the workflow
side without a paid-tier runner.
The per-backend tests-vibevoice-cpp job already runs the same
AudioTranscription path via a closed-loop TTS->ASR Ginkgo spec - same
gRPC contract, same model, single process - so the standalone
tests-vibevoice-cpp-grpc-transcription job was redundant on top of
the disk/CPU pressure.
The Makefile target test-extra-backend-vibevoice-cpp-transcription
stays for local invocation on workstations that can afford it -
useful when developing the streaming codepaths.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* ci(vibevoice-cpp): restore transcription gRPC e2e on bigger-runner
Switch tests-vibevoice-cpp-grpc-transcription from ubuntu-latest to
the self-hosted 'bigger-runner' label that GPU image builds in
backend.yml use, plus the documented Free-disk-space prep step (purge
dotnet / ghc / android / CodeQL caches) the disabled vllm/sglang
entries in this file describe. That gives the 7B-param Q4_K ASR
model the disk + CPU runway it needs.
Keep timeout-minutes: 150 - even on a beefier runner the 30 s JFK
decode plus 10 GB download has to fit comfortably.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* ci(vibevoice-cpp): apt-get install make on bigger-runner before transcription e2e
bigger-runner is a self-hosted bare runner without the standard
ubuntu image's preinstalled build tools, so the previous job died at
the very first command with 'make: command not found' (exit 127).
Add the Dependencies step that the disabled vllm/sglang entries in
this file already document - apt-get installs make + build-essential
+ curl + unzip + ca-certificates + git + tar before the make target
runs. Mirrors how every other 'runs-on: bigger-runner' entry in
backend.yml prepares the runner.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
383 lines
13 KiB
Go
383 lines
13 KiB
Go
package main
|
|
|
|
import (
|
|
"context"
|
|
"os"
|
|
"os/exec"
|
|
"path/filepath"
|
|
"regexp"
|
|
"strings"
|
|
"testing"
|
|
"time"
|
|
|
|
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
|
|
. "github.com/onsi/ginkgo/v2"
|
|
. "github.com/onsi/gomega"
|
|
"google.golang.org/grpc"
|
|
"google.golang.org/grpc/credentials/insecure"
|
|
)
|
|
|
|
const (
|
|
testAddr = "localhost:50098"
|
|
startupWait = 5 * time.Second
|
|
)
|
|
|
|
func TestVibevoiceCpp(t *testing.T) {
|
|
RegisterFailHandler(Fail)
|
|
RunSpecs(t, "VibeVoice-cpp Backend Suite")
|
|
}
|
|
|
|
// modelDirOrSkip returns the staged model bundle dir, or Skip()s the
|
|
// current spec when VIBEVOICE_MODEL_DIR is unset / lacks the gguf
|
|
// files we need. Tests that don't depend on a model (Locking, error
|
|
// paths) don't call this.
|
|
func modelDirOrSkip() string {
|
|
dir := os.Getenv("VIBEVOICE_MODEL_DIR")
|
|
if dir == "" {
|
|
Skip("VIBEVOICE_MODEL_DIR not set, skipping model-dependent specs")
|
|
}
|
|
if _, err := os.Stat(filepath.Join(dir, "tokenizer.gguf")); os.IsNotExist(err) {
|
|
Skip("tokenizer.gguf missing in " + dir)
|
|
}
|
|
tts, _ := filepath.Glob(filepath.Join(dir, "vibevoice-realtime-*.gguf"))
|
|
asr, _ := filepath.Glob(filepath.Join(dir, "vibevoice-asr-*.gguf"))
|
|
if len(tts) == 0 && len(asr) == 0 {
|
|
Skip("neither realtime TTS nor ASR gguf found in " + dir)
|
|
}
|
|
return dir
|
|
}
|
|
|
|
// startServer launches the prebuilt backend binary and returns a
|
|
// running *exec.Cmd. test.sh ensures `./vibevoice-cpp` is built; if
|
|
// it isn't, every gRPC spec is skipped with a clear reason.
|
|
func startServer() *exec.Cmd {
|
|
binary := os.Getenv("VIBEVOICE_BINARY")
|
|
if binary == "" {
|
|
binary = "./vibevoice-cpp"
|
|
}
|
|
if _, err := os.Stat(binary); os.IsNotExist(err) {
|
|
Skip("backend binary not found at " + binary)
|
|
}
|
|
cmd := exec.Command(binary, "--addr", testAddr)
|
|
cmd.Stdout = os.Stderr
|
|
cmd.Stderr = os.Stderr
|
|
Expect(cmd.Start()).To(Succeed())
|
|
time.Sleep(startupWait)
|
|
return cmd
|
|
}
|
|
|
|
func stopServer(cmd *exec.Cmd) {
|
|
if cmd == nil || cmd.Process == nil {
|
|
return
|
|
}
|
|
_ = cmd.Process.Kill()
|
|
_, _ = cmd.Process.Wait()
|
|
}
|
|
|
|
func dialGRPC() *grpc.ClientConn {
|
|
conn, err := grpc.Dial(testAddr,
|
|
grpc.WithTransportCredentials(insecure.NewCredentials()),
|
|
grpc.WithDefaultCallOptions(
|
|
grpc.MaxCallRecvMsgSize(50*1024*1024),
|
|
grpc.MaxCallSendMsgSize(50*1024*1024),
|
|
),
|
|
)
|
|
Expect(err).ToNot(HaveOccurred())
|
|
return conn
|
|
}
|
|
|
|
var _ = Describe("VibeVoice-cpp", func() {
|
|
Context("backend semantics (no purego load needed)", func() {
|
|
It("is locking - the engine has process-global state", func() {
|
|
Expect((&VibevoiceCpp{}).Locking()).To(BeTrue())
|
|
})
|
|
|
|
It("rejects Load with empty ModelFile", func() {
|
|
err := (&VibevoiceCpp{}).Load(&pb.ModelOptions{})
|
|
Expect(err).To(HaveOccurred())
|
|
Expect(err.Error()).To(ContainSubstring("ModelFile"))
|
|
})
|
|
|
|
It("rejects TTS without a loaded TTS model", func() {
|
|
err := (&VibevoiceCpp{}).TTS(&pb.TTSRequest{
|
|
Text: "no model loaded",
|
|
Dst: "/tmp/should-not-be-written.wav",
|
|
})
|
|
Expect(err).To(HaveOccurred())
|
|
})
|
|
|
|
It("rejects AudioTranscription without a loaded ASR model", func() {
|
|
_, err := (&VibevoiceCpp{}).AudioTranscription(&pb.TranscriptRequest{
|
|
Dst: "/tmp/some.wav",
|
|
})
|
|
Expect(err).To(HaveOccurred())
|
|
})
|
|
|
|
It("closes the channel and errors on TTSStream without a loaded model", func() {
|
|
ch := make(chan []byte, 4)
|
|
err := (&VibevoiceCpp{}).TTSStream(&pb.TTSRequest{
|
|
Text: "no model loaded",
|
|
Dst: "/tmp/should-not-be-written.wav",
|
|
}, ch)
|
|
Expect(err).To(HaveOccurred())
|
|
// Server hangs forever if the channel stays open; this guard
|
|
// is what regresses the e2e DeadlineExceeded we're fixing.
|
|
_, ok := <-ch
|
|
Expect(ok).To(BeFalse(), "TTSStream must close results channel even on error")
|
|
})
|
|
|
|
// parseOptions + slot fill is the source of the closed-loop CI
|
|
// regression where ModelFile=tts.gguf + Options[asr_model=...]
|
|
// resulted in a load with empty tts slot. These specs assert
|
|
// the slot resolution before we ever call into purego.
|
|
Describe("ModelFile slot resolution", func() {
|
|
It("fills tts slot from ModelFile when only asr_model is in Options", func() {
|
|
v := &VibevoiceCpp{}
|
|
v.modelRoot = "/abs/root"
|
|
role := v.parseOptions([]string{"asr_model=/abs/root/asr.gguf", "tokenizer=/abs/root/tokenizer.gguf"}, v.modelRoot)
|
|
Expect(v.asrModel).To(Equal("/abs/root/asr.gguf"))
|
|
Expect(v.ttsModel).To(BeEmpty())
|
|
Expect(role).To(BeEmpty())
|
|
// Mirror the Load() default-fill block:
|
|
if v.ttsModel == "" {
|
|
v.ttsModel = "/abs/root/tts.gguf"
|
|
}
|
|
Expect(v.ttsModel).To(Equal("/abs/root/tts.gguf"))
|
|
Expect(v.asrModel).To(Equal("/abs/root/asr.gguf"))
|
|
})
|
|
|
|
It("fills asr slot from ModelFile when type=asr is set", func() {
|
|
v := &VibevoiceCpp{}
|
|
v.modelRoot = "/abs/root"
|
|
role := v.parseOptions([]string{"type=asr", "tokenizer=/abs/root/tokenizer.gguf"}, v.modelRoot)
|
|
Expect(role).To(Equal("asr"))
|
|
Expect(v.asrModel).To(BeEmpty())
|
|
Expect(v.ttsModel).To(BeEmpty())
|
|
})
|
|
|
|
It("respects explicit tts_model override over ModelFile", func() {
|
|
v := &VibevoiceCpp{}
|
|
v.modelRoot = "/abs/root"
|
|
_ = v.parseOptions([]string{"tts_model=/abs/root/alt.gguf"}, v.modelRoot)
|
|
Expect(v.ttsModel).To(Equal("/abs/root/alt.gguf"))
|
|
})
|
|
|
|
It("accepts colon-separated options too", func() {
|
|
v := &VibevoiceCpp{}
|
|
v.modelRoot = "/abs/root"
|
|
role := v.parseOptions([]string{"type:asr", "tokenizer:/abs/root/tokenizer.gguf"}, v.modelRoot)
|
|
Expect(role).To(Equal("asr"))
|
|
Expect(v.tokenizer).To(Equal("/abs/root/tokenizer.gguf"))
|
|
})
|
|
})
|
|
|
|
// The gallery flow puts everything under <models_dir>/<entry>/,
|
|
// and parameters/options carry paths *relative* to <models_dir>.
|
|
// LocalAI core fills opts.ModelPath = <models_dir>; the backend
|
|
// must resolve every relative path against that root, never CWD.
|
|
Describe("resolvePath (relative-to-modelRoot)", func() {
|
|
It("joins relative path onto relTo", func() {
|
|
Expect(resolvePath("vibevoice-cpp/tokenizer.gguf", "/data/models")).
|
|
To(Equal("/data/models/vibevoice-cpp/tokenizer.gguf"))
|
|
})
|
|
|
|
It("passes absolute paths through unchanged", func() {
|
|
Expect(resolvePath("/abs/somewhere/tokenizer.gguf", "/data/models")).
|
|
To(Equal("/abs/somewhere/tokenizer.gguf"))
|
|
})
|
|
|
|
It("returns input unchanged when relTo is empty", func() {
|
|
Expect(resolvePath("vibevoice-cpp/tokenizer.gguf", "")).
|
|
To(Equal("vibevoice-cpp/tokenizer.gguf"))
|
|
})
|
|
|
|
It("returns empty input unchanged", func() {
|
|
Expect(resolvePath("", "/data/models")).To(BeEmpty())
|
|
})
|
|
|
|
It("does not consult CWD - bare filenames stay relative to modelRoot", func() {
|
|
// Even if the test runs in a directory containing a
|
|
// file with this name, the lookup must not fall back
|
|
// to CWD. This is the trap the production gallery flow
|
|
// would otherwise hit when LocalAI is launched from a
|
|
// directory that happens to contain a same-named file.
|
|
prev, _ := os.Getwd()
|
|
DeferCleanup(func() { _ = os.Chdir(prev) })
|
|
tmpCWD, err := os.MkdirTemp("", "vv-cwd-*")
|
|
Expect(err).ToNot(HaveOccurred())
|
|
DeferCleanup(func() { _ = os.RemoveAll(tmpCWD) })
|
|
Expect(os.WriteFile(filepath.Join(tmpCWD, "tokenizer.gguf"),
|
|
[]byte("not the real one"), 0o644)).To(Succeed())
|
|
Expect(os.Chdir(tmpCWD)).To(Succeed())
|
|
|
|
got := resolvePath("tokenizer.gguf", "/data/models")
|
|
Expect(got).To(Equal("/data/models/tokenizer.gguf"))
|
|
})
|
|
})
|
|
|
|
// Round-trip the gallery layout: relative paths in Options +
|
|
// an absolute ModelFile (as LocalAI core delivers them) end
|
|
// up resolved correctly inside the backend struct.
|
|
It("Load resolves relative Options paths against opts.ModelPath", func() {
|
|
tmpDir, err := os.MkdirTemp("", "vv-relpath-*")
|
|
Expect(err).ToNot(HaveOccurred())
|
|
DeferCleanup(func() { _ = os.RemoveAll(tmpDir) })
|
|
|
|
// Lay out the bundle exactly as the gallery would after install:
|
|
// <modelpath>/vibevoice-cpp/{tts,tokenizer,voice}.gguf
|
|
subDir := filepath.Join(tmpDir, "vibevoice-cpp")
|
|
Expect(os.MkdirAll(subDir, 0o755)).To(Succeed())
|
|
tts := filepath.Join(subDir, "vibevoice-realtime-stub.gguf")
|
|
tok := filepath.Join(subDir, "tokenizer.gguf")
|
|
voice := filepath.Join(subDir, "voice.gguf")
|
|
for _, p := range []string{tts, tok, voice} {
|
|
Expect(os.WriteFile(p, []byte("stub"), 0o644)).To(Succeed())
|
|
}
|
|
|
|
// Mirror Load()'s pre-purego prefix: parse + slot fill.
|
|
v := &VibevoiceCpp{}
|
|
modelFile := tts // core delivers this as an abspath already
|
|
v.modelRoot = tmpDir
|
|
role := v.parseOptions([]string{
|
|
"tokenizer=vibevoice-cpp/tokenizer.gguf",
|
|
"voice=vibevoice-cpp/voice.gguf",
|
|
}, v.modelRoot)
|
|
Expect(role).To(BeEmpty())
|
|
if v.ttsModel == "" {
|
|
v.ttsModel = modelFile
|
|
}
|
|
|
|
Expect(v.ttsModel).To(Equal(tts))
|
|
Expect(v.tokenizer).To(Equal(tok))
|
|
Expect(v.voice).To(Equal(voice))
|
|
Expect(v.asrModel).To(BeEmpty())
|
|
})
|
|
|
|
It("closes the channel and errors on AudioTranscriptionStream without a loaded model", func() {
|
|
ch := make(chan *pb.TranscriptStreamResponse, 4)
|
|
err := (&VibevoiceCpp{}).AudioTranscriptionStream(&pb.TranscriptRequest{
|
|
Dst: "/tmp/some.wav",
|
|
}, ch)
|
|
Expect(err).To(HaveOccurred())
|
|
_, ok := <-ch
|
|
Expect(ok).To(BeFalse(), "AudioTranscriptionStream must close results channel even on error")
|
|
})
|
|
})
|
|
|
|
Context("gRPC server lifecycle", func() {
|
|
var cmd *exec.Cmd
|
|
|
|
AfterEach(func() {
|
|
stopServer(cmd)
|
|
cmd = nil
|
|
})
|
|
|
|
It("answers Health checks", func() {
|
|
cmd = startServer()
|
|
conn := dialGRPC()
|
|
defer func() { _ = conn.Close() }()
|
|
|
|
resp, err := pb.NewBackendClient(conn).Health(context.Background(), &pb.HealthMessage{})
|
|
Expect(err).ToNot(HaveOccurred())
|
|
Expect(string(resp.Message)).To(Equal("OK"))
|
|
})
|
|
|
|
It("loads the realtime TTS model", func() {
|
|
dir := modelDirOrSkip()
|
|
tts, _ := filepath.Glob(filepath.Join(dir, "vibevoice-realtime-*.gguf"))
|
|
if len(tts) == 0 {
|
|
Skip("realtime TTS gguf missing")
|
|
}
|
|
|
|
cmd = startServer()
|
|
conn := dialGRPC()
|
|
defer func() { _ = conn.Close() }()
|
|
|
|
// Mirror the gallery contract: ModelFile is whatever LocalAI
|
|
// core hands us; ModelPath is the models root; Options[]
|
|
// carry paths relative to ModelPath.
|
|
resp, err := pb.NewBackendClient(conn).LoadModel(context.Background(), &pb.ModelOptions{
|
|
ModelFile: filepath.Base(tts[0]),
|
|
ModelPath: dir,
|
|
Threads: 4,
|
|
Options: []string{"tokenizer=tokenizer.gguf"},
|
|
})
|
|
Expect(err).ToNot(HaveOccurred())
|
|
Expect(resp.Success).To(BeTrue(), "LoadModel msg=%q", resp.Message)
|
|
})
|
|
|
|
It("runs a closed-loop TTS -> ASR with >=80% word recall", func() {
|
|
dir := modelDirOrSkip()
|
|
tts, _ := filepath.Glob(filepath.Join(dir, "vibevoice-realtime-*.gguf"))
|
|
asr, _ := filepath.Glob(filepath.Join(dir, "vibevoice-asr-*.gguf"))
|
|
if len(tts) == 0 || len(asr) == 0 {
|
|
Skip("closed-loop needs both realtime TTS and ASR ggufs")
|
|
}
|
|
|
|
tmpDir, err := os.MkdirTemp("", "vibevoice-cpp-closedloop-*")
|
|
Expect(err).ToNot(HaveOccurred())
|
|
DeferCleanup(func() { _ = os.RemoveAll(tmpDir) })
|
|
wav := filepath.Join(tmpDir, "say.wav")
|
|
|
|
cmd = startServer()
|
|
conn := dialGRPC()
|
|
defer func() { _ = conn.Close() }()
|
|
client := pb.NewBackendClient(conn)
|
|
|
|
// Gallery convention: ModelPath is the models root, every
|
|
// path inside Options[] is relative to it.
|
|
voiceMatches, _ := filepath.Glob(filepath.Join(dir, "voice-*.gguf"))
|
|
loadOpts := &pb.ModelOptions{
|
|
ModelFile: filepath.Base(tts[0]),
|
|
ModelPath: dir,
|
|
Threads: 4,
|
|
Options: []string{
|
|
"asr_model=" + filepath.Base(asr[0]),
|
|
"tokenizer=tokenizer.gguf",
|
|
},
|
|
}
|
|
if len(voiceMatches) > 0 {
|
|
loadOpts.Options = append(loadOpts.Options, "voice="+filepath.Base(voiceMatches[0]))
|
|
}
|
|
loadResp, err := client.LoadModel(context.Background(), loadOpts)
|
|
Expect(err).ToNot(HaveOccurred())
|
|
Expect(loadResp.Success).To(BeTrue(), "LoadModel msg=%q", loadResp.Message)
|
|
|
|
srcText := "Hello world this is a test of the synthesis system."
|
|
_, err = client.TTS(context.Background(), &pb.TTSRequest{
|
|
Text: srcText,
|
|
Dst: wav,
|
|
})
|
|
Expect(err).ToNot(HaveOccurred())
|
|
|
|
info, err := os.Stat(wav)
|
|
Expect(err).ToNot(HaveOccurred())
|
|
Expect(info.Size()).To(BeNumerically(">=", 1000),
|
|
"TTS produced suspiciously small wav (%d bytes)", info.Size())
|
|
|
|
resp, err := client.AudioTranscription(context.Background(), &pb.TranscriptRequest{
|
|
Dst: wav,
|
|
})
|
|
Expect(err).ToNot(HaveOccurred())
|
|
got := strings.ToLower(resp.Text)
|
|
GinkgoWriter.Printf("source : %s\n", srcText)
|
|
GinkgoWriter.Printf("transcribed: %s\n", got)
|
|
|
|
wordRE := regexp.MustCompile(`[a-z]+`)
|
|
srcWords := wordRE.FindAllString(strings.ToLower(srcText), -1)
|
|
Expect(srcWords).ToNot(BeEmpty())
|
|
hits := 0
|
|
for _, w := range srcWords {
|
|
if strings.Contains(got, w) {
|
|
hits++
|
|
}
|
|
}
|
|
recall := float64(hits) / float64(len(srcWords))
|
|
GinkgoWriter.Printf("recall: %d/%d = %.2f%%\n", hits, len(srcWords), recall*100)
|
|
Expect(recall).To(BeNumerically(">=", 0.80),
|
|
"closed-loop recall too low: %d/%d = %.2f%%",
|
|
hits, len(srcWords), recall*100)
|
|
})
|
|
})
|
|
})
|