mirror of
https://github.com/mudler/LocalAI.git
synced 2026-04-29 19:44:13 -04:00
* feat(vibevoice-cpp): add purego TTS+ASR backend
Wire up Microsoft VibeVoice via the vibevoice.cpp C ABI as a new
purego-based Go backend that serves both Backend.TTS and
Backend.AudioTranscription from a single gRPC binary. Mirrors the
qwen3-tts-cpp / sherpa-onnx pattern so the variant matrix
(cpu/cuda12/cuda13/metal/rocm/sycl-f16/f32/vulkan/l4t) and the
e2e-backends gRPC harness reuse existing infrastructure.
- backend/go/vibevoice-cpp/ - Makefile, CMakeLists, purego shim, gRPC
Backend with model-dir auto-detection, closed-loop TTS->ASR smoke test
- backend/index.yaml - &vibevoicecpp meta + 18 image entries
- Makefile - .NOTPARALLEL, BACKEND_VIBEVOICE_CPP, docker-build wiring,
test-extra-backend-vibevoice-cpp-{tts,transcription} e2e wrappers
- .github/workflows/backend.yml - matrix entries for all variants
- .github/workflows/test-extra.yml - per-backend smoke + 2 gRPC e2e jobs
* feat(vibevoice-cpp): drop hardcoded glob detection, add gallery entries
Refactor backend Load() to follow the standard Options[] convention
used by sherpa-onnx and the rest of the multi-role backends:
ModelFile is the primary gguf, supplementary paths come through
opts.Options[] as key=value (or key:value for Make-target compat),
resolved against opts.ModelPath. type=asr/tts decides the role of
ModelFile when neither tts_model nor asr_model is set explicitly.
Add gallery/index.yaml entries:
- vibevoice-cpp - realtime 0.5B Q8_0 TTS + tokenizer + Carter voice
- vibevoice-cpp-asr - long-form ASR Q8_0 + tokenizer
Both pull from huggingface://mudler/vibevoice.cpp-models with sha256
verification. parameters.model + Options[] paths are siblings under
{models_dir} per the qwen3-tts-cpp convention.
Update Makefile e2e wrappers to pass BACKEND_TEST_OPTIONS comma+colon
style, and tighten the per-backend Go closed-loop test to use the
explicit Options API.
* fix(vibevoice-cpp): force whole-archive link so vv_capi_* exports survive
libvibevoice is a STATIC archive linked into the MODULE library.
Without --whole-archive (or -force_load on Apple, /WHOLEARCHIVE on
MSVC), the linker garbage-collects symbols not referenced from this
translation unit - which means dlopen+RegisterLibFunc panics with
'undefined symbol: vv_capi_load' at backend startup, since purego
looks them up by name and our cpp/govibevoicecpp.cpp doesn't call
them directly.
* test(vibevoice-cpp): rewrite suite with Ginkgo v2
Match the convention used by backend/go/sherpa-onnx/backend_test.go.
The suite now covers backend semantics that don't need purego (Locking,
empty-ModelFile rejection, TTS/ASR-without-loaded-model errors) on top
of the gRPC lifecycle specs (Health, Load, closed-loop TTS->ASR).
Model-dependent specs Skip() when VIBEVOICE_MODEL_DIR is unset, so
`go test ./backend/go/vibevoice-cpp/` is green on a clean checkout
and runs the heavyweight closed-loop spec when test.sh has staged
the bundle.
* fix(vibevoice-cpp): implement TTSStream + AudioTranscriptionStream
The gRPC server's stream handlers (pkg/grpc/server.go) spawn a
goroutine that ranges over a chan; the only thing closing that chan
is the backend's own *Stream method. With the default Base stub
returning 'unimplemented' and never touching the chan, the server
goroutine hangs forever and the client hits DeadlineExceeded - which
is exactly what the e2e harness saw in the test-extra-backend-vibevoice-cpp-tts
matrix run.
TTSStream synthesizes via vv_capi_tts to a tempfile, then emits a
streaming WAV header (chunk sizes 0xFFFFFFFF so HTTP clients can
start playback before the full PCM lands) followed by the PCM body
in 64 KB slices. The header + >=2 PCM frames satisfy the harness's
'expected >=2 chunks' assertion and give a real progressive stream.
AudioTranscriptionStream runs the offline transcription, emits each
segment as a delta, and closes with a final_result whose Text equals
the concatenated deltas (the harness asserts those match).
Two new Ginkgo specs guard the close-channel-on-error path so the
deadline-exceeded regression can't come back silently.
* fix(vibevoice-cpp): silence errcheck on cleanup paths
Lint flagged six unchecked Close()/Remove()/RemoveAll() calls along
purely-cleanup deferred paths. Wrap each in '_ = ...' (or a closure
for defers that take args) - matches what the rest of the LocalAI
backend/go/* tree already does for these callsites.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(vibevoice-cpp): closed-loop slot fill + modelRoot-relative path resolution
Two bugs the test-extra-backend-vibevoice-cpp-* CI matrix surfaced:
1. Closed-loop Load with ModelFile=tts.gguf + Options[asr_model=...] left
v.ttsModel empty, because the default-fill block only ran when BOTH
slots were empty. vv_capi_load then got tts="" + a voice and the
C side rejected it with rc=-3 'TTS model required to load a voice'.
Fix: ModelFile fills the *primary* role-slot (decided by 'type=' in
Options, defaulting to tts) independently of the secondary, so
ModelFile + asr_model resolves to both.
2. resolvePath stat'd CWD before falling back to relTo. With LocalAI
launched from a directory that happens to contain a same-named
file, supplementary Options[] paths could leak away from the
models dir. Drop the CWD probe entirely - relative paths now
*always* join onto opts.ModelPath (the gallery convention).
New Ginkgo coverage:
* 'ModelFile slot resolution' (4 specs) - asr_model+ModelFile, type=asr,
explicit tts_model override, key:value variant.
* 'resolvePath (relative-to-modelRoot)' (5 specs) - join, abs passthrough,
empty input, empty relTo, and the CWD-trap regression test.
* 'Load resolves relative Options paths against opts.ModelPath' - end-
to-end gallery layout round-trip.
Verified locally: 19/19 specs pass (with model bundle, including the
closed-loop TTS->ASR; without bundle, 17 pass + 2 model-dependent skip).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(vibevoice-cpp): use gallery convention in closed-loop spec
The 'loads the realtime TTS model' / closed-loop specs were passing
already-prefixed paths into Options[]:
Options: ['tokenizer=' + filepath.Join(modelDir, 'tokenizer.gguf')]
Combined with no ModelPath set on the request, the backend's
modelRoot fell back to filepath.Dir(ModelFile) = modelDir, then
resolvePath joined the prefixed Options path on top of it -
producing 'vibevoice-models/vibevoice-models/tokenizer.gguf' when
the CI's VIBEVOICE_MODEL_DIR is the relative './vibevoice-models'.
The fix is to mirror the gallery contract LocalAI core actually
sends in production: ModelPath is the models root (absolute),
ModelFile is a name *under* it, every Options[] path is relative
to ModelPath. Uses filepath.Base() to get bare filenames.
Verified locally with both VIBEVOICE_MODEL_DIR=/tmp/vv-bundle (abs)
and VIBEVOICE_MODEL_DIR=vibevoice-models (the relative shape that
broke CI). Both: 19/19 specs pass, ~55-60s.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* ci(vibevoice-cpp): switch ASR to Q4_K + bump transcription timeout
The Q8_0 ASR gguf is ~14 GB - too big to fit alongside the runner
image, the docker build cache, and the test artifacts on a free
ubuntu-latest GHA runner; 'test-extra-backend-vibevoice-cpp-transcription'
was getting SIGTERM'd at 90 min before the model could finish loading.
Switch to Q4_K (~10 GB on disk, slightly faster CPU decode) for:
* the e2e harness Make target
* the gallery 'vibevoice-cpp-asr' entry (parameters + files block)
* the per-backend test.sh auto-download list
Bump tests-vibevoice-cpp-grpc-transcription's timeout-minutes from
90 to 150 - even with Q4_K, the 30 s JFK clip on a CPU runner needs
runway above the previous 90 min cap.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* ci(vibevoice-cpp): drop transcription gRPC e2e job - too heavy for free runners
The vibevoice ASR is a 7B-parameter model. Even on Q4_K (~10 GB on
disk) a single 30 s transcription saturates the per-test 30 min
timeout in the e2e-backends harness on a 4-core ubuntu-latest, and
the 10 GB download + Docker layer + working space leaves no headroom
on the runner's free disk. Two attempts in CI got SIGTERM'd at the
LoadModel boundary - the bottleneck isn't tunable from the workflow
side without a paid-tier runner.
The per-backend tests-vibevoice-cpp job already runs the same
AudioTranscription path via a closed-loop TTS->ASR Ginkgo spec - same
gRPC contract, same model, single process - so the standalone
tests-vibevoice-cpp-grpc-transcription job was redundant on top of
the disk/CPU pressure.
The Makefile target test-extra-backend-vibevoice-cpp-transcription
stays for local invocation on workstations that can afford it -
useful when developing the streaming codepaths.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* ci(vibevoice-cpp): restore transcription gRPC e2e on bigger-runner
Switch tests-vibevoice-cpp-grpc-transcription from ubuntu-latest to
the self-hosted 'bigger-runner' label that GPU image builds in
backend.yml use, plus the documented Free-disk-space prep step (purge
dotnet / ghc / android / CodeQL caches) the disabled vllm/sglang
entries in this file describe. That gives the 7B-param Q4_K ASR
model the disk + CPU runway it needs.
Keep timeout-minutes: 150 - even on a beefier runner the 30 s JFK
decode plus 10 GB download has to fit comfortably.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* ci(vibevoice-cpp): apt-get install make on bigger-runner before transcription e2e
bigger-runner is a self-hosted bare runner without the standard
ubuntu image's preinstalled build tools, so the previous job died at
the very first command with 'make: command not found' (exit 127).
Add the Dependencies step that the disabled vllm/sglang entries in
this file already document - apt-get installs make + build-essential
+ curl + unzip + ca-certificates + git + tar before the make target
runs. Mirrors how every other 'runs-on: bigger-runner' entry in
backend.yml prepares the runner.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
388 lines
12 KiB
Go
388 lines
12 KiB
Go
package main
|
|
|
|
import (
|
|
"encoding/json"
|
|
"fmt"
|
|
"os"
|
|
"path/filepath"
|
|
"strings"
|
|
|
|
laudio "github.com/mudler/LocalAI/pkg/audio"
|
|
"github.com/mudler/LocalAI/pkg/grpc/base"
|
|
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
|
|
)
|
|
|
|
// vibevoice.cpp synthesizes 24 kHz mono 16-bit PCM. Hardcoded - the
|
|
// model itself is fixed-rate; if the upstream ever changes this we'll
|
|
// pick it up via vv_capi_version().
|
|
const vibevoiceSampleRate = uint32(24000)
|
|
|
|
// purego-bound entry points from libgovibevoicecpp.
|
|
var (
|
|
CppLoad func(ttsModel, asrModel, tokenizer, voice string, threads int32) int32
|
|
CppTTS func(text, voicePath, dstWav string,
|
|
nSteps int32, cfgScale float32, maxSpeechFrames int32, seed uint32) int32
|
|
CppASR func(srcWav string, outJSON []byte, capacity uint64,
|
|
maxNewTokens int32) int32
|
|
CppUnload func()
|
|
CppVersion func() string
|
|
)
|
|
|
|
// VibevoiceCpp speaks gRPC against vibevoice.cpp's flat C ABI. The
|
|
// engine is a single global, so we serialize calls through SingleThread.
|
|
type VibevoiceCpp struct {
|
|
base.SingleThread
|
|
threads int
|
|
|
|
// modelRoot is the directory we use to resolve relative paths
|
|
// from Options[] and per-call overrides (TTSRequest.Voice).
|
|
// Source of truth: opts.ModelPath; falls back to the dir of
|
|
// the primary ModelFile when ModelPath is empty.
|
|
modelRoot string
|
|
|
|
ttsModel string
|
|
asrModel string
|
|
tokenizer string
|
|
voice string
|
|
}
|
|
|
|
// resolvePath joins a relative path onto `relTo`. The gallery
|
|
// convention is that Options[] carry paths relative to the LocalAI
|
|
// models dir (opts.ModelPath), so anything not absolute is treated
|
|
// as a sibling of the primary ModelFile - never CWD. Empty / already-
|
|
// absolute / no-relTo inputs pass through unchanged.
|
|
func resolvePath(p, relTo string) string {
|
|
if p == "" || filepath.IsAbs(p) || relTo == "" {
|
|
return p
|
|
}
|
|
return filepath.Join(relTo, p)
|
|
}
|
|
|
|
// parseOptions reads opts.Options[] and pulls out the per-role
|
|
// overrides documented in the gallery entries. Accepts both "key=value"
|
|
// (gallery YAML style) and "key:value" (Make-target / env-var style).
|
|
func (v *VibevoiceCpp) parseOptions(opts []string, relTo string) string {
|
|
role := ""
|
|
for _, raw := range opts {
|
|
k, val, ok := strings.Cut(raw, "=")
|
|
if !ok {
|
|
k, val, ok = strings.Cut(raw, ":")
|
|
if !ok {
|
|
continue
|
|
}
|
|
}
|
|
key := strings.TrimSpace(k)
|
|
val = strings.TrimSpace(val)
|
|
switch key {
|
|
case "type":
|
|
role = strings.ToLower(val)
|
|
case "tokenizer":
|
|
v.tokenizer = resolvePath(val, relTo)
|
|
case "voice":
|
|
v.voice = resolvePath(val, relTo)
|
|
case "tts_model":
|
|
v.ttsModel = resolvePath(val, relTo)
|
|
case "asr_model":
|
|
v.asrModel = resolvePath(val, relTo)
|
|
}
|
|
}
|
|
return role
|
|
}
|
|
|
|
func (v *VibevoiceCpp) Load(opts *pb.ModelOptions) error {
|
|
if opts.ModelFile == "" {
|
|
return fmt.Errorf("vibevoice-cpp: ModelFile is required")
|
|
}
|
|
modelFile := opts.ModelFile
|
|
if !filepath.IsAbs(modelFile) && opts.ModelPath != "" {
|
|
modelFile = filepath.Join(opts.ModelPath, modelFile)
|
|
}
|
|
|
|
// ModelPath is the LocalAI core's models root, propagated over
|
|
// gRPC. Use it as the resolution base for Options[] (and later
|
|
// for TTSRequest.Voice) so gallery entries can reference paths
|
|
// like "tokenizer=tokenizer.gguf" and have them resolved
|
|
// against the same root the core used to drop the files.
|
|
v.modelRoot = opts.ModelPath
|
|
if v.modelRoot == "" {
|
|
v.modelRoot = filepath.Dir(modelFile)
|
|
}
|
|
role := v.parseOptions(opts.Options, v.modelRoot)
|
|
|
|
// ModelFile fills the "primary" role-slot determined by `type=`
|
|
// in Options (defaults to tts). The other slot stays exactly as
|
|
// Options set it - so a closed-loop config with ModelFile=tts.gguf
|
|
// + Options[asr_model=asr.gguf] resolves correctly to both slots,
|
|
// and an explicit `tts_model=` / `asr_model=` always wins over
|
|
// ModelFile for its own slot.
|
|
primaryIsASR := false
|
|
switch role {
|
|
case "asr", "transcript", "stt", "speech-to-text":
|
|
primaryIsASR = true
|
|
}
|
|
if primaryIsASR {
|
|
if v.asrModel == "" {
|
|
v.asrModel = modelFile
|
|
}
|
|
} else if v.ttsModel == "" {
|
|
v.ttsModel = modelFile
|
|
}
|
|
|
|
if v.ttsModel == "" && v.asrModel == "" {
|
|
return fmt.Errorf("vibevoice-cpp: no TTS or ASR model resolved from ModelFile=%q + options", opts.ModelFile)
|
|
}
|
|
if v.tokenizer == "" {
|
|
return fmt.Errorf("vibevoice-cpp: tokenizer is required - pass options: [tokenizer=<path>]")
|
|
}
|
|
|
|
threads := int(opts.Threads)
|
|
if threads <= 0 {
|
|
threads = 4
|
|
}
|
|
v.threads = threads
|
|
|
|
fmt.Fprintf(os.Stderr,
|
|
"[vibevoice-cpp] Loading: tts=%q asr=%q tokenizer=%q voice=%q threads=%d\n",
|
|
v.ttsModel, v.asrModel, v.tokenizer, v.voice, threads)
|
|
|
|
if rc := CppLoad(v.ttsModel, v.asrModel, v.tokenizer, v.voice, int32(threads)); rc != 0 {
|
|
return fmt.Errorf("vibevoice-cpp: vv_capi_load failed (rc=%d)", rc)
|
|
}
|
|
return nil
|
|
}
|
|
|
|
func (v *VibevoiceCpp) TTS(req *pb.TTSRequest) error {
|
|
if v.ttsModel == "" {
|
|
return fmt.Errorf("vibevoice-cpp: TTS requested but no realtime model was loaded")
|
|
}
|
|
text := req.Text
|
|
dst := req.Dst
|
|
if text == "" || dst == "" {
|
|
return fmt.Errorf("vibevoice-cpp: TTS requires both text and dst")
|
|
}
|
|
|
|
// req.Voice may be a bare filename (e.g. "voice-en-Emma.gguf") or an
|
|
// absolute path. Resolve via the same modelRoot Load() used for
|
|
// Options[] so a swap-voice request mirrors the gallery's layout.
|
|
voice := resolvePath(req.Voice, v.modelRoot)
|
|
|
|
if req.Language != nil && *req.Language != "" {
|
|
fmt.Fprintf(os.Stderr,
|
|
"[vibevoice-cpp] note: TTSRequest.language=%q ignored - vibevoice picks language from the voice prompt\n",
|
|
*req.Language)
|
|
}
|
|
|
|
const (
|
|
defaultSteps = 20
|
|
defaultMaxFrames = 200
|
|
)
|
|
defaultCfg := float32(1.3)
|
|
if rc := CppTTS(text, voice, dst,
|
|
int32(defaultSteps), defaultCfg, int32(defaultMaxFrames), 0); rc != 0 {
|
|
return fmt.Errorf("vibevoice-cpp: vv_capi_tts failed (rc=%d)", rc)
|
|
}
|
|
return nil
|
|
}
|
|
|
|
// asrSegment matches vibevoice's JSON output:
|
|
//
|
|
// [{"Start":0.0,"End":2.8,"Speaker":0,"Content":"…"}, ...]
|
|
type asrSegment struct {
|
|
Start float64 `json:"Start"`
|
|
End float64 `json:"End"`
|
|
Speaker int `json:"Speaker"`
|
|
Content string `json:"Content"`
|
|
}
|
|
|
|
// callASR invokes vv_capi_asr with a buffer that grows on demand.
|
|
// vv_capi_asr returns: >0 bytes written, 0 no transcript, <0 error or
|
|
// -required_size. We honor the resize protocol once before giving up.
|
|
func (v *VibevoiceCpp) callASR(srcWav string, maxNewTokens int32) (string, error) {
|
|
const startCap = 256 * 1024
|
|
buf := make([]byte, startCap)
|
|
rc := CppASR(srcWav, buf, uint64(len(buf)), maxNewTokens)
|
|
if rc < 0 {
|
|
need := -int(rc)
|
|
if need > 0 && need < (16<<20) && need > len(buf) {
|
|
buf = make([]byte, need+64)
|
|
rc = CppASR(srcWav, buf, uint64(len(buf)), maxNewTokens)
|
|
}
|
|
}
|
|
if rc < 0 {
|
|
return "", fmt.Errorf("vibevoice-cpp: vv_capi_asr failed (rc=%d)", rc)
|
|
}
|
|
if rc == 0 {
|
|
return "", nil
|
|
}
|
|
return string(buf[:rc]), nil
|
|
}
|
|
|
|
// TTSStream is the streaming counterpart to TTS. vibevoice's C ABI is
|
|
// file-only (vv_capi_tts writes a complete WAV), so we synthesize to
|
|
// a tempfile, then emit a streaming-WAV header followed by the PCM
|
|
// body in chunks. The main reason this exists at all is the gRPC
|
|
// server wrapper (pkg/grpc/server.go:TTSStream) blocks on a channel
|
|
// that only this method can close - if we leave the default Base
|
|
// stub in place, every TTSStream call hangs until the client
|
|
// deadline.
|
|
func (v *VibevoiceCpp) TTSStream(req *pb.TTSRequest, results chan []byte) error {
|
|
defer close(results)
|
|
if v.ttsModel == "" {
|
|
return fmt.Errorf("vibevoice-cpp: TTSStream requested but no realtime model was loaded")
|
|
}
|
|
if req.Text == "" {
|
|
return fmt.Errorf("vibevoice-cpp: TTSStream requires text")
|
|
}
|
|
|
|
tmp, err := os.CreateTemp("", "vibevoice-cpp-stream-*.wav")
|
|
if err != nil {
|
|
return fmt.Errorf("vibevoice-cpp: tempfile: %w", err)
|
|
}
|
|
dst := tmp.Name()
|
|
_ = tmp.Close()
|
|
defer func() { _ = os.Remove(dst) }()
|
|
|
|
if err := v.TTS(&pb.TTSRequest{
|
|
Text: req.Text,
|
|
Voice: req.Voice,
|
|
Dst: dst,
|
|
Language: req.Language,
|
|
}); err != nil {
|
|
return err
|
|
}
|
|
|
|
wav, err := os.ReadFile(dst)
|
|
if err != nil {
|
|
return fmt.Errorf("vibevoice-cpp: read tempfile: %w", err)
|
|
}
|
|
|
|
// Streaming WAV header: declare 0xFFFFFFFF for chunk sizes so HTTP
|
|
// clients can start playback before they see the full PCM.
|
|
const streamingSize = 0xFFFFFFFF
|
|
hdr := laudio.NewWAVHeaderWithRate(streamingSize, vibevoiceSampleRate)
|
|
hdr.ChunkSize = streamingSize
|
|
hdrBuf := make([]byte, 0, laudio.WAVHeaderSize)
|
|
w := newByteWriter(&hdrBuf)
|
|
if err := hdr.Write(w); err != nil {
|
|
return fmt.Errorf("vibevoice-cpp: write WAV header: %w", err)
|
|
}
|
|
results <- hdrBuf
|
|
|
|
// PCM body: send in ~64 KB slices so the client gets multiple
|
|
// reply chunks (e2e harness asserts >=2 frames).
|
|
pcm := laudio.StripWAVHeader(wav)
|
|
const chunkBytes = 64 * 1024
|
|
for off := 0; off < len(pcm); off += chunkBytes {
|
|
end := off + chunkBytes
|
|
if end > len(pcm) {
|
|
end = len(pcm)
|
|
}
|
|
chunk := make([]byte, end-off)
|
|
copy(chunk, pcm[off:end])
|
|
results <- chunk
|
|
}
|
|
return nil
|
|
}
|
|
|
|
// byteWriter adapts a *[]byte to io.Writer so we can hand it to
|
|
// laudio.WAVHeader.Write without allocating a bytes.Buffer.
|
|
type byteWriter struct{ buf *[]byte }
|
|
|
|
func newByteWriter(b *[]byte) *byteWriter { return &byteWriter{buf: b} }
|
|
func (w *byteWriter) Write(p []byte) (int, error) {
|
|
*w.buf = append(*w.buf, p...)
|
|
return len(p), nil
|
|
}
|
|
|
|
func (v *VibevoiceCpp) AudioTranscription(req *pb.TranscriptRequest) (pb.TranscriptResult, error) {
|
|
if v.asrModel == "" {
|
|
return pb.TranscriptResult{}, fmt.Errorf("vibevoice-cpp: AudioTranscription requested but no ASR model was loaded")
|
|
}
|
|
if req.Dst == "" {
|
|
return pb.TranscriptResult{}, fmt.Errorf("vibevoice-cpp: TranscriptRequest.dst (audio path) is required")
|
|
}
|
|
|
|
out, err := v.callASR(req.Dst, 0)
|
|
if err != nil {
|
|
return pb.TranscriptResult{}, err
|
|
}
|
|
if out == "" {
|
|
return pb.TranscriptResult{}, nil
|
|
}
|
|
|
|
var segs []asrSegment
|
|
if err := json.Unmarshal([]byte(out), &segs); err != nil {
|
|
fmt.Fprintf(os.Stderr,
|
|
"[vibevoice-cpp] WARNING: vv_capi_asr returned non-JSON, falling back to single segment: %v\n", err)
|
|
return pb.TranscriptResult{
|
|
Segments: []*pb.TranscriptSegment{{Id: 0, Text: strings.TrimSpace(out)}},
|
|
Text: strings.TrimSpace(out),
|
|
}, nil
|
|
}
|
|
|
|
segments := make([]*pb.TranscriptSegment, 0, len(segs))
|
|
parts := make([]string, 0, len(segs))
|
|
var duration float32
|
|
for i, s := range segs {
|
|
// LocalAI's whisper backend uses int64 100ns ticks for
|
|
// Start/End (seconds * 1e7); follow the same convention so
|
|
// consumers can mix vibevoice and whisper transcripts.
|
|
segments = append(segments, &pb.TranscriptSegment{
|
|
Id: int32(i),
|
|
Text: s.Content,
|
|
Start: int64(s.Start * 1e7),
|
|
End: int64(s.End * 1e7),
|
|
Speaker: fmt.Sprintf("%d", s.Speaker),
|
|
})
|
|
parts = append(parts, strings.TrimSpace(s.Content))
|
|
if float32(s.End) > duration {
|
|
duration = float32(s.End)
|
|
}
|
|
}
|
|
return pb.TranscriptResult{
|
|
Segments: segments,
|
|
Text: strings.TrimSpace(strings.Join(parts, " ")),
|
|
Duration: duration,
|
|
}, nil
|
|
}
|
|
|
|
// AudioTranscriptionStream wraps AudioTranscription so the streaming
|
|
// gRPC endpoint (server.go:AudioTranscriptionStream) sees its channel
|
|
// close and the client doesn't sit waiting until deadline. vibevoice's
|
|
// ASR doesn't expose token-level streaming - vv_capi_asr decodes the
|
|
// whole audio and returns a JSON segment list - so we run the offline
|
|
// transcription, emit each segment's content as a delta, then close
|
|
// with a final_result whose Text equals the concatenated deltas (the
|
|
// e2e harness asserts those match).
|
|
func (v *VibevoiceCpp) AudioTranscriptionStream(req *pb.TranscriptRequest, results chan *pb.TranscriptStreamResponse) error {
|
|
defer close(results)
|
|
res, err := v.AudioTranscription(req)
|
|
if err != nil {
|
|
return err
|
|
}
|
|
var assembled strings.Builder
|
|
for _, seg := range res.Segments {
|
|
if seg == nil {
|
|
continue
|
|
}
|
|
txt := strings.TrimSpace(seg.Text)
|
|
if txt == "" {
|
|
continue
|
|
}
|
|
delta := txt
|
|
if assembled.Len() > 0 {
|
|
delta = " " + txt
|
|
}
|
|
results <- &pb.TranscriptStreamResponse{Delta: delta}
|
|
assembled.WriteString(delta)
|
|
}
|
|
final := pb.TranscriptResult{
|
|
Segments: res.Segments,
|
|
Duration: res.Duration,
|
|
Language: res.Language,
|
|
Text: assembled.String(),
|
|
}
|
|
results <- &pb.TranscriptStreamResponse{FinalResult: &final}
|
|
return nil
|
|
}
|