mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-06 07:46:15 -04:00
* feat(crispasr): backend source files (Go gRPC server, C-ABI shim, build files) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * polish(crispasr): brand error strings + fix stale shim comment Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * build(crispasr): register backend in root Makefile Mirror the whisper Go backend registration for the new crispasr backend: NOTPARALLEL entry, prepare-test-extra/test-extra hooks, BACKEND_CRISPASR definition, docker-build target generation, and the docker-build-backends aggregate target. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(crispasr): add backend build matrix entries Mirror the 11 whisper golang Dockerfile matrix entries (CPU amd64/arm64, CUDA 12/13, L4T CUDA 13, Intel SYCL f32/f16, Vulkan amd64/arm64, L4T arm64, ROCm hipblas) with backend and tag-suffix substituted to crispasr. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(gallery): add crispasr backend gallery entries Add the crispasr meta anchor and its full set of image gallery entries (cpu, metal, cuda12/13, rocm, intel-sycl f32/f16, vulkan, L4T arm64, L4T cuda13 arm64, plus -development variants), mirroring the whisper backend gallery block. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(crispasr): bump CRISPASR_VERSION via bump_deps workflow Track CrispStrobe/CrispASR main branch and bump CRISPASR_VERSION in backend/go/crispasr/Makefile. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * build(crispasr): don't wire fixture-gated test into test-extra Mirror the whisper Go backend: its AudioTranscription test is gated on model/audio fixtures and skips in CI, so building crispasr (the heaviest ggml compile in the tree) inside the unit-test lane adds a long compile for zero coverage. The backend image build in backend-matrix.yml remains the authoritative compile check. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(crispasr): add darwin metal build entry (mirror whisper) The metal-crispasr gallery entries and capabilities.metal mapping reference -metal-darwin-arm64-crispasr, which is only produced by an includeDarwin entry. Mirror whisper's darwin metal entry so the tag actually gets built. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(crispasr): place hipblas matrix entry next to whisper twin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(crispasr): register crispasr as pref-only ASR backend + test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(crispasr): port whisper behavioral suite (cancellation + streaming) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(crispasr): fix skip message env var names to CRISPASR_* Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(crispasr): switch shim to crispasr_session_* multi-architecture API The shim used whisper_full(), which in CrispASR is the whisper-only path: libcrispasr only transcribes Whisper GGUFs through it. Multi-architecture transcription (Parakeet, Voxtral, Qwen3-ASR, Canary, Granite, FunASR, Paraformer, SenseVoice, ...) goes through the crispasr_session_* C-ABI, which auto-detects the architecture from the GGUF and dispatches to the matching backend. Rewrite the C shim around crispasr_session_open / _transcribe_lang / _result_* and add get_backend() so the selected backend is logged. load_model now takes a threads param (session_open binds n_threads at open). The session result is segment+word based with no token IDs and no per-decode callback, so drop n_tokens / get_token_id / get_segment_speaker_turn_next / set_new_segment_callback. set_abort is kept for API parity but is best-effort: the session transcribe is blocking with no abort hook. Update the purego bindings and gocrispasr.go to match: tokens are left empty, speaker-turn handling is removed, and AudioTranscriptionStream emits one delta per non-empty segment after the blocking decode returns (no progressive streaming via the session API), preserving the concat(deltas) == final.Text invariant. crispasr_session_set_translate is exported by libcrispasr but not declared in crispasr.h, so it is forward-declared in the shim alongside the open/transcribe/result functions. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * build(crispasr): link full CrispASR backend set for multi-arch support The shim's crispasr_session_* dispatch calls into the per-architecture backend libs (parakeet, voxtral, qwen3_asr, canary, funasr, paraformer, sensevoice, ...), which CrispASR builds as static archives. Linking only crispasr + ggml dead-stripped every backend object from the final module (nm backend-symbol count: 0), leaving a whisper-only .so. Link the same backend set as crispasr-cli so the static archives are pulled in. After this the module carries the backend symbols (nm count 407, .so grows from ~2.1MB to ~6.7MB) and the session API can dispatch to every compiled-in architecture. Also rewrite ${CMAKE_SOURCE_DIR}/examples/talk-llama to ${PROJECT_SOURCE_DIR}/... in the vendored src/CMakeLists.txt: CrispASR locates its vendored llama.cpp via ${CMAKE_SOURCE_DIR}, which is wrong when CrispASR is add_subdirectory'd (CMAKE_SOURCE_DIR points at this backend dir, not the CrispASR root). PROJECT_SOURCE_DIR is correct both standalone and as a subproject; the sed is idempotent. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(crispasr): adapt suite to session API (blocking, no decode callback) Register the new symbol set (drop the removed token/speaker/callback funcs, add get_backend; load_model now takes 2 args). The session transcribe is blocking with no abort hook, so a mid-decode cancel can't interrupt it: change the cancellation spec to cancel the context before the call and assert codes.Canceled from the pre-call ctx.Err() check, dropping the <5s mid-decode timing assertion. The streaming spec still holds with per-segment post-decode emission (>=2 deltas, concat(deltas) == final.Text). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(gallery): add CrispASR ASR model entries (-crispasr) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(gallery): keep only session-auto-detectable CrispASR ASR models The crispasr backend loads models via crispasr_session_open, which auto-detects the backend from the GGUF general.architecture using crispasr_detect_backend_from_gguf. Architectures not in that detect map cannot be opened, so those gallery entries fail to load. Removed entries whose architecture is not wired into CrispASR v0.6.11's session auto-detect router (they can be re-added when upstream maps them): - Not in the detect map: data2vec, firered-asr, funasr, fun-asr-mlt-nano, glm-asr, hubert, kyutai-stt, mega-asr, mimo-asr, moonshine{,-de,-streaming,-tiny-de}, omniasr{,-llm,-llm-1b}, paraformer, sensevoice. - Pending verification (filename-heuristic routed, not arch-detected): parakeet-ctc-0.6b, parakeet-ctc-1.1b. Their GGUFs are routed to the fastconformer-ctc backend by a filename heuristic in the model registry, which implies general.architecture is not a mapped string. Kept the parakeet rnnt/tdt_ctc variants: convert-parakeet-to-gguf.py writes general.architecture="parakeet" unconditionally and encodes the rnnt/ctc distinction in metadata fields, so they session-auto-detect. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(crispasr): TTS synthesis via crispasr_session_synthesize (24kHz) Add tts_synthesize/tts_free/tts_set_voice to the C-ABI shim. They reuse the already-open g_session (crispasr_session_open auto-detects a TTS model) and dispatch to the upstream synthesis call, which returns malloc'd 24 kHz mono float PCM. Orpheus needs a SNAC codec path that we do not set, so it returns NULL here and surfaces as an error Go-side. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(crispasr): implement TTS/TTSStream gRPC methods Bind the new shim functions via purego and implement TTS, TTSStream and a writeWAV24k helper. synthesize copies the C-owned PCM out before freeing it; TTS writes a 24 kHz mono 16-bit WAV to req.Dst via go-audio/wav. CrispASR has no progressive synth, so TTSStream synthesizes fully, encodes to WAV, and emits the bytes as a single chunk; it owns the results-channel close (the gRPC server wrapper ranges until close), mirroring vibevoice-cpp's TTSStream. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(crispasr): log when a TTS voice override is not honored Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(gallery): add CrispASR vibevoice-tts model entry Only vibevoice-tts works through the current shim: qwen3-tts, chatterbox, and orpheus require companion codec/s3gen/SNAC paths (set_codec_path / set_s3gen_path) that the shim doesn't wire yet, and kokoro/indextts/voxcpm2 aren't in the session auto-detect map. Those are follow-ups. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(crispasr): gated TTS synthesis spec Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(crispasr): satisfy golangci-lint (errcheck defers + unsafeptr nolint) The crispasr Go file is entirely new, so new-from-merge-base lints every line (unlike the grandfathered whisper backend it was forked from): - handle os.RemoveAll / fh.Close return values in AudioTranscription - annotate the two intentional C-pointer unsafe.Slice sites with //nolint:govet Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(crispasr): backend: and codec: model options (explicit arch + companion files) Add two model-config options to the CrispASR backend via opts.Options: - backend:<name> selects an explicit CrispASR backend (bypassing auto-detect) by routing load_model through crispasr_session_open_explicit, unlocking architectures the detector won't pick on its own (qwen3, cohere, granite, voxtral, moonshine, mimo-asr, orpheus, kokoro, chatterbox, etc.). - codec:<path> loads a companion file (qwen3-tts codec, orpheus SNAC, chatterbox s3gen, or mimo-asr tokenizer) via the universal crispasr_session_set_codec_path setter after the session opens. A relative path resolves against the model directory. rc==0 means success or not-applicable; only a negative rc is fatal. The C shim load_model gains a backend_name argument and a new set_codec_path entry point; the Go bridge parses the prefix:value options and registers the new symbol. The vad_only path is unchanged. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(gallery): expand CrispASR models via backend:/codec: options (explicit arch + companions) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(gallery): use virtual.yaml base for crispasr models The crispasr entries are just backend + model + a couple options, fully expressed inline via overrides:/files: in gallery/index.yaml. Point each url: at the shared gallery/virtual.yaml (the established 'virtual' model trick) and drop the 36 redundant per-model gallery/*-crispasr.yaml files. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(gallery): drop voice-requiring TTS entries (keep vibevoice-tts) Real e2e showed qwen3-tts/orpheus/chatterbox don't synthesize through the current shim: the codec: companion loads fine, but these engines additionally need a voice pack / voice prompt / reference clip (qwen3-tts base errors 'no voice'; chatterbox is zero-shot cloning; orpheus uses named voices) that the backend doesn't wire. (qwen3-tts also can't auto-detect: its GGUF arch is 'qwen3tts', unmapped by the detector — would need backend:qwen3-tts.) Removed to avoid shipping non-working gallery entries; vibevoice-tts (built-in voice, e2e-verified) remains the working TTS. Voice-pack wiring is a follow-up. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(crispasr): speaker: and voice: TTS options (baked speakers + voice packs/prompts) speaker:<name> -> crispasr_session_set_speaker_name (baked speakers: qwen3-tts CustomVoice, orpheus). voice:<path>(+voice_text:<ref>) -> crispasr_session_set_voice (voice-pack GGUF, or WAV zero-shot clone with ref text). Applied at Load as the default voice; req.Voice still overrides the speaker per request. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(gallery): re-add e2e-verified TTS engines (chatterbox, qwen3-tts-customvoice, orpheus) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
254 lines
8.2 KiB
C++
254 lines
8.2 KiB
C++
#include "crispasr_shim.h"
|
|
#include "ggml-backend.h"
|
|
#include "crispasr.h"
|
|
#include <atomic>
|
|
#include <vector>
|
|
|
|
// Opaque session types. crispasr.h declares `struct crispasr_session;` but not
|
|
// the result type nor the open/transcribe/result accessors — those are
|
|
// CA_EXPORT extern "C" symbols in src/crispasr_c_api.cpp, so we forward-declare
|
|
// exactly the ones we use. Signatures verified against
|
|
// sources/CrispASR/src/crispasr_c_api.cpp.
|
|
struct crispasr_session_result;
|
|
extern "C" {
|
|
crispasr_session *crispasr_session_open(const char *model_path, int n_threads);
|
|
crispasr_session *crispasr_session_open_explicit(const char *model_path,
|
|
const char *backend_name,
|
|
int n_threads);
|
|
int crispasr_session_set_codec_path(crispasr_session *s, const char *path);
|
|
void crispasr_session_close(crispasr_session *s);
|
|
const char *crispasr_session_backend(crispasr_session *s);
|
|
int crispasr_session_set_translate(crispasr_session *s, int enable);
|
|
crispasr_session_result *crispasr_session_transcribe_lang(
|
|
crispasr_session *s, const float *pcm, int n_samples, const char *language);
|
|
int crispasr_session_result_n_segments(crispasr_session_result *r);
|
|
const char *crispasr_session_result_segment_text(crispasr_session_result *r,
|
|
int i);
|
|
int64_t crispasr_session_result_segment_t0(crispasr_session_result *r, int i);
|
|
int64_t crispasr_session_result_segment_t1(crispasr_session_result *r, int i);
|
|
void crispasr_session_result_free(crispasr_session_result *r);
|
|
float *crispasr_session_synthesize(crispasr_session *s, const char *text,
|
|
int *out_n_samples);
|
|
void crispasr_pcm_free(float *pcm);
|
|
int crispasr_session_set_speaker_name(crispasr_session *s, const char *name);
|
|
int crispasr_session_set_voice(crispasr_session *s, const char *path,
|
|
const char *ref_text_or_null);
|
|
}
|
|
|
|
static crispasr_session *g_session = nullptr;
|
|
static crispasr_session_result *g_result = nullptr;
|
|
|
|
static struct whisper_vad_context *vctx;
|
|
static std::vector<float> flat_segs;
|
|
|
|
static std::atomic<int> g_abort{0};
|
|
|
|
extern "C" void set_abort(int v) {
|
|
g_abort.store(v, std::memory_order_relaxed);
|
|
}
|
|
|
|
static void ggml_log_cb(enum ggml_log_level level, const char *log,
|
|
void *data) {
|
|
const char *level_str;
|
|
|
|
if (!log) {
|
|
return;
|
|
}
|
|
|
|
switch (level) {
|
|
case GGML_LOG_LEVEL_DEBUG:
|
|
level_str = "DEBUG";
|
|
break;
|
|
case GGML_LOG_LEVEL_INFO:
|
|
level_str = "INFO";
|
|
break;
|
|
case GGML_LOG_LEVEL_WARN:
|
|
level_str = "WARN";
|
|
break;
|
|
case GGML_LOG_LEVEL_ERROR:
|
|
level_str = "ERROR";
|
|
break;
|
|
default: /* Potential future-proofing */
|
|
level_str = "?????";
|
|
break;
|
|
}
|
|
|
|
fprintf(stderr, "[%-5s] ", level_str);
|
|
fputs(log, stderr);
|
|
fflush(stderr);
|
|
}
|
|
|
|
int load_model(const char *const model_path, int threads,
|
|
const char *backend_name) {
|
|
whisper_log_set(ggml_log_cb, nullptr);
|
|
ggml_backend_load_all();
|
|
|
|
if (backend_name && *backend_name) {
|
|
g_session =
|
|
crispasr_session_open_explicit(model_path, backend_name, threads);
|
|
} else {
|
|
g_session = crispasr_session_open(model_path, threads);
|
|
}
|
|
if (g_session == nullptr) {
|
|
fprintf(stderr, "error: failed to open CrispASR session for model\n");
|
|
return 1;
|
|
}
|
|
|
|
fprintf(stderr, "info: CrispASR backend selected: %s\n",
|
|
crispasr_session_backend(g_session));
|
|
return 0;
|
|
}
|
|
|
|
// set_codec_path forwards a companion file (qwen3-tts codec, orpheus SNAC,
|
|
// chatterbox s3gen, or mimo-asr tokenizer) to the active session. Returns 0 on
|
|
// success or when the active backend needs no companion, negative on failure,
|
|
// and -1 when no session is open.
|
|
int set_codec_path(const char *path) {
|
|
return g_session ? crispasr_session_set_codec_path(g_session, path) : -1;
|
|
}
|
|
|
|
int load_model_vad(const char *const model_path) {
|
|
whisper_log_set(ggml_log_cb, nullptr);
|
|
ggml_backend_load_all();
|
|
|
|
struct whisper_vad_context_params vcparams =
|
|
whisper_vad_default_context_params();
|
|
|
|
// XXX: Overridden to false in upstream due to performance?
|
|
// vcparams.use_gpu = true;
|
|
|
|
vctx = whisper_vad_init_from_file_with_params(model_path, vcparams);
|
|
if (vctx == nullptr) {
|
|
fprintf(stderr, "error: Failed to init model as VAD\n");
|
|
return 1;
|
|
}
|
|
|
|
return 0;
|
|
}
|
|
|
|
int vad(float pcmf32[], size_t pcmf32_len, float **segs_out,
|
|
size_t *segs_out_len) {
|
|
if (!whisper_vad_detect_speech(vctx, pcmf32, pcmf32_len)) {
|
|
fprintf(stderr, "error: failed to detect speech\n");
|
|
return 1;
|
|
}
|
|
|
|
struct whisper_vad_params params = whisper_vad_default_params();
|
|
struct whisper_vad_segments *segs =
|
|
whisper_vad_segments_from_probs(vctx, params);
|
|
size_t segn = whisper_vad_segments_n_segments(segs);
|
|
|
|
// fprintf(stderr, "Got segments %zd\n", segn);
|
|
|
|
flat_segs.clear();
|
|
|
|
for (int i = 0; i < segn; i++) {
|
|
flat_segs.push_back(whisper_vad_segments_get_segment_t0(segs, i));
|
|
flat_segs.push_back(whisper_vad_segments_get_segment_t1(segs, i));
|
|
}
|
|
|
|
// fprintf(stderr, "setting out variables: %p=%p -> %p, %p=%zx -> %zx\n",
|
|
// segs_out, *segs_out, flat_segs.data(), segs_out_len, *segs_out_len,
|
|
// flat_segs.size());
|
|
*segs_out = flat_segs.data();
|
|
*segs_out_len = flat_segs.size();
|
|
|
|
// fprintf(stderr, "freeing segs\n");
|
|
whisper_vad_free_segments(segs);
|
|
|
|
// fprintf(stderr, "returning\n");
|
|
return 0;
|
|
}
|
|
|
|
// threads, diarize and prompt are accepted for Go-side API parity but unused
|
|
// in Phase 1: the thread count is fixed at session open, and diarization and
|
|
// the initial prompt are separate CrispASR features not yet wired through the
|
|
// session ASR path.
|
|
int transcribe(uint32_t threads, char *lang, bool translate, bool diarize,
|
|
float pcmf32[], size_t pcmf32_len, size_t *segs_out_len,
|
|
char *prompt) {
|
|
(void)threads;
|
|
(void)diarize;
|
|
(void)prompt;
|
|
|
|
if (!g_session) {
|
|
return 1;
|
|
}
|
|
|
|
// Reset stale abort flag from any prior cancelled call. set_abort remains
|
|
// best-effort: the session transcribe call is blocking and exposes no abort
|
|
// hook, so a mid-decode abort cannot interrupt it.
|
|
g_abort.store(0, std::memory_order_relaxed);
|
|
|
|
crispasr_session_set_translate(g_session, translate ? 1 : 0);
|
|
|
|
if (g_result) {
|
|
crispasr_session_result_free(g_result);
|
|
g_result = nullptr;
|
|
}
|
|
|
|
const char *language = (lang && *lang) ? lang : nullptr;
|
|
g_result = crispasr_session_transcribe_lang(g_session, pcmf32, (int)pcmf32_len,
|
|
language);
|
|
if (!g_result) {
|
|
fprintf(stderr, "error: transcription failed\n");
|
|
return 1;
|
|
}
|
|
|
|
*segs_out_len = crispasr_session_result_n_segments(g_result);
|
|
return 0;
|
|
}
|
|
|
|
const char *get_segment_text(int i) {
|
|
if (!g_result) {
|
|
return "";
|
|
}
|
|
return crispasr_session_result_segment_text(g_result, i);
|
|
}
|
|
|
|
int64_t get_segment_t0(int i) {
|
|
if (!g_result) {
|
|
return 0;
|
|
}
|
|
return crispasr_session_result_segment_t0(g_result, i);
|
|
}
|
|
|
|
int64_t get_segment_t1(int i) {
|
|
if (!g_result) {
|
|
return 0;
|
|
}
|
|
return crispasr_session_result_segment_t1(g_result, i);
|
|
}
|
|
|
|
const char *get_backend(void) {
|
|
return g_session ? crispasr_session_backend(g_session) : "";
|
|
}
|
|
|
|
// TTS uses the already-open session (crispasr_session_open auto-detects a TTS
|
|
// model). Output is 24 kHz mono float PCM (upstream CrispASR convention),
|
|
// malloc'd by the C API; the caller must release it via tts_free.
|
|
float *tts_synthesize(const char *text, int *out_n_samples) {
|
|
if (out_n_samples) *out_n_samples = 0;
|
|
if (!g_session || !text) return nullptr;
|
|
return crispasr_session_synthesize(g_session, text, out_n_samples);
|
|
}
|
|
|
|
void tts_free(float *pcm) {
|
|
if (pcm) crispasr_pcm_free(pcm);
|
|
}
|
|
|
|
int tts_set_voice(const char *name) {
|
|
if (!g_session || !name || !*name) return 0;
|
|
return crispasr_session_set_speaker_name(g_session, name);
|
|
}
|
|
|
|
// tts_set_voice_file loads a voice from a file: a .gguf path selects a voice
|
|
// pack, a .wav path with a non-empty ref_text performs zero-shot voice cloning
|
|
// (the C API returns -2 when ref_text is required but missing). Returns -1 when
|
|
// no session is open or path is null.
|
|
int tts_set_voice_file(const char *path, const char *ref_text) {
|
|
if (!g_session || !path) return -1;
|
|
const char *ref = (ref_text && *ref_text) ? ref_text : nullptr;
|
|
return crispasr_session_set_voice(g_session, path, ref);
|
|
}
|