From 76fe0bb9294dafb52c3a8d4d4ccbaf927999be3c Mon Sep 17 00:00:00 2001 From: "LocalAI [bot]" <139863280+localai-bot@users.noreply.github.com> Date: Sun, 31 May 2026 12:11:03 +0200 Subject: [PATCH] =?UTF-8?q?feat(crispasr):=20add=20CrispASR=20backend=20?= =?UTF-8?q?=E2=80=94=20multi-architecture=20ASR=20+=20TTS=20(#10099)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * feat(crispasr): backend source files (Go gRPC server, C-ABI shim, build files) Signed-off-by: Ettore Di Giacinto * polish(crispasr): brand error strings + fix stale shim comment Signed-off-by: Ettore Di Giacinto * build(crispasr): register backend in root Makefile Mirror the whisper Go backend registration for the new crispasr backend: NOTPARALLEL entry, prepare-test-extra/test-extra hooks, BACKEND_CRISPASR definition, docker-build target generation, and the docker-build-backends aggregate target. Signed-off-by: Ettore Di Giacinto * ci(crispasr): add backend build matrix entries Mirror the 11 whisper golang Dockerfile matrix entries (CPU amd64/arm64, CUDA 12/13, L4T CUDA 13, Intel SYCL f32/f16, Vulkan amd64/arm64, L4T arm64, ROCm hipblas) with backend and tag-suffix substituted to crispasr. Signed-off-by: Ettore Di Giacinto * feat(gallery): add crispasr backend gallery entries Add the crispasr meta anchor and its full set of image gallery entries (cpu, metal, cuda12/13, rocm, intel-sycl f32/f16, vulkan, L4T arm64, L4T cuda13 arm64, plus -development variants), mirroring the whisper backend gallery block. Signed-off-by: Ettore Di Giacinto * ci(crispasr): bump CRISPASR_VERSION via bump_deps workflow Track CrispStrobe/CrispASR main branch and bump CRISPASR_VERSION in backend/go/crispasr/Makefile. Signed-off-by: Ettore Di Giacinto * build(crispasr): don't wire fixture-gated test into test-extra Mirror the whisper Go backend: its AudioTranscription test is gated on model/audio fixtures and skips in CI, so building crispasr (the heaviest ggml compile in the tree) inside the unit-test lane adds a long compile for zero coverage. The backend image build in backend-matrix.yml remains the authoritative compile check. Signed-off-by: Ettore Di Giacinto * ci(crispasr): add darwin metal build entry (mirror whisper) The metal-crispasr gallery entries and capabilities.metal mapping reference -metal-darwin-arm64-crispasr, which is only produced by an includeDarwin entry. Mirror whisper's darwin metal entry so the tag actually gets built. Signed-off-by: Ettore Di Giacinto * ci(crispasr): place hipblas matrix entry next to whisper twin Signed-off-by: Ettore Di Giacinto * feat(crispasr): register crispasr as pref-only ASR backend + test Signed-off-by: Ettore Di Giacinto * test(crispasr): port whisper behavioral suite (cancellation + streaming) Signed-off-by: Ettore Di Giacinto * test(crispasr): fix skip message env var names to CRISPASR_* Signed-off-by: Ettore Di Giacinto * feat(crispasr): switch shim to crispasr_session_* multi-architecture API The shim used whisper_full(), which in CrispASR is the whisper-only path: libcrispasr only transcribes Whisper GGUFs through it. Multi-architecture transcription (Parakeet, Voxtral, Qwen3-ASR, Canary, Granite, FunASR, Paraformer, SenseVoice, ...) goes through the crispasr_session_* C-ABI, which auto-detects the architecture from the GGUF and dispatches to the matching backend. Rewrite the C shim around crispasr_session_open / _transcribe_lang / _result_* and add get_backend() so the selected backend is logged. load_model now takes a threads param (session_open binds n_threads at open). The session result is segment+word based with no token IDs and no per-decode callback, so drop n_tokens / get_token_id / get_segment_speaker_turn_next / set_new_segment_callback. set_abort is kept for API parity but is best-effort: the session transcribe is blocking with no abort hook. Update the purego bindings and gocrispasr.go to match: tokens are left empty, speaker-turn handling is removed, and AudioTranscriptionStream emits one delta per non-empty segment after the blocking decode returns (no progressive streaming via the session API), preserving the concat(deltas) == final.Text invariant. crispasr_session_set_translate is exported by libcrispasr but not declared in crispasr.h, so it is forward-declared in the shim alongside the open/transcribe/result functions. Signed-off-by: Ettore Di Giacinto * build(crispasr): link full CrispASR backend set for multi-arch support The shim's crispasr_session_* dispatch calls into the per-architecture backend libs (parakeet, voxtral, qwen3_asr, canary, funasr, paraformer, sensevoice, ...), which CrispASR builds as static archives. Linking only crispasr + ggml dead-stripped every backend object from the final module (nm backend-symbol count: 0), leaving a whisper-only .so. Link the same backend set as crispasr-cli so the static archives are pulled in. After this the module carries the backend symbols (nm count 407, .so grows from ~2.1MB to ~6.7MB) and the session API can dispatch to every compiled-in architecture. Also rewrite ${CMAKE_SOURCE_DIR}/examples/talk-llama to ${PROJECT_SOURCE_DIR}/... in the vendored src/CMakeLists.txt: CrispASR locates its vendored llama.cpp via ${CMAKE_SOURCE_DIR}, which is wrong when CrispASR is add_subdirectory'd (CMAKE_SOURCE_DIR points at this backend dir, not the CrispASR root). PROJECT_SOURCE_DIR is correct both standalone and as a subproject; the sed is idempotent. Signed-off-by: Ettore Di Giacinto * test(crispasr): adapt suite to session API (blocking, no decode callback) Register the new symbol set (drop the removed token/speaker/callback funcs, add get_backend; load_model now takes 2 args). The session transcribe is blocking with no abort hook, so a mid-decode cancel can't interrupt it: change the cancellation spec to cancel the context before the call and assert codes.Canceled from the pre-call ctx.Err() check, dropping the <5s mid-decode timing assertion. The streaming spec still holds with per-segment post-decode emission (>=2 deltas, concat(deltas) == final.Text). Signed-off-by: Ettore Di Giacinto * feat(gallery): add CrispASR ASR model entries (-crispasr) Signed-off-by: Ettore Di Giacinto * fix(gallery): keep only session-auto-detectable CrispASR ASR models The crispasr backend loads models via crispasr_session_open, which auto-detects the backend from the GGUF general.architecture using crispasr_detect_backend_from_gguf. Architectures not in that detect map cannot be opened, so those gallery entries fail to load. Removed entries whose architecture is not wired into CrispASR v0.6.11's session auto-detect router (they can be re-added when upstream maps them): - Not in the detect map: data2vec, firered-asr, funasr, fun-asr-mlt-nano, glm-asr, hubert, kyutai-stt, mega-asr, mimo-asr, moonshine{,-de,-streaming,-tiny-de}, omniasr{,-llm,-llm-1b}, paraformer, sensevoice. - Pending verification (filename-heuristic routed, not arch-detected): parakeet-ctc-0.6b, parakeet-ctc-1.1b. Their GGUFs are routed to the fastconformer-ctc backend by a filename heuristic in the model registry, which implies general.architecture is not a mapped string. Kept the parakeet rnnt/tdt_ctc variants: convert-parakeet-to-gguf.py writes general.architecture="parakeet" unconditionally and encodes the rnnt/ctc distinction in metadata fields, so they session-auto-detect. Signed-off-by: Ettore Di Giacinto * feat(crispasr): TTS synthesis via crispasr_session_synthesize (24kHz) Add tts_synthesize/tts_free/tts_set_voice to the C-ABI shim. They reuse the already-open g_session (crispasr_session_open auto-detects a TTS model) and dispatch to the upstream synthesis call, which returns malloc'd 24 kHz mono float PCM. Orpheus needs a SNAC codec path that we do not set, so it returns NULL here and surfaces as an error Go-side. Signed-off-by: Ettore Di Giacinto * feat(crispasr): implement TTS/TTSStream gRPC methods Bind the new shim functions via purego and implement TTS, TTSStream and a writeWAV24k helper. synthesize copies the C-owned PCM out before freeing it; TTS writes a 24 kHz mono 16-bit WAV to req.Dst via go-audio/wav. CrispASR has no progressive synth, so TTSStream synthesizes fully, encodes to WAV, and emits the bytes as a single chunk; it owns the results-channel close (the gRPC server wrapper ranges until close), mirroring vibevoice-cpp's TTSStream. Signed-off-by: Ettore Di Giacinto * feat(crispasr): log when a TTS voice override is not honored Signed-off-by: Ettore Di Giacinto * feat(gallery): add CrispASR vibevoice-tts model entry Only vibevoice-tts works through the current shim: qwen3-tts, chatterbox, and orpheus require companion codec/s3gen/SNAC paths (set_codec_path / set_s3gen_path) that the shim doesn't wire yet, and kokoro/indextts/voxcpm2 aren't in the session auto-detect map. Those are follow-ups. Signed-off-by: Ettore Di Giacinto * test(crispasr): gated TTS synthesis spec Signed-off-by: Ettore Di Giacinto * fix(crispasr): satisfy golangci-lint (errcheck defers + unsafeptr nolint) The crispasr Go file is entirely new, so new-from-merge-base lints every line (unlike the grandfathered whisper backend it was forked from): - handle os.RemoveAll / fh.Close return values in AudioTranscription - annotate the two intentional C-pointer unsafe.Slice sites with //nolint:govet Signed-off-by: Ettore Di Giacinto * feat(crispasr): backend: and codec: model options (explicit arch + companion files) Add two model-config options to the CrispASR backend via opts.Options: - backend: selects an explicit CrispASR backend (bypassing auto-detect) by routing load_model through crispasr_session_open_explicit, unlocking architectures the detector won't pick on its own (qwen3, cohere, granite, voxtral, moonshine, mimo-asr, orpheus, kokoro, chatterbox, etc.). - codec: loads a companion file (qwen3-tts codec, orpheus SNAC, chatterbox s3gen, or mimo-asr tokenizer) via the universal crispasr_session_set_codec_path setter after the session opens. A relative path resolves against the model directory. rc==0 means success or not-applicable; only a negative rc is fatal. The C shim load_model gains a backend_name argument and a new set_codec_path entry point; the Go bridge parses the prefix:value options and registers the new symbol. The vad_only path is unchanged. Signed-off-by: Ettore Di Giacinto * feat(gallery): expand CrispASR models via backend:/codec: options (explicit arch + companions) Signed-off-by: Ettore Di Giacinto * refactor(gallery): use virtual.yaml base for crispasr models The crispasr entries are just backend + model + a couple options, fully expressed inline via overrides:/files: in gallery/index.yaml. Point each url: at the shared gallery/virtual.yaml (the established 'virtual' model trick) and drop the 36 redundant per-model gallery/*-crispasr.yaml files. Signed-off-by: Ettore Di Giacinto * fix(gallery): drop voice-requiring TTS entries (keep vibevoice-tts) Real e2e showed qwen3-tts/orpheus/chatterbox don't synthesize through the current shim: the codec: companion loads fine, but these engines additionally need a voice pack / voice prompt / reference clip (qwen3-tts base errors 'no voice'; chatterbox is zero-shot cloning; orpheus uses named voices) that the backend doesn't wire. (qwen3-tts also can't auto-detect: its GGUF arch is 'qwen3tts', unmapped by the detector — would need backend:qwen3-tts.) Removed to avoid shipping non-working gallery entries; vibevoice-tts (built-in voice, e2e-verified) remains the working TTS. Voice-pack wiring is a follow-up. Signed-off-by: Ettore Di Giacinto * feat(crispasr): speaker: and voice: TTS options (baked speakers + voice packs/prompts) speaker: -> crispasr_session_set_speaker_name (baked speakers: qwen3-tts CustomVoice, orpheus). voice:(+voice_text:) -> crispasr_session_set_voice (voice-pack GGUF, or WAV zero-shot clone with ref text). Applied at Load as the default voice; req.Voice still overrides the speaker per request. Signed-off-by: Ettore Di Giacinto * feat(gallery): re-add e2e-verified TTS engines (chatterbox, qwen3-tts-customvoice, orpheus) Signed-off-by: Ettore Di Giacinto --------- Signed-off-by: Ettore Di Giacinto Co-authored-by: Ettore Di Giacinto --- .github/backend-matrix.yml | 151 ++++ .github/workflows/bump_deps.yaml | 4 + Makefile | 6 +- backend/go/crispasr/.gitignore | 5 + backend/go/crispasr/CMakeLists.txt | 30 + backend/go/crispasr/Makefile | 132 +++ backend/go/crispasr/cpp/crispasr_shim.cpp | 253 ++++++ backend/go/crispasr/cpp/crispasr_shim.h | 23 + backend/go/crispasr/gocrispasr.go | 497 ++++++++++++ backend/go/crispasr/gocrispasr_test.go | 193 +++++ backend/go/crispasr/main.go | 58 ++ backend/go/crispasr/package.sh | 65 ++ backend/go/crispasr/run.sh | 52 ++ backend/index.yaml | 152 ++++ core/http/endpoints/localai/backend.go | 1 + core/http/endpoints/localai/backend_test.go | 1 + gallery/index.yaml | 841 ++++++++++++++++++++ 17 files changed, 2462 insertions(+), 2 deletions(-) create mode 100644 backend/go/crispasr/.gitignore create mode 100644 backend/go/crispasr/CMakeLists.txt create mode 100644 backend/go/crispasr/Makefile create mode 100644 backend/go/crispasr/cpp/crispasr_shim.cpp create mode 100644 backend/go/crispasr/cpp/crispasr_shim.h create mode 100644 backend/go/crispasr/gocrispasr.go create mode 100644 backend/go/crispasr/gocrispasr_test.go create mode 100644 backend/go/crispasr/main.go create mode 100755 backend/go/crispasr/package.sh create mode 100755 backend/go/crispasr/run.sh diff --git a/.github/backend-matrix.yml b/.github/backend-matrix.yml index de41aba52..9d7e36259 100644 --- a/.github/backend-matrix.yml +++ b/.github/backend-matrix.yml @@ -716,6 +716,19 @@ include: dockerfile: "./backend/Dockerfile.golang" context: "./" ubuntu-version: '2404' + - build-type: 'cublas' + cuda-major-version: "12" + cuda-minor-version: "8" + platforms: 'linux/amd64' + tag-latest: 'auto' + tag-suffix: '-gpu-nvidia-cuda-12-crispasr' + runs-on: 'ubuntu-latest' + base-image: "ubuntu:24.04" + skip-drivers: 'false' + backend: "crispasr" + dockerfile: "./backend/Dockerfile.golang" + context: "./" + ubuntu-version: '2404' - build-type: 'cublas' cuda-major-version: "12" cuda-minor-version: "8" @@ -1569,6 +1582,19 @@ include: dockerfile: "./backend/Dockerfile.golang" context: "./" ubuntu-version: '2404' + - build-type: 'cublas' + cuda-major-version: "13" + cuda-minor-version: "0" + platforms: 'linux/amd64' + tag-latest: 'auto' + tag-suffix: '-gpu-nvidia-cuda-13-crispasr' + runs-on: 'ubuntu-latest' + base-image: "ubuntu:24.04" + skip-drivers: 'false' + backend: "crispasr" + dockerfile: "./backend/Dockerfile.golang" + context: "./" + ubuntu-version: '2404' - build-type: 'cublas' cuda-major-version: "13" cuda-minor-version: "0" @@ -1595,6 +1621,19 @@ include: backend: "whisper" dockerfile: "./backend/Dockerfile.golang" context: "./" + - build-type: 'cublas' + cuda-major-version: "13" + cuda-minor-version: "0" + platforms: 'linux/arm64' + skip-drivers: 'false' + tag-latest: 'auto' + tag-suffix: '-nvidia-l4t-cuda-13-arm64-crispasr' + base-image: "ubuntu:24.04" + ubuntu-version: '2404' + runs-on: 'ubuntu-24.04-arm' + backend: "crispasr" + dockerfile: "./backend/Dockerfile.golang" + context: "./" - build-type: 'cublas' cuda-major-version: "13" cuda-minor-version: "0" @@ -2889,6 +2928,20 @@ include: dockerfile: "./backend/Dockerfile.golang" context: "./" ubuntu-version: '2404' + - build-type: '' + cuda-major-version: "" + cuda-minor-version: "" + platforms: 'linux/amd64' + platform-tag: 'amd64' + tag-latest: 'auto' + tag-suffix: '-cpu-crispasr' + runs-on: 'ubuntu-latest' + base-image: "ubuntu:24.04" + skip-drivers: 'false' + backend: "crispasr" + dockerfile: "./backend/Dockerfile.golang" + context: "./" + ubuntu-version: '2404' - build-type: '' cuda-major-version: "" cuda-minor-version: "" @@ -2903,6 +2956,20 @@ include: dockerfile: "./backend/Dockerfile.golang" context: "./" ubuntu-version: '2404' + - build-type: '' + cuda-major-version: "" + cuda-minor-version: "" + platforms: 'linux/arm64' + platform-tag: 'arm64' + tag-latest: 'auto' + tag-suffix: '-cpu-crispasr' + runs-on: 'ubuntu-24.04-arm' + base-image: "ubuntu:24.04" + skip-drivers: 'false' + backend: "crispasr" + dockerfile: "./backend/Dockerfile.golang" + context: "./" + ubuntu-version: '2404' - build-type: 'sycl_f32' cuda-major-version: "" cuda-minor-version: "" @@ -2916,6 +2983,19 @@ include: dockerfile: "./backend/Dockerfile.golang" context: "./" ubuntu-version: '2404' + - build-type: 'sycl_f32' + cuda-major-version: "" + cuda-minor-version: "" + platforms: 'linux/amd64' + tag-latest: 'auto' + tag-suffix: '-gpu-intel-sycl-f32-crispasr' + runs-on: 'ubuntu-latest' + base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04" + skip-drivers: 'false' + backend: "crispasr" + dockerfile: "./backend/Dockerfile.golang" + context: "./" + ubuntu-version: '2404' - build-type: 'sycl_f16' cuda-major-version: "" cuda-minor-version: "" @@ -2929,6 +3009,19 @@ include: dockerfile: "./backend/Dockerfile.golang" context: "./" ubuntu-version: '2404' + - build-type: 'sycl_f16' + cuda-major-version: "" + cuda-minor-version: "" + platforms: 'linux/amd64' + tag-latest: 'auto' + tag-suffix: '-gpu-intel-sycl-f16-crispasr' + runs-on: 'ubuntu-latest' + base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04" + skip-drivers: 'false' + backend: "crispasr" + dockerfile: "./backend/Dockerfile.golang" + context: "./" + ubuntu-version: '2404' - build-type: 'vulkan' cuda-major-version: "" cuda-minor-version: "" @@ -2943,6 +3036,20 @@ include: dockerfile: "./backend/Dockerfile.golang" context: "./" ubuntu-version: '2404' + - build-type: 'vulkan' + cuda-major-version: "" + cuda-minor-version: "" + platforms: 'linux/amd64' + platform-tag: 'amd64' + tag-latest: 'auto' + tag-suffix: '-gpu-vulkan-crispasr' + runs-on: 'ubuntu-latest' + base-image: "ubuntu:24.04" + skip-drivers: 'false' + backend: "crispasr" + dockerfile: "./backend/Dockerfile.golang" + context: "./" + ubuntu-version: '2404' - build-type: 'vulkan' cuda-major-version: "" cuda-minor-version: "" @@ -2957,6 +3064,20 @@ include: dockerfile: "./backend/Dockerfile.golang" context: "./" ubuntu-version: '2404' + - build-type: 'vulkan' + cuda-major-version: "" + cuda-minor-version: "" + platforms: 'linux/arm64' + platform-tag: 'arm64' + tag-latest: 'auto' + tag-suffix: '-gpu-vulkan-crispasr' + runs-on: 'ubuntu-24.04-arm' + base-image: "ubuntu:24.04" + skip-drivers: 'false' + backend: "crispasr" + dockerfile: "./backend/Dockerfile.golang" + context: "./" + ubuntu-version: '2404' - build-type: 'cublas' cuda-major-version: "12" cuda-minor-version: "0" @@ -2970,6 +3091,19 @@ include: dockerfile: "./backend/Dockerfile.golang" context: "./" ubuntu-version: '2204' + - build-type: 'cublas' + cuda-major-version: "12" + cuda-minor-version: "0" + platforms: 'linux/arm64' + skip-drivers: 'false' + tag-latest: 'auto' + tag-suffix: '-nvidia-l4t-arm64-crispasr' + base-image: "nvcr.io/nvidia/l4t-jetpack:r36.4.0" + runs-on: 'ubuntu-24.04-arm' + backend: "crispasr" + dockerfile: "./backend/Dockerfile.golang" + context: "./" + ubuntu-version: '2204' - build-type: 'hipblas' cuda-major-version: "" cuda-minor-version: "" @@ -2983,6 +3117,19 @@ include: dockerfile: "./backend/Dockerfile.golang" context: "./" ubuntu-version: '2404' + - build-type: 'hipblas' + cuda-major-version: "" + cuda-minor-version: "" + platforms: 'linux/amd64' + tag-latest: 'auto' + tag-suffix: '-gpu-rocm-hipblas-crispasr' + base-image: "rocm/dev-ubuntu-24.04:7.2.1" + runs-on: 'ubuntu-latest' + skip-drivers: 'false' + backend: "crispasr" + dockerfile: "./backend/Dockerfile.golang" + context: "./" + ubuntu-version: '2404' # parakeet-cpp - build-type: '' cuda-major-version: "" @@ -4124,6 +4271,10 @@ includeDarwin: tag-suffix: "-metal-darwin-arm64-whisper" build-type: "metal" lang: "go" + - backend: "crispasr" + tag-suffix: "-metal-darwin-arm64-crispasr" + build-type: "metal" + lang: "go" - backend: "parakeet-cpp" tag-suffix: "-metal-darwin-arm64-parakeet-cpp" build-type: "metal" diff --git a/.github/workflows/bump_deps.yaml b/.github/workflows/bump_deps.yaml index 95612ae5b..5f1ac0c21 100644 --- a/.github/workflows/bump_deps.yaml +++ b/.github/workflows/bump_deps.yaml @@ -30,6 +30,10 @@ jobs: variable: "WHISPER_CPP_VERSION" branch: "master" file: "backend/go/whisper/Makefile" + - repository: "CrispStrobe/CrispASR" + variable: "CRISPASR_VERSION" + branch: "main" + file: "backend/go/crispasr/Makefile" - repository: "mudler/parakeet.cpp" variable: "PARAKEET_VERSION" branch: "master" diff --git a/Makefile b/Makefile index a401762cf..5ea1db5bd 100644 --- a/Makefile +++ b/Makefile @@ -1,5 +1,5 @@ # Disable parallel execution for backend builds -.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/parakeet-cpp backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/rfdetr-cpp backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin backends/liquid-audio +.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/crispasr backends/parakeet-cpp backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/rfdetr-cpp backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin backends/liquid-audio GOCMD=go GOTEST=$(GOCMD) test @@ -1162,6 +1162,7 @@ BACKEND_HUGGINGFACE = huggingface|golang|.|false|true BACKEND_SILERO_VAD = silero-vad|golang|.|false|true BACKEND_STABLEDIFFUSION_GGML = stablediffusion-ggml|golang|.|--progress=plain|true BACKEND_WHISPER = whisper|golang|.|false|true +BACKEND_CRISPASR = crispasr|golang|.|false|true BACKEND_PARAKEET_CPP = parakeet-cpp|golang|.|false|true BACKEND_VOXTRAL = voxtral|golang|.|false|true BACKEND_ACESTEP_CPP = acestep-cpp|golang|.|false|true @@ -1250,6 +1251,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_HUGGINGFACE))) $(eval $(call generate-docker-build-target,$(BACKEND_SILERO_VAD))) $(eval $(call generate-docker-build-target,$(BACKEND_STABLEDIFFUSION_GGML))) $(eval $(call generate-docker-build-target,$(BACKEND_WHISPER))) +$(eval $(call generate-docker-build-target,$(BACKEND_CRISPASR))) $(eval $(call generate-docker-build-target,$(BACKEND_PARAKEET_CPP))) $(eval $(call generate-docker-build-target,$(BACKEND_VOXTRAL))) $(eval $(call generate-docker-build-target,$(BACKEND_OPUS))) @@ -1300,7 +1302,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_SHERPA_ONNX))) docker-save-%: backend-images docker save local-ai-backend:$* -o backend-images/$*.tar -docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-ds4 docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-liquid-audio docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-rfdetr-cpp docker-build-qwen3-tts-cpp docker-build-vibevoice-cpp docker-build-localvqe docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx docker-build-cloud-proxy +docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-ds4 docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-crispasr docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-liquid-audio docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-rfdetr-cpp docker-build-qwen3-tts-cpp docker-build-vibevoice-cpp docker-build-localvqe docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx docker-build-cloud-proxy ######################################################## ### Mock Backend for E2E Tests diff --git a/backend/go/crispasr/.gitignore b/backend/go/crispasr/.gitignore new file mode 100644 index 000000000..537510174 --- /dev/null +++ b/backend/go/crispasr/.gitignore @@ -0,0 +1,5 @@ +sources +build* +libgocrispasr*.so +crispasr +package diff --git a/backend/go/crispasr/CMakeLists.txt b/backend/go/crispasr/CMakeLists.txt new file mode 100644 index 000000000..5ed6a9c3f --- /dev/null +++ b/backend/go/crispasr/CMakeLists.txt @@ -0,0 +1,30 @@ +cmake_minimum_required(VERSION 3.12) +project(gocrispasr LANGUAGES C CXX) +set(CMAKE_POSITION_INDEPENDENT_CODE ON) +set(CMAKE_EXPORT_COMPILE_COMMANDS ON) + +add_subdirectory(./sources/CrispASR) + +add_library(gocrispasr MODULE cpp/crispasr_shim.cpp) +target_include_directories(gocrispasr PRIVATE + ${CMAKE_CURRENT_SOURCE_DIR}/sources/CrispASR/include + ${CMAKE_CURRENT_SOURCE_DIR}/sources/CrispASR/ggml/include) +# Link the same backend set as crispasr-cli (examples/cli/CMakeLists.txt) so +# the session API can dispatch to every compiled-in architecture, not just +# whisper. crispasr is the referencer; the backend static libs supply the +# per-architecture symbols; ggml is the math/runtime base. +target_link_libraries(gocrispasr PRIVATE + crispasr + parakeet canary canary_ctc cohere granite_speech granite_nle + voxtral voxtral4b qwen3_asr qwen3_tts orpheus chatterbox indextts + kokoro voxcpm2_tts m2m100 t5_translate wav2vec2-ggml vibevoice + silero-lid pyannote-seg funasr paraformer sensevoice + crisp_audio + ggml) + +if(CMAKE_CXX_COMPILER_ID MATCHES "GNU" AND CMAKE_CXX_COMPILER_VERSION VERSION_LESS 9.0) + target_link_libraries(gocrispasr PRIVATE stdc++fs) +endif() + +set_property(TARGET gocrispasr PROPERTY CXX_STANDARD 17) +set_target_properties(gocrispasr PROPERTIES LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}) diff --git a/backend/go/crispasr/Makefile b/backend/go/crispasr/Makefile new file mode 100644 index 000000000..3d57067b0 --- /dev/null +++ b/backend/go/crispasr/Makefile @@ -0,0 +1,132 @@ +CMAKE_ARGS?= +BUILD_TYPE?= +NATIVE?=false + +GOCMD?=go +GO_TAGS?= +JOBS?=$(shell nproc --ignore=1) + +# CrispASR version (release tag) +CRISPASR_REPO?=https://github.com/CrispStrobe/CrispASR +CRISPASR_VERSION?=v0.6.11 +SO_TARGET?=libgocrispasr.so + +CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF +# Keep the build lean: no tests/examples/server/SDL2/curl/ffmpeg (the FROM scratch +# image cannot satisfy those runtime deps). All ASR/TTS model backends stay enabled. +CMAKE_ARGS+=-DCRISPASR_BUILD_TESTS=OFF -DCRISPASR_BUILD_EXAMPLES=OFF -DCRISPASR_BUILD_SERVER=OFF +CMAKE_ARGS+=-DCRISPASR_SDL2=OFF -DCRISPASR_CURL=OFF -DCRISPASR_FFMPEG=OFF + +ifeq ($(NATIVE),false) + CMAKE_ARGS+=-DGGML_NATIVE=OFF +endif + +ifeq ($(BUILD_TYPE),cublas) + CMAKE_ARGS+=-DGGML_CUDA=ON +else ifeq ($(BUILD_TYPE),openblas) + CMAKE_ARGS+=-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS +else ifeq ($(BUILD_TYPE),clblas) + CMAKE_ARGS+=-DGGML_CLBLAST=ON -DCLBlast_DIR=/some/path +else ifeq ($(BUILD_TYPE),hipblas) + CMAKE_ARGS+=-DGGML_HIPBLAS=ON +else ifeq ($(BUILD_TYPE),vulkan) + CMAKE_ARGS+=-DGGML_VULKAN=ON +else ifeq ($(OS),Darwin) + ifneq ($(BUILD_TYPE),metal) + CMAKE_ARGS+=-DGGML_METAL=OFF + else + CMAKE_ARGS+=-DGGML_METAL=ON + CMAKE_ARGS+=-DGGML_METAL_EMBED_LIBRARY=ON + endif +endif + +ifeq ($(BUILD_TYPE),sycl_f16) + CMAKE_ARGS+=-DGGML_SYCL=ON \ + -DCMAKE_C_COMPILER=icx \ + -DCMAKE_CXX_COMPILER=icpx \ + -DGGML_SYCL_F16=ON +endif + +ifeq ($(BUILD_TYPE),sycl_f32) + CMAKE_ARGS+=-DGGML_SYCL=ON \ + -DCMAKE_C_COMPILER=icx \ + -DCMAKE_CXX_COMPILER=icpx +endif + +sources/CrispASR: + mkdir -p sources/CrispASR + cd sources/CrispASR && \ + git init && \ + git remote add origin $(CRISPASR_REPO) && \ + git fetch origin && \ + git checkout $(CRISPASR_VERSION) && \ + git submodule update --init --recursive --depth 1 --single-branch + # CrispASR's src/CMakeLists.txt locates its vendored llama.cpp + # (crispasr-llama-core, used by the chat C-ABI) via ${CMAKE_SOURCE_DIR}, + # which assumes CrispASR is the top-level CMake project. We add_subdirectory + # it, so ${CMAKE_SOURCE_DIR} is THIS backend dir and the talk-llama sources + # aren't found. Rewrite to ${PROJECT_SOURCE_DIR} (the crispasr project root), + # which is correct both standalone and as a subproject. Idempotent. + sed -i 's#\$${CMAKE_SOURCE_DIR}/examples/talk-llama#\$${PROJECT_SOURCE_DIR}/examples/talk-llama#' sources/CrispASR/src/CMakeLists.txt + +# Detect OS +UNAME_S := $(shell uname -s) + +ifeq ($(UNAME_S),Linux) + VARIANT_TARGETS = libgocrispasr-avx.so libgocrispasr-avx2.so libgocrispasr-avx512.so libgocrispasr-fallback.so +else + VARIANT_TARGETS = libgocrispasr-fallback.so +endif + +crispasr: main.go gocrispasr.go $(VARIANT_TARGETS) + CGO_ENABLED=0 $(GOCMD) build -tags "$(GO_TAGS)" -o crispasr ./ + +package: crispasr + bash package.sh + +build: package + +clean: purge + rm -rf libgocrispasr*.so package sources/CrispASR crispasr + +purge: + rm -rf build* + +ifeq ($(UNAME_S),Linux) +libgocrispasr-avx.so: sources/CrispASR + $(MAKE) purge + $(info ${GREEN}I crispasr build info:avx${RESET}) + SO_TARGET=libgocrispasr-avx.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) libgocrispasr-custom + rm -rfv build* + +libgocrispasr-avx2.so: sources/CrispASR + $(MAKE) purge + $(info ${GREEN}I crispasr build info:avx2${RESET}) + SO_TARGET=libgocrispasr-avx2.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=off -DGGML_FMA=on -DGGML_F16C=on -DGGML_BMI2=on" $(MAKE) libgocrispasr-custom + rm -rfv build* + +libgocrispasr-avx512.so: sources/CrispASR + $(MAKE) purge + $(info ${GREEN}I crispasr build info:avx512${RESET}) + SO_TARGET=libgocrispasr-avx512.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=on -DGGML_FMA=on -DGGML_F16C=on -DGGML_BMI2=on" $(MAKE) libgocrispasr-custom + rm -rfv build* +endif + +libgocrispasr-fallback.so: sources/CrispASR + $(MAKE) purge + $(info ${GREEN}I crispasr build info:fallback${RESET}) + SO_TARGET=libgocrispasr-fallback.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) libgocrispasr-custom + rm -rfv build* + +libgocrispasr-custom: CMakeLists.txt cpp/crispasr_shim.cpp cpp/crispasr_shim.h + mkdir -p build-$(SO_TARGET) && \ + cd build-$(SO_TARGET) && \ + cmake .. $(CMAKE_ARGS) && \ + cmake --build . --config Release -j$(JOBS) && \ + cd .. && \ + mv build-$(SO_TARGET)/libgocrispasr.so ./$(SO_TARGET) + +test: crispasr + CGO_ENABLED=0 $(GOCMD) test -v ./... + +all: crispasr package diff --git a/backend/go/crispasr/cpp/crispasr_shim.cpp b/backend/go/crispasr/cpp/crispasr_shim.cpp new file mode 100644 index 000000000..bf6151ae1 --- /dev/null +++ b/backend/go/crispasr/cpp/crispasr_shim.cpp @@ -0,0 +1,253 @@ +#include "crispasr_shim.h" +#include "ggml-backend.h" +#include "crispasr.h" +#include +#include + +// Opaque session types. crispasr.h declares `struct crispasr_session;` but not +// the result type nor the open/transcribe/result accessors — those are +// CA_EXPORT extern "C" symbols in src/crispasr_c_api.cpp, so we forward-declare +// exactly the ones we use. Signatures verified against +// sources/CrispASR/src/crispasr_c_api.cpp. +struct crispasr_session_result; +extern "C" { +crispasr_session *crispasr_session_open(const char *model_path, int n_threads); +crispasr_session *crispasr_session_open_explicit(const char *model_path, + const char *backend_name, + int n_threads); +int crispasr_session_set_codec_path(crispasr_session *s, const char *path); +void crispasr_session_close(crispasr_session *s); +const char *crispasr_session_backend(crispasr_session *s); +int crispasr_session_set_translate(crispasr_session *s, int enable); +crispasr_session_result *crispasr_session_transcribe_lang( + crispasr_session *s, const float *pcm, int n_samples, const char *language); +int crispasr_session_result_n_segments(crispasr_session_result *r); +const char *crispasr_session_result_segment_text(crispasr_session_result *r, + int i); +int64_t crispasr_session_result_segment_t0(crispasr_session_result *r, int i); +int64_t crispasr_session_result_segment_t1(crispasr_session_result *r, int i); +void crispasr_session_result_free(crispasr_session_result *r); +float *crispasr_session_synthesize(crispasr_session *s, const char *text, + int *out_n_samples); +void crispasr_pcm_free(float *pcm); +int crispasr_session_set_speaker_name(crispasr_session *s, const char *name); +int crispasr_session_set_voice(crispasr_session *s, const char *path, + const char *ref_text_or_null); +} + +static crispasr_session *g_session = nullptr; +static crispasr_session_result *g_result = nullptr; + +static struct whisper_vad_context *vctx; +static std::vector flat_segs; + +static std::atomic g_abort{0}; + +extern "C" void set_abort(int v) { + g_abort.store(v, std::memory_order_relaxed); +} + +static void ggml_log_cb(enum ggml_log_level level, const char *log, + void *data) { + const char *level_str; + + if (!log) { + return; + } + + switch (level) { + case GGML_LOG_LEVEL_DEBUG: + level_str = "DEBUG"; + break; + case GGML_LOG_LEVEL_INFO: + level_str = "INFO"; + break; + case GGML_LOG_LEVEL_WARN: + level_str = "WARN"; + break; + case GGML_LOG_LEVEL_ERROR: + level_str = "ERROR"; + break; + default: /* Potential future-proofing */ + level_str = "?????"; + break; + } + + fprintf(stderr, "[%-5s] ", level_str); + fputs(log, stderr); + fflush(stderr); +} + +int load_model(const char *const model_path, int threads, + const char *backend_name) { + whisper_log_set(ggml_log_cb, nullptr); + ggml_backend_load_all(); + + if (backend_name && *backend_name) { + g_session = + crispasr_session_open_explicit(model_path, backend_name, threads); + } else { + g_session = crispasr_session_open(model_path, threads); + } + if (g_session == nullptr) { + fprintf(stderr, "error: failed to open CrispASR session for model\n"); + return 1; + } + + fprintf(stderr, "info: CrispASR backend selected: %s\n", + crispasr_session_backend(g_session)); + return 0; +} + +// set_codec_path forwards a companion file (qwen3-tts codec, orpheus SNAC, +// chatterbox s3gen, or mimo-asr tokenizer) to the active session. Returns 0 on +// success or when the active backend needs no companion, negative on failure, +// and -1 when no session is open. +int set_codec_path(const char *path) { + return g_session ? crispasr_session_set_codec_path(g_session, path) : -1; +} + +int load_model_vad(const char *const model_path) { + whisper_log_set(ggml_log_cb, nullptr); + ggml_backend_load_all(); + + struct whisper_vad_context_params vcparams = + whisper_vad_default_context_params(); + + // XXX: Overridden to false in upstream due to performance? + // vcparams.use_gpu = true; + + vctx = whisper_vad_init_from_file_with_params(model_path, vcparams); + if (vctx == nullptr) { + fprintf(stderr, "error: Failed to init model as VAD\n"); + return 1; + } + + return 0; +} + +int vad(float pcmf32[], size_t pcmf32_len, float **segs_out, + size_t *segs_out_len) { + if (!whisper_vad_detect_speech(vctx, pcmf32, pcmf32_len)) { + fprintf(stderr, "error: failed to detect speech\n"); + return 1; + } + + struct whisper_vad_params params = whisper_vad_default_params(); + struct whisper_vad_segments *segs = + whisper_vad_segments_from_probs(vctx, params); + size_t segn = whisper_vad_segments_n_segments(segs); + + // fprintf(stderr, "Got segments %zd\n", segn); + + flat_segs.clear(); + + for (int i = 0; i < segn; i++) { + flat_segs.push_back(whisper_vad_segments_get_segment_t0(segs, i)); + flat_segs.push_back(whisper_vad_segments_get_segment_t1(segs, i)); + } + + // fprintf(stderr, "setting out variables: %p=%p -> %p, %p=%zx -> %zx\n", + // segs_out, *segs_out, flat_segs.data(), segs_out_len, *segs_out_len, + // flat_segs.size()); + *segs_out = flat_segs.data(); + *segs_out_len = flat_segs.size(); + + // fprintf(stderr, "freeing segs\n"); + whisper_vad_free_segments(segs); + + // fprintf(stderr, "returning\n"); + return 0; +} + +// threads, diarize and prompt are accepted for Go-side API parity but unused +// in Phase 1: the thread count is fixed at session open, and diarization and +// the initial prompt are separate CrispASR features not yet wired through the +// session ASR path. +int transcribe(uint32_t threads, char *lang, bool translate, bool diarize, + float pcmf32[], size_t pcmf32_len, size_t *segs_out_len, + char *prompt) { + (void)threads; + (void)diarize; + (void)prompt; + + if (!g_session) { + return 1; + } + + // Reset stale abort flag from any prior cancelled call. set_abort remains + // best-effort: the session transcribe call is blocking and exposes no abort + // hook, so a mid-decode abort cannot interrupt it. + g_abort.store(0, std::memory_order_relaxed); + + crispasr_session_set_translate(g_session, translate ? 1 : 0); + + if (g_result) { + crispasr_session_result_free(g_result); + g_result = nullptr; + } + + const char *language = (lang && *lang) ? lang : nullptr; + g_result = crispasr_session_transcribe_lang(g_session, pcmf32, (int)pcmf32_len, + language); + if (!g_result) { + fprintf(stderr, "error: transcription failed\n"); + return 1; + } + + *segs_out_len = crispasr_session_result_n_segments(g_result); + return 0; +} + +const char *get_segment_text(int i) { + if (!g_result) { + return ""; + } + return crispasr_session_result_segment_text(g_result, i); +} + +int64_t get_segment_t0(int i) { + if (!g_result) { + return 0; + } + return crispasr_session_result_segment_t0(g_result, i); +} + +int64_t get_segment_t1(int i) { + if (!g_result) { + return 0; + } + return crispasr_session_result_segment_t1(g_result, i); +} + +const char *get_backend(void) { + return g_session ? crispasr_session_backend(g_session) : ""; +} + +// TTS uses the already-open session (crispasr_session_open auto-detects a TTS +// model). Output is 24 kHz mono float PCM (upstream CrispASR convention), +// malloc'd by the C API; the caller must release it via tts_free. +float *tts_synthesize(const char *text, int *out_n_samples) { + if (out_n_samples) *out_n_samples = 0; + if (!g_session || !text) return nullptr; + return crispasr_session_synthesize(g_session, text, out_n_samples); +} + +void tts_free(float *pcm) { + if (pcm) crispasr_pcm_free(pcm); +} + +int tts_set_voice(const char *name) { + if (!g_session || !name || !*name) return 0; + return crispasr_session_set_speaker_name(g_session, name); +} + +// tts_set_voice_file loads a voice from a file: a .gguf path selects a voice +// pack, a .wav path with a non-empty ref_text performs zero-shot voice cloning +// (the C API returns -2 when ref_text is required but missing). Returns -1 when +// no session is open or path is null. +int tts_set_voice_file(const char *path, const char *ref_text) { + if (!g_session || !path) return -1; + const char *ref = (ref_text && *ref_text) ? ref_text : nullptr; + return crispasr_session_set_voice(g_session, path, ref); +} diff --git a/backend/go/crispasr/cpp/crispasr_shim.h b/backend/go/crispasr/cpp/crispasr_shim.h new file mode 100644 index 000000000..7c593951a --- /dev/null +++ b/backend/go/crispasr/cpp/crispasr_shim.h @@ -0,0 +1,23 @@ +#include +#include + +extern "C" { +int load_model(const char *const model_path, int threads, + const char *backend_name); +int set_codec_path(const char *path); +int load_model_vad(const char *const model_path); +int vad(float pcmf32[], size_t pcmf32_size, float **segs_out, + size_t *segs_out_len); +int transcribe(uint32_t threads, char *lang, bool translate, bool diarize, + float pcmf32[], size_t pcmf32_len, size_t *segs_out_len, + char *prompt); +const char *get_segment_text(int i); +int64_t get_segment_t0(int i); +int64_t get_segment_t1(int i); +const char *get_backend(void); +void set_abort(int v); +float *tts_synthesize(const char *text, int *out_n_samples); // 24kHz mono float, malloc'd; NULL on failure +void tts_free(float *pcm); +int tts_set_voice(const char *name); // best-effort speaker selection; 0 ok +int tts_set_voice_file(const char *path, const char *ref_text); // load voice pack (.gguf) or zero-shot clone (.wav + ref_text) +} diff --git a/backend/go/crispasr/gocrispasr.go b/backend/go/crispasr/gocrispasr.go new file mode 100644 index 000000000..dc21a28fd --- /dev/null +++ b/backend/go/crispasr/gocrispasr.go @@ -0,0 +1,497 @@ +package main + +import ( + "context" + "fmt" + "os" + "path/filepath" + "strings" + "sync" + "unsafe" + + "github.com/go-audio/audio" + "github.com/go-audio/wav" + "github.com/mudler/LocalAI/pkg/grpc/base" + pb "github.com/mudler/LocalAI/pkg/grpc/proto" + "github.com/mudler/LocalAI/pkg/utils" + "google.golang.org/grpc/codes" + "google.golang.org/grpc/status" +) + +var ( + CppLoadModel func(modelPath string, threads int, backendName string) int + CppSetCodecPath func(path string) int + CppLoadModelVAD func(modelPath string) int + CppVAD func(pcmf32 []float32, pcmf32Size uintptr, segsOut unsafe.Pointer, segsOutLen unsafe.Pointer) int + CppTranscribe func(threads uint32, lang string, translate bool, diarize bool, pcmf32 []float32, pcmf32Len uintptr, segsOutLen unsafe.Pointer, prompt string) int + CppGetSegmentText func(i int) string + CppGetSegmentStart func(i int) int64 + CppGetSegmentEnd func(i int) int64 + CppGetBackend func() string + CppSetAbort func(v int) + CppTTSSynthesize func(text string, outNSamples unsafe.Pointer) uintptr + CppTTSFree func(ptr uintptr) + CppTTSSetVoice func(name string) int + CppTTSSetVoiceFile func(path string, refText string) int +) + +type CrispASR struct { + base.SingleThread +} + +// splitOption splits a "prefix:value" model option into its key and value, +// matching the convention used by other backends (see sherpa-onnx). It returns +// ok=false when the option carries no ':' separator. +func splitOption(oo string) (key, value string, ok bool) { + parts := strings.SplitN(oo, ":", 2) + if len(parts) != 2 { + return "", "", false + } + return parts[0], parts[1], true +} + +func (w *CrispASR) Load(opts *pb.ModelOptions) error { + vadOnly := false + backendName := "" + codecPath := "" + speakerName := "" + voicePath := "" + voiceRefText := "" + + for _, oo := range opts.Options { + if oo == "vad_only" { + vadOnly = true + continue + } + switch key, value, ok := splitOption(oo); { + case ok && key == "backend": + backendName = value + case ok && key == "codec": + codecPath = value + case ok && key == "speaker": + speakerName = value + case ok && key == "voice": + voicePath = value + case ok && key == "voice_text": + voiceRefText = value + default: + fmt.Fprintf(os.Stderr, "Unrecognized option: %v\n", oo) + } + } + + if vadOnly { + if ret := CppLoadModelVAD(opts.ModelFile); ret != 0 { + return fmt.Errorf("Failed to load CrispASR VAD model") + } + + return nil + } + + // Resolve a relative companion path against the model directory so a config + // can reference a sibling codec/tokenizer file by name alone. + if codecPath != "" && !filepath.IsAbs(codecPath) { + codecPath = filepath.Join(filepath.Dir(opts.ModelFile), codecPath) + } + + // A voice file (.gguf pack or .wav prompt) is resolved against the model + // directory just like the codec, so a config can reference a sibling file. + if voicePath != "" && !filepath.IsAbs(voicePath) { + voicePath = filepath.Join(filepath.Dir(opts.ModelFile), voicePath) + } + + if ret := CppLoadModel(opts.ModelFile, int(opts.Threads), backendName); ret != 0 { + return fmt.Errorf("Failed to load CrispASR transcription model") + } + + // Load the companion file (codec/tokenizer/s3gen) after the session is open. + // rc==0 means success or "not applicable" for the active backend; only a + // negative code is fatal. + if codecPath != "" { + if rc := CppSetCodecPath(codecPath); rc < 0 { + return fmt.Errorf("crispasr: failed to load companion file %q (rc=%d)", codecPath, rc) + } + fmt.Fprintf(os.Stderr, "CrispASR companion file loaded: %s\n", codecPath) + } + + // Apply the Load-time default voice. A baked speaker (speaker:) is selected + // by name and is best-effort: a backend that can't honor it is logged, not + // fatal. A voice file (voice:) is a hard requirement once configured, so a + // negative rc fails Load. + if speakerName != "" { + if rc := CppTTSSetVoice(speakerName); rc != 0 { + fmt.Fprintf(os.Stderr, "crispasr: speaker %q not applied (rc=%d)\n", speakerName, rc) + } + } + if voicePath != "" { + if rc := CppTTSSetVoiceFile(voicePath, voiceRefText); rc < 0 { + return fmt.Errorf("crispasr: failed to load voice %q (rc=%d)", voicePath, rc) + } + fmt.Fprintf(os.Stderr, "CrispASR voice loaded: %s\n", voicePath) + } + + fmt.Fprintf(os.Stderr, "CrispASR backend selected: %s\n", CppGetBackend()) + + return nil +} + +func (w *CrispASR) VAD(req *pb.VADRequest) (pb.VADResponse, error) { + audio := req.Audio + // We expect 0xdeadbeef to be overwritten and if we see it in a stack trace we know it wasn't + segsPtr, segsLen := uintptr(0xdeadbeef), uintptr(0xdeadbeef) + segsPtrPtr, segsLenPtr := unsafe.Pointer(&segsPtr), unsafe.Pointer(&segsLen) + + if ret := CppVAD(audio, uintptr(len(audio)), segsPtrPtr, segsLenPtr); ret != 0 { + return pb.VADResponse{}, fmt.Errorf("Failed VAD") + } + + // Happens when CPP vector has not had any elements pushed to it + if segsPtr == 0 { + return pb.VADResponse{ + Segments: []*pb.VADSegment{}, + }, nil + } + + // unsafeptr warning is caused by segsPtr being on the stack and therefor being subject to stack copying AFAICT + // however the stack shouldn't have grown between setting segsPtr and now, also the memory pointed to is allocated by C++ + segs := unsafe.Slice((*float32)(unsafe.Pointer(segsPtr)), segsLen) //nolint:govet // segsPtr addresses C++-owned heap memory passed back through the cgo-free purego boundary; the uintptr->Pointer round-trip is intentional and the buffer outlives this read. + + vadSegments := []*pb.VADSegment{} + for i := range len(segs) >> 1 { + s := segs[2*i] / 100 + t := segs[2*i+1] / 100 + vadSegments = append(vadSegments, &pb.VADSegment{ + Start: s, + End: t, + }) + } + + return pb.VADResponse{ + Segments: vadSegments, + }, nil +} + +func (w *CrispASR) AudioTranscription(ctx context.Context, opts *pb.TranscriptRequest) (pb.TranscriptResult, error) { + if err := ctx.Err(); err != nil { + return pb.TranscriptResult{}, status.Error(codes.Canceled, "transcription cancelled") + } + + dir, err := os.MkdirTemp("", "crispasr") + if err != nil { + return pb.TranscriptResult{}, err + } + defer func() { _ = os.RemoveAll(dir) }() + + convertedPath := filepath.Join(dir, "converted.wav") + + if err := utils.AudioToWav(opts.Dst, convertedPath); err != nil { + return pb.TranscriptResult{}, err + } + + fh, err := os.Open(convertedPath) + if err != nil { + return pb.TranscriptResult{}, err + } + defer func() { _ = fh.Close() }() + + d := wav.NewDecoder(fh) + buf, err := d.FullPCMBuffer() + if err != nil { + return pb.TranscriptResult{}, err + } + + data := buf.AsFloat32Buffer().Data + var duration float32 + if buf.Format != nil && buf.Format.SampleRate > 0 { + duration = float32(len(data)) / float32(buf.Format.SampleRate) + } + segsLen := uintptr(0xdeadbeef) + segsLenPtr := unsafe.Pointer(&segsLen) + + // Watcher: flips the C-side abort flag when ctx is cancelled. The + // goroutine is joined synchronously (close(done) signals it to exit, + // wg.Wait() blocks until it has) so a late CppSetAbort(1) cannot fire + // after the function returns and corrupt the next transcription call. + done := make(chan struct{}) + var wg sync.WaitGroup + wg.Add(1) + go func() { + defer wg.Done() + select { + case <-ctx.Done(): + CppSetAbort(1) + case <-done: + } + }() + defer func() { + close(done) + wg.Wait() + }() + + ret := CppTranscribe(opts.Threads, opts.Language, opts.Translate, opts.Diarize, data, uintptr(len(data)), segsLenPtr, opts.Prompt) + if ret == 2 { + return pb.TranscriptResult{}, status.Error(codes.Canceled, "transcription cancelled") + } + if ret != 0 { + return pb.TranscriptResult{}, fmt.Errorf("Failed Transcribe") + } + + segments := []*pb.TranscriptSegment{} + text := "" + for i := range int(segsLen) { + // segment start/end conversion factor taken from https://github.com/ggml-org/whisper.cpp/blob/master/examples/cli/cli.cpp#L895 + s := CppGetSegmentStart(i) * (10000000) + t := CppGetSegmentEnd(i) * (10000000) + // The session result can emit bytes that aren't valid UTF-8 (e.g. a + // multibyte codepoint split across token boundaries); protobuf string + // fields reject those at marshal time. Scrub before the value escapes + // cgo. The session result is segment+word based and exposes no token + // IDs, so Tokens is left empty. + txt := strings.ToValidUTF8(strings.Clone(CppGetSegmentText(i)), "�") + + segment := &pb.TranscriptSegment{ + Id: int32(i), + Text: txt, + Start: s, End: t, + } + + segments = append(segments, segment) + + text += " " + strings.TrimSpace(txt) + } + + return pb.TranscriptResult{ + Segments: segments, + Text: strings.TrimSpace(text), + Language: opts.Language, + Duration: duration, + }, nil +} + +// AudioTranscriptionStream runs the session transcribe to completion and then +// emits one delta per non-empty segment, followed by a final TranscriptResult. +// Progressive/real-time streaming isn't available via the session API (there +// is no per-decode callback), so deltas are emitted per-segment after the +// blocking decode returns rather than as segments are produced. The offline +// AudioTranscription is unchanged; both paths share the session and the +// SingleThread concurrency model. +func (w *CrispASR) AudioTranscriptionStream(ctx context.Context, opts *pb.TranscriptRequest, results chan *pb.TranscriptStreamResponse) error { + defer close(results) + + if err := ctx.Err(); err != nil { + return status.Error(codes.Canceled, "transcription cancelled") + } + + dir, err := os.MkdirTemp("", "crispasr") + if err != nil { + return err + } + defer func() { _ = os.RemoveAll(dir) }() + + convertedPath := filepath.Join(dir, "converted.wav") + if err := utils.AudioToWav(opts.Dst, convertedPath); err != nil { + return err + } + + fh, err := os.Open(convertedPath) + if err != nil { + return err + } + defer func() { _ = fh.Close() }() + + d := wav.NewDecoder(fh) + buf, err := d.FullPCMBuffer() + if err != nil { + return err + } + data := buf.AsFloat32Buffer().Data + var duration float32 + if buf.Format != nil && buf.Format.SampleRate > 0 { + duration = float32(len(data)) / float32(buf.Format.SampleRate) + } + + // Same abort-watcher pattern as AudioTranscription. Joined synchronously + // so a late CppSetAbort(1) cannot fire after this function returns. + // Best-effort only: the session transcribe is blocking with no abort hook. + done := make(chan struct{}) + var wg sync.WaitGroup + wg.Add(1) + go func() { + defer wg.Done() + select { + case <-ctx.Done(): + CppSetAbort(1) + case <-done: + } + }() + defer func() { + close(done) + wg.Wait() + }() + + segsLen := uintptr(0xdeadbeef) + segsLenPtr := unsafe.Pointer(&segsLen) + ret := CppTranscribe(opts.Threads, opts.Language, opts.Translate, opts.Diarize, data, uintptr(len(data)), segsLenPtr, opts.Prompt) + if ret == 2 { + return status.Error(codes.Canceled, "transcription cancelled") + } + if ret != 0 { + return fmt.Errorf("Failed Transcribe") + } + + // Walk the segments once: emit a delta per non-empty segment and build the + // final TranscriptResult.Segments alongside. The first delta has no leading + // space and subsequent ones are prefixed with a single space, so + // concat(deltas) == final.Text exactly, matching the e2e contract. + segments := []*pb.TranscriptSegment{} + var assembled strings.Builder + for i := range int(segsLen) { + s := CppGetSegmentStart(i) * 10000000 + t := CppGetSegmentEnd(i) * 10000000 + txt := strings.ToValidUTF8(strings.Clone(CppGetSegmentText(i)), "�") + segments = append(segments, &pb.TranscriptSegment{ + Id: int32(i), + Text: txt, + Start: s, End: t, + }) + + trimmed := strings.TrimSpace(txt) + if trimmed == "" { + continue + } + var delta string + if assembled.Len() == 0 { + delta = trimmed + } else { + delta = " " + trimmed + } + results <- &pb.TranscriptStreamResponse{Delta: delta} + assembled.WriteString(delta) + } + + final := &pb.TranscriptResult{ + Segments: segments, + Text: assembled.String(), + Language: opts.Language, + Duration: duration, + } + results <- &pb.TranscriptStreamResponse{FinalResult: final} + return nil +} + +// synthesize returns 24 kHz mono float32 PCM for text via the open session. +func (w *CrispASR) synthesize(text string) ([]float32, error) { + if text == "" { + return nil, fmt.Errorf("crispasr: TTS requires non-empty text") + } + var n int32 + ptr := CppTTSSynthesize(text, unsafe.Pointer(&n)) + if ptr == 0 || n <= 0 { + return nil, fmt.Errorf("crispasr: synthesis failed (the loaded model may not be a supported TTS backend, or needs extra config e.g. orpheus SNAC codec)") + } + defer CppTTSFree(ptr) + src := unsafe.Slice((*float32)(unsafe.Pointer(ptr)), int(n)) //nolint:govet // ptr addresses C-allocated PCM returned across the purego boundary; copied out immediately below, before tts_free. + out := make([]float32, int(n)) // copy out of C memory before free + copy(out, src) + return out, nil +} + +// setVoice applies a per-call speaker/voice override (best effort). CrispASR +// returns a negative code when the active backend can't honor the name; we log +// it rather than fail, so an unknown voice falls back to the default speaker. +func setVoice(voice string) { + v := strings.TrimSpace(voice) + if v == "" { + return + } + if rc := CppTTSSetVoice(v); rc != 0 { + fmt.Fprintf(os.Stderr, "crispasr: voice %q not applied by the active TTS backend (rc=%d); using default\n", v, rc) + } +} + +func (w *CrispASR) TTS(req *pb.TTSRequest) error { + if req.Dst == "" { + return fmt.Errorf("crispasr: TTS requires a destination path") + } + setVoice(req.Voice) + pcm, err := w.synthesize(req.Text) + if err != nil { + return err + } + return writeWAV24k(req.Dst, pcm) +} + +// TTSStream is the streaming counterpart to TTS. CrispASR has no progressive +// (native streaming) synth, so we synthesize the whole utterance, encode it to +// a 24 kHz WAV, and emit the encoded bytes as a single chunk. The gRPC server +// wrapper (pkg/grpc/server.go:TTSStream) ranges over the channel until it is +// closed, so this method owns the close - mirrors vibevoice-cpp's TTSStream. +func (w *CrispASR) TTSStream(req *pb.TTSRequest, results chan []byte) error { + defer close(results) + + if req.Text == "" { + return fmt.Errorf("crispasr: TTSStream requires text") + } + setVoice(req.Voice) + pcm, err := w.synthesize(req.Text) + if err != nil { + return err + } + + tmp, err := os.CreateTemp("", "crispasr-tts-stream-*.wav") + if err != nil { + return fmt.Errorf("crispasr: tempfile: %w", err) + } + dst := tmp.Name() + if err := tmp.Close(); err != nil { + return fmt.Errorf("crispasr: close tempfile: %w", err) + } + defer func() { _ = os.Remove(dst) }() + + if err := writeWAV24k(dst, pcm); err != nil { + return err + } + + encoded, err := os.ReadFile(dst) + if err != nil { + return fmt.Errorf("crispasr: read tempfile: %w", err) + } + results <- encoded + return nil +} + +// writeWAV24k writes pcm as a 24000 Hz, mono, 16-bit PCM WAV at dst. +func writeWAV24k(dst string, pcm []float32) error { + f, err := os.Create(dst) + if err != nil { + return fmt.Errorf("crispasr: create %q: %w", dst, err) + } + + enc := wav.NewEncoder(f, 24000, 16, 1, 1) + ints := make([]int, len(pcm)) + for i, s := range pcm { + if s > 1 { + s = 1 + } else if s < -1 { + s = -1 + } + ints[i] = int(s * 32767) + } + buf := &audio.IntBuffer{ + Format: &audio.Format{NumChannels: 1, SampleRate: 24000}, + Data: ints, + SourceBitDepth: 16, + } + if err := enc.Write(buf); err != nil { + _ = enc.Close() + _ = f.Close() + return fmt.Errorf("crispasr: encode WAV: %w", err) + } + if err := enc.Close(); err != nil { + _ = f.Close() + return fmt.Errorf("crispasr: finalize WAV: %w", err) + } + if err := f.Close(); err != nil { + return fmt.Errorf("crispasr: close %q: %w", dst, err) + } + return nil +} diff --git a/backend/go/crispasr/gocrispasr_test.go b/backend/go/crispasr/gocrispasr_test.go new file mode 100644 index 000000000..63ea47907 --- /dev/null +++ b/backend/go/crispasr/gocrispasr_test.go @@ -0,0 +1,193 @@ +package main + +import ( + "context" + "os" + "path/filepath" + "strings" + "sync" + "testing" + + "github.com/ebitengine/purego" + pb "github.com/mudler/LocalAI/pkg/grpc/proto" + . "github.com/onsi/ginkgo/v2" + . "github.com/onsi/gomega" + "google.golang.org/grpc/codes" + "google.golang.org/grpc/status" +) + +func TestCrispASR(t *testing.T) { + RegisterFailHandler(Fail) + RunSpecs(t, "CrispASR Backend Suite") +} + +var ( + libLoadOnce sync.Once + libLoadErr error +) + +// ensureLibLoaded mirrors main.go's bootstrap so a Go test can drive the +// bridge without spinning up the gRPC server. Skips the current spec when the +// shared library isn't present (e.g. running before `make backends/whisper`). +func ensureLibLoaded() { + libLoadOnce.Do(func() { + libName := os.Getenv("CRISPASR_LIBRARY") + if libName == "" { + libName = "./libgocrispasr-fallback.so" + } + if _, err := os.Stat(libName); err != nil { + libLoadErr = err + return + } + gosd, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL) + if err != nil { + libLoadErr = err + return + } + purego.RegisterLibFunc(&CppLoadModel, gosd, "load_model") + purego.RegisterLibFunc(&CppSetCodecPath, gosd, "set_codec_path") + purego.RegisterLibFunc(&CppTranscribe, gosd, "transcribe") + purego.RegisterLibFunc(&CppGetSegmentText, gosd, "get_segment_text") + purego.RegisterLibFunc(&CppGetSegmentStart, gosd, "get_segment_t0") + purego.RegisterLibFunc(&CppGetSegmentEnd, gosd, "get_segment_t1") + purego.RegisterLibFunc(&CppGetBackend, gosd, "get_backend") + purego.RegisterLibFunc(&CppSetAbort, gosd, "set_abort") + purego.RegisterLibFunc(&CppTTSSynthesize, gosd, "tts_synthesize") + purego.RegisterLibFunc(&CppTTSFree, gosd, "tts_free") + purego.RegisterLibFunc(&CppTTSSetVoice, gosd, "tts_set_voice") + purego.RegisterLibFunc(&CppTTSSetVoiceFile, gosd, "tts_set_voice_file") + }) + if libLoadErr != nil { + Skip("whisper library not loadable: " + libLoadErr.Error()) + } +} + +// fixturesOrSkip returns the model + audio paths or skips the spec if either +// env var is unset. The test never runs in default CI — it requires a real +// whisper model and a long audio file (~3 minutes) on disk. +func fixturesOrSkip() (string, string) { + modelPath := os.Getenv("CRISPASR_MODEL_PATH") + audioPath := os.Getenv("CRISPASR_AUDIO_PATH") + if modelPath == "" || audioPath == "" { + Skip("set CRISPASR_MODEL_PATH and CRISPASR_AUDIO_PATH to run this spec") + } + return modelPath, audioPath +} + +// ttsModelOrSkip returns the TTS model path or skips the spec when the env var +// is unset. Like the transcription fixtures, this never runs in default CI — it +// needs a real TTS model (e.g. a vibevoice GGUF) on disk. +func ttsModelOrSkip() string { + modelPath := os.Getenv("CRISPASR_TTS_MODEL_PATH") + if modelPath == "" { + Skip("set CRISPASR_TTS_MODEL_PATH to run this spec") + } + return modelPath +} + +var _ = Describe("CrispASR", func() { + Context("AudioTranscription cancellation", func() { + It("returns codes.Canceled on a pre-cancelled context and still succeeds afterwards", func() { + modelPath, audioPath := fixturesOrSkip() + ensureLibLoaded() + + w := &CrispASR{} + Expect(w.Load(&pb.ModelOptions{ModelFile: modelPath})).To(Succeed()) + + // The session transcribe is blocking and exposes no abort hook, so + // a mid-decode cancel can't interrupt it. The contract we can rely + // on is the pre-call ctx.Err() check: a context cancelled before + // the call must yield codes.Canceled without starting a decode. + ctx, cancel := context.WithCancel(context.Background()) + cancel() + + _, err := w.AudioTranscription(ctx, &pb.TranscriptRequest{ + Dst: audioPath, + Threads: 4, + Language: "en", + }) + Expect(err).To(HaveOccurred(), "expected pre-cancelled context to fail") + st, ok := status.FromError(err) + Expect(ok).To(BeTrue(), "expected gRPC status error, got %v", err) + Expect(st.Code()).To(Equal(codes.Canceled), "expected codes.Canceled, got %v", err) + + // Subsequent transcription must succeed — proves g_abort reset. + res, err := w.AudioTranscription(context.Background(), &pb.TranscriptRequest{ + Dst: audioPath, + Threads: 4, + Language: "en", + }) + Expect(err).ToNot(HaveOccurred(), "post-cancel transcription failed") + Expect(res.Text).ToNot(BeEmpty(), "post-cancel transcription returned empty text") + }) + }) + + Context("AudioTranscriptionStream", func() { + It("emits multiple deltas progressively for a multi-segment clip", func() { + modelPath, audioPath := fixturesOrSkip() + ensureLibLoaded() + + w := &CrispASR{} + Expect(w.Load(&pb.ModelOptions{ModelFile: modelPath})).To(Succeed()) + + results := make(chan *pb.TranscriptStreamResponse, 64) + done := make(chan error, 1) + go func() { + done <- w.AudioTranscriptionStream(context.Background(), &pb.TranscriptRequest{ + Dst: audioPath, + Threads: 4, + Language: "en", + Stream: true, + }, results) + }() + + var deltas []string + var assembled strings.Builder + var finalText string + var finalSegmentCount int + for chunk := range results { + if d := chunk.GetDelta(); d != "" { + deltas = append(deltas, d) + assembled.WriteString(d) + } + if final := chunk.GetFinalResult(); final != nil { + finalText = final.GetText() + finalSegmentCount = len(final.GetSegments()) + } + } + Expect(<-done).ToNot(HaveOccurred()) + + // One delta per non-empty segment is emitted after the blocking + // decode returns (the session API has no per-decode callback), so a + // multi-segment clip MUST produce >=2 delta events, and + // concat(deltas) MUST equal final.Text exactly. + Expect(len(deltas)).To(BeNumerically(">=", 2), + "expected multiple deltas from a multi-segment clip, got %d (assembled=%q)", + len(deltas), assembled.String()) + Expect(finalSegmentCount).To(BeNumerically(">=", 2), + "expected final to carry multiple segments") + Expect(assembled.String()).To(Equal(finalText), + "concat(deltas) must equal final.Text") + }) + }) + + Context("TTS", func() { + It("synthesizes a non-empty WAV", func() { + ttsModel := ttsModelOrSkip() + ensureLibLoaded() + + w := &CrispASR{} + Expect(w.Load(&pb.ModelOptions{ModelFile: ttsModel})).To(Succeed()) + + dst := filepath.Join(GinkgoT().TempDir(), "out.wav") + Expect(w.TTS(&pb.TTSRequest{Text: "Hello from CrispASR.", Dst: dst})).To(Succeed()) + + info, err := os.Stat(dst) + Expect(err).ToNot(HaveOccurred(), "synthesized WAV should exist at %q", dst) + // A real 24 kHz mono WAV is a 44-byte header plus samples; anything + // this small would mean an empty/failed synth. + Expect(info.Size()).To(BeNumerically(">", 1024), + "expected a non-trivial WAV, got %d bytes", info.Size()) + }) + }) +}) diff --git a/backend/go/crispasr/main.go b/backend/go/crispasr/main.go new file mode 100644 index 000000000..c2069bd85 --- /dev/null +++ b/backend/go/crispasr/main.go @@ -0,0 +1,58 @@ +package main + +// Note: this is started internally by LocalAI and a server is allocated for each model +import ( + "flag" + "os" + + "github.com/ebitengine/purego" + grpc "github.com/mudler/LocalAI/pkg/grpc" +) + +var ( + addr = flag.String("addr", "localhost:50051", "the address to connect to") +) + +type LibFuncs struct { + FuncPtr any + Name string +} + +func main() { + libName := os.Getenv("CRISPASR_LIBRARY") + if libName == "" { + libName = "./libgocrispasr-fallback.so" + } + + lib, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL) + if err != nil { + panic(err) + } + + libFuncs := []LibFuncs{ + {&CppLoadModel, "load_model"}, + {&CppSetCodecPath, "set_codec_path"}, + {&CppLoadModelVAD, "load_model_vad"}, + {&CppVAD, "vad"}, + {&CppTranscribe, "transcribe"}, + {&CppGetSegmentText, "get_segment_text"}, + {&CppGetSegmentStart, "get_segment_t0"}, + {&CppGetSegmentEnd, "get_segment_t1"}, + {&CppGetBackend, "get_backend"}, + {&CppSetAbort, "set_abort"}, + {&CppTTSSynthesize, "tts_synthesize"}, + {&CppTTSFree, "tts_free"}, + {&CppTTSSetVoice, "tts_set_voice"}, + {&CppTTSSetVoiceFile, "tts_set_voice_file"}, + } + + for _, lf := range libFuncs { + purego.RegisterLibFunc(lf.FuncPtr, lib, lf.Name) + } + + flag.Parse() + + if err := grpc.StartServer(*addr, &CrispASR{}); err != nil { + panic(err) + } +} diff --git a/backend/go/crispasr/package.sh b/backend/go/crispasr/package.sh new file mode 100755 index 000000000..09e95f13a --- /dev/null +++ b/backend/go/crispasr/package.sh @@ -0,0 +1,65 @@ +#!/bin/bash + +# Script to copy the appropriate libraries based on architecture +# This script is used in the final stage of the Dockerfile + +set -e + +CURDIR=$(dirname "$(realpath $0)") +REPO_ROOT="${CURDIR}/../../.." + +# Create lib directory +mkdir -p $CURDIR/package/lib + +cp -avf $CURDIR/crispasr $CURDIR/package/ +cp -fv $CURDIR/libgocrispasr-*.so $CURDIR/package/ +cp -fv $CURDIR/run.sh $CURDIR/package/ + +# Detect architecture and copy appropriate libraries +if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then + # x86_64 architecture + echo "Detected x86_64 architecture, copying x86_64 libraries..." + cp -arfLv /lib64/ld-linux-x86-64.so.2 $CURDIR/package/lib/ld.so + cp -arfLv /lib/x86_64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6 + cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1 + cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6 + cp -arfLv /lib/x86_64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6 + cp -arfLv /lib/x86_64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1 + cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1 + cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6 + cp -arfLv /lib/x86_64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2 + cp -arfLv /lib/x86_64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1 + cp -arfLv /lib/x86_64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0 +elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then + # ARM64 architecture + echo "Detected ARM64 architecture, copying ARM64 libraries..." + cp -arfLv /lib/ld-linux-aarch64.so.1 $CURDIR/package/lib/ld.so + cp -arfLv /lib/aarch64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6 + cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1 + cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6 + cp -arfLv /lib/aarch64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6 + cp -arfLv /lib/aarch64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1 + cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1 + cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6 + cp -arfLv /lib/aarch64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2 + cp -arfLv /lib/aarch64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1 + cp -arfLv /lib/aarch64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0 +elif [ $(uname -s) = "Darwin" ]; then + echo "Detected Darwin" +else + echo "Error: Could not detect architecture" + exit 1 +fi + +# Package GPU libraries based on BUILD_TYPE +# The GPU library packaging script will detect BUILD_TYPE and copy appropriate GPU libraries +GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh" +if [ -f "$GPU_LIB_SCRIPT" ]; then + echo "Packaging GPU libraries for BUILD_TYPE=${BUILD_TYPE:-cpu}..." + source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib" + package_gpu_libs +fi + +echo "Packaging completed successfully" +ls -liah $CURDIR/package/ +ls -liah $CURDIR/package/lib/ diff --git a/backend/go/crispasr/run.sh b/backend/go/crispasr/run.sh new file mode 100755 index 000000000..942b418a3 --- /dev/null +++ b/backend/go/crispasr/run.sh @@ -0,0 +1,52 @@ +#!/bin/bash +set -ex + +# Get the absolute current dir where the script is located +CURDIR=$(dirname "$(realpath $0)") + +cd / + +echo "CPU info:" +if [ "$(uname)" != "Darwin" ]; then + grep -e "model\sname" /proc/cpuinfo | head -1 + grep -e "flags" /proc/cpuinfo | head -1 +fi + +LIBRARY="$CURDIR/libgocrispasr-fallback.so" + +if [ "$(uname)" != "Darwin" ]; then + if grep -q -e "\savx\s" /proc/cpuinfo ; then + echo "CPU: AVX found OK" + if [ -e $CURDIR/libgocrispasr-avx.so ]; then + LIBRARY="$CURDIR/libgocrispasr-avx.so" + fi + fi + + if grep -q -e "\savx2\s" /proc/cpuinfo ; then + echo "CPU: AVX2 found OK" + if [ -e $CURDIR/libgocrispasr-avx2.so ]; then + LIBRARY="$CURDIR/libgocrispasr-avx2.so" + fi + fi + + # Check avx 512 + if grep -q -e "\savx512f\s" /proc/cpuinfo ; then + echo "CPU: AVX512F found OK" + if [ -e $CURDIR/libgocrispasr-avx512.so ]; then + LIBRARY="$CURDIR/libgocrispasr-avx512.so" + fi + fi +fi + +export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH +export CRISPASR_LIBRARY=$LIBRARY + +# If there is a lib/ld.so, use it +if [ -f $CURDIR/lib/ld.so ]; then + echo "Using lib/ld.so" + echo "Using library: $LIBRARY" + exec $CURDIR/lib/ld.so $CURDIR/crispasr "$@" +fi + +echo "Using library: $LIBRARY" +exec $CURDIR/crispasr "$@" diff --git a/backend/index.yaml b/backend/index.yaml index 887e2e57e..37e689071 100644 --- a/backend/index.yaml +++ b/backend/index.yaml @@ -122,6 +122,33 @@ nvidia-cuda-12: "cuda12-whisper" nvidia-l4t-cuda-12: "nvidia-l4t-arm64-whisper" nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-whisper" +- &crispasr + name: "crispasr" + alias: "crispasr" + license: mit + icon: https://user-images.githubusercontent.com/1991296/235238348-05d0f6a4-da44-4900-a1de-d0707e75b763.jpeg + description: | + CrispASR unified speech engine (whisper.cpp fork on ggml) supporting many ASR architectures (Parakeet, Canary, Voxtral, Qwen3-ASR, Granite, Wav2Vec2, Moonshine, OmniASR, FireRedASR, and more). + urls: + - https://github.com/CrispStrobe/CrispASR + tags: + - audio-transcription + - CPU + - GPU + - CUDA + - HIP + capabilities: + default: "cpu-crispasr" + nvidia: "cuda12-crispasr" + intel: "intel-sycl-f16-crispasr" + metal: "metal-crispasr" + amd: "rocm-crispasr" + vulkan: "vulkan-crispasr" + nvidia-l4t: "nvidia-l4t-arm64-crispasr" + nvidia-cuda-13: "cuda13-crispasr" + nvidia-cuda-12: "cuda12-crispasr" + nvidia-l4t-cuda-12: "nvidia-l4t-arm64-crispasr" + nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-crispasr" - ¶keetcpp name: "parakeet-cpp" alias: "parakeet-cpp" @@ -1957,6 +1984,131 @@ uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-whisper" mirrors: - localai/localai-backends:master-gpu-nvidia-cuda-13-whisper +## crispasr +- !!merge <<: *crispasr + name: "crispasr-development" + capabilities: + default: "cpu-crispasr-development" + nvidia: "cuda12-crispasr-development" + intel: "intel-sycl-f16-crispasr-development" + metal: "metal-crispasr-development" + amd: "rocm-crispasr-development" + vulkan: "vulkan-crispasr-development" + nvidia-l4t: "nvidia-l4t-arm64-crispasr-development" + nvidia-cuda-13: "cuda13-crispasr-development" + nvidia-cuda-12: "cuda12-crispasr-development" + nvidia-l4t-cuda-12: "nvidia-l4t-arm64-crispasr-development" + nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-crispasr-development" +- !!merge <<: *crispasr + name: "nvidia-l4t-arm64-crispasr" + uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-arm64-crispasr" + mirrors: + - localai/localai-backends:latest-nvidia-l4t-arm64-crispasr +- !!merge <<: *crispasr + name: "nvidia-l4t-arm64-crispasr-development" + uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-arm64-crispasr" + mirrors: + - localai/localai-backends:master-nvidia-l4t-arm64-crispasr +- !!merge <<: *crispasr + name: "cuda13-nvidia-l4t-arm64-crispasr" + uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-crispasr" + mirrors: + - localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-crispasr +- !!merge <<: *crispasr + name: "cuda13-nvidia-l4t-arm64-crispasr-development" + uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-crispasr" + mirrors: + - localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-crispasr +- !!merge <<: *crispasr + name: "cpu-crispasr" + uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-crispasr" + mirrors: + - localai/localai-backends:latest-cpu-crispasr +- !!merge <<: *crispasr + name: "metal-crispasr" + uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-crispasr" + mirrors: + - localai/localai-backends:latest-metal-darwin-arm64-crispasr +- !!merge <<: *crispasr + name: "metal-crispasr-development" + uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-crispasr" + mirrors: + - localai/localai-backends:master-metal-darwin-arm64-crispasr +- !!merge <<: *crispasr + name: "cpu-crispasr-development" + uri: "quay.io/go-skynet/local-ai-backends:master-cpu-crispasr" + mirrors: + - localai/localai-backends:master-cpu-crispasr +- !!merge <<: *crispasr + name: "cuda12-crispasr" + uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-crispasr" + mirrors: + - localai/localai-backends:latest-gpu-nvidia-cuda-12-crispasr +- !!merge <<: *crispasr + name: "rocm-crispasr" + uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-crispasr" + mirrors: + - localai/localai-backends:latest-gpu-rocm-hipblas-crispasr +- !!merge <<: *crispasr + name: "intel-sycl-f32-crispasr" + uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f32-crispasr" + mirrors: + - localai/localai-backends:latest-gpu-intel-sycl-f32-crispasr +- !!merge <<: *crispasr + name: "intel-sycl-f16-crispasr" + uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f16-crispasr" + mirrors: + - localai/localai-backends:latest-gpu-intel-sycl-f16-crispasr +- !!merge <<: *crispasr + name: "vulkan-crispasr" + uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-vulkan-crispasr" + mirrors: + - localai/localai-backends:latest-gpu-vulkan-crispasr +- !!merge <<: *crispasr + name: "vulkan-crispasr-development" + uri: "quay.io/go-skynet/local-ai-backends:master-gpu-vulkan-crispasr" + mirrors: + - localai/localai-backends:master-gpu-vulkan-crispasr +- !!merge <<: *crispasr + name: "metal-crispasr" + uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-crispasr" + mirrors: + - localai/localai-backends:latest-metal-darwin-arm64-crispasr +- !!merge <<: *crispasr + name: "metal-crispasr-development" + uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-crispasr" + mirrors: + - localai/localai-backends:master-metal-darwin-arm64-crispasr +- !!merge <<: *crispasr + name: "cuda12-crispasr-development" + uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-crispasr" + mirrors: + - localai/localai-backends:master-gpu-nvidia-cuda-12-crispasr +- !!merge <<: *crispasr + name: "rocm-crispasr-development" + uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-crispasr" + mirrors: + - localai/localai-backends:master-gpu-rocm-hipblas-crispasr +- !!merge <<: *crispasr + name: "intel-sycl-f32-crispasr-development" + uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f32-crispasr" + mirrors: + - localai/localai-backends:master-gpu-intel-sycl-f32-crispasr +- !!merge <<: *crispasr + name: "intel-sycl-f16-crispasr-development" + uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f16-crispasr" + mirrors: + - localai/localai-backends:master-gpu-intel-sycl-f16-crispasr +- !!merge <<: *crispasr + name: "cuda13-crispasr" + uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-crispasr" + mirrors: + - localai/localai-backends:latest-gpu-nvidia-cuda-13-crispasr +- !!merge <<: *crispasr + name: "cuda13-crispasr-development" + uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-crispasr" + mirrors: + - localai/localai-backends:master-gpu-nvidia-cuda-13-crispasr ## parakeet-cpp - !!merge <<: *parakeetcpp name: "parakeet-cpp-development" diff --git a/core/http/endpoints/localai/backend.go b/core/http/endpoints/localai/backend.go index da5808c26..cbda648d6 100644 --- a/core/http/endpoints/localai/backend.go +++ b/core/http/endpoints/localai/backend.go @@ -31,6 +31,7 @@ var knownPrefOnlyBackends = []schema.KnownBackend{ {Name: "mlx-vlm", Modality: "text", AutoDetect: false, Description: "MLX vision-language models (preference-only)"}, // ASR {Name: "whisperx", Modality: "asr", AutoDetect: false, Description: "WhisperX transcription (preference-only)"}, + {Name: "crispasr", Modality: "asr", AutoDetect: false, Description: "CrispASR multi-architecture transcription (preference-only)"}, // TTS {Name: "kokoros", Modality: "tts", AutoDetect: false, Description: "Kokoros TTS (preference-only)"}, {Name: "qwen-tts", Modality: "tts", AutoDetect: false, Description: "Qwen TTS (preference-only)"}, diff --git a/core/http/endpoints/localai/backend_test.go b/core/http/endpoints/localai/backend_test.go index 4a7a9497c..0c21bb7b4 100644 --- a/core/http/endpoints/localai/backend_test.go +++ b/core/http/endpoints/localai/backend_test.go @@ -140,6 +140,7 @@ var _ = Describe("Backend Endpoints", func() { expectPrefOnly("trl", "text") expectPrefOnly("mlx-vlm", "text") expectPrefOnly("whisperx", "asr") + expectPrefOnly("crispasr", "asr") expectPrefOnly("kokoros", "tts") expectPrefOnly("qwen-tts", "tts") expectPrefOnly("qwen3-tts-cpp", "tts") diff --git a/gallery/index.yaml b/gallery/index.yaml index ecd648f32..865eec9f0 100644 --- a/gallery/index.yaml +++ b/gallery/index.yaml @@ -31771,3 +31771,844 @@ - filename: parakeet-cpp/tdt_ctc-1.1b-f16.gguf uri: huggingface://mudler/parakeet-cpp-gguf/tdt_ctc-1.1b-f16.gguf sha256: cd53f64eefac2623a12f2f118ef50b56622dc3012f42c815c6adf0d08292f387 + +- name: parakeet-crispasr + url: github:mudler/LocalAI/gallery/virtual.yaml@master + urls: + - https://huggingface.co/cstr/parakeet-tdt-0.6b-v3-GGUF + description: | + NVIDIA Parakeet TDT 0.6B v3 (FastConformer + Token-and-Duration Transducer), 25-language ASR. Runs via the CrispASR backend. Default GGUF size ~467 MB. + tags: + - crispasr + - asr + - speech-recognition + - stt + - gguf + overrides: + backend: crispasr + known_usecases: + - transcript + name: parakeet-crispasr + parameters: + model: parakeet-tdt-0.6b-v3-q4_k.gguf + files: + - filename: parakeet-tdt-0.6b-v3-q4_k.gguf + uri: huggingface://cstr/parakeet-tdt-0.6b-v3-GGUF/parakeet-tdt-0.6b-v3-q4_k.gguf +- name: parakeet-v2-crispasr + url: github:mudler/LocalAI/gallery/virtual.yaml@master + urls: + - https://huggingface.co/cstr/parakeet-tdt-0.6b-v2-GGUF + description: | + NVIDIA Parakeet TDT 0.6B v2 (FastConformer + TDT), English-only ASR. Runs via the CrispASR backend. Default GGUF size ~468 MB. + tags: + - crispasr + - asr + - speech-recognition + - stt + - gguf + overrides: + backend: crispasr + known_usecases: + - transcript + name: parakeet-v2-crispasr + parameters: + model: parakeet-tdt-0.6b-v2-q4_k.gguf + files: + - filename: parakeet-tdt-0.6b-v2-q4_k.gguf + uri: huggingface://cstr/parakeet-tdt-0.6b-v2-GGUF/parakeet-tdt-0.6b-v2-q4_k.gguf +- name: parakeet-ja-crispasr + url: github:mudler/LocalAI/gallery/virtual.yaml@master + urls: + - https://huggingface.co/cstr/parakeet-tdt-0.6b-ja-GGUF + description: | + NVIDIA Parakeet TDT 0.6B Japanese ASR (F16 default; Q4_K is quantisation-sensitive for this model). Runs via the CrispASR backend. Default GGUF size ~1.24 GB. + tags: + - crispasr + - asr + - speech-recognition + - stt + - gguf + overrides: + backend: crispasr + known_usecases: + - transcript + name: parakeet-ja-crispasr + parameters: + model: parakeet-tdt-0.6b-ja.gguf + files: + - filename: parakeet-tdt-0.6b-ja.gguf + uri: huggingface://cstr/parakeet-tdt-0.6b-ja-GGUF/parakeet-tdt-0.6b-ja.gguf +- name: parakeet-tdt-1.1b-crispasr + url: github:mudler/LocalAI/gallery/virtual.yaml@master + urls: + - https://huggingface.co/cstr/parakeet-tdt-1.1b-GGUF + description: | + NVIDIA Parakeet TDT 1.1B (42-layer FastConformer encoder), English-only ASR. Runs via the CrispASR backend. Default GGUF size ~808 MB. + tags: + - crispasr + - asr + - speech-recognition + - stt + - gguf + overrides: + backend: crispasr + known_usecases: + - transcript + name: parakeet-tdt-1.1b-crispasr + parameters: + model: parakeet-tdt-1.1b-q4_k.gguf + files: + - filename: parakeet-tdt-1.1b-q4_k.gguf + uri: huggingface://cstr/parakeet-tdt-1.1b-GGUF/parakeet-tdt-1.1b-q4_k.gguf +- name: parakeet-tdt_ctc-110m-crispasr + url: github:mudler/LocalAI/gallery/virtual.yaml@master + urls: + - https://huggingface.co/cstr/parakeet-tdt_ctc-110m-GGUF + description: | + NVIDIA Parakeet hybrid TDT+CTC 110M (smallest, CTC decode), English-only ASR. Runs via the CrispASR backend. Default GGUF size ~91 MB. + tags: + - crispasr + - asr + - speech-recognition + - stt + - gguf + overrides: + backend: crispasr + known_usecases: + - transcript + name: parakeet-tdt_ctc-110m-crispasr + parameters: + model: parakeet-tdt_ctc-110m-q4_k.gguf + files: + - filename: parakeet-tdt_ctc-110m-q4_k.gguf + uri: huggingface://cstr/parakeet-tdt_ctc-110m-GGUF/parakeet-tdt_ctc-110m-q4_k.gguf +- name: parakeet-tdt_ctc-1.1b-crispasr + url: github:mudler/LocalAI/gallery/virtual.yaml@master + urls: + - https://huggingface.co/cstr/parakeet-tdt_ctc-1.1b-GGUF + description: | + NVIDIA Parakeet hybrid TDT+CTC 1.1B (multilingual, casing + punctuation) ASR. Runs via the CrispASR backend. Default GGUF size ~810 MB. + tags: + - crispasr + - asr + - speech-recognition + - stt + - gguf + overrides: + backend: crispasr + known_usecases: + - transcript + name: parakeet-tdt_ctc-1.1b-crispasr + parameters: + model: parakeet-tdt_ctc-1.1b-q4_k.gguf + files: + - filename: parakeet-tdt_ctc-1.1b-q4_k.gguf + uri: huggingface://cstr/parakeet-tdt_ctc-1.1b-GGUF/parakeet-tdt_ctc-1.1b-q4_k.gguf +- name: parakeet-rnnt-0.6b-crispasr + url: github:mudler/LocalAI/gallery/virtual.yaml@master + urls: + - https://huggingface.co/cstr/parakeet-rnnt-0.6b-GGUF + description: | + NVIDIA Parakeet RNN-Transducer 0.6B (24-layer FastConformer) ASR. Runs via the CrispASR backend. Default GGUF size ~447 MB. + tags: + - crispasr + - asr + - speech-recognition + - stt + - gguf + overrides: + backend: crispasr + known_usecases: + - transcript + name: parakeet-rnnt-0.6b-crispasr + parameters: + model: parakeet-rnnt-0.6b-q4_k.gguf + files: + - filename: parakeet-rnnt-0.6b-q4_k.gguf + uri: huggingface://cstr/parakeet-rnnt-0.6b-GGUF/parakeet-rnnt-0.6b-q4_k.gguf +- name: parakeet-rnnt-1.1b-crispasr + url: github:mudler/LocalAI/gallery/virtual.yaml@master + urls: + - https://huggingface.co/cstr/parakeet-rnnt-1.1b-GGUF + description: | + NVIDIA Parakeet RNN-Transducer 1.1B (42-layer FastConformer) ASR. Runs via the CrispASR backend. Default GGUF size ~770 MB. + tags: + - crispasr + - asr + - speech-recognition + - stt + - gguf + overrides: + backend: crispasr + known_usecases: + - transcript + name: parakeet-rnnt-1.1b-crispasr + parameters: + model: parakeet-rnnt-1.1b-q4_k.gguf + files: + - filename: parakeet-rnnt-1.1b-q4_k.gguf + uri: huggingface://cstr/parakeet-rnnt-1.1b-GGUF/parakeet-rnnt-1.1b-q4_k.gguf +- name: fastconformer-ctc-crispasr + url: github:mudler/LocalAI/gallery/virtual.yaml@master + urls: + - https://huggingface.co/cstr/stt-en-fastconformer-ctc-large-GGUF + description: | + NVIDIA STT-EN FastConformer-CTC Large, English ASR. Runs via the CrispASR backend. Default GGUF size ~83 MB. + tags: + - crispasr + - asr + - speech-recognition + - stt + - gguf + overrides: + backend: crispasr + known_usecases: + - transcript + name: fastconformer-ctc-crispasr + parameters: + model: stt-en-fastconformer-ctc-large-q4_k.gguf + files: + - filename: stt-en-fastconformer-ctc-large-q4_k.gguf + uri: huggingface://cstr/stt-en-fastconformer-ctc-large-GGUF/stt-en-fastconformer-ctc-large-q4_k.gguf +- name: canary-crispasr + url: github:mudler/LocalAI/gallery/virtual.yaml@master + urls: + - https://huggingface.co/cstr/canary-1b-v2-GGUF + description: | + NVIDIA Canary 1B v2 (FastConformer encoder-decoder), multilingual ASR + translation. Runs via the CrispASR backend. Default GGUF size ~600 MB. + tags: + - crispasr + - asr + - speech-recognition + - stt + - gguf + overrides: + backend: crispasr + known_usecases: + - transcript + name: canary-crispasr + parameters: + model: canary-1b-v2-q4_k.gguf + files: + - filename: canary-1b-v2-q4_k.gguf + uri: huggingface://cstr/canary-1b-v2-GGUF/canary-1b-v2-q4_k.gguf +- name: voxtral-crispasr + url: github:mudler/LocalAI/gallery/virtual.yaml@master + urls: + - https://huggingface.co/cstr/voxtral-mini-3b-2507-GGUF + description: | + Mistral Voxtral Mini 3B (audio LLM) ASR. Runs via the CrispASR backend. Default GGUF size ~2.5 GB. + tags: + - crispasr + - asr + - speech-recognition + - stt + - gguf + overrides: + backend: crispasr + known_usecases: + - transcript + name: voxtral-crispasr + parameters: + model: voxtral-mini-3b-2507-q4_k.gguf + files: + - filename: voxtral-mini-3b-2507-q4_k.gguf + uri: huggingface://cstr/voxtral-mini-3b-2507-GGUF/voxtral-mini-3b-2507-q4_k.gguf +- name: voxtral4b-crispasr + url: github:mudler/LocalAI/gallery/virtual.yaml@master + urls: + - https://huggingface.co/cstr/voxtral-mini-4b-realtime-GGUF + description: | + Mistral Voxtral Mini 4B Realtime (audio LLM) ASR. Runs via the CrispASR backend. Default GGUF size ~3.3 GB. + tags: + - crispasr + - asr + - speech-recognition + - stt + - gguf + overrides: + backend: crispasr + known_usecases: + - transcript + name: voxtral4b-crispasr + parameters: + model: voxtral-mini-4b-realtime-q4_k.gguf + files: + - filename: voxtral-mini-4b-realtime-q4_k.gguf + uri: huggingface://cstr/voxtral-mini-4b-realtime-GGUF/voxtral-mini-4b-realtime-q4_k.gguf +- name: granite-crispasr + url: github:mudler/LocalAI/gallery/virtual.yaml@master + urls: + - https://huggingface.co/cstr/granite-speech-4.0-1b-GGUF + description: | + IBM Granite Speech 4.0 1B ASR. Runs via the CrispASR backend. Default GGUF size ~2.94 GB. + tags: + - crispasr + - asr + - speech-recognition + - stt + - gguf + overrides: + backend: crispasr + known_usecases: + - transcript + name: granite-crispasr + parameters: + model: granite-speech-4.0-1b-q4_k.gguf + files: + - filename: granite-speech-4.0-1b-q4_k.gguf + uri: huggingface://cstr/granite-speech-4.0-1b-GGUF/granite-speech-4.0-1b-q4_k.gguf +- name: granite-4.1-crispasr + url: github:mudler/LocalAI/gallery/virtual.yaml@master + urls: + - https://huggingface.co/cstr/granite-speech-4.1-2b-GGUF + description: | + IBM Granite Speech 4.1 2B ASR. Runs via the CrispASR backend. Default GGUF size ~2.94 GB. + tags: + - crispasr + - asr + - speech-recognition + - stt + - gguf + overrides: + backend: crispasr + known_usecases: + - transcript + name: granite-4.1-crispasr + parameters: + model: granite-speech-4.1-2b-q4_k.gguf + files: + - filename: granite-speech-4.1-2b-q4_k.gguf + uri: huggingface://cstr/granite-speech-4.1-2b-GGUF/granite-speech-4.1-2b-q4_k.gguf +- name: granite-4.1-plus-crispasr + url: github:mudler/LocalAI/gallery/virtual.yaml@master + urls: + - https://huggingface.co/cstr/granite-speech-4.1-2b-plus-GGUF + description: | + IBM Granite Speech 4.1 2B Plus ASR. Runs via the CrispASR backend. Default GGUF size ~2.96 GB. + tags: + - crispasr + - asr + - speech-recognition + - stt + - gguf + overrides: + backend: crispasr + known_usecases: + - transcript + name: granite-4.1-plus-crispasr + parameters: + model: granite-speech-4.1-2b-plus-q4_k.gguf + files: + - filename: granite-speech-4.1-2b-plus-q4_k.gguf + uri: huggingface://cstr/granite-speech-4.1-2b-plus-GGUF/granite-speech-4.1-2b-plus-q4_k.gguf +- name: granite-4.1-nar-crispasr + url: github:mudler/LocalAI/gallery/virtual.yaml@master + urls: + - https://huggingface.co/cstr/granite-speech-4.1-2b-nar-GGUF + description: | + IBM Granite Speech 4.1 2B NAR (non-autoregressive) ASR. Runs via the CrispASR backend. Default GGUF size ~3.2 GB. + tags: + - crispasr + - asr + - speech-recognition + - stt + - gguf + overrides: + backend: crispasr + known_usecases: + - transcript + name: granite-4.1-nar-crispasr + parameters: + model: granite-speech-4.1-2b-nar-q4_k.gguf + files: + - filename: granite-speech-4.1-2b-nar-q4_k.gguf + uri: huggingface://cstr/granite-speech-4.1-2b-nar-GGUF/granite-speech-4.1-2b-nar-q4_k.gguf +- name: qwen3-crispasr + url: github:mudler/LocalAI/gallery/virtual.yaml@master + urls: + - https://huggingface.co/cstr/qwen3-asr-0.6b-GGUF + description: | + Qwen3-ASR 0.6B ASR. Runs via the CrispASR backend. Default GGUF size ~500 MB. + tags: + - crispasr + - asr + - speech-recognition + - stt + - gguf + overrides: + backend: crispasr + known_usecases: + - transcript + name: qwen3-crispasr + parameters: + model: qwen3-asr-0.6b-q4_k.gguf + files: + - filename: qwen3-asr-0.6b-q4_k.gguf + uri: huggingface://cstr/qwen3-asr-0.6b-GGUF/qwen3-asr-0.6b-q4_k.gguf +- name: qwen3-1.7b-crispasr + url: github:mudler/LocalAI/gallery/virtual.yaml@master + urls: + - https://huggingface.co/cstr/qwen3-asr-1.7b-GGUF + description: | + Qwen3-ASR 1.7B ASR. Runs via the CrispASR backend. Default GGUF size ~1.3 GB. + tags: + - crispasr + - asr + - speech-recognition + - stt + - gguf + overrides: + backend: crispasr + known_usecases: + - transcript + name: qwen3-1.7b-crispasr + parameters: + model: qwen3-asr-1.7b-q4_k.gguf + files: + - filename: qwen3-asr-1.7b-q4_k.gguf + uri: huggingface://cstr/qwen3-asr-1.7b-GGUF/qwen3-asr-1.7b-q4_k.gguf +- name: cohere-crispasr + url: github:mudler/LocalAI/gallery/virtual.yaml@master + urls: + - https://huggingface.co/cstr/cohere-transcribe-03-2026-GGUF + description: | + Cohere Transcribe (03-2026) ASR. Runs via the CrispASR backend. Default GGUF size ~550 MB. + tags: + - crispasr + - asr + - speech-recognition + - stt + - gguf + overrides: + backend: crispasr + known_usecases: + - transcript + name: cohere-crispasr + parameters: + model: cohere-transcribe-q4_k.gguf + files: + - filename: cohere-transcribe-q4_k.gguf + uri: huggingface://cstr/cohere-transcribe-03-2026-GGUF/cohere-transcribe-q4_k.gguf +- name: wav2vec2-crispasr + url: github:mudler/LocalAI/gallery/virtual.yaml@master + urls: + - https://huggingface.co/cstr/wav2vec2-large-xlsr-53-english-GGUF + description: | + wav2vec2 Large XLSR-53 English (CTC) ASR. Runs via the CrispASR backend. Default GGUF size ~212 MB. + tags: + - crispasr + - asr + - speech-recognition + - stt + - gguf + overrides: + backend: crispasr + known_usecases: + - transcript + name: wav2vec2-crispasr + parameters: + model: wav2vec2-xlsr-en-q4_k.gguf + files: + - filename: wav2vec2-xlsr-en-q4_k.gguf + uri: huggingface://cstr/wav2vec2-large-xlsr-53-english-GGUF/wav2vec2-xlsr-en-q4_k.gguf +- name: wav2vec2-de-crispasr + url: github:mudler/LocalAI/gallery/virtual.yaml@master + urls: + - https://huggingface.co/cstr/wav2vec2-large-xlsr-53-german-GGUF + description: | + wav2vec2 Large XLSR-53 German (CTC) ASR. Runs via the CrispASR backend. Default GGUF size ~222 MB. + tags: + - crispasr + - asr + - speech-recognition + - stt + - gguf + overrides: + backend: crispasr + known_usecases: + - transcript + name: wav2vec2-de-crispasr + parameters: + model: wav2vec2-large-xlsr-53-german-q4_k.gguf + files: + - filename: wav2vec2-large-xlsr-53-german-q4_k.gguf + uri: huggingface://cstr/wav2vec2-large-xlsr-53-german-GGUF/wav2vec2-large-xlsr-53-german-q4_k.gguf +- name: vibevoice-crispasr + url: github:mudler/LocalAI/gallery/virtual.yaml@master + urls: + - https://huggingface.co/cstr/vibevoice-asr-GGUF + description: | + VibeVoice ASR. Runs via the CrispASR backend. Default GGUF size ~4.5 GB. + tags: + - crispasr + - asr + - speech-recognition + - stt + - gguf + overrides: + backend: crispasr + known_usecases: + - transcript + name: vibevoice-crispasr + parameters: + model: vibevoice-asr-q4_k.gguf + files: + - filename: vibevoice-asr-q4_k.gguf + uri: huggingface://cstr/vibevoice-asr-GGUF/vibevoice-asr-q4_k.gguf +- name: vibevoice-tts-crispasr + url: github:mudler/LocalAI/gallery/virtual.yaml@master + urls: + - https://huggingface.co/cstr/vibevoice-realtime-0.5b-GGUF + description: | + VibeVoice Realtime 0.5B text-to-speech (TTS) model, synthesized through the CrispASR backend. Produces 24 kHz mono audio; runs end-to-end on CPU with a built-in default voice. Default GGUF size ~636 MB. + tags: + - crispasr + - tts + - text-to-speech + - gguf + overrides: + backend: crispasr + known_usecases: + - tts + name: vibevoice-tts-crispasr + parameters: + model: vibevoice-realtime-0.5b-q4_k.gguf + files: + - filename: vibevoice-realtime-0.5b-q4_k.gguf + uri: huggingface://cstr/vibevoice-realtime-0.5b-GGUF/vibevoice-realtime-0.5b-q4_k.gguf +- name: chatterbox-tts-crispasr + url: github:mudler/LocalAI/gallery/virtual.yaml@master + urls: + - https://huggingface.co/cstr/chatterbox-GGUF + description: | + Chatterbox (ResembleAI, MIT) text-to-speech synthesized through the CrispASR backend. Two-GGUF runtime: a Llama T3 token model plus an S3Gen codec companion (tokens to 24 kHz waveform). Auto-detected by CrispASR and ships with a built-in default voice; runs end-to-end on CPU and produces 24 kHz mono audio. Default GGUF sizes ~630 MB (T3) + ~358 MB (S3Gen). + tags: + - crispasr + - tts + - text-to-speech + - gguf + overrides: + backend: crispasr + known_usecases: + - tts + name: chatterbox-tts-crispasr + options: + - "codec:chatterbox-s3gen-q8_0.gguf" + parameters: + model: chatterbox-t3-q8_0.gguf + files: + - filename: chatterbox-t3-q8_0.gguf + uri: huggingface://cstr/chatterbox-GGUF/chatterbox-t3-q8_0.gguf + - filename: chatterbox-s3gen-q8_0.gguf + uri: huggingface://cstr/chatterbox-GGUF/chatterbox-s3gen-q8_0.gguf +- name: qwen3-tts-customvoice-crispasr + url: github:mudler/LocalAI/gallery/virtual.yaml@master + urls: + - https://huggingface.co/cstr/qwen3-tts-0.6b-customvoice-GGUF + description: | + Qwen3-TTS CustomVoice 0.6B (12 Hz) text-to-speech synthesized through the CrispASR backend. Fixed-speaker fine-tune driven via an explicit backend selector plus a tokenizer codec companion. Ships baked speakers (vivian, aiden, dylan, eric, ono_anna, ryan, serena, sohee, uncle_fu); the default config selects vivian. Runs end-to-end on CPU and produces 24 kHz mono audio. Default GGUF sizes ~968 MB (talker) + ~358 MB (tokenizer). + tags: + - crispasr + - tts + - text-to-speech + - gguf + overrides: + backend: crispasr + known_usecases: + - tts + name: qwen3-tts-customvoice-crispasr + options: + - "backend:qwen3-tts" + - "codec:qwen3-tts-tokenizer-12hz.gguf" + - "speaker:vivian" + parameters: + model: qwen3-tts-12hz-0.6b-customvoice-q8_0.gguf + files: + - filename: qwen3-tts-12hz-0.6b-customvoice-q8_0.gguf + uri: huggingface://cstr/qwen3-tts-0.6b-customvoice-GGUF/qwen3-tts-12hz-0.6b-customvoice-q8_0.gguf + - filename: qwen3-tts-tokenizer-12hz.gguf + uri: huggingface://cstr/qwen3-tts-tokenizer-12hz-GGUF/qwen3-tts-tokenizer-12hz.gguf +- name: orpheus-tts-crispasr + url: github:mudler/LocalAI/gallery/virtual.yaml@master + urls: + - https://huggingface.co/cstr/orpheus-3b-base-GGUF + description: | + Orpheus-3B (Llama-3.2 base) text-to-speech synthesized through the CrispASR backend. Auto-detected by CrispASR; needs a SNAC 24 kHz codec companion and a baked speaker. Ships speaker tara (selected by the default config). Runs end-to-end on CPU and produces 24 kHz mono audio. Default GGUF sizes ~3.5 GB (model) + ~26 MB (SNAC codec). + tags: + - crispasr + - tts + - text-to-speech + - gguf + overrides: + backend: crispasr + known_usecases: + - tts + name: orpheus-tts-crispasr + options: + - "codec:snac-24khz.gguf" + - "speaker:tara" + parameters: + model: orpheus-3b-base-q8_0.gguf + files: + - filename: orpheus-3b-base-q8_0.gguf + uri: huggingface://cstr/orpheus-3b-base-GGUF/orpheus-3b-base-q8_0.gguf + - filename: snac-24khz.gguf + uri: huggingface://cstr/snac-24khz-GGUF/snac-24khz.gguf +- name: hubert-crispasr + url: github:mudler/LocalAI/gallery/virtual.yaml@master + urls: + - https://huggingface.co/cstr/hubert-large-ls960-ft-GGUF + description: | + HuBERT Large (LS960 fine-tune) CTC speech recognition, English. Runs via the CrispASR backend with an explicit backend selector. Default GGUF size ~200 MB. + tags: + - crispasr + - asr + - speech-recognition + - stt + - gguf + overrides: + backend: crispasr + known_usecases: + - transcript + name: hubert-crispasr + options: + - "backend:hubert" + parameters: + model: hubert-large-ls960-ft-q4_k.gguf + files: + - filename: hubert-large-ls960-ft-q4_k.gguf + uri: huggingface://cstr/hubert-large-ls960-ft-GGUF/hubert-large-ls960-ft-q4_k.gguf +- name: data2vec-crispasr + url: github:mudler/LocalAI/gallery/virtual.yaml@master + urls: + - https://huggingface.co/cstr/data2vec-audio-960h-GGUF + description: | + data2vec Audio Base (960h) CTC speech recognition, English. Runs via the CrispASR backend with an explicit backend selector. Default GGUF size ~60 MB. + tags: + - crispasr + - asr + - speech-recognition + - stt + - gguf + overrides: + backend: crispasr + known_usecases: + - transcript + name: data2vec-crispasr + options: + - "backend:data2vec" + parameters: + model: data2vec-audio-base-960h-q4_k.gguf + files: + - filename: data2vec-audio-base-960h-q4_k.gguf + uri: huggingface://cstr/data2vec-audio-960h-GGUF/data2vec-audio-base-960h-q4_k.gguf +- name: glm-asr-crispasr + url: github:mudler/LocalAI/gallery/virtual.yaml@master + urls: + - https://huggingface.co/cstr/glm-asr-nano-GGUF + description: | + GLM-ASR Nano speech recognition. Runs via the CrispASR backend with an explicit backend selector. Default GGUF size ~1.2 GB. + tags: + - crispasr + - asr + - speech-recognition + - stt + - gguf + overrides: + backend: crispasr + known_usecases: + - transcript + name: glm-asr-crispasr + options: + - "backend:glm-asr" + parameters: + model: glm-asr-nano-q4_k.gguf + files: + - filename: glm-asr-nano-q4_k.gguf + uri: huggingface://cstr/glm-asr-nano-GGUF/glm-asr-nano-q4_k.gguf +- name: kyutai-stt-crispasr + url: github:mudler/LocalAI/gallery/virtual.yaml@master + urls: + - https://huggingface.co/cstr/kyutai-stt-1b-GGUF + description: | + Kyutai STT 1B (Moshi-style) speech recognition. Runs via the CrispASR backend with an explicit backend selector. Default GGUF size ~636 MB. + tags: + - crispasr + - asr + - speech-recognition + - stt + - gguf + overrides: + backend: crispasr + known_usecases: + - transcript + name: kyutai-stt-crispasr + options: + - "backend:kyutai-stt" + parameters: + model: kyutai-stt-1b-q4_k.gguf + files: + - filename: kyutai-stt-1b-q4_k.gguf + uri: huggingface://cstr/kyutai-stt-1b-GGUF/kyutai-stt-1b-q4_k.gguf +- name: firered-asr-crispasr + url: github:mudler/LocalAI/gallery/virtual.yaml@master + urls: + - https://huggingface.co/cstr/firered-asr2-aed-GGUF + description: | + FireRed-ASR2 AED speech recognition. Runs via the CrispASR backend with an explicit backend selector. Default GGUF size ~918 MB. + tags: + - crispasr + - asr + - speech-recognition + - stt + - gguf + overrides: + backend: crispasr + known_usecases: + - transcript + name: firered-asr-crispasr + options: + - "backend:firered-asr" + parameters: + model: firered-asr2-aed-q4_k.gguf + files: + - filename: firered-asr2-aed-q4_k.gguf + uri: huggingface://cstr/firered-asr2-aed-GGUF/firered-asr2-aed-q4_k.gguf +- name: moonshine-crispasr + url: github:mudler/LocalAI/gallery/virtual.yaml@master + urls: + - https://huggingface.co/cstr/moonshine-tiny-GGUF + description: | + Moonshine Tiny speech recognition, English. Runs via the CrispASR backend with an explicit backend selector and a companion tokenizer. Default GGUF size ~20 MB. + tags: + - crispasr + - asr + - speech-recognition + - stt + - gguf + overrides: + backend: crispasr + known_usecases: + - transcript + name: moonshine-crispasr + options: + - "backend:moonshine" + - "codec:tokenizer.bin" + parameters: + model: moonshine-tiny-q4_k.gguf + files: + - filename: moonshine-tiny-q4_k.gguf + uri: huggingface://cstr/moonshine-tiny-GGUF/moonshine-tiny-q4_k.gguf + - filename: tokenizer.bin + uri: huggingface://cstr/moonshine-tiny-GGUF/tokenizer.bin +- name: moonshine-de-crispasr + url: github:mudler/LocalAI/gallery/virtual.yaml@master + urls: + - https://huggingface.co/cstr/moonshine-base-de-fidoriel-GGUF + description: | + Moonshine Base German fine-tune (fidoriel), best-quality German Moonshine. Runs via the CrispASR backend with an explicit backend selector and a companion tokenizer. Default GGUF size ~39 MB. + license: CC-BY-NC-SA-4.0 + tags: + - crispasr + - asr + - speech-recognition + - stt + - gguf + overrides: + backend: crispasr + known_usecases: + - transcript + name: moonshine-de-crispasr + options: + - "backend:moonshine" + - "codec:tokenizer.bin" + parameters: + model: moonshine-base-de-fidoriel-q4_k.gguf + files: + - filename: moonshine-base-de-fidoriel-q4_k.gguf + uri: huggingface://cstr/moonshine-base-de-fidoriel-GGUF/moonshine-base-de-fidoriel-q4_k.gguf + - filename: tokenizer.bin + uri: huggingface://cstr/moonshine-base-de-fidoriel-GGUF/tokenizer.bin +- name: moonshine-tiny-de-crispasr + url: github:mudler/LocalAI/gallery/virtual.yaml@master + urls: + - https://huggingface.co/cstr/moonshine-tiny-de-fidoriel-GGUF + description: | + Moonshine Tiny German fine-tune (fidoriel), smaller/faster German Moonshine. Runs via the CrispASR backend with an explicit backend selector and a companion tokenizer. Default GGUF size ~17 MB. + license: CC-BY-NC-SA-4.0 + tags: + - crispasr + - asr + - speech-recognition + - stt + - gguf + overrides: + backend: crispasr + known_usecases: + - transcript + name: moonshine-tiny-de-crispasr + options: + - "backend:moonshine" + - "codec:tokenizer.bin" + parameters: + model: moonshine-tiny-de-fidoriel-q4_k.gguf + files: + - filename: moonshine-tiny-de-fidoriel-q4_k.gguf + uri: huggingface://cstr/moonshine-tiny-de-fidoriel-GGUF/moonshine-tiny-de-fidoriel-q4_k.gguf + - filename: tokenizer.bin + uri: huggingface://cstr/moonshine-tiny-de-fidoriel-GGUF/tokenizer.bin +- name: moonshine-streaming-crispasr + url: github:mudler/LocalAI/gallery/virtual.yaml@master + urls: + - https://huggingface.co/cstr/moonshine-streaming-tiny-GGUF + description: | + Moonshine Streaming Tiny speech recognition. Runs via the CrispASR backend with an explicit backend selector and a companion tokenizer. Default GGUF size ~31 MB. + tags: + - crispasr + - asr + - speech-recognition + - stt + - gguf + overrides: + backend: crispasr + known_usecases: + - transcript + name: moonshine-streaming-crispasr + options: + - "backend:moonshine-streaming" + - "codec:tokenizer.bin" + parameters: + model: moonshine-streaming-tiny-q4_k.gguf + files: + - filename: moonshine-streaming-tiny-q4_k.gguf + uri: huggingface://cstr/moonshine-streaming-tiny-GGUF/moonshine-streaming-tiny-q4_k.gguf + - filename: tokenizer.bin + uri: huggingface://cstr/moonshine-streaming-tiny-GGUF/tokenizer.bin +- name: mimo-asr-crispasr + url: github:mudler/LocalAI/gallery/virtual.yaml@master + urls: + - https://huggingface.co/cstr/mimo-asr-GGUF + description: | + MiMo-ASR speech recognition. Runs via the CrispASR backend with an explicit backend selector and a companion tokenizer GGUF. Default GGUF size ~4.2 GB. + tags: + - crispasr + - asr + - speech-recognition + - stt + - gguf + overrides: + backend: crispasr + known_usecases: + - transcript + name: mimo-asr-crispasr + options: + - "backend:mimo-asr" + - "codec:mimo-tokenizer-q4_k.gguf" + parameters: + model: mimo-asr-q4_k.gguf + files: + - filename: mimo-asr-q4_k.gguf + uri: huggingface://cstr/mimo-asr-GGUF/mimo-asr-q4_k.gguf + - filename: mimo-tokenizer-q4_k.gguf + uri: huggingface://cstr/mimo-tokenizer-GGUF/mimo-tokenizer-q4_k.gguf