feat(crispasr): add CrispASR backend — multi-architecture ASR + TTS (#10099)

* feat(crispasr): backend source files (Go gRPC server, C-ABI shim, build files) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * polish(crispasr): brand error strings + fix stale shim comment Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * build(crispasr): register backend in root Makefile Mirror the whisper Go backend registration for the new crispasr backend: NOTPARALLEL entry, prepare-test-extra/test-extra hooks, BACKEND_CRISPASR definition, docker-build target generation, and the docker-build-backends aggregate target. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(crispasr): add backend build matrix entries Mirror the 11 whisper golang Dockerfile matrix entries (CPU amd64/arm64, CUDA 12/13, L4T CUDA 13, Intel SYCL f32/f16, Vulkan amd64/arm64, L4T arm64, ROCm hipblas) with backend and tag-suffix substituted to crispasr. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(gallery): add crispasr backend gallery entries Add the crispasr meta anchor and its full set of image gallery entries (cpu, metal, cuda12/13, rocm, intel-sycl f32/f16, vulkan, L4T arm64, L4T cuda13 arm64, plus -development variants), mirroring the whisper backend gallery block. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(crispasr): bump CRISPASR_VERSION via bump_deps workflow Track CrispStrobe/CrispASR main branch and bump CRISPASR_VERSION in backend/go/crispasr/Makefile. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * build(crispasr): don't wire fixture-gated test into test-extra Mirror the whisper Go backend: its AudioTranscription test is gated on model/audio fixtures and skips in CI, so building crispasr (the heaviest ggml compile in the tree) inside the unit-test lane adds a long compile for zero coverage. The backend image build in backend-matrix.yml remains the authoritative compile check. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(crispasr): add darwin metal build entry (mirror whisper) The metal-crispasr gallery entries and capabilities.metal mapping reference -metal-darwin-arm64-crispasr, which is only produced by an includeDarwin entry. Mirror whisper's darwin metal entry so the tag actually gets built. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(crispasr): place hipblas matrix entry next to whisper twin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(crispasr): register crispasr as pref-only ASR backend + test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(crispasr): port whisper behavioral suite (cancellation + streaming) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(crispasr): fix skip message env var names to CRISPASR_* Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(crispasr): switch shim to crispasr_session_* multi-architecture API The shim used whisper_full(), which in CrispASR is the whisper-only path: libcrispasr only transcribes Whisper GGUFs through it. Multi-architecture transcription (Parakeet, Voxtral, Qwen3-ASR, Canary, Granite, FunASR, Paraformer, SenseVoice, ...) goes through the crispasr_session_* C-ABI, which auto-detects the architecture from the GGUF and dispatches to the matching backend. Rewrite the C shim around crispasr_session_open / _transcribe_lang / _result_* and add get_backend() so the selected backend is logged. load_model now takes a threads param (session_open binds n_threads at open). The session result is segment+word based with no token IDs and no per-decode callback, so drop n_tokens / get_token_id / get_segment_speaker_turn_next / set_new_segment_callback. set_abort is kept for API parity but is best-effort: the session transcribe is blocking with no abort hook. Update the purego bindings and gocrispasr.go to match: tokens are left empty, speaker-turn handling is removed, and AudioTranscriptionStream emits one delta per non-empty segment after the blocking decode returns (no progressive streaming via the session API), preserving the concat(deltas) == final.Text invariant. crispasr_session_set_translate is exported by libcrispasr but not declared in crispasr.h, so it is forward-declared in the shim alongside the open/transcribe/result functions. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * build(crispasr): link full CrispASR backend set for multi-arch support The shim's crispasr_session_* dispatch calls into the per-architecture backend libs (parakeet, voxtral, qwen3_asr, canary, funasr, paraformer, sensevoice, ...), which CrispASR builds as static archives. Linking only crispasr + ggml dead-stripped every backend object from the final module (nm backend-symbol count: 0), leaving a whisper-only .so. Link the same backend set as crispasr-cli so the static archives are pulled in. After this the module carries the backend symbols (nm count 407, .so grows from ~2.1MB to ~6.7MB) and the session API can dispatch to every compiled-in architecture. Also rewrite ${CMAKE_SOURCE_DIR}/examples/talk-llama to ${PROJECT_SOURCE_DIR}/... in the vendored src/CMakeLists.txt: CrispASR locates its vendored llama.cpp via ${CMAKE_SOURCE_DIR}, which is wrong when CrispASR is add_subdirectory'd (CMAKE_SOURCE_DIR points at this backend dir, not the CrispASR root). PROJECT_SOURCE_DIR is correct both standalone and as a subproject; the sed is idempotent. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(crispasr): adapt suite to session API (blocking, no decode callback) Register the new symbol set (drop the removed token/speaker/callback funcs, add get_backend; load_model now takes 2 args). The session transcribe is blocking with no abort hook, so a mid-decode cancel can't interrupt it: change the cancellation spec to cancel the context before the call and assert codes.Canceled from the pre-call ctx.Err() check, dropping the <5s mid-decode timing assertion. The streaming spec still holds with per-segment post-decode emission (>=2 deltas, concat(deltas) == final.Text). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(gallery): add CrispASR ASR model entries (-crispasr) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(gallery): keep only session-auto-detectable CrispASR ASR models The crispasr backend loads models via crispasr_session_open, which auto-detects the backend from the GGUF general.architecture using crispasr_detect_backend_from_gguf. Architectures not in that detect map cannot be opened, so those gallery entries fail to load. Removed entries whose architecture is not wired into CrispASR v0.6.11's session auto-detect router (they can be re-added when upstream maps them): - Not in the detect map: data2vec, firered-asr, funasr, fun-asr-mlt-nano, glm-asr, hubert, kyutai-stt, mega-asr, mimo-asr, moonshine{,-de,-streaming,-tiny-de}, omniasr{,-llm,-llm-1b}, paraformer, sensevoice. - Pending verification (filename-heuristic routed, not arch-detected): parakeet-ctc-0.6b, parakeet-ctc-1.1b. Their GGUFs are routed to the fastconformer-ctc backend by a filename heuristic in the model registry, which implies general.architecture is not a mapped string. Kept the parakeet rnnt/tdt_ctc variants: convert-parakeet-to-gguf.py writes general.architecture="parakeet" unconditionally and encodes the rnnt/ctc distinction in metadata fields, so they session-auto-detect. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(crispasr): TTS synthesis via crispasr_session_synthesize (24kHz) Add tts_synthesize/tts_free/tts_set_voice to the C-ABI shim. They reuse the already-open g_session (crispasr_session_open auto-detects a TTS model) and dispatch to the upstream synthesis call, which returns malloc'd 24 kHz mono float PCM. Orpheus needs a SNAC codec path that we do not set, so it returns NULL here and surfaces as an error Go-side. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(crispasr): implement TTS/TTSStream gRPC methods Bind the new shim functions via purego and implement TTS, TTSStream and a writeWAV24k helper. synthesize copies the C-owned PCM out before freeing it; TTS writes a 24 kHz mono 16-bit WAV to req.Dst via go-audio/wav. CrispASR has no progressive synth, so TTSStream synthesizes fully, encodes to WAV, and emits the bytes as a single chunk; it owns the results-channel close (the gRPC server wrapper ranges until close), mirroring vibevoice-cpp's TTSStream. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(crispasr): log when a TTS voice override is not honored Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(gallery): add CrispASR vibevoice-tts model entry Only vibevoice-tts works through the current shim: qwen3-tts, chatterbox, and orpheus require companion codec/s3gen/SNAC paths (set_codec_path / set_s3gen_path) that the shim doesn't wire yet, and kokoro/indextts/voxcpm2 aren't in the session auto-detect map. Those are follow-ups. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(crispasr): gated TTS synthesis spec Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(crispasr): satisfy golangci-lint (errcheck defers + unsafeptr nolint) The crispasr Go file is entirely new, so new-from-merge-base lints every line (unlike the grandfathered whisper backend it was forked from): - handle os.RemoveAll / fh.Close return values in AudioTranscription - annotate the two intentional C-pointer unsafe.Slice sites with //nolint:govet Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(crispasr): backend: and codec: model options (explicit arch + companion files) Add two model-config options to the CrispASR backend via opts.Options: - backend:<name> selects an explicit CrispASR backend (bypassing auto-detect) by routing load_model through crispasr_session_open_explicit, unlocking architectures the detector won't pick on its own (qwen3, cohere, granite, voxtral, moonshine, mimo-asr, orpheus, kokoro, chatterbox, etc.). - codec:<path> loads a companion file (qwen3-tts codec, orpheus SNAC, chatterbox s3gen, or mimo-asr tokenizer) via the universal crispasr_session_set_codec_path setter after the session opens. A relative path resolves against the model directory. rc==0 means success or not-applicable; only a negative rc is fatal. The C shim load_model gains a backend_name argument and a new set_codec_path entry point; the Go bridge parses the prefix:value options and registers the new symbol. The vad_only path is unchanged. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(gallery): expand CrispASR models via backend:/codec: options (explicit arch + companions) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(gallery): use virtual.yaml base for crispasr models The crispasr entries are just backend + model + a couple options, fully expressed inline via overrides:/files: in gallery/index.yaml. Point each url: at the shared gallery/virtual.yaml (the established 'virtual' model trick) and drop the 36 redundant per-model gallery/*-crispasr.yaml files. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(gallery): drop voice-requiring TTS entries (keep vibevoice-tts) Real e2e showed qwen3-tts/orpheus/chatterbox don't synthesize through the current shim: the codec: companion loads fine, but these engines additionally need a voice pack / voice prompt / reference clip (qwen3-tts base errors 'no voice'; chatterbox is zero-shot cloning; orpheus uses named voices) that the backend doesn't wire. (qwen3-tts also can't auto-detect: its GGUF arch is 'qwen3tts', unmapped by the detector — would need backend:qwen3-tts.) Removed to avoid shipping non-working gallery entries; vibevoice-tts (built-in voice, e2e-verified) remains the working TTS. Voice-pack wiring is a follow-up. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(crispasr): speaker: and voice: TTS options (baked speakers + voice packs/prompts) speaker:<name> -> crispasr_session_set_speaker_name (baked speakers: qwen3-tts CustomVoice, orpheus). voice:<path>(+voice_text:<ref>) -> crispasr_session_set_voice (voice-pack GGUF, or WAV zero-shot clone with ref text). Applied at Load as the default voice; req.Voice still overrides the speaker per request. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(gallery): re-add e2e-verified TTS engines (chatterbox, qwen3-tts-customvoice, orpheus) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-07-18 20:24:11 -04:00 · 2026-05-31 12:11:03 +02:00
parent baa11133f1
commit 76fe0bb929
17 changed files with 2462 additions and 2 deletions
--- a/.github/backend-matrix.yml
+++ b/.github/backend-matrix.yml
@@ -716,6 +716,19 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
+  - build-type: 'cublas'
+    cuda-major-version: "12"
+    cuda-minor-version: "8"
+    platforms: 'linux/amd64'
+    tag-latest: 'auto'
+    tag-suffix: '-gpu-nvidia-cuda-12-crispasr'
+    runs-on: 'ubuntu-latest'
+    base-image: "ubuntu:24.04"
+    skip-drivers: 'false'
+    backend: "crispasr"
+    dockerfile: "./backend/Dockerfile.golang"
+    context: "./"
+    ubuntu-version: '2404'
  - build-type: 'cublas'
    cuda-major-version: "12"
    cuda-minor-version: "8"
@@ -1569,6 +1582,19 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
+  - build-type: 'cublas'
+    cuda-major-version: "13"
+    cuda-minor-version: "0"
+    platforms: 'linux/amd64'
+    tag-latest: 'auto'
+    tag-suffix: '-gpu-nvidia-cuda-13-crispasr'
+    runs-on: 'ubuntu-latest'
+    base-image: "ubuntu:24.04"
+    skip-drivers: 'false'
+    backend: "crispasr"
+    dockerfile: "./backend/Dockerfile.golang"
+    context: "./"
+    ubuntu-version: '2404'
  - build-type: 'cublas'
    cuda-major-version: "13"
    cuda-minor-version: "0"
@@ -1595,6 +1621,19 @@ include:
    backend: "whisper"
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
+  - build-type: 'cublas'
+    cuda-major-version: "13"
+    cuda-minor-version: "0"
+    platforms: 'linux/arm64'
+    skip-drivers: 'false'
+    tag-latest: 'auto'
+    tag-suffix: '-nvidia-l4t-cuda-13-arm64-crispasr'
+    base-image: "ubuntu:24.04"
+    ubuntu-version: '2404'
+    runs-on: 'ubuntu-24.04-arm'
+    backend: "crispasr"
+    dockerfile: "./backend/Dockerfile.golang"
+    context: "./"
  - build-type: 'cublas'
    cuda-major-version: "13"
    cuda-minor-version: "0"
@@ -2889,6 +2928,20 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
+  - build-type: ''
+    cuda-major-version: ""
+    cuda-minor-version: ""
+    platforms: 'linux/amd64'
+    platform-tag: 'amd64'
+    tag-latest: 'auto'
+    tag-suffix: '-cpu-crispasr'
+    runs-on: 'ubuntu-latest'
+    base-image: "ubuntu:24.04"
+    skip-drivers: 'false'
+    backend: "crispasr"
+    dockerfile: "./backend/Dockerfile.golang"
+    context: "./"
+    ubuntu-version: '2404'
  - build-type: ''
    cuda-major-version: ""
    cuda-minor-version: ""
@@ -2903,6 +2956,20 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
+  - build-type: ''
+    cuda-major-version: ""
+    cuda-minor-version: ""
+    platforms: 'linux/arm64'
+    platform-tag: 'arm64'
+    tag-latest: 'auto'
+    tag-suffix: '-cpu-crispasr'
+    runs-on: 'ubuntu-24.04-arm'
+    base-image: "ubuntu:24.04"
+    skip-drivers: 'false'
+    backend: "crispasr"
+    dockerfile: "./backend/Dockerfile.golang"
+    context: "./"
+    ubuntu-version: '2404'
  - build-type: 'sycl_f32'
    cuda-major-version: ""
    cuda-minor-version: ""
@@ -2916,6 +2983,19 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
+  - build-type: 'sycl_f32'
+    cuda-major-version: ""
+    cuda-minor-version: ""
+    platforms: 'linux/amd64'
+    tag-latest: 'auto'
+    tag-suffix: '-gpu-intel-sycl-f32-crispasr'
+    runs-on: 'ubuntu-latest'
+    base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
+    skip-drivers: 'false'
+    backend: "crispasr"
+    dockerfile: "./backend/Dockerfile.golang"
+    context: "./"
+    ubuntu-version: '2404'
  - build-type: 'sycl_f16'
    cuda-major-version: ""
    cuda-minor-version: ""
@@ -2929,6 +3009,19 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
+  - build-type: 'sycl_f16'
+    cuda-major-version: ""
+    cuda-minor-version: ""
+    platforms: 'linux/amd64'
+    tag-latest: 'auto'
+    tag-suffix: '-gpu-intel-sycl-f16-crispasr'
+    runs-on: 'ubuntu-latest'
+    base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
+    skip-drivers: 'false'
+    backend: "crispasr"
+    dockerfile: "./backend/Dockerfile.golang"
+    context: "./"
+    ubuntu-version: '2404'
  - build-type: 'vulkan'
    cuda-major-version: ""
    cuda-minor-version: ""
@@ -2943,6 +3036,20 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
+  - build-type: 'vulkan'
+    cuda-major-version: ""
+    cuda-minor-version: ""
+    platforms: 'linux/amd64'
+    platform-tag: 'amd64'
+    tag-latest: 'auto'
+    tag-suffix: '-gpu-vulkan-crispasr'
+    runs-on: 'ubuntu-latest'
+    base-image: "ubuntu:24.04"
+    skip-drivers: 'false'
+    backend: "crispasr"
+    dockerfile: "./backend/Dockerfile.golang"
+    context: "./"
+    ubuntu-version: '2404'
  - build-type: 'vulkan'
    cuda-major-version: ""
    cuda-minor-version: ""
@@ -2957,6 +3064,20 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
+  - build-type: 'vulkan'
+    cuda-major-version: ""
+    cuda-minor-version: ""
+    platforms: 'linux/arm64'
+    platform-tag: 'arm64'
+    tag-latest: 'auto'
+    tag-suffix: '-gpu-vulkan-crispasr'
+    runs-on: 'ubuntu-24.04-arm'
+    base-image: "ubuntu:24.04"
+    skip-drivers: 'false'
+    backend: "crispasr"
+    dockerfile: "./backend/Dockerfile.golang"
+    context: "./"
+    ubuntu-version: '2404'
  - build-type: 'cublas'
    cuda-major-version: "12"
    cuda-minor-version: "0"
@@ -2970,6 +3091,19 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2204'
+  - build-type: 'cublas'
+    cuda-major-version: "12"
+    cuda-minor-version: "0"
+    platforms: 'linux/arm64'
+    skip-drivers: 'false'
+    tag-latest: 'auto'
+    tag-suffix: '-nvidia-l4t-arm64-crispasr'
+    base-image: "nvcr.io/nvidia/l4t-jetpack:r36.4.0"
+    runs-on: 'ubuntu-24.04-arm'
+    backend: "crispasr"
+    dockerfile: "./backend/Dockerfile.golang"
+    context: "./"
+    ubuntu-version: '2204'
  - build-type: 'hipblas'
    cuda-major-version: ""
    cuda-minor-version: ""
@@ -2983,6 +3117,19 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
+  - build-type: 'hipblas'
+    cuda-major-version: ""
+    cuda-minor-version: ""
+    platforms: 'linux/amd64'
+    tag-latest: 'auto'
+    tag-suffix: '-gpu-rocm-hipblas-crispasr'
+    base-image: "rocm/dev-ubuntu-24.04:7.2.1"
+    runs-on: 'ubuntu-latest'
+    skip-drivers: 'false'
+    backend: "crispasr"
+    dockerfile: "./backend/Dockerfile.golang"
+    context: "./"
+    ubuntu-version: '2404'
  # parakeet-cpp
  - build-type: ''
    cuda-major-version: ""
@@ -4124,6 +4271,10 @@ includeDarwin:
    tag-suffix: "-metal-darwin-arm64-whisper"
    build-type: "metal"
    lang: "go"
+  - backend: "crispasr"
+    tag-suffix: "-metal-darwin-arm64-crispasr"
+    build-type: "metal"
+    lang: "go"
  - backend: "parakeet-cpp"
    tag-suffix: "-metal-darwin-arm64-parakeet-cpp"
    build-type: "metal"
--- a/.github/workflows/bump_deps.yaml
+++ b/.github/workflows/bump_deps.yaml
@@ -30,6 +30,10 @@ jobs:
            variable: "WHISPER_CPP_VERSION"
            branch: "master"
            file: "backend/go/whisper/Makefile"
+          - repository: "CrispStrobe/CrispASR"
+            variable: "CRISPASR_VERSION"
+            branch: "main"
+            file: "backend/go/crispasr/Makefile"
          - repository: "mudler/parakeet.cpp"
            variable: "PARAKEET_VERSION"
            branch: "master"
--- a/6
+++ b/6
@@ -1,5 +1,5 @@
 # Disable parallel execution for backend builds
-.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/parakeet-cpp backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/rfdetr-cpp backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin backends/liquid-audio
+.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/crispasr backends/parakeet-cpp backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/rfdetr-cpp backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin backends/liquid-audio

 GOCMD=go
 GOTEST=$(GOCMD) test
@@ -1162,6 +1162,7 @@ BACKEND_HUGGINGFACE = huggingface|golang|.|false|true
 BACKEND_SILERO_VAD = silero-vad|golang|.|false|true
 BACKEND_STABLEDIFFUSION_GGML = stablediffusion-ggml|golang|.|--progress=plain|true
 BACKEND_WHISPER = whisper|golang|.|false|true
+BACKEND_CRISPASR = crispasr|golang|.|false|true
 BACKEND_PARAKEET_CPP = parakeet-cpp|golang|.|false|true
 BACKEND_VOXTRAL = voxtral|golang|.|false|true
 BACKEND_ACESTEP_CPP = acestep-cpp|golang|.|false|true
@@ -1250,6 +1251,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_HUGGINGFACE)))
 $(eval $(call generate-docker-build-target,$(BACKEND_SILERO_VAD)))
 $(eval $(call generate-docker-build-target,$(BACKEND_STABLEDIFFUSION_GGML)))
 $(eval $(call generate-docker-build-target,$(BACKEND_WHISPER)))
+$(eval $(call generate-docker-build-target,$(BACKEND_CRISPASR)))
 $(eval $(call generate-docker-build-target,$(BACKEND_PARAKEET_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_VOXTRAL)))
 $(eval $(call generate-docker-build-target,$(BACKEND_OPUS)))
@@ -1300,7 +1302,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_SHERPA_ONNX)))
 docker-save-%: backend-images
 	docker save local-ai-backend:$* -o backend-images/$*.tar

-docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-ds4 docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-liquid-audio docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-rfdetr-cpp docker-build-qwen3-tts-cpp docker-build-vibevoice-cpp docker-build-localvqe docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx docker-build-cloud-proxy
+docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-ds4 docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-crispasr docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-liquid-audio docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-rfdetr-cpp docker-build-qwen3-tts-cpp docker-build-vibevoice-cpp docker-build-localvqe docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx docker-build-cloud-proxy

 ########################################################
 ### Mock Backend for E2E Tests
--- a/backend/go/crispasr/.gitignore
+++ b/backend/go/crispasr/.gitignore
@@ -0,0 +1,5 @@
+sources
+build*
+libgocrispasr*.so
+crispasr
+package
--- a/backend/go/crispasr/CMakeLists.txt
+++ b/backend/go/crispasr/CMakeLists.txt
@@ -0,0 +1,30 @@
+cmake_minimum_required(VERSION 3.12)
+project(gocrispasr LANGUAGES C CXX)
+set(CMAKE_POSITION_INDEPENDENT_CODE ON)
+set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
+
+add_subdirectory(./sources/CrispASR)
+
+add_library(gocrispasr MODULE cpp/crispasr_shim.cpp)
+target_include_directories(gocrispasr PRIVATE
+    ${CMAKE_CURRENT_SOURCE_DIR}/sources/CrispASR/include
+    ${CMAKE_CURRENT_SOURCE_DIR}/sources/CrispASR/ggml/include)
+# Link the same backend set as crispasr-cli (examples/cli/CMakeLists.txt) so
+# the session API can dispatch to every compiled-in architecture, not just
+# whisper. crispasr is the referencer; the backend static libs supply the
+# per-architecture symbols; ggml is the math/runtime base.
+target_link_libraries(gocrispasr PRIVATE
+    crispasr
+    parakeet canary canary_ctc cohere granite_speech granite_nle
+    voxtral voxtral4b qwen3_asr qwen3_tts orpheus chatterbox indextts
+    kokoro voxcpm2_tts m2m100 t5_translate wav2vec2-ggml vibevoice
+    silero-lid pyannote-seg funasr paraformer sensevoice
+    crisp_audio
+    ggml)
+
+if(CMAKE_CXX_COMPILER_ID MATCHES "GNU" AND CMAKE_CXX_COMPILER_VERSION VERSION_LESS 9.0)
+    target_link_libraries(gocrispasr PRIVATE stdc++fs)
+endif()
+
+set_property(TARGET gocrispasr PROPERTY CXX_STANDARD 17)
+set_target_properties(gocrispasr PROPERTIES LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR})
--- a/backend/go/crispasr/Makefile
+++ b/backend/go/crispasr/Makefile
@@ -0,0 +1,132 @@
+CMAKE_ARGS?=
+BUILD_TYPE?=
+NATIVE?=false
+
+GOCMD?=go
+GO_TAGS?=
+JOBS?=$(shell nproc --ignore=1)
+
+# CrispASR version (release tag)
+CRISPASR_REPO?=https://github.com/CrispStrobe/CrispASR
+CRISPASR_VERSION?=v0.6.11
+SO_TARGET?=libgocrispasr.so
+
+CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF
+# Keep the build lean: no tests/examples/server/SDL2/curl/ffmpeg (the FROM scratch
+# image cannot satisfy those runtime deps). All ASR/TTS model backends stay enabled.
+CMAKE_ARGS+=-DCRISPASR_BUILD_TESTS=OFF -DCRISPASR_BUILD_EXAMPLES=OFF -DCRISPASR_BUILD_SERVER=OFF
+CMAKE_ARGS+=-DCRISPASR_SDL2=OFF -DCRISPASR_CURL=OFF -DCRISPASR_FFMPEG=OFF
+
+ifeq ($(NATIVE),false)
+	CMAKE_ARGS+=-DGGML_NATIVE=OFF
+endif
+
+ifeq ($(BUILD_TYPE),cublas)
+	CMAKE_ARGS+=-DGGML_CUDA=ON
+else ifeq ($(BUILD_TYPE),openblas)
+	CMAKE_ARGS+=-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
+else ifeq ($(BUILD_TYPE),clblas)
+	CMAKE_ARGS+=-DGGML_CLBLAST=ON -DCLBlast_DIR=/some/path
+else ifeq ($(BUILD_TYPE),hipblas)
+	CMAKE_ARGS+=-DGGML_HIPBLAS=ON
+else ifeq ($(BUILD_TYPE),vulkan)
+	CMAKE_ARGS+=-DGGML_VULKAN=ON
+else ifeq ($(OS),Darwin)
+	ifneq ($(BUILD_TYPE),metal)
+		CMAKE_ARGS+=-DGGML_METAL=OFF
+	else
+		CMAKE_ARGS+=-DGGML_METAL=ON
+		CMAKE_ARGS+=-DGGML_METAL_EMBED_LIBRARY=ON
+	endif
+endif
+
+ifeq ($(BUILD_TYPE),sycl_f16)
+	CMAKE_ARGS+=-DGGML_SYCL=ON \
+		-DCMAKE_C_COMPILER=icx \
+		-DCMAKE_CXX_COMPILER=icpx \
+		-DGGML_SYCL_F16=ON
+endif
+
+ifeq ($(BUILD_TYPE),sycl_f32)
+	CMAKE_ARGS+=-DGGML_SYCL=ON \
+		-DCMAKE_C_COMPILER=icx \
+		-DCMAKE_CXX_COMPILER=icpx
+endif
+
+sources/CrispASR:
+	mkdir -p sources/CrispASR
+	cd sources/CrispASR && \
+	git init && \
+	git remote add origin $(CRISPASR_REPO) && \
+	git fetch origin && \
+	git checkout $(CRISPASR_VERSION) && \
+	git submodule update --init --recursive --depth 1 --single-branch
+	# CrispASR's src/CMakeLists.txt locates its vendored llama.cpp
+	# (crispasr-llama-core, used by the chat C-ABI) via ${CMAKE_SOURCE_DIR},
+	# which assumes CrispASR is the top-level CMake project. We add_subdirectory
+	# it, so ${CMAKE_SOURCE_DIR} is THIS backend dir and the talk-llama sources
+	# aren't found. Rewrite to ${PROJECT_SOURCE_DIR} (the crispasr project root),
+	# which is correct both standalone and as a subproject. Idempotent.
+	sed -i 's#\$${CMAKE_SOURCE_DIR}/examples/talk-llama#\$${PROJECT_SOURCE_DIR}/examples/talk-llama#' sources/CrispASR/src/CMakeLists.txt
+
+# Detect OS
+UNAME_S := $(shell uname -s)
+
+ifeq ($(UNAME_S),Linux)
+	VARIANT_TARGETS = libgocrispasr-avx.so libgocrispasr-avx2.so libgocrispasr-avx512.so libgocrispasr-fallback.so
+else
+	VARIANT_TARGETS = libgocrispasr-fallback.so
+endif
+
+crispasr: main.go gocrispasr.go $(VARIANT_TARGETS)
+	CGO_ENABLED=0 $(GOCMD) build -tags "$(GO_TAGS)" -o crispasr ./
+
+package: crispasr
+	bash package.sh
+
+build: package
+
+clean: purge
+	rm -rf libgocrispasr*.so package sources/CrispASR crispasr
+
+purge:
+	rm -rf build*
+
+ifeq ($(UNAME_S),Linux)
+libgocrispasr-avx.so: sources/CrispASR
+	$(MAKE) purge
+	$(info ${GREEN}I crispasr build info:avx${RESET})
+	SO_TARGET=libgocrispasr-avx.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) libgocrispasr-custom
+	rm -rfv build*
+
+libgocrispasr-avx2.so: sources/CrispASR
+	$(MAKE) purge
+	$(info ${GREEN}I crispasr build info:avx2${RESET})
+	SO_TARGET=libgocrispasr-avx2.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=off -DGGML_FMA=on -DGGML_F16C=on -DGGML_BMI2=on" $(MAKE) libgocrispasr-custom
+	rm -rfv build*
+
+libgocrispasr-avx512.so: sources/CrispASR
+	$(MAKE) purge
+	$(info ${GREEN}I crispasr build info:avx512${RESET})
+	SO_TARGET=libgocrispasr-avx512.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=on -DGGML_FMA=on -DGGML_F16C=on -DGGML_BMI2=on" $(MAKE) libgocrispasr-custom
+	rm -rfv build*
+endif
+
+libgocrispasr-fallback.so: sources/CrispASR
+	$(MAKE) purge
+	$(info ${GREEN}I crispasr build info:fallback${RESET})
+	SO_TARGET=libgocrispasr-fallback.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) libgocrispasr-custom
+	rm -rfv build*
+
+libgocrispasr-custom: CMakeLists.txt cpp/crispasr_shim.cpp cpp/crispasr_shim.h
+	mkdir -p build-$(SO_TARGET) && \
+	cd build-$(SO_TARGET) && \
+	cmake .. $(CMAKE_ARGS) && \
+	cmake --build . --config Release -j$(JOBS) && \
+	cd .. && \
+	mv build-$(SO_TARGET)/libgocrispasr.so ./$(SO_TARGET)
+
+test: crispasr
+	CGO_ENABLED=0 $(GOCMD) test -v ./...
+
+all: crispasr package
--- a/backend/go/crispasr/cpp/crispasr_shim.cpp
+++ b/backend/go/crispasr/cpp/crispasr_shim.cpp
@@ -0,0 +1,253 @@
+#include "crispasr_shim.h"
+#include "ggml-backend.h"
+#include "crispasr.h"
+#include <atomic>
+#include <vector>
+
+// Opaque session types. crispasr.h declares `struct crispasr_session;` but not
+// the result type nor the open/transcribe/result accessors — those are
+// CA_EXPORT extern "C" symbols in src/crispasr_c_api.cpp, so we forward-declare
+// exactly the ones we use. Signatures verified against
+// sources/CrispASR/src/crispasr_c_api.cpp.
+struct crispasr_session_result;
+extern "C" {
+crispasr_session *crispasr_session_open(const char *model_path, int n_threads);
+crispasr_session *crispasr_session_open_explicit(const char *model_path,
+                                                 const char *backend_name,
+                                                 int n_threads);
+int crispasr_session_set_codec_path(crispasr_session *s, const char *path);
+void crispasr_session_close(crispasr_session *s);
+const char *crispasr_session_backend(crispasr_session *s);
+int crispasr_session_set_translate(crispasr_session *s, int enable);
+crispasr_session_result *crispasr_session_transcribe_lang(
+    crispasr_session *s, const float *pcm, int n_samples, const char *language);
+int crispasr_session_result_n_segments(crispasr_session_result *r);
+const char *crispasr_session_result_segment_text(crispasr_session_result *r,
+                                                  int i);
+int64_t crispasr_session_result_segment_t0(crispasr_session_result *r, int i);
+int64_t crispasr_session_result_segment_t1(crispasr_session_result *r, int i);
+void crispasr_session_result_free(crispasr_session_result *r);
+float *crispasr_session_synthesize(crispasr_session *s, const char *text,
+                                   int *out_n_samples);
+void crispasr_pcm_free(float *pcm);
+int crispasr_session_set_speaker_name(crispasr_session *s, const char *name);
+int crispasr_session_set_voice(crispasr_session *s, const char *path,
+                               const char *ref_text_or_null);
+}
+
+static crispasr_session *g_session = nullptr;
+static crispasr_session_result *g_result = nullptr;
+
+static struct whisper_vad_context *vctx;
+static std::vector<float> flat_segs;
+
+static std::atomic<int> g_abort{0};
+
+extern "C" void set_abort(int v) {
+  g_abort.store(v, std::memory_order_relaxed);
+}
+
+static void ggml_log_cb(enum ggml_log_level level, const char *log,
+                        void *data) {
+  const char *level_str;
+
+  if (!log) {
+    return;
+  }
+
+  switch (level) {
+  case GGML_LOG_LEVEL_DEBUG:
+    level_str = "DEBUG";
+    break;
+  case GGML_LOG_LEVEL_INFO:
+    level_str = "INFO";
+    break;
+  case GGML_LOG_LEVEL_WARN:
+    level_str = "WARN";
+    break;
+  case GGML_LOG_LEVEL_ERROR:
+    level_str = "ERROR";
+    break;
+  default: /* Potential future-proofing */
+    level_str = "?????";
+    break;
+  }
+
+  fprintf(stderr, "[%-5s] ", level_str);
+  fputs(log, stderr);
+  fflush(stderr);
+}
+
+int load_model(const char *const model_path, int threads,
+               const char *backend_name) {
+  whisper_log_set(ggml_log_cb, nullptr);
+  ggml_backend_load_all();
+
+  if (backend_name && *backend_name) {
+    g_session =
+        crispasr_session_open_explicit(model_path, backend_name, threads);
+  } else {
+    g_session = crispasr_session_open(model_path, threads);
+  }
+  if (g_session == nullptr) {
+    fprintf(stderr, "error: failed to open CrispASR session for model\n");
+    return 1;
+  }
+
+  fprintf(stderr, "info: CrispASR backend selected: %s\n",
+          crispasr_session_backend(g_session));
+  return 0;
+}
+
+// set_codec_path forwards a companion file (qwen3-tts codec, orpheus SNAC,
+// chatterbox s3gen, or mimo-asr tokenizer) to the active session. Returns 0 on
+// success or when the active backend needs no companion, negative on failure,
+// and -1 when no session is open.
+int set_codec_path(const char *path) {
+  return g_session ? crispasr_session_set_codec_path(g_session, path) : -1;
+}
+
+int load_model_vad(const char *const model_path) {
+  whisper_log_set(ggml_log_cb, nullptr);
+  ggml_backend_load_all();
+
+  struct whisper_vad_context_params vcparams =
+      whisper_vad_default_context_params();
+
+  // XXX: Overridden to false in upstream due to performance?
+  // vcparams.use_gpu = true;
+
+  vctx = whisper_vad_init_from_file_with_params(model_path, vcparams);
+  if (vctx == nullptr) {
+    fprintf(stderr, "error: Failed to init model as VAD\n");
+    return 1;
+  }
+
+  return 0;
+}
+
+int vad(float pcmf32[], size_t pcmf32_len, float **segs_out,
+        size_t *segs_out_len) {
+  if (!whisper_vad_detect_speech(vctx, pcmf32, pcmf32_len)) {
+    fprintf(stderr, "error: failed to detect speech\n");
+    return 1;
+  }
+
+  struct whisper_vad_params params = whisper_vad_default_params();
+  struct whisper_vad_segments *segs =
+      whisper_vad_segments_from_probs(vctx, params);
+  size_t segn = whisper_vad_segments_n_segments(segs);
+
+  // fprintf(stderr, "Got segments %zd\n", segn);
+
+  flat_segs.clear();
+
+  for (int i = 0; i < segn; i++) {
+    flat_segs.push_back(whisper_vad_segments_get_segment_t0(segs, i));
+    flat_segs.push_back(whisper_vad_segments_get_segment_t1(segs, i));
+  }
+
+  // fprintf(stderr, "setting out variables: %p=%p -> %p, %p=%zx -> %zx\n",
+  //         segs_out, *segs_out, flat_segs.data(), segs_out_len, *segs_out_len,
+  //         flat_segs.size());
+  *segs_out = flat_segs.data();
+  *segs_out_len = flat_segs.size();
+
+  // fprintf(stderr, "freeing segs\n");
+  whisper_vad_free_segments(segs);
+
+  // fprintf(stderr, "returning\n");
+  return 0;
+}
+
+// threads, diarize and prompt are accepted for Go-side API parity but unused
+// in Phase 1: the thread count is fixed at session open, and diarization and
+// the initial prompt are separate CrispASR features not yet wired through the
+// session ASR path.
+int transcribe(uint32_t threads, char *lang, bool translate, bool diarize,
+               float pcmf32[], size_t pcmf32_len, size_t *segs_out_len,
+               char *prompt) {
+  (void)threads;
+  (void)diarize;
+  (void)prompt;
+
+  if (!g_session) {
+    return 1;
+  }
+
+  // Reset stale abort flag from any prior cancelled call. set_abort remains
+  // best-effort: the session transcribe call is blocking and exposes no abort
+  // hook, so a mid-decode abort cannot interrupt it.
+  g_abort.store(0, std::memory_order_relaxed);
+
+  crispasr_session_set_translate(g_session, translate ? 1 : 0);
+
+  if (g_result) {
+    crispasr_session_result_free(g_result);
+    g_result = nullptr;
+  }
+
+  const char *language = (lang && *lang) ? lang : nullptr;
+  g_result = crispasr_session_transcribe_lang(g_session, pcmf32, (int)pcmf32_len,
+                                              language);
+  if (!g_result) {
+    fprintf(stderr, "error: transcription failed\n");
+    return 1;
+  }
+
+  *segs_out_len = crispasr_session_result_n_segments(g_result);
+  return 0;
+}
+
+const char *get_segment_text(int i) {
+  if (!g_result) {
+    return "";
+  }
+  return crispasr_session_result_segment_text(g_result, i);
+}
+
+int64_t get_segment_t0(int i) {
+  if (!g_result) {
+    return 0;
+  }
+  return crispasr_session_result_segment_t0(g_result, i);
+}
+
+int64_t get_segment_t1(int i) {
+  if (!g_result) {
+    return 0;
+  }
+  return crispasr_session_result_segment_t1(g_result, i);
+}
+
+const char *get_backend(void) {
+  return g_session ? crispasr_session_backend(g_session) : "";
+}
+
+// TTS uses the already-open session (crispasr_session_open auto-detects a TTS
+// model). Output is 24 kHz mono float PCM (upstream CrispASR convention),
+// malloc'd by the C API; the caller must release it via tts_free.
+float *tts_synthesize(const char *text, int *out_n_samples) {
+  if (out_n_samples) *out_n_samples = 0;
+  if (!g_session || !text) return nullptr;
+  return crispasr_session_synthesize(g_session, text, out_n_samples);
+}
+
+void tts_free(float *pcm) {
+  if (pcm) crispasr_pcm_free(pcm);
+}
+
+int tts_set_voice(const char *name) {
+  if (!g_session || !name || !*name) return 0;
+  return crispasr_session_set_speaker_name(g_session, name);
+}
+
+// tts_set_voice_file loads a voice from a file: a .gguf path selects a voice
+// pack, a .wav path with a non-empty ref_text performs zero-shot voice cloning
+// (the C API returns -2 when ref_text is required but missing). Returns -1 when
+// no session is open or path is null.
+int tts_set_voice_file(const char *path, const char *ref_text) {
+  if (!g_session || !path) return -1;
+  const char *ref = (ref_text && *ref_text) ? ref_text : nullptr;
+  return crispasr_session_set_voice(g_session, path, ref);
+}
--- a/backend/go/crispasr/cpp/crispasr_shim.h
+++ b/backend/go/crispasr/cpp/crispasr_shim.h
@@ -0,0 +1,23 @@
+#include <cstddef>
+#include <cstdint>
+
+extern "C" {
+int load_model(const char *const model_path, int threads,
+               const char *backend_name);
+int set_codec_path(const char *path);
+int load_model_vad(const char *const model_path);
+int vad(float pcmf32[], size_t pcmf32_size, float **segs_out,
+        size_t *segs_out_len);
+int transcribe(uint32_t threads, char *lang, bool translate, bool diarize,
+               float pcmf32[], size_t pcmf32_len, size_t *segs_out_len,
+               char *prompt);
+const char *get_segment_text(int i);
+int64_t get_segment_t0(int i);
+int64_t get_segment_t1(int i);
+const char *get_backend(void);
+void set_abort(int v);
+float *tts_synthesize(const char *text, int *out_n_samples); // 24kHz mono float, malloc'd; NULL on failure
+void tts_free(float *pcm);
+int tts_set_voice(const char *name); // best-effort speaker selection; 0 ok
+int tts_set_voice_file(const char *path, const char *ref_text); // load voice pack (.gguf) or zero-shot clone (.wav + ref_text)
+}
--- a/backend/go/crispasr/gocrispasr.go
+++ b/backend/go/crispasr/gocrispasr.go
@@ -0,0 +1,497 @@
+package main
+
+import (
+	"context"
+	"fmt"
+	"os"
+	"path/filepath"
+	"strings"
+	"sync"
+	"unsafe"
+
+	"github.com/go-audio/audio"
+	"github.com/go-audio/wav"
+	"github.com/mudler/LocalAI/pkg/grpc/base"
+	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
+	"github.com/mudler/LocalAI/pkg/utils"
+	"google.golang.org/grpc/codes"
+	"google.golang.org/grpc/status"
+)
+
+var (
+	CppLoadModel       func(modelPath string, threads int, backendName string) int
+	CppSetCodecPath    func(path string) int
+	CppLoadModelVAD    func(modelPath string) int
+	CppVAD             func(pcmf32 []float32, pcmf32Size uintptr, segsOut unsafe.Pointer, segsOutLen unsafe.Pointer) int
+	CppTranscribe      func(threads uint32, lang string, translate bool, diarize bool, pcmf32 []float32, pcmf32Len uintptr, segsOutLen unsafe.Pointer, prompt string) int
+	CppGetSegmentText  func(i int) string
+	CppGetSegmentStart func(i int) int64
+	CppGetSegmentEnd   func(i int) int64
+	CppGetBackend      func() string
+	CppSetAbort        func(v int)
+	CppTTSSynthesize   func(text string, outNSamples unsafe.Pointer) uintptr
+	CppTTSFree         func(ptr uintptr)
+	CppTTSSetVoice     func(name string) int
+	CppTTSSetVoiceFile func(path string, refText string) int
+)
+
+type CrispASR struct {
+	base.SingleThread
+}
+
+// splitOption splits a "prefix:value" model option into its key and value,
+// matching the convention used by other backends (see sherpa-onnx). It returns
+// ok=false when the option carries no ':' separator.
+func splitOption(oo string) (key, value string, ok bool) {
+	parts := strings.SplitN(oo, ":", 2)
+	if len(parts) != 2 {
+		return "", "", false
+	}
+	return parts[0], parts[1], true
+}
+
+func (w *CrispASR) Load(opts *pb.ModelOptions) error {
+	vadOnly := false
+	backendName := ""
+	codecPath := ""
+	speakerName := ""
+	voicePath := ""
+	voiceRefText := ""
+
+	for _, oo := range opts.Options {
+		if oo == "vad_only" {
+			vadOnly = true
+			continue
+		}
+		switch key, value, ok := splitOption(oo); {
+		case ok && key == "backend":
+			backendName = value
+		case ok && key == "codec":
+			codecPath = value
+		case ok && key == "speaker":
+			speakerName = value
+		case ok && key == "voice":
+			voicePath = value
+		case ok && key == "voice_text":
+			voiceRefText = value
+		default:
+			fmt.Fprintf(os.Stderr, "Unrecognized option: %v\n", oo)
+		}
+	}
+
+	if vadOnly {
+		if ret := CppLoadModelVAD(opts.ModelFile); ret != 0 {
+			return fmt.Errorf("Failed to load CrispASR VAD model")
+		}
+
+		return nil
+	}
+
+	// Resolve a relative companion path against the model directory so a config
+	// can reference a sibling codec/tokenizer file by name alone.
+	if codecPath != "" && !filepath.IsAbs(codecPath) {
+		codecPath = filepath.Join(filepath.Dir(opts.ModelFile), codecPath)
+	}
+
+	// A voice file (.gguf pack or .wav prompt) is resolved against the model
+	// directory just like the codec, so a config can reference a sibling file.
+	if voicePath != "" && !filepath.IsAbs(voicePath) {
+		voicePath = filepath.Join(filepath.Dir(opts.ModelFile), voicePath)
+	}
+
+	if ret := CppLoadModel(opts.ModelFile, int(opts.Threads), backendName); ret != 0 {
+		return fmt.Errorf("Failed to load CrispASR transcription model")
+	}
+
+	// Load the companion file (codec/tokenizer/s3gen) after the session is open.
+	// rc==0 means success or "not applicable" for the active backend; only a
+	// negative code is fatal.
+	if codecPath != "" {
+		if rc := CppSetCodecPath(codecPath); rc < 0 {
+			return fmt.Errorf("crispasr: failed to load companion file %q (rc=%d)", codecPath, rc)
+		}
+		fmt.Fprintf(os.Stderr, "CrispASR companion file loaded: %s\n", codecPath)
+	}
+
+	// Apply the Load-time default voice. A baked speaker (speaker:) is selected
+	// by name and is best-effort: a backend that can't honor it is logged, not
+	// fatal. A voice file (voice:) is a hard requirement once configured, so a
+	// negative rc fails Load.
+	if speakerName != "" {
+		if rc := CppTTSSetVoice(speakerName); rc != 0 {
+			fmt.Fprintf(os.Stderr, "crispasr: speaker %q not applied (rc=%d)\n", speakerName, rc)
+		}
+	}
+	if voicePath != "" {
+		if rc := CppTTSSetVoiceFile(voicePath, voiceRefText); rc < 0 {
+			return fmt.Errorf("crispasr: failed to load voice %q (rc=%d)", voicePath, rc)
+		}
+		fmt.Fprintf(os.Stderr, "CrispASR voice loaded: %s\n", voicePath)
+	}
+
+	fmt.Fprintf(os.Stderr, "CrispASR backend selected: %s\n", CppGetBackend())
+
+	return nil
+}
+
+func (w *CrispASR) VAD(req *pb.VADRequest) (pb.VADResponse, error) {
+	audio := req.Audio
+	// We expect 0xdeadbeef to be overwritten and if we see it in a stack trace we know it wasn't
+	segsPtr, segsLen := uintptr(0xdeadbeef), uintptr(0xdeadbeef)
+	segsPtrPtr, segsLenPtr := unsafe.Pointer(&segsPtr), unsafe.Pointer(&segsLen)
+
+	if ret := CppVAD(audio, uintptr(len(audio)), segsPtrPtr, segsLenPtr); ret != 0 {
+		return pb.VADResponse{}, fmt.Errorf("Failed VAD")
+	}
+
+	// Happens when CPP vector has not had any elements pushed to it
+	if segsPtr == 0 {
+		return pb.VADResponse{
+			Segments: []*pb.VADSegment{},
+		}, nil
+	}
+
+	// unsafeptr warning is caused by segsPtr being on the stack and therefor being subject to stack copying AFAICT
+	// however the stack shouldn't have grown between setting segsPtr and now, also the memory pointed to is allocated by C++
+	segs := unsafe.Slice((*float32)(unsafe.Pointer(segsPtr)), segsLen) //nolint:govet // segsPtr addresses C++-owned heap memory passed back through the cgo-free purego boundary; the uintptr->Pointer round-trip is intentional and the buffer outlives this read.
+
+	vadSegments := []*pb.VADSegment{}
+	for i := range len(segs) >> 1 {
+		s := segs[2*i] / 100
+		t := segs[2*i+1] / 100
+		vadSegments = append(vadSegments, &pb.VADSegment{
+			Start: s,
+			End:   t,
+		})
+	}
+
+	return pb.VADResponse{
+		Segments: vadSegments,
+	}, nil
+}
+
+func (w *CrispASR) AudioTranscription(ctx context.Context, opts *pb.TranscriptRequest) (pb.TranscriptResult, error) {
+	if err := ctx.Err(); err != nil {
+		return pb.TranscriptResult{}, status.Error(codes.Canceled, "transcription cancelled")
+	}
+
+	dir, err := os.MkdirTemp("", "crispasr")
+	if err != nil {
+		return pb.TranscriptResult{}, err
+	}
+	defer func() { _ = os.RemoveAll(dir) }()
+
+	convertedPath := filepath.Join(dir, "converted.wav")
+
+	if err := utils.AudioToWav(opts.Dst, convertedPath); err != nil {
+		return pb.TranscriptResult{}, err
+	}
+
+	fh, err := os.Open(convertedPath)
+	if err != nil {
+		return pb.TranscriptResult{}, err
+	}
+	defer func() { _ = fh.Close() }()
+
+	d := wav.NewDecoder(fh)
+	buf, err := d.FullPCMBuffer()
+	if err != nil {
+		return pb.TranscriptResult{}, err
+	}
+
+	data := buf.AsFloat32Buffer().Data
+	var duration float32
+	if buf.Format != nil && buf.Format.SampleRate > 0 {
+		duration = float32(len(data)) / float32(buf.Format.SampleRate)
+	}
+	segsLen := uintptr(0xdeadbeef)
+	segsLenPtr := unsafe.Pointer(&segsLen)
+
+	// Watcher: flips the C-side abort flag when ctx is cancelled. The
+	// goroutine is joined synchronously (close(done) signals it to exit,
+	// wg.Wait() blocks until it has) so a late CppSetAbort(1) cannot fire
+	// after the function returns and corrupt the next transcription call.
+	done := make(chan struct{})
+	var wg sync.WaitGroup
+	wg.Add(1)
+	go func() {
+		defer wg.Done()
+		select {
+		case <-ctx.Done():
+			CppSetAbort(1)
+		case <-done:
+		}
+	}()
+	defer func() {
+		close(done)
+		wg.Wait()
+	}()
+
+	ret := CppTranscribe(opts.Threads, opts.Language, opts.Translate, opts.Diarize, data, uintptr(len(data)), segsLenPtr, opts.Prompt)
+	if ret == 2 {
+		return pb.TranscriptResult{}, status.Error(codes.Canceled, "transcription cancelled")
+	}
+	if ret != 0 {
+		return pb.TranscriptResult{}, fmt.Errorf("Failed Transcribe")
+	}
+
+	segments := []*pb.TranscriptSegment{}
+	text := ""
+	for i := range int(segsLen) {
+		// segment start/end conversion factor taken from https://github.com/ggml-org/whisper.cpp/blob/master/examples/cli/cli.cpp#L895
+		s := CppGetSegmentStart(i) * (10000000)
+		t := CppGetSegmentEnd(i) * (10000000)
+		// The session result can emit bytes that aren't valid UTF-8 (e.g. a
+		// multibyte codepoint split across token boundaries); protobuf string
+		// fields reject those at marshal time. Scrub before the value escapes
+		// cgo. The session result is segment+word based and exposes no token
+		// IDs, so Tokens is left empty.
+		txt := strings.ToValidUTF8(strings.Clone(CppGetSegmentText(i)), "<22>")
+
+		segment := &pb.TranscriptSegment{
+			Id:    int32(i),
+			Text:  txt,
+			Start: s, End: t,
+		}
+
+		segments = append(segments, segment)
+
+		text += " " + strings.TrimSpace(txt)
+	}
+
+	return pb.TranscriptResult{
+		Segments: segments,
+		Text:     strings.TrimSpace(text),
+		Language: opts.Language,
+		Duration: duration,
+	}, nil
+}
+
+// AudioTranscriptionStream runs the session transcribe to completion and then
+// emits one delta per non-empty segment, followed by a final TranscriptResult.
+// Progressive/real-time streaming isn't available via the session API (there
+// is no per-decode callback), so deltas are emitted per-segment after the
+// blocking decode returns rather than as segments are produced. The offline
+// AudioTranscription is unchanged; both paths share the session and the
+// SingleThread concurrency model.
+func (w *CrispASR) AudioTranscriptionStream(ctx context.Context, opts *pb.TranscriptRequest, results chan *pb.TranscriptStreamResponse) error {
+	defer close(results)
+
+	if err := ctx.Err(); err != nil {
+		return status.Error(codes.Canceled, "transcription cancelled")
+	}
+
+	dir, err := os.MkdirTemp("", "crispasr")
+	if err != nil {
+		return err
+	}
+	defer func() { _ = os.RemoveAll(dir) }()
+
+	convertedPath := filepath.Join(dir, "converted.wav")
+	if err := utils.AudioToWav(opts.Dst, convertedPath); err != nil {
+		return err
+	}
+
+	fh, err := os.Open(convertedPath)
+	if err != nil {
+		return err
+	}
+	defer func() { _ = fh.Close() }()
+
+	d := wav.NewDecoder(fh)
+	buf, err := d.FullPCMBuffer()
+	if err != nil {
+		return err
+	}
+	data := buf.AsFloat32Buffer().Data
+	var duration float32
+	if buf.Format != nil && buf.Format.SampleRate > 0 {
+		duration = float32(len(data)) / float32(buf.Format.SampleRate)
+	}
+
+	// Same abort-watcher pattern as AudioTranscription. Joined synchronously
+	// so a late CppSetAbort(1) cannot fire after this function returns.
+	// Best-effort only: the session transcribe is blocking with no abort hook.
+	done := make(chan struct{})
+	var wg sync.WaitGroup
+	wg.Add(1)
+	go func() {
+		defer wg.Done()
+		select {
+		case <-ctx.Done():
+			CppSetAbort(1)
+		case <-done:
+		}
+	}()
+	defer func() {
+		close(done)
+		wg.Wait()
+	}()
+
+	segsLen := uintptr(0xdeadbeef)
+	segsLenPtr := unsafe.Pointer(&segsLen)
+	ret := CppTranscribe(opts.Threads, opts.Language, opts.Translate, opts.Diarize, data, uintptr(len(data)), segsLenPtr, opts.Prompt)
+	if ret == 2 {
+		return status.Error(codes.Canceled, "transcription cancelled")
+	}
+	if ret != 0 {
+		return fmt.Errorf("Failed Transcribe")
+	}
+
+	// Walk the segments once: emit a delta per non-empty segment and build the
+	// final TranscriptResult.Segments alongside. The first delta has no leading
+	// space and subsequent ones are prefixed with a single space, so
+	// concat(deltas) == final.Text exactly, matching the e2e contract.
+	segments := []*pb.TranscriptSegment{}
+	var assembled strings.Builder
+	for i := range int(segsLen) {
+		s := CppGetSegmentStart(i) * 10000000
+		t := CppGetSegmentEnd(i) * 10000000
+		txt := strings.ToValidUTF8(strings.Clone(CppGetSegmentText(i)), "<22>")
+		segments = append(segments, &pb.TranscriptSegment{
+			Id:    int32(i),
+			Text:  txt,
+			Start: s, End: t,
+		})
+
+		trimmed := strings.TrimSpace(txt)
+		if trimmed == "" {
+			continue
+		}
+		var delta string
+		if assembled.Len() == 0 {
+			delta = trimmed
+		} else {
+			delta = " " + trimmed
+		}
+		results <- &pb.TranscriptStreamResponse{Delta: delta}
+		assembled.WriteString(delta)
+	}
+
+	final := &pb.TranscriptResult{
+		Segments: segments,
+		Text:     assembled.String(),
+		Language: opts.Language,
+		Duration: duration,
+	}
+	results <- &pb.TranscriptStreamResponse{FinalResult: final}
+	return nil
+}
+
+// synthesize returns 24 kHz mono float32 PCM for text via the open session.
+func (w *CrispASR) synthesize(text string) ([]float32, error) {
+	if text == "" {
+		return nil, fmt.Errorf("crispasr: TTS requires non-empty text")
+	}
+	var n int32
+	ptr := CppTTSSynthesize(text, unsafe.Pointer(&n))
+	if ptr == 0 || n <= 0 {
+		return nil, fmt.Errorf("crispasr: synthesis failed (the loaded model may not be a supported TTS backend, or needs extra config e.g. orpheus SNAC codec)")
+	}
+	defer CppTTSFree(ptr)
+	src := unsafe.Slice((*float32)(unsafe.Pointer(ptr)), int(n)) //nolint:govet // ptr addresses C-allocated PCM returned across the purego boundary; copied out immediately below, before tts_free.
+	out := make([]float32, int(n)) // copy out of C memory before free
+	copy(out, src)
+	return out, nil
+}
+
+// setVoice applies a per-call speaker/voice override (best effort). CrispASR
+// returns a negative code when the active backend can't honor the name; we log
+// it rather than fail, so an unknown voice falls back to the default speaker.
+func setVoice(voice string) {
+	v := strings.TrimSpace(voice)
+	if v == "" {
+		return
+	}
+	if rc := CppTTSSetVoice(v); rc != 0 {
+		fmt.Fprintf(os.Stderr, "crispasr: voice %q not applied by the active TTS backend (rc=%d); using default\n", v, rc)
+	}
+}
+
+func (w *CrispASR) TTS(req *pb.TTSRequest) error {
+	if req.Dst == "" {
+		return fmt.Errorf("crispasr: TTS requires a destination path")
+	}
+	setVoice(req.Voice)
+	pcm, err := w.synthesize(req.Text)
+	if err != nil {
+		return err
+	}
+	return writeWAV24k(req.Dst, pcm)
+}
+
+// TTSStream is the streaming counterpart to TTS. CrispASR has no progressive
+// (native streaming) synth, so we synthesize the whole utterance, encode it to
+// a 24 kHz WAV, and emit the encoded bytes as a single chunk. The gRPC server
+// wrapper (pkg/grpc/server.go:TTSStream) ranges over the channel until it is
+// closed, so this method owns the close - mirrors vibevoice-cpp's TTSStream.
+func (w *CrispASR) TTSStream(req *pb.TTSRequest, results chan []byte) error {
+	defer close(results)
+
+	if req.Text == "" {
+		return fmt.Errorf("crispasr: TTSStream requires text")
+	}
+	setVoice(req.Voice)
+	pcm, err := w.synthesize(req.Text)
+	if err != nil {
+		return err
+	}
+
+	tmp, err := os.CreateTemp("", "crispasr-tts-stream-*.wav")
+	if err != nil {
+		return fmt.Errorf("crispasr: tempfile: %w", err)
+	}
+	dst := tmp.Name()
+	if err := tmp.Close(); err != nil {
+		return fmt.Errorf("crispasr: close tempfile: %w", err)
+	}
+	defer func() { _ = os.Remove(dst) }()
+
+	if err := writeWAV24k(dst, pcm); err != nil {
+		return err
+	}
+
+	encoded, err := os.ReadFile(dst)
+	if err != nil {
+		return fmt.Errorf("crispasr: read tempfile: %w", err)
+	}
+	results <- encoded
+	return nil
+}
+
+// writeWAV24k writes pcm as a 24000 Hz, mono, 16-bit PCM WAV at dst.
+func writeWAV24k(dst string, pcm []float32) error {
+	f, err := os.Create(dst)
+	if err != nil {
+		return fmt.Errorf("crispasr: create %q: %w", dst, err)
+	}
+
+	enc := wav.NewEncoder(f, 24000, 16, 1, 1)
+	ints := make([]int, len(pcm))
+	for i, s := range pcm {
+		if s > 1 {
+			s = 1
+		} else if s < -1 {
+			s = -1
+		}
+		ints[i] = int(s * 32767)
+	}
+	buf := &audio.IntBuffer{
+		Format:         &audio.Format{NumChannels: 1, SampleRate: 24000},
+		Data:           ints,
+		SourceBitDepth: 16,
+	}
+	if err := enc.Write(buf); err != nil {
+		_ = enc.Close()
+		_ = f.Close()
+		return fmt.Errorf("crispasr: encode WAV: %w", err)
+	}
+	if err := enc.Close(); err != nil {
+		_ = f.Close()
+		return fmt.Errorf("crispasr: finalize WAV: %w", err)
+	}
+	if err := f.Close(); err != nil {
+		return fmt.Errorf("crispasr: close %q: %w", dst, err)
+	}
+	return nil
+}
--- a/backend/go/crispasr/gocrispasr_test.go
+++ b/backend/go/crispasr/gocrispasr_test.go
@@ -0,0 +1,193 @@
+package main
+
+import (
+	"context"
+	"os"
+	"path/filepath"
+	"strings"
+	"sync"
+	"testing"
+
+	"github.com/ebitengine/purego"
+	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
+	. "github.com/onsi/ginkgo/v2"
+	. "github.com/onsi/gomega"
+	"google.golang.org/grpc/codes"
+	"google.golang.org/grpc/status"
+)
+
+func TestCrispASR(t *testing.T) {
+	RegisterFailHandler(Fail)
+	RunSpecs(t, "CrispASR Backend Suite")
+}
+
+var (
+	libLoadOnce sync.Once
+	libLoadErr  error
+)
+
+// ensureLibLoaded mirrors main.go's bootstrap so a Go test can drive the
+// bridge without spinning up the gRPC server. Skips the current spec when the
+// shared library isn't present (e.g. running before `make backends/whisper`).
+func ensureLibLoaded() {
+	libLoadOnce.Do(func() {
+		libName := os.Getenv("CRISPASR_LIBRARY")
+		if libName == "" {
+			libName = "./libgocrispasr-fallback.so"
+		}
+		if _, err := os.Stat(libName); err != nil {
+			libLoadErr = err
+			return
+		}
+		gosd, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL)
+		if err != nil {
+			libLoadErr = err
+			return
+		}
+		purego.RegisterLibFunc(&CppLoadModel, gosd, "load_model")
+		purego.RegisterLibFunc(&CppSetCodecPath, gosd, "set_codec_path")
+		purego.RegisterLibFunc(&CppTranscribe, gosd, "transcribe")
+		purego.RegisterLibFunc(&CppGetSegmentText, gosd, "get_segment_text")
+		purego.RegisterLibFunc(&CppGetSegmentStart, gosd, "get_segment_t0")
+		purego.RegisterLibFunc(&CppGetSegmentEnd, gosd, "get_segment_t1")
+		purego.RegisterLibFunc(&CppGetBackend, gosd, "get_backend")
+		purego.RegisterLibFunc(&CppSetAbort, gosd, "set_abort")
+		purego.RegisterLibFunc(&CppTTSSynthesize, gosd, "tts_synthesize")
+		purego.RegisterLibFunc(&CppTTSFree, gosd, "tts_free")
+		purego.RegisterLibFunc(&CppTTSSetVoice, gosd, "tts_set_voice")
+		purego.RegisterLibFunc(&CppTTSSetVoiceFile, gosd, "tts_set_voice_file")
+	})
+	if libLoadErr != nil {
+		Skip("whisper library not loadable: " + libLoadErr.Error())
+	}
+}
+
+// fixturesOrSkip returns the model + audio paths or skips the spec if either
+// env var is unset. The test never runs in default CI — it requires a real
+// whisper model and a long audio file (~3 minutes) on disk.
+func fixturesOrSkip() (string, string) {
+	modelPath := os.Getenv("CRISPASR_MODEL_PATH")
+	audioPath := os.Getenv("CRISPASR_AUDIO_PATH")
+	if modelPath == "" || audioPath == "" {
+		Skip("set CRISPASR_MODEL_PATH and CRISPASR_AUDIO_PATH to run this spec")
+	}
+	return modelPath, audioPath
+}
+
+// ttsModelOrSkip returns the TTS model path or skips the spec when the env var
+// is unset. Like the transcription fixtures, this never runs in default CI — it
+// needs a real TTS model (e.g. a vibevoice GGUF) on disk.
+func ttsModelOrSkip() string {
+	modelPath := os.Getenv("CRISPASR_TTS_MODEL_PATH")
+	if modelPath == "" {
+		Skip("set CRISPASR_TTS_MODEL_PATH to run this spec")
+	}
+	return modelPath
+}
+
+var _ = Describe("CrispASR", func() {
+	Context("AudioTranscription cancellation", func() {
+		It("returns codes.Canceled on a pre-cancelled context and still succeeds afterwards", func() {
+			modelPath, audioPath := fixturesOrSkip()
+			ensureLibLoaded()
+
+			w := &CrispASR{}
+			Expect(w.Load(&pb.ModelOptions{ModelFile: modelPath})).To(Succeed())
+
+			// The session transcribe is blocking and exposes no abort hook, so
+			// a mid-decode cancel can't interrupt it. The contract we can rely
+			// on is the pre-call ctx.Err() check: a context cancelled before
+			// the call must yield codes.Canceled without starting a decode.
+			ctx, cancel := context.WithCancel(context.Background())
+			cancel()
+
+			_, err := w.AudioTranscription(ctx, &pb.TranscriptRequest{
+				Dst:      audioPath,
+				Threads:  4,
+				Language: "en",
+			})
+			Expect(err).To(HaveOccurred(), "expected pre-cancelled context to fail")
+			st, ok := status.FromError(err)
+			Expect(ok).To(BeTrue(), "expected gRPC status error, got %v", err)
+			Expect(st.Code()).To(Equal(codes.Canceled), "expected codes.Canceled, got %v", err)
+
+			// Subsequent transcription must succeed — proves g_abort reset.
+			res, err := w.AudioTranscription(context.Background(), &pb.TranscriptRequest{
+				Dst:      audioPath,
+				Threads:  4,
+				Language: "en",
+			})
+			Expect(err).ToNot(HaveOccurred(), "post-cancel transcription failed")
+			Expect(res.Text).ToNot(BeEmpty(), "post-cancel transcription returned empty text")
+		})
+	})
+
+	Context("AudioTranscriptionStream", func() {
+		It("emits multiple deltas progressively for a multi-segment clip", func() {
+			modelPath, audioPath := fixturesOrSkip()
+			ensureLibLoaded()
+
+			w := &CrispASR{}
+			Expect(w.Load(&pb.ModelOptions{ModelFile: modelPath})).To(Succeed())
+
+			results := make(chan *pb.TranscriptStreamResponse, 64)
+			done := make(chan error, 1)
+			go func() {
+				done <- w.AudioTranscriptionStream(context.Background(), &pb.TranscriptRequest{
+					Dst:      audioPath,
+					Threads:  4,
+					Language: "en",
+					Stream:   true,
+				}, results)
+			}()
+
+			var deltas []string
+			var assembled strings.Builder
+			var finalText string
+			var finalSegmentCount int
+			for chunk := range results {
+				if d := chunk.GetDelta(); d != "" {
+					deltas = append(deltas, d)
+					assembled.WriteString(d)
+				}
+				if final := chunk.GetFinalResult(); final != nil {
+					finalText = final.GetText()
+					finalSegmentCount = len(final.GetSegments())
+				}
+			}
+			Expect(<-done).ToNot(HaveOccurred())
+
+			// One delta per non-empty segment is emitted after the blocking
+			// decode returns (the session API has no per-decode callback), so a
+			// multi-segment clip MUST produce >=2 delta events, and
+			// concat(deltas) MUST equal final.Text exactly.
+			Expect(len(deltas)).To(BeNumerically(">=", 2),
+				"expected multiple deltas from a multi-segment clip, got %d (assembled=%q)",
+				len(deltas), assembled.String())
+			Expect(finalSegmentCount).To(BeNumerically(">=", 2),
+				"expected final to carry multiple segments")
+			Expect(assembled.String()).To(Equal(finalText),
+				"concat(deltas) must equal final.Text")
+		})
+	})
+
+	Context("TTS", func() {
+		It("synthesizes a non-empty WAV", func() {
+			ttsModel := ttsModelOrSkip()
+			ensureLibLoaded()
+
+			w := &CrispASR{}
+			Expect(w.Load(&pb.ModelOptions{ModelFile: ttsModel})).To(Succeed())
+
+			dst := filepath.Join(GinkgoT().TempDir(), "out.wav")
+			Expect(w.TTS(&pb.TTSRequest{Text: "Hello from CrispASR.", Dst: dst})).To(Succeed())
+
+			info, err := os.Stat(dst)
+			Expect(err).ToNot(HaveOccurred(), "synthesized WAV should exist at %q", dst)
+			// A real 24 kHz mono WAV is a 44-byte header plus samples; anything
+			// this small would mean an empty/failed synth.
+			Expect(info.Size()).To(BeNumerically(">", 1024),
+				"expected a non-trivial WAV, got %d bytes", info.Size())
+		})
+	})
+})
--- a/backend/go/crispasr/main.go
+++ b/backend/go/crispasr/main.go
@@ -0,0 +1,58 @@
+package main
+
+// Note: this is started internally by LocalAI and a server is allocated for each model
+import (
+	"flag"
+	"os"
+
+	"github.com/ebitengine/purego"
+	grpc "github.com/mudler/LocalAI/pkg/grpc"
+)
+
+var (
+	addr = flag.String("addr", "localhost:50051", "the address to connect to")
+)
+
+type LibFuncs struct {
+	FuncPtr any
+	Name    string
+}
+
+func main() {
+	libName := os.Getenv("CRISPASR_LIBRARY")
+	if libName == "" {
+		libName = "./libgocrispasr-fallback.so"
+	}
+
+	lib, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL)
+	if err != nil {
+		panic(err)
+	}
+
+	libFuncs := []LibFuncs{
+		{&CppLoadModel, "load_model"},
+		{&CppSetCodecPath, "set_codec_path"},
+		{&CppLoadModelVAD, "load_model_vad"},
+		{&CppVAD, "vad"},
+		{&CppTranscribe, "transcribe"},
+		{&CppGetSegmentText, "get_segment_text"},
+		{&CppGetSegmentStart, "get_segment_t0"},
+		{&CppGetSegmentEnd, "get_segment_t1"},
+		{&CppGetBackend, "get_backend"},
+		{&CppSetAbort, "set_abort"},
+		{&CppTTSSynthesize, "tts_synthesize"},
+		{&CppTTSFree, "tts_free"},
+		{&CppTTSSetVoice, "tts_set_voice"},
+		{&CppTTSSetVoiceFile, "tts_set_voice_file"},
+	}
+
+	for _, lf := range libFuncs {
+		purego.RegisterLibFunc(lf.FuncPtr, lib, lf.Name)
+	}
+
+	flag.Parse()
+
+	if err := grpc.StartServer(*addr, &CrispASR{}); err != nil {
+		panic(err)
+	}
+}
--- a/backend/go/crispasr/package.sh
+++ b/backend/go/crispasr/package.sh
@@ -0,0 +1,65 @@
+#!/bin/bash
+
+# Script to copy the appropriate libraries based on architecture
+# This script is used in the final stage of the Dockerfile
+
+set -e
+
+CURDIR=$(dirname "$(realpath $0)")
+REPO_ROOT="${CURDIR}/../../.."
+
+# Create lib directory
+mkdir -p $CURDIR/package/lib
+
+cp -avf $CURDIR/crispasr $CURDIR/package/
+cp -fv $CURDIR/libgocrispasr-*.so $CURDIR/package/
+cp -fv $CURDIR/run.sh $CURDIR/package/
+
+# Detect architecture and copy appropriate libraries
+if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
+    # x86_64 architecture
+    echo "Detected x86_64 architecture, copying x86_64 libraries..."
+    cp -arfLv /lib64/ld-linux-x86-64.so.2 $CURDIR/package/lib/ld.so
+    cp -arfLv /lib/x86_64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
+    cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
+    cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
+    cp -arfLv /lib/x86_64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
+    cp -arfLv /lib/x86_64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
+    cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
+    cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
+    cp -arfLv /lib/x86_64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
+    cp -arfLv /lib/x86_64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
+    cp -arfLv /lib/x86_64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
+elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
+    # ARM64 architecture
+    echo "Detected ARM64 architecture, copying ARM64 libraries..."
+    cp -arfLv /lib/ld-linux-aarch64.so.1 $CURDIR/package/lib/ld.so
+    cp -arfLv /lib/aarch64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
+    cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
+    cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
+    cp -arfLv /lib/aarch64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
+    cp -arfLv /lib/aarch64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
+    cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
+    cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
+    cp -arfLv /lib/aarch64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
+    cp -arfLv /lib/aarch64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
+    cp -arfLv /lib/aarch64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
+elif [ $(uname -s) = "Darwin" ]; then
+    echo "Detected Darwin"
+else
+    echo "Error: Could not detect architecture"
+    exit 1
+fi
+
+# Package GPU libraries based on BUILD_TYPE
+# The GPU library packaging script will detect BUILD_TYPE and copy appropriate GPU libraries
+GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
+if [ -f "$GPU_LIB_SCRIPT" ]; then
+    echo "Packaging GPU libraries for BUILD_TYPE=${BUILD_TYPE:-cpu}..."
+    source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
+    package_gpu_libs
+fi
+
+echo "Packaging completed successfully"
+ls -liah $CURDIR/package/
+ls -liah $CURDIR/package/lib/
--- a/backend/go/crispasr/run.sh
+++ b/backend/go/crispasr/run.sh
@@ -0,0 +1,52 @@
+#!/bin/bash
+set -ex
+
+# Get the absolute current dir where the script is located
+CURDIR=$(dirname "$(realpath $0)")
+
+cd /
+
+echo "CPU info:"
+if [ "$(uname)" != "Darwin" ]; then
+	grep -e "model\sname" /proc/cpuinfo | head -1
+	grep -e "flags" /proc/cpuinfo | head -1
+fi
+
+LIBRARY="$CURDIR/libgocrispasr-fallback.so"
+
+if [ "$(uname)" != "Darwin" ]; then
+	if grep -q -e "\savx\s" /proc/cpuinfo ; then
+		echo "CPU:    AVX    found OK"
+		if [ -e $CURDIR/libgocrispasr-avx.so ]; then
+			LIBRARY="$CURDIR/libgocrispasr-avx.so"
+		fi
+	fi
+
+	if grep -q -e "\savx2\s" /proc/cpuinfo ; then
+		echo "CPU:    AVX2   found OK"
+		if [ -e $CURDIR/libgocrispasr-avx2.so ]; then
+			LIBRARY="$CURDIR/libgocrispasr-avx2.so"
+		fi
+	fi
+
+	# Check avx 512
+	if grep -q -e "\savx512f\s" /proc/cpuinfo ; then
+		echo "CPU:    AVX512F found OK"
+		if [ -e $CURDIR/libgocrispasr-avx512.so ]; then
+			LIBRARY="$CURDIR/libgocrispasr-avx512.so"
+		fi
+	fi
+fi
+
+export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
+export CRISPASR_LIBRARY=$LIBRARY
+
+# If there is a lib/ld.so, use it
+if [ -f $CURDIR/lib/ld.so ]; then
+	echo "Using lib/ld.so"
+	echo "Using library: $LIBRARY"
+	exec $CURDIR/lib/ld.so $CURDIR/crispasr "$@"
+fi
+
+echo "Using library: $LIBRARY"
+exec $CURDIR/crispasr "$@"
--- a/backend/index.yaml
+++ b/backend/index.yaml
@@ -122,6 +122,33 @@
    nvidia-cuda-12: "cuda12-whisper"
    nvidia-l4t-cuda-12: "nvidia-l4t-arm64-whisper"
    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-whisper"
+- &crispasr
+  name: "crispasr"
+  alias: "crispasr"
+  license: mit
+  icon: https://user-images.githubusercontent.com/1991296/235238348-05d0f6a4-da44-4900-a1de-d0707e75b763.jpeg
+  description: |
+    CrispASR unified speech engine (whisper.cpp fork on ggml) supporting many ASR architectures (Parakeet, Canary, Voxtral, Qwen3-ASR, Granite, Wav2Vec2, Moonshine, OmniASR, FireRedASR, and more).
+  urls:
+    - https://github.com/CrispStrobe/CrispASR
+  tags:
+    - audio-transcription
+    - CPU
+    - GPU
+    - CUDA
+    - HIP
+  capabilities:
+    default: "cpu-crispasr"
+    nvidia: "cuda12-crispasr"
+    intel: "intel-sycl-f16-crispasr"
+    metal: "metal-crispasr"
+    amd: "rocm-crispasr"
+    vulkan: "vulkan-crispasr"
+    nvidia-l4t: "nvidia-l4t-arm64-crispasr"
+    nvidia-cuda-13: "cuda13-crispasr"
+    nvidia-cuda-12: "cuda12-crispasr"
+    nvidia-l4t-cuda-12: "nvidia-l4t-arm64-crispasr"
+    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-crispasr"
 - &parakeetcpp
  name: "parakeet-cpp"
  alias: "parakeet-cpp"
@@ -1957,6 +1984,131 @@
  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-whisper"
  mirrors:
    - localai/localai-backends:master-gpu-nvidia-cuda-13-whisper
+## crispasr
+- !!merge <<: *crispasr
+  name: "crispasr-development"
+  capabilities:
+    default: "cpu-crispasr-development"
+    nvidia: "cuda12-crispasr-development"
+    intel: "intel-sycl-f16-crispasr-development"
+    metal: "metal-crispasr-development"
+    amd: "rocm-crispasr-development"
+    vulkan: "vulkan-crispasr-development"
+    nvidia-l4t: "nvidia-l4t-arm64-crispasr-development"
+    nvidia-cuda-13: "cuda13-crispasr-development"
+    nvidia-cuda-12: "cuda12-crispasr-development"
+    nvidia-l4t-cuda-12: "nvidia-l4t-arm64-crispasr-development"
+    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-crispasr-development"
+- !!merge <<: *crispasr
+  name: "nvidia-l4t-arm64-crispasr"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-arm64-crispasr"
+  mirrors:
+    - localai/localai-backends:latest-nvidia-l4t-arm64-crispasr
+- !!merge <<: *crispasr
+  name: "nvidia-l4t-arm64-crispasr-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-arm64-crispasr"
+  mirrors:
+    - localai/localai-backends:master-nvidia-l4t-arm64-crispasr
+- !!merge <<: *crispasr
+  name: "cuda13-nvidia-l4t-arm64-crispasr"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-crispasr"
+  mirrors:
+    - localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-crispasr
+- !!merge <<: *crispasr
+  name: "cuda13-nvidia-l4t-arm64-crispasr-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-crispasr"
+  mirrors:
+    - localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-crispasr
+- !!merge <<: *crispasr
+  name: "cpu-crispasr"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-crispasr"
+  mirrors:
+    - localai/localai-backends:latest-cpu-crispasr
+- !!merge <<: *crispasr
+  name: "metal-crispasr"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-crispasr"
+  mirrors:
+    - localai/localai-backends:latest-metal-darwin-arm64-crispasr
+- !!merge <<: *crispasr
+  name: "metal-crispasr-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-crispasr"
+  mirrors:
+    - localai/localai-backends:master-metal-darwin-arm64-crispasr
+- !!merge <<: *crispasr
+  name: "cpu-crispasr-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-cpu-crispasr"
+  mirrors:
+    - localai/localai-backends:master-cpu-crispasr
+- !!merge <<: *crispasr
+  name: "cuda12-crispasr"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-crispasr"
+  mirrors:
+    - localai/localai-backends:latest-gpu-nvidia-cuda-12-crispasr
+- !!merge <<: *crispasr
+  name: "rocm-crispasr"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-crispasr"
+  mirrors:
+    - localai/localai-backends:latest-gpu-rocm-hipblas-crispasr
+- !!merge <<: *crispasr
+  name: "intel-sycl-f32-crispasr"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f32-crispasr"
+  mirrors:
+    - localai/localai-backends:latest-gpu-intel-sycl-f32-crispasr
+- !!merge <<: *crispasr
+  name: "intel-sycl-f16-crispasr"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f16-crispasr"
+  mirrors:
+    - localai/localai-backends:latest-gpu-intel-sycl-f16-crispasr
+- !!merge <<: *crispasr
+  name: "vulkan-crispasr"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-vulkan-crispasr"
+  mirrors:
+    - localai/localai-backends:latest-gpu-vulkan-crispasr
+- !!merge <<: *crispasr
+  name: "vulkan-crispasr-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-vulkan-crispasr"
+  mirrors:
+    - localai/localai-backends:master-gpu-vulkan-crispasr
+- !!merge <<: *crispasr
+  name: "metal-crispasr"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-crispasr"
+  mirrors:
+    - localai/localai-backends:latest-metal-darwin-arm64-crispasr
+- !!merge <<: *crispasr
+  name: "metal-crispasr-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-crispasr"
+  mirrors:
+    - localai/localai-backends:master-metal-darwin-arm64-crispasr
+- !!merge <<: *crispasr
+  name: "cuda12-crispasr-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-crispasr"
+  mirrors:
+    - localai/localai-backends:master-gpu-nvidia-cuda-12-crispasr
+- !!merge <<: *crispasr
+  name: "rocm-crispasr-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-crispasr"
+  mirrors:
+    - localai/localai-backends:master-gpu-rocm-hipblas-crispasr
+- !!merge <<: *crispasr
+  name: "intel-sycl-f32-crispasr-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f32-crispasr"
+  mirrors:
+    - localai/localai-backends:master-gpu-intel-sycl-f32-crispasr
+- !!merge <<: *crispasr
+  name: "intel-sycl-f16-crispasr-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f16-crispasr"
+  mirrors:
+    - localai/localai-backends:master-gpu-intel-sycl-f16-crispasr
+- !!merge <<: *crispasr
+  name: "cuda13-crispasr"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-crispasr"
+  mirrors:
+    - localai/localai-backends:latest-gpu-nvidia-cuda-13-crispasr
+- !!merge <<: *crispasr
+  name: "cuda13-crispasr-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-crispasr"
+  mirrors:
+    - localai/localai-backends:master-gpu-nvidia-cuda-13-crispasr
 ## parakeet-cpp
 - !!merge <<: *parakeetcpp
  name: "parakeet-cpp-development"
--- a/core/http/endpoints/localai/backend.go
+++ b/core/http/endpoints/localai/backend.go
@@ -31,6 +31,7 @@ var knownPrefOnlyBackends = []schema.KnownBackend{
 	{Name: "mlx-vlm", Modality: "text", AutoDetect: false, Description: "MLX vision-language models (preference-only)"},
 	// ASR
 	{Name: "whisperx", Modality: "asr", AutoDetect: false, Description: "WhisperX transcription (preference-only)"},
+	{Name: "crispasr", Modality: "asr", AutoDetect: false, Description: "CrispASR multi-architecture transcription (preference-only)"},
 	// TTS
 	{Name: "kokoros", Modality: "tts", AutoDetect: false, Description: "Kokoros TTS (preference-only)"},
 	{Name: "qwen-tts", Modality: "tts", AutoDetect: false, Description: "Qwen TTS (preference-only)"},
--- a/core/http/endpoints/localai/backend_test.go
+++ b/core/http/endpoints/localai/backend_test.go
@@ -140,6 +140,7 @@ var _ = Describe("Backend Endpoints", func() {
 			expectPrefOnly("trl", "text")
 			expectPrefOnly("mlx-vlm", "text")
 			expectPrefOnly("whisperx", "asr")
+			expectPrefOnly("crispasr", "asr")
 			expectPrefOnly("kokoros", "tts")
 			expectPrefOnly("qwen-tts", "tts")
 			expectPrefOnly("qwen3-tts-cpp", "tts")
--- a/gallery/index.yaml
+++ b/gallery/index.yaml
@@ -31771,3 +31771,844 @@
    - filename: parakeet-cpp/tdt_ctc-1.1b-f16.gguf
      uri: huggingface://mudler/parakeet-cpp-gguf/tdt_ctc-1.1b-f16.gguf
      sha256: cd53f64eefac2623a12f2f118ef50b56622dc3012f42c815c6adf0d08292f387
+
+- name: parakeet-crispasr
+  url: github:mudler/LocalAI/gallery/virtual.yaml@master
+  urls:
+    - https://huggingface.co/cstr/parakeet-tdt-0.6b-v3-GGUF
+  description: |
+    NVIDIA Parakeet TDT 0.6B v3 (FastConformer + Token-and-Duration Transducer), 25-language ASR. Runs via the CrispASR backend. Default GGUF size ~467 MB.
+  tags:
+    - crispasr
+    - asr
+    - speech-recognition
+    - stt
+    - gguf
+  overrides:
+    backend: crispasr
+    known_usecases:
+      - transcript
+    name: parakeet-crispasr
+    parameters:
+      model: parakeet-tdt-0.6b-v3-q4_k.gguf
+  files:
+    - filename: parakeet-tdt-0.6b-v3-q4_k.gguf
+      uri: huggingface://cstr/parakeet-tdt-0.6b-v3-GGUF/parakeet-tdt-0.6b-v3-q4_k.gguf
+- name: parakeet-v2-crispasr
+  url: github:mudler/LocalAI/gallery/virtual.yaml@master
+  urls:
+    - https://huggingface.co/cstr/parakeet-tdt-0.6b-v2-GGUF
+  description: |
+    NVIDIA Parakeet TDT 0.6B v2 (FastConformer + TDT), English-only ASR. Runs via the CrispASR backend. Default GGUF size ~468 MB.
+  tags:
+    - crispasr
+    - asr
+    - speech-recognition
+    - stt
+    - gguf
+  overrides:
+    backend: crispasr
+    known_usecases:
+      - transcript
+    name: parakeet-v2-crispasr
+    parameters:
+      model: parakeet-tdt-0.6b-v2-q4_k.gguf
+  files:
+    - filename: parakeet-tdt-0.6b-v2-q4_k.gguf
+      uri: huggingface://cstr/parakeet-tdt-0.6b-v2-GGUF/parakeet-tdt-0.6b-v2-q4_k.gguf
+- name: parakeet-ja-crispasr
+  url: github:mudler/LocalAI/gallery/virtual.yaml@master
+  urls:
+    - https://huggingface.co/cstr/parakeet-tdt-0.6b-ja-GGUF
+  description: |
+    NVIDIA Parakeet TDT 0.6B Japanese ASR (F16 default; Q4_K is quantisation-sensitive for this model). Runs via the CrispASR backend. Default GGUF size ~1.24 GB.
+  tags:
+    - crispasr
+    - asr
+    - speech-recognition
+    - stt
+    - gguf
+  overrides:
+    backend: crispasr
+    known_usecases:
+      - transcript
+    name: parakeet-ja-crispasr
+    parameters:
+      model: parakeet-tdt-0.6b-ja.gguf
+  files:
+    - filename: parakeet-tdt-0.6b-ja.gguf
+      uri: huggingface://cstr/parakeet-tdt-0.6b-ja-GGUF/parakeet-tdt-0.6b-ja.gguf
+- name: parakeet-tdt-1.1b-crispasr
+  url: github:mudler/LocalAI/gallery/virtual.yaml@master
+  urls:
+    - https://huggingface.co/cstr/parakeet-tdt-1.1b-GGUF
+  description: |
+    NVIDIA Parakeet TDT 1.1B (42-layer FastConformer encoder), English-only ASR. Runs via the CrispASR backend. Default GGUF size ~808 MB.
+  tags:
+    - crispasr
+    - asr
+    - speech-recognition
+    - stt
+    - gguf
+  overrides:
+    backend: crispasr
+    known_usecases:
+      - transcript
+    name: parakeet-tdt-1.1b-crispasr
+    parameters:
+      model: parakeet-tdt-1.1b-q4_k.gguf
+  files:
+    - filename: parakeet-tdt-1.1b-q4_k.gguf
+      uri: huggingface://cstr/parakeet-tdt-1.1b-GGUF/parakeet-tdt-1.1b-q4_k.gguf
+- name: parakeet-tdt_ctc-110m-crispasr
+  url: github:mudler/LocalAI/gallery/virtual.yaml@master
+  urls:
+    - https://huggingface.co/cstr/parakeet-tdt_ctc-110m-GGUF
+  description: |
+    NVIDIA Parakeet hybrid TDT+CTC 110M (smallest, CTC decode), English-only ASR. Runs via the CrispASR backend. Default GGUF size ~91 MB.
+  tags:
+    - crispasr
+    - asr
+    - speech-recognition
+    - stt
+    - gguf
+  overrides:
+    backend: crispasr
+    known_usecases:
+      - transcript
+    name: parakeet-tdt_ctc-110m-crispasr
+    parameters:
+      model: parakeet-tdt_ctc-110m-q4_k.gguf
+  files:
+    - filename: parakeet-tdt_ctc-110m-q4_k.gguf
+      uri: huggingface://cstr/parakeet-tdt_ctc-110m-GGUF/parakeet-tdt_ctc-110m-q4_k.gguf
+- name: parakeet-tdt_ctc-1.1b-crispasr
+  url: github:mudler/LocalAI/gallery/virtual.yaml@master
+  urls:
+    - https://huggingface.co/cstr/parakeet-tdt_ctc-1.1b-GGUF
+  description: |
+    NVIDIA Parakeet hybrid TDT+CTC 1.1B (multilingual, casing + punctuation) ASR. Runs via the CrispASR backend. Default GGUF size ~810 MB.
+  tags:
+    - crispasr
+    - asr
+    - speech-recognition
+    - stt
+    - gguf
+  overrides:
+    backend: crispasr
+    known_usecases:
+      - transcript
+    name: parakeet-tdt_ctc-1.1b-crispasr
+    parameters:
+      model: parakeet-tdt_ctc-1.1b-q4_k.gguf
+  files:
+    - filename: parakeet-tdt_ctc-1.1b-q4_k.gguf
+      uri: huggingface://cstr/parakeet-tdt_ctc-1.1b-GGUF/parakeet-tdt_ctc-1.1b-q4_k.gguf
+- name: parakeet-rnnt-0.6b-crispasr
+  url: github:mudler/LocalAI/gallery/virtual.yaml@master
+  urls:
+    - https://huggingface.co/cstr/parakeet-rnnt-0.6b-GGUF
+  description: |
+    NVIDIA Parakeet RNN-Transducer 0.6B (24-layer FastConformer) ASR. Runs via the CrispASR backend. Default GGUF size ~447 MB.
+  tags:
+    - crispasr
+    - asr
+    - speech-recognition
+    - stt
+    - gguf
+  overrides:
+    backend: crispasr
+    known_usecases:
+      - transcript
+    name: parakeet-rnnt-0.6b-crispasr
+    parameters:
+      model: parakeet-rnnt-0.6b-q4_k.gguf
+  files:
+    - filename: parakeet-rnnt-0.6b-q4_k.gguf
+      uri: huggingface://cstr/parakeet-rnnt-0.6b-GGUF/parakeet-rnnt-0.6b-q4_k.gguf
+- name: parakeet-rnnt-1.1b-crispasr
+  url: github:mudler/LocalAI/gallery/virtual.yaml@master
+  urls:
+    - https://huggingface.co/cstr/parakeet-rnnt-1.1b-GGUF
+  description: |
+    NVIDIA Parakeet RNN-Transducer 1.1B (42-layer FastConformer) ASR. Runs via the CrispASR backend. Default GGUF size ~770 MB.
+  tags:
+    - crispasr
+    - asr
+    - speech-recognition
+    - stt
+    - gguf
+  overrides:
+    backend: crispasr
+    known_usecases:
+      - transcript
+    name: parakeet-rnnt-1.1b-crispasr
+    parameters:
+      model: parakeet-rnnt-1.1b-q4_k.gguf
+  files:
+    - filename: parakeet-rnnt-1.1b-q4_k.gguf
+      uri: huggingface://cstr/parakeet-rnnt-1.1b-GGUF/parakeet-rnnt-1.1b-q4_k.gguf
+- name: fastconformer-ctc-crispasr
+  url: github:mudler/LocalAI/gallery/virtual.yaml@master
+  urls:
+    - https://huggingface.co/cstr/stt-en-fastconformer-ctc-large-GGUF
+  description: |
+    NVIDIA STT-EN FastConformer-CTC Large, English ASR. Runs via the CrispASR backend. Default GGUF size ~83 MB.
+  tags:
+    - crispasr
+    - asr
+    - speech-recognition
+    - stt
+    - gguf
+  overrides:
+    backend: crispasr
+    known_usecases:
+      - transcript
+    name: fastconformer-ctc-crispasr
+    parameters:
+      model: stt-en-fastconformer-ctc-large-q4_k.gguf
+  files:
+    - filename: stt-en-fastconformer-ctc-large-q4_k.gguf
+      uri: huggingface://cstr/stt-en-fastconformer-ctc-large-GGUF/stt-en-fastconformer-ctc-large-q4_k.gguf
+- name: canary-crispasr
+  url: github:mudler/LocalAI/gallery/virtual.yaml@master
+  urls:
+    - https://huggingface.co/cstr/canary-1b-v2-GGUF
+  description: |
+    NVIDIA Canary 1B v2 (FastConformer encoder-decoder), multilingual ASR + translation. Runs via the CrispASR backend. Default GGUF size ~600 MB.
+  tags:
+    - crispasr
+    - asr
+    - speech-recognition
+    - stt
+    - gguf
+  overrides:
+    backend: crispasr
+    known_usecases:
+      - transcript
+    name: canary-crispasr
+    parameters:
+      model: canary-1b-v2-q4_k.gguf
+  files:
+    - filename: canary-1b-v2-q4_k.gguf
+      uri: huggingface://cstr/canary-1b-v2-GGUF/canary-1b-v2-q4_k.gguf
+- name: voxtral-crispasr
+  url: github:mudler/LocalAI/gallery/virtual.yaml@master
+  urls:
+    - https://huggingface.co/cstr/voxtral-mini-3b-2507-GGUF
+  description: |
+    Mistral Voxtral Mini 3B (audio LLM) ASR. Runs via the CrispASR backend. Default GGUF size ~2.5 GB.
+  tags:
+    - crispasr
+    - asr
+    - speech-recognition
+    - stt
+    - gguf
+  overrides:
+    backend: crispasr
+    known_usecases:
+      - transcript
+    name: voxtral-crispasr
+    parameters:
+      model: voxtral-mini-3b-2507-q4_k.gguf
+  files:
+    - filename: voxtral-mini-3b-2507-q4_k.gguf
+      uri: huggingface://cstr/voxtral-mini-3b-2507-GGUF/voxtral-mini-3b-2507-q4_k.gguf
+- name: voxtral4b-crispasr
+  url: github:mudler/LocalAI/gallery/virtual.yaml@master
+  urls:
+    - https://huggingface.co/cstr/voxtral-mini-4b-realtime-GGUF
+  description: |
+    Mistral Voxtral Mini 4B Realtime (audio LLM) ASR. Runs via the CrispASR backend. Default GGUF size ~3.3 GB.
+  tags:
+    - crispasr
+    - asr
+    - speech-recognition
+    - stt
+    - gguf
+  overrides:
+    backend: crispasr
+    known_usecases:
+      - transcript
+    name: voxtral4b-crispasr
+    parameters:
+      model: voxtral-mini-4b-realtime-q4_k.gguf
+  files:
+    - filename: voxtral-mini-4b-realtime-q4_k.gguf
+      uri: huggingface://cstr/voxtral-mini-4b-realtime-GGUF/voxtral-mini-4b-realtime-q4_k.gguf
+- name: granite-crispasr
+  url: github:mudler/LocalAI/gallery/virtual.yaml@master
+  urls:
+    - https://huggingface.co/cstr/granite-speech-4.0-1b-GGUF
+  description: |
+    IBM Granite Speech 4.0 1B ASR. Runs via the CrispASR backend. Default GGUF size ~2.94 GB.
+  tags:
+    - crispasr
+    - asr
+    - speech-recognition
+    - stt
+    - gguf
+  overrides:
+    backend: crispasr
+    known_usecases:
+      - transcript
+    name: granite-crispasr
+    parameters:
+      model: granite-speech-4.0-1b-q4_k.gguf
+  files:
+    - filename: granite-speech-4.0-1b-q4_k.gguf
+      uri: huggingface://cstr/granite-speech-4.0-1b-GGUF/granite-speech-4.0-1b-q4_k.gguf
+- name: granite-4.1-crispasr
+  url: github:mudler/LocalAI/gallery/virtual.yaml@master
+  urls:
+    - https://huggingface.co/cstr/granite-speech-4.1-2b-GGUF
+  description: |
+    IBM Granite Speech 4.1 2B ASR. Runs via the CrispASR backend. Default GGUF size ~2.94 GB.
+  tags:
+    - crispasr
+    - asr
+    - speech-recognition
+    - stt
+    - gguf
+  overrides:
+    backend: crispasr
+    known_usecases:
+      - transcript
+    name: granite-4.1-crispasr
+    parameters:
+      model: granite-speech-4.1-2b-q4_k.gguf
+  files:
+    - filename: granite-speech-4.1-2b-q4_k.gguf
+      uri: huggingface://cstr/granite-speech-4.1-2b-GGUF/granite-speech-4.1-2b-q4_k.gguf
+- name: granite-4.1-plus-crispasr
+  url: github:mudler/LocalAI/gallery/virtual.yaml@master
+  urls:
+    - https://huggingface.co/cstr/granite-speech-4.1-2b-plus-GGUF
+  description: |
+    IBM Granite Speech 4.1 2B Plus ASR. Runs via the CrispASR backend. Default GGUF size ~2.96 GB.
+  tags:
+    - crispasr
+    - asr
+    - speech-recognition
+    - stt
+    - gguf
+  overrides:
+    backend: crispasr
+    known_usecases:
+      - transcript
+    name: granite-4.1-plus-crispasr
+    parameters:
+      model: granite-speech-4.1-2b-plus-q4_k.gguf
+  files:
+    - filename: granite-speech-4.1-2b-plus-q4_k.gguf
+      uri: huggingface://cstr/granite-speech-4.1-2b-plus-GGUF/granite-speech-4.1-2b-plus-q4_k.gguf
+- name: granite-4.1-nar-crispasr
+  url: github:mudler/LocalAI/gallery/virtual.yaml@master
+  urls:
+    - https://huggingface.co/cstr/granite-speech-4.1-2b-nar-GGUF
+  description: |
+    IBM Granite Speech 4.1 2B NAR (non-autoregressive) ASR. Runs via the CrispASR backend. Default GGUF size ~3.2 GB.
+  tags:
+    - crispasr
+    - asr
+    - speech-recognition
+    - stt
+    - gguf
+  overrides:
+    backend: crispasr
+    known_usecases:
+      - transcript
+    name: granite-4.1-nar-crispasr
+    parameters:
+      model: granite-speech-4.1-2b-nar-q4_k.gguf
+  files:
+    - filename: granite-speech-4.1-2b-nar-q4_k.gguf
+      uri: huggingface://cstr/granite-speech-4.1-2b-nar-GGUF/granite-speech-4.1-2b-nar-q4_k.gguf
+- name: qwen3-crispasr
+  url: github:mudler/LocalAI/gallery/virtual.yaml@master
+  urls:
+    - https://huggingface.co/cstr/qwen3-asr-0.6b-GGUF
+  description: |
+    Qwen3-ASR 0.6B ASR. Runs via the CrispASR backend. Default GGUF size ~500 MB.
+  tags:
+    - crispasr
+    - asr
+    - speech-recognition
+    - stt
+    - gguf
+  overrides:
+    backend: crispasr
+    known_usecases:
+      - transcript
+    name: qwen3-crispasr
+    parameters:
+      model: qwen3-asr-0.6b-q4_k.gguf
+  files:
+    - filename: qwen3-asr-0.6b-q4_k.gguf
+      uri: huggingface://cstr/qwen3-asr-0.6b-GGUF/qwen3-asr-0.6b-q4_k.gguf
+- name: qwen3-1.7b-crispasr
+  url: github:mudler/LocalAI/gallery/virtual.yaml@master
+  urls:
+    - https://huggingface.co/cstr/qwen3-asr-1.7b-GGUF
+  description: |
+    Qwen3-ASR 1.7B ASR. Runs via the CrispASR backend. Default GGUF size ~1.3 GB.
+  tags:
+    - crispasr
+    - asr
+    - speech-recognition
+    - stt
+    - gguf
+  overrides:
+    backend: crispasr
+    known_usecases:
+      - transcript
+    name: qwen3-1.7b-crispasr
+    parameters:
+      model: qwen3-asr-1.7b-q4_k.gguf
+  files:
+    - filename: qwen3-asr-1.7b-q4_k.gguf
+      uri: huggingface://cstr/qwen3-asr-1.7b-GGUF/qwen3-asr-1.7b-q4_k.gguf
+- name: cohere-crispasr
+  url: github:mudler/LocalAI/gallery/virtual.yaml@master
+  urls:
+    - https://huggingface.co/cstr/cohere-transcribe-03-2026-GGUF
+  description: |
+    Cohere Transcribe (03-2026) ASR. Runs via the CrispASR backend. Default GGUF size ~550 MB.
+  tags:
+    - crispasr
+    - asr
+    - speech-recognition
+    - stt
+    - gguf
+  overrides:
+    backend: crispasr
+    known_usecases:
+      - transcript
+    name: cohere-crispasr
+    parameters:
+      model: cohere-transcribe-q4_k.gguf
+  files:
+    - filename: cohere-transcribe-q4_k.gguf
+      uri: huggingface://cstr/cohere-transcribe-03-2026-GGUF/cohere-transcribe-q4_k.gguf
+- name: wav2vec2-crispasr
+  url: github:mudler/LocalAI/gallery/virtual.yaml@master
+  urls:
+    - https://huggingface.co/cstr/wav2vec2-large-xlsr-53-english-GGUF
+  description: |
+    wav2vec2 Large XLSR-53 English (CTC) ASR. Runs via the CrispASR backend. Default GGUF size ~212 MB.
+  tags:
+    - crispasr
+    - asr
+    - speech-recognition
+    - stt
+    - gguf
+  overrides:
+    backend: crispasr
+    known_usecases:
+      - transcript
+    name: wav2vec2-crispasr
+    parameters:
+      model: wav2vec2-xlsr-en-q4_k.gguf
+  files:
+    - filename: wav2vec2-xlsr-en-q4_k.gguf
+      uri: huggingface://cstr/wav2vec2-large-xlsr-53-english-GGUF/wav2vec2-xlsr-en-q4_k.gguf
+- name: wav2vec2-de-crispasr
+  url: github:mudler/LocalAI/gallery/virtual.yaml@master
+  urls:
+    - https://huggingface.co/cstr/wav2vec2-large-xlsr-53-german-GGUF
+  description: |
+    wav2vec2 Large XLSR-53 German (CTC) ASR. Runs via the CrispASR backend. Default GGUF size ~222 MB.
+  tags:
+    - crispasr
+    - asr
+    - speech-recognition
+    - stt
+    - gguf
+  overrides:
+    backend: crispasr
+    known_usecases:
+      - transcript
+    name: wav2vec2-de-crispasr
+    parameters:
+      model: wav2vec2-large-xlsr-53-german-q4_k.gguf
+  files:
+    - filename: wav2vec2-large-xlsr-53-german-q4_k.gguf
+      uri: huggingface://cstr/wav2vec2-large-xlsr-53-german-GGUF/wav2vec2-large-xlsr-53-german-q4_k.gguf
+- name: vibevoice-crispasr
+  url: github:mudler/LocalAI/gallery/virtual.yaml@master
+  urls:
+    - https://huggingface.co/cstr/vibevoice-asr-GGUF
+  description: |
+    VibeVoice ASR. Runs via the CrispASR backend. Default GGUF size ~4.5 GB.
+  tags:
+    - crispasr
+    - asr
+    - speech-recognition
+    - stt
+    - gguf
+  overrides:
+    backend: crispasr
+    known_usecases:
+      - transcript
+    name: vibevoice-crispasr
+    parameters:
+      model: vibevoice-asr-q4_k.gguf
+  files:
+    - filename: vibevoice-asr-q4_k.gguf
+      uri: huggingface://cstr/vibevoice-asr-GGUF/vibevoice-asr-q4_k.gguf
+- name: vibevoice-tts-crispasr
+  url: github:mudler/LocalAI/gallery/virtual.yaml@master
+  urls:
+    - https://huggingface.co/cstr/vibevoice-realtime-0.5b-GGUF
+  description: |
+    VibeVoice Realtime 0.5B text-to-speech (TTS) model, synthesized through the CrispASR backend. Produces 24 kHz mono audio; runs end-to-end on CPU with a built-in default voice. Default GGUF size ~636 MB.
+  tags:
+    - crispasr
+    - tts
+    - text-to-speech
+    - gguf
+  overrides:
+    backend: crispasr
+    known_usecases:
+      - tts
+    name: vibevoice-tts-crispasr
+    parameters:
+      model: vibevoice-realtime-0.5b-q4_k.gguf
+  files:
+    - filename: vibevoice-realtime-0.5b-q4_k.gguf
+      uri: huggingface://cstr/vibevoice-realtime-0.5b-GGUF/vibevoice-realtime-0.5b-q4_k.gguf
+- name: chatterbox-tts-crispasr
+  url: github:mudler/LocalAI/gallery/virtual.yaml@master
+  urls:
+    - https://huggingface.co/cstr/chatterbox-GGUF
+  description: |
+    Chatterbox (ResembleAI, MIT) text-to-speech synthesized through the CrispASR backend. Two-GGUF runtime: a Llama T3 token model plus an S3Gen codec companion (tokens to 24 kHz waveform). Auto-detected by CrispASR and ships with a built-in default voice; runs end-to-end on CPU and produces 24 kHz mono audio. Default GGUF sizes ~630 MB (T3) + ~358 MB (S3Gen).
+  tags:
+    - crispasr
+    - tts
+    - text-to-speech
+    - gguf
+  overrides:
+    backend: crispasr
+    known_usecases:
+      - tts
+    name: chatterbox-tts-crispasr
+    options:
+      - "codec:chatterbox-s3gen-q8_0.gguf"
+    parameters:
+      model: chatterbox-t3-q8_0.gguf
+  files:
+    - filename: chatterbox-t3-q8_0.gguf
+      uri: huggingface://cstr/chatterbox-GGUF/chatterbox-t3-q8_0.gguf
+    - filename: chatterbox-s3gen-q8_0.gguf
+      uri: huggingface://cstr/chatterbox-GGUF/chatterbox-s3gen-q8_0.gguf
+- name: qwen3-tts-customvoice-crispasr
+  url: github:mudler/LocalAI/gallery/virtual.yaml@master
+  urls:
+    - https://huggingface.co/cstr/qwen3-tts-0.6b-customvoice-GGUF
+  description: |
+    Qwen3-TTS CustomVoice 0.6B (12 Hz) text-to-speech synthesized through the CrispASR backend. Fixed-speaker fine-tune driven via an explicit backend selector plus a tokenizer codec companion. Ships baked speakers (vivian, aiden, dylan, eric, ono_anna, ryan, serena, sohee, uncle_fu); the default config selects vivian. Runs end-to-end on CPU and produces 24 kHz mono audio. Default GGUF sizes ~968 MB (talker) + ~358 MB (tokenizer).
+  tags:
+    - crispasr
+    - tts
+    - text-to-speech
+    - gguf
+  overrides:
+    backend: crispasr
+    known_usecases:
+      - tts
+    name: qwen3-tts-customvoice-crispasr
+    options:
+      - "backend:qwen3-tts"
+      - "codec:qwen3-tts-tokenizer-12hz.gguf"
+      - "speaker:vivian"
+    parameters:
+      model: qwen3-tts-12hz-0.6b-customvoice-q8_0.gguf
+  files:
+    - filename: qwen3-tts-12hz-0.6b-customvoice-q8_0.gguf
+      uri: huggingface://cstr/qwen3-tts-0.6b-customvoice-GGUF/qwen3-tts-12hz-0.6b-customvoice-q8_0.gguf
+    - filename: qwen3-tts-tokenizer-12hz.gguf
+      uri: huggingface://cstr/qwen3-tts-tokenizer-12hz-GGUF/qwen3-tts-tokenizer-12hz.gguf
+- name: orpheus-tts-crispasr
+  url: github:mudler/LocalAI/gallery/virtual.yaml@master
+  urls:
+    - https://huggingface.co/cstr/orpheus-3b-base-GGUF
+  description: |
+    Orpheus-3B (Llama-3.2 base) text-to-speech synthesized through the CrispASR backend. Auto-detected by CrispASR; needs a SNAC 24 kHz codec companion and a baked speaker. Ships speaker tara (selected by the default config). Runs end-to-end on CPU and produces 24 kHz mono audio. Default GGUF sizes ~3.5 GB (model) + ~26 MB (SNAC codec).
+  tags:
+    - crispasr
+    - tts
+    - text-to-speech
+    - gguf
+  overrides:
+    backend: crispasr
+    known_usecases:
+      - tts
+    name: orpheus-tts-crispasr
+    options:
+      - "codec:snac-24khz.gguf"
+      - "speaker:tara"
+    parameters:
+      model: orpheus-3b-base-q8_0.gguf
+  files:
+    - filename: orpheus-3b-base-q8_0.gguf
+      uri: huggingface://cstr/orpheus-3b-base-GGUF/orpheus-3b-base-q8_0.gguf
+    - filename: snac-24khz.gguf
+      uri: huggingface://cstr/snac-24khz-GGUF/snac-24khz.gguf
+- name: hubert-crispasr
+  url: github:mudler/LocalAI/gallery/virtual.yaml@master
+  urls:
+    - https://huggingface.co/cstr/hubert-large-ls960-ft-GGUF
+  description: |
+    HuBERT Large (LS960 fine-tune) CTC speech recognition, English. Runs via the CrispASR backend with an explicit backend selector. Default GGUF size ~200 MB.
+  tags:
+    - crispasr
+    - asr
+    - speech-recognition
+    - stt
+    - gguf
+  overrides:
+    backend: crispasr
+    known_usecases:
+      - transcript
+    name: hubert-crispasr
+    options:
+      - "backend:hubert"
+    parameters:
+      model: hubert-large-ls960-ft-q4_k.gguf
+  files:
+    - filename: hubert-large-ls960-ft-q4_k.gguf
+      uri: huggingface://cstr/hubert-large-ls960-ft-GGUF/hubert-large-ls960-ft-q4_k.gguf
+- name: data2vec-crispasr
+  url: github:mudler/LocalAI/gallery/virtual.yaml@master
+  urls:
+    - https://huggingface.co/cstr/data2vec-audio-960h-GGUF
+  description: |
+    data2vec Audio Base (960h) CTC speech recognition, English. Runs via the CrispASR backend with an explicit backend selector. Default GGUF size ~60 MB.
+  tags:
+    - crispasr
+    - asr
+    - speech-recognition
+    - stt
+    - gguf
+  overrides:
+    backend: crispasr
+    known_usecases:
+      - transcript
+    name: data2vec-crispasr
+    options:
+      - "backend:data2vec"
+    parameters:
+      model: data2vec-audio-base-960h-q4_k.gguf
+  files:
+    - filename: data2vec-audio-base-960h-q4_k.gguf
+      uri: huggingface://cstr/data2vec-audio-960h-GGUF/data2vec-audio-base-960h-q4_k.gguf
+- name: glm-asr-crispasr
+  url: github:mudler/LocalAI/gallery/virtual.yaml@master
+  urls:
+    - https://huggingface.co/cstr/glm-asr-nano-GGUF
+  description: |
+    GLM-ASR Nano speech recognition. Runs via the CrispASR backend with an explicit backend selector. Default GGUF size ~1.2 GB.
+  tags:
+    - crispasr
+    - asr
+    - speech-recognition
+    - stt
+    - gguf
+  overrides:
+    backend: crispasr
+    known_usecases:
+      - transcript
+    name: glm-asr-crispasr
+    options:
+      - "backend:glm-asr"
+    parameters:
+      model: glm-asr-nano-q4_k.gguf
+  files:
+    - filename: glm-asr-nano-q4_k.gguf
+      uri: huggingface://cstr/glm-asr-nano-GGUF/glm-asr-nano-q4_k.gguf
+- name: kyutai-stt-crispasr
+  url: github:mudler/LocalAI/gallery/virtual.yaml@master
+  urls:
+    - https://huggingface.co/cstr/kyutai-stt-1b-GGUF
+  description: |
+    Kyutai STT 1B (Moshi-style) speech recognition. Runs via the CrispASR backend with an explicit backend selector. Default GGUF size ~636 MB.
+  tags:
+    - crispasr
+    - asr
+    - speech-recognition
+    - stt
+    - gguf
+  overrides:
+    backend: crispasr
+    known_usecases:
+      - transcript
+    name: kyutai-stt-crispasr
+    options:
+      - "backend:kyutai-stt"
+    parameters:
+      model: kyutai-stt-1b-q4_k.gguf
+  files:
+    - filename: kyutai-stt-1b-q4_k.gguf
+      uri: huggingface://cstr/kyutai-stt-1b-GGUF/kyutai-stt-1b-q4_k.gguf
+- name: firered-asr-crispasr
+  url: github:mudler/LocalAI/gallery/virtual.yaml@master
+  urls:
+    - https://huggingface.co/cstr/firered-asr2-aed-GGUF
+  description: |
+    FireRed-ASR2 AED speech recognition. Runs via the CrispASR backend with an explicit backend selector. Default GGUF size ~918 MB.
+  tags:
+    - crispasr
+    - asr
+    - speech-recognition
+    - stt
+    - gguf
+  overrides:
+    backend: crispasr
+    known_usecases:
+      - transcript
+    name: firered-asr-crispasr
+    options:
+      - "backend:firered-asr"
+    parameters:
+      model: firered-asr2-aed-q4_k.gguf
+  files:
+    - filename: firered-asr2-aed-q4_k.gguf
+      uri: huggingface://cstr/firered-asr2-aed-GGUF/firered-asr2-aed-q4_k.gguf
+- name: moonshine-crispasr
+  url: github:mudler/LocalAI/gallery/virtual.yaml@master
+  urls:
+    - https://huggingface.co/cstr/moonshine-tiny-GGUF
+  description: |
+    Moonshine Tiny speech recognition, English. Runs via the CrispASR backend with an explicit backend selector and a companion tokenizer. Default GGUF size ~20 MB.
+  tags:
+    - crispasr
+    - asr
+    - speech-recognition
+    - stt
+    - gguf
+  overrides:
+    backend: crispasr
+    known_usecases:
+      - transcript
+    name: moonshine-crispasr
+    options:
+      - "backend:moonshine"
+      - "codec:tokenizer.bin"
+    parameters:
+      model: moonshine-tiny-q4_k.gguf
+  files:
+    - filename: moonshine-tiny-q4_k.gguf
+      uri: huggingface://cstr/moonshine-tiny-GGUF/moonshine-tiny-q4_k.gguf
+    - filename: tokenizer.bin
+      uri: huggingface://cstr/moonshine-tiny-GGUF/tokenizer.bin
+- name: moonshine-de-crispasr
+  url: github:mudler/LocalAI/gallery/virtual.yaml@master
+  urls:
+    - https://huggingface.co/cstr/moonshine-base-de-fidoriel-GGUF
+  description: |
+    Moonshine Base German fine-tune (fidoriel), best-quality German Moonshine. Runs via the CrispASR backend with an explicit backend selector and a companion tokenizer. Default GGUF size ~39 MB.
+  license: CC-BY-NC-SA-4.0
+  tags:
+    - crispasr
+    - asr
+    - speech-recognition
+    - stt
+    - gguf
+  overrides:
+    backend: crispasr
+    known_usecases:
+      - transcript
+    name: moonshine-de-crispasr
+    options:
+      - "backend:moonshine"
+      - "codec:tokenizer.bin"
+    parameters:
+      model: moonshine-base-de-fidoriel-q4_k.gguf
+  files:
+    - filename: moonshine-base-de-fidoriel-q4_k.gguf
+      uri: huggingface://cstr/moonshine-base-de-fidoriel-GGUF/moonshine-base-de-fidoriel-q4_k.gguf
+    - filename: tokenizer.bin
+      uri: huggingface://cstr/moonshine-base-de-fidoriel-GGUF/tokenizer.bin
+- name: moonshine-tiny-de-crispasr
+  url: github:mudler/LocalAI/gallery/virtual.yaml@master
+  urls:
+    - https://huggingface.co/cstr/moonshine-tiny-de-fidoriel-GGUF
+  description: |
+    Moonshine Tiny German fine-tune (fidoriel), smaller/faster German Moonshine. Runs via the CrispASR backend with an explicit backend selector and a companion tokenizer. Default GGUF size ~17 MB.
+  license: CC-BY-NC-SA-4.0
+  tags:
+    - crispasr
+    - asr
+    - speech-recognition
+    - stt
+    - gguf
+  overrides:
+    backend: crispasr
+    known_usecases:
+      - transcript
+    name: moonshine-tiny-de-crispasr
+    options:
+      - "backend:moonshine"
+      - "codec:tokenizer.bin"
+    parameters:
+      model: moonshine-tiny-de-fidoriel-q4_k.gguf
+  files:
+    - filename: moonshine-tiny-de-fidoriel-q4_k.gguf
+      uri: huggingface://cstr/moonshine-tiny-de-fidoriel-GGUF/moonshine-tiny-de-fidoriel-q4_k.gguf
+    - filename: tokenizer.bin
+      uri: huggingface://cstr/moonshine-tiny-de-fidoriel-GGUF/tokenizer.bin
+- name: moonshine-streaming-crispasr
+  url: github:mudler/LocalAI/gallery/virtual.yaml@master
+  urls:
+    - https://huggingface.co/cstr/moonshine-streaming-tiny-GGUF
+  description: |
+    Moonshine Streaming Tiny speech recognition. Runs via the CrispASR backend with an explicit backend selector and a companion tokenizer. Default GGUF size ~31 MB.
+  tags:
+    - crispasr
+    - asr
+    - speech-recognition
+    - stt
+    - gguf
+  overrides:
+    backend: crispasr
+    known_usecases:
+      - transcript
+    name: moonshine-streaming-crispasr
+    options:
+      - "backend:moonshine-streaming"
+      - "codec:tokenizer.bin"
+    parameters:
+      model: moonshine-streaming-tiny-q4_k.gguf
+  files:
+    - filename: moonshine-streaming-tiny-q4_k.gguf
+      uri: huggingface://cstr/moonshine-streaming-tiny-GGUF/moonshine-streaming-tiny-q4_k.gguf
+    - filename: tokenizer.bin
+      uri: huggingface://cstr/moonshine-streaming-tiny-GGUF/tokenizer.bin
+- name: mimo-asr-crispasr
+  url: github:mudler/LocalAI/gallery/virtual.yaml@master
+  urls:
+    - https://huggingface.co/cstr/mimo-asr-GGUF
+  description: |
+    MiMo-ASR speech recognition. Runs via the CrispASR backend with an explicit backend selector and a companion tokenizer GGUF. Default GGUF size ~4.2 GB.
+  tags:
+    - crispasr
+    - asr
+    - speech-recognition
+    - stt
+    - gguf
+  overrides:
+    backend: crispasr
+    known_usecases:
+      - transcript
+    name: mimo-asr-crispasr
+    options:
+      - "backend:mimo-asr"
+      - "codec:mimo-tokenizer-q4_k.gguf"
+    parameters:
+      model: mimo-asr-q4_k.gguf
+  files:
+    - filename: mimo-asr-q4_k.gguf
+      uri: huggingface://cstr/mimo-asr-GGUF/mimo-asr-q4_k.gguf
+    - filename: mimo-tokenizer-q4_k.gguf
+      uri: huggingface://cstr/mimo-tokenizer-GGUF/mimo-tokenizer-q4_k.gguf