feat(crispasr): add CrispASR backend — multi-architecture ASR + TTS (#10099)

* feat(crispasr): backend source files (Go gRPC server, C-ABI shim, build files)

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* polish(crispasr): brand error strings + fix stale shim comment

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* build(crispasr): register backend in root Makefile

Mirror the whisper Go backend registration for the new crispasr
backend: NOTPARALLEL entry, prepare-test-extra/test-extra hooks,
BACKEND_CRISPASR definition, docker-build target generation, and the
docker-build-backends aggregate target.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(crispasr): add backend build matrix entries

Mirror the 11 whisper golang Dockerfile matrix entries (CPU amd64/arm64,
CUDA 12/13, L4T CUDA 13, Intel SYCL f32/f16, Vulkan amd64/arm64, L4T
arm64, ROCm hipblas) with backend and tag-suffix substituted to crispasr.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(gallery): add crispasr backend gallery entries

Add the crispasr meta anchor and its full set of image gallery entries
(cpu, metal, cuda12/13, rocm, intel-sycl f32/f16, vulkan, L4T arm64,
L4T cuda13 arm64, plus -development variants), mirroring the whisper
backend gallery block.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(crispasr): bump CRISPASR_VERSION via bump_deps workflow

Track CrispStrobe/CrispASR main branch and bump CRISPASR_VERSION in
backend/go/crispasr/Makefile.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* build(crispasr): don't wire fixture-gated test into test-extra

Mirror the whisper Go backend: its AudioTranscription test is gated on
model/audio fixtures and skips in CI, so building crispasr (the heaviest
ggml compile in the tree) inside the unit-test lane adds a long compile
for zero coverage. The backend image build in backend-matrix.yml remains
the authoritative compile check.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(crispasr): add darwin metal build entry (mirror whisper)

The metal-crispasr gallery entries and capabilities.metal mapping
reference -metal-darwin-arm64-crispasr, which is only produced by an
includeDarwin entry. Mirror whisper's darwin metal entry so the tag
actually gets built.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(crispasr): place hipblas matrix entry next to whisper twin

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(crispasr): register crispasr as pref-only ASR backend + test

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(crispasr): port whisper behavioral suite (cancellation + streaming)

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(crispasr): fix skip message env var names to CRISPASR_*

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(crispasr): switch shim to crispasr_session_* multi-architecture API

The shim used whisper_full(), which in CrispASR is the whisper-only path:
libcrispasr only transcribes Whisper GGUFs through it. Multi-architecture
transcription (Parakeet, Voxtral, Qwen3-ASR, Canary, Granite, FunASR,
Paraformer, SenseVoice, ...) goes through the crispasr_session_* C-ABI,
which auto-detects the architecture from the GGUF and dispatches to the
matching backend.

Rewrite the C shim around crispasr_session_open / _transcribe_lang /
_result_* and add get_backend() so the selected backend is logged.
load_model now takes a threads param (session_open binds n_threads at
open). The session result is segment+word based with no token IDs and no
per-decode callback, so drop n_tokens / get_token_id /
get_segment_speaker_turn_next / set_new_segment_callback. set_abort is
kept for API parity but is best-effort: the session transcribe is blocking
with no abort hook.

Update the purego bindings and gocrispasr.go to match: tokens are left
empty, speaker-turn handling is removed, and AudioTranscriptionStream
emits one delta per non-empty segment after the blocking decode returns
(no progressive streaming via the session API), preserving the
concat(deltas) == final.Text invariant.

crispasr_session_set_translate is exported by libcrispasr but not declared
in crispasr.h, so it is forward-declared in the shim alongside the
open/transcribe/result functions.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* build(crispasr): link full CrispASR backend set for multi-arch support

The shim's crispasr_session_* dispatch calls into the per-architecture
backend libs (parakeet, voxtral, qwen3_asr, canary, funasr, paraformer,
sensevoice, ...), which CrispASR builds as static archives. Linking only
crispasr + ggml dead-stripped every backend object from the final module
(nm backend-symbol count: 0), leaving a whisper-only .so.

Link the same backend set as crispasr-cli so the static archives are
pulled in. After this the module carries the backend symbols (nm count
407, .so grows from ~2.1MB to ~6.7MB) and the session API can dispatch to
every compiled-in architecture.

Also rewrite ${CMAKE_SOURCE_DIR}/examples/talk-llama to
${PROJECT_SOURCE_DIR}/... in the vendored src/CMakeLists.txt: CrispASR
locates its vendored llama.cpp via ${CMAKE_SOURCE_DIR}, which is wrong when
CrispASR is add_subdirectory'd (CMAKE_SOURCE_DIR points at this backend
dir, not the CrispASR root). PROJECT_SOURCE_DIR is correct both standalone
and as a subproject; the sed is idempotent.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(crispasr): adapt suite to session API (blocking, no decode callback)

Register the new symbol set (drop the removed token/speaker/callback funcs,
add get_backend; load_model now takes 2 args). The session transcribe is
blocking with no abort hook, so a mid-decode cancel can't interrupt it:
change the cancellation spec to cancel the context before the call and
assert codes.Canceled from the pre-call ctx.Err() check, dropping the
<5s mid-decode timing assertion. The streaming spec still holds with
per-segment post-decode emission (>=2 deltas, concat(deltas) == final.Text).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(gallery): add CrispASR ASR model entries (-crispasr)

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(gallery): keep only session-auto-detectable CrispASR ASR models

The crispasr backend loads models via crispasr_session_open, which
auto-detects the backend from the GGUF general.architecture using
crispasr_detect_backend_from_gguf. Architectures not in that detect
map cannot be opened, so those gallery entries fail to load.

Removed entries whose architecture is not wired into CrispASR
v0.6.11's session auto-detect router (they can be re-added when
upstream maps them):

- Not in the detect map: data2vec, firered-asr, funasr,
  fun-asr-mlt-nano, glm-asr, hubert, kyutai-stt, mega-asr, mimo-asr,
  moonshine{,-de,-streaming,-tiny-de}, omniasr{,-llm,-llm-1b},
  paraformer, sensevoice.
- Pending verification (filename-heuristic routed, not arch-detected):
  parakeet-ctc-0.6b, parakeet-ctc-1.1b. Their GGUFs are routed to the
  fastconformer-ctc backend by a filename heuristic in the model
  registry, which implies general.architecture is not a mapped string.

Kept the parakeet rnnt/tdt_ctc variants: convert-parakeet-to-gguf.py
writes general.architecture="parakeet" unconditionally and encodes the
rnnt/ctc distinction in metadata fields, so they session-auto-detect.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(crispasr): TTS synthesis via crispasr_session_synthesize (24kHz)

Add tts_synthesize/tts_free/tts_set_voice to the C-ABI shim. They reuse
the already-open g_session (crispasr_session_open auto-detects a TTS
model) and dispatch to the upstream synthesis call, which returns
malloc'd 24 kHz mono float PCM. Orpheus needs a SNAC codec path that we
do not set, so it returns NULL here and surfaces as an error Go-side.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(crispasr): implement TTS/TTSStream gRPC methods

Bind the new shim functions via purego and implement TTS, TTSStream and
a writeWAV24k helper. synthesize copies the C-owned PCM out before
freeing it; TTS writes a 24 kHz mono 16-bit WAV to req.Dst via
go-audio/wav. CrispASR has no progressive synth, so TTSStream
synthesizes fully, encodes to WAV, and emits the bytes as a single
chunk; it owns the results-channel close (the gRPC server wrapper ranges
until close), mirroring vibevoice-cpp's TTSStream.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(crispasr): log when a TTS voice override is not honored

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(gallery): add CrispASR vibevoice-tts model entry

Only vibevoice-tts works through the current shim: qwen3-tts, chatterbox,
and orpheus require companion codec/s3gen/SNAC paths (set_codec_path /
set_s3gen_path) that the shim doesn't wire yet, and kokoro/indextts/voxcpm2
aren't in the session auto-detect map. Those are follow-ups.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(crispasr): gated TTS synthesis spec

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(crispasr): satisfy golangci-lint (errcheck defers + unsafeptr nolint)

The crispasr Go file is entirely new, so new-from-merge-base lints every
line (unlike the grandfathered whisper backend it was forked from):
- handle os.RemoveAll / fh.Close return values in AudioTranscription
- annotate the two intentional C-pointer unsafe.Slice sites with //nolint:govet

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(crispasr): backend: and codec: model options (explicit arch + companion files)

Add two model-config options to the CrispASR backend via opts.Options:

- backend:<name> selects an explicit CrispASR backend (bypassing
  auto-detect) by routing load_model through
  crispasr_session_open_explicit, unlocking architectures the
  detector won't pick on its own (qwen3, cohere, granite, voxtral,
  moonshine, mimo-asr, orpheus, kokoro, chatterbox, etc.).
- codec:<path> loads a companion file (qwen3-tts codec, orpheus SNAC,
  chatterbox s3gen, or mimo-asr tokenizer) via the universal
  crispasr_session_set_codec_path setter after the session opens. A
  relative path resolves against the model directory. rc==0 means
  success or not-applicable; only a negative rc is fatal.

The C shim load_model gains a backend_name argument and a new
set_codec_path entry point; the Go bridge parses the prefix:value
options and registers the new symbol. The vad_only path is unchanged.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(gallery): expand CrispASR models via backend:/codec: options (explicit arch + companions)

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(gallery): use virtual.yaml base for crispasr models

The crispasr entries are just backend + model + a couple options, fully
expressed inline via overrides:/files: in gallery/index.yaml. Point each
url: at the shared gallery/virtual.yaml (the established 'virtual' model
trick) and drop the 36 redundant per-model gallery/*-crispasr.yaml files.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(gallery): drop voice-requiring TTS entries (keep vibevoice-tts)

Real e2e showed qwen3-tts/orpheus/chatterbox don't synthesize through the
current shim: the codec: companion loads fine, but these engines additionally
need a voice pack / voice prompt / reference clip (qwen3-tts base errors
'no voice'; chatterbox is zero-shot cloning; orpheus uses named voices) that
the backend doesn't wire. (qwen3-tts also can't auto-detect: its GGUF arch is
'qwen3tts', unmapped by the detector — would need backend:qwen3-tts.) Removed
to avoid shipping non-working gallery entries; vibevoice-tts (built-in voice,
e2e-verified) remains the working TTS. Voice-pack wiring is a follow-up.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(crispasr): speaker: and voice: TTS options (baked speakers + voice packs/prompts)

speaker:<name> -> crispasr_session_set_speaker_name (baked speakers: qwen3-tts
CustomVoice, orpheus). voice:<path>(+voice_text:<ref>) -> crispasr_session_set_voice
(voice-pack GGUF, or WAV zero-shot clone with ref text). Applied at Load as the
default voice; req.Voice still overrides the speaker per request.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(gallery): re-add e2e-verified TTS engines (chatterbox, qwen3-tts-customvoice, orpheus)

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
LocalAI [bot]
2026-05-31 12:11:03 +02:00
committed by GitHub
parent baa11133f1
commit 76fe0bb929
17 changed files with 2462 additions and 2 deletions

View File

@@ -716,6 +716,19 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "8"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-12-crispasr'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "crispasr"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "8"
@@ -1569,6 +1582,19 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-13-crispasr'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "crispasr"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
@@ -1595,6 +1621,19 @@ include:
backend: "whisper"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/arm64'
skip-drivers: 'false'
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-cuda-13-arm64-crispasr'
base-image: "ubuntu:24.04"
ubuntu-version: '2404'
runs-on: 'ubuntu-24.04-arm'
backend: "crispasr"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
@@ -2889,6 +2928,20 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
platform-tag: 'amd64'
tag-latest: 'auto'
tag-suffix: '-cpu-crispasr'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "crispasr"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
@@ -2903,6 +2956,20 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/arm64'
platform-tag: 'arm64'
tag-latest: 'auto'
tag-suffix: '-cpu-crispasr'
runs-on: 'ubuntu-24.04-arm'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "crispasr"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f32'
cuda-major-version: ""
cuda-minor-version: ""
@@ -2916,6 +2983,19 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f32'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-intel-sycl-f32-crispasr'
runs-on: 'ubuntu-latest'
base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
skip-drivers: 'false'
backend: "crispasr"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f16'
cuda-major-version: ""
cuda-minor-version: ""
@@ -2929,6 +3009,19 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f16'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-intel-sycl-f16-crispasr'
runs-on: 'ubuntu-latest'
base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
skip-drivers: 'false'
backend: "crispasr"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'vulkan'
cuda-major-version: ""
cuda-minor-version: ""
@@ -2943,6 +3036,20 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'vulkan'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
platform-tag: 'amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-vulkan-crispasr'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "crispasr"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'vulkan'
cuda-major-version: ""
cuda-minor-version: ""
@@ -2957,6 +3064,20 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'vulkan'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/arm64'
platform-tag: 'arm64'
tag-latest: 'auto'
tag-suffix: '-gpu-vulkan-crispasr'
runs-on: 'ubuntu-24.04-arm'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "crispasr"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "0"
@@ -2970,6 +3091,19 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2204'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "0"
platforms: 'linux/arm64'
skip-drivers: 'false'
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-arm64-crispasr'
base-image: "nvcr.io/nvidia/l4t-jetpack:r36.4.0"
runs-on: 'ubuntu-24.04-arm'
backend: "crispasr"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2204'
- build-type: 'hipblas'
cuda-major-version: ""
cuda-minor-version: ""
@@ -2983,6 +3117,19 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'hipblas'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-rocm-hipblas-crispasr'
base-image: "rocm/dev-ubuntu-24.04:7.2.1"
runs-on: 'ubuntu-latest'
skip-drivers: 'false'
backend: "crispasr"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
# parakeet-cpp
- build-type: ''
cuda-major-version: ""
@@ -4124,6 +4271,10 @@ includeDarwin:
tag-suffix: "-metal-darwin-arm64-whisper"
build-type: "metal"
lang: "go"
- backend: "crispasr"
tag-suffix: "-metal-darwin-arm64-crispasr"
build-type: "metal"
lang: "go"
- backend: "parakeet-cpp"
tag-suffix: "-metal-darwin-arm64-parakeet-cpp"
build-type: "metal"

View File

@@ -30,6 +30,10 @@ jobs:
variable: "WHISPER_CPP_VERSION"
branch: "master"
file: "backend/go/whisper/Makefile"
- repository: "CrispStrobe/CrispASR"
variable: "CRISPASR_VERSION"
branch: "main"
file: "backend/go/crispasr/Makefile"
- repository: "mudler/parakeet.cpp"
variable: "PARAKEET_VERSION"
branch: "master"

View File

@@ -1,5 +1,5 @@
# Disable parallel execution for backend builds
.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/parakeet-cpp backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/rfdetr-cpp backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin backends/liquid-audio
.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/crispasr backends/parakeet-cpp backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/rfdetr-cpp backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin backends/liquid-audio
GOCMD=go
GOTEST=$(GOCMD) test
@@ -1162,6 +1162,7 @@ BACKEND_HUGGINGFACE = huggingface|golang|.|false|true
BACKEND_SILERO_VAD = silero-vad|golang|.|false|true
BACKEND_STABLEDIFFUSION_GGML = stablediffusion-ggml|golang|.|--progress=plain|true
BACKEND_WHISPER = whisper|golang|.|false|true
BACKEND_CRISPASR = crispasr|golang|.|false|true
BACKEND_PARAKEET_CPP = parakeet-cpp|golang|.|false|true
BACKEND_VOXTRAL = voxtral|golang|.|false|true
BACKEND_ACESTEP_CPP = acestep-cpp|golang|.|false|true
@@ -1250,6 +1251,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_HUGGINGFACE)))
$(eval $(call generate-docker-build-target,$(BACKEND_SILERO_VAD)))
$(eval $(call generate-docker-build-target,$(BACKEND_STABLEDIFFUSION_GGML)))
$(eval $(call generate-docker-build-target,$(BACKEND_WHISPER)))
$(eval $(call generate-docker-build-target,$(BACKEND_CRISPASR)))
$(eval $(call generate-docker-build-target,$(BACKEND_PARAKEET_CPP)))
$(eval $(call generate-docker-build-target,$(BACKEND_VOXTRAL)))
$(eval $(call generate-docker-build-target,$(BACKEND_OPUS)))
@@ -1300,7 +1302,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_SHERPA_ONNX)))
docker-save-%: backend-images
docker save local-ai-backend:$* -o backend-images/$*.tar
docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-ds4 docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-liquid-audio docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-rfdetr-cpp docker-build-qwen3-tts-cpp docker-build-vibevoice-cpp docker-build-localvqe docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx docker-build-cloud-proxy
docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-ds4 docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-crispasr docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-liquid-audio docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-rfdetr-cpp docker-build-qwen3-tts-cpp docker-build-vibevoice-cpp docker-build-localvqe docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx docker-build-cloud-proxy
########################################################
### Mock Backend for E2E Tests

5
backend/go/crispasr/.gitignore vendored Normal file
View File

@@ -0,0 +1,5 @@
sources
build*
libgocrispasr*.so
crispasr
package

View File

@@ -0,0 +1,30 @@
cmake_minimum_required(VERSION 3.12)
project(gocrispasr LANGUAGES C CXX)
set(CMAKE_POSITION_INDEPENDENT_CODE ON)
set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
add_subdirectory(./sources/CrispASR)
add_library(gocrispasr MODULE cpp/crispasr_shim.cpp)
target_include_directories(gocrispasr PRIVATE
${CMAKE_CURRENT_SOURCE_DIR}/sources/CrispASR/include
${CMAKE_CURRENT_SOURCE_DIR}/sources/CrispASR/ggml/include)
# Link the same backend set as crispasr-cli (examples/cli/CMakeLists.txt) so
# the session API can dispatch to every compiled-in architecture, not just
# whisper. crispasr is the referencer; the backend static libs supply the
# per-architecture symbols; ggml is the math/runtime base.
target_link_libraries(gocrispasr PRIVATE
crispasr
parakeet canary canary_ctc cohere granite_speech granite_nle
voxtral voxtral4b qwen3_asr qwen3_tts orpheus chatterbox indextts
kokoro voxcpm2_tts m2m100 t5_translate wav2vec2-ggml vibevoice
silero-lid pyannote-seg funasr paraformer sensevoice
crisp_audio
ggml)
if(CMAKE_CXX_COMPILER_ID MATCHES "GNU" AND CMAKE_CXX_COMPILER_VERSION VERSION_LESS 9.0)
target_link_libraries(gocrispasr PRIVATE stdc++fs)
endif()
set_property(TARGET gocrispasr PROPERTY CXX_STANDARD 17)
set_target_properties(gocrispasr PROPERTIES LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR})

View File

@@ -0,0 +1,132 @@
CMAKE_ARGS?=
BUILD_TYPE?=
NATIVE?=false
GOCMD?=go
GO_TAGS?=
JOBS?=$(shell nproc --ignore=1)
# CrispASR version (release tag)
CRISPASR_REPO?=https://github.com/CrispStrobe/CrispASR
CRISPASR_VERSION?=v0.6.11
SO_TARGET?=libgocrispasr.so
CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF
# Keep the build lean: no tests/examples/server/SDL2/curl/ffmpeg (the FROM scratch
# image cannot satisfy those runtime deps). All ASR/TTS model backends stay enabled.
CMAKE_ARGS+=-DCRISPASR_BUILD_TESTS=OFF -DCRISPASR_BUILD_EXAMPLES=OFF -DCRISPASR_BUILD_SERVER=OFF
CMAKE_ARGS+=-DCRISPASR_SDL2=OFF -DCRISPASR_CURL=OFF -DCRISPASR_FFMPEG=OFF
ifeq ($(NATIVE),false)
CMAKE_ARGS+=-DGGML_NATIVE=OFF
endif
ifeq ($(BUILD_TYPE),cublas)
CMAKE_ARGS+=-DGGML_CUDA=ON
else ifeq ($(BUILD_TYPE),openblas)
CMAKE_ARGS+=-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
else ifeq ($(BUILD_TYPE),clblas)
CMAKE_ARGS+=-DGGML_CLBLAST=ON -DCLBlast_DIR=/some/path
else ifeq ($(BUILD_TYPE),hipblas)
CMAKE_ARGS+=-DGGML_HIPBLAS=ON
else ifeq ($(BUILD_TYPE),vulkan)
CMAKE_ARGS+=-DGGML_VULKAN=ON
else ifeq ($(OS),Darwin)
ifneq ($(BUILD_TYPE),metal)
CMAKE_ARGS+=-DGGML_METAL=OFF
else
CMAKE_ARGS+=-DGGML_METAL=ON
CMAKE_ARGS+=-DGGML_METAL_EMBED_LIBRARY=ON
endif
endif
ifeq ($(BUILD_TYPE),sycl_f16)
CMAKE_ARGS+=-DGGML_SYCL=ON \
-DCMAKE_C_COMPILER=icx \
-DCMAKE_CXX_COMPILER=icpx \
-DGGML_SYCL_F16=ON
endif
ifeq ($(BUILD_TYPE),sycl_f32)
CMAKE_ARGS+=-DGGML_SYCL=ON \
-DCMAKE_C_COMPILER=icx \
-DCMAKE_CXX_COMPILER=icpx
endif
sources/CrispASR:
mkdir -p sources/CrispASR
cd sources/CrispASR && \
git init && \
git remote add origin $(CRISPASR_REPO) && \
git fetch origin && \
git checkout $(CRISPASR_VERSION) && \
git submodule update --init --recursive --depth 1 --single-branch
# CrispASR's src/CMakeLists.txt locates its vendored llama.cpp
# (crispasr-llama-core, used by the chat C-ABI) via ${CMAKE_SOURCE_DIR},
# which assumes CrispASR is the top-level CMake project. We add_subdirectory
# it, so ${CMAKE_SOURCE_DIR} is THIS backend dir and the talk-llama sources
# aren't found. Rewrite to ${PROJECT_SOURCE_DIR} (the crispasr project root),
# which is correct both standalone and as a subproject. Idempotent.
sed -i 's#\$${CMAKE_SOURCE_DIR}/examples/talk-llama#\$${PROJECT_SOURCE_DIR}/examples/talk-llama#' sources/CrispASR/src/CMakeLists.txt
# Detect OS
UNAME_S := $(shell uname -s)
ifeq ($(UNAME_S),Linux)
VARIANT_TARGETS = libgocrispasr-avx.so libgocrispasr-avx2.so libgocrispasr-avx512.so libgocrispasr-fallback.so
else
VARIANT_TARGETS = libgocrispasr-fallback.so
endif
crispasr: main.go gocrispasr.go $(VARIANT_TARGETS)
CGO_ENABLED=0 $(GOCMD) build -tags "$(GO_TAGS)" -o crispasr ./
package: crispasr
bash package.sh
build: package
clean: purge
rm -rf libgocrispasr*.so package sources/CrispASR crispasr
purge:
rm -rf build*
ifeq ($(UNAME_S),Linux)
libgocrispasr-avx.so: sources/CrispASR
$(MAKE) purge
$(info ${GREEN}I crispasr build info:avx${RESET})
SO_TARGET=libgocrispasr-avx.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) libgocrispasr-custom
rm -rfv build*
libgocrispasr-avx2.so: sources/CrispASR
$(MAKE) purge
$(info ${GREEN}I crispasr build info:avx2${RESET})
SO_TARGET=libgocrispasr-avx2.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=off -DGGML_FMA=on -DGGML_F16C=on -DGGML_BMI2=on" $(MAKE) libgocrispasr-custom
rm -rfv build*
libgocrispasr-avx512.so: sources/CrispASR
$(MAKE) purge
$(info ${GREEN}I crispasr build info:avx512${RESET})
SO_TARGET=libgocrispasr-avx512.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=on -DGGML_FMA=on -DGGML_F16C=on -DGGML_BMI2=on" $(MAKE) libgocrispasr-custom
rm -rfv build*
endif
libgocrispasr-fallback.so: sources/CrispASR
$(MAKE) purge
$(info ${GREEN}I crispasr build info:fallback${RESET})
SO_TARGET=libgocrispasr-fallback.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) libgocrispasr-custom
rm -rfv build*
libgocrispasr-custom: CMakeLists.txt cpp/crispasr_shim.cpp cpp/crispasr_shim.h
mkdir -p build-$(SO_TARGET) && \
cd build-$(SO_TARGET) && \
cmake .. $(CMAKE_ARGS) && \
cmake --build . --config Release -j$(JOBS) && \
cd .. && \
mv build-$(SO_TARGET)/libgocrispasr.so ./$(SO_TARGET)
test: crispasr
CGO_ENABLED=0 $(GOCMD) test -v ./...
all: crispasr package

View File

@@ -0,0 +1,253 @@
#include "crispasr_shim.h"
#include "ggml-backend.h"
#include "crispasr.h"
#include <atomic>
#include <vector>
// Opaque session types. crispasr.h declares `struct crispasr_session;` but not
// the result type nor the open/transcribe/result accessors — those are
// CA_EXPORT extern "C" symbols in src/crispasr_c_api.cpp, so we forward-declare
// exactly the ones we use. Signatures verified against
// sources/CrispASR/src/crispasr_c_api.cpp.
struct crispasr_session_result;
extern "C" {
crispasr_session *crispasr_session_open(const char *model_path, int n_threads);
crispasr_session *crispasr_session_open_explicit(const char *model_path,
const char *backend_name,
int n_threads);
int crispasr_session_set_codec_path(crispasr_session *s, const char *path);
void crispasr_session_close(crispasr_session *s);
const char *crispasr_session_backend(crispasr_session *s);
int crispasr_session_set_translate(crispasr_session *s, int enable);
crispasr_session_result *crispasr_session_transcribe_lang(
crispasr_session *s, const float *pcm, int n_samples, const char *language);
int crispasr_session_result_n_segments(crispasr_session_result *r);
const char *crispasr_session_result_segment_text(crispasr_session_result *r,
int i);
int64_t crispasr_session_result_segment_t0(crispasr_session_result *r, int i);
int64_t crispasr_session_result_segment_t1(crispasr_session_result *r, int i);
void crispasr_session_result_free(crispasr_session_result *r);
float *crispasr_session_synthesize(crispasr_session *s, const char *text,
int *out_n_samples);
void crispasr_pcm_free(float *pcm);
int crispasr_session_set_speaker_name(crispasr_session *s, const char *name);
int crispasr_session_set_voice(crispasr_session *s, const char *path,
const char *ref_text_or_null);
}
static crispasr_session *g_session = nullptr;
static crispasr_session_result *g_result = nullptr;
static struct whisper_vad_context *vctx;
static std::vector<float> flat_segs;
static std::atomic<int> g_abort{0};
extern "C" void set_abort(int v) {
g_abort.store(v, std::memory_order_relaxed);
}
static void ggml_log_cb(enum ggml_log_level level, const char *log,
void *data) {
const char *level_str;
if (!log) {
return;
}
switch (level) {
case GGML_LOG_LEVEL_DEBUG:
level_str = "DEBUG";
break;
case GGML_LOG_LEVEL_INFO:
level_str = "INFO";
break;
case GGML_LOG_LEVEL_WARN:
level_str = "WARN";
break;
case GGML_LOG_LEVEL_ERROR:
level_str = "ERROR";
break;
default: /* Potential future-proofing */
level_str = "?????";
break;
}
fprintf(stderr, "[%-5s] ", level_str);
fputs(log, stderr);
fflush(stderr);
}
int load_model(const char *const model_path, int threads,
const char *backend_name) {
whisper_log_set(ggml_log_cb, nullptr);
ggml_backend_load_all();
if (backend_name && *backend_name) {
g_session =
crispasr_session_open_explicit(model_path, backend_name, threads);
} else {
g_session = crispasr_session_open(model_path, threads);
}
if (g_session == nullptr) {
fprintf(stderr, "error: failed to open CrispASR session for model\n");
return 1;
}
fprintf(stderr, "info: CrispASR backend selected: %s\n",
crispasr_session_backend(g_session));
return 0;
}
// set_codec_path forwards a companion file (qwen3-tts codec, orpheus SNAC,
// chatterbox s3gen, or mimo-asr tokenizer) to the active session. Returns 0 on
// success or when the active backend needs no companion, negative on failure,
// and -1 when no session is open.
int set_codec_path(const char *path) {
return g_session ? crispasr_session_set_codec_path(g_session, path) : -1;
}
int load_model_vad(const char *const model_path) {
whisper_log_set(ggml_log_cb, nullptr);
ggml_backend_load_all();
struct whisper_vad_context_params vcparams =
whisper_vad_default_context_params();
// XXX: Overridden to false in upstream due to performance?
// vcparams.use_gpu = true;
vctx = whisper_vad_init_from_file_with_params(model_path, vcparams);
if (vctx == nullptr) {
fprintf(stderr, "error: Failed to init model as VAD\n");
return 1;
}
return 0;
}
int vad(float pcmf32[], size_t pcmf32_len, float **segs_out,
size_t *segs_out_len) {
if (!whisper_vad_detect_speech(vctx, pcmf32, pcmf32_len)) {
fprintf(stderr, "error: failed to detect speech\n");
return 1;
}
struct whisper_vad_params params = whisper_vad_default_params();
struct whisper_vad_segments *segs =
whisper_vad_segments_from_probs(vctx, params);
size_t segn = whisper_vad_segments_n_segments(segs);
// fprintf(stderr, "Got segments %zd\n", segn);
flat_segs.clear();
for (int i = 0; i < segn; i++) {
flat_segs.push_back(whisper_vad_segments_get_segment_t0(segs, i));
flat_segs.push_back(whisper_vad_segments_get_segment_t1(segs, i));
}
// fprintf(stderr, "setting out variables: %p=%p -> %p, %p=%zx -> %zx\n",
// segs_out, *segs_out, flat_segs.data(), segs_out_len, *segs_out_len,
// flat_segs.size());
*segs_out = flat_segs.data();
*segs_out_len = flat_segs.size();
// fprintf(stderr, "freeing segs\n");
whisper_vad_free_segments(segs);
// fprintf(stderr, "returning\n");
return 0;
}
// threads, diarize and prompt are accepted for Go-side API parity but unused
// in Phase 1: the thread count is fixed at session open, and diarization and
// the initial prompt are separate CrispASR features not yet wired through the
// session ASR path.
int transcribe(uint32_t threads, char *lang, bool translate, bool diarize,
float pcmf32[], size_t pcmf32_len, size_t *segs_out_len,
char *prompt) {
(void)threads;
(void)diarize;
(void)prompt;
if (!g_session) {
return 1;
}
// Reset stale abort flag from any prior cancelled call. set_abort remains
// best-effort: the session transcribe call is blocking and exposes no abort
// hook, so a mid-decode abort cannot interrupt it.
g_abort.store(0, std::memory_order_relaxed);
crispasr_session_set_translate(g_session, translate ? 1 : 0);
if (g_result) {
crispasr_session_result_free(g_result);
g_result = nullptr;
}
const char *language = (lang && *lang) ? lang : nullptr;
g_result = crispasr_session_transcribe_lang(g_session, pcmf32, (int)pcmf32_len,
language);
if (!g_result) {
fprintf(stderr, "error: transcription failed\n");
return 1;
}
*segs_out_len = crispasr_session_result_n_segments(g_result);
return 0;
}
const char *get_segment_text(int i) {
if (!g_result) {
return "";
}
return crispasr_session_result_segment_text(g_result, i);
}
int64_t get_segment_t0(int i) {
if (!g_result) {
return 0;
}
return crispasr_session_result_segment_t0(g_result, i);
}
int64_t get_segment_t1(int i) {
if (!g_result) {
return 0;
}
return crispasr_session_result_segment_t1(g_result, i);
}
const char *get_backend(void) {
return g_session ? crispasr_session_backend(g_session) : "";
}
// TTS uses the already-open session (crispasr_session_open auto-detects a TTS
// model). Output is 24 kHz mono float PCM (upstream CrispASR convention),
// malloc'd by the C API; the caller must release it via tts_free.
float *tts_synthesize(const char *text, int *out_n_samples) {
if (out_n_samples) *out_n_samples = 0;
if (!g_session || !text) return nullptr;
return crispasr_session_synthesize(g_session, text, out_n_samples);
}
void tts_free(float *pcm) {
if (pcm) crispasr_pcm_free(pcm);
}
int tts_set_voice(const char *name) {
if (!g_session || !name || !*name) return 0;
return crispasr_session_set_speaker_name(g_session, name);
}
// tts_set_voice_file loads a voice from a file: a .gguf path selects a voice
// pack, a .wav path with a non-empty ref_text performs zero-shot voice cloning
// (the C API returns -2 when ref_text is required but missing). Returns -1 when
// no session is open or path is null.
int tts_set_voice_file(const char *path, const char *ref_text) {
if (!g_session || !path) return -1;
const char *ref = (ref_text && *ref_text) ? ref_text : nullptr;
return crispasr_session_set_voice(g_session, path, ref);
}

View File

@@ -0,0 +1,23 @@
#include <cstddef>
#include <cstdint>
extern "C" {
int load_model(const char *const model_path, int threads,
const char *backend_name);
int set_codec_path(const char *path);
int load_model_vad(const char *const model_path);
int vad(float pcmf32[], size_t pcmf32_size, float **segs_out,
size_t *segs_out_len);
int transcribe(uint32_t threads, char *lang, bool translate, bool diarize,
float pcmf32[], size_t pcmf32_len, size_t *segs_out_len,
char *prompt);
const char *get_segment_text(int i);
int64_t get_segment_t0(int i);
int64_t get_segment_t1(int i);
const char *get_backend(void);
void set_abort(int v);
float *tts_synthesize(const char *text, int *out_n_samples); // 24kHz mono float, malloc'd; NULL on failure
void tts_free(float *pcm);
int tts_set_voice(const char *name); // best-effort speaker selection; 0 ok
int tts_set_voice_file(const char *path, const char *ref_text); // load voice pack (.gguf) or zero-shot clone (.wav + ref_text)
}

View File

@@ -0,0 +1,497 @@
package main
import (
"context"
"fmt"
"os"
"path/filepath"
"strings"
"sync"
"unsafe"
"github.com/go-audio/audio"
"github.com/go-audio/wav"
"github.com/mudler/LocalAI/pkg/grpc/base"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
"github.com/mudler/LocalAI/pkg/utils"
"google.golang.org/grpc/codes"
"google.golang.org/grpc/status"
)
var (
CppLoadModel func(modelPath string, threads int, backendName string) int
CppSetCodecPath func(path string) int
CppLoadModelVAD func(modelPath string) int
CppVAD func(pcmf32 []float32, pcmf32Size uintptr, segsOut unsafe.Pointer, segsOutLen unsafe.Pointer) int
CppTranscribe func(threads uint32, lang string, translate bool, diarize bool, pcmf32 []float32, pcmf32Len uintptr, segsOutLen unsafe.Pointer, prompt string) int
CppGetSegmentText func(i int) string
CppGetSegmentStart func(i int) int64
CppGetSegmentEnd func(i int) int64
CppGetBackend func() string
CppSetAbort func(v int)
CppTTSSynthesize func(text string, outNSamples unsafe.Pointer) uintptr
CppTTSFree func(ptr uintptr)
CppTTSSetVoice func(name string) int
CppTTSSetVoiceFile func(path string, refText string) int
)
type CrispASR struct {
base.SingleThread
}
// splitOption splits a "prefix:value" model option into its key and value,
// matching the convention used by other backends (see sherpa-onnx). It returns
// ok=false when the option carries no ':' separator.
func splitOption(oo string) (key, value string, ok bool) {
parts := strings.SplitN(oo, ":", 2)
if len(parts) != 2 {
return "", "", false
}
return parts[0], parts[1], true
}
func (w *CrispASR) Load(opts *pb.ModelOptions) error {
vadOnly := false
backendName := ""
codecPath := ""
speakerName := ""
voicePath := ""
voiceRefText := ""
for _, oo := range opts.Options {
if oo == "vad_only" {
vadOnly = true
continue
}
switch key, value, ok := splitOption(oo); {
case ok && key == "backend":
backendName = value
case ok && key == "codec":
codecPath = value
case ok && key == "speaker":
speakerName = value
case ok && key == "voice":
voicePath = value
case ok && key == "voice_text":
voiceRefText = value
default:
fmt.Fprintf(os.Stderr, "Unrecognized option: %v\n", oo)
}
}
if vadOnly {
if ret := CppLoadModelVAD(opts.ModelFile); ret != 0 {
return fmt.Errorf("Failed to load CrispASR VAD model")
}
return nil
}
// Resolve a relative companion path against the model directory so a config
// can reference a sibling codec/tokenizer file by name alone.
if codecPath != "" && !filepath.IsAbs(codecPath) {
codecPath = filepath.Join(filepath.Dir(opts.ModelFile), codecPath)
}
// A voice file (.gguf pack or .wav prompt) is resolved against the model
// directory just like the codec, so a config can reference a sibling file.
if voicePath != "" && !filepath.IsAbs(voicePath) {
voicePath = filepath.Join(filepath.Dir(opts.ModelFile), voicePath)
}
if ret := CppLoadModel(opts.ModelFile, int(opts.Threads), backendName); ret != 0 {
return fmt.Errorf("Failed to load CrispASR transcription model")
}
// Load the companion file (codec/tokenizer/s3gen) after the session is open.
// rc==0 means success or "not applicable" for the active backend; only a
// negative code is fatal.
if codecPath != "" {
if rc := CppSetCodecPath(codecPath); rc < 0 {
return fmt.Errorf("crispasr: failed to load companion file %q (rc=%d)", codecPath, rc)
}
fmt.Fprintf(os.Stderr, "CrispASR companion file loaded: %s\n", codecPath)
}
// Apply the Load-time default voice. A baked speaker (speaker:) is selected
// by name and is best-effort: a backend that can't honor it is logged, not
// fatal. A voice file (voice:) is a hard requirement once configured, so a
// negative rc fails Load.
if speakerName != "" {
if rc := CppTTSSetVoice(speakerName); rc != 0 {
fmt.Fprintf(os.Stderr, "crispasr: speaker %q not applied (rc=%d)\n", speakerName, rc)
}
}
if voicePath != "" {
if rc := CppTTSSetVoiceFile(voicePath, voiceRefText); rc < 0 {
return fmt.Errorf("crispasr: failed to load voice %q (rc=%d)", voicePath, rc)
}
fmt.Fprintf(os.Stderr, "CrispASR voice loaded: %s\n", voicePath)
}
fmt.Fprintf(os.Stderr, "CrispASR backend selected: %s\n", CppGetBackend())
return nil
}
func (w *CrispASR) VAD(req *pb.VADRequest) (pb.VADResponse, error) {
audio := req.Audio
// We expect 0xdeadbeef to be overwritten and if we see it in a stack trace we know it wasn't
segsPtr, segsLen := uintptr(0xdeadbeef), uintptr(0xdeadbeef)
segsPtrPtr, segsLenPtr := unsafe.Pointer(&segsPtr), unsafe.Pointer(&segsLen)
if ret := CppVAD(audio, uintptr(len(audio)), segsPtrPtr, segsLenPtr); ret != 0 {
return pb.VADResponse{}, fmt.Errorf("Failed VAD")
}
// Happens when CPP vector has not had any elements pushed to it
if segsPtr == 0 {
return pb.VADResponse{
Segments: []*pb.VADSegment{},
}, nil
}
// unsafeptr warning is caused by segsPtr being on the stack and therefor being subject to stack copying AFAICT
// however the stack shouldn't have grown between setting segsPtr and now, also the memory pointed to is allocated by C++
segs := unsafe.Slice((*float32)(unsafe.Pointer(segsPtr)), segsLen) //nolint:govet // segsPtr addresses C++-owned heap memory passed back through the cgo-free purego boundary; the uintptr->Pointer round-trip is intentional and the buffer outlives this read.
vadSegments := []*pb.VADSegment{}
for i := range len(segs) >> 1 {
s := segs[2*i] / 100
t := segs[2*i+1] / 100
vadSegments = append(vadSegments, &pb.VADSegment{
Start: s,
End: t,
})
}
return pb.VADResponse{
Segments: vadSegments,
}, nil
}
func (w *CrispASR) AudioTranscription(ctx context.Context, opts *pb.TranscriptRequest) (pb.TranscriptResult, error) {
if err := ctx.Err(); err != nil {
return pb.TranscriptResult{}, status.Error(codes.Canceled, "transcription cancelled")
}
dir, err := os.MkdirTemp("", "crispasr")
if err != nil {
return pb.TranscriptResult{}, err
}
defer func() { _ = os.RemoveAll(dir) }()
convertedPath := filepath.Join(dir, "converted.wav")
if err := utils.AudioToWav(opts.Dst, convertedPath); err != nil {
return pb.TranscriptResult{}, err
}
fh, err := os.Open(convertedPath)
if err != nil {
return pb.TranscriptResult{}, err
}
defer func() { _ = fh.Close() }()
d := wav.NewDecoder(fh)
buf, err := d.FullPCMBuffer()
if err != nil {
return pb.TranscriptResult{}, err
}
data := buf.AsFloat32Buffer().Data
var duration float32
if buf.Format != nil && buf.Format.SampleRate > 0 {
duration = float32(len(data)) / float32(buf.Format.SampleRate)
}
segsLen := uintptr(0xdeadbeef)
segsLenPtr := unsafe.Pointer(&segsLen)
// Watcher: flips the C-side abort flag when ctx is cancelled. The
// goroutine is joined synchronously (close(done) signals it to exit,
// wg.Wait() blocks until it has) so a late CppSetAbort(1) cannot fire
// after the function returns and corrupt the next transcription call.
done := make(chan struct{})
var wg sync.WaitGroup
wg.Add(1)
go func() {
defer wg.Done()
select {
case <-ctx.Done():
CppSetAbort(1)
case <-done:
}
}()
defer func() {
close(done)
wg.Wait()
}()
ret := CppTranscribe(opts.Threads, opts.Language, opts.Translate, opts.Diarize, data, uintptr(len(data)), segsLenPtr, opts.Prompt)
if ret == 2 {
return pb.TranscriptResult{}, status.Error(codes.Canceled, "transcription cancelled")
}
if ret != 0 {
return pb.TranscriptResult{}, fmt.Errorf("Failed Transcribe")
}
segments := []*pb.TranscriptSegment{}
text := ""
for i := range int(segsLen) {
// segment start/end conversion factor taken from https://github.com/ggml-org/whisper.cpp/blob/master/examples/cli/cli.cpp#L895
s := CppGetSegmentStart(i) * (10000000)
t := CppGetSegmentEnd(i) * (10000000)
// The session result can emit bytes that aren't valid UTF-8 (e.g. a
// multibyte codepoint split across token boundaries); protobuf string
// fields reject those at marshal time. Scrub before the value escapes
// cgo. The session result is segment+word based and exposes no token
// IDs, so Tokens is left empty.
txt := strings.ToValidUTF8(strings.Clone(CppGetSegmentText(i)), "<22>")
segment := &pb.TranscriptSegment{
Id: int32(i),
Text: txt,
Start: s, End: t,
}
segments = append(segments, segment)
text += " " + strings.TrimSpace(txt)
}
return pb.TranscriptResult{
Segments: segments,
Text: strings.TrimSpace(text),
Language: opts.Language,
Duration: duration,
}, nil
}
// AudioTranscriptionStream runs the session transcribe to completion and then
// emits one delta per non-empty segment, followed by a final TranscriptResult.
// Progressive/real-time streaming isn't available via the session API (there
// is no per-decode callback), so deltas are emitted per-segment after the
// blocking decode returns rather than as segments are produced. The offline
// AudioTranscription is unchanged; both paths share the session and the
// SingleThread concurrency model.
func (w *CrispASR) AudioTranscriptionStream(ctx context.Context, opts *pb.TranscriptRequest, results chan *pb.TranscriptStreamResponse) error {
defer close(results)
if err := ctx.Err(); err != nil {
return status.Error(codes.Canceled, "transcription cancelled")
}
dir, err := os.MkdirTemp("", "crispasr")
if err != nil {
return err
}
defer func() { _ = os.RemoveAll(dir) }()
convertedPath := filepath.Join(dir, "converted.wav")
if err := utils.AudioToWav(opts.Dst, convertedPath); err != nil {
return err
}
fh, err := os.Open(convertedPath)
if err != nil {
return err
}
defer func() { _ = fh.Close() }()
d := wav.NewDecoder(fh)
buf, err := d.FullPCMBuffer()
if err != nil {
return err
}
data := buf.AsFloat32Buffer().Data
var duration float32
if buf.Format != nil && buf.Format.SampleRate > 0 {
duration = float32(len(data)) / float32(buf.Format.SampleRate)
}
// Same abort-watcher pattern as AudioTranscription. Joined synchronously
// so a late CppSetAbort(1) cannot fire after this function returns.
// Best-effort only: the session transcribe is blocking with no abort hook.
done := make(chan struct{})
var wg sync.WaitGroup
wg.Add(1)
go func() {
defer wg.Done()
select {
case <-ctx.Done():
CppSetAbort(1)
case <-done:
}
}()
defer func() {
close(done)
wg.Wait()
}()
segsLen := uintptr(0xdeadbeef)
segsLenPtr := unsafe.Pointer(&segsLen)
ret := CppTranscribe(opts.Threads, opts.Language, opts.Translate, opts.Diarize, data, uintptr(len(data)), segsLenPtr, opts.Prompt)
if ret == 2 {
return status.Error(codes.Canceled, "transcription cancelled")
}
if ret != 0 {
return fmt.Errorf("Failed Transcribe")
}
// Walk the segments once: emit a delta per non-empty segment and build the
// final TranscriptResult.Segments alongside. The first delta has no leading
// space and subsequent ones are prefixed with a single space, so
// concat(deltas) == final.Text exactly, matching the e2e contract.
segments := []*pb.TranscriptSegment{}
var assembled strings.Builder
for i := range int(segsLen) {
s := CppGetSegmentStart(i) * 10000000
t := CppGetSegmentEnd(i) * 10000000
txt := strings.ToValidUTF8(strings.Clone(CppGetSegmentText(i)), "<22>")
segments = append(segments, &pb.TranscriptSegment{
Id: int32(i),
Text: txt,
Start: s, End: t,
})
trimmed := strings.TrimSpace(txt)
if trimmed == "" {
continue
}
var delta string
if assembled.Len() == 0 {
delta = trimmed
} else {
delta = " " + trimmed
}
results <- &pb.TranscriptStreamResponse{Delta: delta}
assembled.WriteString(delta)
}
final := &pb.TranscriptResult{
Segments: segments,
Text: assembled.String(),
Language: opts.Language,
Duration: duration,
}
results <- &pb.TranscriptStreamResponse{FinalResult: final}
return nil
}
// synthesize returns 24 kHz mono float32 PCM for text via the open session.
func (w *CrispASR) synthesize(text string) ([]float32, error) {
if text == "" {
return nil, fmt.Errorf("crispasr: TTS requires non-empty text")
}
var n int32
ptr := CppTTSSynthesize(text, unsafe.Pointer(&n))
if ptr == 0 || n <= 0 {
return nil, fmt.Errorf("crispasr: synthesis failed (the loaded model may not be a supported TTS backend, or needs extra config e.g. orpheus SNAC codec)")
}
defer CppTTSFree(ptr)
src := unsafe.Slice((*float32)(unsafe.Pointer(ptr)), int(n)) //nolint:govet // ptr addresses C-allocated PCM returned across the purego boundary; copied out immediately below, before tts_free.
out := make([]float32, int(n)) // copy out of C memory before free
copy(out, src)
return out, nil
}
// setVoice applies a per-call speaker/voice override (best effort). CrispASR
// returns a negative code when the active backend can't honor the name; we log
// it rather than fail, so an unknown voice falls back to the default speaker.
func setVoice(voice string) {
v := strings.TrimSpace(voice)
if v == "" {
return
}
if rc := CppTTSSetVoice(v); rc != 0 {
fmt.Fprintf(os.Stderr, "crispasr: voice %q not applied by the active TTS backend (rc=%d); using default\n", v, rc)
}
}
func (w *CrispASR) TTS(req *pb.TTSRequest) error {
if req.Dst == "" {
return fmt.Errorf("crispasr: TTS requires a destination path")
}
setVoice(req.Voice)
pcm, err := w.synthesize(req.Text)
if err != nil {
return err
}
return writeWAV24k(req.Dst, pcm)
}
// TTSStream is the streaming counterpart to TTS. CrispASR has no progressive
// (native streaming) synth, so we synthesize the whole utterance, encode it to
// a 24 kHz WAV, and emit the encoded bytes as a single chunk. The gRPC server
// wrapper (pkg/grpc/server.go:TTSStream) ranges over the channel until it is
// closed, so this method owns the close - mirrors vibevoice-cpp's TTSStream.
func (w *CrispASR) TTSStream(req *pb.TTSRequest, results chan []byte) error {
defer close(results)
if req.Text == "" {
return fmt.Errorf("crispasr: TTSStream requires text")
}
setVoice(req.Voice)
pcm, err := w.synthesize(req.Text)
if err != nil {
return err
}
tmp, err := os.CreateTemp("", "crispasr-tts-stream-*.wav")
if err != nil {
return fmt.Errorf("crispasr: tempfile: %w", err)
}
dst := tmp.Name()
if err := tmp.Close(); err != nil {
return fmt.Errorf("crispasr: close tempfile: %w", err)
}
defer func() { _ = os.Remove(dst) }()
if err := writeWAV24k(dst, pcm); err != nil {
return err
}
encoded, err := os.ReadFile(dst)
if err != nil {
return fmt.Errorf("crispasr: read tempfile: %w", err)
}
results <- encoded
return nil
}
// writeWAV24k writes pcm as a 24000 Hz, mono, 16-bit PCM WAV at dst.
func writeWAV24k(dst string, pcm []float32) error {
f, err := os.Create(dst)
if err != nil {
return fmt.Errorf("crispasr: create %q: %w", dst, err)
}
enc := wav.NewEncoder(f, 24000, 16, 1, 1)
ints := make([]int, len(pcm))
for i, s := range pcm {
if s > 1 {
s = 1
} else if s < -1 {
s = -1
}
ints[i] = int(s * 32767)
}
buf := &audio.IntBuffer{
Format: &audio.Format{NumChannels: 1, SampleRate: 24000},
Data: ints,
SourceBitDepth: 16,
}
if err := enc.Write(buf); err != nil {
_ = enc.Close()
_ = f.Close()
return fmt.Errorf("crispasr: encode WAV: %w", err)
}
if err := enc.Close(); err != nil {
_ = f.Close()
return fmt.Errorf("crispasr: finalize WAV: %w", err)
}
if err := f.Close(); err != nil {
return fmt.Errorf("crispasr: close %q: %w", dst, err)
}
return nil
}

View File

@@ -0,0 +1,193 @@
package main
import (
"context"
"os"
"path/filepath"
"strings"
"sync"
"testing"
"github.com/ebitengine/purego"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
"google.golang.org/grpc/codes"
"google.golang.org/grpc/status"
)
func TestCrispASR(t *testing.T) {
RegisterFailHandler(Fail)
RunSpecs(t, "CrispASR Backend Suite")
}
var (
libLoadOnce sync.Once
libLoadErr error
)
// ensureLibLoaded mirrors main.go's bootstrap so a Go test can drive the
// bridge without spinning up the gRPC server. Skips the current spec when the
// shared library isn't present (e.g. running before `make backends/whisper`).
func ensureLibLoaded() {
libLoadOnce.Do(func() {
libName := os.Getenv("CRISPASR_LIBRARY")
if libName == "" {
libName = "./libgocrispasr-fallback.so"
}
if _, err := os.Stat(libName); err != nil {
libLoadErr = err
return
}
gosd, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL)
if err != nil {
libLoadErr = err
return
}
purego.RegisterLibFunc(&CppLoadModel, gosd, "load_model")
purego.RegisterLibFunc(&CppSetCodecPath, gosd, "set_codec_path")
purego.RegisterLibFunc(&CppTranscribe, gosd, "transcribe")
purego.RegisterLibFunc(&CppGetSegmentText, gosd, "get_segment_text")
purego.RegisterLibFunc(&CppGetSegmentStart, gosd, "get_segment_t0")
purego.RegisterLibFunc(&CppGetSegmentEnd, gosd, "get_segment_t1")
purego.RegisterLibFunc(&CppGetBackend, gosd, "get_backend")
purego.RegisterLibFunc(&CppSetAbort, gosd, "set_abort")
purego.RegisterLibFunc(&CppTTSSynthesize, gosd, "tts_synthesize")
purego.RegisterLibFunc(&CppTTSFree, gosd, "tts_free")
purego.RegisterLibFunc(&CppTTSSetVoice, gosd, "tts_set_voice")
purego.RegisterLibFunc(&CppTTSSetVoiceFile, gosd, "tts_set_voice_file")
})
if libLoadErr != nil {
Skip("whisper library not loadable: " + libLoadErr.Error())
}
}
// fixturesOrSkip returns the model + audio paths or skips the spec if either
// env var is unset. The test never runs in default CI — it requires a real
// whisper model and a long audio file (~3 minutes) on disk.
func fixturesOrSkip() (string, string) {
modelPath := os.Getenv("CRISPASR_MODEL_PATH")
audioPath := os.Getenv("CRISPASR_AUDIO_PATH")
if modelPath == "" || audioPath == "" {
Skip("set CRISPASR_MODEL_PATH and CRISPASR_AUDIO_PATH to run this spec")
}
return modelPath, audioPath
}
// ttsModelOrSkip returns the TTS model path or skips the spec when the env var
// is unset. Like the transcription fixtures, this never runs in default CI — it
// needs a real TTS model (e.g. a vibevoice GGUF) on disk.
func ttsModelOrSkip() string {
modelPath := os.Getenv("CRISPASR_TTS_MODEL_PATH")
if modelPath == "" {
Skip("set CRISPASR_TTS_MODEL_PATH to run this spec")
}
return modelPath
}
var _ = Describe("CrispASR", func() {
Context("AudioTranscription cancellation", func() {
It("returns codes.Canceled on a pre-cancelled context and still succeeds afterwards", func() {
modelPath, audioPath := fixturesOrSkip()
ensureLibLoaded()
w := &CrispASR{}
Expect(w.Load(&pb.ModelOptions{ModelFile: modelPath})).To(Succeed())
// The session transcribe is blocking and exposes no abort hook, so
// a mid-decode cancel can't interrupt it. The contract we can rely
// on is the pre-call ctx.Err() check: a context cancelled before
// the call must yield codes.Canceled without starting a decode.
ctx, cancel := context.WithCancel(context.Background())
cancel()
_, err := w.AudioTranscription(ctx, &pb.TranscriptRequest{
Dst: audioPath,
Threads: 4,
Language: "en",
})
Expect(err).To(HaveOccurred(), "expected pre-cancelled context to fail")
st, ok := status.FromError(err)
Expect(ok).To(BeTrue(), "expected gRPC status error, got %v", err)
Expect(st.Code()).To(Equal(codes.Canceled), "expected codes.Canceled, got %v", err)
// Subsequent transcription must succeed — proves g_abort reset.
res, err := w.AudioTranscription(context.Background(), &pb.TranscriptRequest{
Dst: audioPath,
Threads: 4,
Language: "en",
})
Expect(err).ToNot(HaveOccurred(), "post-cancel transcription failed")
Expect(res.Text).ToNot(BeEmpty(), "post-cancel transcription returned empty text")
})
})
Context("AudioTranscriptionStream", func() {
It("emits multiple deltas progressively for a multi-segment clip", func() {
modelPath, audioPath := fixturesOrSkip()
ensureLibLoaded()
w := &CrispASR{}
Expect(w.Load(&pb.ModelOptions{ModelFile: modelPath})).To(Succeed())
results := make(chan *pb.TranscriptStreamResponse, 64)
done := make(chan error, 1)
go func() {
done <- w.AudioTranscriptionStream(context.Background(), &pb.TranscriptRequest{
Dst: audioPath,
Threads: 4,
Language: "en",
Stream: true,
}, results)
}()
var deltas []string
var assembled strings.Builder
var finalText string
var finalSegmentCount int
for chunk := range results {
if d := chunk.GetDelta(); d != "" {
deltas = append(deltas, d)
assembled.WriteString(d)
}
if final := chunk.GetFinalResult(); final != nil {
finalText = final.GetText()
finalSegmentCount = len(final.GetSegments())
}
}
Expect(<-done).ToNot(HaveOccurred())
// One delta per non-empty segment is emitted after the blocking
// decode returns (the session API has no per-decode callback), so a
// multi-segment clip MUST produce >=2 delta events, and
// concat(deltas) MUST equal final.Text exactly.
Expect(len(deltas)).To(BeNumerically(">=", 2),
"expected multiple deltas from a multi-segment clip, got %d (assembled=%q)",
len(deltas), assembled.String())
Expect(finalSegmentCount).To(BeNumerically(">=", 2),
"expected final to carry multiple segments")
Expect(assembled.String()).To(Equal(finalText),
"concat(deltas) must equal final.Text")
})
})
Context("TTS", func() {
It("synthesizes a non-empty WAV", func() {
ttsModel := ttsModelOrSkip()
ensureLibLoaded()
w := &CrispASR{}
Expect(w.Load(&pb.ModelOptions{ModelFile: ttsModel})).To(Succeed())
dst := filepath.Join(GinkgoT().TempDir(), "out.wav")
Expect(w.TTS(&pb.TTSRequest{Text: "Hello from CrispASR.", Dst: dst})).To(Succeed())
info, err := os.Stat(dst)
Expect(err).ToNot(HaveOccurred(), "synthesized WAV should exist at %q", dst)
// A real 24 kHz mono WAV is a 44-byte header plus samples; anything
// this small would mean an empty/failed synth.
Expect(info.Size()).To(BeNumerically(">", 1024),
"expected a non-trivial WAV, got %d bytes", info.Size())
})
})
})

View File

@@ -0,0 +1,58 @@
package main
// Note: this is started internally by LocalAI and a server is allocated for each model
import (
"flag"
"os"
"github.com/ebitengine/purego"
grpc "github.com/mudler/LocalAI/pkg/grpc"
)
var (
addr = flag.String("addr", "localhost:50051", "the address to connect to")
)
type LibFuncs struct {
FuncPtr any
Name string
}
func main() {
libName := os.Getenv("CRISPASR_LIBRARY")
if libName == "" {
libName = "./libgocrispasr-fallback.so"
}
lib, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL)
if err != nil {
panic(err)
}
libFuncs := []LibFuncs{
{&CppLoadModel, "load_model"},
{&CppSetCodecPath, "set_codec_path"},
{&CppLoadModelVAD, "load_model_vad"},
{&CppVAD, "vad"},
{&CppTranscribe, "transcribe"},
{&CppGetSegmentText, "get_segment_text"},
{&CppGetSegmentStart, "get_segment_t0"},
{&CppGetSegmentEnd, "get_segment_t1"},
{&CppGetBackend, "get_backend"},
{&CppSetAbort, "set_abort"},
{&CppTTSSynthesize, "tts_synthesize"},
{&CppTTSFree, "tts_free"},
{&CppTTSSetVoice, "tts_set_voice"},
{&CppTTSSetVoiceFile, "tts_set_voice_file"},
}
for _, lf := range libFuncs {
purego.RegisterLibFunc(lf.FuncPtr, lib, lf.Name)
}
flag.Parse()
if err := grpc.StartServer(*addr, &CrispASR{}); err != nil {
panic(err)
}
}

65
backend/go/crispasr/package.sh Executable file
View File

@@ -0,0 +1,65 @@
#!/bin/bash
# Script to copy the appropriate libraries based on architecture
# This script is used in the final stage of the Dockerfile
set -e
CURDIR=$(dirname "$(realpath $0)")
REPO_ROOT="${CURDIR}/../../.."
# Create lib directory
mkdir -p $CURDIR/package/lib
cp -avf $CURDIR/crispasr $CURDIR/package/
cp -fv $CURDIR/libgocrispasr-*.so $CURDIR/package/
cp -fv $CURDIR/run.sh $CURDIR/package/
# Detect architecture and copy appropriate libraries
if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
# x86_64 architecture
echo "Detected x86_64 architecture, copying x86_64 libraries..."
cp -arfLv /lib64/ld-linux-x86-64.so.2 $CURDIR/package/lib/ld.so
cp -arfLv /lib/x86_64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
cp -arfLv /lib/x86_64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
cp -arfLv /lib/x86_64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
cp -arfLv /lib/x86_64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
cp -arfLv /lib/x86_64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
cp -arfLv /lib/x86_64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
# ARM64 architecture
echo "Detected ARM64 architecture, copying ARM64 libraries..."
cp -arfLv /lib/ld-linux-aarch64.so.1 $CURDIR/package/lib/ld.so
cp -arfLv /lib/aarch64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
cp -arfLv /lib/aarch64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
cp -arfLv /lib/aarch64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
cp -arfLv /lib/aarch64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
cp -arfLv /lib/aarch64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
cp -arfLv /lib/aarch64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
elif [ $(uname -s) = "Darwin" ]; then
echo "Detected Darwin"
else
echo "Error: Could not detect architecture"
exit 1
fi
# Package GPU libraries based on BUILD_TYPE
# The GPU library packaging script will detect BUILD_TYPE and copy appropriate GPU libraries
GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
if [ -f "$GPU_LIB_SCRIPT" ]; then
echo "Packaging GPU libraries for BUILD_TYPE=${BUILD_TYPE:-cpu}..."
source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
package_gpu_libs
fi
echo "Packaging completed successfully"
ls -liah $CURDIR/package/
ls -liah $CURDIR/package/lib/

52
backend/go/crispasr/run.sh Executable file
View File

@@ -0,0 +1,52 @@
#!/bin/bash
set -ex
# Get the absolute current dir where the script is located
CURDIR=$(dirname "$(realpath $0)")
cd /
echo "CPU info:"
if [ "$(uname)" != "Darwin" ]; then
grep -e "model\sname" /proc/cpuinfo | head -1
grep -e "flags" /proc/cpuinfo | head -1
fi
LIBRARY="$CURDIR/libgocrispasr-fallback.so"
if [ "$(uname)" != "Darwin" ]; then
if grep -q -e "\savx\s" /proc/cpuinfo ; then
echo "CPU: AVX found OK"
if [ -e $CURDIR/libgocrispasr-avx.so ]; then
LIBRARY="$CURDIR/libgocrispasr-avx.so"
fi
fi
if grep -q -e "\savx2\s" /proc/cpuinfo ; then
echo "CPU: AVX2 found OK"
if [ -e $CURDIR/libgocrispasr-avx2.so ]; then
LIBRARY="$CURDIR/libgocrispasr-avx2.so"
fi
fi
# Check avx 512
if grep -q -e "\savx512f\s" /proc/cpuinfo ; then
echo "CPU: AVX512F found OK"
if [ -e $CURDIR/libgocrispasr-avx512.so ]; then
LIBRARY="$CURDIR/libgocrispasr-avx512.so"
fi
fi
fi
export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
export CRISPASR_LIBRARY=$LIBRARY
# If there is a lib/ld.so, use it
if [ -f $CURDIR/lib/ld.so ]; then
echo "Using lib/ld.so"
echo "Using library: $LIBRARY"
exec $CURDIR/lib/ld.so $CURDIR/crispasr "$@"
fi
echo "Using library: $LIBRARY"
exec $CURDIR/crispasr "$@"

View File

@@ -122,6 +122,33 @@
nvidia-cuda-12: "cuda12-whisper"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-whisper"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-whisper"
- &crispasr
name: "crispasr"
alias: "crispasr"
license: mit
icon: https://user-images.githubusercontent.com/1991296/235238348-05d0f6a4-da44-4900-a1de-d0707e75b763.jpeg
description: |
CrispASR unified speech engine (whisper.cpp fork on ggml) supporting many ASR architectures (Parakeet, Canary, Voxtral, Qwen3-ASR, Granite, Wav2Vec2, Moonshine, OmniASR, FireRedASR, and more).
urls:
- https://github.com/CrispStrobe/CrispASR
tags:
- audio-transcription
- CPU
- GPU
- CUDA
- HIP
capabilities:
default: "cpu-crispasr"
nvidia: "cuda12-crispasr"
intel: "intel-sycl-f16-crispasr"
metal: "metal-crispasr"
amd: "rocm-crispasr"
vulkan: "vulkan-crispasr"
nvidia-l4t: "nvidia-l4t-arm64-crispasr"
nvidia-cuda-13: "cuda13-crispasr"
nvidia-cuda-12: "cuda12-crispasr"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-crispasr"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-crispasr"
- &parakeetcpp
name: "parakeet-cpp"
alias: "parakeet-cpp"
@@ -1957,6 +1984,131 @@
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-whisper"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-whisper
## crispasr
- !!merge <<: *crispasr
name: "crispasr-development"
capabilities:
default: "cpu-crispasr-development"
nvidia: "cuda12-crispasr-development"
intel: "intel-sycl-f16-crispasr-development"
metal: "metal-crispasr-development"
amd: "rocm-crispasr-development"
vulkan: "vulkan-crispasr-development"
nvidia-l4t: "nvidia-l4t-arm64-crispasr-development"
nvidia-cuda-13: "cuda13-crispasr-development"
nvidia-cuda-12: "cuda12-crispasr-development"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-crispasr-development"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-crispasr-development"
- !!merge <<: *crispasr
name: "nvidia-l4t-arm64-crispasr"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-arm64-crispasr"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-arm64-crispasr
- !!merge <<: *crispasr
name: "nvidia-l4t-arm64-crispasr-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-arm64-crispasr"
mirrors:
- localai/localai-backends:master-nvidia-l4t-arm64-crispasr
- !!merge <<: *crispasr
name: "cuda13-nvidia-l4t-arm64-crispasr"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-crispasr"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-crispasr
- !!merge <<: *crispasr
name: "cuda13-nvidia-l4t-arm64-crispasr-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-crispasr"
mirrors:
- localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-crispasr
- !!merge <<: *crispasr
name: "cpu-crispasr"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-crispasr"
mirrors:
- localai/localai-backends:latest-cpu-crispasr
- !!merge <<: *crispasr
name: "metal-crispasr"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-crispasr"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-crispasr
- !!merge <<: *crispasr
name: "metal-crispasr-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-crispasr"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-crispasr
- !!merge <<: *crispasr
name: "cpu-crispasr-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-crispasr"
mirrors:
- localai/localai-backends:master-cpu-crispasr
- !!merge <<: *crispasr
name: "cuda12-crispasr"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-crispasr"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-crispasr
- !!merge <<: *crispasr
name: "rocm-crispasr"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-crispasr"
mirrors:
- localai/localai-backends:latest-gpu-rocm-hipblas-crispasr
- !!merge <<: *crispasr
name: "intel-sycl-f32-crispasr"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f32-crispasr"
mirrors:
- localai/localai-backends:latest-gpu-intel-sycl-f32-crispasr
- !!merge <<: *crispasr
name: "intel-sycl-f16-crispasr"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f16-crispasr"
mirrors:
- localai/localai-backends:latest-gpu-intel-sycl-f16-crispasr
- !!merge <<: *crispasr
name: "vulkan-crispasr"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-vulkan-crispasr"
mirrors:
- localai/localai-backends:latest-gpu-vulkan-crispasr
- !!merge <<: *crispasr
name: "vulkan-crispasr-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-vulkan-crispasr"
mirrors:
- localai/localai-backends:master-gpu-vulkan-crispasr
- !!merge <<: *crispasr
name: "metal-crispasr"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-crispasr"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-crispasr
- !!merge <<: *crispasr
name: "metal-crispasr-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-crispasr"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-crispasr
- !!merge <<: *crispasr
name: "cuda12-crispasr-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-crispasr"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-crispasr
- !!merge <<: *crispasr
name: "rocm-crispasr-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-crispasr"
mirrors:
- localai/localai-backends:master-gpu-rocm-hipblas-crispasr
- !!merge <<: *crispasr
name: "intel-sycl-f32-crispasr-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f32-crispasr"
mirrors:
- localai/localai-backends:master-gpu-intel-sycl-f32-crispasr
- !!merge <<: *crispasr
name: "intel-sycl-f16-crispasr-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f16-crispasr"
mirrors:
- localai/localai-backends:master-gpu-intel-sycl-f16-crispasr
- !!merge <<: *crispasr
name: "cuda13-crispasr"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-crispasr"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-13-crispasr
- !!merge <<: *crispasr
name: "cuda13-crispasr-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-crispasr"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-crispasr
## parakeet-cpp
- !!merge <<: *parakeetcpp
name: "parakeet-cpp-development"

View File

@@ -31,6 +31,7 @@ var knownPrefOnlyBackends = []schema.KnownBackend{
{Name: "mlx-vlm", Modality: "text", AutoDetect: false, Description: "MLX vision-language models (preference-only)"},
// ASR
{Name: "whisperx", Modality: "asr", AutoDetect: false, Description: "WhisperX transcription (preference-only)"},
{Name: "crispasr", Modality: "asr", AutoDetect: false, Description: "CrispASR multi-architecture transcription (preference-only)"},
// TTS
{Name: "kokoros", Modality: "tts", AutoDetect: false, Description: "Kokoros TTS (preference-only)"},
{Name: "qwen-tts", Modality: "tts", AutoDetect: false, Description: "Qwen TTS (preference-only)"},

View File

@@ -140,6 +140,7 @@ var _ = Describe("Backend Endpoints", func() {
expectPrefOnly("trl", "text")
expectPrefOnly("mlx-vlm", "text")
expectPrefOnly("whisperx", "asr")
expectPrefOnly("crispasr", "asr")
expectPrefOnly("kokoros", "tts")
expectPrefOnly("qwen-tts", "tts")
expectPrefOnly("qwen3-tts-cpp", "tts")

View File

@@ -31771,3 +31771,844 @@
- filename: parakeet-cpp/tdt_ctc-1.1b-f16.gguf
uri: huggingface://mudler/parakeet-cpp-gguf/tdt_ctc-1.1b-f16.gguf
sha256: cd53f64eefac2623a12f2f118ef50b56622dc3012f42c815c6adf0d08292f387
- name: parakeet-crispasr
url: github:mudler/LocalAI/gallery/virtual.yaml@master
urls:
- https://huggingface.co/cstr/parakeet-tdt-0.6b-v3-GGUF
description: |
NVIDIA Parakeet TDT 0.6B v3 (FastConformer + Token-and-Duration Transducer), 25-language ASR. Runs via the CrispASR backend. Default GGUF size ~467 MB.
tags:
- crispasr
- asr
- speech-recognition
- stt
- gguf
overrides:
backend: crispasr
known_usecases:
- transcript
name: parakeet-crispasr
parameters:
model: parakeet-tdt-0.6b-v3-q4_k.gguf
files:
- filename: parakeet-tdt-0.6b-v3-q4_k.gguf
uri: huggingface://cstr/parakeet-tdt-0.6b-v3-GGUF/parakeet-tdt-0.6b-v3-q4_k.gguf
- name: parakeet-v2-crispasr
url: github:mudler/LocalAI/gallery/virtual.yaml@master
urls:
- https://huggingface.co/cstr/parakeet-tdt-0.6b-v2-GGUF
description: |
NVIDIA Parakeet TDT 0.6B v2 (FastConformer + TDT), English-only ASR. Runs via the CrispASR backend. Default GGUF size ~468 MB.
tags:
- crispasr
- asr
- speech-recognition
- stt
- gguf
overrides:
backend: crispasr
known_usecases:
- transcript
name: parakeet-v2-crispasr
parameters:
model: parakeet-tdt-0.6b-v2-q4_k.gguf
files:
- filename: parakeet-tdt-0.6b-v2-q4_k.gguf
uri: huggingface://cstr/parakeet-tdt-0.6b-v2-GGUF/parakeet-tdt-0.6b-v2-q4_k.gguf
- name: parakeet-ja-crispasr
url: github:mudler/LocalAI/gallery/virtual.yaml@master
urls:
- https://huggingface.co/cstr/parakeet-tdt-0.6b-ja-GGUF
description: |
NVIDIA Parakeet TDT 0.6B Japanese ASR (F16 default; Q4_K is quantisation-sensitive for this model). Runs via the CrispASR backend. Default GGUF size ~1.24 GB.
tags:
- crispasr
- asr
- speech-recognition
- stt
- gguf
overrides:
backend: crispasr
known_usecases:
- transcript
name: parakeet-ja-crispasr
parameters:
model: parakeet-tdt-0.6b-ja.gguf
files:
- filename: parakeet-tdt-0.6b-ja.gguf
uri: huggingface://cstr/parakeet-tdt-0.6b-ja-GGUF/parakeet-tdt-0.6b-ja.gguf
- name: parakeet-tdt-1.1b-crispasr
url: github:mudler/LocalAI/gallery/virtual.yaml@master
urls:
- https://huggingface.co/cstr/parakeet-tdt-1.1b-GGUF
description: |
NVIDIA Parakeet TDT 1.1B (42-layer FastConformer encoder), English-only ASR. Runs via the CrispASR backend. Default GGUF size ~808 MB.
tags:
- crispasr
- asr
- speech-recognition
- stt
- gguf
overrides:
backend: crispasr
known_usecases:
- transcript
name: parakeet-tdt-1.1b-crispasr
parameters:
model: parakeet-tdt-1.1b-q4_k.gguf
files:
- filename: parakeet-tdt-1.1b-q4_k.gguf
uri: huggingface://cstr/parakeet-tdt-1.1b-GGUF/parakeet-tdt-1.1b-q4_k.gguf
- name: parakeet-tdt_ctc-110m-crispasr
url: github:mudler/LocalAI/gallery/virtual.yaml@master
urls:
- https://huggingface.co/cstr/parakeet-tdt_ctc-110m-GGUF
description: |
NVIDIA Parakeet hybrid TDT+CTC 110M (smallest, CTC decode), English-only ASR. Runs via the CrispASR backend. Default GGUF size ~91 MB.
tags:
- crispasr
- asr
- speech-recognition
- stt
- gguf
overrides:
backend: crispasr
known_usecases:
- transcript
name: parakeet-tdt_ctc-110m-crispasr
parameters:
model: parakeet-tdt_ctc-110m-q4_k.gguf
files:
- filename: parakeet-tdt_ctc-110m-q4_k.gguf
uri: huggingface://cstr/parakeet-tdt_ctc-110m-GGUF/parakeet-tdt_ctc-110m-q4_k.gguf
- name: parakeet-tdt_ctc-1.1b-crispasr
url: github:mudler/LocalAI/gallery/virtual.yaml@master
urls:
- https://huggingface.co/cstr/parakeet-tdt_ctc-1.1b-GGUF
description: |
NVIDIA Parakeet hybrid TDT+CTC 1.1B (multilingual, casing + punctuation) ASR. Runs via the CrispASR backend. Default GGUF size ~810 MB.
tags:
- crispasr
- asr
- speech-recognition
- stt
- gguf
overrides:
backend: crispasr
known_usecases:
- transcript
name: parakeet-tdt_ctc-1.1b-crispasr
parameters:
model: parakeet-tdt_ctc-1.1b-q4_k.gguf
files:
- filename: parakeet-tdt_ctc-1.1b-q4_k.gguf
uri: huggingface://cstr/parakeet-tdt_ctc-1.1b-GGUF/parakeet-tdt_ctc-1.1b-q4_k.gguf
- name: parakeet-rnnt-0.6b-crispasr
url: github:mudler/LocalAI/gallery/virtual.yaml@master
urls:
- https://huggingface.co/cstr/parakeet-rnnt-0.6b-GGUF
description: |
NVIDIA Parakeet RNN-Transducer 0.6B (24-layer FastConformer) ASR. Runs via the CrispASR backend. Default GGUF size ~447 MB.
tags:
- crispasr
- asr
- speech-recognition
- stt
- gguf
overrides:
backend: crispasr
known_usecases:
- transcript
name: parakeet-rnnt-0.6b-crispasr
parameters:
model: parakeet-rnnt-0.6b-q4_k.gguf
files:
- filename: parakeet-rnnt-0.6b-q4_k.gguf
uri: huggingface://cstr/parakeet-rnnt-0.6b-GGUF/parakeet-rnnt-0.6b-q4_k.gguf
- name: parakeet-rnnt-1.1b-crispasr
url: github:mudler/LocalAI/gallery/virtual.yaml@master
urls:
- https://huggingface.co/cstr/parakeet-rnnt-1.1b-GGUF
description: |
NVIDIA Parakeet RNN-Transducer 1.1B (42-layer FastConformer) ASR. Runs via the CrispASR backend. Default GGUF size ~770 MB.
tags:
- crispasr
- asr
- speech-recognition
- stt
- gguf
overrides:
backend: crispasr
known_usecases:
- transcript
name: parakeet-rnnt-1.1b-crispasr
parameters:
model: parakeet-rnnt-1.1b-q4_k.gguf
files:
- filename: parakeet-rnnt-1.1b-q4_k.gguf
uri: huggingface://cstr/parakeet-rnnt-1.1b-GGUF/parakeet-rnnt-1.1b-q4_k.gguf
- name: fastconformer-ctc-crispasr
url: github:mudler/LocalAI/gallery/virtual.yaml@master
urls:
- https://huggingface.co/cstr/stt-en-fastconformer-ctc-large-GGUF
description: |
NVIDIA STT-EN FastConformer-CTC Large, English ASR. Runs via the CrispASR backend. Default GGUF size ~83 MB.
tags:
- crispasr
- asr
- speech-recognition
- stt
- gguf
overrides:
backend: crispasr
known_usecases:
- transcript
name: fastconformer-ctc-crispasr
parameters:
model: stt-en-fastconformer-ctc-large-q4_k.gguf
files:
- filename: stt-en-fastconformer-ctc-large-q4_k.gguf
uri: huggingface://cstr/stt-en-fastconformer-ctc-large-GGUF/stt-en-fastconformer-ctc-large-q4_k.gguf
- name: canary-crispasr
url: github:mudler/LocalAI/gallery/virtual.yaml@master
urls:
- https://huggingface.co/cstr/canary-1b-v2-GGUF
description: |
NVIDIA Canary 1B v2 (FastConformer encoder-decoder), multilingual ASR + translation. Runs via the CrispASR backend. Default GGUF size ~600 MB.
tags:
- crispasr
- asr
- speech-recognition
- stt
- gguf
overrides:
backend: crispasr
known_usecases:
- transcript
name: canary-crispasr
parameters:
model: canary-1b-v2-q4_k.gguf
files:
- filename: canary-1b-v2-q4_k.gguf
uri: huggingface://cstr/canary-1b-v2-GGUF/canary-1b-v2-q4_k.gguf
- name: voxtral-crispasr
url: github:mudler/LocalAI/gallery/virtual.yaml@master
urls:
- https://huggingface.co/cstr/voxtral-mini-3b-2507-GGUF
description: |
Mistral Voxtral Mini 3B (audio LLM) ASR. Runs via the CrispASR backend. Default GGUF size ~2.5 GB.
tags:
- crispasr
- asr
- speech-recognition
- stt
- gguf
overrides:
backend: crispasr
known_usecases:
- transcript
name: voxtral-crispasr
parameters:
model: voxtral-mini-3b-2507-q4_k.gguf
files:
- filename: voxtral-mini-3b-2507-q4_k.gguf
uri: huggingface://cstr/voxtral-mini-3b-2507-GGUF/voxtral-mini-3b-2507-q4_k.gguf
- name: voxtral4b-crispasr
url: github:mudler/LocalAI/gallery/virtual.yaml@master
urls:
- https://huggingface.co/cstr/voxtral-mini-4b-realtime-GGUF
description: |
Mistral Voxtral Mini 4B Realtime (audio LLM) ASR. Runs via the CrispASR backend. Default GGUF size ~3.3 GB.
tags:
- crispasr
- asr
- speech-recognition
- stt
- gguf
overrides:
backend: crispasr
known_usecases:
- transcript
name: voxtral4b-crispasr
parameters:
model: voxtral-mini-4b-realtime-q4_k.gguf
files:
- filename: voxtral-mini-4b-realtime-q4_k.gguf
uri: huggingface://cstr/voxtral-mini-4b-realtime-GGUF/voxtral-mini-4b-realtime-q4_k.gguf
- name: granite-crispasr
url: github:mudler/LocalAI/gallery/virtual.yaml@master
urls:
- https://huggingface.co/cstr/granite-speech-4.0-1b-GGUF
description: |
IBM Granite Speech 4.0 1B ASR. Runs via the CrispASR backend. Default GGUF size ~2.94 GB.
tags:
- crispasr
- asr
- speech-recognition
- stt
- gguf
overrides:
backend: crispasr
known_usecases:
- transcript
name: granite-crispasr
parameters:
model: granite-speech-4.0-1b-q4_k.gguf
files:
- filename: granite-speech-4.0-1b-q4_k.gguf
uri: huggingface://cstr/granite-speech-4.0-1b-GGUF/granite-speech-4.0-1b-q4_k.gguf
- name: granite-4.1-crispasr
url: github:mudler/LocalAI/gallery/virtual.yaml@master
urls:
- https://huggingface.co/cstr/granite-speech-4.1-2b-GGUF
description: |
IBM Granite Speech 4.1 2B ASR. Runs via the CrispASR backend. Default GGUF size ~2.94 GB.
tags:
- crispasr
- asr
- speech-recognition
- stt
- gguf
overrides:
backend: crispasr
known_usecases:
- transcript
name: granite-4.1-crispasr
parameters:
model: granite-speech-4.1-2b-q4_k.gguf
files:
- filename: granite-speech-4.1-2b-q4_k.gguf
uri: huggingface://cstr/granite-speech-4.1-2b-GGUF/granite-speech-4.1-2b-q4_k.gguf
- name: granite-4.1-plus-crispasr
url: github:mudler/LocalAI/gallery/virtual.yaml@master
urls:
- https://huggingface.co/cstr/granite-speech-4.1-2b-plus-GGUF
description: |
IBM Granite Speech 4.1 2B Plus ASR. Runs via the CrispASR backend. Default GGUF size ~2.96 GB.
tags:
- crispasr
- asr
- speech-recognition
- stt
- gguf
overrides:
backend: crispasr
known_usecases:
- transcript
name: granite-4.1-plus-crispasr
parameters:
model: granite-speech-4.1-2b-plus-q4_k.gguf
files:
- filename: granite-speech-4.1-2b-plus-q4_k.gguf
uri: huggingface://cstr/granite-speech-4.1-2b-plus-GGUF/granite-speech-4.1-2b-plus-q4_k.gguf
- name: granite-4.1-nar-crispasr
url: github:mudler/LocalAI/gallery/virtual.yaml@master
urls:
- https://huggingface.co/cstr/granite-speech-4.1-2b-nar-GGUF
description: |
IBM Granite Speech 4.1 2B NAR (non-autoregressive) ASR. Runs via the CrispASR backend. Default GGUF size ~3.2 GB.
tags:
- crispasr
- asr
- speech-recognition
- stt
- gguf
overrides:
backend: crispasr
known_usecases:
- transcript
name: granite-4.1-nar-crispasr
parameters:
model: granite-speech-4.1-2b-nar-q4_k.gguf
files:
- filename: granite-speech-4.1-2b-nar-q4_k.gguf
uri: huggingface://cstr/granite-speech-4.1-2b-nar-GGUF/granite-speech-4.1-2b-nar-q4_k.gguf
- name: qwen3-crispasr
url: github:mudler/LocalAI/gallery/virtual.yaml@master
urls:
- https://huggingface.co/cstr/qwen3-asr-0.6b-GGUF
description: |
Qwen3-ASR 0.6B ASR. Runs via the CrispASR backend. Default GGUF size ~500 MB.
tags:
- crispasr
- asr
- speech-recognition
- stt
- gguf
overrides:
backend: crispasr
known_usecases:
- transcript
name: qwen3-crispasr
parameters:
model: qwen3-asr-0.6b-q4_k.gguf
files:
- filename: qwen3-asr-0.6b-q4_k.gguf
uri: huggingface://cstr/qwen3-asr-0.6b-GGUF/qwen3-asr-0.6b-q4_k.gguf
- name: qwen3-1.7b-crispasr
url: github:mudler/LocalAI/gallery/virtual.yaml@master
urls:
- https://huggingface.co/cstr/qwen3-asr-1.7b-GGUF
description: |
Qwen3-ASR 1.7B ASR. Runs via the CrispASR backend. Default GGUF size ~1.3 GB.
tags:
- crispasr
- asr
- speech-recognition
- stt
- gguf
overrides:
backend: crispasr
known_usecases:
- transcript
name: qwen3-1.7b-crispasr
parameters:
model: qwen3-asr-1.7b-q4_k.gguf
files:
- filename: qwen3-asr-1.7b-q4_k.gguf
uri: huggingface://cstr/qwen3-asr-1.7b-GGUF/qwen3-asr-1.7b-q4_k.gguf
- name: cohere-crispasr
url: github:mudler/LocalAI/gallery/virtual.yaml@master
urls:
- https://huggingface.co/cstr/cohere-transcribe-03-2026-GGUF
description: |
Cohere Transcribe (03-2026) ASR. Runs via the CrispASR backend. Default GGUF size ~550 MB.
tags:
- crispasr
- asr
- speech-recognition
- stt
- gguf
overrides:
backend: crispasr
known_usecases:
- transcript
name: cohere-crispasr
parameters:
model: cohere-transcribe-q4_k.gguf
files:
- filename: cohere-transcribe-q4_k.gguf
uri: huggingface://cstr/cohere-transcribe-03-2026-GGUF/cohere-transcribe-q4_k.gguf
- name: wav2vec2-crispasr
url: github:mudler/LocalAI/gallery/virtual.yaml@master
urls:
- https://huggingface.co/cstr/wav2vec2-large-xlsr-53-english-GGUF
description: |
wav2vec2 Large XLSR-53 English (CTC) ASR. Runs via the CrispASR backend. Default GGUF size ~212 MB.
tags:
- crispasr
- asr
- speech-recognition
- stt
- gguf
overrides:
backend: crispasr
known_usecases:
- transcript
name: wav2vec2-crispasr
parameters:
model: wav2vec2-xlsr-en-q4_k.gguf
files:
- filename: wav2vec2-xlsr-en-q4_k.gguf
uri: huggingface://cstr/wav2vec2-large-xlsr-53-english-GGUF/wav2vec2-xlsr-en-q4_k.gguf
- name: wav2vec2-de-crispasr
url: github:mudler/LocalAI/gallery/virtual.yaml@master
urls:
- https://huggingface.co/cstr/wav2vec2-large-xlsr-53-german-GGUF
description: |
wav2vec2 Large XLSR-53 German (CTC) ASR. Runs via the CrispASR backend. Default GGUF size ~222 MB.
tags:
- crispasr
- asr
- speech-recognition
- stt
- gguf
overrides:
backend: crispasr
known_usecases:
- transcript
name: wav2vec2-de-crispasr
parameters:
model: wav2vec2-large-xlsr-53-german-q4_k.gguf
files:
- filename: wav2vec2-large-xlsr-53-german-q4_k.gguf
uri: huggingface://cstr/wav2vec2-large-xlsr-53-german-GGUF/wav2vec2-large-xlsr-53-german-q4_k.gguf
- name: vibevoice-crispasr
url: github:mudler/LocalAI/gallery/virtual.yaml@master
urls:
- https://huggingface.co/cstr/vibevoice-asr-GGUF
description: |
VibeVoice ASR. Runs via the CrispASR backend. Default GGUF size ~4.5 GB.
tags:
- crispasr
- asr
- speech-recognition
- stt
- gguf
overrides:
backend: crispasr
known_usecases:
- transcript
name: vibevoice-crispasr
parameters:
model: vibevoice-asr-q4_k.gguf
files:
- filename: vibevoice-asr-q4_k.gguf
uri: huggingface://cstr/vibevoice-asr-GGUF/vibevoice-asr-q4_k.gguf
- name: vibevoice-tts-crispasr
url: github:mudler/LocalAI/gallery/virtual.yaml@master
urls:
- https://huggingface.co/cstr/vibevoice-realtime-0.5b-GGUF
description: |
VibeVoice Realtime 0.5B text-to-speech (TTS) model, synthesized through the CrispASR backend. Produces 24 kHz mono audio; runs end-to-end on CPU with a built-in default voice. Default GGUF size ~636 MB.
tags:
- crispasr
- tts
- text-to-speech
- gguf
overrides:
backend: crispasr
known_usecases:
- tts
name: vibevoice-tts-crispasr
parameters:
model: vibevoice-realtime-0.5b-q4_k.gguf
files:
- filename: vibevoice-realtime-0.5b-q4_k.gguf
uri: huggingface://cstr/vibevoice-realtime-0.5b-GGUF/vibevoice-realtime-0.5b-q4_k.gguf
- name: chatterbox-tts-crispasr
url: github:mudler/LocalAI/gallery/virtual.yaml@master
urls:
- https://huggingface.co/cstr/chatterbox-GGUF
description: |
Chatterbox (ResembleAI, MIT) text-to-speech synthesized through the CrispASR backend. Two-GGUF runtime: a Llama T3 token model plus an S3Gen codec companion (tokens to 24 kHz waveform). Auto-detected by CrispASR and ships with a built-in default voice; runs end-to-end on CPU and produces 24 kHz mono audio. Default GGUF sizes ~630 MB (T3) + ~358 MB (S3Gen).
tags:
- crispasr
- tts
- text-to-speech
- gguf
overrides:
backend: crispasr
known_usecases:
- tts
name: chatterbox-tts-crispasr
options:
- "codec:chatterbox-s3gen-q8_0.gguf"
parameters:
model: chatterbox-t3-q8_0.gguf
files:
- filename: chatterbox-t3-q8_0.gguf
uri: huggingface://cstr/chatterbox-GGUF/chatterbox-t3-q8_0.gguf
- filename: chatterbox-s3gen-q8_0.gguf
uri: huggingface://cstr/chatterbox-GGUF/chatterbox-s3gen-q8_0.gguf
- name: qwen3-tts-customvoice-crispasr
url: github:mudler/LocalAI/gallery/virtual.yaml@master
urls:
- https://huggingface.co/cstr/qwen3-tts-0.6b-customvoice-GGUF
description: |
Qwen3-TTS CustomVoice 0.6B (12 Hz) text-to-speech synthesized through the CrispASR backend. Fixed-speaker fine-tune driven via an explicit backend selector plus a tokenizer codec companion. Ships baked speakers (vivian, aiden, dylan, eric, ono_anna, ryan, serena, sohee, uncle_fu); the default config selects vivian. Runs end-to-end on CPU and produces 24 kHz mono audio. Default GGUF sizes ~968 MB (talker) + ~358 MB (tokenizer).
tags:
- crispasr
- tts
- text-to-speech
- gguf
overrides:
backend: crispasr
known_usecases:
- tts
name: qwen3-tts-customvoice-crispasr
options:
- "backend:qwen3-tts"
- "codec:qwen3-tts-tokenizer-12hz.gguf"
- "speaker:vivian"
parameters:
model: qwen3-tts-12hz-0.6b-customvoice-q8_0.gguf
files:
- filename: qwen3-tts-12hz-0.6b-customvoice-q8_0.gguf
uri: huggingface://cstr/qwen3-tts-0.6b-customvoice-GGUF/qwen3-tts-12hz-0.6b-customvoice-q8_0.gguf
- filename: qwen3-tts-tokenizer-12hz.gguf
uri: huggingface://cstr/qwen3-tts-tokenizer-12hz-GGUF/qwen3-tts-tokenizer-12hz.gguf
- name: orpheus-tts-crispasr
url: github:mudler/LocalAI/gallery/virtual.yaml@master
urls:
- https://huggingface.co/cstr/orpheus-3b-base-GGUF
description: |
Orpheus-3B (Llama-3.2 base) text-to-speech synthesized through the CrispASR backend. Auto-detected by CrispASR; needs a SNAC 24 kHz codec companion and a baked speaker. Ships speaker tara (selected by the default config). Runs end-to-end on CPU and produces 24 kHz mono audio. Default GGUF sizes ~3.5 GB (model) + ~26 MB (SNAC codec).
tags:
- crispasr
- tts
- text-to-speech
- gguf
overrides:
backend: crispasr
known_usecases:
- tts
name: orpheus-tts-crispasr
options:
- "codec:snac-24khz.gguf"
- "speaker:tara"
parameters:
model: orpheus-3b-base-q8_0.gguf
files:
- filename: orpheus-3b-base-q8_0.gguf
uri: huggingface://cstr/orpheus-3b-base-GGUF/orpheus-3b-base-q8_0.gguf
- filename: snac-24khz.gguf
uri: huggingface://cstr/snac-24khz-GGUF/snac-24khz.gguf
- name: hubert-crispasr
url: github:mudler/LocalAI/gallery/virtual.yaml@master
urls:
- https://huggingface.co/cstr/hubert-large-ls960-ft-GGUF
description: |
HuBERT Large (LS960 fine-tune) CTC speech recognition, English. Runs via the CrispASR backend with an explicit backend selector. Default GGUF size ~200 MB.
tags:
- crispasr
- asr
- speech-recognition
- stt
- gguf
overrides:
backend: crispasr
known_usecases:
- transcript
name: hubert-crispasr
options:
- "backend:hubert"
parameters:
model: hubert-large-ls960-ft-q4_k.gguf
files:
- filename: hubert-large-ls960-ft-q4_k.gguf
uri: huggingface://cstr/hubert-large-ls960-ft-GGUF/hubert-large-ls960-ft-q4_k.gguf
- name: data2vec-crispasr
url: github:mudler/LocalAI/gallery/virtual.yaml@master
urls:
- https://huggingface.co/cstr/data2vec-audio-960h-GGUF
description: |
data2vec Audio Base (960h) CTC speech recognition, English. Runs via the CrispASR backend with an explicit backend selector. Default GGUF size ~60 MB.
tags:
- crispasr
- asr
- speech-recognition
- stt
- gguf
overrides:
backend: crispasr
known_usecases:
- transcript
name: data2vec-crispasr
options:
- "backend:data2vec"
parameters:
model: data2vec-audio-base-960h-q4_k.gguf
files:
- filename: data2vec-audio-base-960h-q4_k.gguf
uri: huggingface://cstr/data2vec-audio-960h-GGUF/data2vec-audio-base-960h-q4_k.gguf
- name: glm-asr-crispasr
url: github:mudler/LocalAI/gallery/virtual.yaml@master
urls:
- https://huggingface.co/cstr/glm-asr-nano-GGUF
description: |
GLM-ASR Nano speech recognition. Runs via the CrispASR backend with an explicit backend selector. Default GGUF size ~1.2 GB.
tags:
- crispasr
- asr
- speech-recognition
- stt
- gguf
overrides:
backend: crispasr
known_usecases:
- transcript
name: glm-asr-crispasr
options:
- "backend:glm-asr"
parameters:
model: glm-asr-nano-q4_k.gguf
files:
- filename: glm-asr-nano-q4_k.gguf
uri: huggingface://cstr/glm-asr-nano-GGUF/glm-asr-nano-q4_k.gguf
- name: kyutai-stt-crispasr
url: github:mudler/LocalAI/gallery/virtual.yaml@master
urls:
- https://huggingface.co/cstr/kyutai-stt-1b-GGUF
description: |
Kyutai STT 1B (Moshi-style) speech recognition. Runs via the CrispASR backend with an explicit backend selector. Default GGUF size ~636 MB.
tags:
- crispasr
- asr
- speech-recognition
- stt
- gguf
overrides:
backend: crispasr
known_usecases:
- transcript
name: kyutai-stt-crispasr
options:
- "backend:kyutai-stt"
parameters:
model: kyutai-stt-1b-q4_k.gguf
files:
- filename: kyutai-stt-1b-q4_k.gguf
uri: huggingface://cstr/kyutai-stt-1b-GGUF/kyutai-stt-1b-q4_k.gguf
- name: firered-asr-crispasr
url: github:mudler/LocalAI/gallery/virtual.yaml@master
urls:
- https://huggingface.co/cstr/firered-asr2-aed-GGUF
description: |
FireRed-ASR2 AED speech recognition. Runs via the CrispASR backend with an explicit backend selector. Default GGUF size ~918 MB.
tags:
- crispasr
- asr
- speech-recognition
- stt
- gguf
overrides:
backend: crispasr
known_usecases:
- transcript
name: firered-asr-crispasr
options:
- "backend:firered-asr"
parameters:
model: firered-asr2-aed-q4_k.gguf
files:
- filename: firered-asr2-aed-q4_k.gguf
uri: huggingface://cstr/firered-asr2-aed-GGUF/firered-asr2-aed-q4_k.gguf
- name: moonshine-crispasr
url: github:mudler/LocalAI/gallery/virtual.yaml@master
urls:
- https://huggingface.co/cstr/moonshine-tiny-GGUF
description: |
Moonshine Tiny speech recognition, English. Runs via the CrispASR backend with an explicit backend selector and a companion tokenizer. Default GGUF size ~20 MB.
tags:
- crispasr
- asr
- speech-recognition
- stt
- gguf
overrides:
backend: crispasr
known_usecases:
- transcript
name: moonshine-crispasr
options:
- "backend:moonshine"
- "codec:tokenizer.bin"
parameters:
model: moonshine-tiny-q4_k.gguf
files:
- filename: moonshine-tiny-q4_k.gguf
uri: huggingface://cstr/moonshine-tiny-GGUF/moonshine-tiny-q4_k.gguf
- filename: tokenizer.bin
uri: huggingface://cstr/moonshine-tiny-GGUF/tokenizer.bin
- name: moonshine-de-crispasr
url: github:mudler/LocalAI/gallery/virtual.yaml@master
urls:
- https://huggingface.co/cstr/moonshine-base-de-fidoriel-GGUF
description: |
Moonshine Base German fine-tune (fidoriel), best-quality German Moonshine. Runs via the CrispASR backend with an explicit backend selector and a companion tokenizer. Default GGUF size ~39 MB.
license: CC-BY-NC-SA-4.0
tags:
- crispasr
- asr
- speech-recognition
- stt
- gguf
overrides:
backend: crispasr
known_usecases:
- transcript
name: moonshine-de-crispasr
options:
- "backend:moonshine"
- "codec:tokenizer.bin"
parameters:
model: moonshine-base-de-fidoriel-q4_k.gguf
files:
- filename: moonshine-base-de-fidoriel-q4_k.gguf
uri: huggingface://cstr/moonshine-base-de-fidoriel-GGUF/moonshine-base-de-fidoriel-q4_k.gguf
- filename: tokenizer.bin
uri: huggingface://cstr/moonshine-base-de-fidoriel-GGUF/tokenizer.bin
- name: moonshine-tiny-de-crispasr
url: github:mudler/LocalAI/gallery/virtual.yaml@master
urls:
- https://huggingface.co/cstr/moonshine-tiny-de-fidoriel-GGUF
description: |
Moonshine Tiny German fine-tune (fidoriel), smaller/faster German Moonshine. Runs via the CrispASR backend with an explicit backend selector and a companion tokenizer. Default GGUF size ~17 MB.
license: CC-BY-NC-SA-4.0
tags:
- crispasr
- asr
- speech-recognition
- stt
- gguf
overrides:
backend: crispasr
known_usecases:
- transcript
name: moonshine-tiny-de-crispasr
options:
- "backend:moonshine"
- "codec:tokenizer.bin"
parameters:
model: moonshine-tiny-de-fidoriel-q4_k.gguf
files:
- filename: moonshine-tiny-de-fidoriel-q4_k.gguf
uri: huggingface://cstr/moonshine-tiny-de-fidoriel-GGUF/moonshine-tiny-de-fidoriel-q4_k.gguf
- filename: tokenizer.bin
uri: huggingface://cstr/moonshine-tiny-de-fidoriel-GGUF/tokenizer.bin
- name: moonshine-streaming-crispasr
url: github:mudler/LocalAI/gallery/virtual.yaml@master
urls:
- https://huggingface.co/cstr/moonshine-streaming-tiny-GGUF
description: |
Moonshine Streaming Tiny speech recognition. Runs via the CrispASR backend with an explicit backend selector and a companion tokenizer. Default GGUF size ~31 MB.
tags:
- crispasr
- asr
- speech-recognition
- stt
- gguf
overrides:
backend: crispasr
known_usecases:
- transcript
name: moonshine-streaming-crispasr
options:
- "backend:moonshine-streaming"
- "codec:tokenizer.bin"
parameters:
model: moonshine-streaming-tiny-q4_k.gguf
files:
- filename: moonshine-streaming-tiny-q4_k.gguf
uri: huggingface://cstr/moonshine-streaming-tiny-GGUF/moonshine-streaming-tiny-q4_k.gguf
- filename: tokenizer.bin
uri: huggingface://cstr/moonshine-streaming-tiny-GGUF/tokenizer.bin
- name: mimo-asr-crispasr
url: github:mudler/LocalAI/gallery/virtual.yaml@master
urls:
- https://huggingface.co/cstr/mimo-asr-GGUF
description: |
MiMo-ASR speech recognition. Runs via the CrispASR backend with an explicit backend selector and a companion tokenizer GGUF. Default GGUF size ~4.2 GB.
tags:
- crispasr
- asr
- speech-recognition
- stt
- gguf
overrides:
backend: crispasr
known_usecases:
- transcript
name: mimo-asr-crispasr
options:
- "backend:mimo-asr"
- "codec:mimo-tokenizer-q4_k.gguf"
parameters:
model: mimo-asr-q4_k.gguf
files:
- filename: mimo-asr-q4_k.gguf
uri: huggingface://cstr/mimo-asr-GGUF/mimo-asr-q4_k.gguf
- filename: mimo-tokenizer-q4_k.gguf
uri: huggingface://cstr/mimo-tokenizer-GGUF/mimo-tokenizer-q4_k.gguf