fix(buun-llama-cpp): shim cudaMemcpy{To,From}Symbol + WARP_SIZE on fwht128 shuffles

Two more hipblas-only build failures in buun's fattn.cu, fixed under the same patches/ infrastructure: 1. cudaMemcpyToSymbol / cudaMemcpyFromSymbol — buun's Q² calibration + TCQ codebook upload paths call the symbol variants of cudaMemcpy. ggml/src/ggml-cuda/vendors/hip.h aliases every other cudaMemcpy* name (cudaMemcpy, cudaMemcpyAsync, cudaMemcpy2DAsync, …) but the symbol pair was never added. 15+ "use of undeclared identifier" errors across fattn.cu lines 40, 54, 74-76, 94, 100-101, 371, 883, 905, 954, 976, 1449, 1463. Add the two missing aliases alongside the existing memcpy block. 2. __shfl_xor_sync fwht128 calls — same 3-arg omission pattern as the earlier argmax top-K fix. Lines 512 (ggml_cuda_fwht128 intra-warp butterfly) and 536 (fwht128_store_half neighbor fetch) drop the width argument that hip.h:33 requires. Add WARP_SIZE. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
fix(buun-llama-cpp): pass WARP_SIZE to argmax __shfl_xor_sync calls
2026-05-23 08:10:48 -04:00 · 2026-04-24 20:09:36 +00:00 · 2026-04-24 16:29:29 +00:00 · 2026-04-24 13:57:30 +00:00 · 2026-04-24 12:52:54 +00:00 · 2026-04-24 12:52:53 +00:00
82 changed files with 7812 additions and 173 deletions
--- a/.agents/ai-coding-assistants.md
+++ b/.agents/ai-coding-assistants.md
@@ -35,19 +35,33 @@ All contributions must comply with LocalAI's licensing requirements:

 ## Signed-off-by and Developer Certificate of Origin

-**AI agents MUST NOT add `Signed-off-by` tags.** Only humans can legally
-certify the Developer Certificate of Origin (DCO). The human submitter
-is responsible for:
+Only humans can certify the Developer Certificate of Origin (DCO). AI
+agents MUST NOT invent or guess a human identity for `Signed-off-by` —
+doing so forges the DCO certification.

- Reviewing all AI-generated code
+However, when a human operator explicitly directs the AI to commit on
+their behalf, the AI is acting as a typing tool — no different from an
+editor macro or `git commit -s`. In that case the AI SHOULD add
+`Signed-off-by:` using the **configured `user.name` / `user.email`** of
+the current git repository (i.e. the operator's own identity). The
+resulting trailer is the operator's signature; they take responsibility
+for it by reviewing and pushing the commit. The AI MUST NOT use any
+other identity and MUST NOT add its own name to the sign-off.
+
+When running `git commit`, prefer `git commit --signoff` (or `-s`) so
+the trailer is emitted by git itself from the configured identity,
+rather than hand-writing it in a heredoc — this guarantees the sign-off
+matches whatever identity the operator is currently using.
+
+The human submitter remains responsible for:
+
+- Reviewing all AI-generated code before it's pushed or merged
 - Ensuring compliance with licensing requirements
- Adding their own `Signed-off-by` tag (when the project requires DCO)
-  to certify the contribution
 - Taking full responsibility for the contribution

-AI agents MUST NOT add `Co-Authored-By` trailers for themselves either.
-A human reviewer owns the contribution; the AI's involvement is recorded
-via `Assisted-by` (see below).
+AI agents MUST NOT add `Co-Authored-By` trailers for themselves. A human
+reviewer owns the contribution; the AI's involvement is recorded via
+`Assisted-by` (see below).

 ## Attribution

@@ -84,6 +98,12 @@ Assisted-by: Claude:claude-opus-4-7 golangci-lint
 Signed-off-by: Jane Developer <jane@example.com>
 ```

+The `Signed-off-by` line uses Jane's own identity because Jane is the
+submitter operating the AI. If Jane asks Claude to create the commit via
+`git commit -s`, git emits that exact trailer from Jane's configured
+identity — no separate human step is needed beyond Jane reviewing the
+diff before pushing.
+
 ## Scope and Responsibility

 Using an AI assistant does not reduce the contributor's responsibility.
--- a/.github/workflows/backend.yml
+++ b/.github/workflows/backend.yml
@@ -399,6 +399,19 @@ jobs:
            dockerfile: "./backend/Dockerfile.turboquant"
            context: "./"
            ubuntu-version: '2404'
+          - build-type: 'cublas'
+            cuda-major-version: "12"
+            cuda-minor-version: "8"
+            platforms: 'linux/amd64'
+            tag-latest: 'auto'
+            tag-suffix: '-gpu-nvidia-cuda-12-buun-llama-cpp'
+            runs-on: 'bigger-runner'
+            base-image: "ubuntu:24.04"
+            skip-drivers: 'false'
+            backend: "buun-llama-cpp"
+            dockerfile: "./backend/Dockerfile.buun-llama-cpp"
+            context: "./"
+            ubuntu-version: '2404'
          - build-type: 'cublas'
            cuda-major-version: "12"
            cuda-minor-version: "8"
@@ -894,6 +907,19 @@ jobs:
            dockerfile: "./backend/Dockerfile.turboquant"
            context: "./"
            ubuntu-version: '2404'
+          - build-type: 'cublas'
+            cuda-major-version: "13"
+            cuda-minor-version: "0"
+            platforms: 'linux/amd64'
+            tag-latest: 'auto'
+            tag-suffix: '-gpu-nvidia-cuda-13-buun-llama-cpp'
+            runs-on: 'ubuntu-latest'
+            base-image: "ubuntu:24.04"
+            skip-drivers: 'false'
+            backend: "buun-llama-cpp"
+            dockerfile: "./backend/Dockerfile.buun-llama-cpp"
+            context: "./"
+            ubuntu-version: '2404'
          - build-type: 'cublas'
            cuda-major-version: "13"
            cuda-minor-version: "0"
@@ -920,6 +946,19 @@ jobs:
            backend: "turboquant"
            dockerfile: "./backend/Dockerfile.turboquant"
            context: "./"
+          - build-type: 'cublas'
+            cuda-major-version: "13"
+            cuda-minor-version: "0"
+            platforms: 'linux/arm64'
+            skip-drivers: 'false'
+            tag-latest: 'auto'
+            tag-suffix: '-nvidia-l4t-cuda-13-arm64-buun-llama-cpp'
+            base-image: "ubuntu:24.04"
+            runs-on: 'ubuntu-24.04-arm'
+            ubuntu-version: '2404'
+            backend: "buun-llama-cpp"
+            dockerfile: "./backend/Dockerfile.buun-llama-cpp"
+            context: "./"
          - build-type: 'cublas'
            cuda-major-version: "13"
            cuda-minor-version: "0"
@@ -1454,6 +1493,19 @@ jobs:
            dockerfile: "./backend/Dockerfile.turboquant"
            context: "./"
            ubuntu-version: '2404'
+          - build-type: 'hipblas'
+            cuda-major-version: ""
+            cuda-minor-version: ""
+            platforms: 'linux/amd64'
+            tag-latest: 'auto'
+            tag-suffix: '-gpu-rocm-hipblas-buun-llama-cpp'
+            runs-on: 'ubuntu-latest'
+            base-image: "rocm/dev-ubuntu-24.04:7.2.1"
+            skip-drivers: 'false'
+            backend: "buun-llama-cpp"
+            dockerfile: "./backend/Dockerfile.buun-llama-cpp"
+            context: "./"
+            ubuntu-version: '2404'
          - build-type: 'hipblas'
            cuda-major-version: ""
            cuda-minor-version: ""
@@ -1703,6 +1755,19 @@ jobs:
            dockerfile: "./backend/Dockerfile.turboquant"
            context: "./"
            ubuntu-version: '2404'
+          - build-type: 'sycl_f32'
+            cuda-major-version: ""
+            cuda-minor-version: ""
+            platforms: 'linux/amd64'
+            tag-latest: 'auto'
+            tag-suffix: '-gpu-intel-sycl-f32-buun-llama-cpp'
+            runs-on: 'ubuntu-latest'
+            base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
+            skip-drivers: 'false'
+            backend: "buun-llama-cpp"
+            dockerfile: "./backend/Dockerfile.buun-llama-cpp"
+            context: "./"
+            ubuntu-version: '2404'
          - build-type: 'sycl_f16'
            cuda-major-version: ""
            cuda-minor-version: ""
@@ -1729,6 +1794,19 @@ jobs:
            dockerfile: "./backend/Dockerfile.turboquant"
            context: "./"
            ubuntu-version: '2404'
+          - build-type: 'sycl_f16'
+            cuda-major-version: ""
+            cuda-minor-version: ""
+            platforms: 'linux/amd64'
+            tag-latest: 'auto'
+            tag-suffix: '-gpu-intel-sycl-f16-buun-llama-cpp'
+            runs-on: 'ubuntu-latest'
+            base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
+            skip-drivers: 'false'
+            backend: "buun-llama-cpp"
+            dockerfile: "./backend/Dockerfile.buun-llama-cpp"
+            context: "./"
+            ubuntu-version: '2404'
          - build-type: 'intel'
            cuda-major-version: ""
            cuda-minor-version: ""
@@ -2134,6 +2212,19 @@ jobs:
            dockerfile: "./backend/Dockerfile.turboquant"
            context: "./"
            ubuntu-version: '2404'
+          - build-type: ''
+            cuda-major-version: ""
+            cuda-minor-version: ""
+            platforms: 'linux/amd64,linux/arm64'
+            tag-latest: 'auto'
+            tag-suffix: '-cpu-buun-llama-cpp'
+            runs-on: 'bigger-runner'
+            base-image: "ubuntu:24.04"
+            skip-drivers: 'false'
+            backend: "buun-llama-cpp"
+            dockerfile: "./backend/Dockerfile.buun-llama-cpp"
+            context: "./"
+            ubuntu-version: '2404'
          - build-type: ''
            cuda-major-version: ""
            cuda-minor-version: ""
@@ -2173,6 +2264,19 @@ jobs:
            dockerfile: "./backend/Dockerfile.turboquant"
            context: "./"
            ubuntu-version: '2204'
+          - build-type: 'cublas'
+            cuda-major-version: "12"
+            cuda-minor-version: "0"
+            platforms: 'linux/arm64'
+            skip-drivers: 'false'
+            tag-latest: 'auto'
+            tag-suffix: '-nvidia-l4t-arm64-buun-llama-cpp'
+            base-image: "nvcr.io/nvidia/l4t-jetpack:r36.4.0"
+            runs-on: 'ubuntu-24.04-arm'
+            backend: "buun-llama-cpp"
+            dockerfile: "./backend/Dockerfile.buun-llama-cpp"
+            context: "./"
+            ubuntu-version: '2204'
          - build-type: 'vulkan'
            cuda-major-version: ""
            cuda-minor-version: ""
@@ -2199,6 +2303,19 @@ jobs:
            dockerfile: "./backend/Dockerfile.turboquant"
            context: "./"
            ubuntu-version: '2404'
+          - build-type: 'vulkan'
+            cuda-major-version: ""
+            cuda-minor-version: ""
+            platforms: 'linux/amd64,linux/arm64'
+            tag-latest: 'auto'
+            tag-suffix: '-gpu-vulkan-buun-llama-cpp'
+            runs-on: 'bigger-runner'
+            base-image: "ubuntu:24.04"
+            skip-drivers: 'false'
+            backend: "buun-llama-cpp"
+            dockerfile: "./backend/Dockerfile.buun-llama-cpp"
+            context: "./"
+            ubuntu-version: '2404'
          # Stablediffusion-ggml
          - build-type: ''
            cuda-major-version: ""
@@ -2877,6 +2994,49 @@ jobs:
            dockerfile: "./backend/Dockerfile.python"
            context: "./"
            ubuntu-version: '2404'
+          # sherpa-onnx CPU
+          - build-type: ''
+            cuda-major-version: ""
+            cuda-minor-version: ""
+            platforms: 'linux/amd64,linux/arm64'
+            tag-latest: 'auto'
+            tag-suffix: '-cpu-sherpa-onnx'
+            runs-on: 'ubuntu-latest'
+            base-image: "ubuntu:24.04"
+            skip-drivers: 'false'
+            backend: "sherpa-onnx"
+            dockerfile: "./backend/Dockerfile.golang"
+            context: "./"
+            ubuntu-version: '2404'
+          # sherpa-onnx CUDA 12
+          - build-type: 'cublas'
+            cuda-major-version: "12"
+            cuda-minor-version: "8"
+            platforms: 'linux/amd64'
+            tag-latest: 'auto'
+            tag-suffix: '-gpu-nvidia-cuda-12-sherpa-onnx'
+            runs-on: 'ubuntu-latest'
+            base-image: "ubuntu:24.04"
+            skip-drivers: 'false'
+            backend: "sherpa-onnx"
+            dockerfile: "./backend/Dockerfile.golang"
+            context: "./"
+            ubuntu-version: '2404'
+          # sherpa-onnx CUDA 13 — requires onnxruntime 1.24.x+ for the
+          # gpu_cuda13 tarball; sherpa-onnx SHERPA_COMMIT pins to v1.12.39.
+          - build-type: 'cublas'
+            cuda-major-version: "13"
+            cuda-minor-version: "0"
+            platforms: 'linux/amd64'
+            tag-latest: 'auto'
+            tag-suffix: '-gpu-nvidia-cuda-13-sherpa-onnx'
+            runs-on: 'ubuntu-latest'
+            base-image: "ubuntu:24.04"
+            skip-drivers: 'false'
+            backend: "sherpa-onnx"
+            dockerfile: "./backend/Dockerfile.golang"
+            context: "./"
+            ubuntu-version: '2404'
  backend-jobs-darwin:
    uses: ./.github/workflows/backend_build_darwin.yml
    strategy:
--- a/.github/workflows/test-extra.yml
+++ b/.github/workflows/test-extra.yml
@@ -32,6 +32,7 @@ jobs:
      llama-cpp: ${{ steps.detect.outputs.llama-cpp }}
      ik-llama-cpp: ${{ steps.detect.outputs.ik-llama-cpp }}
      turboquant: ${{ steps.detect.outputs.turboquant }}
+      buun-llama-cpp: ${{ steps.detect.outputs['buun-llama-cpp'] }}
      vllm: ${{ steps.detect.outputs.vllm }}
      sglang: ${{ steps.detect.outputs.sglang }}
      acestep-cpp: ${{ steps.detect.outputs.acestep-cpp }}
@@ -40,6 +41,7 @@ jobs:
      kokoros: ${{ steps.detect.outputs.kokoros }}
      insightface: ${{ steps.detect.outputs.insightface }}
      speaker-recognition: ${{ steps.detect.outputs.speaker-recognition }}
+      sherpa-onnx: ${{ steps.detect.outputs.sherpa-onnx }}
    steps:
      - name: Checkout repository
        uses: actions/checkout@v6
@@ -506,6 +508,72 @@ jobs:
      - name: Build llama-cpp backend image and run audio transcription gRPC e2e tests
        run: |
          make test-extra-backend-llama-cpp-transcription
+  # Realtime e2e with sherpa-onnx driving VAD + STT + TTS against a mocked LLM.
+  # Builds the sherpa-onnx Docker image, extracts the rootfs so the e2e suite
+  # can discover the backend binary + shared libs, downloads the three model
+  # bundles (silero-vad, omnilingual-asr, vits-ljs) and drives the realtime
+  # websocket spec end-to-end.
+  tests-sherpa-onnx-realtime:
+    needs: detect-changes
+    if: needs.detect-changes.outputs.sherpa-onnx == 'true' || needs.detect-changes.outputs.run-all == 'true'
+    runs-on: ubuntu-latest
+    timeout-minutes: 90
+    steps:
+      - name: Clone
+        uses: actions/checkout@v6
+        with:
+          submodules: true
+      - name: Setup Go
+        uses: actions/setup-go@v5
+        with:
+          go-version: '1.25.4'
+      - name: Setup Node.js
+        uses: actions/setup-node@v6
+        with:
+          node-version: '22'
+      - name: Build sherpa-onnx backend image and run realtime e2e tests
+        run: |
+          make test-extra-e2e-realtime-sherpa
+  # Streaming ASR via the sherpa-onnx online recognizer (zipformer
+  # transducer). Exercises both AudioTranscription (buffered) and
+  # AudioTranscriptionStream (real-time deltas) on the e2e-backends
+  # harness.
+  tests-sherpa-onnx-grpc-transcription:
+    needs: detect-changes
+    if: needs.detect-changes.outputs.sherpa-onnx == 'true' || needs.detect-changes.outputs.run-all == 'true'
+    runs-on: ubuntu-latest
+    timeout-minutes: 90
+    steps:
+      - name: Clone
+        uses: actions/checkout@v6
+        with:
+          submodules: true
+      - name: Setup Go
+        uses: actions/setup-go@v5
+        with:
+          go-version: '1.25.4'
+      - name: Build sherpa-onnx backend image and run streaming ASR gRPC e2e tests
+        run: |
+          make test-extra-backend-sherpa-onnx-transcription
+  # VITS TTS via the sherpa-onnx backend. Drives both TTS (file write) and
+  # TTSStream (PCM chunks) on the e2e-backends harness.
+  tests-sherpa-onnx-grpc-tts:
+    needs: detect-changes
+    if: needs.detect-changes.outputs.sherpa-onnx == 'true' || needs.detect-changes.outputs.run-all == 'true'
+    runs-on: ubuntu-latest
+    timeout-minutes: 90
+    steps:
+      - name: Clone
+        uses: actions/checkout@v6
+        with:
+          submodules: true
+      - name: Setup Go
+        uses: actions/setup-go@v5
+        with:
+          go-version: '1.25.4'
+      - name: Build sherpa-onnx backend image and run TTS gRPC e2e tests
+        run: |
+          make test-extra-backend-sherpa-onnx-tts
  tests-ik-llama-cpp-grpc:
    needs: detect-changes
    if: needs.detect-changes.outputs.ik-llama-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
@@ -546,6 +614,30 @@ jobs:
      - name: Build turboquant backend image and run gRPC e2e tests
        run: |
          make test-extra-backend-turboquant
+  tests-buun-llama-cpp-grpc:
+    needs: detect-changes
+    if: needs.detect-changes.outputs['buun-llama-cpp'] == 'true' || needs.detect-changes.outputs.run-all == 'true'
+    runs-on: ubuntu-latest
+    timeout-minutes: 90
+    steps:
+      - name: Clone
+        uses: actions/checkout@v6
+        with:
+          submodules: true
+      - name: Setup Go
+        uses: actions/setup-go@v5
+        with:
+          go-version: '1.25.4'
+      # Exercises the buun-llama-cpp (fork-of-a-fork) backend with the
+      # fork-specific TurboQuant/TCQ KV-cache types. BACKEND_TEST_CACHE_TYPE_V
+      # is set to turbo3 so the test round-trips through the fork's KV
+      # allow-list — picking a stock llama.cpp type would only re-test the
+      # shared code path. DFlash speculative decoding is not exercised here
+      # because the one known public target/drafter pair (Qwen3.5-27B) is too
+      # large for CI.
+      - name: Build buun-llama-cpp backend image and run gRPC e2e tests
+        run: |
+          make test-extra-backend-buun-llama-cpp
  # tests-vllm-grpc is currently disabled in CI.
  #
  # The prebuilt vllm CPU wheel is compiled with AVX-512 VNNI/BF16
--- a/.github/workflows/test.yml
+++ b/.github/workflows/test.yml
@@ -195,7 +195,7 @@ jobs:
        run: go version
      - name: Dependencies
        run: |
-          brew install protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm opus
+          brew install protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm opus ffmpeg
          pip install --user --no-cache-dir grpcio-tools grpcio
      - name: Setup Node.js
        uses: actions/setup-node@v6
--- a/71
+++ b/71
@@ -1,5 +1,5 @@
 # Disable parallel execution for backend builds
-.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/tinygrad
+.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/buun-llama-cpp backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/tinygrad backends/sherpa-onnx

 GOCMD=go
 GOTEST=$(GOCMD) test
@@ -394,7 +394,13 @@ protoc:
 .PHONY: protogen-go
 protogen-go: protoc install-go-tools
 	mkdir -p pkg/grpc/proto
-	./protoc --experimental_allow_proto3_optional -Ibackend/ --go_out=pkg/grpc/proto/ --go_opt=paths=source_relative --go-grpc_out=pkg/grpc/proto/ --go-grpc_opt=paths=source_relative \
+	# install-go-tools writes protoc-gen-go and protoc-gen-go-grpc into
+	# $(shell go env GOPATH)/bin, which isn't on every dev's PATH. protoc
+	# resolves its code-gen plugins via PATH, so without this prefix the
+	# generate step fails with "protoc-gen-go: program not found". Prepend
+	# GOPATH/bin so the freshly-installed plugins win without requiring a
+	# shell-profile change.
+	PATH="$$(go env GOPATH)/bin:$$PATH" ./protoc --experimental_allow_proto3_optional -Ibackend/ --go_out=pkg/grpc/proto/ --go_opt=paths=source_relative --go-grpc_out=pkg/grpc/proto/ --go-grpc_opt=paths=source_relative \
    backend/backend.proto

 core/config/inference_defaults.json: ## Fetch inference defaults from unsloth (only if missing)
@@ -539,6 +545,19 @@ test-extra-backend-turboquant: docker-build-turboquant
 	BACKEND_TEST_CACHE_TYPE_V=turbo3 \
 	$(MAKE) test-extra-backend

+## buun-llama-cpp: exercises the fork-of-a-fork backend (spiritbuun/buun-llama-cpp)
+## with the *TurboQuant/TCQ-specific* KV-cache types (turbo3 for V). Same rationale
+## as turboquant above: picking a standard llama.cpp type would only re-test the
+## shared code path. buun inherits turboquant's turbo2/turbo3/turbo4 and adds
+## turbo2_tcq / turbo3_tcq on top. DFlash speculative decoding is not exercised
+## here because no small DFlash drafter model exists (the known public pair is
+## Qwen3.5-27B, ~54 GB).
+test-extra-backend-buun-llama-cpp: docker-build-buun-llama-cpp
+	BACKEND_IMAGE=local-ai-backend:buun-llama-cpp \
+	BACKEND_TEST_CACHE_TYPE_K=q8_0 \
+	BACKEND_TEST_CACHE_TYPE_V=turbo3 \
+	$(MAKE) test-extra-backend
+
 ## Audio transcription wrapper for the llama-cpp backend.
 ## Drives the new AudioTranscription / AudioTranscriptionStream RPCs against
 ## ggml-org/Qwen3-ASR-0.6B-GGUF (a small ASR model that requires its mmproj
@@ -780,6 +799,44 @@ test-extra-backend-speaker-recognition-ecapa: docker-build-speaker-recognition
 test-extra-backend-speaker-recognition-all: \
 	test-extra-backend-speaker-recognition-ecapa

+## Realtime e2e with sherpa-onnx driving VAD + STT + TTS against a mocked
+## LLM. Extracts the sherpa-onnx Docker image rootfs, downloads the three
+## gallery-referenced model bundles (silero-vad, omnilingual-asr, vits-ljs),
+## writes the corresponding model config YAMLs, and runs the realtime
+## websocket spec in tests/e2e with REALTIME_* env vars wiring the sherpa
+## slots into the pipeline. The LLM slot stays on the in-repo mock-backend
+## registered unconditionally by tests/e2e/e2e_suite_test.go. See
+## tests/e2e/run-realtime-sherpa.sh for the full orchestration.
+test-extra-e2e-realtime-sherpa: build-mock-backend docker-build-sherpa-onnx protogen-go react-ui
+	bash tests/e2e/run-realtime-sherpa.sh
+
+## Streaming ASR via the sherpa-onnx online recognizer. Uses the streaming
+## zipformer English model (encoder/decoder/joiner int8 + tokens) from the
+## sherpa-onnx gallery entry. Drives both AudioTranscription and
+## AudioTranscriptionStream via the e2e-backends gRPC harness; streaming
+## emits real partial deltas during decode. Each file is renamed on download
+## to the shape sherpa-onnx's online loader expects (encoder.int8.onnx etc.).
+test-extra-backend-sherpa-onnx-transcription: docker-build-sherpa-onnx
+	BACKEND_IMAGE=local-ai-backend:sherpa-onnx \
+	BACKEND_TEST_MODEL_URL='https://huggingface.co/csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26/resolve/main/encoder-epoch-99-avg-1-chunk-16-left-128.int8.onnx#encoder.int8.onnx' \
+	BACKEND_TEST_EXTRA_FILES='https://huggingface.co/csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26/resolve/main/decoder-epoch-99-avg-1-chunk-16-left-128.int8.onnx#decoder.int8.onnx|https://huggingface.co/csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26/resolve/main/joiner-epoch-99-avg-1-chunk-16-left-128.int8.onnx#joiner.int8.onnx|https://huggingface.co/csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26/resolve/main/tokens.txt' \
+	BACKEND_TEST_AUDIO_URL=https://github.com/ggml-org/whisper.cpp/raw/master/samples/jfk.wav \
+	BACKEND_TEST_CAPS=health,load,transcription \
+	BACKEND_TEST_OPTIONS=subtype=online \
+	$(MAKE) test-extra-backend
+
+## VITS TTS via the sherpa-onnx backend. Pulls the individual files from
+## HuggingFace (the vits-ljs release tarball lives on the k2-fsa github
+## but is also mirrored as discrete files on HF). Exercises both
+## TTS (write-to-file) and TTSStream (PCM chunks + WAV header) via the
+## e2e-backends gRPC harness.
+test-extra-backend-sherpa-onnx-tts: docker-build-sherpa-onnx
+	BACKEND_IMAGE=local-ai-backend:sherpa-onnx \
+	BACKEND_TEST_MODEL_URL='https://huggingface.co/csukuangfj/vits-ljs/resolve/main/vits-ljs.onnx#vits-ljs.onnx' \
+	BACKEND_TEST_EXTRA_FILES='https://huggingface.co/csukuangfj/vits-ljs/resolve/main/tokens.txt|https://huggingface.co/csukuangfj/vits-ljs/resolve/main/lexicon.txt' \
+	BACKEND_TEST_CAPS=health,load,tts \
+	$(MAKE) test-extra-backend
+
 ## sglang mirrors the vllm setup: HuggingFace model id, same tiny Qwen,
 ## tool-call extraction via sglang's native qwen parser. CPU builds use
 ## sglang's upstream pyproject_cpu.toml recipe (see backend/python/sglang/install.sh).
@@ -905,6 +962,11 @@ BACKEND_IK_LLAMA_CPP = ik-llama-cpp|ik-llama-cpp|.|false|false
 # turboquant is a llama.cpp fork with TurboQuant KV-cache quantization.
 # Reuses backend/cpp/llama-cpp grpc-server sources via a thin wrapper Makefile.
 BACKEND_TURBOQUANT = turboquant|turboquant|.|false|false
+# buun-llama-cpp is a fork-of-a-fork (spiritbuun/buun-llama-cpp forks
+# TheTom/llama-cpp-turboquant) that adds DFlash block-diffusion speculative
+# decoding and extra TCQ KV-cache variants on top of TurboQuant. Same thin
+# wrapper pattern as turboquant — reuses backend/cpp/llama-cpp grpc-server.
+BACKEND_BUUN_LLAMA_CPP = buun-llama-cpp|buun-llama-cpp|.|false|false

 # Golang backends
 BACKEND_PIPER = piper|golang|.|false|true
@@ -917,6 +979,7 @@ BACKEND_VOXTRAL = voxtral|golang|.|false|true
 BACKEND_ACESTEP_CPP = acestep-cpp|golang|.|false|true
 BACKEND_QWEN3_TTS_CPP = qwen3-tts-cpp|golang|.|false|true
 BACKEND_OPUS = opus|golang|.|false|true
+BACKEND_SHERPA_ONNX = sherpa-onnx|golang|.|false|true

 # Python backends with root context
 BACKEND_RERANKERS = rerankers|python|.|false|true
@@ -984,6 +1047,7 @@ endef
 $(eval $(call generate-docker-build-target,$(BACKEND_LLAMA_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_IK_LLAMA_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_TURBOQUANT)))
+$(eval $(call generate-docker-build-target,$(BACKEND_BUUN_LLAMA_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_PIPER)))
 $(eval $(call generate-docker-build-target,$(BACKEND_LOCAL_STORE)))
 $(eval $(call generate-docker-build-target,$(BACKEND_HUGGINGFACE)))
@@ -1029,12 +1093,13 @@ $(eval $(call generate-docker-build-target,$(BACKEND_LLAMA_CPP_QUANTIZATION)))
 $(eval $(call generate-docker-build-target,$(BACKEND_TINYGRAD)))
 $(eval $(call generate-docker-build-target,$(BACKEND_KOKOROS)))
 $(eval $(call generate-docker-build-target,$(BACKEND_SAM3_CPP)))
+$(eval $(call generate-docker-build-target,$(BACKEND_SHERPA_ONNX)))

 # Pattern rule for docker-save targets
 docker-save-%: backend-images
 	docker save local-ai-backend:$* -o backend-images/$*.tar

-docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp docker-build-insightface docker-build-speaker-recognition
+docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-buun-llama-cpp docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx

 ########################################################
 ### Mock Backend for E2E Tests
--- a/backend/Dockerfile.buun-llama-cpp
+++ b/backend/Dockerfile.buun-llama-cpp
@@ -0,0 +1,290 @@
+ARG BASE_IMAGE=ubuntu:24.04
+ARG GRPC_BASE_IMAGE=${BASE_IMAGE}
+
+
+# The grpc target does one thing, it builds and installs GRPC.  This is in it's own layer so that it can be effectively cached by CI.
+# You probably don't need to change anything here, and if you do, make sure that CI is adjusted so that the cache continues to work.
+FROM ${GRPC_BASE_IMAGE} AS grpc
+
+# This is a bit of a hack, but it's required in order to be able to effectively cache this layer in CI
+ARG GRPC_MAKEFLAGS="-j4 -Otarget"
+ARG GRPC_VERSION=v1.65.0
+ARG CMAKE_FROM_SOURCE=false
+# CUDA Toolkit 13.x compatibility: CMake 3.31.9+ fixes toolchain detection/arch table issues
+ARG CMAKE_VERSION=3.31.10
+
+ENV MAKEFLAGS=${GRPC_MAKEFLAGS}
+
+WORKDIR /build
+
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends \
+        ca-certificates \
+        build-essential curl libssl-dev \
+        git wget && \
+    apt-get clean && \
+    rm -rf /var/lib/apt/lists/*
+
+# Install CMake (the version in 22.04 is too old)
+RUN <<EOT bash
+    if [ "${CMAKE_FROM_SOURCE}" = "true" ]; then
+        curl -L -s https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/cmake-${CMAKE_VERSION}.tar.gz -o cmake.tar.gz && tar xvf cmake.tar.gz && cd cmake-${CMAKE_VERSION} && ./configure && make && make install
+    else
+        apt-get update && \
+        apt-get install -y \
+            cmake && \
+        apt-get clean && \
+        rm -rf /var/lib/apt/lists/*
+    fi
+EOT
+
+# We install GRPC to a different prefix here so that we can copy in only the build artifacts later
+# saves several hundred MB on the final docker image size vs copying in the entire GRPC source tree
+# and running make install in the target container
+RUN git clone --recurse-submodules --jobs 4 -b ${GRPC_VERSION} --depth 1 --shallow-submodules https://github.com/grpc/grpc && \
+    mkdir -p /build/grpc/cmake/build && \
+    cd /build/grpc/cmake/build && \
+    sed -i "216i\  TESTONLY" "../../third_party/abseil-cpp/absl/container/CMakeLists.txt" && \
+    cmake -DgRPC_INSTALL=ON -DgRPC_BUILD_TESTS=OFF -DCMAKE_INSTALL_PREFIX:PATH=/opt/grpc ../.. && \
+    make && \
+    make install && \
+    rm -rf /build
+
+FROM ${BASE_IMAGE} AS builder
+ARG CMAKE_FROM_SOURCE=false
+ARG CMAKE_VERSION=3.31.10
+# We can target specific CUDA ARCHITECTURES like --build-arg CUDA_DOCKER_ARCH='75;86;89;120'
+ARG CUDA_DOCKER_ARCH
+ENV CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH}
+ARG CMAKE_ARGS
+ENV CMAKE_ARGS=${CMAKE_ARGS}
+ARG BACKEND=rerankers
+ARG BUILD_TYPE
+ENV BUILD_TYPE=${BUILD_TYPE}
+ARG CUDA_MAJOR_VERSION
+ARG CUDA_MINOR_VERSION
+ARG SKIP_DRIVERS=false
+ENV CUDA_MAJOR_VERSION=${CUDA_MAJOR_VERSION}
+ENV CUDA_MINOR_VERSION=${CUDA_MINOR_VERSION}
+ENV DEBIAN_FRONTEND=noninteractive
+ARG TARGETARCH
+ARG TARGETVARIANT
+ARG GO_VERSION=1.25.4
+ARG UBUNTU_VERSION=2404
+
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends \
+        build-essential \
+        ccache git \
+        ca-certificates \
+        make \
+        pkg-config libcurl4-openssl-dev \
+        curl unzip \
+        libssl-dev wget && \
+    apt-get clean && \
+    rm -rf /var/lib/apt/lists/*
+
+# Cuda
+ENV PATH=/usr/local/cuda/bin:${PATH}
+
+# HipBLAS requirements
+ENV PATH=/opt/rocm/bin:${PATH}
+
+
+# Vulkan requirements
+RUN <<EOT bash
+    if [ "${BUILD_TYPE}" = "vulkan" ] && [ "${SKIP_DRIVERS}" = "false" ]; then
+        apt-get update && \
+        apt-get install -y  --no-install-recommends \
+            software-properties-common pciutils wget gpg-agent && \
+        apt-get install -y libglm-dev cmake libxcb-dri3-0 libxcb-present0 libpciaccess0 \
+            libpng-dev libxcb-keysyms1-dev libxcb-dri3-dev libx11-dev g++ gcc \
+            libwayland-dev libxrandr-dev libxcb-randr0-dev libxcb-ewmh-dev \
+            git python-is-python3 bison libx11-xcb-dev liblz4-dev libzstd-dev \
+            ocaml-core ninja-build pkg-config libxml2-dev wayland-protocols python3-jsonschema \
+            clang-format qtbase5-dev qt6-base-dev libxcb-glx0-dev sudo xz-utils
+        if [ "amd64" = "$TARGETARCH" ]; then
+            wget "https://sdk.lunarg.com/sdk/download/1.4.335.0/linux/vulkansdk-linux-x86_64-1.4.335.0.tar.xz" && \
+            tar -xf vulkansdk-linux-x86_64-1.4.335.0.tar.xz && \
+            rm vulkansdk-linux-x86_64-1.4.335.0.tar.xz && \
+            mkdir -p /opt/vulkan-sdk && \
+            mv 1.4.335.0 /opt/vulkan-sdk/ && \
+            cd /opt/vulkan-sdk/1.4.335.0 && \
+            ./vulkansdk --no-deps --maxjobs \
+                vulkan-loader \
+                vulkan-validationlayers \
+                vulkan-extensionlayer \
+                vulkan-tools \
+                shaderc && \
+            cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/bin/* /usr/bin/ && \
+            cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/lib/* /usr/lib/x86_64-linux-gnu/ && \
+            cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/include/* /usr/include/ && \
+            cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/share/* /usr/share/ && \
+            rm -rf /opt/vulkan-sdk
+        fi
+        if [ "arm64" = "$TARGETARCH" ]; then
+            mkdir vulkan && cd vulkan && \
+            curl -L -o vulkan-sdk.tar.xz https://github.com/mudler/vulkan-sdk-arm/releases/download/1.4.335.0/vulkansdk-ubuntu-24.04-arm-1.4.335.0.tar.xz && \
+            tar -xvf vulkan-sdk.tar.xz && \
+            rm vulkan-sdk.tar.xz && \
+            cd 1.4.335.0 && \
+            cp -rfv aarch64/bin/* /usr/bin/ && \
+            cp -rfv aarch64/lib/* /usr/lib/aarch64-linux-gnu/ && \
+            cp -rfv aarch64/include/* /usr/include/ && \
+            cp -rfv aarch64/share/* /usr/share/ && \
+            cd ../.. && \
+            rm -rf vulkan
+        fi
+        ldconfig && \
+        apt-get clean && \
+        rm -rf /var/lib/apt/lists/*
+    fi
+EOT
+
+# CuBLAS requirements
+RUN <<EOT bash
+    if ( [ "${BUILD_TYPE}" = "cublas" ] || [ "${BUILD_TYPE}" = "l4t" ] ) && [ "${SKIP_DRIVERS}" = "false" ]; then
+        apt-get update && \
+        apt-get install -y  --no-install-recommends \
+            software-properties-common pciutils
+        if [ "amd64" = "$TARGETARCH" ]; then
+            curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/x86_64/cuda-keyring_1.1-1_all.deb
+        fi
+        if [ "arm64" = "$TARGETARCH" ]; then
+            if [ "${CUDA_MAJOR_VERSION}" = "13" ]; then
+                curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/sbsa/cuda-keyring_1.1-1_all.deb
+            else
+                curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/arm64/cuda-keyring_1.1-1_all.deb
+            fi
+        fi
+        dpkg -i cuda-keyring_1.1-1_all.deb && \
+        rm -f cuda-keyring_1.1-1_all.deb && \
+        apt-get update && \
+        apt-get install -y --no-install-recommends \
+            cuda-nvcc-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
+            libcufft-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
+            libcurand-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
+            libcublas-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
+            libcusparse-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
+            libcusolver-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}
+        if [ "${CUDA_MAJOR_VERSION}" = "13" ] && [ "arm64" = "$TARGETARCH" ]; then
+            apt-get install -y --no-install-recommends \
+            libcufile-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} libcudnn9-cuda-${CUDA_MAJOR_VERSION} cuda-cupti-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} libnvjitlink-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}
+        fi
+        apt-get clean && \
+        rm -rf /var/lib/apt/lists/*
+    fi
+EOT
+
+
+# https://github.com/NVIDIA/Isaac-GR00T/issues/343
+RUN <<EOT bash
+    if [ "${BUILD_TYPE}" = "cublas" ] && [ "${TARGETARCH}" = "arm64" ]; then
+        wget https://developer.download.nvidia.com/compute/cudss/0.6.0/local_installers/cudss-local-tegra-repo-ubuntu${UBUNTU_VERSION}-0.6.0_0.6.0-1_arm64.deb && \
+        dpkg -i cudss-local-tegra-repo-ubuntu${UBUNTU_VERSION}-0.6.0_0.6.0-1_arm64.deb && \
+        cp /var/cudss-local-tegra-repo-ubuntu${UBUNTU_VERSION}-0.6.0/cudss-*-keyring.gpg /usr/share/keyrings/ && \
+        apt-get update && apt-get -y install cudss cudss-cuda-${CUDA_MAJOR_VERSION} && \
+        wget https://developer.download.nvidia.com/compute/nvpl/25.5/local_installers/nvpl-local-repo-ubuntu${UBUNTU_VERSION}-25.5_1.0-1_arm64.deb && \
+        dpkg -i nvpl-local-repo-ubuntu${UBUNTU_VERSION}-25.5_1.0-1_arm64.deb && \
+        cp /var/nvpl-local-repo-ubuntu${UBUNTU_VERSION}-25.5/nvpl-*-keyring.gpg /usr/share/keyrings/ && \
+        apt-get update && apt-get install -y nvpl
+    fi
+EOT
+
+# If we are building with clblas support, we need the libraries for the builds
+RUN if [ "${BUILD_TYPE}" = "clblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then \
+        apt-get update && \
+        apt-get install -y --no-install-recommends \
+            libclblast-dev && \
+        apt-get clean && \
+        rm -rf /var/lib/apt/lists/* \
+    ; fi
+
+RUN if [ "${BUILD_TYPE}" = "hipblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then \
+        apt-get update && \
+        apt-get install -y --no-install-recommends \
+            hipblas-dev \
+            rocblas-dev && \
+        apt-get clean && \
+        rm -rf /var/lib/apt/lists/* && \
+        # I have no idea why, but the ROCM lib packages don't trigger ldconfig after they install, which results in local-ai and others not being able
+        # to locate the libraries. We run ldconfig ourselves to work around this packaging deficiency
+        ldconfig && \
+        # Log which GPU architectures have rocBLAS kernel support
+        echo "rocBLAS library data architectures:" && \
+        (ls /opt/rocm*/lib/rocblas/library/Kernels* 2>/dev/null || ls /opt/rocm*/lib64/rocblas/library/Kernels* 2>/dev/null) | grep -oP 'gfx[0-9a-z+-]+' | sort -u || \
+        echo "WARNING: No rocBLAS kernel data found" \
+    ; fi
+
+RUN echo "TARGETARCH: $TARGETARCH"
+
+# We need protoc installed, and the version in 22.04 is too old.  We will create one as part installing the GRPC build below
+# but that will also being in a newer version of absl which stablediffusion cannot compile with.  This version of protoc is only
+# here so that we can generate the grpc code for the stablediffusion build
+RUN <<EOT bash
+    if [ "amd64" = "$TARGETARCH" ]; then
+        curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v27.1/protoc-27.1-linux-x86_64.zip -o protoc.zip && \
+        unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
+        rm protoc.zip
+    fi
+    if [ "arm64" = "$TARGETARCH" ]; then
+        curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v27.1/protoc-27.1-linux-aarch_64.zip -o protoc.zip && \
+        unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
+        rm protoc.zip
+    fi
+EOT
+
+# Install CMake (the version in 22.04 is too old)
+RUN <<EOT bash
+    if [ "${CMAKE_FROM_SOURCE}" = "true" ]; then
+        curl -L -s https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/cmake-${CMAKE_VERSION}.tar.gz -o cmake.tar.gz && tar xvf cmake.tar.gz && cd cmake-${CMAKE_VERSION} && ./configure && make && make install
+    else
+        apt-get update && \
+        apt-get install -y \
+            cmake && \
+        apt-get clean && \
+        rm -rf /var/lib/apt/lists/*
+    fi
+EOT
+
+COPY --from=grpc /opt/grpc /usr/local
+
+
+COPY . /LocalAI
+
+RUN <<'EOT' bash
+set -euxo pipefail
+
+if [[ -n "${CUDA_DOCKER_ARCH:-}" ]]; then
+  CUDA_ARCH_ESC="${CUDA_DOCKER_ARCH//;/\\;}"
+  export CMAKE_ARGS="${CMAKE_ARGS:-} -DCMAKE_CUDA_ARCHITECTURES=${CUDA_ARCH_ESC}"
+  echo "CMAKE_ARGS(env) = ${CMAKE_ARGS}"
+  rm -rf /LocalAI/backend/cpp/buun-llama-cpp-*-build
+fi
+
+cd /LocalAI/backend/cpp/buun-llama-cpp
+
+if [ "${TARGETARCH}" = "arm64" ] || [ "${BUILD_TYPE}" = "hipblas" ]; then
+  make buun-llama-cpp-fallback
+  make buun-llama-cpp-grpc
+  make buun-llama-cpp-rpc-server
+else
+  make buun-llama-cpp-avx
+  make buun-llama-cpp-avx2
+  make buun-llama-cpp-avx512
+  make buun-llama-cpp-fallback
+  make buun-llama-cpp-grpc
+  make buun-llama-cpp-rpc-server
+fi
+EOT
+
+
+# Copy libraries using a script to handle architecture differences
+RUN make -BC /LocalAI/backend/cpp/buun-llama-cpp package
+
+
+FROM scratch
+
+
+# Copy all available binaries (the build process only creates the appropriate ones for the target architecture)
+COPY --from=builder /LocalAI/backend/cpp/buun-llama-cpp/package/. ./
--- a/backend/cpp/buun-llama-cpp/Makefile
+++ b/backend/cpp/buun-llama-cpp/Makefile
@@ -0,0 +1,85 @@
+
+# Pinned to the HEAD of master on https://github.com/spiritbuun/buun-llama-cpp.
+# Auto-bumped nightly by .github/workflows/bump_deps.yaml.
+BUUN_LLAMA_VERSION?=22464d0848b87c5d56b52fdf6af2e5da46bf803e
+LLAMA_REPO?=https://github.com/spiritbuun/buun-llama-cpp
+
+CMAKE_ARGS?=
+BUILD_TYPE?=
+NATIVE?=false
+ONEAPI_VARS?=/opt/intel/oneapi/setvars.sh
+TARGET?=--target grpc-server
+JOBS?=$(shell nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 1)
+ARCH?=$(shell uname -m)
+
+CURRENT_MAKEFILE_DIR := $(dir $(abspath $(lastword $(MAKEFILE_LIST))))
+LLAMA_CPP_DIR := $(CURRENT_MAKEFILE_DIR)/../llama-cpp
+
+GREEN := \033[0;32m
+RESET := \033[0m
+
+# buun-llama-cpp is a llama.cpp fork-of-a-fork (spiritbuun/buun-llama-cpp forked
+# TheTom/llama-cpp-turboquant, which itself forked ggml-org/llama.cpp). Rather
+# than duplicating grpc-server.cpp / CMakeLists.txt / prepare.sh we reuse the
+# ones in backend/cpp/llama-cpp, and only swap which repo+sha the fetch step
+# pulls. Each flavor target copies ../llama-cpp into a sibling
+# ../buun-llama-cpp-<flavor>-build directory, then invokes llama-cpp's own
+# build-llama-cpp-grpc-server with LLAMA_REPO/LLAMA_VERSION overridden to point
+# at the fork.
+PATCHES_DIR := $(CURRENT_MAKEFILE_DIR)/patches
+
+# Each flavor target:
+#   1. copies backend/cpp/llama-cpp/ (grpc-server.cpp + prepare.sh + CMakeLists.txt + Makefile)
+#      into a sibling buun-llama-cpp-<flavor>-build directory;
+#   2. clones the buun fork into buun-llama-cpp-<flavor>-build/llama.cpp via the
+#      copy's own `llama.cpp` target, overriding LLAMA_REPO/LLAMA_VERSION;
+#   3. applies patches from backend/cpp/buun-llama-cpp/patches/ to the cloned
+#      fork sources (for backporting upstream commits the fork hasn't pulled);
+#   4. runs the copy's `grpc-server` target, which produces the binary we copy
+#      up as buun-llama-cpp-<flavor>.
+define buun-llama-cpp-build
+	rm -rf $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-$(1)-build
+	cp -rf $(LLAMA_CPP_DIR) $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-$(1)-build
+	$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-$(1)-build purge
+	# Augment the copied grpc-server.cpp's KV-cache allow-list with the
+	# fork's turbo2/turbo3/turbo4/turbo2_tcq/turbo3_tcq types and wire up the
+	# DFlash-specific option handlers (tree_budget / draft_topk). We patch the
+	# *copy*, never the original under backend/cpp/llama-cpp/, so the stock
+	# llama-cpp build stays compiling against vanilla upstream.
+	bash $(CURRENT_MAKEFILE_DIR)/patch-grpc-server.sh $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-$(1)-build/grpc-server.cpp
+	$(info $(GREEN)I buun-llama-cpp build info:$(1)$(RESET))
+	LLAMA_REPO=$(LLAMA_REPO) LLAMA_VERSION=$(BUUN_LLAMA_VERSION) \
+	$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-$(1)-build llama.cpp
+	bash $(CURRENT_MAKEFILE_DIR)/apply-patches.sh $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-$(1)-build/llama.cpp $(PATCHES_DIR)
+	CMAKE_ARGS="$(CMAKE_ARGS) $(2)" TARGET="$(3)" \
+	LLAMA_REPO=$(LLAMA_REPO) LLAMA_VERSION=$(BUUN_LLAMA_VERSION) \
+	$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-$(1)-build grpc-server
+	cp -rfv $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-$(1)-build/grpc-server buun-llama-cpp-$(1)
+endef
+
+buun-llama-cpp-avx2:
+	$(call buun-llama-cpp-build,avx2,-DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=off -DGGML_FMA=on -DGGML_F16C=on,--target grpc-server)
+
+buun-llama-cpp-avx512:
+	$(call buun-llama-cpp-build,avx512,-DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=on -DGGML_FMA=on -DGGML_F16C=on,--target grpc-server)
+
+buun-llama-cpp-avx:
+	$(call buun-llama-cpp-build,avx,-DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off,--target grpc-server)
+
+buun-llama-cpp-fallback:
+	$(call buun-llama-cpp-build,fallback,-DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off,--target grpc-server)
+
+buun-llama-cpp-grpc:
+	$(call buun-llama-cpp-build,grpc,-DGGML_RPC=ON -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off,--target grpc-server --target rpc-server)
+
+buun-llama-cpp-rpc-server: buun-llama-cpp-grpc
+	cp -rf $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-grpc-build/llama.cpp/build/bin/rpc-server buun-llama-cpp-rpc-server
+
+package:
+	bash package.sh
+
+purge:
+	rm -rf $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-*-build
+	rm -rf buun-llama-cpp-* package
+
+clean: purge
--- a/backend/cpp/buun-llama-cpp/apply-patches.sh
+++ b/backend/cpp/buun-llama-cpp/apply-patches.sh
@@ -0,0 +1,50 @@
+#!/bin/bash
+# Apply the buun-llama-cpp patch series to a cloned buun-llama-cpp checkout.
+#
+# buun-llama-cpp is a fork-of-a-fork that branched off upstream llama.cpp
+# before some API changes the shared backend/cpp/llama-cpp/grpc-server.cpp
+# depends on. We carry those upstream commits as patch files under
+# backend/cpp/buun-llama-cpp/patches/ and apply them here so the reused
+# grpc-server source compiles against the fork unmodified.
+#
+# Drop the corresponding patch from patches/ whenever the fork catches up with
+# upstream — the build will fail fast if a patch stops applying, which is the
+# signal to retire it.
+
+set -euo pipefail
+
+if [[ $# -ne 2 ]]; then
+    echo "usage: $0 <llama.cpp-src-dir> <patches-dir>" >&2
+    exit 2
+fi
+
+SRC_DIR=$1
+PATCHES_DIR=$2
+
+if [[ ! -d "$SRC_DIR" ]]; then
+    echo "source dir does not exist: $SRC_DIR" >&2
+    exit 2
+fi
+
+if [[ ! -d "$PATCHES_DIR" ]]; then
+    echo "no patches dir at $PATCHES_DIR, nothing to apply"
+    exit 0
+fi
+
+shopt -s nullglob
+patches=("$PATCHES_DIR"/*.patch)
+shopt -u nullglob
+
+if [[ ${#patches[@]} -eq 0 ]]; then
+    echo "no .patch files in $PATCHES_DIR, nothing to apply"
+    exit 0
+fi
+
+cd "$SRC_DIR"
+
+for patch in "${patches[@]}"; do
+    echo "==> applying $patch"
+    git apply --verbose "$patch"
+done
+
+echo "all buun-llama-cpp patches applied successfully"
--- a/backend/cpp/buun-llama-cpp/package.sh
+++ b/backend/cpp/buun-llama-cpp/package.sh
@@ -0,0 +1,57 @@
+#!/bin/bash
+
+# Script to copy the appropriate libraries based on architecture
+# This script is used in the final stage of the Dockerfile
+
+set -e
+
+CURDIR=$(dirname "$(realpath $0)")
+REPO_ROOT="${CURDIR}/../../.."
+
+# Create lib directory
+mkdir -p $CURDIR/package/lib
+
+cp -avrf $CURDIR/buun-llama-cpp-* $CURDIR/package/
+cp -rfv $CURDIR/run.sh $CURDIR/package/
+
+# Detect architecture and copy appropriate libraries
+if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
+    # x86_64 architecture
+    echo "Detected x86_64 architecture, copying x86_64 libraries..."
+    cp -arfLv /lib64/ld-linux-x86-64.so.2 $CURDIR/package/lib/ld.so
+    cp -arfLv /lib/x86_64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
+    cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
+    cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
+    cp -arfLv /lib/x86_64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
+    cp -arfLv /lib/x86_64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
+    cp -arfLv /lib/x86_64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
+    cp -arfLv /lib/x86_64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
+    cp -arfLv /lib/x86_64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
+elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
+    # ARM64 architecture
+    echo "Detected ARM64 architecture, copying ARM64 libraries..."
+    cp -arfLv /lib/ld-linux-aarch64.so.1 $CURDIR/package/lib/ld.so
+    cp -arfLv /lib/aarch64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
+    cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
+    cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
+    cp -arfLv /lib/aarch64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
+    cp -arfLv /lib/aarch64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
+    cp -arfLv /lib/aarch64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
+    cp -arfLv /lib/aarch64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
+    cp -arfLv /lib/aarch64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
+else
+    echo "Error: Could not detect architecture"
+    exit 1
+fi
+
+# Package GPU libraries based on BUILD_TYPE
+GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
+if [ -f "$GPU_LIB_SCRIPT" ]; then
+    echo "Packaging GPU libraries for BUILD_TYPE=${BUILD_TYPE:-cpu}..."
+    source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
+    package_gpu_libs
+fi
+
+echo "Packaging completed successfully"
+ls -liah $CURDIR/package/
+ls -liah $CURDIR/package/lib/
--- a/backend/cpp/buun-llama-cpp/patch-grpc-server.sh
+++ b/backend/cpp/buun-llama-cpp/patch-grpc-server.sh
@@ -0,0 +1,162 @@
+#!/bin/bash
+# Patch the shared backend/cpp/llama-cpp/grpc-server.cpp *copy* used by the
+# buun-llama-cpp build to account for three gaps between upstream and the fork:
+#
+#   1. Augment the kv_cache_types[] allow-list so `LoadModel` accepts the
+#      fork-specific `turbo2` / `turbo3` / `turbo4` cache types plus the buun
+#      additions `turbo2_tcq` / `turbo3_tcq`.
+#
+#   2. Wire up buun-exclusive speculative-decoding option handlers
+#      (tree_budget / draft_topk) alongside the existing spec_* handlers.
+#      These reference struct fields (common_params.speculative.tree_budget
+#      and .draft_topk) that only exist in buun's common/common.h — adding
+#      them to the shared backend/cpp/llama-cpp/grpc-server.cpp would break
+#      the stock llama-cpp build, so we inject them only into the buun copy.
+#
+#   3. Replace `get_media_marker()` (added upstream in ggml-org/llama.cpp#21962,
+#      server-side random per-instance marker) with the legacy "<__media__>"
+#      literal. The fork branched before that PR, so server-common.cpp has no
+#      get_media_marker symbol. The fork's mtmd_default_marker() still returns
+#      "<__media__>", and Go-side tooling falls back to that sentinel when the
+#      backend does not expose media_marker, so substituting the literal keeps
+#      behavior identical on the buun path.
+#
+# We patch the *copy* sitting in buun-llama-cpp-<flavor>-build/, never the
+# original under backend/cpp/llama-cpp/, so the stock llama-cpp build keeps
+# compiling against vanilla upstream.
+#
+# Idempotent: skips each insertion if its marker is already present (so re-runs
+# of the same build dir don't double-insert).
+
+set -euo pipefail
+
+if [[ $# -ne 1 ]]; then
+    echo "usage: $0 <grpc-server.cpp>" >&2
+    exit 2
+fi
+
+SRC=$1
+
+if [[ ! -f "$SRC" ]]; then
+    echo "grpc-server.cpp not found at $SRC" >&2
+    exit 2
+fi
+
+if grep -q 'GGML_TYPE_TURBO2_TCQ' "$SRC"; then
+    echo "==> $SRC already has buun cache types, skipping KV allow-list patch"
+else
+    echo "==> patching $SRC to allow turbo2/turbo3/turbo4/turbo2_tcq/turbo3_tcq KV-cache types"
+
+    # Insert the five TURBO entries right after the first `    GGML_TYPE_Q5_1,`
+    # line (the kv_cache_types[] allow-list). Using awk because the builder
+    # image does not ship python3, and GNU sed's multi-line `a\` quoting is
+    # awkward.
+    awk '
+        /^    GGML_TYPE_Q5_1,$/ && !done {
+            print
+            print "    // buun-llama-cpp fork extras — added by patch-grpc-server.sh"
+            print "    GGML_TYPE_TURBO2_0,"
+            print "    GGML_TYPE_TURBO3_0,"
+            print "    GGML_TYPE_TURBO4_0,"
+            print "    GGML_TYPE_TURBO2_TCQ,"
+            print "    GGML_TYPE_TURBO3_TCQ,"
+            done = 1
+            next
+        }
+        { print }
+        END {
+            if (!done) {
+                print "patch-grpc-server.sh: anchor `    GGML_TYPE_Q5_1,` not found" > "/dev/stderr"
+                exit 1
+            }
+        }
+    ' "$SRC" > "$SRC.tmp"
+    mv "$SRC.tmp" "$SRC"
+
+    echo "==> KV allow-list patch OK"
+fi
+
+if grep -q 'optname, "tree_budget"' "$SRC"; then
+    echo "==> $SRC already has DFlash option handlers, skipping"
+else
+    echo "==> patching $SRC to add tree_budget / draft_topk option handlers"
+
+    # Insert two new `else if` handlers between the inner close-brace of the
+    # `spec_p_split` block and the next `} else if (…spec_ngram_size_n…)` line.
+    # Upstream writes each `} else if` as a single physical line, so we don't
+    # emit an outer `}` ourselves — the existing next line provides both the
+    # close of our `draft_topk` block and the open of `spec_ngram_size_n`.
+    # Anchor on the exact 3-line body of spec_p_split so we can't drift.
+    awk '
+        prev2 == "        } else if (!strcmp(optname, \"spec_p_split\")) {" &&
+        prev1 ~ /^ +if \(optval != NULL\) \{$/ &&
+        $0    ~ /^ +try \{ params\.speculative\.p_split = std::stof\(optval_str\); \} catch \(\.\.\.\) \{\}$/ &&
+        !done {
+            print                        # print the try-line itself
+            getline inner_close          # read "            }" closing the inner if
+            print inner_close            # print it — this closes spec_p_split body
+            print "        // buun-llama-cpp DFlash options — added by patch-grpc-server.sh"
+            print "        } else if (!strcmp(optname, \"tree_budget\")) {"
+            print "            if (optval != NULL) {"
+            print "                try { params.speculative.tree_budget = std::stoi(optval_str); } catch (...) {}"
+            print "            }"
+            print "        } else if (!strcmp(optname, \"draft_topk\")) {"
+            print "            if (optval != NULL) {"
+            print "                try { params.speculative.draft_topk = std::stoi(optval_str); } catch (...) {}"
+            print "            }"
+            # The next source line (`} else if (…spec_ngram_size_n…) {`) closes
+            # our draft_topk block and continues the chain naturally; fall back
+            # into the main loop to emit it and everything after.
+            done = 1
+            prev2 = prev1
+            prev1 = inner_close
+            next
+        }
+        { print; prev2 = prev1; prev1 = $0 }
+        END {
+            if (!done) {
+                print "patch-grpc-server.sh: spec_p_split anchor not found" > "/dev/stderr"
+                exit 1
+            }
+        }
+    ' "$SRC" > "$SRC.tmp"
+    mv "$SRC.tmp" "$SRC"
+
+    echo "==> DFlash option-handler patch OK"
+fi
+
+if grep -qE 'ctx_server\.get_meta\(\)\.logit_bias_eog|params_base\.sampling\.logit_bias_eog,' "$SRC"; then
+    echo "==> patching $SRC to drop the logit_bias_eog arg from params_from_json_cmpl() callsites (buun still uses the pre-refactor 4-arg signature)"
+    # Upstream llama.cpp refactored params_from_json_cmpl to take a precomputed
+    # logit_bias_eog vector after buun's 2026-04-05 fork-point — simultaneously
+    # adding server_context_meta::logit_bias_eog as the supplier. Buun carries
+    # neither change: its params_from_json_cmpl is still 4-arg, and internally
+    # derives logit_bias_eog from the common_params it's passed. So we just
+    # delete the argument line entirely — the remaining 4 args match buun's
+    # signature and the resulting behavior matches upstream bit-for-bit
+    # (upstream's 5th arg is the same data buun derives internally).
+    #
+    # Guard is broad so this works whether the line has been run through this
+    # block before (leaving params_base.sampling.logit_bias_eog,) or not
+    # (leaving the original ctx_server.get_meta().logit_bias_eog,).
+    sed -E '/^[[:space:]]+(ctx_server\.get_meta\(\)\.logit_bias_eog|params_base\.sampling\.logit_bias_eog),$/d' "$SRC" > "$SRC.tmp"
+    mv "$SRC.tmp" "$SRC"
+    echo "==> logit_bias_eog arg drop OK"
+else
+    echo "==> $SRC has no logit_bias_eog arg line, skipping"
+fi
+
+if grep -q 'get_media_marker()' "$SRC"; then
+    echo "==> patching $SRC to replace get_media_marker() with legacy \"<__media__>\" literal"
+    # Only one call site today (ModelMetadata), but replace all occurrences to
+    # stay robust if upstream adds more. Use a temp file to avoid relying on
+    # sed -i portability (the builder image uses GNU sed, but keeping this
+    # consistent with the awk block above).
+    sed 's/get_media_marker()/"<__media__>"/g' "$SRC" > "$SRC.tmp"
+    mv "$SRC.tmp" "$SRC"
+    echo "==> get_media_marker() substitution OK"
+else
+    echo "==> $SRC has no get_media_marker() call, skipping media-marker patch"
+fi
+
+echo "==> all patches applied"
--- a/backend/cpp/buun-llama-cpp/patches/0001-fattn-atomicAdd-double-shim.patch
+++ b/backend/cpp/buun-llama-cpp/patches/0001-fattn-atomicAdd-double-shim.patch
@@ -0,0 +1,46 @@
+Subject: [PATCH] ggml-cuda/fattn: provide atomicAdd(double*,double) shim for pre-sm_60
+
+Buun's Q² calibration path in ggml_cuda_turbo_scale_q calls
+  atomicAdd(&d_q_channel_sq_fattn[threadIdx.x], (double)(val * val));
+but native double atomicAdd is only available on compute capability 6.0
+and newer. Compiling against a CUDA arch list that includes older
+architectures (LocalAI's CUDA 12 Docker image builds for the full
+published arch range) fails with:
+
+    fattn.cu(812): error: no instance of overloaded function "atomicAdd"
+      matches the argument list, argument types are: (double *, double)
+
+Add the canonical CUDA-programming-guide shim at the top of fattn.cu so
+pre-sm_60 codegen has a definition to call. On sm_60+ the native CUDA
+intrinsic is used and the shim is elided via __CUDA_ARCH__.
+
+--- a/ggml/src/ggml-cuda/fattn.cu
+++ b/ggml/src/ggml-cuda/fattn.cu
+@@ -7,6 +7,27 @@
+
+ #include <atomic>
+
+// Pre-sm_60 double atomicAdd shim. Native double atomicAdd(double*,double)
+// is only available on CUDA compute capability 6.0+ (see CUDA C Programming
+// Guide, B.15 Atomic Functions). Buun's Q² calibration path below calls
+// atomicAdd with a double*; without this definition, nvcc fails to find a
+// matching overload whenever the compile target list includes pre-sm_60
+// architectures. The standard CAS loop implementation below matches the
+// semantics of the native intrinsic.
+#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ < 600
+static __device__ double atomicAdd(double * address, double val) {
+    unsigned long long int * address_as_ull = (unsigned long long int *)address;
+    unsigned long long int old = *address_as_ull;
+    unsigned long long int assumed;
+    do {
+        assumed = old;
+        old = atomicCAS(address_as_ull, assumed,
+                        __double_as_longlong(val + __longlong_as_double(assumed)));
+    } while (assumed != old);
+    return __longlong_as_double(old);
+}
+#endif
+
+ // InnerQ: update the fattn-side inverse scale array from host (all devices)
+ void turbo_innerq_update_fattn_scales(const float * scale_inv) {
+     int cur_device;
--- a/backend/cpp/buun-llama-cpp/patches/0002-argmax-shfl-xor-sync-add-width.patch
+++ b/backend/cpp/buun-llama-cpp/patches/0002-argmax-shfl-xor-sync-add-width.patch
@@ -0,0 +1,32 @@
+Subject: [PATCH] ggml-cuda/argmax: pass WARP_SIZE to the top-K __shfl_xor_sync calls
+
+Two __shfl_xor_sync calls in the top-K intra-warp merge drop the `width`
+argument and rely on the CUDA default (warpSize). Every other call in
+the same file already passes WARP_SIZE explicitly, and the HIP/ROCm
+compatibility shim at ggml/src/ggml-cuda/vendors/hip.h:33 is a 4-arg
+function-like macro — so the 3-arg form fails to preprocess when
+building with hipcc against ROCm:
+
+    argmax.cu:265: error: too few arguments provided to function-like
+      macro invocation
+    note: macro '__shfl_xor_sync' defined here:
+      #define __shfl_xor_sync(mask, var, laneMask, width) \
+              __shfl_xor(var, laneMask, width)
+
+Align the two call sites with the rest of the file by passing WARP_SIZE
+explicitly. On CUDA the generated code is unchanged (warpSize is the
+default); on HIP it now matches the macro's arity.
+
+--- a/ggml/src/ggml-cuda/argmax.cu
+++ b/ggml/src/ggml-cuda/argmax.cu
+@@ -262,8 +262,8 @@
+     // Each step: lane gets partner's min element, if it beats our min, replace and re-heapify
+     for (int offset = WARP_SIZE / 2; offset > 0; offset >>= 1) {
+         for (int i = 0; i < K; i++) {
+-            float partner_val = __shfl_xor_sync(0xFFFFFFFF, heap_val[i], offset);
+-            int partner_idx = __shfl_xor_sync(0xFFFFFFFF, heap_idx[i], offset);
+            float partner_val = __shfl_xor_sync(0xFFFFFFFF, heap_val[i], offset, WARP_SIZE);
+            int partner_idx = __shfl_xor_sync(0xFFFFFFFF, heap_idx[i], offset, WARP_SIZE);
+             if (partner_val > heap_val[0]) {
+                 heap_val[0] = partner_val;
+                 heap_idx[0] = partner_idx;
--- a/backend/cpp/buun-llama-cpp/patches/0003-hip-add-memcpy-symbol-aliases.patch
+++ b/backend/cpp/buun-llama-cpp/patches/0003-hip-add-memcpy-symbol-aliases.patch
@@ -0,0 +1,24 @@
+Subject: [PATCH] ggml-cuda/vendors/hip: alias cudaMemcpy{To,From}Symbol to hip counterparts
+
+Buun's Q² calibration + TCQ codebook upload paths in fattn.cu use
+cudaMemcpyToSymbol / cudaMemcpyFromSymbol. The HIP-compat header in
+ggml/src/ggml-cuda/vendors/hip.h already aliases the scalar cudaMemcpy
+family (cudaMemcpy, cudaMemcpyAsync, cudaMemcpy2DAsync, …) but is
+missing the symbol variants. Building with hipcc therefore fails with
+15+ "use of undeclared identifier 'cudaMemcpyToSymbol'" errors.
+
+Add the two missing aliases alongside the existing memcpy block. HIP
+provides hipMemcpy{To,From}Symbol with the same signature as CUDA's
+equivalents, so this is a straight name substitution.
+
+--- a/ggml/src/ggml-cuda/vendors/hip.h
+++ b/ggml/src/ggml-cuda/vendors/hip.h
+@@ -85,6 +85,8 @@
+ #define cudaMemcpyDeviceToDevice hipMemcpyDeviceToDevice
+ #define cudaMemcpyDeviceToHost hipMemcpyDeviceToHost
+ #define cudaMemcpyHostToDevice hipMemcpyHostToDevice
+#define cudaMemcpyToSymbol hipMemcpyToSymbol
+#define cudaMemcpyFromSymbol hipMemcpyFromSymbol
+ #define cudaMemcpyKind hipMemcpyKind
+ #define cudaMemset hipMemset
+ #define cudaMemsetAsync hipMemsetAsync
--- a/backend/cpp/buun-llama-cpp/patches/0004-fattn-fwht128-shfl-xor-sync-add-width.patch
+++ b/backend/cpp/buun-llama-cpp/patches/0004-fattn-fwht128-shfl-xor-sync-add-width.patch
@@ -0,0 +1,36 @@
+Subject: [PATCH] ggml-cuda/fattn: pass WARP_SIZE to fwht128 __shfl_xor_sync calls
+
+Same issue as the argmax top-K fix: two __shfl_xor_sync call sites in
+the FWHT-128 butterfly kernels (ggml_cuda_fwht128 and fwht128_store_half)
+use the 3-arg CUDA form and omit the `width` argument that the HIP
+function-like macro in vendors/hip.h:33 requires. Hipcc fails with:
+
+    fattn.cu:512: too few arguments provided to function-like macro
+      invocation
+    note: macro '__shfl_xor_sync' defined here:
+      #define __shfl_xor_sync(mask, var, laneMask, width) \
+              __shfl_xor(var, laneMask, width)
+
+Add WARP_SIZE to both calls. CUDA codegen is unchanged (warpSize is the
+default); HIP now matches the macro arity.
+
+--- a/ggml/src/ggml-cuda/fattn.cu
+++ b/ggml/src/ggml-cuda/fattn.cu
+@@ -509,7 +509,7 @@
+     // Intra-warp passes: shuffle xor with stride h, no smem, no sync.
+     #pragma unroll
+     for (int h = 1; h <= 16; h *= 2) {
+-        const float other = __shfl_xor_sync(0xFFFFFFFF, val, h);
+        const float other = __shfl_xor_sync(0xFFFFFFFF, val, h, WARP_SIZE);
+         val = (tid & h) ? (other - val) : (val + other);
+     }
+
+@@ -533,7 +533,7 @@
+ static __device__ __forceinline__ void fwht128_store_half(
+         float val, half * dst_base) {
+     const int tid = threadIdx.x;
+-    const float neighbor = __shfl_xor_sync(0xFFFFFFFF, val, 1);
+    const float neighbor = __shfl_xor_sync(0xFFFFFFFF, val, 1, WARP_SIZE);
+     if ((tid & 1) == 0) {
+         const half2 packed = __floats2half2_rn(val, neighbor);
+         *((half2 *)(dst_base + tid)) = packed;
--- a/backend/cpp/buun-llama-cpp/run.sh
+++ b/backend/cpp/buun-llama-cpp/run.sh
@@ -0,0 +1,65 @@
+#!/bin/bash
+set -ex
+
+# Get the absolute current dir where the script is located
+CURDIR=$(dirname "$(realpath $0)")
+
+cd /
+
+echo "CPU info:"
+grep -e "model\sname" /proc/cpuinfo | head -1
+grep -e "flags" /proc/cpuinfo | head -1
+
+BINARY=buun-llama-cpp-fallback
+
+if grep -q -e "\savx\s" /proc/cpuinfo ; then
+	echo "CPU:    AVX    found OK"
+	if [ -e $CURDIR/buun-llama-cpp-avx ]; then
+		BINARY=buun-llama-cpp-avx
+	fi
+fi
+
+if grep -q -e "\savx2\s" /proc/cpuinfo ; then
+	echo "CPU:    AVX2   found OK"
+	if [ -e $CURDIR/buun-llama-cpp-avx2 ]; then
+		BINARY=buun-llama-cpp-avx2
+	fi
+fi
+
+# Check avx 512
+if grep -q -e "\savx512f\s" /proc/cpuinfo ; then
+	echo "CPU:    AVX512F found OK"
+	if [ -e $CURDIR/buun-llama-cpp-avx512 ]; then
+		BINARY=buun-llama-cpp-avx512
+	fi
+fi
+
+if [ -n "$LLAMACPP_GRPC_SERVERS" ]; then
+	if [ -e $CURDIR/buun-llama-cpp-grpc ]; then
+		BINARY=buun-llama-cpp-grpc
+	fi
+fi
+
+# Extend ld library path with the dir where this script is located/lib
+if [ "$(uname)" == "Darwin" ]; then
+	export DYLD_LIBRARY_PATH=$CURDIR/lib:$DYLD_LIBRARY_PATH
+else
+	export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
+	# Tell rocBLAS where to find TensileLibrary data (GPU kernel tuning files)
+	if [ -d "$CURDIR/lib/rocblas/library" ]; then
+		export ROCBLAS_TENSILE_LIBPATH=$CURDIR/lib/rocblas/library
+	fi
+fi
+
+# If there is a lib/ld.so, use it
+if [ -f $CURDIR/lib/ld.so ]; then
+	echo "Using lib/ld.so"
+	echo "Using binary: $BINARY"
+	exec $CURDIR/lib/ld.so $CURDIR/$BINARY "$@"
+fi
+
+echo "Using binary: $BINARY"
+exec $CURDIR/$BINARY "$@"
+
+# We should never reach this point, however just in case we do, run fallback
+exec $CURDIR/buun-llama-cpp-fallback "$@"
--- a/backend/cpp/ik-llama-cpp/Makefile
+++ b/backend/cpp/ik-llama-cpp/Makefile
@@ -1,5 +1,5 @@

-IK_LLAMA_VERSION?=286ce324baed17c95faec77792eaa6bdb1c7a5f5
+IK_LLAMA_VERSION?=16996aeab772c69b6473597038b2ef0b85297e8b
 LLAMA_REPO?=https://github.com/ikawrakow/ik_llama.cpp

 CMAKE_ARGS?=
--- a/backend/cpp/ik-llama-cpp/patches/0002-clip-ggml-quantize-chunk-user-data.patch
+++ b/backend/cpp/ik-llama-cpp/patches/0002-clip-ggml-quantize-chunk-user-data.patch
@@ -0,0 +1,11 @@
+--- a/examples/llava/clip.cpp
+++ b/examples/llava/clip.cpp
+@@ -2494,7 +2494,7 @@
+             }
+             new_data = work.data();
+
+-            new_size = ggml_quantize_chunk(new_type, f32_data, new_data, 0, n_elms/cur->ne[0], cur->ne[0], nullptr);
+            new_size = ggml_quantize_chunk(new_type, f32_data, new_data, 0, n_elms/cur->ne[0], cur->ne[0], nullptr, nullptr);
+         } else {
+             new_type = cur->type;
+             new_data = cur->data;
--- a/backend/cpp/llama-cpp/Makefile
+++ b/backend/cpp/llama-cpp/Makefile
@@ -1,5 +1,5 @@

-LLAMA_VERSION?=0d0764dfd257c0ae862525c05778207f87b99b1c
+LLAMA_VERSION?=187a45637054881ecacf17f8e2f6f8f2ba7df1c7
 LLAMA_REPO?=https://github.com/ggerganov/llama.cpp

 CMAKE_ARGS?=
--- a/backend/go/local-store/store.go
+++ b/backend/go/local-store/store.go
@@ -4,7 +4,6 @@ package main
 // It is meant to be used by the main executable that is the server for the specific backend type (falcon, gpt3, etc)
 import (
 	"container/heap"
-	"errors"
 	"fmt"
 	"math"
 	"slices"
@@ -100,9 +99,16 @@ func sortIntoKeySlicese(keys []*pb.StoresKey) [][]float32 {
 }

 func (s *Store) Load(opts *pb.ModelOptions) error {
-	if opts.Model != "" {
-		return errors.New("not implemented")
-	}
+	// local-store is an in-memory vector store with no on-disk artefact to
+	// load — opts.Model is just a namespace identifier. The old `!= ""` guard
+	// rejected any non-empty model name with "not implemented", which broke
+	// callers that pass a namespace to isolate embedding spaces (face vs.
+	// voice biometrics both go through local-store but need distinct stores
+	// so ArcFace 512-D and ECAPA-TDNN 192-D don't collide). Namespace
+	// isolation is already handled upstream: ModelLoader spawns a fresh
+	// local-store process per (backend, model) tuple, so each namespace is
+	// its own Store{} instance. Nothing to do here beyond accepting the load.
+	_ = opts
 	return nil
 }

--- a/backend/go/sherpa-onnx/.gitignore
+++ b/backend/go/sherpa-onnx/.gitignore
@@ -0,0 +1,11 @@
+.cache/
+sources/
+build*/
+package/
+backend-assets/
+sherpa-onnx
+*.so
+compile_commands.json
+sherpa-onnx-whisper-*
+vits-ljs/
+streaming-zipformer-en/
--- a/backend/go/sherpa-onnx/Makefile
+++ b/backend/go/sherpa-onnx/Makefile
@@ -0,0 +1,120 @@
+CURRENT_DIR=$(abspath ./)
+GOCMD=go
+
+ONNX_VERSION?=1.24.4
+# v1.12.39 — includes upstream's onnxruntime 1.24.4 bump (#3501). Earlier
+# pinned commits only support onnxruntime 1.23.2, which has no CUDA 13
+# pre-built tarball, blocking the -gpu-nvidia-cuda-13 build matrix entry.
+SHERPA_COMMIT?=7288d15e3e31a7bd589b2ba88828d521e7a6b140
+ONNX_ARCH?=x64
+ONNX_OS?=linux
+
+ifneq (,$(findstring aarch64,$(shell uname -m)))
+	ONNX_ARCH=aarch64
+endif
+
+ifeq ($(OS),Darwin)
+	ONNX_OS=osx
+	ifneq (,$(findstring aarch64,$(shell uname -m)))
+		ONNX_ARCH=arm64
+	else ifneq (,$(findstring arm64,$(shell uname -m)))
+		ONNX_ARCH=arm64
+	else
+		ONNX_ARCH=x86_64
+	endif
+endif
+
+# Upstream onnxruntime ships CUDA 12 and CUDA 13 variants under different
+# names: -gpu-<ver>.tgz for CUDA 12, -gpu_cuda13-<ver>.tgz for CUDA 13
+# (note underscore vs dash). CUDA 13 tarballs only exist from 1.24.x onward.
+ifeq ($(BUILD_TYPE),cublas)
+	SHERPA_GPU=ON
+	ONNX_PROVIDER=cuda
+	ifeq ($(CUDA_MAJOR_VERSION),13)
+		ONNX_VARIANT=-gpu_cuda13
+	else
+		ONNX_VARIANT=-gpu
+	endif
+else
+	ONNX_VARIANT=
+	SHERPA_GPU=OFF
+	ONNX_PROVIDER=cpu
+endif
+
+JOBS?=$(shell nproc --ignore=1 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 4)
+
+sources/onnxruntime:
+	mkdir -p sources/onnxruntime
+	curl -L https://github.com/microsoft/onnxruntime/releases/download/v$(ONNX_VERSION)/onnxruntime-$(ONNX_OS)-$(ONNX_ARCH)$(ONNX_VARIANT)-$(ONNX_VERSION).tgz \
+	  -o sources/onnxruntime/onnxruntime.tgz
+	cd sources/onnxruntime && tar -xf onnxruntime.tgz --strip-components=1 && rm onnxruntime.tgz
+
+sources/sherpa-onnx: sources/onnxruntime
+	git clone https://github.com/k2-fsa/sherpa-onnx.git sources/sherpa-onnx
+	cd sources/sherpa-onnx && git checkout $(SHERPA_COMMIT)
+	mkdir -p sources/sherpa-onnx/build
+	# sherpa-onnx's cmake detects a pre-installed onnxruntime via the
+	# SHERPA_ONNXRUNTIME_{INCLUDE,LIB}_DIR env vars (not via -D flags).
+	# Point them at our locally-downloaded Microsoft tarball — without
+	# this, sherpa-onnx falls through to download_onnxruntime() which
+	# fetches from csukuangfj/onnxruntime-libs. For the GPU 1.24.4
+	# build that release mirror publishes `-patched.zip` instead of the
+	# expected `.tgz`, so the download 404s and the build fails.
+	cd sources/sherpa-onnx/build && \
+	SHERPA_ONNXRUNTIME_INCLUDE_DIR=$(CURRENT_DIR)/sources/onnxruntime/include \
+	SHERPA_ONNXRUNTIME_LIB_DIR=$(CURRENT_DIR)/sources/onnxruntime/lib \
+	cmake \
+	  -DCMAKE_BUILD_TYPE=Release \
+	  -DCMAKE_C_FLAGS="-Wno-error=format-security" \
+	  -DCMAKE_CXX_FLAGS="-Wno-error=format-security" \
+	  -DSHERPA_ONNX_ENABLE_GPU=$(SHERPA_GPU) \
+	  -DSHERPA_ONNX_ENABLE_TTS=ON \
+	  -DSHERPA_ONNX_ENABLE_BINARY=OFF \
+	  -DSHERPA_ONNX_ENABLE_PYTHON=OFF \
+	  -DSHERPA_ONNX_ENABLE_TESTS=OFF \
+	  -DSHERPA_ONNX_ENABLE_C_API=ON \
+	  -DBUILD_SHARED_LIBS=ON \
+	  -DSHERPA_ONNX_USE_PRE_INSTALLED_ONNXRUNTIME_IF_AVAILABLE=ON \
+	  ..
+	cd sources/sherpa-onnx/build && make -j$(JOBS)
+
+backend-assets/lib: sources/sherpa-onnx sources/onnxruntime
+	mkdir -p backend-assets/lib
+	cp -rfLv sources/onnxruntime/lib/* backend-assets/lib/
+	cp -rfLv sources/sherpa-onnx/build/lib/*.so* backend-assets/lib/ 2>/dev/null || true
+	cp -rfLv sources/sherpa-onnx/build/lib/*.dylib backend-assets/lib/ 2>/dev/null || true
+
+# libsherpa-shim wraps sherpa-onnx's nested config structs and TTS
+# callback plumbing behind a purego-friendly API: opaque handles plus
+# fixed-signature setters/getters/trampoline. Plain C compile — no cgo.
+SHIM_EXT=so
+ifeq ($(OS),Darwin)
+	SHIM_EXT=dylib
+endif
+
+backend-assets/lib/libsherpa-shim.$(SHIM_EXT): csrc/shim.c csrc/shim.h backend-assets/lib
+	$(CC) -shared -fPIC -O2 \
+	  -I$(CURRENT_DIR)/sources/sherpa-onnx/sherpa-onnx/c-api \
+	  -o $@ csrc/shim.c \
+	  -L$(CURRENT_DIR)/backend-assets/lib \
+	  -lsherpa-onnx-c-api \
+	  -Wl,-rpath,'$$ORIGIN'
+
+sherpa-onnx: backend-assets/lib backend-assets/lib/libsherpa-shim.$(SHIM_EXT)
+	CGO_ENABLED=0 $(GOCMD) build \
+	  -ldflags "$(LD_FLAGS) -X main.onnxProvider=$(ONNX_PROVIDER)" \
+	  -tags "$(GO_TAGS)" -o sherpa-onnx ./
+
+package:
+	bash package.sh
+
+build: sherpa-onnx package
+
+clean:
+	rm -rf sherpa-onnx sources/ backend-assets/ package/ vits-ljs/ sherpa-onnx-whisper-*/
+
+test: sherpa-onnx
+	LD_LIBRARY_PATH=$(CURRENT_DIR)/backend-assets/lib \
+	bash test.sh
+
+.PHONY: build package clean test
--- a/backend/go/sherpa-onnx/backend.go
+++ b/backend/go/sherpa-onnx/backend.go
--- a/backend/go/sherpa-onnx/backend_test.go
+++ b/backend/go/sherpa-onnx/backend_test.go
@@ -0,0 +1,169 @@
+package main
+
+import (
+	"os"
+	"path/filepath"
+	"testing"
+
+	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
+	. "github.com/onsi/ginkgo/v2"
+	. "github.com/onsi/gomega"
+)
+
+func TestSherpaBackend(t *testing.T) {
+	RegisterFailHandler(Fail)
+	RunSpecs(t, "Sherpa-ONNX Backend Suite")
+}
+
+// Load libsherpa-shim + libsherpa-onnx-c-api via purego before any spec
+// runs — otherwise any Load/TTS/VAD/AudioTranscription call hits a nil
+// function pointer. LD_LIBRARY_PATH must contain the directory holding
+// both .so files; test.sh sets this.
+var _ = BeforeSuite(func() {
+	Expect(loadSherpaLibs()).To(Succeed())
+})
+
+var _ = Describe("Sherpa-ONNX", func() {
+	Context("lifecycle", func() {
+		It("is locking (C API is not thread safe)", func() {
+			Expect((&SherpaBackend{}).Locking()).To(BeTrue())
+		})
+
+		It("errors loading a non-existent model", func() {
+			tmpDir, err := os.MkdirTemp("", "sherpa-test-nonexistent")
+			Expect(err).ToNot(HaveOccurred())
+			defer os.RemoveAll(tmpDir)
+
+			err = (&SherpaBackend{}).Load(&pb.ModelOptions{
+				ModelFile: filepath.Join(tmpDir, "non-existent-model.onnx"),
+			})
+			Expect(err).To(HaveOccurred())
+		})
+
+		It("errors loading a non-existent ASR model", func() {
+			tmpDir, err := os.MkdirTemp("", "sherpa-test-asr")
+			Expect(err).ToNot(HaveOccurred())
+			defer os.RemoveAll(tmpDir)
+
+			err = (&SherpaBackend{}).Load(&pb.ModelOptions{
+				ModelFile: filepath.Join(tmpDir, "model.onnx"),
+				Type:      "asr",
+			})
+			Expect(err).To(HaveOccurred())
+		})
+
+		It("dispatches Load by Type", func() {
+			tmpDir, err := os.MkdirTemp("", "sherpa-test-dispatch")
+			Expect(err).ToNot(HaveOccurred())
+			defer os.RemoveAll(tmpDir)
+
+			modelFile := filepath.Join(tmpDir, "model.onnx")
+			for _, typ := range []string{"", "asr", "vad"} {
+				err := (&SherpaBackend{}).Load(&pb.ModelOptions{ModelFile: modelFile, Type: typ})
+				Expect(err).To(HaveOccurred(), "Type=%q", typ)
+			}
+		})
+	})
+
+	Context("method errors without loaded model", func() {
+		It("rejects TTS", func() {
+			tmpDir, err := os.MkdirTemp("", "sherpa-test-tts")
+			Expect(err).ToNot(HaveOccurred())
+			defer os.RemoveAll(tmpDir)
+
+			err = (&SherpaBackend{}).TTS(&pb.TTSRequest{
+				Text: "should fail — no model loaded",
+				Dst:  filepath.Join(tmpDir, "output.wav"),
+			})
+			Expect(err).To(HaveOccurred())
+		})
+
+		It("rejects AudioTranscription", func() {
+			_, err := (&SherpaBackend{}).AudioTranscription(&pb.TranscriptRequest{
+				Dst: "/tmp/nonexistent.wav",
+			})
+			Expect(err).To(HaveOccurred())
+		})
+
+		It("rejects VAD", func() {
+			_, err := (&SherpaBackend{}).VAD(&pb.VADRequest{
+				Audio: []float32{0.1, 0.2, 0.3},
+			})
+			Expect(err).To(HaveOccurred())
+		})
+	})
+
+	Context("type detection", func() {
+		DescribeTable("isASRType",
+			func(input string, want bool) {
+				Expect(isASRType(input)).To(Equal(want))
+			},
+			Entry("asr", "asr", true),
+			Entry("ASR", "ASR", true),
+			Entry("Asr", "Asr", true),
+			Entry("transcription", "transcription", true),
+			Entry("Transcription", "Transcription", true),
+			Entry("transcribe", "transcribe", true),
+			Entry("Transcribe", "Transcribe", true),
+			Entry("tts", "tts", false),
+			Entry("empty", "", false),
+			Entry("other", "other", false),
+			Entry("vad", "vad", false),
+		)
+
+		DescribeTable("isVADType",
+			func(input string, want bool) {
+				Expect(isVADType(input)).To(Equal(want))
+			},
+			Entry("vad", "vad", true),
+			Entry("VAD", "VAD", true),
+			Entry("Vad", "Vad", true),
+			Entry("asr", "asr", false),
+			Entry("tts", "tts", false),
+			Entry("empty", "", false),
+			Entry("other", "other", false),
+		)
+	})
+
+	Context("option parsing", func() {
+		It("parses float options with fallback on bad input", func() {
+			opts := &pb.ModelOptions{Options: []string{
+				"vad.threshold=0.3",
+				"tts.length_scale=1.25",
+				"bad.number=not-a-float",
+			}}
+			Expect(findOptionFloat(opts, "vad.threshold=", 0.5)).To(BeNumerically("~", 0.3, 1e-6))
+			Expect(findOptionFloat(opts, "tts.length_scale=", 1.0)).To(BeNumerically("~", 1.25, 1e-6))
+			Expect(findOptionFloat(opts, "missing.key=", 0.7)).To(BeNumerically("~", 0.7, 1e-6))
+			Expect(findOptionFloat(opts, "bad.number=", 9.9)).To(BeNumerically("~", 9.9, 1e-6))
+		})
+
+		It("parses int options with fallback on bad input", func() {
+			opts := &pb.ModelOptions{Options: []string{
+				"asr.sample_rate=22050",
+				"online.chunk_samples=800",
+				"bad.int=4.2",
+			}}
+			Expect(findOptionInt(opts, "asr.sample_rate=", 16000)).To(Equal(int32(22050)))
+			Expect(findOptionInt(opts, "online.chunk_samples=", 1600)).To(Equal(int32(800)))
+			Expect(findOptionInt(opts, "missing.key=", 42)).To(Equal(int32(42)))
+			Expect(findOptionInt(opts, "bad.int=", 100)).To(Equal(int32(100)))
+		})
+
+		It("parses bool options (0/1, true/false, yes/no, on/off)", func() {
+			opts := &pb.ModelOptions{Options: []string{
+				"online.enable_endpoint=0",
+				"asr.sense_voice.use_itn=True",
+				"feature.on=yes",
+				"feature.off=Off",
+				"feature.bad=maybe",
+			}}
+			Expect(findOptionBool(opts, "online.enable_endpoint=", 1)).To(Equal(int32(0)))
+			Expect(findOptionBool(opts, "asr.sense_voice.use_itn=", 0)).To(Equal(int32(1)))
+			Expect(findOptionBool(opts, "feature.on=", 0)).To(Equal(int32(1)))
+			Expect(findOptionBool(opts, "feature.off=", 1)).To(Equal(int32(0)))
+			Expect(findOptionBool(opts, "feature.bad=", 1)).To(Equal(int32(1)))
+			Expect(findOptionBool(opts, "missing.key=", 1)).To(Equal(int32(1)))
+		})
+	})
+})
--- a/backend/go/sherpa-onnx/csrc/shim.c
+++ b/backend/go/sherpa-onnx/csrc/shim.c
@@ -0,0 +1,325 @@
+#include "shim.h"
+#include "c-api.h"
+
+#include <stdlib.h>
+#include <string.h>
+
+// Replace the char* field pointed to by `slot` with a strdup of `s`
+// (or NULL if s is NULL). Frees any prior value. Silently no-ops when
+// strdup fails — the caller will see a Create* failure downstream.
+static void shim_set_str(const char **slot, const char *s) {
+    free((char *)*slot);
+    *slot = s ? strdup(s) : NULL;
+}
+
+// ==================================================================
+// VAD config
+// ==================================================================
+
+void *sherpa_shim_vad_config_new(void) {
+    return calloc(1, sizeof(SherpaOnnxVadModelConfig));
+}
+
+void sherpa_shim_vad_config_free(void *h) {
+    if (!h) return;
+    SherpaOnnxVadModelConfig *c = (SherpaOnnxVadModelConfig *)h;
+    free((char *)c->silero_vad.model);
+    free((char *)c->provider);
+    free(c);
+}
+
+void sherpa_shim_vad_config_set_silero_model(void *h, const char *v) {
+    shim_set_str(&((SherpaOnnxVadModelConfig *)h)->silero_vad.model, v);
+}
+void sherpa_shim_vad_config_set_silero_threshold(void *h, float v) {
+    ((SherpaOnnxVadModelConfig *)h)->silero_vad.threshold = v;
+}
+void sherpa_shim_vad_config_set_silero_min_silence_duration(void *h, float v) {
+    ((SherpaOnnxVadModelConfig *)h)->silero_vad.min_silence_duration = v;
+}
+void sherpa_shim_vad_config_set_silero_min_speech_duration(void *h, float v) {
+    ((SherpaOnnxVadModelConfig *)h)->silero_vad.min_speech_duration = v;
+}
+void sherpa_shim_vad_config_set_silero_window_size(void *h, int32_t v) {
+    ((SherpaOnnxVadModelConfig *)h)->silero_vad.window_size = v;
+}
+void sherpa_shim_vad_config_set_silero_max_speech_duration(void *h, float v) {
+    ((SherpaOnnxVadModelConfig *)h)->silero_vad.max_speech_duration = v;
+}
+void sherpa_shim_vad_config_set_sample_rate(void *h, int32_t v) {
+    ((SherpaOnnxVadModelConfig *)h)->sample_rate = v;
+}
+void sherpa_shim_vad_config_set_num_threads(void *h, int32_t v) {
+    ((SherpaOnnxVadModelConfig *)h)->num_threads = v;
+}
+void sherpa_shim_vad_config_set_provider(void *h, const char *v) {
+    shim_set_str(&((SherpaOnnxVadModelConfig *)h)->provider, v);
+}
+void sherpa_shim_vad_config_set_debug(void *h, int32_t v) {
+    ((SherpaOnnxVadModelConfig *)h)->debug = v;
+}
+
+void *sherpa_shim_create_vad(void *h, float buffer_size_seconds) {
+    return (void *)SherpaOnnxCreateVoiceActivityDetector(
+        (const SherpaOnnxVadModelConfig *)h, buffer_size_seconds);
+}
+
+// ==================================================================
+// Offline TTS config (VITS)
+// ==================================================================
+
+void *sherpa_shim_tts_config_new(void) {
+    return calloc(1, sizeof(SherpaOnnxOfflineTtsConfig));
+}
+
+void sherpa_shim_tts_config_free(void *h) {
+    if (!h) return;
+    SherpaOnnxOfflineTtsConfig *c = (SherpaOnnxOfflineTtsConfig *)h;
+    free((char *)c->model.vits.model);
+    free((char *)c->model.vits.tokens);
+    free((char *)c->model.vits.lexicon);
+    free((char *)c->model.vits.data_dir);
+    free((char *)c->model.provider);
+    free(c);
+}
+
+void sherpa_shim_tts_config_set_vits_model(void *h, const char *v) {
+    shim_set_str(&((SherpaOnnxOfflineTtsConfig *)h)->model.vits.model, v);
+}
+void sherpa_shim_tts_config_set_vits_tokens(void *h, const char *v) {
+    shim_set_str(&((SherpaOnnxOfflineTtsConfig *)h)->model.vits.tokens, v);
+}
+void sherpa_shim_tts_config_set_vits_lexicon(void *h, const char *v) {
+    shim_set_str(&((SherpaOnnxOfflineTtsConfig *)h)->model.vits.lexicon, v);
+}
+void sherpa_shim_tts_config_set_vits_data_dir(void *h, const char *v) {
+    shim_set_str(&((SherpaOnnxOfflineTtsConfig *)h)->model.vits.data_dir, v);
+}
+void sherpa_shim_tts_config_set_vits_noise_scale(void *h, float v) {
+    ((SherpaOnnxOfflineTtsConfig *)h)->model.vits.noise_scale = v;
+}
+void sherpa_shim_tts_config_set_vits_noise_scale_w(void *h, float v) {
+    ((SherpaOnnxOfflineTtsConfig *)h)->model.vits.noise_scale_w = v;
+}
+void sherpa_shim_tts_config_set_vits_length_scale(void *h, float v) {
+    ((SherpaOnnxOfflineTtsConfig *)h)->model.vits.length_scale = v;
+}
+void sherpa_shim_tts_config_set_num_threads(void *h, int32_t v) {
+    ((SherpaOnnxOfflineTtsConfig *)h)->model.num_threads = v;
+}
+void sherpa_shim_tts_config_set_debug(void *h, int32_t v) {
+    ((SherpaOnnxOfflineTtsConfig *)h)->model.debug = v;
+}
+void sherpa_shim_tts_config_set_provider(void *h, const char *v) {
+    shim_set_str(&((SherpaOnnxOfflineTtsConfig *)h)->model.provider, v);
+}
+void sherpa_shim_tts_config_set_max_num_sentences(void *h, int32_t v) {
+    ((SherpaOnnxOfflineTtsConfig *)h)->max_num_sentences = v;
+}
+
+void *sherpa_shim_create_offline_tts(void *h) {
+    return (void *)SherpaOnnxCreateOfflineTts(
+        (const SherpaOnnxOfflineTtsConfig *)h);
+}
+
+// ==================================================================
+// Offline recognizer config
+// ==================================================================
+
+void *sherpa_shim_offline_recog_config_new(void) {
+    return calloc(1, sizeof(SherpaOnnxOfflineRecognizerConfig));
+}
+
+void sherpa_shim_offline_recog_config_free(void *h) {
+    if (!h) return;
+    SherpaOnnxOfflineRecognizerConfig *c = (SherpaOnnxOfflineRecognizerConfig *)h;
+    free((char *)c->model_config.provider);
+    free((char *)c->model_config.tokens);
+    free((char *)c->model_config.whisper.encoder);
+    free((char *)c->model_config.whisper.decoder);
+    free((char *)c->model_config.whisper.language);
+    free((char *)c->model_config.whisper.task);
+    free((char *)c->model_config.paraformer.model);
+    free((char *)c->model_config.sense_voice.model);
+    free((char *)c->model_config.sense_voice.language);
+    free((char *)c->model_config.omnilingual.model);
+    free((char *)c->decoding_method);
+    free(c);
+}
+
+void sherpa_shim_offline_recog_config_set_num_threads(void *h, int32_t v) {
+    ((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.num_threads = v;
+}
+void sherpa_shim_offline_recog_config_set_debug(void *h, int32_t v) {
+    ((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.debug = v;
+}
+void sherpa_shim_offline_recog_config_set_provider(void *h, const char *v) {
+    shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.provider, v);
+}
+void sherpa_shim_offline_recog_config_set_tokens(void *h, const char *v) {
+    shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.tokens, v);
+}
+void sherpa_shim_offline_recog_config_set_feat_sample_rate(void *h, int32_t v) {
+    ((SherpaOnnxOfflineRecognizerConfig *)h)->feat_config.sample_rate = v;
+}
+void sherpa_shim_offline_recog_config_set_feat_feature_dim(void *h, int32_t v) {
+    ((SherpaOnnxOfflineRecognizerConfig *)h)->feat_config.feature_dim = v;
+}
+void sherpa_shim_offline_recog_config_set_decoding_method(void *h, const char *v) {
+    shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->decoding_method, v);
+}
+void sherpa_shim_offline_recog_config_set_whisper_encoder(void *h, const char *v) {
+    shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.whisper.encoder, v);
+}
+void sherpa_shim_offline_recog_config_set_whisper_decoder(void *h, const char *v) {
+    shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.whisper.decoder, v);
+}
+void sherpa_shim_offline_recog_config_set_whisper_language(void *h, const char *v) {
+    shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.whisper.language, v);
+}
+void sherpa_shim_offline_recog_config_set_whisper_task(void *h, const char *v) {
+    shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.whisper.task, v);
+}
+void sherpa_shim_offline_recog_config_set_whisper_tail_paddings(void *h, int32_t v) {
+    ((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.whisper.tail_paddings = v;
+}
+void sherpa_shim_offline_recog_config_set_paraformer_model(void *h, const char *v) {
+    shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.paraformer.model, v);
+}
+void sherpa_shim_offline_recog_config_set_sense_voice_model(void *h, const char *v) {
+    shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.sense_voice.model, v);
+}
+void sherpa_shim_offline_recog_config_set_sense_voice_language(void *h, const char *v) {
+    shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.sense_voice.language, v);
+}
+void sherpa_shim_offline_recog_config_set_sense_voice_use_itn(void *h, int32_t v) {
+    ((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.sense_voice.use_itn = v;
+}
+void sherpa_shim_offline_recog_config_set_omnilingual_model(void *h, const char *v) {
+    shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.omnilingual.model, v);
+}
+
+void *sherpa_shim_create_offline_recognizer(void *h) {
+    return (void *)SherpaOnnxCreateOfflineRecognizer(
+        (const SherpaOnnxOfflineRecognizerConfig *)h);
+}
+
+// ==================================================================
+// Online recognizer config
+// ==================================================================
+
+void *sherpa_shim_online_recog_config_new(void) {
+    return calloc(1, sizeof(SherpaOnnxOnlineRecognizerConfig));
+}
+
+void sherpa_shim_online_recog_config_free(void *h) {
+    if (!h) return;
+    SherpaOnnxOnlineRecognizerConfig *c = (SherpaOnnxOnlineRecognizerConfig *)h;
+    free((char *)c->model_config.transducer.encoder);
+    free((char *)c->model_config.transducer.decoder);
+    free((char *)c->model_config.transducer.joiner);
+    free((char *)c->model_config.tokens);
+    free((char *)c->model_config.provider);
+    free((char *)c->decoding_method);
+    free(c);
+}
+
+void sherpa_shim_online_recog_config_set_transducer_encoder(void *h, const char *v) {
+    shim_set_str(&((SherpaOnnxOnlineRecognizerConfig *)h)->model_config.transducer.encoder, v);
+}
+void sherpa_shim_online_recog_config_set_transducer_decoder(void *h, const char *v) {
+    shim_set_str(&((SherpaOnnxOnlineRecognizerConfig *)h)->model_config.transducer.decoder, v);
+}
+void sherpa_shim_online_recog_config_set_transducer_joiner(void *h, const char *v) {
+    shim_set_str(&((SherpaOnnxOnlineRecognizerConfig *)h)->model_config.transducer.joiner, v);
+}
+void sherpa_shim_online_recog_config_set_tokens(void *h, const char *v) {
+    shim_set_str(&((SherpaOnnxOnlineRecognizerConfig *)h)->model_config.tokens, v);
+}
+void sherpa_shim_online_recog_config_set_num_threads(void *h, int32_t v) {
+    ((SherpaOnnxOnlineRecognizerConfig *)h)->model_config.num_threads = v;
+}
+void sherpa_shim_online_recog_config_set_debug(void *h, int32_t v) {
+    ((SherpaOnnxOnlineRecognizerConfig *)h)->model_config.debug = v;
+}
+void sherpa_shim_online_recog_config_set_provider(void *h, const char *v) {
+    shim_set_str(&((SherpaOnnxOnlineRecognizerConfig *)h)->model_config.provider, v);
+}
+void sherpa_shim_online_recog_config_set_feat_sample_rate(void *h, int32_t v) {
+    ((SherpaOnnxOnlineRecognizerConfig *)h)->feat_config.sample_rate = v;
+}
+void sherpa_shim_online_recog_config_set_feat_feature_dim(void *h, int32_t v) {
+    ((SherpaOnnxOnlineRecognizerConfig *)h)->feat_config.feature_dim = v;
+}
+void sherpa_shim_online_recog_config_set_decoding_method(void *h, const char *v) {
+    shim_set_str(&((SherpaOnnxOnlineRecognizerConfig *)h)->decoding_method, v);
+}
+void sherpa_shim_online_recog_config_set_enable_endpoint(void *h, int32_t v) {
+    ((SherpaOnnxOnlineRecognizerConfig *)h)->enable_endpoint = v;
+}
+void sherpa_shim_online_recog_config_set_rule1_min_trailing_silence(void *h, float v) {
+    ((SherpaOnnxOnlineRecognizerConfig *)h)->rule1_min_trailing_silence = v;
+}
+void sherpa_shim_online_recog_config_set_rule2_min_trailing_silence(void *h, float v) {
+    ((SherpaOnnxOnlineRecognizerConfig *)h)->rule2_min_trailing_silence = v;
+}
+void sherpa_shim_online_recog_config_set_rule3_min_utterance_length(void *h, float v) {
+    ((SherpaOnnxOnlineRecognizerConfig *)h)->rule3_min_utterance_length = v;
+}
+
+void *sherpa_shim_create_online_recognizer(void *h) {
+    return (void *)SherpaOnnxCreateOnlineRecognizer(
+        (const SherpaOnnxOnlineRecognizerConfig *)h);
+}
+
+// ==================================================================
+// Result-struct accessors
+// ==================================================================
+
+int32_t sherpa_shim_wave_sample_rate(const void *h) {
+    return ((const SherpaOnnxWave *)h)->sample_rate;
+}
+int32_t sherpa_shim_wave_num_samples(const void *h) {
+    return ((const SherpaOnnxWave *)h)->num_samples;
+}
+const float *sherpa_shim_wave_samples(const void *h) {
+    return ((const SherpaOnnxWave *)h)->samples;
+}
+
+const char *sherpa_shim_offline_result_text(const void *h) {
+    return ((const SherpaOnnxOfflineRecognizerResult *)h)->text;
+}
+const char *sherpa_shim_online_result_text(const void *h) {
+    return ((const SherpaOnnxOnlineRecognizerResult *)h)->text;
+}
+
+int32_t sherpa_shim_generated_audio_sample_rate(const void *h) {
+    return ((const SherpaOnnxGeneratedAudio *)h)->sample_rate;
+}
+int32_t sherpa_shim_generated_audio_n(const void *h) {
+    return ((const SherpaOnnxGeneratedAudio *)h)->n;
+}
+const float *sherpa_shim_generated_audio_samples(const void *h) {
+    return ((const SherpaOnnxGeneratedAudio *)h)->samples;
+}
+
+int32_t sherpa_shim_speech_segment_start(const void *h) {
+    return ((const SherpaOnnxSpeechSegment *)h)->start;
+}
+int32_t sherpa_shim_speech_segment_n(const void *h) {
+    return ((const SherpaOnnxSpeechSegment *)h)->n;
+}
+
+// ==================================================================
+// TTS streaming callback trampoline
+// ==================================================================
+
+void *sherpa_shim_tts_generate_with_callback(
+    void *tts, const char *text, int32_t sid, float speed,
+    uintptr_t callback_ptr, uintptr_t user_data) {
+    SherpaOnnxGeneratedAudioCallbackWithArg cb =
+        (SherpaOnnxGeneratedAudioCallbackWithArg)callback_ptr;
+    return (void *)SherpaOnnxOfflineTtsGenerateWithCallbackWithArg(
+        (const SherpaOnnxOfflineTts *)tts, text, sid, speed, cb,
+        (void *)user_data);
+}
--- a/backend/go/sherpa-onnx/csrc/shim.h
+++ b/backend/go/sherpa-onnx/csrc/shim.h
@@ -0,0 +1,129 @@
+#ifndef LOCALAI_SHERPA_ONNX_SHIM_H
+#define LOCALAI_SHERPA_ONNX_SHIM_H
+
+#include <stdint.h>
+
+// libsherpa-shim: purego-friendly wrapper around sherpa-onnx's C API.
+// Purego can't access C struct fields and can't route C callbacks to Go
+// funcs directly. Every function here is a fixed-signature trampoline
+// that replaces one field read/write or callback handoff that the Go
+// backend would otherwise have to do through cgo.
+//
+// String lifetime: setters strdup; _free walks every owned string and
+// frees it. Callers may discard their input buffers the moment a setter
+// returns.
+//
+// Opaque handles are `void *` in both directions. Nothing here holds a
+// reference across calls except config handles (freed via _free) and
+// sherpa-allocated results (freed via sherpa's own Destroy* entry
+// points, which Go calls through purego pass-through).
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+// --- VAD config -----------------------------------------------------
+void *sherpa_shim_vad_config_new(void);
+void  sherpa_shim_vad_config_free(void *cfg);
+void  sherpa_shim_vad_config_set_silero_model(void *cfg, const char *path);
+void  sherpa_shim_vad_config_set_silero_threshold(void *cfg, float v);
+void  sherpa_shim_vad_config_set_silero_min_silence_duration(void *cfg, float v);
+void  sherpa_shim_vad_config_set_silero_min_speech_duration(void *cfg, float v);
+void  sherpa_shim_vad_config_set_silero_window_size(void *cfg, int32_t v);
+void  sherpa_shim_vad_config_set_silero_max_speech_duration(void *cfg, float v);
+void  sherpa_shim_vad_config_set_sample_rate(void *cfg, int32_t v);
+void  sherpa_shim_vad_config_set_num_threads(void *cfg, int32_t v);
+void  sherpa_shim_vad_config_set_provider(void *cfg, const char *v);
+void  sherpa_shim_vad_config_set_debug(void *cfg, int32_t v);
+void *sherpa_shim_create_vad(void *cfg, float buffer_size_seconds);
+
+// --- Offline TTS config (VITS path — the only TTS family the backend uses) ---
+void *sherpa_shim_tts_config_new(void);
+void  sherpa_shim_tts_config_free(void *cfg);
+void  sherpa_shim_tts_config_set_vits_model(void *cfg, const char *v);
+void  sherpa_shim_tts_config_set_vits_tokens(void *cfg, const char *v);
+void  sherpa_shim_tts_config_set_vits_lexicon(void *cfg, const char *v);
+void  sherpa_shim_tts_config_set_vits_data_dir(void *cfg, const char *v);
+void  sherpa_shim_tts_config_set_vits_noise_scale(void *cfg, float v);
+void  sherpa_shim_tts_config_set_vits_noise_scale_w(void *cfg, float v);
+void  sherpa_shim_tts_config_set_vits_length_scale(void *cfg, float v);
+void  sherpa_shim_tts_config_set_num_threads(void *cfg, int32_t v);
+void  sherpa_shim_tts_config_set_debug(void *cfg, int32_t v);
+void  sherpa_shim_tts_config_set_provider(void *cfg, const char *v);
+void  sherpa_shim_tts_config_set_max_num_sentences(void *cfg, int32_t v);
+void *sherpa_shim_create_offline_tts(void *cfg);
+
+// --- Offline recognizer config (Whisper / Paraformer / SenseVoice / Omnilingual) ---
+void *sherpa_shim_offline_recog_config_new(void);
+void  sherpa_shim_offline_recog_config_free(void *cfg);
+void  sherpa_shim_offline_recog_config_set_num_threads(void *cfg, int32_t v);
+void  sherpa_shim_offline_recog_config_set_debug(void *cfg, int32_t v);
+void  sherpa_shim_offline_recog_config_set_provider(void *cfg, const char *v);
+void  sherpa_shim_offline_recog_config_set_tokens(void *cfg, const char *v);
+void  sherpa_shim_offline_recog_config_set_feat_sample_rate(void *cfg, int32_t v);
+void  sherpa_shim_offline_recog_config_set_feat_feature_dim(void *cfg, int32_t v);
+void  sherpa_shim_offline_recog_config_set_decoding_method(void *cfg, const char *v);
+void  sherpa_shim_offline_recog_config_set_whisper_encoder(void *cfg, const char *v);
+void  sherpa_shim_offline_recog_config_set_whisper_decoder(void *cfg, const char *v);
+void  sherpa_shim_offline_recog_config_set_whisper_language(void *cfg, const char *v);
+void  sherpa_shim_offline_recog_config_set_whisper_task(void *cfg, const char *v);
+void  sherpa_shim_offline_recog_config_set_whisper_tail_paddings(void *cfg, int32_t v);
+void  sherpa_shim_offline_recog_config_set_paraformer_model(void *cfg, const char *v);
+void  sherpa_shim_offline_recog_config_set_sense_voice_model(void *cfg, const char *v);
+void  sherpa_shim_offline_recog_config_set_sense_voice_language(void *cfg, const char *v);
+void  sherpa_shim_offline_recog_config_set_sense_voice_use_itn(void *cfg, int32_t v);
+void  sherpa_shim_offline_recog_config_set_omnilingual_model(void *cfg, const char *v);
+void *sherpa_shim_create_offline_recognizer(void *cfg);
+
+// --- Online recognizer config (streaming zipformer transducer) ---
+void *sherpa_shim_online_recog_config_new(void);
+void  sherpa_shim_online_recog_config_free(void *cfg);
+void  sherpa_shim_online_recog_config_set_transducer_encoder(void *cfg, const char *v);
+void  sherpa_shim_online_recog_config_set_transducer_decoder(void *cfg, const char *v);
+void  sherpa_shim_online_recog_config_set_transducer_joiner(void *cfg, const char *v);
+void  sherpa_shim_online_recog_config_set_tokens(void *cfg, const char *v);
+void  sherpa_shim_online_recog_config_set_num_threads(void *cfg, int32_t v);
+void  sherpa_shim_online_recog_config_set_debug(void *cfg, int32_t v);
+void  sherpa_shim_online_recog_config_set_provider(void *cfg, const char *v);
+void  sherpa_shim_online_recog_config_set_feat_sample_rate(void *cfg, int32_t v);
+void  sherpa_shim_online_recog_config_set_feat_feature_dim(void *cfg, int32_t v);
+void  sherpa_shim_online_recog_config_set_decoding_method(void *cfg, const char *v);
+void  sherpa_shim_online_recog_config_set_enable_endpoint(void *cfg, int32_t v);
+void  sherpa_shim_online_recog_config_set_rule1_min_trailing_silence(void *cfg, float v);
+void  sherpa_shim_online_recog_config_set_rule2_min_trailing_silence(void *cfg, float v);
+void  sherpa_shim_online_recog_config_set_rule3_min_utterance_length(void *cfg, float v);
+void *sherpa_shim_create_online_recognizer(void *cfg);
+
+// --- Result accessors (sherpa-allocated; caller destroys via sherpa's own Destroy*) ---
+int32_t      sherpa_shim_wave_sample_rate(const void *wave);
+int32_t      sherpa_shim_wave_num_samples(const void *wave);
+const float *sherpa_shim_wave_samples(const void *wave);
+
+const char *sherpa_shim_offline_result_text(const void *result);
+const char *sherpa_shim_online_result_text(const void *result);
+
+int32_t      sherpa_shim_generated_audio_sample_rate(const void *audio);
+int32_t      sherpa_shim_generated_audio_n(const void *audio);
+const float *sherpa_shim_generated_audio_samples(const void *audio);
+
+int32_t sherpa_shim_speech_segment_start(const void *seg);
+int32_t sherpa_shim_speech_segment_n(const void *seg);
+
+// --- TTS streaming callback trampoline -----------------------------
+// Replaces the //export sherpaTtsGoCallback + callbacks.c bridge pattern.
+// `callback_ptr` is the C-callable function pointer returned by
+// purego.NewCallback. `user_data` is an integer the Go side uses to
+// look up its state (sync.Map keyed by uint64).
+//
+// Returns the sherpa-allocated SherpaOnnxGeneratedAudio. Destroy with
+// SherpaOnnxDestroyOfflineTtsGeneratedAudio (callable directly from
+// Go via purego).
+void *sherpa_shim_tts_generate_with_callback(
+    void *tts, const char *text, int32_t sid, float speed,
+    uintptr_t callback_ptr, uintptr_t user_data);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif
--- a/backend/go/sherpa-onnx/main.go
+++ b/backend/go/sherpa-onnx/main.go
@@ -0,0 +1,23 @@
+package main
+
+import (
+	"flag"
+
+	grpc "github.com/mudler/LocalAI/pkg/grpc"
+)
+
+var (
+	addr = flag.String("addr", "localhost:50051", "the address to connect to")
+)
+
+func main() {
+	flag.Parse()
+
+	if err := loadSherpaLibs(); err != nil {
+		panic(err)
+	}
+
+	if err := grpc.StartServer(*addr, &SherpaBackend{}); err != nil {
+		panic(err)
+	}
+}
--- a/backend/go/sherpa-onnx/package.sh
+++ b/backend/go/sherpa-onnx/package.sh
@@ -0,0 +1,51 @@
+#!/bin/bash
+set -e
+
+CURDIR=$(dirname "$(realpath $0)")
+REPO_ROOT="${CURDIR}/../../.."
+
+mkdir -p $CURDIR/package/lib
+
+cp -avf $CURDIR/sherpa-onnx $CURDIR/package/
+cp -avf $CURDIR/run.sh $CURDIR/package/
+cp -rfLv $CURDIR/backend-assets/lib/* $CURDIR/package/lib/
+
+if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
+    echo "Detected x86_64 architecture, copying x86_64 libraries..."
+    cp -arfLv /lib64/ld-linux-x86-64.so.2 $CURDIR/package/lib/ld.so
+    cp -arfLv /lib/x86_64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
+    cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
+    cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
+    cp -arfLv /lib/x86_64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
+    cp -arfLv /lib/x86_64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
+    cp -arfLv /lib/x86_64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
+    cp -arfLv /lib/x86_64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
+    cp -arfLv /lib/x86_64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
+elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
+    echo "Detected ARM64 architecture, copying ARM64 libraries..."
+    cp -arfLv /lib/ld-linux-aarch64.so.1 $CURDIR/package/lib/ld.so
+    cp -arfLv /lib/aarch64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
+    cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
+    cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
+    cp -arfLv /lib/aarch64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
+    cp -arfLv /lib/aarch64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
+    cp -arfLv /lib/aarch64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
+    cp -arfLv /lib/aarch64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
+    cp -arfLv /lib/aarch64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
+elif [ $(uname -s) = "Darwin" ]; then
+    echo "Detected Darwin"
+else
+    echo "Error: Could not detect architecture"
+    exit 1
+fi
+
+GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
+if [ -f "$GPU_LIB_SCRIPT" ]; then
+    echo "Packaging GPU libraries for BUILD_TYPE=${BUILD_TYPE:-cpu}..."
+    source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
+    package_gpu_libs
+fi
+
+echo "Packaging completed successfully"
+ls -liah $CURDIR/package/
+ls -liah $CURDIR/package/lib/
--- a/backend/go/sherpa-onnx/run.sh
+++ b/backend/go/sherpa-onnx/run.sh
@@ -0,0 +1,13 @@
+#!/bin/bash
+set -ex
+
+CURDIR=$(dirname "$(realpath $0)")
+
+export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
+
+if [ -f $CURDIR/lib/ld.so ]; then
+	echo "Using lib/ld.so"
+	exec $CURDIR/lib/ld.so $CURDIR/sherpa-onnx "$@"
+fi
+
+exec $CURDIR/sherpa-onnx "$@"
--- a/backend/go/sherpa-onnx/test.sh
+++ b/backend/go/sherpa-onnx/test.sh
@@ -0,0 +1,12 @@
+#!/bin/bash
+# Unit tests for the sherpa-onnx backend. Exercises error-path and
+# dispatch logic via SherpaBackend directly (no gRPC). Integration
+# coverage (gRPC TTS / streaming ASR / realtime pipeline) lives in
+# tests/e2e-backends and tests/e2e and runs against the Docker image.
+set -e
+
+CURDIR=$(dirname "$(realpath $0)")
+cd "$CURDIR"
+
+PACKAGES=$(go list ./... | grep -v /sources/)
+go test -v -timeout 60s $PACKAGES
--- a/backend/index.yaml
+++ b/backend/index.yaml
@@ -1006,6 +1006,23 @@
    nvidia: "cuda12-neutts"
    amd: "rocm-neutts"
    nvidia-cuda-12: "cuda12-neutts"
+- &sherpa-onnx
+  name: "sherpa-onnx"
+  alias: "sherpa-onnx"
+  urls:
+    - https://k2-fsa.github.io/sherpa/onnx/
+  description: |
+    Sherpa-ONNX backend for text-to-speech (VITS, Matcha, Kokoro), speech-to-text (Whisper, Paraformer, SenseVoice, Omnilingual ASR CTC), and voice activity detection via ONNX Runtime.
+    Supports multi-speaker voices, 1600+ language ASR, and GPU acceleration.
+  tags:
+    - text-to-speech
+    - TTS
+    - speech-to-text
+    - ASR
+  capabilities:
+    default: "cpu-sherpa-onnx"
+    nvidia: "cuda12-sherpa-onnx"
+    nvidia-cuda-12: "cuda12-sherpa-onnx"
 - !!merge <<: *neutts
  name: "neutts-development"
  capabilities:
@@ -3834,3 +3851,30 @@
  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-speaker-recognition"
  mirrors:
    - localai/localai-backends:master-gpu-nvidia-cuda-12-speaker-recognition
+## sherpa-onnx
+- !!merge <<: *sherpa-onnx
+  name: "sherpa-onnx-development"
+  capabilities:
+    default: "cpu-sherpa-onnx-development"
+    nvidia: "cuda12-sherpa-onnx-development"
+    nvidia-cuda-12: "cuda12-sherpa-onnx-development"
+- !!merge <<: *sherpa-onnx
+  name: "cpu-sherpa-onnx"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-sherpa-onnx"
+  mirrors:
+    - localai/localai-backends:latest-cpu-sherpa-onnx
+- !!merge <<: *sherpa-onnx
+  name: "cpu-sherpa-onnx-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-cpu-sherpa-onnx"
+  mirrors:
+    - localai/localai-backends:master-cpu-sherpa-onnx
+- !!merge <<: *sherpa-onnx
+  name: "cuda12-sherpa-onnx"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-sherpa-onnx"
+  mirrors:
+    - localai/localai-backends:latest-gpu-nvidia-cuda-12-sherpa-onnx
+- !!merge <<: *sherpa-onnx
+  name: "cuda12-sherpa-onnx-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-sherpa-onnx"
+  mirrors:
+    - localai/localai-backends:master-gpu-nvidia-cuda-12-sherpa-onnx
--- a/backend/python/insightface/engines.py
+++ b/backend/python/insightface/engines.py
@@ -173,6 +173,30 @@ def _build_antispoofer(options: dict[str, str], model_dir: str | None) -> Antisp

 # ─── InsightFaceEngine ────────────────────────────────────────────────

+# Canonical ONNX manifest for each upstream insightface pack (v0.7 release
+# at github.com/deepinsight/insightface/releases). LocalAI's gallery extracts
+# these zips flat into the models directory, so when multiple packs or other
+# backends drop their own ONNX files alongside, the glob-the-directory
+# approach picks up foreign files and insightface's model_zoo.get_model()
+# raises IndexError trying to index `input_shape[2]` on a tensor that isn't
+# shaped like a face model. The manifest lets us pre-filter to only the
+# files that actually belong to the requested pack — deterministic, correct
+# pack choice, no crashes on neighbour ONNX files.
+_KNOWN_PACK_MANIFESTS: dict[str, frozenset[str]] = {
+    "buffalo_l": frozenset({
+        "det_10g.onnx",
+        "w600k_r50.onnx",
+        "genderage.onnx",
+        "2d106det.onnx",
+        "1k3d68.onnx",
+    }),
+    "buffalo_sc": frozenset({
+        "det_500m.onnx",
+        "w600k_mbf.onnx",
+    }),
+}
+
+
 class InsightFaceEngine:
    """Drives insightface's model_zoo directly — no FaceAnalysis wrapper.

@@ -222,6 +246,21 @@ class InsightFaceEngine:
            )

        onnx_files = sorted(glob.glob(os.path.join(pack_dir, "*.onnx")))
+        # When the pack extracts flat into a shared models directory it
+        # mixes with ONNX files from other backends (opencv face engine,
+        # MiniFASNet antispoof, WeSpeaker voice embedding, other buffalo
+        # packs installed earlier). Feeding those into model_zoo.get_model()
+        # blows up inside insightface's router — it assumes a 4-D NCHW
+        # input and indexes `input_shape[2]` on tensors that aren't shaped
+        # like a face model, raising IndexError. For the upstream packs we
+        # know the exact ONNX manifest; scoping to it makes the load
+        # deterministic (without it, det_10g.onnx from buffalo_l sorts
+        # before det_500m.onnx from buffalo_sc and silently wins).
+        manifest = _KNOWN_PACK_MANIFESTS.get(self.model_pack)
+        if manifest is not None:
+            scoped = [f for f in onnx_files if os.path.basename(f) in manifest]
+            if scoped:
+                onnx_files = scoped
        if not onnx_files:
            raise ValueError(f"no ONNX files in pack directory: {pack_dir}")

@@ -231,14 +270,31 @@ class InsightFaceEngine:
        self._providers = ["CUDAExecutionProvider", "CPUExecutionProvider"]

        self.models = {}
+        skipped: list[tuple[str, str]] = []
        for onnx_file in onnx_files:
-            m = model_zoo.get_model(onnx_file, providers=self._providers)
+            try:
+                m = model_zoo.get_model(onnx_file, providers=self._providers)
+            except Exception as err:
+                # Foreign ONNX (wrong rank/shape, non-insightface model) —
+                # older insightface versions raise IndexError / ValueError
+                # instead of returning None. Keep loading the rest.
+                skipped.append((os.path.basename(onnx_file), str(err)))
+                continue
            if m is None:
+                skipped.append((os.path.basename(onnx_file), "unknown taskname"))
                continue
            # First occurrence of each taskname wins (matches FaceAnalysis).
            if m.taskname not in self.models:
                self.models[m.taskname] = m

+        if skipped:
+            import sys
+            print(
+                f"[insightface] skipped {len(skipped)} non-pack ONNX file(s) in {pack_dir}: "
+                + ", ".join(f"{n} ({why})" for n, why in skipped),
+                file=sys.stderr,
+            )
+
        if "detection" not in self.models:
            raise ValueError(f"no detector (taskname='detection') found in {pack_dir}")
        self.det_model = self.models["detection"]
--- a/backend/python/speaker-recognition/engines.py
+++ b/backend/python/speaker-recognition/engines.py
@@ -317,8 +317,23 @@ class OnnxDirectEngine:
        else:
            provider_list = ["CPUExecutionProvider"]
        self._session = ort.InferenceSession(onnx_path, providers=provider_list)
-        self._input_name = self._session.get_inputs()[0].name
+        input_meta = self._session.get_inputs()[0]
+        self._input_name = input_meta.name
+        # Pre-exported speaker encoders come in two shapes:
+        #   rank-2  [batch, samples]          — some 3D-Speaker exports feed raw waveform.
+        #   rank-3  [batch, frames, n_mels]   — WeSpeaker and most Kaldi-lineage encoders
+        #                                        expect pre-computed Kaldi FBank features.
+        # We detect this at load time and branch in embed(), because feeding raw audio
+        # into a rank-3 graph is exactly what triggered
+        # "Invalid rank for input: feats Got: 2 Expected: 3".
+        self._input_rank = len(input_meta.shape) if input_meta.shape is not None else 2
        self._expected_sr = int(options.get("sample_rate", "16000"))
+        self._fbank_mels = int(options.get("fbank_num_mel_bins", "80"))
+        self._fbank_frame_length_ms = float(options.get("fbank_frame_length_ms", "25"))
+        self._fbank_frame_shift_ms = float(options.get("fbank_frame_shift_ms", "10"))
+        # Per-utterance cepstral mean normalisation — on for WeSpeaker by default,
+        # toggleable for encoders that expect raw FBank.
+        self._fbank_cmn = options.get("fbank_cmn", "true").lower() in ("1", "true", "yes")
        self._analysis = AnalysisHead(options)

    def _load_waveform(self, path: str):
@@ -344,11 +359,37 @@ class OnnxDirectEngine:
        import numpy as np

        audio = self._load_waveform(audio_path)
-        feed = audio.reshape(1, -1)
+        if self._input_rank >= 3:
+            feats = self._extract_fbank(audio)        # [frames, n_mels]
+            feed = feats[np.newaxis, :, :]             # [1, frames, n_mels]
+        else:
+            feed = audio.reshape(1, -1)                # [1, samples]
        out = self._session.run(None, {self._input_name: feed})
        vec = np.asarray(out[0]).reshape(-1)
        return [float(x) for x in vec]

+    def _extract_fbank(self, audio):
+        """Compute Kaldi-style 80-dim FBank features for speaker encoders that
+        expect pre-featurised input (WeSpeaker, most 3D-Speaker exports).
+        torchaudio is already a backend dependency for SpeechBrain — no new
+        package required."""
+        import numpy as np
+        import torch  # type: ignore
+        import torchaudio.compliance.kaldi as kaldi  # type: ignore
+
+        tensor = torch.from_numpy(audio).unsqueeze(0)  # [1, samples]
+        feats = kaldi.fbank(
+            tensor,
+            sample_frequency=self._expected_sr,
+            num_mel_bins=self._fbank_mels,
+            frame_length=self._fbank_frame_length_ms,
+            frame_shift=self._fbank_frame_shift_ms,
+            dither=0.0,
+        )  # [frames, n_mels]
+        if self._fbank_cmn:
+            feats = feats - feats.mean(dim=0, keepdim=True)
+        return feats.numpy().astype(np.float32)
+
    def compare(self, audio1: str, audio2: str) -> float:
        return _cosine_distance(self.embed(audio1), self.embed(audio2))

--- a/core/application/application.go
+++ b/core/application/application.go
@@ -81,18 +81,30 @@ func newApplication(appConfig *config.ApplicationConfig) *Application {
 	// The resolver closes over the ModelLoader so the Registry stays
 	// decoupled from loader plumbing; swapping in a postgres-backed
 	// implementation later is a single construction change here.
+	//
+	// `faceStoreName` is the default namespace passed to StoreBackend when
+	// the request doesn't override it. Face and voice MUST use distinct
+	// namespaces — the local-store gRPC surface rejects mixed dimensions
+	// inside one namespace ("Try to add key with length N when existing
+	// length is M"). ArcFace buffalo_l produces 512-dim embeddings while
+	// ECAPA-TDNN produces 192-dim; enrolling one after the other into a
+	// shared namespace is exactly how we hit that error.
+	const (
+		faceStoreName  = "localai-face-biometrics"
+		voiceStoreName = "localai-voice-biometrics"
+	)
 	faceStoreResolver := func(_ context.Context, storeName string) (pkggrpc.Backend, error) {
 		return corebackend.StoreBackend(ml, appConfig, storeName, "")
 	}
-	app.faceRegistry = facerecognition.NewStoreRegistry(faceStoreResolver, "", faceEmbeddingDim)
+	app.faceRegistry = facerecognition.NewStoreRegistry(faceStoreResolver, faceStoreName, faceEmbeddingDim)

 	// Voice (speaker) recognition registry — same plumbing, separate
-	// registry so embedding spaces stay isolated (a face vector and a
-	// speaker vector are not comparable).
+	// namespace so embedding spaces stay isolated (a face vector and a
+	// speaker vector are not comparable and differ in dimensionality).
 	voiceStoreResolver := func(_ context.Context, storeName string) (pkggrpc.Backend, error) {
 		return corebackend.StoreBackend(ml, appConfig, storeName, "")
 	}
-	app.voiceRegistry = voicerecognition.NewStoreRegistry(voiceStoreResolver, "", voiceEmbeddingDim)
+	app.voiceRegistry = voicerecognition.NewStoreRegistry(voiceStoreResolver, voiceStoreName, voiceEmbeddingDim)

 	return app
 }
--- a/core/application/startup.go
+++ b/core/application/startup.go
@@ -242,6 +242,12 @@ func New(opts ...config.AppOption) (*Application, error) {
 		bmFn := func() galleryop.BackendManager { return application.GalleryService().BackendManager() }
 		uc := NewUpgradeChecker(options, application.ModelLoader(), application.distributedDB(), bmFn)
 		application.upgradeChecker = uc
+		// Refresh the upgrade cache the moment a backend op finishes — otherwise
+		// the UI keeps showing a just-upgraded backend as upgradeable until the
+		// next 6-hour tick. TriggerCheck is non-blocking.
+		if gs := application.GalleryService(); gs != nil {
+			gs.OnBackendOpCompleted = uc.TriggerCheck
+		}
 		go uc.Run(options.Context)
 	}

--- a/core/backend/stores.go
+++ b/core/backend/stores.go
@@ -11,8 +11,17 @@ func StoreBackend(sl *model.ModelLoader, appConfig *config.ApplicationConfig, st
 	if backend == "" {
 		backend = model.LocalStoreBackend
 	}
+	// ModelLoader caches backend processes by `modelID`, not by the `model`
+	// passed via WithModel. Without a distinct modelID, every StoreBackend
+	// call collapses to the same `modelID=""` cache slot — face (512-D) and
+	// voice (192-D) biometrics would then share the same local-store process
+	// and the second enrollment would fail with
+	//   Try to add key with length N when existing length is M
+	// Use the store namespace as modelID so each namespace gets its own
+	// process instance and its own in-memory Store{}.
 	sc := []model.Option{
 		model.WithBackendString(backend),
+		model.WithModelID(storeName),
 		model.WithModel(storeName),
 	}

--- a/core/config/meta/constants.go
+++ b/core/config/meta/constants.go
@@ -37,6 +37,14 @@ var CacheTypeOptions = []FieldOption{
 	{Value: "q4_1", Label: "Q4_1"},
 	{Value: "q5_0", Label: "Q5_0"},
 	{Value: "q5_1", Label: "Q5_1"},
+	// TurboQuant KV-cache types — accepted by the turboquant and
+	// buun-llama-cpp fork backends; stock llama-cpp will reject them at load.
+	{Value: "turbo2", Label: "Turbo2 (TurboQuant)"},
+	{Value: "turbo3", Label: "Turbo3 (TurboQuant)"},
+	{Value: "turbo4", Label: "Turbo4 (TurboQuant)"},
+	// Trellis-Coded Quantization variants — buun-llama-cpp only.
+	{Value: "turbo2_tcq", Label: "Turbo2 TCQ (buun-llama-cpp)"},
+	{Value: "turbo3_tcq", Label: "Turbo3 TCQ (buun-llama-cpp)"},
 }

 var DiffusersPipelineOptions = []FieldOption{
--- a/core/config/model_config.go
+++ b/core/config/model_config.go
@@ -767,7 +767,7 @@ func (c *ModelConfig) GuessUsecases(u ModelConfigUsecase) bool {
 	}

 	if (u & FLAG_VAD) == FLAG_VAD {
-		if c.Backend != "silero-vad" && !(c.Backend == "whisper" && slices.Contains(c.Options, "vad_only")) {
+		if c.Backend != "silero-vad" && c.Backend != "sherpa-onnx" && !(c.Backend == "whisper" && slices.Contains(c.Options, "vad_only")) {
 			return false
 		}
 	}
--- a/core/gallery/backends.go
+++ b/core/gallery/backends.go
@@ -194,6 +194,20 @@ func InstallBackend(ctx context.Context, systemState *system.SystemState, modelL

 	name := config.Name
 	backendPath := filepath.Join(systemState.Backend.BackendsPath, name)
+	// Clean up legacy flat-layout artefacts: earlier dev builds of the
+	// golang backends dropped the compiled binary directly at
+	// `<backendsPath>/<name>` (a plain file) instead of
+	// `<backendsPath>/<name>/<name>` (the nested layout the current code
+	// expects). MkdirAll below returns ENOTDIR when such a stale file
+	// exists, permanently blocking any reinstall or upgrade. Remove the
+	// file first so the install can proceed; the new install will write
+	// the correct nested layout, including metadata.json + run.sh.
+	if fi, statErr := os.Lstat(backendPath); statErr == nil && !fi.IsDir() {
+		xlog.Warn("removing stale non-directory backend artefact to make room for fresh install", "path", backendPath)
+		if rmErr := os.Remove(backendPath); rmErr != nil {
+			return fmt.Errorf("failed to remove stale backend artefact at %s: %w", backendPath, rmErr)
+		}
+	}
 	err = os.MkdirAll(backendPath, 0750)
 	if err != nil {
 		return fmt.Errorf("failed to create base path: %v", err)
--- a/core/gallery/importers/llama-cpp.go
+++ b/core/gallery/importers/llama-cpp.go
@@ -34,6 +34,7 @@ func (i *LlamaCPPImporter) AdditionalBackends() []KnownBackendEntry {
 	return []KnownBackendEntry{
 		{Name: "ik-llama-cpp", Modality: "text", Description: "GGUF drop-in replacement for llama-cpp with ik-quants"},
 		{Name: "turboquant", Modality: "text", Description: "GGUF drop-in replacement for llama-cpp with TurboQuant optimizations"},
+		{Name: "buun-llama-cpp", Modality: "text", Description: "GGUF drop-in replacement for llama-cpp with DFlash speculative decoding and TurboQuant/TCQ KV-cache quantization"},
 	}
 }

@@ -127,7 +128,7 @@ func (i *LlamaCPPImporter) Import(details Details) (gallery.ModelConfig, error)
 	backend := "llama-cpp"
 	if b, ok := preferencesMap["backend"].(string); ok {
 		switch b {
-		case "ik-llama-cpp", "turboquant":
+		case "ik-llama-cpp", "turboquant", "buun-llama-cpp":
 			backend = b
 		}
 	}
--- a/core/gallery/importers/llama-cpp_test.go
+++ b/core/gallery/importers/llama-cpp_test.go
@@ -181,6 +181,23 @@ var _ = Describe("LlamaCPPImporter", func() {
 			Expect(modelConfig.Files[0].Filename).To(Equal("my-model.gguf"))
 		})

+		It("swaps the emitted backend to buun-llama-cpp when preferred", func() {
+			preferences := json.RawMessage(`{"backend": "buun-llama-cpp"}`)
+			details := Details{
+				URI:         "https://example.com/my-model.gguf",
+				Preferences: preferences,
+			}
+
+			modelConfig, err := importer.Import(details)
+
+			Expect(err).ToNot(HaveOccurred())
+			Expect(modelConfig.ConfigFile).To(ContainSubstring("backend: buun-llama-cpp"), fmt.Sprintf("Model config: %+v", modelConfig))
+			Expect(modelConfig.ConfigFile).NotTo(ContainSubstring("backend: llama-cpp\n"), fmt.Sprintf("Model config: %+v", modelConfig))
+			Expect(modelConfig.ConfigFile).To(ContainSubstring("model: my-model.gguf"), fmt.Sprintf("Model config: %+v", modelConfig))
+			Expect(len(modelConfig.Files)).To(Equal(1))
+			Expect(modelConfig.Files[0].Filename).To(Equal("my-model.gguf"))
+		})
+
 		It("keeps backend: llama-cpp for unknown backend preferences", func() {
 			// Unknown backend values must not leak into the emitted YAML —
 			// we only honour the two curated drop-in replacements.
@@ -375,7 +392,7 @@ var _ = Describe("LlamaCPPImporter", func() {
 	})

 	Context("AdditionalBackends", func() {
-		It("advertises ik-llama-cpp and turboquant as drop-in replacements", func() {
+		It("advertises ik-llama-cpp, turboquant, and buun-llama-cpp as drop-in replacements", func() {
 			entries := importer.AdditionalBackends()

 			names := make([]string, 0, len(entries))
@@ -384,7 +401,7 @@ var _ = Describe("LlamaCPPImporter", func() {
 				names = append(names, e.Name)
 				byName[e.Name] = e
 			}
-			Expect(names).To(ConsistOf("ik-llama-cpp", "turboquant"))
+			Expect(names).To(ConsistOf("ik-llama-cpp", "turboquant", "buun-llama-cpp"))

 			ik := byName["ik-llama-cpp"]
 			Expect(ik.Modality).To(Equal("text"))
@@ -393,6 +410,10 @@ var _ = Describe("LlamaCPPImporter", func() {
 			tq := byName["turboquant"]
 			Expect(tq.Modality).To(Equal("text"))
 			Expect(tq.Description).NotTo(BeEmpty())
+
+			bn := byName["buun-llama-cpp"]
+			Expect(bn.Modality).To(Equal("text"))
+			Expect(bn.Description).NotTo(BeEmpty())
 		})
 	})
 })
--- a/core/http/endpoints/anthropic/messages.go
+++ b/core/http/endpoints/anthropic/messages.go
@@ -880,7 +880,7 @@ func convertAnthropicTools(input *schema.AnthropicRequest, cfg *config.ModelConf
 			if tcType, ok := tc["type"].(string); ok && tcType == "tool" {
 				if name, ok := tc["name"].(string); ok {
 					// Force specific tool
-					cfg.SetFunctionCallString(name)
+					cfg.SetFunctionCallNameString(name)
 				}
 			}
 		}
--- a/core/http/endpoints/localai/audio.go
+++ b/core/http/endpoints/localai/audio.go
@@ -14,7 +14,13 @@ import (
 	"github.com/mudler/LocalAI/pkg/utils"
 )

-var audioDataURIPattern = regexp.MustCompile(`^data:([^;]+);base64,`)
+// Match `data:<mime>[;param=value...];base64,` — MediaRecorder in the browser
+// produces data URIs like `data:audio/webm;codecs=opus;base64,...`, so the
+// pre-`;base64,` section can contain zero or more parameter segments. The
+// old `([^;]+)` form only matched exactly one segment and left recordings
+// from the React UI's live-capture tab unparsed, which then failed base64
+// decoding on the leading `data:` bytes.
+var audioDataURIPattern = regexp.MustCompile(`^data:[^,]+?;base64,`)

 var audioDownloadClient = http.Client{Timeout: 30 * time.Second}

--- a/core/http/endpoints/openai/realtime.go
+++ b/core/http/endpoints/openai/realtime.go
@@ -1315,13 +1315,35 @@ func triggerResponse(ctx context.Context, session *Session, conv *Conversation,
 	}
 	thinkingStartToken := reasoning.DetectThinkingStartToken(template, &config.ReasoningConfig)

-	reasoningText, responseWithoutReasoning := reasoning.ExtractReasoningWithConfig(rawResponse, thinkingStartToken, config.ReasoningConfig)
+	// When the C++ autoparser emitted ChatDeltas with actionable data,
+	// prefer them — the backend clears Reply.Message in that path and
+	// delivers parsed content/reasoning/tool-calls via the delta stream
+	// (see pkg/functions/chat_deltas.go, mirrored from chat.go's non-SSE
+	// handling). Without this, Response is empty and realtime would
+	// synthesize silence for replies that actually produced tokens.
+	var reasoningText, responseWithoutReasoning, textContent, cleanedResponse string
+	var toolCalls []functions.FuncCallResults
+	deltaToolCalls := functions.ToolCallsFromChatDeltas(pred.ChatDeltas)
+	deltaContent := functions.ContentFromChatDeltas(pred.ChatDeltas)
+	deltaReasoning := functions.ReasoningFromChatDeltas(pred.ChatDeltas)
+	if len(deltaToolCalls) > 0 || deltaContent != "" {
+		xlog.Debug("[ChatDeltas] realtime: using C++ autoparser deltas",
+			"tool_calls", len(deltaToolCalls),
+			"content_len", len(deltaContent),
+			"reasoning_len", len(deltaReasoning))
+		reasoningText = deltaReasoning
+		responseWithoutReasoning = deltaContent
+		textContent = deltaContent
+		cleanedResponse = deltaContent
+		toolCalls = deltaToolCalls
+	} else {
+		reasoningText, responseWithoutReasoning = reasoning.ExtractReasoningWithConfig(rawResponse, thinkingStartToken, config.ReasoningConfig)
+		textContent = functions.ParseTextContent(responseWithoutReasoning, config.FunctionsConfig)
+		cleanedResponse = functions.CleanupLLMResult(responseWithoutReasoning, config.FunctionsConfig)
+		toolCalls = functions.ParseFunctionCall(cleanedResponse, config.FunctionsConfig)
+	}
 	xlog.Debug("LLM Response", "reasoning", reasoningText, "response_without_reasoning", responseWithoutReasoning)

-	textContent := functions.ParseTextContent(responseWithoutReasoning, config.FunctionsConfig)
-	cleanedResponse := functions.CleanupLLMResult(responseWithoutReasoning, config.FunctionsConfig)
-	toolCalls := functions.ParseFunctionCall(cleanedResponse, config.FunctionsConfig)
-
 	xlog.Debug("Function call parsing", "textContent", textContent, "cleanedResponse", cleanedResponse, "toolCallsCount", len(toolCalls))

 	noActionName := "answer"
--- a/core/http/endpoints/openai/realtime_model.go
+++ b/core/http/endpoints/openai/realtime_model.go
@@ -168,7 +168,7 @@ func (m *wrappedModel) Predict(ctx context.Context, messages schema.Messages, im
 			}
 		} else if toolChoice.Function != nil {
 			// Specific function specified
-			m.LLMConfig.SetFunctionCallString(toolChoice.Function.Name)
+			m.LLMConfig.SetFunctionCallNameString(toolChoice.Function.Name)
 		}
 	}

--- a/core/http/endpoints/openresponses/responses.go
+++ b/core/http/endpoints/openresponses/responses.go
@@ -773,7 +773,7 @@ func convertORToolsToFunctions(input *schema.OpenResponsesRequest, cfg *config.M
 		case map[string]any:
 			if tcType, ok := tc["type"].(string); ok && tcType == "function" {
 				if name, ok := tc["name"].(string); ok {
-					cfg.SetFunctionCallString(name)
+					cfg.SetFunctionCallNameString(name)
 				}
 			}
 		}
--- a/core/http/react-ui/src/App.css
+++ b/core/http/react-ui/src/App.css
--- a/core/http/react-ui/src/components/Sidebar.jsx
+++ b/core/http/react-ui/src/components/Sidebar.jsx
@@ -24,6 +24,18 @@ const sections = [
      { path: '/app/quantize', icon: 'fas fa-compress', label: 'Quantize (Experimental)', feature: 'quantization' },
    ],
  },
+  {
+    id: 'biometrics',
+    title: 'Biometrics',
+    featureMap: {
+      '/app/face': 'face_recognition',
+      '/app/voice': 'voice_recognition',
+    },
+    items: [
+      { path: '/app/face', icon: 'fas fa-face-smile', label: 'Face Recognition', feature: 'face_recognition' },
+      { path: '/app/voice', icon: 'fas fa-microphone-lines', label: 'Voice Recognition', feature: 'voice_recognition' },
+    ],
+  },
  {
    id: 'agents',
    title: 'Agents',
--- a/core/http/react-ui/src/components/biometrics/BoundingBoxCanvas.jsx
+++ b/core/http/react-ui/src/components/biometrics/BoundingBoxCanvas.jsx
@@ -0,0 +1,63 @@
+import { useEffect, useRef, useState } from 'react'
+
+// BoundingBoxCanvas — overlay face-detection rectangles on the user-supplied image.
+// boxes: [{ x, y, w, h, label?, sublabel?, tone? }]
+// tone: 'default' | 'success' | 'warning' | 'error' | 'accent'
+export default function BoundingBoxCanvas({ src, boxes = [], alt = '' }) {
+  const wrapRef = useRef(null)
+  const imgRef = useRef(null)
+  const [dims, setDims] = useState({ w: 0, h: 0, natW: 0, natH: 0 })
+
+  useEffect(() => {
+    const update = () => {
+      if (!wrapRef.current || !imgRef.current) return
+      const rect = imgRef.current.getBoundingClientRect()
+      setDims({
+        w: rect.width,
+        h: rect.height,
+        natW: imgRef.current.naturalWidth || 1,
+        natH: imgRef.current.naturalHeight || 1,
+      })
+    }
+    update()
+    const ro = new ResizeObserver(update)
+    if (imgRef.current) ro.observe(imgRef.current)
+    window.addEventListener('resize', update)
+    return () => {
+      ro.disconnect()
+      window.removeEventListener('resize', update)
+    }
+  }, [src])
+
+  const sx = dims.natW ? dims.w / dims.natW : 1
+  const sy = dims.natH ? dims.h / dims.natH : 1
+
+  return (
+    <div ref={wrapRef} className="biometrics-bbox">
+      {src && <img ref={imgRef} src={src} alt={alt} onLoad={(e) => {
+        setDims({
+          w: e.target.getBoundingClientRect().width,
+          h: e.target.getBoundingClientRect().height,
+          natW: e.target.naturalWidth,
+          natH: e.target.naturalHeight,
+        })
+      }} />}
+      {boxes.map((b, i) => (
+        <div key={i} className={`biometrics-bbox__box tone-${b.tone || 'accent'}`}
+          style={{
+            left: `${b.x * sx}px`,
+            top: `${b.y * sy}px`,
+            width: `${b.w * sx}px`,
+            height: `${b.h * sy}px`,
+          }}>
+          {(b.label || b.sublabel) && (
+            <div className="biometrics-bbox__tag">
+              {b.label && <strong>{b.label}</strong>}
+              {b.sublabel && <span>{b.sublabel}</span>}
+            </div>
+          )}
+        </div>
+      ))}
+    </div>
+  )
+}
--- a/core/http/react-ui/src/components/biometrics/DistributionBars.jsx
+++ b/core/http/react-ui/src/components/biometrics/DistributionBars.jsx
@@ -0,0 +1,33 @@
+// DistributionBars — one horizontal bar per label, width proportional to value.
+// distribution: Record<string, number> (values are probabilities 0..1 or any positive scale).
+// dominant: string — highlighted row.
+export default function DistributionBars({ title, distribution, dominant, icon }) {
+  if (!distribution || Object.keys(distribution).length === 0) return null
+  const entries = Object.entries(distribution).sort((a, b) => b[1] - a[1])
+  const max = entries.reduce((m, [, v]) => Math.max(m, v), 0) || 1
+
+  return (
+    <div className="biometrics-dist card">
+      <div className="biometrics-dist__head">
+        {icon && <i className={icon} aria-hidden="true" />}
+        <h3>{title}</h3>
+        {dominant && <span className="biometrics-dist__dominant">{dominant}</span>}
+      </div>
+      <ul className="biometrics-dist__rows">
+        {entries.map(([label, value]) => {
+          const pct = (value / max) * 100
+          const isDominant = label === dominant
+          return (
+            <li key={label} className={`biometrics-dist__row ${isDominant ? 'dominant' : ''}`}>
+              <span className="biometrics-dist__label">{label}</span>
+              <div className="biometrics-dist__bar-wrap" aria-hidden="true">
+                <div className="biometrics-dist__bar" style={{ width: `${pct}%` }} />
+              </div>
+              <span className="biometrics-dist__value">{(value * 100).toFixed(1)}%</span>
+            </li>
+          )
+        })}
+      </ul>
+    </div>
+  )
+}
--- a/core/http/react-ui/src/components/biometrics/EmbeddingInspector.jsx
+++ b/core/http/react-ui/src/components/biometrics/EmbeddingInspector.jsx
@@ -0,0 +1,89 @@
+import { useMemo, useRef, useEffect, useState } from 'react'
+
+// EmbeddingInspector — compact visualization of a raw vector returned by /v1/face|voice/embed.
+// embedding: number[] (can be large). dim: int. model: string.
+export default function EmbeddingInspector({ embedding, dim, model, elapsedMs }) {
+  const canvasRef = useRef(null)
+  const [copied, setCopied] = useState(false)
+
+  const summary = useMemo(() => {
+    if (!embedding || !embedding.length) return null
+    let sum = 0, sumSq = 0, min = Infinity, max = -Infinity
+    for (const v of embedding) {
+      sum += v
+      sumSq += v * v
+      if (v < min) min = v
+      if (v > max) max = v
+    }
+    const mean = sum / embedding.length
+    const norm = Math.sqrt(sumSq)
+    return { mean, norm, min, max }
+  }, [embedding])
+
+  useEffect(() => {
+    if (!canvasRef.current || !embedding?.length) return
+    const canvas = canvasRef.current
+    const dpr = window.devicePixelRatio || 1
+    const cssW = canvas.clientWidth
+    const cssH = 60
+    canvas.width = Math.floor(cssW * dpr)
+    canvas.height = Math.floor(cssH * dpr)
+    const ctx = canvas.getContext('2d')
+    ctx.scale(dpr, dpr)
+    ctx.clearRect(0, 0, cssW, cssH)
+
+    const COUNT = Math.min(embedding.length, 128)
+    const values = embedding.slice(0, COUNT)
+    const max = Math.max(...values.map(Math.abs)) || 1
+    const mid = cssH / 2
+    const barW = cssW / COUNT
+    const accent = getComputedStyle(canvas).getPropertyValue('--color-accent').trim() || '#e8a87c'
+    const accentMuted = getComputedStyle(canvas).getPropertyValue('--color-text-muted').trim() || '#6c7084'
+    ctx.strokeStyle = accentMuted
+    ctx.beginPath()
+    ctx.moveTo(0, mid + 0.5)
+    ctx.lineTo(cssW, mid + 0.5)
+    ctx.stroke()
+    ctx.fillStyle = accent
+    for (let i = 0; i < COUNT; i++) {
+      const v = values[i]
+      const h = (Math.abs(v) / max) * (cssH * 0.45)
+      if (v >= 0) ctx.fillRect(i * barW, mid - h, Math.max(0.5, barW - 0.5), h)
+      else ctx.fillRect(i * barW, mid, Math.max(0.5, barW - 0.5), h)
+    }
+  }, [embedding])
+
+  if (!embedding) return null
+
+  const copy = async () => {
+    try {
+      await navigator.clipboard.writeText(JSON.stringify(embedding))
+      setCopied(true)
+      setTimeout(() => setCopied(false), 1500)
+    } catch (_) {
+      /* clipboard gated */
+    }
+  }
+
+  return (
+    <div className="biometrics-embed card">
+      <div className="biometrics-embed__head">
+        <div>
+          <div className="biometrics-embed__title">Embedding vector</div>
+          <div className="biometrics-embed__meta">
+            {dim != null && <span><strong>{dim}</strong> dims</span>}
+            {summary && <span>L2 <strong>{summary.norm.toFixed(3)}</strong></span>}
+            {summary && <span>range <strong>[{summary.min.toFixed(3)}, {summary.max.toFixed(3)}]</strong></span>}
+            {model && <span>model <code>{model}</code></span>}
+            {elapsedMs != null && <span>{elapsedMs.toFixed(0)} ms</span>}
+          </div>
+        </div>
+        <button type="button" className="btn btn-secondary btn-sm" onClick={copy}>
+          <i className={`fas ${copied ? 'fa-check' : 'fa-copy'}`} aria-hidden="true" />
+          {copied ? ' Copied' : ' Copy JSON'}
+        </button>
+      </div>
+      <canvas ref={canvasRef} style={{ width: '100%', height: 60 }} aria-label="Embedding sparkline (first 128 dimensions)" />
+    </div>
+  )
+}
--- a/core/http/react-ui/src/components/biometrics/EnrollmentList.jsx
+++ b/core/http/react-ui/src/components/biometrics/EnrollmentList.jsx
@@ -0,0 +1,65 @@
+// EnrollmentList — grid of enrolled subjects (face or voice).
+// entries: [{ id, name, labels?, thumbnail?, registeredAt?, sampleUrl? }]
+// mode: 'image' | 'audio' — controls the card visual.
+export default function EnrollmentList({ entries, onDelete, mode = 'image', highlightId }) {
+  if (!entries || entries.length === 0) {
+    return (
+      <div className="biometrics-enroll__empty">
+        <i className={`fas ${mode === 'image' ? 'fa-user-plus' : 'fa-microphone-lines'}`} aria-hidden="true" />
+        <p>No one enrolled yet. Add a sample using the form on the left to start building your identification store.</p>
+      </div>
+    )
+  }
+
+  return (
+    <ul className="biometrics-enroll__grid" role="list">
+      {entries.map((e) => {
+        const highlight = e.id === highlightId
+        return (
+          <li key={e.id} className={`biometrics-enroll__card ${highlight ? 'highlight' : ''}`}>
+            <div className="biometrics-enroll__media">
+              {mode === 'image' && e.thumbnail
+                ? <img src={e.thumbnail} alt="" />
+                : mode === 'audio' && e.sampleUrl
+                  ? <audio controls src={e.sampleUrl} />
+                  : <div className="biometrics-enroll__initials" aria-hidden="true">{initials(e.name)}</div>}
+            </div>
+            <div className="biometrics-enroll__body">
+              <div className="biometrics-enroll__name">{e.name}</div>
+              {e.labels && Object.keys(e.labels).length > 0 && (
+                <ul className="biometrics-enroll__labels" aria-label="labels">
+                  {Object.entries(e.labels).slice(0, 3).map(([k, v]) => (
+                    <li key={k}><span>{k}</span>{v}</li>
+                  ))}
+                </ul>
+              )}
+              {e.registeredAt && (
+                <div className="biometrics-enroll__meta">
+                  <i className="fas fa-clock" aria-hidden="true" /> {formatTime(e.registeredAt)}
+                </div>
+              )}
+            </div>
+            <button type="button" className="biometrics-enroll__delete" onClick={() => onDelete(e)}
+              aria-label={`Forget ${e.name}`} title="Forget this enrollment">
+              <i className="fas fa-trash" aria-hidden="true" />
+            </button>
+          </li>
+        )
+      })}
+    </ul>
+  )
+}
+
+function initials(name) {
+  if (!name) return '?'
+  return name.trim().split(/\s+/).map(p => p[0] || '').join('').slice(0, 2).toUpperCase()
+}
+
+function formatTime(ts) {
+  try {
+    const d = new Date(ts)
+    return d.toLocaleString()
+  } catch (_) {
+    return ts
+  }
+}
--- a/core/http/react-ui/src/components/biometrics/MatchGauge.jsx
+++ b/core/http/react-ui/src/components/biometrics/MatchGauge.jsx
@@ -0,0 +1,46 @@
+// MatchGauge — distance vs threshold as a single horizontal meter.
+// distance, threshold numeric (cosine distance, lower = closer).
+// Scale is 0 → max (default 2× threshold or 1.0) so the threshold sits near the middle.
+export default function MatchGauge({ distance, threshold, confidence, verified, label }) {
+  const max = Math.max(1.0, (threshold || 0.3) * 2)
+  const clamp = (v) => Math.max(0, Math.min(max, v))
+  const tPct = (clamp(threshold || 0) / max) * 100
+  const dPct = distance == null ? null : (clamp(distance) / max) * 100
+  const tone = verified ? 'success' : 'error'
+
+  return (
+    <div className={`biometrics-gauge tone-${tone}`} role="img"
+      aria-label={`${label || 'Match'}: ${verified ? 'match' : 'no match'} at distance ${distance?.toFixed?.(3) ?? '?'} (threshold ${threshold?.toFixed?.(3) ?? '?'})`}>
+      <div className="biometrics-gauge__head">
+        <div className="biometrics-gauge__verdict">
+          <i className={`fas ${verified ? 'fa-circle-check' : 'fa-circle-xmark'}`} aria-hidden="true" />
+          <span>{verified ? 'Match' : 'No match'}</span>
+        </div>
+        {confidence != null && (
+          <div className="biometrics-gauge__confidence">
+            <strong>{typeof confidence === 'number' ? confidence.toFixed(1) : confidence}</strong>
+            <span>confidence</span>
+          </div>
+        )}
+      </div>
+      <div className="biometrics-gauge__track" aria-hidden="true">
+        <div className="biometrics-gauge__zone biometrics-gauge__zone--match"
+          style={{ width: `${tPct}%` }} />
+        <div className="biometrics-gauge__zone biometrics-gauge__zone--miss"
+          style={{ left: `${tPct}%`, width: `${100 - tPct}%` }} />
+        <div className="biometrics-gauge__threshold" style={{ left: `${tPct}%` }}>
+          <span>threshold</span>
+        </div>
+        {dPct != null && (
+          <div className="biometrics-gauge__marker" style={{ left: `${dPct}%` }}>
+            <span>distance</span>
+          </div>
+        )}
+      </div>
+      <div className="biometrics-gauge__footer">
+        <span><em>distance</em> <code>{distance?.toFixed?.(4) ?? '—'}</code></span>
+        <span><em>threshold</em> <code>{threshold?.toFixed?.(4) ?? '—'}</code></span>
+      </div>
+    </div>
+  )
+}
--- a/core/http/react-ui/src/components/biometrics/MediaInput.jsx
+++ b/core/http/react-ui/src/components/biometrics/MediaInput.jsx
@@ -0,0 +1,179 @@
+import { useEffect, useRef, useState } from 'react'
+import { useMediaCapture } from '../../hooks/useMediaCapture'
+import { fileToBase64 } from '../../utils/api'
+
+// MediaInput — one control, three ways to supply a sample.
+// mode: 'image' | 'audio'. onChange receives null | { base64, dataUrl, mime, source }.
+function UnsupportedNotice({ mode }) {
+  // Detect the likely cause so we can tell the user what to do, instead of just "not supported".
+  const isSecure = typeof window !== 'undefined' && (window.isSecureContext ?? true)
+  const hostname = typeof window !== 'undefined' ? window.location.hostname : ''
+  const origin = typeof window !== 'undefined' ? window.location.origin : ''
+  const thing = mode === 'image' ? 'webcam' : 'microphone'
+
+  if (!isSecure) {
+    return (
+      <div className="biometrics-mediainput__notice">
+        <i className="fas fa-lock" aria-hidden="true" />
+        <div>
+          <strong>{thing} needs a secure origin</strong>
+          <p>
+            Your browser only exposes <code>getUserMedia</code> over HTTPS, <code>localhost</code>,
+            or <code>127.0.0.1</code>. You're on <code>{origin || hostname}</code>. Reach the UI
+            via <code>http://localhost:&lt;port&gt;</code> (or put a TLS terminator in front) and the
+            live {thing} will light up. Upload still works fine from here.
+          </p>
+        </div>
+      </div>
+    )
+  }
+  return (
+    <div className="biometrics-mediainput__notice">
+      <i className="fas fa-circle-info" aria-hidden="true" />
+      <div>
+        <strong>Live {thing} not available</strong>
+        <p>
+          This browser doesn't expose <code>navigator.mediaDevices.getUserMedia</code>. Try another
+          browser, or use the upload tab — the backend accepts either.
+        </p>
+      </div>
+    </div>
+  )
+}
+
+export default function MediaInput({ mode, label, value, onChange, idPrefix = 'media' }) {
+  const [tab, setTab] = useState('file') // 'file' | 'live'
+  const fileRef = useRef(null)
+  const cap = useMediaCapture(mode)
+
+  // Release the device when switching away from the live tab.
+  useEffect(() => {
+    if (tab !== 'live' && cap.active) cap.stop()
+  }, [tab]) // eslint-disable-line react-hooks/exhaustive-deps
+
+  const handleFile = async (e) => {
+    const f = e.target.files?.[0]
+    if (!f) { onChange(null); return }
+    const base64 = await fileToBase64(f)
+    const dataUrl = await new Promise((resolve) => {
+      const reader = new FileReader()
+      reader.onload = () => resolve(reader.result)
+      reader.readAsDataURL(f)
+    })
+    onChange({ base64, dataUrl, mime: f.type, source: 'file', name: f.name })
+  }
+
+  const handleSnap = () => {
+    const shot = cap.snap()
+    if (shot) onChange({ ...shot, source: 'live' })
+  }
+
+  const handleRecordToggle = async () => {
+    if (cap.recording) {
+      cap.stopRecording()
+    } else {
+      const pending = cap.startRecording()
+      if (!pending) return
+      const result = await pending
+      onChange({ ...result, source: 'live' })
+    }
+  }
+
+  const clear = () => {
+    onChange(null)
+    if (fileRef.current) fileRef.current.value = ''
+  }
+
+  const inputId = `${idPrefix}-${mode}-file`
+
+  return (
+    <div className="biometrics-mediainput">
+      {label && <label className="form-label" htmlFor={inputId}>{label}</label>}
+
+      <div className="biometrics-mediainput__tabs" role="tablist" aria-label={`${label || 'Media'} source`}>
+        <button type="button" role="tab" aria-selected={tab === 'file'}
+          className={`biometrics-mediainput__tab ${tab === 'file' ? 'active' : ''}`}
+          onClick={() => setTab('file')}>
+          <i className="fas fa-upload" aria-hidden="true" /> Upload
+        </button>
+        <button type="button" role="tab" aria-selected={tab === 'live'}
+          className={`biometrics-mediainput__tab ${tab === 'live' ? 'active' : ''}`}
+          onClick={() => setTab('live')}>
+          <i className={`fas ${mode === 'image' ? 'fa-camera' : 'fa-microphone'}`} aria-hidden="true" />
+          {mode === 'image' ? ' Webcam' : ' Record'}
+        </button>
+      </div>
+
+      {tab === 'file' && (
+        <div className="biometrics-mediainput__body">
+          <input
+            ref={fileRef}
+            id={inputId}
+            type="file"
+            className="input"
+            accept={mode === 'image' ? 'image/*' : 'audio/*'}
+            onChange={handleFile}
+          />
+        </div>
+      )}
+
+      {tab === 'live' && (
+        <div className="biometrics-mediainput__body">
+          {!cap.supported && <UnsupportedNotice mode={mode} />}
+          {cap.supported && !cap.active && (
+            <button type="button" className="btn btn-secondary btn-full" onClick={cap.start}>
+              <i className={`fas ${mode === 'image' ? 'fa-camera' : 'fa-microphone'}`} aria-hidden="true" />
+              {mode === 'image' ? ' Start webcam' : ' Enable microphone'}
+            </button>
+          )}
+          {cap.active && mode === 'image' && (
+            <div className="biometrics-mediainput__live">
+              <video ref={cap.videoRef} autoPlay muted playsInline className="biometrics-mediainput__video" />
+              <div className="biometrics-mediainput__controls">
+                <button type="button" className="btn btn-primary" onClick={handleSnap}>
+                  <i className="fas fa-circle-dot" aria-hidden="true" /> Capture
+                </button>
+                <button type="button" className="btn btn-secondary" onClick={cap.stop}>Stop</button>
+              </div>
+            </div>
+          )}
+          {cap.active && mode === 'audio' && (
+            <div className="biometrics-mediainput__live">
+              <div className={`biometrics-mediainput__meter ${cap.recording ? 'recording' : ''}`}>
+                <i className="fas fa-microphone" aria-hidden="true" />
+                <span>{cap.recording ? `Recording… ${cap.elapsed.toFixed(1)}s` : 'Microphone ready'}</span>
+              </div>
+              <div className="biometrics-mediainput__controls">
+                <button type="button" className={`btn ${cap.recording ? 'btn-secondary' : 'btn-primary'}`} onClick={handleRecordToggle}>
+                  <i className={`fas ${cap.recording ? 'fa-stop' : 'fa-circle'}`} aria-hidden="true" />
+                  {cap.recording ? ' Stop' : ' Record'}
+                </button>
+                <button type="button" className="btn btn-secondary" onClick={cap.stop} disabled={cap.recording}>Close</button>
+              </div>
+            </div>
+          )}
+          {cap.error && (
+            <p className="biometrics-mediainput__error" role="alert">{cap.error}</p>
+          )}
+        </div>
+      )}
+
+      {value && (
+        <div className="biometrics-mediainput__preview">
+          {mode === 'image'
+            ? <img src={value.dataUrl} alt="" />
+            : <audio controls src={value.dataUrl} />}
+          <div className="biometrics-mediainput__preview-meta">
+            <span className="biometrics-mediainput__source-pill">
+              <i className={`fas ${value.source === 'live' ? (mode === 'image' ? 'fa-camera' : 'fa-microphone') : 'fa-file'}`} aria-hidden="true" />
+              {value.source === 'live' ? ' Captured' : ` ${value.name || 'Uploaded'}`}
+            </span>
+            <button type="button" className="biometrics-mediainput__clear" onClick={clear} aria-label="Remove sample">
+              <i className="fas fa-xmark" aria-hidden="true" />
+            </button>
+          </div>
+        </div>
+      )}
+    </div>
+  )
+}
--- a/core/http/react-ui/src/components/biometrics/TabSwitch.jsx
+++ b/core/http/react-ui/src/components/biometrics/TabSwitch.jsx
@@ -0,0 +1,22 @@
+export default function TabSwitch({ tabs, value, onChange }) {
+  return (
+    <div className="biometrics-tabs" role="tablist">
+      {tabs.map(t => {
+        const active = t.id === value
+        return (
+          <button
+            key={t.id}
+            role="tab"
+            type="button"
+            aria-selected={active}
+            className={`biometrics-tab ${active ? 'active' : ''}`}
+            onClick={() => onChange(t.id)}
+          >
+            {t.icon && <i className={`${t.icon}`} aria-hidden="true" />}
+            <span>{t.label}</span>
+          </button>
+        )
+      })}
+    </div>
+  )
+}
--- a/core/http/react-ui/src/components/biometrics/WaveformStrip.jsx
+++ b/core/http/react-ui/src/components/biometrics/WaveformStrip.jsx
@@ -0,0 +1,99 @@
+import { useEffect, useRef, useState } from 'react'
+
+// WaveformStrip — decode an audio source (data URL or blob URL) via AudioContext,
+// render a mono waveform, and overlay colored segment regions.
+// segments: [{ start: seconds, end: seconds, label?, tone? }]
+export default function WaveformStrip({ src, segments = [], height = 120 }) {
+  const canvasRef = useRef(null)
+  const [duration, setDuration] = useState(0)
+  const [peaks, setPeaks] = useState(null)
+  const [err, setErr] = useState(null)
+
+  useEffect(() => {
+    setPeaks(null)
+    setDuration(0)
+    setErr(null)
+    if (!src) return
+    let cancelled = false
+
+    async function decode() {
+      try {
+        const response = await fetch(src)
+        const buf = await response.arrayBuffer()
+        const Ctx = window.AudioContext || window.webkitAudioContext
+        const ctx = new Ctx()
+        const audioBuf = await ctx.decodeAudioData(buf.slice(0))
+        if (cancelled) { ctx.close(); return }
+        const data = audioBuf.getChannelData(0)
+        const BUCKETS = 480
+        const step = Math.max(1, Math.floor(data.length / BUCKETS))
+        const result = new Float32Array(BUCKETS)
+        for (let i = 0; i < BUCKETS; i++) {
+          let peak = 0
+          const start = i * step
+          const end = Math.min(start + step, data.length)
+          for (let j = start; j < end; j++) {
+            const v = Math.abs(data[j])
+            if (v > peak) peak = v
+          }
+          result[i] = peak
+        }
+        setPeaks(result)
+        setDuration(audioBuf.duration)
+        ctx.close()
+      } catch (e) {
+        if (!cancelled) setErr(e?.message || 'Could not decode audio')
+      }
+    }
+    decode()
+    return () => { cancelled = true }
+  }, [src])
+
+  useEffect(() => {
+    if (!canvasRef.current || !peaks) return
+    const canvas = canvasRef.current
+    const dpr = window.devicePixelRatio || 1
+    const cssW = canvas.clientWidth
+    const cssH = height
+    canvas.width = Math.floor(cssW * dpr)
+    canvas.height = Math.floor(cssH * dpr)
+    const ctx = canvas.getContext('2d')
+    ctx.scale(dpr, dpr)
+    ctx.clearRect(0, 0, cssW, cssH)
+
+    // Waveform
+    const accent = getComputedStyle(canvas).getPropertyValue('--biometrics-wave').trim() || '#e8a87c'
+    ctx.fillStyle = accent
+    const mid = cssH / 2
+    const barW = Math.max(1, cssW / peaks.length)
+    for (let i = 0; i < peaks.length; i++) {
+      const h = Math.max(1, peaks[i] * (cssH * 0.9))
+      ctx.fillRect(i * barW, mid - h / 2, Math.max(0.5, barW - 0.5), h)
+    }
+  }, [peaks, height])
+
+  if (err) return <div className="biometrics-waveform biometrics-waveform--error">{err}</div>
+  if (!src) return null
+
+  return (
+    <div className="biometrics-waveform" style={{ height }}>
+      <canvas ref={canvasRef} style={{ width: '100%', height: '100%' }} />
+      {duration > 0 && segments.map((s, i) => {
+        const left = (Math.max(0, s.start) / duration) * 100
+        const right = (Math.min(duration, s.end) / duration) * 100
+        return (
+          <div key={i} className={`biometrics-waveform__segment tone-${s.tone || 'accent'}`}
+            style={{ left: `${left}%`, width: `${Math.max(0.5, right - left)}%` }}>
+            {s.label && <span className="biometrics-waveform__seglabel">{s.label}</span>}
+          </div>
+        )
+      })}
+      {duration > 0 && (
+        <div className="biometrics-waveform__duration" aria-hidden="true">{duration.toFixed(1)}s</div>
+      )}
+      {!peaks && (
+        <div className="biometrics-waveform__loading">Decoding…</div>
+      )}
+    </div>
+  )
+}
--- a/core/http/react-ui/src/hooks/useMediaCapture.js
+++ b/core/http/react-ui/src/hooks/useMediaCapture.js
@@ -0,0 +1,205 @@
+import { useCallback, useEffect, useRef, useState } from 'react'
+
+// Encode an AudioBuffer as a 16-bit PCM mono WAV blob. Libsndfile (which the
+// SpeechBrain / ONNX voice backends use) reads this shape without extra
+// decoders. We downmix to mono because speaker-encoder models expect a single
+// channel and sample-rate resampling is handled server-side.
+function audioBufferToWavBlob(audioBuffer) {
+  const sampleRate = audioBuffer.sampleRate
+  const numFrames = audioBuffer.length
+  const bitsPerSample = 16
+  const blockAlign = bitsPerSample / 8 // mono, 1 channel
+  const byteRate = sampleRate * blockAlign
+  const dataSize = numFrames * blockAlign
+  const out = new ArrayBuffer(44 + dataSize)
+  const view = new DataView(out)
+
+  const writeAscii = (offset, s) => {
+    for (let i = 0; i < s.length; i++) view.setUint8(offset + i, s.charCodeAt(i))
+  }
+  writeAscii(0, 'RIFF')
+  view.setUint32(4, 36 + dataSize, true)
+  writeAscii(8, 'WAVE')
+  writeAscii(12, 'fmt ')
+  view.setUint32(16, 16, true)           // fmt chunk size
+  view.setUint16(20, 1, true)            // PCM
+  view.setUint16(22, 1, true)            // mono
+  view.setUint32(24, sampleRate, true)
+  view.setUint32(28, byteRate, true)
+  view.setUint16(32, blockAlign, true)
+  view.setUint16(34, bitsPerSample, true)
+  writeAscii(36, 'data')
+  view.setUint32(40, dataSize, true)
+
+  // Average all input channels into mono, then clamp + convert to int16.
+  const numChannels = audioBuffer.numberOfChannels
+  const channels = []
+  for (let c = 0; c < numChannels; c++) channels.push(audioBuffer.getChannelData(c))
+  let offset = 44
+  for (let i = 0; i < numFrames; i++) {
+    let sum = 0
+    for (let c = 0; c < numChannels; c++) sum += channels[c][i]
+    const mono = Math.max(-1, Math.min(1, sum / numChannels))
+    view.setInt16(offset, mono < 0 ? mono * 0x8000 : mono * 0x7FFF, true)
+    offset += 2
+  }
+  return new Blob([out], { type: 'audio/wav' })
+}
+
+// useMediaCapture — wraps getUserMedia + MediaRecorder for the biometrics pages.
+// mode: 'image' streams video-only for a snap-to-canvas; 'audio' records a clip via MediaRecorder.
+// Consumers attach the returned videoRef to a <video autoPlay muted playsInline/> element.
+export function useMediaCapture(mode) {
+  const [active, setActive] = useState(false)
+  const [recording, setRecording] = useState(false)
+  const [error, setError] = useState(null)
+  const [elapsed, setElapsed] = useState(0)
+
+  const streamRef = useRef(null)
+  const videoRef = useRef(null)
+  const recorderRef = useRef(null)
+  const chunksRef = useRef([])
+  const tickRef = useRef(null)
+  const resolveStopRef = useRef(null)
+
+  const supported = typeof navigator !== 'undefined' && !!navigator.mediaDevices?.getUserMedia
+
+  const stopStream = useCallback(() => {
+    if (tickRef.current) {
+      clearInterval(tickRef.current)
+      tickRef.current = null
+    }
+    if (streamRef.current) {
+      streamRef.current.getTracks().forEach(t => { try { t.stop() } catch (_) { /* ignore */ } })
+      streamRef.current = null
+    }
+    if (videoRef.current) {
+      try { videoRef.current.srcObject = null } catch (_) { /* ignore */ }
+    }
+    setActive(false)
+    setRecording(false)
+    setElapsed(0)
+  }, [])
+
+  const start = useCallback(async () => {
+    if (!supported) {
+      setError('Your browser does not support media capture.')
+      return
+    }
+    setError(null)
+    try {
+      const constraints = mode === 'audio'
+        ? { audio: true }
+        : { video: { facingMode: 'user', width: { ideal: 640 }, height: { ideal: 480 } } }
+      const stream = await navigator.mediaDevices.getUserMedia(constraints)
+      streamRef.current = stream
+      // Attachment happens in the useEffect below — videoRef.current is still
+      // null at this point because the <video> element mounts only after React
+      // processes the setActive(true) state change.
+      setActive(true)
+    } catch (e) {
+      setError(e?.message || 'Could not access device')
+      stopStream()
+    }
+  }, [mode, supported, stopStream])
+
+  // Hook the stream into the <video> once both the stream and the element exist.
+  useEffect(() => {
+    if (mode !== 'image' || !active) return
+    const v = videoRef.current
+    const s = streamRef.current
+    if (!v || !s) return
+    if (v.srcObject !== s) v.srcObject = s
+    const playPromise = v.play()
+    if (playPromise && typeof playPromise.catch === 'function') {
+      playPromise.catch(() => { /* autoplay gated */ })
+    }
+  }, [active, mode])
+
+  // Snap a frame from the live video stream to a PNG base64 (image mode).
+  const snap = useCallback(() => {
+    if (mode !== 'image' || !videoRef.current || !streamRef.current) return null
+    const v = videoRef.current
+    const w = v.videoWidth || 640
+    const h = v.videoHeight || 480
+    const canvas = document.createElement('canvas')
+    canvas.width = w
+    canvas.height = h
+    const ctx = canvas.getContext('2d')
+    ctx.drawImage(v, 0, 0, w, h)
+    const dataUrl = canvas.toDataURL('image/png')
+    const base64 = dataUrl.split(',')[1] || ''
+    return { base64, dataUrl, mime: 'image/png' }
+  }, [mode])
+
+  // Start an audio recording — returns a promise that resolves with a WAV-encoded
+  // {base64, blob, dataUrl, mime} on stopRecording. Transcoding to 16-bit PCM mono
+  // WAV is necessary because the voice backends open the file via libsndfile, which
+  // doesn't handle WebM/Ogg-Opus containers — the browser's native MediaRecorder
+  // output — out of the box.
+  const startRecording = useCallback(() => {
+    if (mode !== 'audio' || !streamRef.current) return null
+    chunksRef.current = []
+    const recMime = (typeof MediaRecorder !== 'undefined' && MediaRecorder.isTypeSupported('audio/webm;codecs=opus'))
+      ? 'audio/webm;codecs=opus'
+      : 'audio/webm'
+    let rec
+    try {
+      rec = new MediaRecorder(streamRef.current, { mimeType: recMime })
+    } catch (_) {
+      rec = new MediaRecorder(streamRef.current)
+    }
+    recorderRef.current = rec
+    rec.ondataavailable = (e) => { if (e.data && e.data.size > 0) chunksRef.current.push(e.data) }
+    const donePromise = new Promise((resolve, reject) => {
+      resolveStopRef.current = resolve
+      rec.onstop = async () => {
+        try {
+          const recBlob = new Blob(chunksRef.current, { type: rec.mimeType || recMime })
+          const arrayBuf = await recBlob.arrayBuffer()
+          const Ctx = window.AudioContext || window.webkitAudioContext
+          const ctx = new Ctx()
+          const audioBuf = await ctx.decodeAudioData(arrayBuf.slice(0))
+          const wavBlob = audioBufferToWavBlob(audioBuf)
+          ctx.close()
+          const dataUrl = await new Promise((res) => {
+            const reader = new FileReader()
+            reader.onloadend = () => res(reader.result)
+            reader.readAsDataURL(wavBlob)
+          })
+          const base64 = typeof dataUrl === 'string' ? (dataUrl.split(',')[1] || '') : ''
+          resolve({ blob: wavBlob, base64, dataUrl, mime: 'audio/wav' })
+        } catch (err) {
+          reject(err)
+        } finally {
+          resolveStopRef.current = null
+        }
+      }
+    })
+    rec.start()
+    setRecording(true)
+    setElapsed(0)
+    const started = Date.now()
+    tickRef.current = setInterval(() => setElapsed((Date.now() - started) / 1000), 100)
+    return donePromise
+  }, [mode])
+
+  const stopRecording = useCallback(() => {
+    if (recorderRef.current && recorderRef.current.state !== 'inactive') {
+      recorderRef.current.stop()
+    }
+    if (tickRef.current) {
+      clearInterval(tickRef.current)
+      tickRef.current = null
+    }
+    setRecording(false)
+  }, [])
+
+  // Cleanup on unmount — always release the device.
+  useEffect(() => () => stopStream(), [stopStream])
+
+  return {
+    supported, active, recording, error, elapsed,
+    videoRef, start, stop: stopStream, snap, startRecording, stopRecording,
+  }
+}
--- a/core/http/react-ui/src/pages/FaceRecognition.jsx
+++ b/core/http/react-ui/src/pages/FaceRecognition.jsx
@@ -0,0 +1,602 @@
+import { useEffect, useMemo, useState } from 'react'
+import { useOutletContext, useParams } from 'react-router-dom'
+import ModelSelector from '../components/ModelSelector'
+import LoadingSpinner from '../components/LoadingSpinner'
+import ErrorWithTraceLink from '../components/ErrorWithTraceLink'
+import TabSwitch from '../components/biometrics/TabSwitch'
+import MediaInput from '../components/biometrics/MediaInput'
+import BoundingBoxCanvas from '../components/biometrics/BoundingBoxCanvas'
+import MatchGauge from '../components/biometrics/MatchGauge'
+import DistributionBars from '../components/biometrics/DistributionBars'
+import EnrollmentList from '../components/biometrics/EnrollmentList'
+import EmbeddingInspector from '../components/biometrics/EmbeddingInspector'
+import { CAP_FACE_RECOGNITION } from '../utils/capabilities'
+import { faceApi } from '../utils/api'
+
+const TABS = [
+  { id: 'analyze',  icon: 'fas fa-chart-column', label: 'Analyze' },
+  { id: 'compare',  icon: 'fas fa-people-arrows', label: 'Compare' },
+  { id: 'enroll',   icon: 'fas fa-id-card',       label: 'Enrollment' },
+  { id: 'embed',    icon: 'fas fa-code',          label: 'Embedding' },
+]
+
+const ENROLL_KEY = 'localai_face_enrollments'
+
+function loadEnrollments() {
+  try {
+    const raw = localStorage.getItem(ENROLL_KEY)
+    if (!raw) return []
+    const parsed = JSON.parse(raw)
+    return Array.isArray(parsed) ? parsed : []
+  } catch (_) { return [] }
+}
+
+function saveEnrollments(list) {
+  try { localStorage.setItem(ENROLL_KEY, JSON.stringify(list.slice(0, 50))) } catch (_) { /* quota */ }
+}
+
+// parse a textarea of "key: value" lines into a { key: value } object.
+function parseLabels(text) {
+  const out = {}
+  if (!text) return out
+  for (const line of text.split('\n')) {
+    const idx = line.indexOf(':')
+    if (idx === -1) continue
+    const k = line.slice(0, idx).trim()
+    const v = line.slice(idx + 1).trim()
+    if (k) out[k] = v
+  }
+  return out
+}
+
+export default function FaceRecognition() {
+  const { model: urlModel } = useParams()
+  const { addToast } = useOutletContext()
+
+  const [model, setModel] = useState(urlModel || '')
+  const [tab, setTab] = useState('analyze')
+
+  return (
+    <div className="biometrics-page">
+      <header className="biometrics-page__header">
+        <div>
+          <h1 className="page-title"><i className="fas fa-face-smile" aria-hidden="true" /> Face Recognition</h1>
+          <p className="page-subtitle">Compare, identify, and analyze faces using any face model installed on this LocalAI instance. Samples never leave your machine — they go only to the running backend.</p>
+        </div>
+        <div className="biometrics-page__model">
+          <label className="form-label" htmlFor="face-model">Model</label>
+          <ModelSelector value={model} onChange={setModel} capability={CAP_FACE_RECOGNITION} />
+        </div>
+      </header>
+
+      <TabSwitch tabs={TABS} value={tab} onChange={setTab} />
+
+      <div className="biometrics-page__body">
+        {tab === 'analyze' && <AnalyzeTab model={model} addToast={addToast} />}
+        {tab === 'compare' && <CompareTab model={model} addToast={addToast} />}
+        {tab === 'enroll' && <EnrollTab model={model} addToast={addToast} />}
+        {tab === 'embed' && <EmbedTab model={model} addToast={addToast} />}
+      </div>
+    </div>
+  )
+}
+
+// ──────────────────────────── Analyze ────────────────────────────
+
+function AnalyzeTab({ model, addToast }) {
+  const [img, setImg] = useState(null)
+  const [actions, setActions] = useState({ age: true, gender: true, emotion: true, race: true })
+  const [antiSpoofing, setAntiSpoofing] = useState(false)
+  const [loading, setLoading] = useState(false)
+  const [error, setError] = useState(null)
+  const [result, setResult] = useState(null)
+  const [focusIdx, setFocusIdx] = useState(0)
+
+  const submit = async (e) => {
+    e.preventDefault()
+    if (!model) { addToast('Select a face model first', 'warning'); return }
+    if (!img) { addToast('Add an image to analyze', 'warning'); return }
+    setLoading(true); setError(null); setResult(null); setFocusIdx(0)
+    try {
+      const body = {
+        model,
+        img: img.dataUrl,
+        actions: Object.entries(actions).filter(([, v]) => v).map(([k]) => k),
+        anti_spoofing: antiSpoofing,
+      }
+      const data = await faceApi.analyze(body)
+      setResult(data)
+      if (!data?.faces?.length) addToast('No face detected in the image', 'warning')
+    } catch (err) {
+      setError(err.message)
+    } finally {
+      setLoading(false)
+    }
+  }
+
+  const boxes = useMemo(() => (result?.faces || []).map((f, i) => ({
+    x: f.region.x, y: f.region.y, w: f.region.w, h: f.region.h,
+    label: f.dominant_emotion || f.dominant_gender || `Face ${i + 1}`,
+    sublabel: f.age ? `~${Math.round(f.age)}y` : null,
+    tone: i === focusIdx ? 'accent' : 'default',
+  })), [result, focusIdx])
+
+  const faces = result?.faces || []
+  const focus = faces[focusIdx]
+
+  return (
+    <form className="biometrics-twocol" onSubmit={submit}>
+      <aside className="biometrics-panel">
+        <h2 className="biometrics-panel__title">Analyze a face</h2>
+        <MediaInput mode="image" label="Source image" value={img} onChange={setImg} idPrefix="face-analyze" />
+
+        <fieldset className="biometrics-fieldset">
+          <legend>Attributes</legend>
+          <div className="biometrics-chipset" role="group">
+            {['age', 'gender', 'emotion', 'race'].map(k => (
+              <label key={k} className={`biometrics-chip ${actions[k] ? 'active' : ''}`}>
+                <input type="checkbox" checked={actions[k]} onChange={(e) => setActions(a => ({ ...a, [k]: e.target.checked }))} />
+                <span>{k}</span>
+              </label>
+            ))}
+          </div>
+        </fieldset>
+
+        <div className="form-row">
+          <div className="form-row__label">
+            <span className="form-row__label-text">Anti-spoofing</span>
+            <span className="form-row__hint">Reject photos-of-photos (requires model support).</span>
+          </div>
+          <label className="biometrics-switch">
+            <input type="checkbox" checked={antiSpoofing} onChange={(e) => setAntiSpoofing(e.target.checked)} />
+            <span aria-hidden="true" />
+          </label>
+        </div>
+
+        <button type="submit" className="btn btn-primary btn-full" disabled={loading || !img}>
+          {loading ? <><LoadingSpinner size="sm" /> Analyzing…</> : <><i className="fas fa-wand-magic-sparkles" /> Analyze</>}
+        </button>
+      </aside>
+
+      <section className="biometrics-results">
+        {loading && <div className="biometrics-empty"><LoadingSpinner size="lg" /></div>}
+        {error && <ErrorWithTraceLink message={error} />}
+        {!loading && !error && !result && (
+          <EmptyState icon="fas fa-face-smile"
+            title="Drop a portrait to analyze"
+            body="The backend will detect each face and return age, gender, emotion, and race distributions — with an optional liveness check." />
+        )}
+        {result && img && (
+          <>
+            <div className="biometrics-split">
+              <div className="biometrics-split__media">
+                <BoundingBoxCanvas src={img.dataUrl} boxes={boxes} alt="Analyzed source" />
+                {faces.length > 1 && (
+                  <div className="biometrics-facepicker" role="tablist" aria-label="Select face">
+                    {faces.map((_, i) => (
+                      <button key={i} type="button"
+                        className={`biometrics-facepicker__chip ${i === focusIdx ? 'active' : ''}`}
+                        onClick={() => setFocusIdx(i)}
+                        aria-pressed={i === focusIdx}>
+                        Face {i + 1}
+                      </button>
+                    ))}
+                  </div>
+                )}
+              </div>
+              <div className="biometrics-split__aside">
+                {focus && (
+                  <>
+                    <div className="biometrics-summary card">
+                      <div className="biometrics-summary__head">
+                        <h3><i className="fas fa-user" /> Face {focusIdx + 1}</h3>
+                        {antiSpoofing && <LivenessPill isReal={focus.is_real} score={focus.antispoof_score} />}
+                      </div>
+                      <dl className="biometrics-summary__grid">
+                        {focus.age != null && <><dt>Age</dt><dd>~{Math.round(focus.age)}</dd></>}
+                        {focus.dominant_gender && <><dt>Gender</dt><dd>{focus.dominant_gender}</dd></>}
+                        {focus.dominant_emotion && <><dt>Emotion</dt><dd>{focus.dominant_emotion}</dd></>}
+                        {focus.dominant_race && <><dt>Race</dt><dd>{focus.dominant_race}</dd></>}
+                        {focus.face_confidence != null && <><dt>Detection</dt><dd>{(focus.face_confidence * 100).toFixed(1)}%</dd></>}
+                      </dl>
+                    </div>
+                    <DistributionBars title="Gender" icon="fas fa-venus-mars" distribution={focus.gender} dominant={focus.dominant_gender} />
+                    <DistributionBars title="Emotion" icon="fas fa-face-smile-beam" distribution={focus.emotion} dominant={focus.dominant_emotion} />
+                    <DistributionBars title="Race" icon="fas fa-globe" distribution={focus.race} dominant={focus.dominant_race} />
+                  </>
+                )}
+              </div>
+            </div>
+            <ResponseDetails data={result} />
+          </>
+        )}
+      </section>
+    </form>
+  )
+}
+
+// ──────────────────────────── Compare ────────────────────────────
+
+function CompareTab({ model, addToast }) {
+  const [img1, setImg1] = useState(null)
+  const [img2, setImg2] = useState(null)
+  const [antiSpoofing, setAntiSpoofing] = useState(false)
+  const [threshold, setThreshold] = useState(null)
+  const [loading, setLoading] = useState(false)
+  const [error, setError] = useState(null)
+  const [result, setResult] = useState(null)
+
+  const submit = async (e) => {
+    e.preventDefault()
+    if (!model) { addToast('Select a face model first', 'warning'); return }
+    if (!img1 || !img2) { addToast('Add both images to compare', 'warning'); return }
+    setLoading(true); setError(null); setResult(null)
+    try {
+      const body = { model, img1: img1.dataUrl, img2: img2.dataUrl, anti_spoofing: antiSpoofing }
+      if (threshold != null) body.threshold = threshold
+      const data = await faceApi.verify(body)
+      setResult(data)
+      if (threshold == null && data?.threshold) setThreshold(data.threshold)
+    } catch (err) {
+      setError(err.message)
+    } finally {
+      setLoading(false)
+    }
+  }
+
+  // Re-compute verified locally when user drags the threshold slider post-response.
+  const effective = useMemo(() => {
+    if (!result) return null
+    const t = threshold ?? result.threshold
+    const verified = result.distance <= t
+    const confidence = Math.max(0, Math.min(100, 100 * (1 - result.distance / t)))
+    return { verified, confidence, threshold: t, distance: result.distance }
+  }, [result, threshold])
+
+  return (
+    <form className="biometrics-twocol" onSubmit={submit}>
+      <aside className="biometrics-panel">
+        <h2 className="biometrics-panel__title">Compare two faces</h2>
+        <MediaInput mode="image" label="First image" value={img1} onChange={setImg1} idPrefix="face-cmp-1" />
+        <MediaInput mode="image" label="Second image" value={img2} onChange={setImg2} idPrefix="face-cmp-2" />
+
+        <div className="form-row">
+          <div className="form-row__label">
+            <span className="form-row__label-text">Anti-spoofing</span>
+            <span className="form-row__hint">Flag photos-of-photos on either image.</span>
+          </div>
+          <label className="biometrics-switch">
+            <input type="checkbox" checked={antiSpoofing} onChange={(e) => setAntiSpoofing(e.target.checked)} />
+            <span aria-hidden="true" />
+          </label>
+        </div>
+
+        <button type="submit" className="btn btn-primary btn-full" disabled={loading || !img1 || !img2}>
+          {loading ? <><LoadingSpinner size="sm" /> Comparing…</> : <><i className="fas fa-equals" /> Compare</>}
+        </button>
+      </aside>
+
+      <section className="biometrics-results">
+        {loading && <div className="biometrics-empty"><LoadingSpinner size="lg" /></div>}
+        {error && <ErrorWithTraceLink message={error} />}
+        {!loading && !error && !result && (
+          <EmptyState icon="fas fa-people-arrows"
+            title="Drop two images to compare"
+            body="The backend will extract an embedding for each face and report the cosine distance between them. A match is declared when distance is below the threshold." />
+        )}
+        {result && effective && (
+          <>
+            <div className="biometrics-compare">
+              <div className="biometrics-compare__panel">
+                <div className="biometrics-compare__label">Image 1</div>
+                <BoundingBoxCanvas src={img1?.dataUrl}
+                  boxes={result.img1_area ? [{ ...result.img1_area, label: result.img1_is_real === false ? 'Spoof' : null, tone: 'accent' }] : []} />
+                {antiSpoofing && result.img1_is_real != null && (
+                  <LivenessPill isReal={result.img1_is_real} score={result.img1_antispoof_score} />
+                )}
+              </div>
+              <div className="biometrics-compare__center">
+                <MatchGauge
+                  distance={effective.distance}
+                  threshold={effective.threshold}
+                  confidence={effective.confidence}
+                  verified={effective.verified}
+                />
+                <div className="biometrics-compare__threshold">
+                  <label htmlFor="face-threshold">Threshold <code>{effective.threshold.toFixed(3)}</code></label>
+                  <input id="face-threshold" type="range" min="0" max="1" step="0.005"
+                    value={effective.threshold}
+                    onChange={(e) => setThreshold(parseFloat(e.target.value))}
+                    aria-describedby="face-threshold-hint" />
+                  <p id="face-threshold-hint" className="biometrics-compare__hint">
+                    Drag to see how the verdict changes. The backend default is <code>{result.threshold?.toFixed(3)}</code>.
+                  </p>
+                </div>
+              </div>
+              <div className="biometrics-compare__panel">
+                <div className="biometrics-compare__label">Image 2</div>
+                <BoundingBoxCanvas src={img2?.dataUrl}
+                  boxes={result.img2_area ? [{ ...result.img2_area, label: result.img2_is_real === false ? 'Spoof' : null, tone: 'accent' }] : []} />
+                {antiSpoofing && result.img2_is_real != null && (
+                  <LivenessPill isReal={result.img2_is_real} score={result.img2_antispoof_score} />
+                )}
+              </div>
+            </div>
+            <ResponseDetails data={result} />
+          </>
+        )}
+      </section>
+    </form>
+  )
+}
+
+// ──────────────────────────── Enrollment (register / identify / forget) ────────────────────────────
+
+function EnrollTab({ model, addToast }) {
+  const [enrolled, setEnrolled] = useState(loadEnrollments)
+  const [enrollName, setEnrollName] = useState('')
+  const [enrollLabels, setEnrollLabels] = useState('')
+  const [enrollImg, setEnrollImg] = useState(null)
+  const [enrolling, setEnrolling] = useState(false)
+  const [enrollErr, setEnrollErr] = useState(null)
+  const [lastEnrolled, setLastEnrolled] = useState(null)
+
+  const [probeImg, setProbeImg] = useState(null)
+  const [topK, setTopK] = useState(5)
+  const [threshold, setThreshold] = useState(0.35)
+  const [identifying, setIdentifying] = useState(false)
+  const [identifyErr, setIdentifyErr] = useState(null)
+  const [identifyResult, setIdentifyResult] = useState(null)
+
+  useEffect(() => { saveEnrollments(enrolled) }, [enrolled])
+
+  const enroll = async (e) => {
+    e.preventDefault()
+    if (!model) { addToast('Select a face model first', 'warning'); return }
+    if (!enrollName.trim()) { addToast('Give this person a name', 'warning'); return }
+    if (!enrollImg) { addToast('Add a sample image', 'warning'); return }
+    setEnrolling(true); setEnrollErr(null)
+    try {
+      const data = await faceApi.register({
+        model,
+        name: enrollName.trim(),
+        img: enrollImg.dataUrl,
+        labels: parseLabels(enrollLabels),
+      })
+      const entry = {
+        id: data.id,
+        name: data.name,
+        labels: parseLabels(enrollLabels),
+        thumbnail: enrollImg.dataUrl,
+        registeredAt: data.registered_at || new Date().toISOString(),
+      }
+      setEnrolled(prev => [entry, ...prev])
+      setLastEnrolled(entry.id)
+      setEnrollName(''); setEnrollLabels(''); setEnrollImg(null)
+      addToast(`Enrolled ${entry.name}`, 'success')
+    } catch (err) {
+      setEnrollErr(err.message)
+    } finally {
+      setEnrolling(false)
+    }
+  }
+
+  const forget = async (entry) => {
+    try {
+      await faceApi.forget({ id: entry.id })
+      setEnrolled(prev => prev.filter(e => e.id !== entry.id))
+      addToast(`Removed ${entry.name}`, 'info')
+    } catch (err) {
+      if (err.status === 404) {
+        setEnrolled(prev => prev.filter(e => e.id !== entry.id))
+        addToast(`${entry.name} was already gone from the backend store`, 'warning')
+      } else {
+        addToast(err.message, 'error')
+      }
+    }
+  }
+
+  const identify = async (e) => {
+    e.preventDefault()
+    if (!model) { addToast('Select a face model first', 'warning'); return }
+    if (!probeImg) { addToast('Add a probe image', 'warning'); return }
+    setIdentifying(true); setIdentifyErr(null); setIdentifyResult(null)
+    try {
+      const data = await faceApi.identify({
+        model,
+        img: probeImg.dataUrl,
+        top_k: topK,
+        threshold,
+      })
+      setIdentifyResult(data)
+      if (!data?.matches?.length) addToast('No matches above threshold', 'info')
+    } catch (err) {
+      setIdentifyErr(err.message)
+    } finally {
+      setIdentifying(false)
+    }
+  }
+
+  return (
+    <div className="biometrics-enrollgrid">
+      <section className="biometrics-enrollgrid__register card">
+        <h2 className="biometrics-panel__title"><i className="fas fa-user-plus" /> Enroll a face</h2>
+        <form onSubmit={enroll}>
+          <div className="form-group">
+            <label className="form-label" htmlFor="face-enroll-name">Name</label>
+            <input id="face-enroll-name" className="input" value={enrollName}
+              onChange={(e) => setEnrollName(e.target.value)} placeholder="e.g. Alice Johnson" />
+          </div>
+          <div className="form-group">
+            <label className="form-label" htmlFor="face-enroll-labels">Labels <span className="form-label__hint">(optional, one per line)</span></label>
+            <textarea id="face-enroll-labels" className="textarea" rows={2}
+              placeholder={"team: engineering\nfloor: 3"}
+              value={enrollLabels} onChange={(e) => setEnrollLabels(e.target.value)} />
+          </div>
+          <MediaInput mode="image" label="Sample image" value={enrollImg} onChange={setEnrollImg} idPrefix="face-enroll" />
+          <button type="submit" className="btn btn-primary btn-full" disabled={enrolling}>
+            {enrolling ? <><LoadingSpinner size="sm" /> Enrolling…</> : <><i className="fas fa-plus" /> Enroll</>}
+          </button>
+          {enrollErr && <div className="biometrics-enrollgrid__err"><ErrorWithTraceLink message={enrollErr} /></div>}
+        </form>
+      </section>
+
+      <section className="biometrics-enrollgrid__identify card">
+        <h2 className="biometrics-panel__title"><i className="fas fa-magnifying-glass" /> Identify someone</h2>
+        <form onSubmit={identify}>
+          <MediaInput mode="image" label="Probe image" value={probeImg} onChange={setProbeImg} idPrefix="face-probe" />
+          <div className="form-grid-2col">
+            <div className="form-group">
+              <label className="form-label" htmlFor="face-topk">Top-K</label>
+              <input id="face-topk" type="number" min="1" max="25" className="input"
+                value={topK} onChange={(e) => setTopK(parseInt(e.target.value) || 1)} />
+            </div>
+            <div className="form-group">
+              <label className="form-label" htmlFor="face-threshold-id">Threshold</label>
+              <input id="face-threshold-id" type="number" min="0" max="1" step="0.01" className="input"
+                value={threshold} onChange={(e) => setThreshold(parseFloat(e.target.value) || 0)} />
+            </div>
+          </div>
+          <button type="submit" className="btn btn-primary btn-full" disabled={identifying || !probeImg}>
+            {identifying ? <><LoadingSpinner size="sm" /> Searching…</> : <><i className="fas fa-magnifying-glass" /> Identify</>}
+          </button>
+          {identifyErr && <div className="biometrics-enrollgrid__err"><ErrorWithTraceLink message={identifyErr} /></div>}
+          {identifyResult && <MatchesList matches={identifyResult.matches || []} enrolled={enrolled} />}
+        </form>
+      </section>
+
+      <section className="biometrics-enrollgrid__list">
+        <div className="biometrics-enroll__head">
+          <h2 className="biometrics-panel__title"><i className="fas fa-id-card" /> Enrolled <span className="biometrics-enroll__count">{enrolled.length}</span></h2>
+        </div>
+        <EnrollmentList entries={enrolled} onDelete={forget} mode="image" highlightId={lastEnrolled} />
+      </section>
+    </div>
+  )
+}
+
+function MatchesList({ matches, enrolled }) {
+  if (!matches.length) {
+    return <div className="biometrics-matches__empty">No candidates above threshold.</div>
+  }
+  return (
+    <ul className="biometrics-matches" aria-label="Matches">
+      {matches.map((m, i) => {
+        const record = enrolled.find(e => e.id === m.id)
+        const conf = Math.max(0, Math.min(100, m.confidence ?? 0))
+        return (
+          <li key={m.id} className={`biometrics-matches__row ${m.match ? 'match' : 'miss'}`}>
+            <div className="biometrics-matches__rank">#{i + 1}</div>
+            <div className="biometrics-matches__avatar">
+              {record?.thumbnail
+                ? <img src={record.thumbnail} alt="" />
+                : <span>{(m.name || '?').slice(0, 2).toUpperCase()}</span>}
+            </div>
+            <div className="biometrics-matches__body">
+              <div className="biometrics-matches__name">
+                <strong>{m.name || m.id}</strong>
+                {m.match ? <span className="biometrics-matches__badge match"><i className="fas fa-check" /> match</span>
+                         : <span className="biometrics-matches__badge miss">below threshold</span>}
+              </div>
+              <div className="biometrics-matches__meter" aria-hidden="true">
+                <div className="biometrics-matches__fill" style={{ width: `${conf}%` }} />
+              </div>
+              <div className="biometrics-matches__meta">
+                <span>distance <code>{m.distance?.toFixed?.(4) ?? '—'}</code></span>
+                <span>confidence <code>{conf.toFixed(1)}%</code></span>
+              </div>
+            </div>
+          </li>
+        )
+      })}
+    </ul>
+  )
+}
+
+// ──────────────────────────── Embedding ────────────────────────────
+
+function EmbedTab({ model, addToast }) {
+  const [img, setImg] = useState(null)
+  const [loading, setLoading] = useState(false)
+  const [error, setError] = useState(null)
+  const [result, setResult] = useState(null)
+  const [elapsedMs, setElapsedMs] = useState(null)
+
+  const submit = async (e) => {
+    e.preventDefault()
+    if (!model) { addToast('Select a face model first', 'warning'); return }
+    if (!img) { addToast('Add an image', 'warning'); return }
+    setLoading(true); setError(null); setResult(null)
+    const started = performance.now()
+    try {
+      const data = await faceApi.embed({ model, img: img.dataUrl })
+      setElapsedMs(performance.now() - started)
+      setResult(data)
+    } catch (err) {
+      setError(err.message)
+    } finally {
+      setLoading(false)
+    }
+  }
+
+  return (
+    <form className="biometrics-twocol" onSubmit={submit}>
+      <aside className="biometrics-panel">
+        <h2 className="biometrics-panel__title">Get a raw embedding</h2>
+        <p className="biometrics-panel__note">
+          Returns a single face embedding vector. This is the same representation the backend uses internally for verify, identify, and compare.
+        </p>
+        <MediaInput mode="image" label="Image" value={img} onChange={setImg} idPrefix="face-embed" />
+        <button type="submit" className="btn btn-primary btn-full" disabled={loading || !img}>
+          {loading ? <><LoadingSpinner size="sm" /> Embedding…</> : <><i className="fas fa-code" /> Extract vector</>}
+        </button>
+      </aside>
+      <section className="biometrics-results">
+        {loading && <div className="biometrics-empty"><LoadingSpinner size="lg" /></div>}
+        {error && <ErrorWithTraceLink message={error} />}
+        {!loading && !error && !result && (
+          <EmptyState icon="fas fa-code"
+            title="Get a face embedding"
+            body="For developers — retrieve the raw vector for a face to store, search, or cluster outside of LocalAI." />
+        )}
+        {result && (
+          <EmbeddingInspector embedding={result.embedding} dim={result.dim} model={result.model} elapsedMs={elapsedMs} />
+        )}
+      </section>
+    </form>
+  )
+}
+
+// ──────────────────────────── Small shared bits ────────────────────────────
+
+function LivenessPill({ isReal, score }) {
+  if (isReal == null) {
+    return <span className="biometrics-pill muted"><i className="fas fa-circle-question" /> Not checked</span>
+  }
+  return (
+    <span className={`biometrics-pill ${isReal ? 'good' : 'bad'}`}>
+      <i className={`fas ${isReal ? 'fa-user-shield' : 'fa-mask'}`} />
+      {isReal ? 'Real' : 'Spoof'}
+      {score != null && <small>{(score * 100).toFixed(0)}%</small>}
+    </span>
+  )
+}
+
+function EmptyState({ icon, title, body }) {
+  return (
+    <div className="biometrics-empty">
+      <i className={icon} aria-hidden="true" />
+      <h3>{title}</h3>
+      <p>{body}</p>
+    </div>
+  )
+}
+
+function ResponseDetails({ data }) {
+  return (
+    <details className="biometrics-response">
+      <summary><i className="fas fa-angle-right" aria-hidden="true" /> Raw response</summary>
+      <pre>{JSON.stringify(data, null, 2)}</pre>
+    </details>
+  )
+}
--- a/core/http/react-ui/src/pages/VoiceRecognition.jsx
+++ b/core/http/react-ui/src/pages/VoiceRecognition.jsx
@@ -0,0 +1,543 @@
+import { useEffect, useMemo, useState } from 'react'
+import { useOutletContext, useParams } from 'react-router-dom'
+import ModelSelector from '../components/ModelSelector'
+import LoadingSpinner from '../components/LoadingSpinner'
+import ErrorWithTraceLink from '../components/ErrorWithTraceLink'
+import TabSwitch from '../components/biometrics/TabSwitch'
+import MediaInput from '../components/biometrics/MediaInput'
+import WaveformStrip from '../components/biometrics/WaveformStrip'
+import MatchGauge from '../components/biometrics/MatchGauge'
+import DistributionBars from '../components/biometrics/DistributionBars'
+import EnrollmentList from '../components/biometrics/EnrollmentList'
+import EmbeddingInspector from '../components/biometrics/EmbeddingInspector'
+import { CAP_SPEAKER_RECOGNITION } from '../utils/capabilities'
+import { voiceApi } from '../utils/api'
+
+const TABS = [
+  { id: 'analyze', icon: 'fas fa-wave-square',  label: 'Analyze' },
+  { id: 'compare', icon: 'fas fa-people-arrows',    label: 'Compare' },
+  { id: 'enroll',  icon: 'fas fa-id-badge',         label: 'Enrollment' },
+  { id: 'embed',   icon: 'fas fa-code',             label: 'Embedding' },
+]
+
+const ENROLL_KEY = 'localai_voice_enrollments'
+
+function loadEnrollments() {
+  try {
+    const raw = localStorage.getItem(ENROLL_KEY)
+    if (!raw) return []
+    const p = JSON.parse(raw)
+    return Array.isArray(p) ? p : []
+  } catch (_) { return [] }
+}
+
+function saveEnrollments(list) {
+  try { localStorage.setItem(ENROLL_KEY, JSON.stringify(list.slice(0, 50))) } catch (_) { /* quota */ }
+}
+
+function parseLabels(text) {
+  const out = {}
+  if (!text) return out
+  for (const line of text.split('\n')) {
+    const idx = line.indexOf(':')
+    if (idx === -1) continue
+    const k = line.slice(0, idx).trim()
+    const v = line.slice(idx + 1).trim()
+    if (k) out[k] = v
+  }
+  return out
+}
+
+const TONE_FOR_SEGMENT = ['accent', 'info', 'success', 'warning', 'data1', 'data2']
+
+export default function VoiceRecognition() {
+  const { model: urlModel } = useParams()
+  const { addToast } = useOutletContext()
+  const [model, setModel] = useState(urlModel || '')
+  const [tab, setTab] = useState('analyze')
+
+  return (
+    <div className="biometrics-page">
+      <header className="biometrics-page__header">
+        <div>
+          <h1 className="page-title"><i className="fas fa-microphone-lines" aria-hidden="true" /> Voice Recognition</h1>
+          <p className="page-subtitle">
+            Compare, identify, and analyze speakers — the audio analog to face recognition. Record directly from your microphone or upload a clip.
+          </p>
+        </div>
+        <div className="biometrics-page__model">
+          <label className="form-label">Model</label>
+          <ModelSelector value={model} onChange={setModel} capability={CAP_SPEAKER_RECOGNITION} />
+        </div>
+      </header>
+
+      <TabSwitch tabs={TABS} value={tab} onChange={setTab} />
+
+      <div className="biometrics-page__body">
+        {tab === 'analyze' && <AnalyzeTab model={model} addToast={addToast} />}
+        {tab === 'compare' && <CompareTab model={model} addToast={addToast} />}
+        {tab === 'enroll' && <EnrollTab model={model} addToast={addToast} />}
+        {tab === 'embed' && <EmbedTab model={model} addToast={addToast} />}
+      </div>
+    </div>
+  )
+}
+
+// ──────────────────────────── Analyze ────────────────────────────
+
+function AnalyzeTab({ model, addToast }) {
+  const [audio, setAudio] = useState(null)
+  const [actions, setActions] = useState({ age: true, gender: true, emotion: true })
+  const [loading, setLoading] = useState(false)
+  const [error, setError] = useState(null)
+  const [result, setResult] = useState(null)
+  const [focusIdx, setFocusIdx] = useState(0)
+
+  const submit = async (e) => {
+    e.preventDefault()
+    if (!model) { addToast('Select a speaker model first', 'warning'); return }
+    if (!audio) { addToast('Add an audio clip', 'warning'); return }
+    setLoading(true); setError(null); setResult(null); setFocusIdx(0)
+    try {
+      const data = await voiceApi.analyze({
+        model,
+        audio: audio.dataUrl,
+        actions: Object.entries(actions).filter(([, v]) => v).map(([k]) => k),
+      })
+      setResult(data)
+      if (!data?.segments?.length) addToast('No speech segments detected', 'warning')
+    } catch (err) {
+      setError(err.message)
+    } finally {
+      setLoading(false)
+    }
+  }
+
+  const segments = useMemo(() => result?.segments || [], [result])
+  const focus = segments[focusIdx]
+  const waveformSegments = useMemo(() => segments.map((s, i) => ({
+    start: s.start, end: s.end,
+    label: s.dominant_emotion || s.dominant_gender || `#${i + 1}`,
+    tone: i === focusIdx ? 'accent' : TONE_FOR_SEGMENT[i % TONE_FOR_SEGMENT.length],
+  })), [segments, focusIdx])
+
+  return (
+    <form className="biometrics-twocol" onSubmit={submit}>
+      <aside className="biometrics-panel">
+        <h2 className="biometrics-panel__title">Analyze a speaker</h2>
+        <MediaInput mode="audio" label="Audio clip" value={audio} onChange={setAudio} idPrefix="voice-analyze" />
+        <fieldset className="biometrics-fieldset">
+          <legend>Attributes</legend>
+          <div className="biometrics-chipset" role="group">
+            {['age', 'gender', 'emotion'].map(k => (
+              <label key={k} className={`biometrics-chip ${actions[k] ? 'active' : ''}`}>
+                <input type="checkbox" checked={actions[k]} onChange={(e) => setActions(a => ({ ...a, [k]: e.target.checked }))} />
+                <span>{k}</span>
+              </label>
+            ))}
+          </div>
+        </fieldset>
+        <button type="submit" className="btn btn-primary btn-full" disabled={loading || !audio}>
+          {loading ? <><LoadingSpinner size="sm" /> Analyzing…</> : <><i className="fas fa-wand-magic-sparkles" /> Analyze</>}
+        </button>
+      </aside>
+
+      <section className="biometrics-results">
+        {loading && <div className="biometrics-empty"><LoadingSpinner size="lg" /></div>}
+        {error && <ErrorWithTraceLink message={error} />}
+        {!loading && !error && !result && (
+          <EmptyState icon="fas fa-wave-square"
+            title="Record or upload a clip to analyze"
+            body="The backend will segment the audio by speaker turn and infer age, gender, and emotion per segment." />
+        )}
+        {result && audio && (
+          <>
+            <WaveformStrip src={audio.dataUrl} segments={waveformSegments} />
+            {segments.length > 1 && (
+              <div className="biometrics-facepicker" role="tablist" aria-label="Select segment">
+                {segments.map((s, i) => (
+                  <button key={i} type="button"
+                    className={`biometrics-facepicker__chip ${i === focusIdx ? 'active' : ''}`}
+                    onClick={() => setFocusIdx(i)}
+                    aria-pressed={i === focusIdx}>
+                    #{i + 1} <small>{s.start.toFixed(1)}s–{s.end.toFixed(1)}s</small>
+                  </button>
+                ))}
+              </div>
+            )}
+            {focus && (
+              <div className="biometrics-split">
+                <div className="biometrics-split__aside" style={{ gridColumn: '1 / -1' }}>
+                  <div className="biometrics-summary card">
+                    <div className="biometrics-summary__head">
+                      <h3><i className="fas fa-user" /> Segment {focusIdx + 1}
+                        <small>· {focus.start.toFixed(2)}s – {focus.end.toFixed(2)}s</small>
+                      </h3>
+                    </div>
+                    <dl className="biometrics-summary__grid">
+                      {focus.age != null && <><dt>Age</dt><dd>~{Math.round(focus.age)}</dd></>}
+                      {focus.dominant_gender && <><dt>Gender</dt><dd>{focus.dominant_gender}</dd></>}
+                      {focus.dominant_emotion && <><dt>Emotion</dt><dd>{focus.dominant_emotion}</dd></>}
+                    </dl>
+                  </div>
+                  <DistributionBars title="Gender" icon="fas fa-venus-mars" distribution={focus.gender} dominant={focus.dominant_gender} />
+                  <DistributionBars title="Emotion" icon="fas fa-face-smile-beam" distribution={focus.emotion} dominant={focus.dominant_emotion} />
+                </div>
+              </div>
+            )}
+            <ResponseDetails data={result} />
+          </>
+        )}
+      </section>
+    </form>
+  )
+}
+
+// ──────────────────────────── Compare ────────────────────────────
+
+function CompareTab({ model, addToast }) {
+  const [audio1, setAudio1] = useState(null)
+  const [audio2, setAudio2] = useState(null)
+  const [threshold, setThreshold] = useState(null)
+  const [loading, setLoading] = useState(false)
+  const [error, setError] = useState(null)
+  const [result, setResult] = useState(null)
+
+  const submit = async (e) => {
+    e.preventDefault()
+    if (!model) { addToast('Select a speaker model first', 'warning'); return }
+    if (!audio1 || !audio2) { addToast('Add both clips to compare', 'warning'); return }
+    setLoading(true); setError(null); setResult(null)
+    try {
+      const body = { model, audio1: audio1.dataUrl, audio2: audio2.dataUrl }
+      if (threshold != null) body.threshold = threshold
+      const data = await voiceApi.verify(body)
+      setResult(data)
+      if (threshold == null && data?.threshold) setThreshold(data.threshold)
+    } catch (err) {
+      setError(err.message)
+    } finally {
+      setLoading(false)
+    }
+  }
+
+  const effective = useMemo(() => {
+    if (!result) return null
+    const t = threshold ?? result.threshold
+    const verified = result.distance <= t
+    const confidence = Math.max(0, Math.min(100, 100 * (1 - result.distance / t)))
+    return { verified, confidence, threshold: t, distance: result.distance }
+  }, [result, threshold])
+
+  return (
+    <form className="biometrics-twocol" onSubmit={submit}>
+      <aside className="biometrics-panel">
+        <h2 className="biometrics-panel__title">Compare two voices</h2>
+        <MediaInput mode="audio" label="First clip" value={audio1} onChange={setAudio1} idPrefix="voice-cmp-1" />
+        <MediaInput mode="audio" label="Second clip" value={audio2} onChange={setAudio2} idPrefix="voice-cmp-2" />
+        <button type="submit" className="btn btn-primary btn-full" disabled={loading || !audio1 || !audio2}>
+          {loading ? <><LoadingSpinner size="sm" /> Comparing…</> : <><i className="fas fa-equals" /> Compare</>}
+        </button>
+      </aside>
+
+      <section className="biometrics-results">
+        {loading && <div className="biometrics-empty"><LoadingSpinner size="lg" /></div>}
+        {error && <ErrorWithTraceLink message={error} />}
+        {!loading && !error && !result && (
+          <EmptyState icon="fas fa-people-arrows"
+            title="Drop two clips to compare"
+            body="We extract a speaker embedding for each clip and report the cosine distance — a match is declared when the distance is below the threshold." />
+        )}
+        {result && effective && (
+          <>
+            <div className="biometrics-compare biometrics-compare--voice">
+              <div className="biometrics-compare__panel">
+                <div className="biometrics-compare__label">Clip 1</div>
+                <WaveformStrip src={audio1?.dataUrl} height={80} />
+              </div>
+              <div className="biometrics-compare__center">
+                <MatchGauge
+                  distance={effective.distance}
+                  threshold={effective.threshold}
+                  confidence={effective.confidence}
+                  verified={effective.verified}
+                />
+                <div className="biometrics-compare__threshold">
+                  <label htmlFor="voice-threshold">Threshold <code>{effective.threshold.toFixed(3)}</code></label>
+                  <input id="voice-threshold" type="range" min="0" max="1" step="0.005"
+                    value={effective.threshold}
+                    onChange={(e) => setThreshold(parseFloat(e.target.value))} />
+                  <p className="biometrics-compare__hint">
+                    Drag to see how the verdict changes. The backend default is <code>{result.threshold?.toFixed(3)}</code>.
+                  </p>
+                </div>
+              </div>
+              <div className="biometrics-compare__panel">
+                <div className="biometrics-compare__label">Clip 2</div>
+                <WaveformStrip src={audio2?.dataUrl} height={80} />
+              </div>
+            </div>
+            <ResponseDetails data={result} />
+          </>
+        )}
+      </section>
+    </form>
+  )
+}
+
+// ──────────────────────────── Enrollment ────────────────────────────
+
+function EnrollTab({ model, addToast }) {
+  const [enrolled, setEnrolled] = useState(loadEnrollments)
+  const [enrollName, setEnrollName] = useState('')
+  const [enrollLabels, setEnrollLabels] = useState('')
+  const [enrollAudio, setEnrollAudio] = useState(null)
+  const [enrolling, setEnrolling] = useState(false)
+  const [enrollErr, setEnrollErr] = useState(null)
+  const [lastEnrolled, setLastEnrolled] = useState(null)
+
+  const [probeAudio, setProbeAudio] = useState(null)
+  const [topK, setTopK] = useState(5)
+  const [threshold, setThreshold] = useState(0.25)
+  const [identifying, setIdentifying] = useState(false)
+  const [identifyErr, setIdentifyErr] = useState(null)
+  const [identifyResult, setIdentifyResult] = useState(null)
+
+  useEffect(() => { saveEnrollments(enrolled) }, [enrolled])
+
+  const enroll = async (e) => {
+    e.preventDefault()
+    if (!model) { addToast('Select a speaker model first', 'warning'); return }
+    if (!enrollName.trim()) { addToast('Give this speaker a name', 'warning'); return }
+    if (!enrollAudio) { addToast('Add a sample clip', 'warning'); return }
+    setEnrolling(true); setEnrollErr(null)
+    try {
+      const data = await voiceApi.register({
+        model,
+        name: enrollName.trim(),
+        audio: enrollAudio.dataUrl,
+        labels: parseLabels(enrollLabels),
+      })
+      const entry = {
+        id: data.id,
+        name: data.name,
+        labels: parseLabels(enrollLabels),
+        sampleUrl: enrollAudio.dataUrl,
+        registeredAt: data.registered_at || new Date().toISOString(),
+      }
+      setEnrolled(prev => [entry, ...prev])
+      setLastEnrolled(entry.id)
+      setEnrollName(''); setEnrollLabels(''); setEnrollAudio(null)
+      addToast(`Enrolled ${entry.name}`, 'success')
+    } catch (err) {
+      setEnrollErr(err.message)
+    } finally {
+      setEnrolling(false)
+    }
+  }
+
+  const forget = async (entry) => {
+    try {
+      await voiceApi.forget({ id: entry.id })
+      setEnrolled(prev => prev.filter(e => e.id !== entry.id))
+      addToast(`Removed ${entry.name}`, 'info')
+    } catch (err) {
+      if (err.status === 404) {
+        setEnrolled(prev => prev.filter(e => e.id !== entry.id))
+        addToast(`${entry.name} was already gone from the backend store`, 'warning')
+      } else {
+        addToast(err.message, 'error')
+      }
+    }
+  }
+
+  const identify = async (e) => {
+    e.preventDefault()
+    if (!model) { addToast('Select a speaker model first', 'warning'); return }
+    if (!probeAudio) { addToast('Add a probe clip', 'warning'); return }
+    setIdentifying(true); setIdentifyErr(null); setIdentifyResult(null)
+    try {
+      const data = await voiceApi.identify({
+        model,
+        audio: probeAudio.dataUrl,
+        top_k: topK,
+        threshold,
+      })
+      setIdentifyResult(data)
+      if (!data?.matches?.length) addToast('No matches above threshold', 'info')
+    } catch (err) {
+      setIdentifyErr(err.message)
+    } finally {
+      setIdentifying(false)
+    }
+  }
+
+  return (
+    <div className="biometrics-enrollgrid">
+      <section className="biometrics-enrollgrid__register card">
+        <h2 className="biometrics-panel__title"><i className="fas fa-user-plus" /> Enroll a voice</h2>
+        <form onSubmit={enroll}>
+          <div className="form-group">
+            <label className="form-label" htmlFor="voice-enroll-name">Name</label>
+            <input id="voice-enroll-name" className="input" value={enrollName}
+              onChange={(e) => setEnrollName(e.target.value)} placeholder="e.g. Alice Johnson" />
+          </div>
+          <div className="form-group">
+            <label className="form-label" htmlFor="voice-enroll-labels">Labels <span className="form-label__hint">(optional, one per line)</span></label>
+            <textarea id="voice-enroll-labels" className="textarea" rows={2}
+              placeholder={"team: engineering\nrole: lead"}
+              value={enrollLabels} onChange={(e) => setEnrollLabels(e.target.value)} />
+          </div>
+          <MediaInput mode="audio" label="Sample clip" value={enrollAudio} onChange={setEnrollAudio} idPrefix="voice-enroll" />
+          <button type="submit" className="btn btn-primary btn-full" disabled={enrolling}>
+            {enrolling ? <><LoadingSpinner size="sm" /> Enrolling…</> : <><i className="fas fa-plus" /> Enroll</>}
+          </button>
+          {enrollErr && <div className="biometrics-enrollgrid__err"><ErrorWithTraceLink message={enrollErr} /></div>}
+        </form>
+      </section>
+
+      <section className="biometrics-enrollgrid__identify card">
+        <h2 className="biometrics-panel__title"><i className="fas fa-magnifying-glass" /> Identify a speaker</h2>
+        <form onSubmit={identify}>
+          <MediaInput mode="audio" label="Probe clip" value={probeAudio} onChange={setProbeAudio} idPrefix="voice-probe" />
+          <div className="form-grid-2col">
+            <div className="form-group">
+              <label className="form-label" htmlFor="voice-topk">Top-K</label>
+              <input id="voice-topk" type="number" min="1" max="25" className="input"
+                value={topK} onChange={(e) => setTopK(parseInt(e.target.value) || 1)} />
+            </div>
+            <div className="form-group">
+              <label className="form-label" htmlFor="voice-threshold-id">Threshold</label>
+              <input id="voice-threshold-id" type="number" min="0" max="1" step="0.01" className="input"
+                value={threshold} onChange={(e) => setThreshold(parseFloat(e.target.value) || 0)} />
+            </div>
+          </div>
+          <button type="submit" className="btn btn-primary btn-full" disabled={identifying || !probeAudio}>
+            {identifying ? <><LoadingSpinner size="sm" /> Searching…</> : <><i className="fas fa-magnifying-glass" /> Identify</>}
+          </button>
+          {identifyErr && <div className="biometrics-enrollgrid__err"><ErrorWithTraceLink message={identifyErr} /></div>}
+          {identifyResult && <MatchesList matches={identifyResult.matches || []} enrolled={enrolled} />}
+        </form>
+      </section>
+
+      <section className="biometrics-enrollgrid__list">
+        <div className="biometrics-enroll__head">
+          <h2 className="biometrics-panel__title"><i className="fas fa-id-badge" /> Enrolled <span className="biometrics-enroll__count">{enrolled.length}</span></h2>
+        </div>
+        <EnrollmentList entries={enrolled} onDelete={forget} mode="audio" highlightId={lastEnrolled} />
+      </section>
+    </div>
+  )
+}
+
+function MatchesList({ matches, enrolled }) {
+  if (!matches.length) {
+    return <div className="biometrics-matches__empty">No candidates above threshold.</div>
+  }
+  return (
+    <ul className="biometrics-matches" aria-label="Matches">
+      {matches.map((m, i) => {
+        const record = enrolled.find(e => e.id === m.id)
+        const conf = Math.max(0, Math.min(100, m.confidence ?? 0))
+        return (
+          <li key={m.id} className={`biometrics-matches__row ${m.match ? 'match' : 'miss'}`}>
+            <div className="biometrics-matches__rank">#{i + 1}</div>
+            <div className="biometrics-matches__avatar">
+              <span>{(m.name || '?').slice(0, 2).toUpperCase()}</span>
+            </div>
+            <div className="biometrics-matches__body">
+              <div className="biometrics-matches__name">
+                <strong>{m.name || m.id}</strong>
+                {m.match ? <span className="biometrics-matches__badge match"><i className="fas fa-check" /> match</span>
+                         : <span className="biometrics-matches__badge miss">below threshold</span>}
+              </div>
+              {record?.sampleUrl && (
+                <audio controls src={record.sampleUrl} className="biometrics-matches__preview" />
+              )}
+              <div className="biometrics-matches__meter" aria-hidden="true">
+                <div className="biometrics-matches__fill" style={{ width: `${conf}%` }} />
+              </div>
+              <div className="biometrics-matches__meta">
+                <span>distance <code>{m.distance?.toFixed?.(4) ?? '—'}</code></span>
+                <span>confidence <code>{conf.toFixed(1)}%</code></span>
+              </div>
+            </div>
+          </li>
+        )
+      })}
+    </ul>
+  )
+}
+
+// ──────────────────────────── Embedding ────────────────────────────
+
+function EmbedTab({ model, addToast }) {
+  const [audio, setAudio] = useState(null)
+  const [loading, setLoading] = useState(false)
+  const [error, setError] = useState(null)
+  const [result, setResult] = useState(null)
+  const [elapsedMs, setElapsedMs] = useState(null)
+
+  const submit = async (e) => {
+    e.preventDefault()
+    if (!model) { addToast('Select a speaker model first', 'warning'); return }
+    if (!audio) { addToast('Add an audio clip', 'warning'); return }
+    setLoading(true); setError(null); setResult(null)
+    const started = performance.now()
+    try {
+      const data = await voiceApi.embed({ model, audio: audio.dataUrl })
+      setElapsedMs(performance.now() - started)
+      setResult(data)
+    } catch (err) {
+      setError(err.message)
+    } finally {
+      setLoading(false)
+    }
+  }
+
+  return (
+    <form className="biometrics-twocol" onSubmit={submit}>
+      <aside className="biometrics-panel">
+        <h2 className="biometrics-panel__title">Get a raw speaker embedding</h2>
+        <p className="biometrics-panel__note">
+          Returns a speaker-encoder vector — the same representation the backend uses internally for verify and identify.
+        </p>
+        <MediaInput mode="audio" label="Audio clip" value={audio} onChange={setAudio} idPrefix="voice-embed" />
+        <button type="submit" className="btn btn-primary btn-full" disabled={loading || !audio}>
+          {loading ? <><LoadingSpinner size="sm" /> Embedding…</> : <><i className="fas fa-code" /> Extract vector</>}
+        </button>
+      </aside>
+      <section className="biometrics-results">
+        {loading && <div className="biometrics-empty"><LoadingSpinner size="lg" /></div>}
+        {error && <ErrorWithTraceLink message={error} />}
+        {!loading && !error && !result && (
+          <EmptyState icon="fas fa-code"
+            title="Get a speaker embedding"
+            body="For developers — retrieve the raw vector for a voice to store, search, or cluster outside of LocalAI." />
+        )}
+        {result && (
+          <EmbeddingInspector embedding={result.embedding} dim={result.dim} model={result.model} elapsedMs={elapsedMs} />
+        )}
+      </section>
+    </form>
+  )
+}
+
+function EmptyState({ icon, title, body }) {
+  return (
+    <div className="biometrics-empty">
+      <i className={icon} aria-hidden="true" />
+      <h3>{title}</h3>
+      <p>{body}</p>
+    </div>
+  )
+}
+
+function ResponseDetails({ data }) {
+  return (
+    <details className="biometrics-response">
+      <summary><i className="fas fa-angle-right" aria-hidden="true" /> Raw response</summary>
+      <pre>{JSON.stringify(data, null, 2)}</pre>
+    </details>
+  )
+}
--- a/core/http/react-ui/src/router.jsx
+++ b/core/http/react-ui/src/router.jsx
@@ -34,6 +34,8 @@ import Login from './pages/Login'
 import FineTune from './pages/FineTune'
 import Quantize from './pages/Quantize'
 import Studio from './pages/Studio'
+import FaceRecognition from './pages/FaceRecognition'
+import VoiceRecognition from './pages/VoiceRecognition'
 import Nodes from './pages/Nodes'
 import NodeBackendLogs from './pages/NodeBackendLogs'
 import NotFound from './pages/NotFound'
@@ -73,6 +75,10 @@ const appChildren = [
  { path: 'sound/:model', element: <Sound /> },
  { path: 'studio', element: <Studio /> },
  { path: 'talk', element: <Talk /> },
+  { path: 'face', element: <Feature feature="face_recognition"><FaceRecognition /></Feature> },
+  { path: 'face/:model', element: <Feature feature="face_recognition"><FaceRecognition /></Feature> },
+  { path: 'voice', element: <Feature feature="voice_recognition"><VoiceRecognition /></Feature> },
+  { path: 'voice/:model', element: <Feature feature="voice_recognition"><VoiceRecognition /></Feature> },
  { path: 'usage', element: <Usage /> },
  { path: 'account', element: <Account /> },
  { path: 'users', element: <Admin><Users /></Admin> },
--- a/core/http/react-ui/src/utils/api.js
+++ b/core/http/react-ui/src/utils/api.js
@@ -259,6 +259,26 @@ export const audioApi = {
  },
 }

+// Face biometrics — backend spec: core/http/endpoints/localai/face_*.go
+export const faceApi = {
+  verify: (body) => postJSON(API_CONFIG.endpoints.faceVerify, body),
+  analyze: (body) => postJSON(API_CONFIG.endpoints.faceAnalyze, body),
+  embed: (body) => postJSON(API_CONFIG.endpoints.faceEmbed, body),
+  register: (body) => postJSON(API_CONFIG.endpoints.faceRegister, body),
+  identify: (body) => postJSON(API_CONFIG.endpoints.faceIdentify, body),
+  forget: (body) => postJSON(API_CONFIG.endpoints.faceForget, body),
+}
+
+// Voice biometrics — backend spec: core/http/endpoints/localai/voice_*.go
+export const voiceApi = {
+  verify: (body) => postJSON(API_CONFIG.endpoints.voiceVerify, body),
+  analyze: (body) => postJSON(API_CONFIG.endpoints.voiceAnalyze, body),
+  embed: (body) => postJSON(API_CONFIG.endpoints.voiceEmbed, body),
+  register: (body) => postJSON(API_CONFIG.endpoints.voiceRegister, body),
+  identify: (body) => postJSON(API_CONFIG.endpoints.voiceIdentify, body),
+  forget: (body) => postJSON(API_CONFIG.endpoints.voiceForget, body),
+}
+
 // Realtime / WebRTC
 export const realtimeApi = {
  call: (body) => postJSON(API_CONFIG.endpoints.realtimeCalls, body),
--- a/core/http/react-ui/src/utils/config.js
+++ b/core/http/react-ui/src/utils/config.js
@@ -73,6 +73,23 @@ export const API_CONFIG = {
    audioTranscriptions: '/v1/audio/transcriptions',
    soundGeneration: '/v1/sound-generation',
    embeddings: '/v1/embeddings',
+
+    // Face biometrics
+    faceVerify: '/v1/face/verify',
+    faceAnalyze: '/v1/face/analyze',
+    faceEmbed: '/v1/face/embed',
+    faceRegister: '/v1/face/register',
+    faceIdentify: '/v1/face/identify',
+    faceForget: '/v1/face/forget',
+
+    // Voice biometrics
+    voiceVerify: '/v1/voice/verify',
+    voiceAnalyze: '/v1/voice/analyze',
+    voiceEmbed: '/v1/voice/embed',
+    voiceRegister: '/v1/voice/register',
+    voiceIdentify: '/v1/voice/identify',
+    voiceForget: '/v1/voice/forget',
+
    modelsList: '/v1/models',
    modelsCapabilities: '/api/models/capabilities',

--- a/core/http/routes/ui_api.go
+++ b/core/http/routes/ui_api.go
@@ -1303,21 +1303,39 @@ func RegisterUIAPIRoutes(app *echo.Echo, cl *config.ModelConfigLoader, ml *model
 			})
 		}

-		uid, err := uuid.NewUUID()
+		id, err := uuid.NewUUID()
 		if err != nil {
 			return c.JSON(http.StatusInternalServerError, map[string]any{"error": err.Error()})
 		}

-		galleryService.BackendGalleryChannel <- galleryop.ManagementOp[gallery.GalleryBackend, any]{
-			ID:                 uid.String(),
+		uid := id.String()
+
+		// Register in opcache so the operation shows up in /api/operations
+		// and the Backends UI can reflect progress on the affected row.
+		opcache.SetBackend(backendName, uid)
+
+		ctx, cancelFunc := context.WithCancel(context.Background())
+		op := galleryop.ManagementOp[gallery.GalleryBackend, any]{
+			ID:                 uid,
 			GalleryElementName: backendName,
 			Galleries:          appConfig.BackendGalleries,
 			Upgrade:            true,
+			Context:            ctx,
+			CancelFunc:         cancelFunc,
 		}
+		// Store cancellation function immediately so queued operations can be cancelled
+		galleryService.StoreCancellation(uid, cancelFunc)
+		// Non-blocking send — BackendGalleryChannel is unbuffered and a direct
+		// send would hang the HTTP handler whenever the worker is busy.
+		go func() {
+			galleryService.BackendGalleryChannel <- op
+		}()

 		return c.JSON(200, map[string]any{
-			"uuid":      uid.String(),
-			"statusUrl": fmt.Sprintf("/api/backends/job/%s", uid.String()),
+			"jobID":     uid,
+			"uuid":      uid,
+			"statusUrl": fmt.Sprintf("/api/backends/job/%s", uid),
+			"message":   "Backend upgrade started",
 		})
 	}, adminMiddleware)

--- a/core/services/galleryop/service.go
+++ b/core/services/galleryop/service.go
@@ -28,6 +28,13 @@ type GalleryService struct {
 	// Distributed mode (nil when not in distributed mode)
 	natsClient   messaging.Publisher
 	galleryStore *distributed.GalleryStore
+
+	// OnBackendOpCompleted is fired after every successful install/upgrade/delete
+	// on the backend channel. The Application wires this to UpgradeChecker.TriggerCheck
+	// so `/api/backends/upgrades` stops surfacing a backend as upgradeable the moment
+	// the worker finishes — previously the cache only refreshed on the 6-hour tick,
+	// making manual upgrades look like they failed even when they hadn't.
+	OnBackendOpCompleted func()
 }

 func NewGalleryService(appConfig *config.ApplicationConfig, ml *model.ModelLoader) *GalleryService {
@@ -245,6 +252,11 @@ func (g *GalleryService) Start(c context.Context, cl *config.ModelConfigLoader,
 				err := g.backendHandler(&op, systemState)
 				if err != nil {
 					updateError(op.ID, err)
+				} else if g.OnBackendOpCompleted != nil {
+					// Let listeners (e.g. UpgradeChecker) refresh their view of
+					// installed state. Run off the worker goroutine so a slow
+					// callback doesn't stall the next queued operation.
+					go g.OnBackendOpCompleted()
 				}
 				g.removeCancellation(op.ID)

--- a/docs/content/features/text-generation.md
+++ b/docs/content/features/text-generation.md
@@ -631,6 +631,83 @@ The `cache_type_k` / `cache_type_v` fields map to llama.cpp's `-ctk` / `-ctv` fl
 - [Tracked branch: `feature/turboquant-kv-cache`](https://github.com/TheTom/llama-cpp-turboquant/tree/feature/turboquant-kv-cache)


+### buun-llama-cpp (DFlash speculative decoding + TurboQuant/TCQ KV-cache)
+
+[buun-llama-cpp](https://github.com/spiritbuun/buun-llama-cpp) is a fork-of-a-fork: spiritbuun forked `TheTom/llama-cpp-turboquant` (the `turboquant` backend above) and added two independent features on top:
+
+1. **DFlash** — a block-diffusion speculative decoding scheme that uses a dedicated drafter model (new `DFlashDraftModel` GGUF architecture). On a target/drafter pair it emits a block of tokens per speculation step and can be combined with tree-structured verification ("DDTree") for multi-branch draft expansion.
+2. **TCQ (Trellis-Coded Quantization)** — two additional KV-cache types (`turbo2_tcq`, `turbo3_tcq`) on top of the TurboQuant `turbo2` / `turbo3` / `turbo4` already shipped by the parent fork, delivering 10–44% KL reduction over scalar quantization at 2–3 bits per value.
+
+Like `turboquant`, this backend shares LocalAI's stock `llama-cpp` gRPC server sources — so any GGUF model that runs on `llama-cpp` also runs on `buun-llama-cpp`. Pick it over `turboquant` specifically when you want DFlash speculative decoding or the newer TCQ KV-cache variants.
+
+#### Features
+
+- Drop-in GGUF compatibility with upstream `llama.cpp`.
+- DFlash block-diffusion speculative decoding (CUDA/Metal; no CPU fallback).
+- TurboQuant KV-cache types (`turbo2`, `turbo3`, `turbo4`) inherited from the parent `turboquant` fork, plus buun-exclusive `turbo2_tcq` and `turbo3_tcq` variants.
+- Same feature surface as `llama-cpp`: text generation, embeddings, tool calls, multimodal via mmproj.
+- Available on CPU (AVX/AVX2/AVX512/fallback), NVIDIA CUDA 12/13, AMD ROCm/HIP, Intel SYCL f32/f16, Vulkan, and NVIDIA L4T — but note that DFlash and `turbo*` KV types have no CPU fallback and error at model-load on CPU-only builds.
+
+#### Setup
+
+`buun-llama-cpp` ships as a separate container image in the LocalAI backend gallery. Install it like any other backend:
+
+```bash
+local-ai backends install buun-llama-cpp
+```
+
+Or pick a specific flavor for your hardware (example tags: `cpu-buun-llama-cpp`, `cuda12-buun-llama-cpp`, `cuda13-buun-llama-cpp`, `rocm-buun-llama-cpp`, `intel-sycl-f16-buun-llama-cpp`, `vulkan-buun-llama-cpp`).
+
+#### YAML configuration — TCQ KV-cache
+
+To run a model with TurboQuant/TCQ quantized KV-cache, set the backend and pick a `turbo*` cache type:
+
+```yaml
+name: my-model
+backend: buun-llama-cpp
+parameters:
+  model: file.gguf
+# Accepted values for the two fork-aware backends include the stock llama.cpp
+# types (f16, f32, q8_0, q4_0, q4_1, q5_0, q5_1), the TurboQuant types
+# (turbo2, turbo3, turbo4), and the buun-only TCQ variants (turbo2_tcq,
+# turbo3_tcq). turbo3 / turbo4 / turbo*_tcq auto-enable flash_attention.
+cache_type_k: turbo3
+cache_type_v: turbo3_tcq
+context_size: 8192
+```
+
+#### YAML configuration — DFlash speculative decoding
+
+DFlash requires a **dedicated drafter model** in the new `DFlashDraftModel` GGUF architecture. At time of writing the only known public target/drafter pair is [`z-lab/Qwen3.5-27B`](https://huggingface.co/z-lab/Qwen3.5-27B) + [`z-lab/Qwen3.5-27B-DFlash`](https://huggingface.co/z-lab/Qwen3.5-27B-DFlash).
+
+```yaml
+name: qwen3-dflash
+backend: buun-llama-cpp
+parameters:
+  # Target model (quantized as usual)
+  model: Qwen3.5-27B-Q4_K_M.gguf
+# Drafter model produced by buun's convert_hf_to_gguf.py from the
+# DFlashDraftModel checkpoint. Resolved relative to the models path.
+draft_model: Qwen3.5-27B-DFlash.gguf
+options:
+  # Switches the speculative pipeline from the default draft-model mode to
+  # DFlash (block-diffusion). Required to activate the DFlash code path.
+  - spec_type:dflash
+  # Optional tuning:
+  # - tree_budget:0      # 0 = flat DFlash; >0 = DDTree verification budget
+  # - draft_topk:1       # drafter top-K per position (1 = argmax)
+  # - spec_n_max:16      # cap on draft tokens per speculation step
+```
+
+Under the hood LocalAI wires `draft_model` through to the grpc-server's `params.speculative.mparams_dft.path`, and `spec_type:dflash` is forwarded through the options passthrough to buun's `common_speculative_type_from_name("dflash")`. The `tree_budget` and `draft_topk` options are buun-exclusive; they reference struct fields that only exist in buun's fork, so they're surfaced on this backend only (passing them to stock `llama-cpp` is a no-op).
+
+#### Reference
+
+- [spiritbuun/buun-llama-cpp](https://github.com/spiritbuun/buun-llama-cpp)
+- [TCQ paper / dataset](https://huggingface.co/datasets/spiritbuun/turboquant-tcq-kv-cache) — *"Closing the Gap: Trellis-Coded Quantization for KV Cache at 2-3 Bits"*
+- DFlash target/drafter pair: [`z-lab/Qwen3.5-27B`](https://huggingface.co/z-lab/Qwen3.5-27B) + [`z-lab/Qwen3.5-27B-DFlash`](https://huggingface.co/z-lab/Qwen3.5-27B-DFlash)
+
+
 ### vLLM

 [vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use library for LLM inference.
--- a/docs/content/reference/compatibility-table.md
+++ b/docs/content/reference/compatibility-table.md
@@ -20,6 +20,7 @@ LocalAI will attempt to automatically load models which are not explicitly confi
 |---------|-------------|------------|------------|-----------|-------------|
 | [llama.cpp](https://github.com/ggerganov/llama.cpp) | LLM inference in C/C++. Supports LLaMA, Mamba, RWKV, Falcon, Starcoder, GPT-2, [and many others](https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#description) | GPT, Functions | yes | yes | CPU, CUDA 12/13, ROCm, Intel SYCL, Vulkan, Metal, Jetson L4T |
 | [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp) | Hard fork of llama.cpp optimized for CPU/hybrid CPU+GPU with IQK quants, custom quant mixes, and MLA for DeepSeek | GPT | yes | yes | CPU (AVX2+) |
+| [buun-llama-cpp](https://github.com/spiritbuun/buun-llama-cpp) | llama.cpp fork with DFlash block-diffusion speculative decoding and TurboQuant/TCQ KV-cache quantization (2–3 bits per value). Accelerated paths are CUDA/Metal only. | GPT, Functions | yes | yes | CUDA, Metal (CPU fallback for non-turbo/non-DFlash only) |
 | [vLLM](https://github.com/vllm-project/vllm) | Fast LLM serving with PagedAttention | GPT, Functions | no | yes | CPU, CUDA 12, ROCm, Intel |
 | [vLLM Omni](https://github.com/vllm-project/vllm) | Unified multimodal generation (text, image, video, audio) | Multimodal GPT, Functions | no | yes | CUDA 12, ROCm |
 | [transformers](https://github.com/huggingface/transformers) | HuggingFace Transformers framework | GPT, Embeddings, Multimodal | yes | yes* | CPU, CUDA 12/13, ROCm, Intel, Metal |
--- a/gallery/index.yaml
+++ b/gallery/index.yaml
@@ -3,40 +3,7 @@
  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
  urls:
    - https://huggingface.co/KyleHessling1/Qwopus-GLM-18B-Merged-GGUF
-  description: |
-    # 🪐 Qwen3.5-9B-GLM5.1-Distill-v1
-
-    ## 📌 Model Overview
-
-    **Model Name:** `Jackrong/Qwen3.5-9B-GLM5.1-Distill-v1`
-    **Base Model:** Qwen3.5-9B
-    **Training Type:** Supervised Fine-Tuning (SFT, Distillation)
-    **Parameter Scale:** 9B
-    **Training Framework:** Unsloth
-
-    This model is a distilled variant of **Qwen3.5-9B**, trained on high-quality reasoning data derived from **GLM-5.1**.
-
-    The primary goals are to:
-
-      - Improve **structured reasoning ability**
-      - Enhance **instruction-following consistency**
-      - Activate **latent knowledge via better reasoning structure**
-
-    ## 📊 Training Data
-
-    ### Main Dataset
-
-      - `Jackrong/GLM-5.1-Reasoning-1M-Cleaned`
-      - Cleaned from the original `Kassadin88/GLM-5.1-1000000x` dataset.
-      - Generated from a **GLM-5.1 teacher model**
-      - Approximately **700x** the scale of `Qwen3.5-reasoning-700x`
-      - Training used a **filtered subset**, not the full source dataset.
-
-    ### Auxiliary Dataset
-
-      - `Jackrong/Qwen3.5-reasoning-700x`
-
-    ...
+  description: "# \U0001FA90 Qwen3.5-9B-GLM5.1-Distill-v1\n\n## \U0001F4CC Model Overview\n\n**Model Name:** `Jackrong/Qwen3.5-9B-GLM5.1-Distill-v1`\n**Base Model:** Qwen3.5-9B\n**Training Type:** Supervised Fine-Tuning (SFT, Distillation)\n**Parameter Scale:** 9B\n**Training Framework:** Unsloth\n\nThis model is a distilled variant of **Qwen3.5-9B**, trained on high-quality reasoning data derived from **GLM-5.1**.\n\nThe primary goals are to:\n\n  - Improve **structured reasoning ability**\n  - Enhance **instruction-following consistency**\n  - Activate **latent knowledge via better reasoning structure**\n\n## \U0001F4CA Training Data\n\n### Main Dataset\n\n  - `Jackrong/GLM-5.1-Reasoning-1M-Cleaned`\n  - Cleaned from the original `Kassadin88/GLM-5.1-1000000x` dataset.\n  - Generated from a **GLM-5.1 teacher model**\n  - Approximately **700x** the scale of `Qwen3.5-reasoning-700x`\n  - Training used a **filtered subset**, not the full source dataset.\n\n### Auxiliary Dataset\n\n  - `Jackrong/Qwen3.5-reasoning-700x`\n\n...\n"
  license: "apache-2.0"
  tags:
    - llm
@@ -127,26 +94,7 @@
  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
  urls:
    - https://huggingface.co/hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF
-  description: |
-    # 🔥 Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled
-
-    A reasoning SFT fine-tune of `Qwen/Qwen3.6-35B-A3B` on chain-of-thought (CoT) distillation mostly sourced from Claude Opus 4.6. The goal is to preserve Qwen3.6's strong agentic coding and reasoning base while nudging the model toward structured Claude Opus-style reasoning traces and more stable long-form problem solving.
-
-    The training path is text-only. The Qwen3.6 base architecture includes a vision encoder, but this fine-tuning run did not train on image or video examples.
-
-      - **Developed by:** @hesamation
-      - **Base model:** `Qwen/Qwen3.6-35B-A3B`
-      - **License:** apache-2.0
-
-    This fine-tuning run is inspired by Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled, including the notebook/training workflow style and Claude Opus reasoning-distillation direction.
-
-    [](https://x.com/Hesamation) [](https://discord.gg/vtJykN3t)
-
-    ## Benchmark Results
-
-    The MMLU-Pro pass used 70 total questions per model: `--limit 5` across 14 MMLU-Pro subjects. Treat this as a smoke/comparative check, not a release-quality full benchmark.
-
-    ...
+  description: "# \U0001F525 Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled\n\nA reasoning SFT fine-tune of `Qwen/Qwen3.6-35B-A3B` on chain-of-thought (CoT) distillation mostly sourced from Claude Opus 4.6. The goal is to preserve Qwen3.6's strong agentic coding and reasoning base while nudging the model toward structured Claude Opus-style reasoning traces and more stable long-form problem solving.\n\nThe training path is text-only. The Qwen3.6 base architecture includes a vision encoder, but this fine-tuning run did not train on image or video examples.\n\n  - **Developed by:** @hesamation\n  - **Base model:** `Qwen/Qwen3.6-35B-A3B`\n  - **License:** apache-2.0\n\nThis fine-tuning run is inspired by Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled, including the notebook/training workflow style and Claude Opus reasoning-distillation direction.\n\n[](https://x.com/Hesamation) [](https://discord.gg/vtJykN3t)\n\n## Benchmark Results\n\nThe MMLU-Pro pass used 70 total questions per model: `--limit 5` across 14 MMLU-Pro subjects. Treat this as a smoke/comparative check, not a release-quality full benchmark.\n\n...\n"
  license: "apache-2.0"
  tags:
    - llm
@@ -182,40 +130,7 @@
  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
  urls:
    - https://huggingface.co/Jackrong/Qwen3.5-9B-GLM5.1-Distill-v1-GGUF
-  description: |
-    # 🪐 Qwen3.5-9B-GLM5.1-Distill-v1
-
-    ## 📌 Model Overview
-
-    **Model Name:** `Jackrong/Qwen3.5-9B-GLM5.1-Distill-v1`
-    **Base Model:** Qwen3.5-9B
-    **Training Type:** Supervised Fine-Tuning (SFT, Distillation)
-    **Parameter Scale:** 9B
-    **Training Framework:** Unsloth
-
-    This model is a distilled variant of **Qwen3.5-9B**, trained on high-quality reasoning data derived from **GLM-5.1**.
-
-    The primary goals are to:
-
-      - Improve **structured reasoning ability**
-      - Enhance **instruction-following consistency**
-      - Activate **latent knowledge via better reasoning structure**
-
-    ## 📊 Training Data
-
-    ### Main Dataset
-
-      - `Jackrong/GLM-5.1-Reasoning-1M-Cleaned`
-      - Cleaned from the original `Kassadin88/GLM-5.1-1000000x` dataset.
-      - Generated from a **GLM-5.1 teacher model**
-      - Approximately **700x** the scale of `Qwen3.5-reasoning-700x`
-      - Training used a **filtered subset**, not the full source dataset.
-
-    ### Auxiliary Dataset
-
-      - `Jackrong/Qwen3.5-reasoning-700x`
-
-    ...
+  description: "# \U0001FA90 Qwen3.5-9B-GLM5.1-Distill-v1\n\n## \U0001F4CC Model Overview\n\n**Model Name:** `Jackrong/Qwen3.5-9B-GLM5.1-Distill-v1`\n**Base Model:** Qwen3.5-9B\n**Training Type:** Supervised Fine-Tuning (SFT, Distillation)\n**Parameter Scale:** 9B\n**Training Framework:** Unsloth\n\nThis model is a distilled variant of **Qwen3.5-9B**, trained on high-quality reasoning data derived from **GLM-5.1**.\n\nThe primary goals are to:\n\n  - Improve **structured reasoning ability**\n  - Enhance **instruction-following consistency**\n  - Activate **latent knowledge via better reasoning structure**\n\n## \U0001F4CA Training Data\n\n### Main Dataset\n\n  - `Jackrong/GLM-5.1-Reasoning-1M-Cleaned`\n  - Cleaned from the original `Kassadin88/GLM-5.1-1000000x` dataset.\n  - Generated from a **GLM-5.1 teacher model**\n  - Approximately **700x** the scale of `Qwen3.5-reasoning-700x`\n  - Training used a **filtered subset**, not the full source dataset.\n\n### Auxiliary Dataset\n\n  - `Jackrong/Qwen3.5-reasoning-700x`\n\n...\n"
  license: "apache-2.0"
  tags:
    - llm
@@ -1178,6 +1093,134 @@
      - transcript
    parameters:
      model: tiny
+- name: omnilingual-0.3b-ctc-q8-sherpa
+  license: apache-2.0
+  url: "github:mudler/LocalAI/gallery/sherpa-onnx-asr.yaml@master"
+  description: |
+    Omnilingual ASR CTC 300M (int8) is a multilingual automatic speech recognition model supporting 1,600+ languages. Based on Meta's omniASR_CTC_300M architecture (Wav2Vec2 with CTC head), quantized to int8 for efficient inference. Uses the sherpa-onnx backend with ONNX Runtime.
+  urls:
+    - https://huggingface.co/csukuangfj/sherpa-onnx-omnilingual-asr-1600-languages-300M-ctc-int8-2025-11-12
+    - https://k2-fsa.github.io/sherpa/onnx/omnilingual-asr/models.html
+  icon: https://avatars.githubusercontent.com/u/75781706
+  tags:
+    - stt
+    - speech-to-text
+    - asr
+    - audio-transcription
+    - multilingual
+    - omnilingual
+    - sherpa-onnx
+    - cpu
+    - gpu
+  overrides:
+    known_usecases:
+      - transcript
+    parameters:
+      model: omnilingual-asr/model.int8.onnx
+  files:
+    - filename: omnilingual-asr/model.int8.onnx
+      sha256: e7c4e54ee4c4c47829cc6667d5d00ed8ea7bef1dcfeef0fce766f77752a2726c
+      uri: https://huggingface.co/csukuangfj/sherpa-onnx-omnilingual-asr-1600-languages-300M-ctc-int8-2025-11-12/resolve/main/model.int8.onnx
+    - filename: omnilingual-asr/tokens.txt
+      sha256: a7a044c52cb29cbe8b0dc1953e92cefd4ca16b0ed968177b6beab21f9a7d0b31
+      uri: https://huggingface.co/csukuangfj/sherpa-onnx-omnilingual-asr-1600-languages-300M-ctc-int8-2025-11-12/resolve/main/tokens.txt
+- name: streaming-zipformer-en-sherpa
+  license: apache-2.0
+  url: "github:mudler/LocalAI/gallery/sherpa-onnx-asr.yaml@master"
+  description: |
+    Streaming English ASR: sherpa-onnx zipformer transducer (int8, chunk-16 left-128). Low-latency real-time transcription with endpoint detection via sherpa-onnx's online recognizer. English-only; for multilingual offline ASR see omnilingual-0.3b-ctc-q8-sherpa.
+  urls:
+    - https://huggingface.co/csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26
+    - https://k2-fsa.github.io/sherpa/onnx/pretrained_models/online-transducer/zipformer-transducer-models.html
+  icon: https://avatars.githubusercontent.com/u/75781706
+  tags:
+    - stt
+    - speech-to-text
+    - asr
+    - audio-transcription
+    - streaming
+    - real-time
+    - english
+    - zipformer
+    - sherpa-onnx
+    - cpu
+    - gpu
+  overrides:
+    known_usecases:
+      - transcript
+    parameters:
+      model: streaming-zipformer-en/encoder.int8.onnx
+    options:
+      - subtype=online
+  files:
+    - filename: streaming-zipformer-en/encoder.int8.onnx
+      sha256: 563fde436d16cf7607cf408cd6b30909819d03162652ef389c2450ced3f45ac1
+      uri: https://huggingface.co/csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26/resolve/main/encoder-epoch-99-avg-1-chunk-16-left-128.int8.onnx
+    - filename: streaming-zipformer-en/decoder.int8.onnx
+      sha256: 98da299f471e38bb4e1a8df579b8cc9122d6039576a77e357b3c60f17dd83b02
+      uri: https://huggingface.co/csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26/resolve/main/decoder-epoch-99-avg-1-chunk-16-left-128.int8.onnx
+    - filename: streaming-zipformer-en/joiner.int8.onnx
+      sha256: d944208d660d67c8d72cd2acaeac971fa5ceb8c80e76c1968148846fedd6e297
+      uri: https://huggingface.co/csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26/resolve/main/joiner-epoch-99-avg-1-chunk-16-left-128.int8.onnx
+    - filename: streaming-zipformer-en/tokens.txt
+      sha256: 49e3c2646595fd907228b3c6787069658f67b17377c60aeb8619c4551b2316fb
+      uri: https://huggingface.co/csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26/resolve/main/tokens.txt
+- name: silero-vad-sherpa
+  license: mit
+  url: "github:mudler/LocalAI/gallery/sherpa-onnx-vad.yaml@master"
+  description: |
+    Silero VAD served through the sherpa-onnx backend. Uses the same ONNX weights as the dedicated silero-vad backend, loaded through sherpa-onnx's C VAD API. Pairs with the sherpa-onnx ASR entries for round-trip audio pipelines.
+  urls:
+    - https://github.com/snakers4/silero-vad
+    - https://huggingface.co/onnx-community/silero-vad
+  icon: https://github.com/snakers4/silero-models/raw/master/files/silero_logo.jpg
+  tags:
+    - vad
+    - voice-activity-detection
+    - sherpa-onnx
+    - cpu
+    - gpu
+  overrides:
+    known_usecases:
+      - vad
+    parameters:
+      model: silero-vad/silero-vad.onnx
+  files:
+    - filename: silero-vad/silero-vad.onnx
+      sha256: a4a068cd6cf1ea8355b84327595838ca748ec29a25bc91fc82e6c299ccdc5808
+      uri: https://huggingface.co/onnx-community/silero-vad/resolve/main/onnx/model.onnx
+- name: vits-ljs-sherpa
+  license: mit
+  url: "github:mudler/LocalAI/gallery/sherpa-onnx-tts.yaml@master"
+  description: |
+    VITS-LJS English single-speaker TTS served through the sherpa-onnx backend. Trained on the LJSpeech corpus at 22.05 kHz. Pairs with the sherpa-onnx ASR entries for round-trip audio pipelines.
+  urls:
+    - https://github.com/k2-fsa/sherpa-onnx
+    - https://huggingface.co/csukuangfj/vits-ljs
+  icon: https://avatars.githubusercontent.com/u/75781706
+  tags:
+    - tts
+    - text-to-speech
+    - english
+    - vits
+    - sherpa-onnx
+    - cpu
+    - gpu
+  overrides:
+    known_usecases:
+      - tts
+    parameters:
+      model: vits-ljs/vits-ljs.onnx
+  files:
+    - filename: vits-ljs/vits-ljs.onnx
+      sha256: 5bbd273797a9ecf8d94bd6ec02ad16cb41cbb85f055ad98d528ced3e44c9b31a
+      uri: https://huggingface.co/csukuangfj/vits-ljs/resolve/main/vits-ljs.onnx
+    - filename: vits-ljs/tokens.txt
+      sha256: 5fee2c6b238d712287f2ecb08f34a8a8b413bcb7390862ef6fb6fd6f0f8d3a17
+      uri: https://huggingface.co/csukuangfj/vits-ljs/resolve/main/tokens.txt
+    - filename: vits-ljs/lexicon.txt
+      sha256: bdccfc6da71c45c48e2e0056fcf0aab760577c5f959f6c1b5eb3e3e916fd5a0e
+      uri: https://huggingface.co/csukuangfj/vits-ljs/resolve/main/lexicon.txt
 - name: voxcpm-1.5
  license: apache-2.0
  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
@@ -3845,7 +3888,7 @@
    cached in the models directory like any other managed model).
    NON-COMMERCIAL RESEARCH USE ONLY. For commercial use see `insightface-opencv`.
  tags: [face-recognition, face-verification, face-embedding, research-only, gpu, cpu]
-  urls: [https://github.com/deepinsight/insightface]
+  urls: ['https://github.com/deepinsight/insightface']
  overrides:
    backend: insightface
    parameters: {model: insightface-buffalo-l}
@@ -3876,7 +3919,7 @@
    cheaper detector — good balance on mid-range hardware.
    NON-COMMERCIAL RESEARCH USE ONLY.
  tags: [face-recognition, face-verification, face-embedding, research-only, gpu, cpu]
-  urls: [https://github.com/deepinsight/insightface]
+  urls: ['https://github.com/deepinsight/insightface']
  overrides:
    backend: insightface
    parameters: {model: insightface-buffalo-m}
@@ -3906,7 +3949,7 @@
    genderage, ~159MB). Good fit for mid-range CPU deployments.
    NON-COMMERCIAL RESEARCH USE ONLY.
  tags: [face-recognition, face-verification, face-embedding, research-only, edge, cpu]
-  urls: [https://github.com/deepinsight/insightface]
+  urls: ['https://github.com/deepinsight/insightface']
  overrides:
    backend: insightface
    parameters: {model: insightface-buffalo-s}
@@ -3938,7 +3981,7 @@
    only verification and embedding are needed.
    NON-COMMERCIAL RESEARCH USE ONLY.
  tags: [face-recognition, face-verification, face-embedding, research-only, edge, cpu]
-  urls: [https://github.com/deepinsight/insightface]
+  urls: ['https://github.com/deepinsight/insightface']
  overrides:
    backend: insightface
    parameters: {model: insightface-buffalo-sc}
@@ -3969,7 +4012,7 @@
    harder benchmarks; pays for it in GPU memory.
    NON-COMMERCIAL RESEARCH USE ONLY.
  tags: [face-recognition, face-verification, face-embedding, research-only, gpu]
-  urls: [https://github.com/deepinsight/insightface]
+  urls: ['https://github.com/deepinsight/insightface']
  overrides:
    backend: insightface
    parameters: {model: insightface-antelopev2}
@@ -4001,7 +4044,7 @@
    Weights are downloaded on install via LocalAI's gallery mechanism
    (~40MB).
  tags: [face-recognition, face-verification, face-embedding, commercial-ok, gpu, cpu]
-  urls: [https://github.com/opencv/opencv_zoo]
+  urls: ['https://github.com/opencv/opencv_zoo']
  overrides:
    backend: insightface
    parameters: {model: face_detection_yunet_2023mar.onnx}
@@ -4035,7 +4078,7 @@
    at comparable accuracy for face tasks. APACHE 2.0 — commercial-safe.
    Weights are downloaded on install via LocalAI's gallery mechanism.
  tags: [face-recognition, face-verification, face-embedding, commercial-ok, edge, cpu]
-  urls: [https://github.com/opencv/opencv_zoo]
+  urls: ['https://github.com/opencv/opencv_zoo']
  overrides:
    backend: insightface
    parameters: {model: face_detection_yunet_2023mar_int8.onnx}
@@ -15923,6 +15966,7 @@
      uri: "https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/resolve/main/split_files/vae/wan_2.1_vae.safetensors"
    - filename: "umt5-xxl-encoder-Q8_0.gguf"
      uri: "huggingface://city96/umt5-xxl-encoder-gguf/umt5-xxl-encoder-Q8_0.gguf"
+      sha256: 2521d4de0bf9e1cc6549866463ceae85e4ec3239bc6063f7488810be39033bbc
    - filename: "clip_vision_h.safetensors"
      uri: "https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/resolve/main/split_files/clip_vision/clip_vision_h.safetensors"
 - name: sd-1.5-ggml
--- a/gallery/sherpa-onnx-asr.yaml
+++ b/gallery/sherpa-onnx-asr.yaml
@@ -0,0 +1,27 @@
+---
+name: "sherpa-onnx-asr"
+
+config_file: |
+  backend: sherpa-onnx
+  type: asr
+  options:
+    # Feature extraction. Most shipped sherpa-onnx ASR models expect
+    # 16 kHz / 80-dim log-mel; derivatives trained at other rates
+    # should override these.
+    - asr.sample_rate=16000
+    - asr.feature_dim=80
+    - asr.decoding_method=greedy_search
+    # Whisper-family defaults (ignored by non-whisper models).
+    - asr.whisper.task=transcribe
+    - asr.whisper.tail_paddings=-1
+    # SenseVoice-family: inverse text normalization is off in upstream
+    # sherpa but on here — we want formatted transcription output
+    # ("100" not "one hundred"). Set to 0 for raw tokens.
+    - asr.sense_voice.use_itn=1
+    # Online (streaming zipformer) ASR. Endpoint detection is upstream-
+    # off but on here — streaming consumers need segment boundaries.
+    - online.enable_endpoint=1
+    - online.rule1_min_trailing_silence=2.4
+    - online.rule2_min_trailing_silence=1.2
+    - online.rule3_min_utterance_length=20.0
+    - online.chunk_samples=1600
--- a/gallery/sherpa-onnx-tts.yaml
+++ b/gallery/sherpa-onnx-tts.yaml
@@ -0,0 +1,14 @@
+---
+name: "sherpa-onnx-tts"
+
+config_file: |
+  backend: sherpa-onnx
+  options:
+    # VITS inference knobs. Matches upstream sherpa-onnx defaults.
+    - tts.noise_scale=0.667
+    - tts.noise_scale_w=0.8
+    - tts.length_scale=1.0
+    - tts.max_num_sentences=1
+    # Speech rate multiplier. Applied at every TTS / TTSStream call
+    # since the TTSRequest proto has no speed field.
+    - tts.speed=1.0
--- a/gallery/sherpa-onnx-vad.yaml
+++ b/gallery/sherpa-onnx-vad.yaml
@@ -0,0 +1,17 @@
+---
+name: "sherpa-onnx-vad"
+
+config_file: |
+  backend: sherpa-onnx
+  type: vad
+  options:
+    # Silero VAD. Defaults mirror upstream sherpa-onnx. Override for
+    # faster turn-taking (lower min_silence) or different sample rate
+    # derivatives (8 kHz Silero variants).
+    - vad.threshold=0.5
+    - vad.min_silence=0.5
+    - vad.min_speech=0.25
+    - vad.window_size=512
+    - vad.max_speech=20.0
+    - vad.sample_rate=16000
+    - vad.buffer_size=60.0
--- a/pkg/downloader/uri.go
+++ b/pkg/downloader/uri.go
@@ -375,9 +375,20 @@ func (uri URI) DownloadFileWithContext(ctx context.Context, filePath, sha string
 	if uri.LooksLikeOCI() {

 		// Only Ollama wants to download to the file, for the rest, we want to download to the directory
-		// so we check if filepath has any extension, otherwise we assume it's a directory
-		if filepath.Ext(filePath) != "" && !strings.HasPrefix(url, OllamaPrefix) {
-			filePath = filepath.Dir(filePath)
+		// so we check if filepath has any extension, otherwise we assume it's a directory.
+		// Caveat: `filepath.Ext` treats any dot-suffix as an extension, so paths like
+		// `backends/local-store.upgrade-tmp` (the tmp dir created by gallery.UpgradeBackend)
+		// look like a "file" to this heuristic and get rewritten to their parent — which
+		// then unpacks the image at `backends/` top level and clobbers the real install
+		// with a flat-layout file. Guard against that by short-circuiting when the caller
+		// has already created the target as a directory: OCI destinations are always dirs
+		// in that case, regardless of what their suffix looks like.
+		if !strings.HasPrefix(url, OllamaPrefix) {
+			if fi, statErr := os.Stat(filePath); statErr == nil && fi.IsDir() {
+				// Existing directory — use as-is.
+			} else if filepath.Ext(filePath) != "" {
+				filePath = filepath.Dir(filePath)
+			}
 		}

 		progressStatus := func(desc ocispec.Descriptor) io.Writer {
--- a/pkg/utils/base64.go
+++ b/pkg/utils/base64.go
@@ -16,7 +16,13 @@ var base64DownloadClient http.Client = http.Client{
 	Timeout: 30 * time.Second,
 }

-var dataURIPattern = regexp.MustCompile(`^data:([^;]+);base64,`)
+// Match `data:<mime>[;param=value...];base64,` — browser-produced data URIs
+// often carry codec/charset params between the mime type and `;base64,`
+// (e.g. MediaRecorder's `data:audio/webm;codecs=opus;base64,...`). The old
+// `([^;]+)` form only tolerated exactly one segment, so anything with
+// extra params failed the strip and tripped the downstream base64 decoder
+// on the `data:` literal.
+var dataURIPattern = regexp.MustCompile(`^data:[^,]+?;base64,`)

 // GetContentURIAsBase64 checks if the string is an URL, if it's an URL downloads the content in memory encodes it in base64 and returns the base64 string, otherwise returns the string by stripping base64 data headers
 func GetContentURIAsBase64(s string) (string, error) {
--- a/pkg/utils/base64_test.go
+++ b/pkg/utils/base64_test.go
@@ -21,6 +21,15 @@ var _ = Describe("utils/base64 tests", func() {
 		Expect(err).To(BeNil())
 		Expect(b64).To(Equal("BAR"))
 	})
+	It("GetContentURIAsBase64 strips data URI prefixes with codec/charset params", func() {
+		// Browser MediaRecorder produces data URIs like
+		// `data:audio/webm;codecs=opus;base64,...` — the regex must accept
+		// any number of MIME parameters between the type and `;base64,`.
+		input := "data:audio/webm;codecs=opus;base64,PAYLOAD"
+		b64, err := GetContentURIAsBase64(input)
+		Expect(err).To(BeNil())
+		Expect(b64).To(Equal("PAYLOAD"))
+	})
 	It("GetImageURLAsBase64 returns an error for bogus data", func() {
 		input := "FOO"
 		b64, err := GetContentURIAsBase64(input)
--- a/pkg/utils/ffmpeg.go
+++ b/pkg/utils/ffmpeg.go
@@ -2,10 +2,13 @@ package utils

 import (
 	"fmt"
+	"io"
 	"os"
 	"os/exec"
 	"strings"

+	laudio "github.com/mudler/LocalAI/pkg/audio"
+
 	"github.com/go-audio/wav"
 )

@@ -16,24 +19,61 @@ func ffmpegCommand(args []string) (string, error) {
 	return string(out), err
 }

-// AudioToWav converts audio to wav for transcribe.
-// TODO: use https://github.com/mccoyst/ogg?
+// AudioToWav converts audio to wav for transcribe (16 kHz mono s16le).
+// WAV files already in the target format are passed through directly;
+// everything else is converted via ffmpeg.
+//
+// The pass-through uses a hardlink (with a Copy fallback for cross-fs
+// src/dst) rather than Rename — callers may invoke this twice against
+// the same fixture (e.g. once for AudioTranscription and once for
+// AudioTranscriptionStream) and expect the original file to remain.
 func AudioToWav(src, dst string) error {
-	if strings.HasSuffix(src, ".wav") {
-		f, err := os.Open(src)
-		if err != nil {
-			return fmt.Errorf("open: %w", err)
-		}
-
-		dec := wav.NewDecoder(f)
-		dec.ReadInfo()
-		f.Close()
-
-		if dec.BitDepth == 16 && dec.NumChans == 1 && dec.SampleRate == 16000 {
-			os.Rename(src, dst)
-			return nil
-		}
+	if strings.HasSuffix(src, ".wav") && isTargetWav(src) {
+		return passthroughWAV(src, dst)
 	}
+	return convertWithFFmpeg(src, dst)
+}
+
+func passthroughWAV(src, dst string) error {
+	if err := os.Link(src, dst); err == nil {
+		return nil
+	}
+	// Fallback: copy. Hardlink fails across filesystems (e.g. src on a
+	// read-only mount, dst in /tmp) or when the destination already
+	// exists — both are fine; just copy bytes.
+	in, err := os.Open(src)
+	if err != nil {
+		return fmt.Errorf("open src: %w", err)
+	}
+	defer in.Close()
+	out, err := os.Create(dst)
+	if err != nil {
+		return fmt.Errorf("create dst: %w", err)
+	}
+	defer out.Close()
+	if _, err := io.Copy(out, in); err != nil {
+		return fmt.Errorf("copy: %w", err)
+	}
+	return nil
+}
+
+// isTargetWav returns true when src is a valid WAV already in the
+// target format (16 kHz, mono, 16-bit PCM).
+func isTargetWav(src string) bool {
+	f, err := os.Open(src)
+	if err != nil {
+		return false
+	}
+	defer f.Close()
+
+	dec := wav.NewDecoder(f)
+	if !dec.IsValidFile() {
+		return false
+	}
+	return dec.BitDepth == 16 && dec.NumChans == 1 && dec.SampleRate == 16000
+}
+
+func convertWithFFmpeg(src, dst string) error {
 	commandArgs := []string{"-i", src, "-format", "s16le", "-ar", "16000", "-ac", "1", "-acodec", "pcm_s16le", dst}
 	out, err := ffmpegCommand(commandArgs)
 	if err != nil {
@@ -85,3 +125,18 @@ func AudioConvert(src string, format string) (string, error) {
 	}
 	return dst, nil
 }
+
+// WriteWav16kFromReader reads all PCM data from r and writes a 16 kHz mono
+// 16-bit WAV to w. Useful when the PCM length is not known in advance.
+func WriteWav16kFromReader(w io.Writer, r io.Reader) error {
+	pcm, err := io.ReadAll(r)
+	if err != nil {
+		return fmt.Errorf("read pcm: %w", err)
+	}
+	hdr := laudio.NewWAVHeader(uint32(len(pcm)))
+	if err := hdr.Write(w); err != nil {
+		return fmt.Errorf("write wav header: %w", err)
+	}
+	_, err = w.Write(pcm)
+	return err
+}
--- a/pkg/utils/ffmpeg_test.go
+++ b/pkg/utils/ffmpeg_test.go
@@ -0,0 +1,150 @@
+package utils
+
+import (
+	"encoding/binary"
+	"os"
+	"path/filepath"
+	"testing"
+
+	laudio "github.com/mudler/LocalAI/pkg/audio"
+)
+
+// generateTestWav creates a WAV file with a sine-ish tone at the given sample rate,
+// channels, and bit depth (only 16-bit supported for simplicity).
+func generateTestWav(t *testing.T, path string, sampleRate uint32, numChannels uint16, numSamples int) {
+	t.Helper()
+	f, err := os.Create(path)
+	if err != nil {
+		t.Fatal(err)
+	}
+	defer f.Close()
+
+	bitsPerSample := uint16(16)
+	blockAlign := numChannels * (bitsPerSample / 8)
+	byteRate := sampleRate * uint32(blockAlign)
+	totalSamples := numSamples * int(numChannels)
+	dataSize := uint32(totalSamples) * uint32(bitsPerSample/8)
+
+	hdr := laudio.WAVHeader{
+		ChunkID:       [4]byte{'R', 'I', 'F', 'F'},
+		ChunkSize:     36 + dataSize,
+		Format:        [4]byte{'W', 'A', 'V', 'E'},
+		Subchunk1ID:   [4]byte{'f', 'm', 't', ' '},
+		Subchunk1Size: 16,
+		AudioFormat:   1,
+		NumChannels:   numChannels,
+		SampleRate:    sampleRate,
+		ByteRate:      byteRate,
+		BlockAlign:    blockAlign,
+		BitsPerSample: bitsPerSample,
+		Subchunk2ID:   [4]byte{'d', 'a', 't', 'a'},
+		Subchunk2Size: dataSize,
+	}
+	if err := binary.Write(f, binary.LittleEndian, &hdr); err != nil {
+		t.Fatal(err)
+	}
+
+	for i := 0; i < totalSamples; i++ {
+		sample := int16(1000 * (i % 100))
+		if err := binary.Write(f, binary.LittleEndian, sample); err != nil {
+			t.Fatal(err)
+		}
+	}
+}
+
+func TestAudioToWav_AlreadyCorrectFormat(t *testing.T) {
+	dir := t.TempDir()
+	src := filepath.Join(dir, "input.wav")
+	dst := filepath.Join(dir, "output.wav")
+
+	generateTestWav(t, src, 16000, 1, 1600)
+
+	if err := AudioToWav(src, dst); err != nil {
+		t.Fatalf("AudioToWav failed: %v", err)
+	}
+
+	info, err := os.Stat(dst)
+	if err != nil {
+		t.Fatalf("output not found: %v", err)
+	}
+	if info.Size() == 0 {
+		t.Fatal("output file is empty")
+	}
+}
+
+func TestAudioToWav_ResampleFrom22050(t *testing.T) {
+	dir := t.TempDir()
+	src := filepath.Join(dir, "input.wav")
+	dst := filepath.Join(dir, "output.wav")
+
+	generateTestWav(t, src, 22050, 1, 22050)
+
+	if err := AudioToWav(src, dst); err != nil {
+		t.Fatalf("AudioToWav failed: %v", err)
+	}
+
+	info, err := os.Stat(dst)
+	if err != nil {
+		t.Fatalf("output not found: %v", err)
+	}
+	if info.Size() == 0 {
+		t.Fatal("output file is empty")
+	}
+
+	verifyWavFormat(t, dst, 16000, 1)
+}
+
+func TestAudioToWav_StereoDownmix(t *testing.T) {
+	dir := t.TempDir()
+	src := filepath.Join(dir, "input.wav")
+	dst := filepath.Join(dir, "output.wav")
+
+	generateTestWav(t, src, 16000, 2, 1600)
+
+	if err := AudioToWav(src, dst); err != nil {
+		t.Fatalf("AudioToWav failed: %v", err)
+	}
+
+	verifyWavFormat(t, dst, 16000, 1)
+}
+
+func TestAudioToWav_StereoAndResample(t *testing.T) {
+	dir := t.TempDir()
+	src := filepath.Join(dir, "input.wav")
+	dst := filepath.Join(dir, "output.wav")
+
+	generateTestWav(t, src, 44100, 2, 44100)
+
+	if err := AudioToWav(src, dst); err != nil {
+		t.Fatalf("AudioToWav failed: %v", err)
+	}
+
+	verifyWavFormat(t, dst, 16000, 1)
+}
+
+func verifyWavFormat(t *testing.T, path string, expectedRate uint32, expectedChannels uint16) {
+	t.Helper()
+	f, err := os.Open(path)
+	if err != nil {
+		t.Fatalf("open: %v", err)
+	}
+	defer f.Close()
+
+	var hdr laudio.WAVHeader
+	if err := binary.Read(f, binary.LittleEndian, &hdr); err != nil {
+		t.Fatalf("read header: %v", err)
+	}
+
+	if hdr.SampleRate != expectedRate {
+		t.Errorf("sample rate = %d, want %d", hdr.SampleRate, expectedRate)
+	}
+	if hdr.NumChannels != expectedChannels {
+		t.Errorf("channels = %d, want %d", hdr.NumChannels, expectedChannels)
+	}
+	if hdr.BitsPerSample != 16 {
+		t.Errorf("bit depth = %d, want 16", hdr.BitsPerSample)
+	}
+	if hdr.AudioFormat != 1 {
+		t.Errorf("audio format = %d, want 1 (PCM)", hdr.AudioFormat)
+	}
+}
--- a/scripts/changed-backends.js
+++ b/scripts/changed-backends.js
@@ -32,6 +32,12 @@ function inferBackendPath(item) {
    // via a thin wrapper Makefile. Changes to either dir should retrigger it.
    return `backend/cpp/turboquant/`;
  }
+  if (item.dockerfile.endsWith("buun-llama-cpp")) {
+    // buun-llama-cpp is a fork-of-a-fork (spiritbuun/buun-llama-cpp forks
+    // TheTom/llama-cpp-turboquant) that reuses backend/cpp/llama-cpp sources
+    // the same way turboquant does. Changes to either dir retrigger it.
+    return `backend/cpp/buun-llama-cpp/`;
+  }
  if (item.dockerfile.endsWith("llama-cpp")) {
    return `backend/cpp/llama-cpp/`;
  }
@@ -138,9 +144,10 @@ async function getChangedFiles() {
  // Per-backend boolean outputs
  for (const [backend, pathPrefix] of allBackendPaths) {
    let changed = changedFiles.some(file => file.startsWith(pathPrefix));
-    // turboquant reuses backend/cpp/llama-cpp sources via a thin wrapper;
-    // changes to either directory should retrigger its pipeline.
-    if (backend === "turboquant" && !changed) {
+    // turboquant and buun-llama-cpp reuse backend/cpp/llama-cpp sources via
+    // thin wrapper Makefiles; changes to that directory should retrigger
+    // their pipelines too.
+    if ((backend === "turboquant" || backend === "buun-llama-cpp") && !changed) {
      changed = changedFiles.some(file => file.startsWith("backend/cpp/llama-cpp/"));
    }
    fs.appendFileSync(process.env.GITHUB_OUTPUT, `${backend}=${changed ? 'true' : 'false'}\n`);
--- a/swagger/docs.go
+++ b/swagger/docs.go
@@ -3442,6 +3442,7 @@ const docTemplate = `{
                    }
                },
                "is_real": {
+                    "description": "Liveness fields — see FaceVerifyResponse for why these are pointers.",
                    "type": "boolean"
                },
                "race": {
@@ -3656,12 +3657,25 @@ const docTemplate = `{
                "distance": {
                    "type": "number"
                },
+                "img1_antispoof_score": {
+                    "type": "number"
+                },
                "img1_area": {
                    "$ref": "#/definitions/schema.FacialArea"
                },
+                "img1_is_real": {
+                    "description": "Liveness fields are only populated when the request set\nanti_spoofing=true. Pointers keep them fully absent from the\nJSON response otherwise, so callers can tell \"not checked\"\napart from \"checked and fake\" (which would collapse to zero\nvalues with plain bool+omitempty).",
+                    "type": "boolean"
+                },
+                "img2_antispoof_score": {
+                    "type": "number"
+                },
                "img2_area": {
                    "$ref": "#/definitions/schema.FacialArea"
                },
+                "img2_is_real": {
+                    "type": "boolean"
+                },
                "model": {
                    "type": "string"
                },
--- a/swagger/swagger.json
+++ b/swagger/swagger.json
@@ -3439,6 +3439,7 @@
                    }
                },
                "is_real": {
+                    "description": "Liveness fields — see FaceVerifyResponse for why these are pointers.",
                    "type": "boolean"
                },
                "race": {
@@ -3653,12 +3654,25 @@
                "distance": {
                    "type": "number"
                },
+                "img1_antispoof_score": {
+                    "type": "number"
+                },
                "img1_area": {
                    "$ref": "#/definitions/schema.FacialArea"
                },
+                "img1_is_real": {
+                    "description": "Liveness fields are only populated when the request set\nanti_spoofing=true. Pointers keep them fully absent from the\nJSON response otherwise, so callers can tell \"not checked\"\napart from \"checked and fake\" (which would collapse to zero\nvalues with plain bool+omitempty).",
+                    "type": "boolean"
+                },
+                "img2_antispoof_score": {
+                    "type": "number"
+                },
                "img2_area": {
                    "$ref": "#/definitions/schema.FacialArea"
                },
+                "img2_is_real": {
+                    "type": "boolean"
+                },
                "model": {
                    "type": "string"
                },
--- a/swagger/swagger.yaml
+++ b/swagger/swagger.yaml
@@ -640,6 +640,7 @@ definitions:
          type: number
        type: object
      is_real:
+        description: Liveness fields — see FaceVerifyResponse for why these are pointers.
        type: boolean
      race:
        additionalProperties:
@@ -780,10 +781,24 @@ definitions:
        type: number
      distance:
        type: number
+      img1_antispoof_score:
+        type: number
      img1_area:
        $ref: '#/definitions/schema.FacialArea'
+      img1_is_real:
+        description: |-
+          Liveness fields are only populated when the request set
+          anti_spoofing=true. Pointers keep them fully absent from the
+          JSON response otherwise, so callers can tell "not checked"
+          apart from "checked and fake" (which would collapse to zero
+          values with plain bool+omitempty).
+        type: boolean
+      img2_antispoof_score:
+        type: number
      img2_area:
        $ref: '#/definitions/schema.FacialArea'
+      img2_is_real:
+        type: boolean
      model:
        type: string
      processing_time_ms:
--- a/tests/e2e-backends/backend_test.go
+++ b/tests/e2e-backends/backend_test.go
@@ -40,6 +40,12 @@ import (
 //	                         to download alongside the main model — required for
 //	                         multimodal models like Qwen3-ASR-0.6B-GGUF.
 //	BACKEND_TEST_MMPROJ_FILE Path to an already-available mmproj file.
+//	BACKEND_TEST_EXTRA_FILES Pipe-separated list of companion files to download
+//	                         next to the main model. Each entry is "<url>" or
+//	                         "<url>#<local-name>" (the optional suffix renames
+//	                         the file on disk — useful for sherpa-onnx models
+//	                         whose loader expects specific names like
+//	                         encoder.int8.onnx).
 //	BACKEND_TEST_AUDIO_URL   HTTP(S) URL of a sample audio file used by the
 //	                         transcription specs.
 //	BACKEND_TEST_AUDIO_FILE  Path to an already-available sample audio file.
@@ -71,6 +77,9 @@ import (
 //	                         (default: "What's the weather like in Paris, France?").
 //	BACKEND_TEST_TOOL_NAME   Override the function name expected in the tool call
 //	                         (default: "get_weather").
+//	BACKEND_TEST_TTS_TEXT    Override the text synthesized by the tts/ttsstream
+//	                         specs (default: "The quick brown fox jumps over the
+//	                         lazy dog.").
 //
 // The suite is intentionally model-format-agnostic: it only ever passes the
 // file path to LoadModel, so GGUF, ONNX, safetensors, .bin etc. all work so
@@ -83,6 +92,7 @@ const (
 	capEmbeddings    = "embeddings"
 	capTools         = "tools"
 	capTranscription = "transcription"
+	capTTS           = "tts"
 	capImage         = "image"
 	capFaceDetect    = "face_detect"
 	capFaceEmbed     = "face_embed"
@@ -100,6 +110,7 @@ const (
 	defaultImagePrompt        = "a photograph of an astronaut riding a horse"
 	defaultImageSteps         = 4
 	defaultVerifyDistanceCeil = float32(0.6) // upper bound for same-person; SFace runs closer to 0.5 ArcFace to 0.35.
+	defaultTTSText            = "The quick brown fox jumps over the lazy dog."
 )

 func defaultCaps() map[string]bool {
@@ -111,6 +122,17 @@ func defaultCaps() map[string]bool {
 	}
 }

+// splitURLAndName parses a "<url>#<local-name>" entry. The #name suffix is
+// optional — if absent, defaultName is returned. Used by the main-model
+// and extras download paths so a test can rename downloaded files to the
+// shape the backend's loader expects.
+func splitURLAndName(entry, defaultName string) (url, name string) {
+	if hash := strings.Index(entry, "#"); hash >= 0 {
+		return entry[:hash], entry[hash+1:]
+	}
+	return entry, defaultName
+}
+
 // parseCaps reads BACKEND_TEST_CAPS and returns the enabled capability set.
 // An empty/unset value falls back to defaultCaps().
 func parseCaps() map[string]bool {
@@ -205,9 +227,33 @@ var _ = Describe("Backend container", Ordered, func() {
 		Expect(filepath.Join(binaryDir, "run.sh")).To(BeAnExistingFile())

 		// Download the model once if not provided and no HF name given.
+		// BACKEND_TEST_MODEL_URL accepts an optional "#<local-name>" suffix
+		// for cases where the backend expects the model file to have a
+		// specific name (e.g. sherpa-onnx's online recognizer finds
+		// encoder/decoder/joiner by filename substring).
 		if modelFile == "" && modelName == "" {
-			modelFile = filepath.Join(workDir, "model.bin")
-			downloadFile(modelURL, modelFile)
+			url, name := splitURLAndName(modelURL, "model.bin")
+			modelFile = filepath.Join(workDir, name)
+			downloadFile(url, modelFile)
+		}
+
+		// Multi-file models (sherpa-onnx streaming zipformer, sherpa-onnx
+		// Omnilingual, any split encoder/decoder/joiner bundle) need
+		// companion files next to the main model. BACKEND_TEST_EXTRA_FILES
+		// is a pipe-separated list of "<url>[#<local-name>]" entries; each
+		// is downloaded into the same directory as modelFile. The optional
+		// <local-name> renames the saved file (useful when upstream URLs
+		// have stamp/version suffixes the loader doesn't recognise).
+		if extraSpec := strings.TrimSpace(os.Getenv("BACKEND_TEST_EXTRA_FILES")); extraSpec != "" && modelFile != "" {
+			modelDir := filepath.Dir(modelFile)
+			for _, entry := range strings.Split(extraSpec, "|") {
+				entry = strings.TrimSpace(entry)
+				if entry == "" {
+					continue
+				}
+				url, name := splitURLAndName(entry, filepath.Base(entry))
+				downloadFile(url, filepath.Join(modelDir, name))
+			}
 		}

 		// Multimodal projector (mmproj): required by audio/vision-capable
@@ -869,6 +915,62 @@ var _ = Describe("Backend container", Ordered, func() {
 		}
 		GinkgoWriter.Printf("voice_analyze: %d segments\n", len(res.GetSegments()))
 	})
+
+	It("synthesizes speech via TTS", func() {
+		if !caps[capTTS] {
+			Skip("tts capability not enabled")
+		}
+		text := os.Getenv("BACKEND_TEST_TTS_TEXT")
+		if text == "" {
+			text = defaultTTSText
+		}
+		dst := filepath.Join(workDir, "tts.wav")
+
+		ctx, cancel := context.WithTimeout(context.Background(), 2*time.Minute)
+		defer cancel()
+		_, err := client.TTS(ctx, &pb.TTSRequest{Text: text, Dst: dst})
+		Expect(err).NotTo(HaveOccurred())
+
+		info, err := os.Stat(dst)
+		Expect(err).NotTo(HaveOccurred(), "TTS did not write a file at %s", dst)
+		Expect(info.Size()).To(BeNumerically(">", int64(1024)),
+			"TTS output too small: %d bytes", info.Size())
+		GinkgoWriter.Printf("TTS: wrote %s (%d bytes)\n", dst, info.Size())
+	})
+
+	It("streams PCM via TTSStream", func() {
+		if !caps[capTTS] {
+			Skip("tts capability not enabled")
+		}
+		text := os.Getenv("BACKEND_TEST_TTS_TEXT")
+		if text == "" {
+			text = defaultTTSText
+		}
+
+		ctx, cancel := context.WithTimeout(context.Background(), 2*time.Minute)
+		defer cancel()
+		stream, err := client.TTSStream(ctx, &pb.TTSRequest{Text: text})
+		Expect(err).NotTo(HaveOccurred())
+
+		var chunks, totalBytes int
+		for {
+			reply, err := stream.Recv()
+			if err == io.EOF {
+				break
+			}
+			Expect(err).NotTo(HaveOccurred())
+			if audio := reply.GetAudio(); len(audio) > 0 {
+				chunks++
+				totalBytes += len(audio)
+			}
+		}
+		// Header + at least one PCM chunk proves real streaming (not emit-once).
+		Expect(chunks).To(BeNumerically(">=", 2),
+			"expected >=2 chunks (header + PCM), got %d (bytes=%d)", chunks, totalBytes)
+		Expect(totalBytes).To(BeNumerically(">", 1024),
+			"streamed audio too short: %d bytes", totalBytes)
+		GinkgoWriter.Printf("TTSStream: %d chunks, %d bytes\n", chunks, totalBytes)
+	})
 })

 // extractImage runs `docker create` + `docker export` to materialise the image
@@ -901,9 +1003,17 @@ func extractImage(image, dest string) {

 // downloadFile fetches url into dest using curl -L. Used for CI convenience;
 // local runs can use BACKEND_TEST_MODEL_FILE to skip downloading.
+// Retry flags guard against transient CI network hiccups (github.com in
+// particular has been flaky from GHA runners, timing out TCP connects).
 func downloadFile(url, dest string) {
 	GinkgoHelper()
-	cmd := exec.Command("curl", "-sSfL", "-o", dest, url)
+	cmd := exec.Command("curl", "-sSfL",
+		"--connect-timeout", "30",
+		"--max-time", "600",
+		"--retry", "5",
+		"--retry-delay", "5",
+		"--retry-all-errors",
+		"-o", dest, url)
 	cmd.Stdout = GinkgoWriter
 	cmd.Stderr = GinkgoWriter
 	Expect(cmd.Run()).To(Succeed(), "failed to download %s", url)
--- a/tests/e2e/e2e_suite_test.go
+++ b/tests/e2e/e2e_suite_test.go
@@ -212,6 +212,9 @@ var _ = BeforeSuite(func() {

 	// Import model configs from an external directory (e.g. real model YAMLs
 	// and weights mounted into a container). Symlinks avoid copying large files.
+	// Both files and directories are symlinked — multi-file backends like
+	// sherpa-onnx TTS expect their tokens.txt / lexicon.txt sidecars in the
+	// same directory as the .onnx, so we need whole-directory imports.
 	if rtModels := os.Getenv("REALTIME_MODELS_PATH"); rtModels != "" {
 		entries, err := os.ReadDir(rtModels)
 		Expect(err).ToNot(HaveOccurred())
@@ -221,9 +224,6 @@ var _ = BeforeSuite(func() {
 			if _, err := os.Stat(dst); err == nil {
 				continue // don't overwrite mock configs
 			}
-			if entry.IsDir() {
-				continue
-			}
 			Expect(os.Symlink(src, dst)).To(Succeed())
 		}
 	}
--- a/tests/e2e/realtime_ws_test.go
+++ b/tests/e2e/realtime_ws_test.go
@@ -1,15 +1,21 @@
 package e2e_test

 import (
+	"bytes"
 	"encoding/base64"
 	"encoding/json"
 	"fmt"
+	"io"
 	"math"
+	"net/http"
 	"net/url"
 	"os"
+	"strings"
 	"time"

 	"github.com/gorilla/websocket"
+	laudio "github.com/mudler/LocalAI/pkg/audio"
+	"github.com/mudler/LocalAI/pkg/sound"
 	. "github.com/onsi/ginkgo/v2"
 	. "github.com/onsi/gomega"
 )
@@ -72,6 +78,66 @@ func generatePCMBase64(freq float64, sampleRate, durationMs int) string {
 	return base64.StdEncoding.EncodeToString(pcm)
 }

+// padPCMBase64 prepends and appends the given milliseconds of silence to a
+// base64-encoded 16-bit LE PCM buffer. Used to give VAD a clear lead-in /
+// lead-out so Silero can reliably detect utterance boundaries.
+func padPCMBase64(pcmB64 string, sampleRate, leadingMs, trailingMs int) string {
+	raw, err := base64.StdEncoding.DecodeString(pcmB64)
+	ExpectWithOffset(1, err).ToNot(HaveOccurred())
+	lead := make([]byte, sampleRate*leadingMs/1000*2)
+	trail := make([]byte, sampleRate*trailingMs/1000*2)
+	padded := make([]byte, 0, len(lead)+len(raw)+len(trail))
+	padded = append(padded, lead...)
+	padded = append(padded, raw...)
+	padded = append(padded, trail...)
+	return base64.StdEncoding.EncodeToString(padded)
+}
+
+// ttsPCMBase64 drives the /v1/audio/speech endpoint to render `text` through
+// the given TTS model, strips the returned WAV header, resamples to the
+// requested sample rate if needed, and returns base64-encoded 16-bit LE PCM.
+// Fails the test on any transport / format error — there's no useful fallback.
+func ttsPCMBase64(model, text string, targetSampleRate int) string {
+	body, err := json.Marshal(map[string]any{
+		"model":  model,
+		"input":  text,
+		"format": "wav",
+	})
+	ExpectWithOffset(1, err).ToNot(HaveOccurred())
+
+	req, err := http.NewRequest(http.MethodPost, apiURL+"/audio/speech", bytes.NewReader(body))
+	ExpectWithOffset(1, err).ToNot(HaveOccurred())
+	req.Header.Set("Content-Type", "application/json")
+
+	resp, err := http.DefaultClient.Do(req)
+	ExpectWithOffset(1, err).ToNot(HaveOccurred())
+	defer resp.Body.Close()
+
+	wav, err := io.ReadAll(resp.Body)
+	ExpectWithOffset(1, err).ToNot(HaveOccurred())
+	ExpectWithOffset(1, resp.StatusCode).To(Equal(http.StatusOK),
+		"tts returned %d: %s", resp.StatusCode, string(wav))
+
+	pcm, srcRate := laudio.ParseWAV(wav)
+	ExpectWithOffset(1, srcRate).To(BeNumerically(">", 0),
+		"tts response is not a valid WAV (body=%d bytes)", len(wav))
+
+	if srcRate != targetSampleRate {
+		samples := sound.BytesToInt16sLE(pcm)
+		pcm = sound.Int16toBytesLE(sound.ResampleInt16(samples, srcRate, targetSampleRate))
+	}
+	return base64.StdEncoding.EncodeToString(pcm)
+}
+
+// isRealTTS returns true when REALTIME_TTS names a real backend-backed model,
+// as opposed to the default mock-tts. Used to gate test behavior that only
+// makes sense with a real TTS — e.g. driving the session with a real
+// utterance and asserting the transcription contains recognisable words.
+func isRealTTS() bool {
+	m := os.Getenv("REALTIME_TTS")
+	return m != "" && m != "mock-tts"
+}
+
 // pipelineModel returns the model name to use for realtime tests.
 func pipelineModel() string {
 	if m := os.Getenv("REALTIME_TEST_MODEL"); m != "" {
@@ -139,8 +205,19 @@ var _ = Describe("Realtime WebSocket API", Label("Realtime"), func() {
 			sendClientEvent(conn, disableVADEvent())
 			drainUntil(conn, "session.updated", 10*time.Second)

-			// Append 1 second of 440Hz sine wave at 24kHz (the default remote sample rate)
-			audio := generatePCMBase64(440, 24000, 1000)
+			// Real TTS: synthesise an utterance the ASR should be able to
+			// recognise, and pad with silence so Silero-VAD has a clear
+			// lead-in/out. Fallback: 1s of 440Hz sine wave — the mock
+			// transcriber returns a static string anyway, so this only
+			// needs to exercise the pipeline plumbing.
+			const inputText = "The quick brown fox jumps over the lazy dog."
+			var audio string
+			if isRealTTS() {
+				audio = ttsPCMBase64(os.Getenv("REALTIME_TTS"), inputText, 24000)
+				audio = padPCMBase64(audio, 24000, 500, 500)
+			} else {
+				audio = generatePCMBase64(440, 24000, 1000)
+			}
 			sendClientEvent(conn, map[string]any{
 				"type":  "input_audio_buffer.append",
 				"audio": audio,
@@ -161,9 +238,30 @@ var _ = Describe("Realtime WebSocket API", Label("Realtime"), func() {
 			committed := drainUntil(conn, "input_audio_buffer.committed", 30*time.Second)
 			Expect(committed).ToNot(BeNil())

-			// Wait for the full response cycle to complete
-			done := drainUntil(conn, "response.done", 60*time.Second)
-			Expect(done).ToNot(BeNil())
+			// Drain the response cycle, capturing the input transcription
+			// event as it arrives so we can sanity-check it alongside the
+			// real-TTS path.
+			var transcript string
+			deadline := time.Now().Add(90 * time.Second)
+			for time.Now().Before(deadline) {
+				evt := readServerEvent(conn, time.Until(deadline))
+				if evt["type"] == "conversation.item.input_audio_transcription.completed" {
+					if t, ok := evt["transcript"].(string); ok {
+						transcript = t
+					}
+				}
+				if evt["type"] == "response.done" {
+					Expect(evt).ToNot(BeNil())
+					break
+				}
+			}
+
+			if isRealTTS() {
+				lower := strings.ToLower(transcript)
+				matched := strings.Contains(lower, "fox") || strings.Contains(lower, "dog")
+				Expect(matched).To(BeTrue(),
+					"expected real-TTS transcript to contain 'fox' or 'dog' (got %q)", transcript)
+			}
 		})
 	})

--- a/tests/e2e/run-realtime-sherpa.sh
+++ b/tests/e2e/run-realtime-sherpa.sh
@@ -0,0 +1,136 @@
+#!/bin/bash
+# Drives tests/e2e/realtime_ws_test.go against a realtime pipeline where
+# VAD, STT and TTS are served by the sherpa-onnx Docker backend, and the
+# LLM slot stays mocked by the in-repo mock-backend. Pre-requisites:
+#   - `make build-mock-backend` has produced tests/e2e/mock-backend/mock-backend
+#   - `make docker-build-sherpa-onnx` has produced local-ai-backend:sherpa-onnx
+#   - `make protogen-go` is up-to-date
+# Environment overrides:
+#   WORK_DIR   Where to stage the extracted backend + model files.
+#              Defaults to a mktemp'd directory (cleaned on exit).
+#   KEEP_WORK  Non-empty to preserve WORK_DIR after the test exits (useful for
+#              debugging repeated runs — skips redownloads if files already present).
+set -euo pipefail
+
+ROOT=$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")/../.." && pwd)
+IMAGE=${BACKEND_IMAGE:-local-ai-backend:sherpa-onnx}
+MODEL=${REALTIME_STT_MODEL:-omnilingual-0.3b-ctc-q8-sherpa}
+VAD_MODEL=${REALTIME_VAD_MODEL:-silero-vad-sherpa}
+TTS_MODEL=${REALTIME_TTS_MODEL:-vits-ljs-sherpa}
+
+WORK_DIR=${WORK_DIR:-$(mktemp -d -t localai-sherpa-realtime.XXXXXX)}
+if [[ -z "${KEEP_WORK:-}" ]]; then
+    trap 'rm -rf "$WORK_DIR"' EXIT
+fi
+echo "WORK_DIR=$WORK_DIR"
+
+BACKENDS_DIR="$WORK_DIR/backends"
+MODELS_DIR="$WORK_DIR/models"
+mkdir -p "$BACKENDS_DIR/sherpa-onnx" "$MODELS_DIR"
+
+# 1. Extract the sherpa-onnx backend image rootfs. Mirrors the pattern in
+# tests/e2e-backends/backend_test.go:extractImage — docker create + export
+# avoids having to pull and parse layer tarballs.
+if [[ ! -x "$BACKENDS_DIR/sherpa-onnx/run.sh" ]]; then
+    echo "Extracting $IMAGE rootfs into $BACKENDS_DIR/sherpa-onnx ..."
+    CID=$(docker create --entrypoint=/run.sh "$IMAGE")
+    trap 'docker rm -f "$CID" >/dev/null 2>&1 || true; [[ -z "${KEEP_WORK:-}" ]] && rm -rf "$WORK_DIR"' EXIT
+    docker export "$CID" | tar -xC "$BACKENDS_DIR/sherpa-onnx" \
+        --exclude='dev/*' --exclude='proc/*' --exclude='sys/*'
+    docker rm -f "$CID" >/dev/null
+fi
+
+# Make sure run.sh is executable (tar usually preserves this, but belt + braces).
+chmod +x "$BACKENDS_DIR/sherpa-onnx/run.sh" \
+         "$BACKENDS_DIR/sherpa-onnx/sherpa-onnx" 2>/dev/null || true
+
+# 2. Download model files. URLs + sha256s match gallery/index.yaml entries.
+download() {
+    local dst="$1" url="$2" sha="$3"
+    if [[ -f "$dst" ]]; then
+        actual=$(sha256sum "$dst" | awk '{print $1}')
+        if [[ "$actual" == "$sha" ]]; then
+            echo "cached: $dst"
+            return
+        fi
+    fi
+    mkdir -p "$(dirname "$dst")"
+    echo "downloading: $url -> $dst"
+    curl -sSfL "$url" -o "$dst"
+    actual=$(sha256sum "$dst" | awk '{print $1}')
+    if [[ "$actual" != "$sha" ]]; then
+        echo "sha256 mismatch for $dst: got $actual, expected $sha" >&2
+        exit 1
+    fi
+}
+
+# Silero VAD (single file)
+download "$MODELS_DIR/silero-vad/silero-vad.onnx" \
+    "https://huggingface.co/onnx-community/silero-vad/resolve/main/onnx/model.onnx" \
+    "a4a068cd6cf1ea8355b84327595838ca748ec29a25bc91fc82e6c299ccdc5808"
+
+# Omnilingual ASR (model + tokens)
+download "$MODELS_DIR/omnilingual-asr/model.int8.onnx" \
+    "https://huggingface.co/csukuangfj/sherpa-onnx-omnilingual-asr-1600-languages-300M-ctc-int8-2025-11-12/resolve/main/model.int8.onnx" \
+    "e7c4e54ee4c4c47829cc6667d5d00ed8ea7bef1dcfeef0fce766f77752a2726c"
+download "$MODELS_DIR/omnilingual-asr/tokens.txt" \
+    "https://huggingface.co/csukuangfj/sherpa-onnx-omnilingual-asr-1600-languages-300M-ctc-int8-2025-11-12/resolve/main/tokens.txt" \
+    "a7a044c52cb29cbe8b0dc1953e92cefd4ca16b0ed968177b6beab21f9a7d0b31"
+
+# VITS-LJS TTS (model + tokens + lexicon)
+download "$MODELS_DIR/vits-ljs/vits-ljs.onnx" \
+    "https://huggingface.co/csukuangfj/vits-ljs/resolve/main/vits-ljs.onnx" \
+    "5bbd273797a9ecf8d94bd6ec02ad16cb41cbb85f055ad98d528ced3e44c9b31a"
+download "$MODELS_DIR/vits-ljs/tokens.txt" \
+    "https://huggingface.co/csukuangfj/vits-ljs/resolve/main/tokens.txt" \
+    "5fee2c6b238d712287f2ecb08f34a8a8b413bcb7390862ef6fb6fd6f0f8d3a17"
+download "$MODELS_DIR/vits-ljs/lexicon.txt" \
+    "https://huggingface.co/csukuangfj/vits-ljs/resolve/main/lexicon.txt" \
+    "bdccfc6da71c45c48e2e0056fcf0aab760577c5f959f6c1b5eb3e3e916fd5a0e"
+
+# 3. Write model config YAMLs matching the gallery entries' shape. These are
+# what the realtime pipeline resolves via LoadModelConfigFileByName.
+cat > "$MODELS_DIR/$VAD_MODEL.yaml" <<EOF
+name: $VAD_MODEL
+backend: sherpa-onnx
+type: vad
+parameters:
+  model: silero-vad/silero-vad.onnx
+known_usecases:
+  - vad
+EOF
+
+cat > "$MODELS_DIR/$MODEL.yaml" <<EOF
+name: $MODEL
+backend: sherpa-onnx
+type: asr
+parameters:
+  model: omnilingual-asr/model.int8.onnx
+options:
+  - subtype=omnilingual
+known_usecases:
+  - transcript
+EOF
+
+cat > "$MODELS_DIR/$TTS_MODEL.yaml" <<EOF
+name: $TTS_MODEL
+backend: sherpa-onnx
+parameters:
+  model: vits-ljs/vits-ljs.onnx
+known_usecases:
+  - tts
+EOF
+
+# 4. Run the Ginkgo spec. REALTIME_TEST_MODEL=realtime-test-pipeline triggers
+# the e2e suite to auto-compose a pipeline YAML from the slot env vars.
+export REALTIME_TEST_MODEL=realtime-test-pipeline
+export REALTIME_VAD="$VAD_MODEL"
+export REALTIME_STT="$MODEL"
+export REALTIME_LLM=mock-llm
+export REALTIME_TTS="$TTS_MODEL"
+export REALTIME_MODELS_PATH="$MODELS_DIR"
+export REALTIME_BACKENDS_PATH="$BACKENDS_DIR"
+
+cd "$ROOT"
+go test -v -timeout 30m ./tests/e2e/... \
+    -ginkgo.focus="Manual audio commit"