Compare commits

..

16 Commits

Author SHA1 Message Date
Ettore Di Giacinto
9787bee48b fix(buun-llama-cpp): shim cudaMemcpy{To,From}Symbol + WARP_SIZE on fwht128 shuffles
Two more hipblas-only build failures in buun's fattn.cu, fixed under the
same patches/ infrastructure:

1. cudaMemcpyToSymbol / cudaMemcpyFromSymbol — buun's Q² calibration +
   TCQ codebook upload paths call the symbol variants of cudaMemcpy.
   ggml/src/ggml-cuda/vendors/hip.h aliases every other cudaMemcpy*
   name (cudaMemcpy, cudaMemcpyAsync, cudaMemcpy2DAsync, …) but the
   symbol pair was never added. 15+ "use of undeclared identifier"
   errors across fattn.cu lines 40, 54, 74-76, 94, 100-101, 371, 883,
   905, 954, 976, 1449, 1463. Add the two missing aliases alongside
   the existing memcpy block.

2. __shfl_xor_sync fwht128 calls — same 3-arg omission pattern as the
   earlier argmax top-K fix. Lines 512 (ggml_cuda_fwht128 intra-warp
   butterfly) and 536 (fwht128_store_half neighbor fetch) drop the
   width argument that hip.h:33 requires. Add WARP_SIZE.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-24 20:09:36 +00:00
Ettore Di Giacinto
42754d33b9 fix(buun-llama-cpp): pass WARP_SIZE to argmax __shfl_xor_sync calls
Two call sites in ggml/src/ggml-cuda/argmax.cu (the top-K intra-warp
merge added by buun) use the 3-arg CUDA form __shfl_xor_sync(mask, var,
laneMask), omitting the optional width parameter. The hipification shim
at ggml/src/ggml-cuda/vendors/hip.h:33 is a function-like macro that
requires all four arguments, so hipcc fails with:

    argmax.cu:265: too few arguments provided to function-like macro
      invocation
    note: macro '__shfl_xor_sync' defined here:
      #define __shfl_xor_sync(mask, var, laneMask, width) \
              __shfl_xor(var, laneMask, width)

Every other call in the same file already passes WARP_SIZE explicitly;
aligning these two with that convention fixes the hipblas build without
changing CUDA codegen (warpSize is the CUDA default).

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-24 16:29:29 +00:00
Ettore Di Giacinto
7f2b7e4ace fix(buun-llama-cpp): shim atomicAdd(double*,double) for pre-sm_60 CUDA
Buun's Q² calibration path in ggml/src/ggml-cuda/fattn.cu calls
atomicAdd with a double* destination. Native double atomicAdd is only
available on CUDA compute capability 6.0 and later — LocalAI's CUDA 12
Docker image builds for the full published arch range (which includes
sm_50/sm_52), so nvcc fails with:

    fattn.cu:812: error: no instance of overloaded function "atomicAdd"
    matches the argument list, argument types are: (double *, double)

Add the canonical CAS-loop shim from the CUDA C Programming Guide
(B.15 Atomic Functions) guarded on __CUDA_ARCH__ < 600. On sm_60+ the
guard is false and nvcc picks up the native intrinsic as before.

Patch file lives under backend/cpp/buun-llama-cpp/patches/ and is
applied to the cloned fork tree by apply-patches.sh (the infrastructure
already put in place for exactly this class of backport).

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-24 13:57:30 +00:00
Ettore Di Giacinto
6233feb190 ci(buun-llama-cpp): wire backend into test-extra + build matrix
Adds the buun-llama-cpp backend to the same CI pipelines that turboquant
and sherpa-onnx already use:

- scripts/changed-backends.js: path resolution for Dockerfile.buun-llama-cpp,
  plus fork-of-fork detection (changes under backend/cpp/llama-cpp/ also
  retrigger the buun pipeline, mirroring how turboquant is handled).
- .github/workflows/test-extra.yml: detect-changes output and a new
  tests-buun-llama-cpp-grpc job that runs make test-extra-backend-buun-llama-cpp
  (turbo3 V-cache, same rationale as tests-turboquant-grpc).
- .github/workflows/backend.yml: 9 matrix entries (CUDA 12/13, L4T CUDA
  13 ARM64, ROCm, SYCL f32/f16, CPU, L4T ARM64, Vulkan) paired with each
  existing turboquant entry so image builds have platform parity.

Also updates .agents/ai-coding-assistants.md to clarify that AI agents
operating under the human submitter's git identity SHOULD emit
Signed-off-by via `git commit -s` (never inventing or guessing another
identity) — documents the workflow this PR is using.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-24 12:52:54 +00:00
Ettore Di Giacinto
d6bf3a4969 fix(buun-llama-cpp): drop logit_bias_eog arg from params_from_json_cmpl
Previous substitution kept the call as 5 args, but buun predates the
upstream refactor that also *added* the logit_bias_eog parameter to
params_from_json_cmpl — buun's signature is still the 4-arg form
  (const llama_vocab*, const common_params&, int, const json&)
and it still derives logit_bias_eog internally from the common_params.

Replace the substitution with a line-delete. Guard matches both the
original call (ctx_server.get_meta().logit_bias_eog) and the previously
substituted form (params_base.sampling.logit_bias_eog) so the script
stays safe across re-runs and whatever state the tree was left in.

Assisted-by: Claude:Opus-4.7 [Read] [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-24 12:52:53 +00:00
Ettore Di Giacinto
b27d38a53d fix(buun-llama-cpp): backport logit_bias_eog field to grpc-server copy
LocalAI's shared grpc-server.cpp reaches
ctx_server.get_meta().logit_bias_eog twice (the twin params_from_json_cmpl
callsites). That accessor was added to server_context_meta upstream after
buun's 2026-04-05 fork-point, so compiling against buun errors with
  'struct server_context_meta' has no member named 'logit_bias_eog'.

Rewrite the call sites — only in the buun grpc-server.cpp copy — to source
the vector from params_base.sampling.logit_bias_eog instead. That vector is
the underlying data the upstream meta accessor eventually returns (buun
still carries common_params_sampling::logit_bias_eog at common.h:280), so
the substitution yields identical behavior on both trees.

The sed is guarded by a grep for the call site, so this patch is
self-disabling once buun rebases past the upstream refactor.

Assisted-by: Claude:Opus-4.7 [Read] [Edit] [Bash] [WebFetch]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-24 12:52:53 +00:00
Ettore Di Giacinto
45756b19dc test(gallery): extend importer specs to cover buun-llama-cpp
Two additions that pair with the new backend:
- An Import()-side case that asserts preference buun-llama-cpp produces
  backend: buun-llama-cpp in the emitted YAML (mirrors the existing
  ik-llama-cpp and turboquant cases).
- AdditionalBackends() spec now asserts all three drop-in replacements
  are advertised, and verifies buun-llama-cpp's Modality/Description
  alongside the other two.

Assisted-by: Claude:Opus-4.7 [Read] [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-24 12:52:53 +00:00
Ettore Di Giacinto
cd6079b2f3 feat(backend): add buun-llama-cpp fork (DFlash + TCQ KV-cache)
spiritbuun/buun-llama-cpp is a fork of TheTom/llama-cpp-turboquant that adds
two independent features on top: DFlash block-diffusion speculative decoding
(via a dedicated DFlashDraftModel GGUF arch) and two extra TCQ KV-cache
variants (turbo2_tcq, turbo3_tcq) on top of TurboQuant's turbo2/turbo3/turbo4.

Follows the turboquant thin-wrapper pattern — reuses backend/cpp/llama-cpp
grpc-server sources verbatim, patches only the build copy to extend the KV
allow-list and wire up buun-exclusive tree_budget / draft_topk options.
DraftModel is already wired end-to-end (proto field 39 → params.speculative),
so DFlash activation only needs the existing options passthrough
(spec_type:dflash) plus the drafter path in draft_model.

CacheTypeOptions now surfaces the five turbo* values so the React UI dropdown
shows them — benefits turboquant too (previously users had to type them in
YAML manually).

Assisted-by: Claude:Opus-4.7 [Read] [Edit] [Bash] [WebFetch]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-24 12:52:53 +00:00
Richard Palethorpe
3db60b57e6 fix(realtime): consume ChatDeltas when C++ autoparser clears Response (#9538)
The llama.cpp C++-side chat autoparser clears Reply.Message and delivers
parsed content/reasoning/tool-calls via Reply.chat_deltas. chat.go handles
this (non-SSE path uses ToolCallsFromChatDeltas/ContentFromChatDeltas/
ReasoningFromChatDeltas), but realtime.go only read pred.Response, so any
model routed through the autoparser (Qwen2.5/3 and friends) produced a
silent reply: backend emitted N tokens, the session surface saw zero.

Mirror the non-SSE chat path in realtime's triggerResponse: when deltas
carry tool calls or content, use them directly; otherwise fall back to
the existing raw-text parsing.

Assisted-by: claude-opus-4-7-1M [Claude Code]

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-04-24 14:41:38 +02:00
Richard Palethorpe
13734ae9fa feat: Add Sherpa ONNX backend for ASR and TTS (#8523)
feat(backend): Add Sherpa ONNX backend and Omnilingual ASR

Adds a new Go backend wrapping sherpa-onnx via purego (no cgo). Same
approach as opus/stablediffusion-ggml/whisper — a thin C shim
(csrc/shim.c + shim.h → libsherpa-shim.so) wraps the bits purego
can't reach directly: nested struct config writes, result-struct field
reads, and the streaming TTS callback trampoline. The Go side uses
opaque uintptr handles and purego.NewCallback for the TTS callback.

Supports:
- VAD via sherpa-onnx's Silero VAD
- Offline ASR: Whisper, Paraformer, SenseVoice, Omnilingual CTC
- Online/streaming ASR: zipformer transducer with endpoint detection
  (AudioTranscriptionStream emits delta events during decode)
- Offline TTS: VITS (LJS, etc.)
- Streaming TTS: sherpa-onnx's callback API → PCM chunks on a channel,
  prefixed by a streaming WAV header

Gallery entries: omnilingual-0.3b-ctc-q8-sherpa (1600-language offline
ASR), streaming-zipformer-en-sherpa (low-latency streaming ASR),
silero-vad-sherpa, vits-ljs-sherpa.

E2E coverage: tests/e2e-backends for offline + streaming ASR,
tests/e2e for the full realtime pipeline (VAD + STT + TTS).

Assisted-by: claude-opus-4-7-1M [Claude Code]

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-04-24 14:40:06 +02:00
Ettore Di Giacinto
c0920f3273 fix(ik-llama-cpp): patch clip.cpp for new ggml_quantize_chunk signature (#9531)
Bumps ik_llama.cpp pin to 16996aeab7. Upstream 286ce32...16996ae adds a
trailing `const struct quantize_user_data *` parameter to
`ggml_quantize_chunk` (PR ikawrakow/ik_llama.cpp#1677) but leaves
`examples/llava/clip.cpp` unchanged because their build has moved to
`examples/mtmd/`. LocalAI's prepare.sh still copies from
`examples/llava/`, so the dead 7-arg call reaches the grpc-server
compile and fails. Patch the call site to pass `nullptr` for the new
param.

Assisted-by: Claude:Opus-4.7 [Read] [Edit] [Bash]
2026-04-24 13:07:26 +02:00
LocalAI [bot]
7c1934b183 chore: ⬆️ Update ggml-org/llama.cpp to 187a45637054881ecacf17f8e2f6f8f2ba7df1c7 (#9520)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-24 09:17:06 +02:00
Tai An
5e062b4d1f fix: use SetFunctionCallNameString when forcing a specific tool (3 sites) (#9526)
* fix(anthropic): use SetFunctionCallNameString for specific tool forcing

* fix(openai/realtime): use SetFunctionCallNameString for specific tool forcing

* fix(openresponses): use SetFunctionCallNameString for specific tool forcing
2026-04-24 09:06:42 +02:00
Ettore Di Giacinto
4906cbad04 feat: add biometrics UI (#9524)
* feat(react-ui): add Face & Voice Recognition pages

Expose the face and voice biometrics endpoints
(/v1/face/*, /v1/voice/*) through the React UI. Each page has four
tabs driving the six endpoints per modality: Analyze (demographics
with bounding boxes / waveform segments), Compare (verify with a
match gauge and live threshold slider), Enrollment (register /
identify / forget with a top-K matches view), Embedding (raw
vector inspector with sparkline + copy).

MediaInput supports file upload plus live capture: webcam
snap-to-canvas for face, MediaRecorder -> AudioContext ->
16-bit PCM mono WAV transcode for voice (libsndfile on the
backend only handles WAV/FLAC/OGG natively).

Sidebar gets a new Biometrics section feature-gated on
face_recognition / voice_recognition; routes are wrapped in
<RequireFeature>. No new dependencies -- Font Awesome icons
picked from the Free set.

Assisted-by: Claude:Opus 4.7

* fix(localai): accept data URI prefixes with codec/charset params

Browser MediaRecorder produces data URIs like
  data:audio/webm;codecs=opus;base64,...
so the pre-';base64,' section can carry multiple parameter
segments. The `^data:([^;]+);base64,` regex in pkg/utils/base64.go
and core/http/endpoints/localai/audio.go only matched exactly one
segment, so recordings straight from the React UI's live-capture
tab failed the strip and then tripped the base64 decoder on the
leading 'data:' literal, surfacing as
  "invalid audio base64: illegal base64 data at input byte 4"

Widened both regexes to `^data:[^,]+?;base64,` so any number of
';param=value' segments between the mime type and ';base64,' are
tolerated. Added a regression test covering the MediaRecorder
shape.

Assisted-by: Claude:Opus 4.7

* fix(insightface): scope pack ONNX loading to known manifests

LocalAI's gallery extracts buffalo_* zips flat into the models
directory, which inevitably mixes with ONNX files from other
backends (opencv face engine, MiniFASNet antispoof, WeSpeaker
voice embedding) and older buffalo pack installs. Feeding those
foreign files into insightface's model_zoo.get_model() blows up
inside the router -- it assumes a 4-D NCHW input and indexes
`input_shape[2]` on tensors that aren't shaped like a face model,
raising IndexError mid-load and leaving the backend unusable.

The router's dispatch isn't amenable to per-file try/except alone
(first-file-wins picks det_10g.onnx from buffalo_l even when the
user asked for buffalo_sc -- alphabetical order happens to favour
the wrong pack). Instead, ship an explicit manifest of the
upstream v0.7 pack contents and scope the glob to that when the
requested pack is known. The manifest is small and stable; future
packs can be added alongside or fall through to the tolerance
loop, which also swallows any remaining IndexError / ValueError
from foreign files with a clear `[insightface] skipped` stderr
line for diagnostics.

Assisted-by: Claude:Opus 4.7

* fix(speaker-recognition): extract FBank features for rank-3 ONNX encoders

Pre-exported speaker-encoder ONNX graphs come in two shapes:

  rank-2  [batch, samples]           -- some 3D-Speaker exports,
                                        take raw waveform directly.
  rank-3  [batch, frames, n_mels]    -- WeSpeaker and most Kaldi-
                                        lineage encoders, expect
                                        pre-computed Kaldi FBank.

OnnxDirectEngine unconditionally fed `audio.reshape(1, -1)` --
correct for rank-2, IndexError-on-input_shape[3] on rank-3, which
surfaced to the UI as
  "Invalid rank for input: feats Got: 2 Expected: 3"

Detect the input rank at session init and run Kaldi FBank
(80-dim, 25ms/10ms frames, dither=0.0, per-utterance CMN) before
the forward pass when rank>=3. All knobs are configurable via
backend options for encoders that deviate from defaults.

torchaudio.compliance.kaldi is already in the backend's
requirements (SpeechBrain pulls torchaudio in), so no new
dependency.

Assisted-by: Claude:Opus 4.7

* fix(biometrics): isolate face and voice vector stores

Face (ArcFace, 512-D) and voice (ECAPA-TDNN 192-D / WeSpeaker
256-D) biometric embeddings were colliding inside a single
in-memory local-store instance. Enrolling one after the other
failed with
  "Try to add key with length N when existing length is M"
because local-store correctly refuses to mix dimensions in one
keyspace.

The registries were constructed with `storeName=""`, which in
StoreBackend() is just a WithModel() call. But ModelLoader's
cache is keyed on `modelID`, not `model` -- so both registries
collapsed to the same `modelID=""` slot and reused the same
backend process despite looking isolated on paper.

Three complementary fixes:

  1. application.go -- give each registry a distinct default
     namespace ("localai-face-biometrics" /
     "localai-voice-biometrics"). The comment claimed
     isolation, now it's actually enforced.

  2. stores.go -- pass the storeName as both WithModelID and
     WithModel so the ModelLoader cache key separates
     namespaces and the loader spawns distinct processes.

  3. local-store/store.go -- drop the Load() `opts.Model != ""`
     guard. It was there to prevent generic model-loading loops
     from picking up local-store by accident, but that auto-load
     path is being retired; the guard now just blocks legitimate
     namespace isolation. opts.Model is treated as a tag; the
     per-tuple process isolation upstream handles discrimination.

Assisted-by: Claude:Opus 4.7

* fix(gallery): stale-file cleanup and upgrade-tmp directory safety

Two related robustness fixes for backend install/upgrade:

pkg/downloader/uri.go
  OCI downloads passed through
      if filepath.Ext(filePath) != "" ...
          filePath = filepath.Dir(filePath)
  which was intended to redirect file-shaped download targets
  into their parent directory for OCI extraction. The heuristic
  misfires on directory-shaped paths with a dot-suffix --
  gallery.UpgradeBackend uses
      tmpPath = "<backendsPath>/<name>.upgrade-tmp"
  and Go's filepath.Ext treats ".upgrade-tmp" as an extension.
  The rewrite landed the extraction at "<backendsPath>/", which
  then **overwrote the real install** (backends/<name>/) with a
  flat-layout file and left a stray run.sh at the top level. The
  tmp dir itself stayed empty, so the validation step that
  checked "<tmpPath>/run.sh" predictably failed with
      "upgrade validation failed: run.sh not found in new backend"
  Every manual upgrade silently corrupted the backends tree this
  way. Guard the rewrite behind "target isn't already an existing
  directory" -- InstallBackend / UpgradeBackend both pre-create
  the target as a directory, so they get the correct behaviour;
  existing file-path callers with a genuine dot-extension still
  get the parent redirect.

core/gallery/backends.go
  InstallBackend's MkdirAll returned ENOTDIR when something at
  the target path was already a file (legacy dev builds dropped
  golang backend binaries directly at `<backendsPath>/<name>`
  instead of nesting them under their own subdir). That
  permanently blocked reinstall and upgrade for anyone carrying
  that state, since every retry hit the same error. Detect a
  pre-existing non-directory, warn, and remove it before the
  MkdirAll so the fresh install can write the correct nested
  layout with metadata.json + run.sh.

Assisted-by: Claude:Opus 4.7

* fix(galleryop): refresh upgrade cache after backend ops

UpgradeChecker caches the last upgrade-check result and only
refreshes on the 6-hour tick or after an auto-upgrade cycle.
Manual upgrades (POST /api/backends/upgrade/:name) go through
the async galleryop worker, which completes the upgrade
correctly but never tells UpgradeChecker to re-check -- so
/api/backends/upgrades continued to list a just-upgraded backend
as upgradeable, indistinguishable from a failed upgrade, for up
to six hours.

Add an optional `OnBackendOpCompleted func()` hook on
GalleryService that fires after every successful install /
upgrade / delete on the backend channel (async, so a slow
callback doesn't stall the queue). startup.go wires it to
UpgradeChecker.TriggerCheck after both services exist. Result:
the upgrade banner clears within milliseconds of the worker
finishing.

Assisted-by: Claude:Opus 4.7

* build: prepend GOPATH/bin to PATH for protogen-go

install-go-tools runs `go install` for protoc-gen-go and
protoc-gen-go-grpc, which writes them into `go env GOPATH`/bin.
That directory isn't on every dev's PATH, and protoc resolves
its code-gen plugins via PATH, so the immediately-following
protoc invocation fails with
  "protoc-gen-go: program not found"
which in turn blocks `make build` and any
`make backends/%` target that depends on build.

Prepend `go env GOPATH`/bin to PATH for the protoc invocation
so the freshly-installed plugins are found without requiring a
shell-profile change.

Assisted-by: Claude:Opus 4.7

* refactor(ui-api): non-blocking backend upgrade handler with opcache

POST /api/backends/upgrade/:name used to send the ManagementOp
directly onto the unbuffered BackendGalleryChannel, which blocked
the HTTP request whenever the galleryop worker was busy with a
prior operation. The op also didn't show up in /api/operations,
so the Backends UI couldn't reflect upgrade progress on the
affected row.

Register the op in opcache immediately, wrap it in a cancellable
context, store the cancellation function on the GalleryService,
and push onto the channel from a goroutine so the handler
returns right away. Response gains a `jobID` field and a
`message` string so clients have a consistent handle regardless
of whether the op is queued or running.

Pairs with the OnBackendOpCompleted hook added in the galleryop
commit — together the UI sees the upgrade start, watches
progress via /api/operations, and drops the "upgradeable" flag
the moment the worker finishes.

Assisted-by: Claude:Opus 4.7
2026-04-24 08:50:34 +02:00
LocalAI [bot]
c755cd5ab5 feat(swagger): update swagger (#9518)
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-23 23:26:50 +02:00
LocalAI [bot]
0fb04f7ac3 chore(model-gallery): ⬆️ update checksum (#9522)
⬆️ Checksum updates in gallery/index.yaml

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-23 23:26:27 +02:00
82 changed files with 7812 additions and 173 deletions

View File

@@ -35,19 +35,33 @@ All contributions must comply with LocalAI's licensing requirements:
## Signed-off-by and Developer Certificate of Origin
**AI agents MUST NOT add `Signed-off-by` tags.** Only humans can legally
certify the Developer Certificate of Origin (DCO). The human submitter
is responsible for:
Only humans can certify the Developer Certificate of Origin (DCO). AI
agents MUST NOT invent or guess a human identity for `Signed-off-by`
doing so forges the DCO certification.
- Reviewing all AI-generated code
However, when a human operator explicitly directs the AI to commit on
their behalf, the AI is acting as a typing tool — no different from an
editor macro or `git commit -s`. In that case the AI SHOULD add
`Signed-off-by:` using the **configured `user.name` / `user.email`** of
the current git repository (i.e. the operator's own identity). The
resulting trailer is the operator's signature; they take responsibility
for it by reviewing and pushing the commit. The AI MUST NOT use any
other identity and MUST NOT add its own name to the sign-off.
When running `git commit`, prefer `git commit --signoff` (or `-s`) so
the trailer is emitted by git itself from the configured identity,
rather than hand-writing it in a heredoc — this guarantees the sign-off
matches whatever identity the operator is currently using.
The human submitter remains responsible for:
- Reviewing all AI-generated code before it's pushed or merged
- Ensuring compliance with licensing requirements
- Adding their own `Signed-off-by` tag (when the project requires DCO)
to certify the contribution
- Taking full responsibility for the contribution
AI agents MUST NOT add `Co-Authored-By` trailers for themselves either.
A human reviewer owns the contribution; the AI's involvement is recorded
via `Assisted-by` (see below).
AI agents MUST NOT add `Co-Authored-By` trailers for themselves. A human
reviewer owns the contribution; the AI's involvement is recorded via
`Assisted-by` (see below).
## Attribution
@@ -84,6 +98,12 @@ Assisted-by: Claude:claude-opus-4-7 golangci-lint
Signed-off-by: Jane Developer <jane@example.com>
```
The `Signed-off-by` line uses Jane's own identity because Jane is the
submitter operating the AI. If Jane asks Claude to create the commit via
`git commit -s`, git emits that exact trailer from Jane's configured
identity — no separate human step is needed beyond Jane reviewing the
diff before pushing.
## Scope and Responsibility
Using an AI assistant does not reduce the contributor's responsibility.

View File

@@ -399,6 +399,19 @@ jobs:
dockerfile: "./backend/Dockerfile.turboquant"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "8"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-12-buun-llama-cpp'
runs-on: 'bigger-runner'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "buun-llama-cpp"
dockerfile: "./backend/Dockerfile.buun-llama-cpp"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "8"
@@ -894,6 +907,19 @@ jobs:
dockerfile: "./backend/Dockerfile.turboquant"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-13-buun-llama-cpp'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "buun-llama-cpp"
dockerfile: "./backend/Dockerfile.buun-llama-cpp"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
@@ -920,6 +946,19 @@ jobs:
backend: "turboquant"
dockerfile: "./backend/Dockerfile.turboquant"
context: "./"
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/arm64'
skip-drivers: 'false'
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-cuda-13-arm64-buun-llama-cpp'
base-image: "ubuntu:24.04"
runs-on: 'ubuntu-24.04-arm'
ubuntu-version: '2404'
backend: "buun-llama-cpp"
dockerfile: "./backend/Dockerfile.buun-llama-cpp"
context: "./"
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
@@ -1454,6 +1493,19 @@ jobs:
dockerfile: "./backend/Dockerfile.turboquant"
context: "./"
ubuntu-version: '2404'
- build-type: 'hipblas'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-rocm-hipblas-buun-llama-cpp'
runs-on: 'ubuntu-latest'
base-image: "rocm/dev-ubuntu-24.04:7.2.1"
skip-drivers: 'false'
backend: "buun-llama-cpp"
dockerfile: "./backend/Dockerfile.buun-llama-cpp"
context: "./"
ubuntu-version: '2404'
- build-type: 'hipblas'
cuda-major-version: ""
cuda-minor-version: ""
@@ -1703,6 +1755,19 @@ jobs:
dockerfile: "./backend/Dockerfile.turboquant"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f32'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-intel-sycl-f32-buun-llama-cpp'
runs-on: 'ubuntu-latest'
base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
skip-drivers: 'false'
backend: "buun-llama-cpp"
dockerfile: "./backend/Dockerfile.buun-llama-cpp"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f16'
cuda-major-version: ""
cuda-minor-version: ""
@@ -1729,6 +1794,19 @@ jobs:
dockerfile: "./backend/Dockerfile.turboquant"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f16'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-intel-sycl-f16-buun-llama-cpp'
runs-on: 'ubuntu-latest'
base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
skip-drivers: 'false'
backend: "buun-llama-cpp"
dockerfile: "./backend/Dockerfile.buun-llama-cpp"
context: "./"
ubuntu-version: '2404'
- build-type: 'intel'
cuda-major-version: ""
cuda-minor-version: ""
@@ -2134,6 +2212,19 @@ jobs:
dockerfile: "./backend/Dockerfile.turboquant"
context: "./"
ubuntu-version: '2404'
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64,linux/arm64'
tag-latest: 'auto'
tag-suffix: '-cpu-buun-llama-cpp'
runs-on: 'bigger-runner'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "buun-llama-cpp"
dockerfile: "./backend/Dockerfile.buun-llama-cpp"
context: "./"
ubuntu-version: '2404'
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
@@ -2173,6 +2264,19 @@ jobs:
dockerfile: "./backend/Dockerfile.turboquant"
context: "./"
ubuntu-version: '2204'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "0"
platforms: 'linux/arm64'
skip-drivers: 'false'
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-arm64-buun-llama-cpp'
base-image: "nvcr.io/nvidia/l4t-jetpack:r36.4.0"
runs-on: 'ubuntu-24.04-arm'
backend: "buun-llama-cpp"
dockerfile: "./backend/Dockerfile.buun-llama-cpp"
context: "./"
ubuntu-version: '2204'
- build-type: 'vulkan'
cuda-major-version: ""
cuda-minor-version: ""
@@ -2199,6 +2303,19 @@ jobs:
dockerfile: "./backend/Dockerfile.turboquant"
context: "./"
ubuntu-version: '2404'
- build-type: 'vulkan'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64,linux/arm64'
tag-latest: 'auto'
tag-suffix: '-gpu-vulkan-buun-llama-cpp'
runs-on: 'bigger-runner'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "buun-llama-cpp"
dockerfile: "./backend/Dockerfile.buun-llama-cpp"
context: "./"
ubuntu-version: '2404'
# Stablediffusion-ggml
- build-type: ''
cuda-major-version: ""
@@ -2877,6 +2994,49 @@ jobs:
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
# sherpa-onnx CPU
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64,linux/arm64'
tag-latest: 'auto'
tag-suffix: '-cpu-sherpa-onnx'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "sherpa-onnx"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
# sherpa-onnx CUDA 12
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "8"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-12-sherpa-onnx'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "sherpa-onnx"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
# sherpa-onnx CUDA 13 — requires onnxruntime 1.24.x+ for the
# gpu_cuda13 tarball; sherpa-onnx SHERPA_COMMIT pins to v1.12.39.
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-13-sherpa-onnx'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "sherpa-onnx"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
backend-jobs-darwin:
uses: ./.github/workflows/backend_build_darwin.yml
strategy:

View File

@@ -32,6 +32,7 @@ jobs:
llama-cpp: ${{ steps.detect.outputs.llama-cpp }}
ik-llama-cpp: ${{ steps.detect.outputs.ik-llama-cpp }}
turboquant: ${{ steps.detect.outputs.turboquant }}
buun-llama-cpp: ${{ steps.detect.outputs['buun-llama-cpp'] }}
vllm: ${{ steps.detect.outputs.vllm }}
sglang: ${{ steps.detect.outputs.sglang }}
acestep-cpp: ${{ steps.detect.outputs.acestep-cpp }}
@@ -40,6 +41,7 @@ jobs:
kokoros: ${{ steps.detect.outputs.kokoros }}
insightface: ${{ steps.detect.outputs.insightface }}
speaker-recognition: ${{ steps.detect.outputs.speaker-recognition }}
sherpa-onnx: ${{ steps.detect.outputs.sherpa-onnx }}
steps:
- name: Checkout repository
uses: actions/checkout@v6
@@ -506,6 +508,72 @@ jobs:
- name: Build llama-cpp backend image and run audio transcription gRPC e2e tests
run: |
make test-extra-backend-llama-cpp-transcription
# Realtime e2e with sherpa-onnx driving VAD + STT + TTS against a mocked LLM.
# Builds the sherpa-onnx Docker image, extracts the rootfs so the e2e suite
# can discover the backend binary + shared libs, downloads the three model
# bundles (silero-vad, omnilingual-asr, vits-ljs) and drives the realtime
# websocket spec end-to-end.
tests-sherpa-onnx-realtime:
needs: detect-changes
if: needs.detect-changes.outputs.sherpa-onnx == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: ubuntu-latest
timeout-minutes: 90
steps:
- name: Clone
uses: actions/checkout@v6
with:
submodules: true
- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: '1.25.4'
- name: Setup Node.js
uses: actions/setup-node@v6
with:
node-version: '22'
- name: Build sherpa-onnx backend image and run realtime e2e tests
run: |
make test-extra-e2e-realtime-sherpa
# Streaming ASR via the sherpa-onnx online recognizer (zipformer
# transducer). Exercises both AudioTranscription (buffered) and
# AudioTranscriptionStream (real-time deltas) on the e2e-backends
# harness.
tests-sherpa-onnx-grpc-transcription:
needs: detect-changes
if: needs.detect-changes.outputs.sherpa-onnx == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: ubuntu-latest
timeout-minutes: 90
steps:
- name: Clone
uses: actions/checkout@v6
with:
submodules: true
- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: '1.25.4'
- name: Build sherpa-onnx backend image and run streaming ASR gRPC e2e tests
run: |
make test-extra-backend-sherpa-onnx-transcription
# VITS TTS via the sherpa-onnx backend. Drives both TTS (file write) and
# TTSStream (PCM chunks) on the e2e-backends harness.
tests-sherpa-onnx-grpc-tts:
needs: detect-changes
if: needs.detect-changes.outputs.sherpa-onnx == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: ubuntu-latest
timeout-minutes: 90
steps:
- name: Clone
uses: actions/checkout@v6
with:
submodules: true
- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: '1.25.4'
- name: Build sherpa-onnx backend image and run TTS gRPC e2e tests
run: |
make test-extra-backend-sherpa-onnx-tts
tests-ik-llama-cpp-grpc:
needs: detect-changes
if: needs.detect-changes.outputs.ik-llama-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
@@ -546,6 +614,30 @@ jobs:
- name: Build turboquant backend image and run gRPC e2e tests
run: |
make test-extra-backend-turboquant
tests-buun-llama-cpp-grpc:
needs: detect-changes
if: needs.detect-changes.outputs['buun-llama-cpp'] == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: ubuntu-latest
timeout-minutes: 90
steps:
- name: Clone
uses: actions/checkout@v6
with:
submodules: true
- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: '1.25.4'
# Exercises the buun-llama-cpp (fork-of-a-fork) backend with the
# fork-specific TurboQuant/TCQ KV-cache types. BACKEND_TEST_CACHE_TYPE_V
# is set to turbo3 so the test round-trips through the fork's KV
# allow-list — picking a stock llama.cpp type would only re-test the
# shared code path. DFlash speculative decoding is not exercised here
# because the one known public target/drafter pair (Qwen3.5-27B) is too
# large for CI.
- name: Build buun-llama-cpp backend image and run gRPC e2e tests
run: |
make test-extra-backend-buun-llama-cpp
# tests-vllm-grpc is currently disabled in CI.
#
# The prebuilt vllm CPU wheel is compiled with AVX-512 VNNI/BF16

View File

@@ -195,7 +195,7 @@ jobs:
run: go version
- name: Dependencies
run: |
brew install protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm opus
brew install protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm opus ffmpeg
pip install --user --no-cache-dir grpcio-tools grpcio
- name: Setup Node.js
uses: actions/setup-node@v6

View File

@@ -1,5 +1,5 @@
# Disable parallel execution for backend builds
.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/tinygrad
.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/buun-llama-cpp backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/tinygrad backends/sherpa-onnx
GOCMD=go
GOTEST=$(GOCMD) test
@@ -394,7 +394,13 @@ protoc:
.PHONY: protogen-go
protogen-go: protoc install-go-tools
mkdir -p pkg/grpc/proto
./protoc --experimental_allow_proto3_optional -Ibackend/ --go_out=pkg/grpc/proto/ --go_opt=paths=source_relative --go-grpc_out=pkg/grpc/proto/ --go-grpc_opt=paths=source_relative \
# install-go-tools writes protoc-gen-go and protoc-gen-go-grpc into
# $(shell go env GOPATH)/bin, which isn't on every dev's PATH. protoc
# resolves its code-gen plugins via PATH, so without this prefix the
# generate step fails with "protoc-gen-go: program not found". Prepend
# GOPATH/bin so the freshly-installed plugins win without requiring a
# shell-profile change.
PATH="$$(go env GOPATH)/bin:$$PATH" ./protoc --experimental_allow_proto3_optional -Ibackend/ --go_out=pkg/grpc/proto/ --go_opt=paths=source_relative --go-grpc_out=pkg/grpc/proto/ --go-grpc_opt=paths=source_relative \
backend/backend.proto
core/config/inference_defaults.json: ## Fetch inference defaults from unsloth (only if missing)
@@ -539,6 +545,19 @@ test-extra-backend-turboquant: docker-build-turboquant
BACKEND_TEST_CACHE_TYPE_V=turbo3 \
$(MAKE) test-extra-backend
## buun-llama-cpp: exercises the fork-of-a-fork backend (spiritbuun/buun-llama-cpp)
## with the *TurboQuant/TCQ-specific* KV-cache types (turbo3 for V). Same rationale
## as turboquant above: picking a standard llama.cpp type would only re-test the
## shared code path. buun inherits turboquant's turbo2/turbo3/turbo4 and adds
## turbo2_tcq / turbo3_tcq on top. DFlash speculative decoding is not exercised
## here because no small DFlash drafter model exists (the known public pair is
## Qwen3.5-27B, ~54 GB).
test-extra-backend-buun-llama-cpp: docker-build-buun-llama-cpp
BACKEND_IMAGE=local-ai-backend:buun-llama-cpp \
BACKEND_TEST_CACHE_TYPE_K=q8_0 \
BACKEND_TEST_CACHE_TYPE_V=turbo3 \
$(MAKE) test-extra-backend
## Audio transcription wrapper for the llama-cpp backend.
## Drives the new AudioTranscription / AudioTranscriptionStream RPCs against
## ggml-org/Qwen3-ASR-0.6B-GGUF (a small ASR model that requires its mmproj
@@ -780,6 +799,44 @@ test-extra-backend-speaker-recognition-ecapa: docker-build-speaker-recognition
test-extra-backend-speaker-recognition-all: \
test-extra-backend-speaker-recognition-ecapa
## Realtime e2e with sherpa-onnx driving VAD + STT + TTS against a mocked
## LLM. Extracts the sherpa-onnx Docker image rootfs, downloads the three
## gallery-referenced model bundles (silero-vad, omnilingual-asr, vits-ljs),
## writes the corresponding model config YAMLs, and runs the realtime
## websocket spec in tests/e2e with REALTIME_* env vars wiring the sherpa
## slots into the pipeline. The LLM slot stays on the in-repo mock-backend
## registered unconditionally by tests/e2e/e2e_suite_test.go. See
## tests/e2e/run-realtime-sherpa.sh for the full orchestration.
test-extra-e2e-realtime-sherpa: build-mock-backend docker-build-sherpa-onnx protogen-go react-ui
bash tests/e2e/run-realtime-sherpa.sh
## Streaming ASR via the sherpa-onnx online recognizer. Uses the streaming
## zipformer English model (encoder/decoder/joiner int8 + tokens) from the
## sherpa-onnx gallery entry. Drives both AudioTranscription and
## AudioTranscriptionStream via the e2e-backends gRPC harness; streaming
## emits real partial deltas during decode. Each file is renamed on download
## to the shape sherpa-onnx's online loader expects (encoder.int8.onnx etc.).
test-extra-backend-sherpa-onnx-transcription: docker-build-sherpa-onnx
BACKEND_IMAGE=local-ai-backend:sherpa-onnx \
BACKEND_TEST_MODEL_URL='https://huggingface.co/csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26/resolve/main/encoder-epoch-99-avg-1-chunk-16-left-128.int8.onnx#encoder.int8.onnx' \
BACKEND_TEST_EXTRA_FILES='https://huggingface.co/csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26/resolve/main/decoder-epoch-99-avg-1-chunk-16-left-128.int8.onnx#decoder.int8.onnx|https://huggingface.co/csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26/resolve/main/joiner-epoch-99-avg-1-chunk-16-left-128.int8.onnx#joiner.int8.onnx|https://huggingface.co/csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26/resolve/main/tokens.txt' \
BACKEND_TEST_AUDIO_URL=https://github.com/ggml-org/whisper.cpp/raw/master/samples/jfk.wav \
BACKEND_TEST_CAPS=health,load,transcription \
BACKEND_TEST_OPTIONS=subtype=online \
$(MAKE) test-extra-backend
## VITS TTS via the sherpa-onnx backend. Pulls the individual files from
## HuggingFace (the vits-ljs release tarball lives on the k2-fsa github
## but is also mirrored as discrete files on HF). Exercises both
## TTS (write-to-file) and TTSStream (PCM chunks + WAV header) via the
## e2e-backends gRPC harness.
test-extra-backend-sherpa-onnx-tts: docker-build-sherpa-onnx
BACKEND_IMAGE=local-ai-backend:sherpa-onnx \
BACKEND_TEST_MODEL_URL='https://huggingface.co/csukuangfj/vits-ljs/resolve/main/vits-ljs.onnx#vits-ljs.onnx' \
BACKEND_TEST_EXTRA_FILES='https://huggingface.co/csukuangfj/vits-ljs/resolve/main/tokens.txt|https://huggingface.co/csukuangfj/vits-ljs/resolve/main/lexicon.txt' \
BACKEND_TEST_CAPS=health,load,tts \
$(MAKE) test-extra-backend
## sglang mirrors the vllm setup: HuggingFace model id, same tiny Qwen,
## tool-call extraction via sglang's native qwen parser. CPU builds use
## sglang's upstream pyproject_cpu.toml recipe (see backend/python/sglang/install.sh).
@@ -905,6 +962,11 @@ BACKEND_IK_LLAMA_CPP = ik-llama-cpp|ik-llama-cpp|.|false|false
# turboquant is a llama.cpp fork with TurboQuant KV-cache quantization.
# Reuses backend/cpp/llama-cpp grpc-server sources via a thin wrapper Makefile.
BACKEND_TURBOQUANT = turboquant|turboquant|.|false|false
# buun-llama-cpp is a fork-of-a-fork (spiritbuun/buun-llama-cpp forks
# TheTom/llama-cpp-turboquant) that adds DFlash block-diffusion speculative
# decoding and extra TCQ KV-cache variants on top of TurboQuant. Same thin
# wrapper pattern as turboquant — reuses backend/cpp/llama-cpp grpc-server.
BACKEND_BUUN_LLAMA_CPP = buun-llama-cpp|buun-llama-cpp|.|false|false
# Golang backends
BACKEND_PIPER = piper|golang|.|false|true
@@ -917,6 +979,7 @@ BACKEND_VOXTRAL = voxtral|golang|.|false|true
BACKEND_ACESTEP_CPP = acestep-cpp|golang|.|false|true
BACKEND_QWEN3_TTS_CPP = qwen3-tts-cpp|golang|.|false|true
BACKEND_OPUS = opus|golang|.|false|true
BACKEND_SHERPA_ONNX = sherpa-onnx|golang|.|false|true
# Python backends with root context
BACKEND_RERANKERS = rerankers|python|.|false|true
@@ -984,6 +1047,7 @@ endef
$(eval $(call generate-docker-build-target,$(BACKEND_LLAMA_CPP)))
$(eval $(call generate-docker-build-target,$(BACKEND_IK_LLAMA_CPP)))
$(eval $(call generate-docker-build-target,$(BACKEND_TURBOQUANT)))
$(eval $(call generate-docker-build-target,$(BACKEND_BUUN_LLAMA_CPP)))
$(eval $(call generate-docker-build-target,$(BACKEND_PIPER)))
$(eval $(call generate-docker-build-target,$(BACKEND_LOCAL_STORE)))
$(eval $(call generate-docker-build-target,$(BACKEND_HUGGINGFACE)))
@@ -1029,12 +1093,13 @@ $(eval $(call generate-docker-build-target,$(BACKEND_LLAMA_CPP_QUANTIZATION)))
$(eval $(call generate-docker-build-target,$(BACKEND_TINYGRAD)))
$(eval $(call generate-docker-build-target,$(BACKEND_KOKOROS)))
$(eval $(call generate-docker-build-target,$(BACKEND_SAM3_CPP)))
$(eval $(call generate-docker-build-target,$(BACKEND_SHERPA_ONNX)))
# Pattern rule for docker-save targets
docker-save-%: backend-images
docker save local-ai-backend:$* -o backend-images/$*.tar
docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp docker-build-insightface docker-build-speaker-recognition
docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-buun-llama-cpp docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx
########################################################
### Mock Backend for E2E Tests

View File

@@ -0,0 +1,290 @@
ARG BASE_IMAGE=ubuntu:24.04
ARG GRPC_BASE_IMAGE=${BASE_IMAGE}
# The grpc target does one thing, it builds and installs GRPC. This is in it's own layer so that it can be effectively cached by CI.
# You probably don't need to change anything here, and if you do, make sure that CI is adjusted so that the cache continues to work.
FROM ${GRPC_BASE_IMAGE} AS grpc
# This is a bit of a hack, but it's required in order to be able to effectively cache this layer in CI
ARG GRPC_MAKEFLAGS="-j4 -Otarget"
ARG GRPC_VERSION=v1.65.0
ARG CMAKE_FROM_SOURCE=false
# CUDA Toolkit 13.x compatibility: CMake 3.31.9+ fixes toolchain detection/arch table issues
ARG CMAKE_VERSION=3.31.10
ENV MAKEFLAGS=${GRPC_MAKEFLAGS}
WORKDIR /build
RUN apt-get update && \
apt-get install -y --no-install-recommends \
ca-certificates \
build-essential curl libssl-dev \
git wget && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
# Install CMake (the version in 22.04 is too old)
RUN <<EOT bash
if [ "${CMAKE_FROM_SOURCE}" = "true" ]; then
curl -L -s https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/cmake-${CMAKE_VERSION}.tar.gz -o cmake.tar.gz && tar xvf cmake.tar.gz && cd cmake-${CMAKE_VERSION} && ./configure && make && make install
else
apt-get update && \
apt-get install -y \
cmake && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
fi
EOT
# We install GRPC to a different prefix here so that we can copy in only the build artifacts later
# saves several hundred MB on the final docker image size vs copying in the entire GRPC source tree
# and running make install in the target container
RUN git clone --recurse-submodules --jobs 4 -b ${GRPC_VERSION} --depth 1 --shallow-submodules https://github.com/grpc/grpc && \
mkdir -p /build/grpc/cmake/build && \
cd /build/grpc/cmake/build && \
sed -i "216i\ TESTONLY" "../../third_party/abseil-cpp/absl/container/CMakeLists.txt" && \
cmake -DgRPC_INSTALL=ON -DgRPC_BUILD_TESTS=OFF -DCMAKE_INSTALL_PREFIX:PATH=/opt/grpc ../.. && \
make && \
make install && \
rm -rf /build
FROM ${BASE_IMAGE} AS builder
ARG CMAKE_FROM_SOURCE=false
ARG CMAKE_VERSION=3.31.10
# We can target specific CUDA ARCHITECTURES like --build-arg CUDA_DOCKER_ARCH='75;86;89;120'
ARG CUDA_DOCKER_ARCH
ENV CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH}
ARG CMAKE_ARGS
ENV CMAKE_ARGS=${CMAKE_ARGS}
ARG BACKEND=rerankers
ARG BUILD_TYPE
ENV BUILD_TYPE=${BUILD_TYPE}
ARG CUDA_MAJOR_VERSION
ARG CUDA_MINOR_VERSION
ARG SKIP_DRIVERS=false
ENV CUDA_MAJOR_VERSION=${CUDA_MAJOR_VERSION}
ENV CUDA_MINOR_VERSION=${CUDA_MINOR_VERSION}
ENV DEBIAN_FRONTEND=noninteractive
ARG TARGETARCH
ARG TARGETVARIANT
ARG GO_VERSION=1.25.4
ARG UBUNTU_VERSION=2404
RUN apt-get update && \
apt-get install -y --no-install-recommends \
build-essential \
ccache git \
ca-certificates \
make \
pkg-config libcurl4-openssl-dev \
curl unzip \
libssl-dev wget && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
# Cuda
ENV PATH=/usr/local/cuda/bin:${PATH}
# HipBLAS requirements
ENV PATH=/opt/rocm/bin:${PATH}
# Vulkan requirements
RUN <<EOT bash
if [ "${BUILD_TYPE}" = "vulkan" ] && [ "${SKIP_DRIVERS}" = "false" ]; then
apt-get update && \
apt-get install -y --no-install-recommends \
software-properties-common pciutils wget gpg-agent && \
apt-get install -y libglm-dev cmake libxcb-dri3-0 libxcb-present0 libpciaccess0 \
libpng-dev libxcb-keysyms1-dev libxcb-dri3-dev libx11-dev g++ gcc \
libwayland-dev libxrandr-dev libxcb-randr0-dev libxcb-ewmh-dev \
git python-is-python3 bison libx11-xcb-dev liblz4-dev libzstd-dev \
ocaml-core ninja-build pkg-config libxml2-dev wayland-protocols python3-jsonschema \
clang-format qtbase5-dev qt6-base-dev libxcb-glx0-dev sudo xz-utils
if [ "amd64" = "$TARGETARCH" ]; then
wget "https://sdk.lunarg.com/sdk/download/1.4.335.0/linux/vulkansdk-linux-x86_64-1.4.335.0.tar.xz" && \
tar -xf vulkansdk-linux-x86_64-1.4.335.0.tar.xz && \
rm vulkansdk-linux-x86_64-1.4.335.0.tar.xz && \
mkdir -p /opt/vulkan-sdk && \
mv 1.4.335.0 /opt/vulkan-sdk/ && \
cd /opt/vulkan-sdk/1.4.335.0 && \
./vulkansdk --no-deps --maxjobs \
vulkan-loader \
vulkan-validationlayers \
vulkan-extensionlayer \
vulkan-tools \
shaderc && \
cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/bin/* /usr/bin/ && \
cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/lib/* /usr/lib/x86_64-linux-gnu/ && \
cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/include/* /usr/include/ && \
cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/share/* /usr/share/ && \
rm -rf /opt/vulkan-sdk
fi
if [ "arm64" = "$TARGETARCH" ]; then
mkdir vulkan && cd vulkan && \
curl -L -o vulkan-sdk.tar.xz https://github.com/mudler/vulkan-sdk-arm/releases/download/1.4.335.0/vulkansdk-ubuntu-24.04-arm-1.4.335.0.tar.xz && \
tar -xvf vulkan-sdk.tar.xz && \
rm vulkan-sdk.tar.xz && \
cd 1.4.335.0 && \
cp -rfv aarch64/bin/* /usr/bin/ && \
cp -rfv aarch64/lib/* /usr/lib/aarch64-linux-gnu/ && \
cp -rfv aarch64/include/* /usr/include/ && \
cp -rfv aarch64/share/* /usr/share/ && \
cd ../.. && \
rm -rf vulkan
fi
ldconfig && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
fi
EOT
# CuBLAS requirements
RUN <<EOT bash
if ( [ "${BUILD_TYPE}" = "cublas" ] || [ "${BUILD_TYPE}" = "l4t" ] ) && [ "${SKIP_DRIVERS}" = "false" ]; then
apt-get update && \
apt-get install -y --no-install-recommends \
software-properties-common pciutils
if [ "amd64" = "$TARGETARCH" ]; then
curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/x86_64/cuda-keyring_1.1-1_all.deb
fi
if [ "arm64" = "$TARGETARCH" ]; then
if [ "${CUDA_MAJOR_VERSION}" = "13" ]; then
curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/sbsa/cuda-keyring_1.1-1_all.deb
else
curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/arm64/cuda-keyring_1.1-1_all.deb
fi
fi
dpkg -i cuda-keyring_1.1-1_all.deb && \
rm -f cuda-keyring_1.1-1_all.deb && \
apt-get update && \
apt-get install -y --no-install-recommends \
cuda-nvcc-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
libcufft-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
libcurand-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
libcublas-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
libcusparse-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
libcusolver-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}
if [ "${CUDA_MAJOR_VERSION}" = "13" ] && [ "arm64" = "$TARGETARCH" ]; then
apt-get install -y --no-install-recommends \
libcufile-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} libcudnn9-cuda-${CUDA_MAJOR_VERSION} cuda-cupti-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} libnvjitlink-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}
fi
apt-get clean && \
rm -rf /var/lib/apt/lists/*
fi
EOT
# https://github.com/NVIDIA/Isaac-GR00T/issues/343
RUN <<EOT bash
if [ "${BUILD_TYPE}" = "cublas" ] && [ "${TARGETARCH}" = "arm64" ]; then
wget https://developer.download.nvidia.com/compute/cudss/0.6.0/local_installers/cudss-local-tegra-repo-ubuntu${UBUNTU_VERSION}-0.6.0_0.6.0-1_arm64.deb && \
dpkg -i cudss-local-tegra-repo-ubuntu${UBUNTU_VERSION}-0.6.0_0.6.0-1_arm64.deb && \
cp /var/cudss-local-tegra-repo-ubuntu${UBUNTU_VERSION}-0.6.0/cudss-*-keyring.gpg /usr/share/keyrings/ && \
apt-get update && apt-get -y install cudss cudss-cuda-${CUDA_MAJOR_VERSION} && \
wget https://developer.download.nvidia.com/compute/nvpl/25.5/local_installers/nvpl-local-repo-ubuntu${UBUNTU_VERSION}-25.5_1.0-1_arm64.deb && \
dpkg -i nvpl-local-repo-ubuntu${UBUNTU_VERSION}-25.5_1.0-1_arm64.deb && \
cp /var/nvpl-local-repo-ubuntu${UBUNTU_VERSION}-25.5/nvpl-*-keyring.gpg /usr/share/keyrings/ && \
apt-get update && apt-get install -y nvpl
fi
EOT
# If we are building with clblas support, we need the libraries for the builds
RUN if [ "${BUILD_TYPE}" = "clblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then \
apt-get update && \
apt-get install -y --no-install-recommends \
libclblast-dev && \
apt-get clean && \
rm -rf /var/lib/apt/lists/* \
; fi
RUN if [ "${BUILD_TYPE}" = "hipblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then \
apt-get update && \
apt-get install -y --no-install-recommends \
hipblas-dev \
rocblas-dev && \
apt-get clean && \
rm -rf /var/lib/apt/lists/* && \
# I have no idea why, but the ROCM lib packages don't trigger ldconfig after they install, which results in local-ai and others not being able
# to locate the libraries. We run ldconfig ourselves to work around this packaging deficiency
ldconfig && \
# Log which GPU architectures have rocBLAS kernel support
echo "rocBLAS library data architectures:" && \
(ls /opt/rocm*/lib/rocblas/library/Kernels* 2>/dev/null || ls /opt/rocm*/lib64/rocblas/library/Kernels* 2>/dev/null) | grep -oP 'gfx[0-9a-z+-]+' | sort -u || \
echo "WARNING: No rocBLAS kernel data found" \
; fi
RUN echo "TARGETARCH: $TARGETARCH"
# We need protoc installed, and the version in 22.04 is too old. We will create one as part installing the GRPC build below
# but that will also being in a newer version of absl which stablediffusion cannot compile with. This version of protoc is only
# here so that we can generate the grpc code for the stablediffusion build
RUN <<EOT bash
if [ "amd64" = "$TARGETARCH" ]; then
curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v27.1/protoc-27.1-linux-x86_64.zip -o protoc.zip && \
unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
rm protoc.zip
fi
if [ "arm64" = "$TARGETARCH" ]; then
curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v27.1/protoc-27.1-linux-aarch_64.zip -o protoc.zip && \
unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
rm protoc.zip
fi
EOT
# Install CMake (the version in 22.04 is too old)
RUN <<EOT bash
if [ "${CMAKE_FROM_SOURCE}" = "true" ]; then
curl -L -s https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/cmake-${CMAKE_VERSION}.tar.gz -o cmake.tar.gz && tar xvf cmake.tar.gz && cd cmake-${CMAKE_VERSION} && ./configure && make && make install
else
apt-get update && \
apt-get install -y \
cmake && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
fi
EOT
COPY --from=grpc /opt/grpc /usr/local
COPY . /LocalAI
RUN <<'EOT' bash
set -euxo pipefail
if [[ -n "${CUDA_DOCKER_ARCH:-}" ]]; then
CUDA_ARCH_ESC="${CUDA_DOCKER_ARCH//;/\\;}"
export CMAKE_ARGS="${CMAKE_ARGS:-} -DCMAKE_CUDA_ARCHITECTURES=${CUDA_ARCH_ESC}"
echo "CMAKE_ARGS(env) = ${CMAKE_ARGS}"
rm -rf /LocalAI/backend/cpp/buun-llama-cpp-*-build
fi
cd /LocalAI/backend/cpp/buun-llama-cpp
if [ "${TARGETARCH}" = "arm64" ] || [ "${BUILD_TYPE}" = "hipblas" ]; then
make buun-llama-cpp-fallback
make buun-llama-cpp-grpc
make buun-llama-cpp-rpc-server
else
make buun-llama-cpp-avx
make buun-llama-cpp-avx2
make buun-llama-cpp-avx512
make buun-llama-cpp-fallback
make buun-llama-cpp-grpc
make buun-llama-cpp-rpc-server
fi
EOT
# Copy libraries using a script to handle architecture differences
RUN make -BC /LocalAI/backend/cpp/buun-llama-cpp package
FROM scratch
# Copy all available binaries (the build process only creates the appropriate ones for the target architecture)
COPY --from=builder /LocalAI/backend/cpp/buun-llama-cpp/package/. ./

View File

@@ -0,0 +1,85 @@
# Pinned to the HEAD of master on https://github.com/spiritbuun/buun-llama-cpp.
# Auto-bumped nightly by .github/workflows/bump_deps.yaml.
BUUN_LLAMA_VERSION?=22464d0848b87c5d56b52fdf6af2e5da46bf803e
LLAMA_REPO?=https://github.com/spiritbuun/buun-llama-cpp
CMAKE_ARGS?=
BUILD_TYPE?=
NATIVE?=false
ONEAPI_VARS?=/opt/intel/oneapi/setvars.sh
TARGET?=--target grpc-server
JOBS?=$(shell nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 1)
ARCH?=$(shell uname -m)
CURRENT_MAKEFILE_DIR := $(dir $(abspath $(lastword $(MAKEFILE_LIST))))
LLAMA_CPP_DIR := $(CURRENT_MAKEFILE_DIR)/../llama-cpp
GREEN := \033[0;32m
RESET := \033[0m
# buun-llama-cpp is a llama.cpp fork-of-a-fork (spiritbuun/buun-llama-cpp forked
# TheTom/llama-cpp-turboquant, which itself forked ggml-org/llama.cpp). Rather
# than duplicating grpc-server.cpp / CMakeLists.txt / prepare.sh we reuse the
# ones in backend/cpp/llama-cpp, and only swap which repo+sha the fetch step
# pulls. Each flavor target copies ../llama-cpp into a sibling
# ../buun-llama-cpp-<flavor>-build directory, then invokes llama-cpp's own
# build-llama-cpp-grpc-server with LLAMA_REPO/LLAMA_VERSION overridden to point
# at the fork.
PATCHES_DIR := $(CURRENT_MAKEFILE_DIR)/patches
# Each flavor target:
# 1. copies backend/cpp/llama-cpp/ (grpc-server.cpp + prepare.sh + CMakeLists.txt + Makefile)
# into a sibling buun-llama-cpp-<flavor>-build directory;
# 2. clones the buun fork into buun-llama-cpp-<flavor>-build/llama.cpp via the
# copy's own `llama.cpp` target, overriding LLAMA_REPO/LLAMA_VERSION;
# 3. applies patches from backend/cpp/buun-llama-cpp/patches/ to the cloned
# fork sources (for backporting upstream commits the fork hasn't pulled);
# 4. runs the copy's `grpc-server` target, which produces the binary we copy
# up as buun-llama-cpp-<flavor>.
define buun-llama-cpp-build
rm -rf $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-$(1)-build
cp -rf $(LLAMA_CPP_DIR) $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-$(1)-build
$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-$(1)-build purge
# Augment the copied grpc-server.cpp's KV-cache allow-list with the
# fork's turbo2/turbo3/turbo4/turbo2_tcq/turbo3_tcq types and wire up the
# DFlash-specific option handlers (tree_budget / draft_topk). We patch the
# *copy*, never the original under backend/cpp/llama-cpp/, so the stock
# llama-cpp build stays compiling against vanilla upstream.
bash $(CURRENT_MAKEFILE_DIR)/patch-grpc-server.sh $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-$(1)-build/grpc-server.cpp
$(info $(GREEN)I buun-llama-cpp build info:$(1)$(RESET))
LLAMA_REPO=$(LLAMA_REPO) LLAMA_VERSION=$(BUUN_LLAMA_VERSION) \
$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-$(1)-build llama.cpp
bash $(CURRENT_MAKEFILE_DIR)/apply-patches.sh $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-$(1)-build/llama.cpp $(PATCHES_DIR)
CMAKE_ARGS="$(CMAKE_ARGS) $(2)" TARGET="$(3)" \
LLAMA_REPO=$(LLAMA_REPO) LLAMA_VERSION=$(BUUN_LLAMA_VERSION) \
$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-$(1)-build grpc-server
cp -rfv $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-$(1)-build/grpc-server buun-llama-cpp-$(1)
endef
buun-llama-cpp-avx2:
$(call buun-llama-cpp-build,avx2,-DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=off -DGGML_FMA=on -DGGML_F16C=on,--target grpc-server)
buun-llama-cpp-avx512:
$(call buun-llama-cpp-build,avx512,-DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=on -DGGML_FMA=on -DGGML_F16C=on,--target grpc-server)
buun-llama-cpp-avx:
$(call buun-llama-cpp-build,avx,-DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off,--target grpc-server)
buun-llama-cpp-fallback:
$(call buun-llama-cpp-build,fallback,-DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off,--target grpc-server)
buun-llama-cpp-grpc:
$(call buun-llama-cpp-build,grpc,-DGGML_RPC=ON -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off,--target grpc-server --target rpc-server)
buun-llama-cpp-rpc-server: buun-llama-cpp-grpc
cp -rf $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-grpc-build/llama.cpp/build/bin/rpc-server buun-llama-cpp-rpc-server
package:
bash package.sh
purge:
rm -rf $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-*-build
rm -rf buun-llama-cpp-* package
clean: purge

View File

@@ -0,0 +1,50 @@
#!/bin/bash
# Apply the buun-llama-cpp patch series to a cloned buun-llama-cpp checkout.
#
# buun-llama-cpp is a fork-of-a-fork that branched off upstream llama.cpp
# before some API changes the shared backend/cpp/llama-cpp/grpc-server.cpp
# depends on. We carry those upstream commits as patch files under
# backend/cpp/buun-llama-cpp/patches/ and apply them here so the reused
# grpc-server source compiles against the fork unmodified.
#
# Drop the corresponding patch from patches/ whenever the fork catches up with
# upstream — the build will fail fast if a patch stops applying, which is the
# signal to retire it.
set -euo pipefail
if [[ $# -ne 2 ]]; then
echo "usage: $0 <llama.cpp-src-dir> <patches-dir>" >&2
exit 2
fi
SRC_DIR=$1
PATCHES_DIR=$2
if [[ ! -d "$SRC_DIR" ]]; then
echo "source dir does not exist: $SRC_DIR" >&2
exit 2
fi
if [[ ! -d "$PATCHES_DIR" ]]; then
echo "no patches dir at $PATCHES_DIR, nothing to apply"
exit 0
fi
shopt -s nullglob
patches=("$PATCHES_DIR"/*.patch)
shopt -u nullglob
if [[ ${#patches[@]} -eq 0 ]]; then
echo "no .patch files in $PATCHES_DIR, nothing to apply"
exit 0
fi
cd "$SRC_DIR"
for patch in "${patches[@]}"; do
echo "==> applying $patch"
git apply --verbose "$patch"
done
echo "all buun-llama-cpp patches applied successfully"

View File

@@ -0,0 +1,57 @@
#!/bin/bash
# Script to copy the appropriate libraries based on architecture
# This script is used in the final stage of the Dockerfile
set -e
CURDIR=$(dirname "$(realpath $0)")
REPO_ROOT="${CURDIR}/../../.."
# Create lib directory
mkdir -p $CURDIR/package/lib
cp -avrf $CURDIR/buun-llama-cpp-* $CURDIR/package/
cp -rfv $CURDIR/run.sh $CURDIR/package/
# Detect architecture and copy appropriate libraries
if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
# x86_64 architecture
echo "Detected x86_64 architecture, copying x86_64 libraries..."
cp -arfLv /lib64/ld-linux-x86-64.so.2 $CURDIR/package/lib/ld.so
cp -arfLv /lib/x86_64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
cp -arfLv /lib/x86_64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
cp -arfLv /lib/x86_64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
cp -arfLv /lib/x86_64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
cp -arfLv /lib/x86_64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
cp -arfLv /lib/x86_64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
# ARM64 architecture
echo "Detected ARM64 architecture, copying ARM64 libraries..."
cp -arfLv /lib/ld-linux-aarch64.so.1 $CURDIR/package/lib/ld.so
cp -arfLv /lib/aarch64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
cp -arfLv /lib/aarch64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
cp -arfLv /lib/aarch64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
cp -arfLv /lib/aarch64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
cp -arfLv /lib/aarch64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
cp -arfLv /lib/aarch64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
else
echo "Error: Could not detect architecture"
exit 1
fi
# Package GPU libraries based on BUILD_TYPE
GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
if [ -f "$GPU_LIB_SCRIPT" ]; then
echo "Packaging GPU libraries for BUILD_TYPE=${BUILD_TYPE:-cpu}..."
source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
package_gpu_libs
fi
echo "Packaging completed successfully"
ls -liah $CURDIR/package/
ls -liah $CURDIR/package/lib/

View File

@@ -0,0 +1,162 @@
#!/bin/bash
# Patch the shared backend/cpp/llama-cpp/grpc-server.cpp *copy* used by the
# buun-llama-cpp build to account for three gaps between upstream and the fork:
#
# 1. Augment the kv_cache_types[] allow-list so `LoadModel` accepts the
# fork-specific `turbo2` / `turbo3` / `turbo4` cache types plus the buun
# additions `turbo2_tcq` / `turbo3_tcq`.
#
# 2. Wire up buun-exclusive speculative-decoding option handlers
# (tree_budget / draft_topk) alongside the existing spec_* handlers.
# These reference struct fields (common_params.speculative.tree_budget
# and .draft_topk) that only exist in buun's common/common.h — adding
# them to the shared backend/cpp/llama-cpp/grpc-server.cpp would break
# the stock llama-cpp build, so we inject them only into the buun copy.
#
# 3. Replace `get_media_marker()` (added upstream in ggml-org/llama.cpp#21962,
# server-side random per-instance marker) with the legacy "<__media__>"
# literal. The fork branched before that PR, so server-common.cpp has no
# get_media_marker symbol. The fork's mtmd_default_marker() still returns
# "<__media__>", and Go-side tooling falls back to that sentinel when the
# backend does not expose media_marker, so substituting the literal keeps
# behavior identical on the buun path.
#
# We patch the *copy* sitting in buun-llama-cpp-<flavor>-build/, never the
# original under backend/cpp/llama-cpp/, so the stock llama-cpp build keeps
# compiling against vanilla upstream.
#
# Idempotent: skips each insertion if its marker is already present (so re-runs
# of the same build dir don't double-insert).
set -euo pipefail
if [[ $# -ne 1 ]]; then
echo "usage: $0 <grpc-server.cpp>" >&2
exit 2
fi
SRC=$1
if [[ ! -f "$SRC" ]]; then
echo "grpc-server.cpp not found at $SRC" >&2
exit 2
fi
if grep -q 'GGML_TYPE_TURBO2_TCQ' "$SRC"; then
echo "==> $SRC already has buun cache types, skipping KV allow-list patch"
else
echo "==> patching $SRC to allow turbo2/turbo3/turbo4/turbo2_tcq/turbo3_tcq KV-cache types"
# Insert the five TURBO entries right after the first ` GGML_TYPE_Q5_1,`
# line (the kv_cache_types[] allow-list). Using awk because the builder
# image does not ship python3, and GNU sed's multi-line `a\` quoting is
# awkward.
awk '
/^ GGML_TYPE_Q5_1,$/ && !done {
print
print " // buun-llama-cpp fork extras — added by patch-grpc-server.sh"
print " GGML_TYPE_TURBO2_0,"
print " GGML_TYPE_TURBO3_0,"
print " GGML_TYPE_TURBO4_0,"
print " GGML_TYPE_TURBO2_TCQ,"
print " GGML_TYPE_TURBO3_TCQ,"
done = 1
next
}
{ print }
END {
if (!done) {
print "patch-grpc-server.sh: anchor ` GGML_TYPE_Q5_1,` not found" > "/dev/stderr"
exit 1
}
}
' "$SRC" > "$SRC.tmp"
mv "$SRC.tmp" "$SRC"
echo "==> KV allow-list patch OK"
fi
if grep -q 'optname, "tree_budget"' "$SRC"; then
echo "==> $SRC already has DFlash option handlers, skipping"
else
echo "==> patching $SRC to add tree_budget / draft_topk option handlers"
# Insert two new `else if` handlers between the inner close-brace of the
# `spec_p_split` block and the next `} else if (…spec_ngram_size_n…)` line.
# Upstream writes each `} else if` as a single physical line, so we don't
# emit an outer `}` ourselves — the existing next line provides both the
# close of our `draft_topk` block and the open of `spec_ngram_size_n`.
# Anchor on the exact 3-line body of spec_p_split so we can't drift.
awk '
prev2 == " } else if (!strcmp(optname, \"spec_p_split\")) {" &&
prev1 ~ /^ +if \(optval != NULL\) \{$/ &&
$0 ~ /^ +try \{ params\.speculative\.p_split = std::stof\(optval_str\); \} catch \(\.\.\.\) \{\}$/ &&
!done {
print # print the try-line itself
getline inner_close # read " }" closing the inner if
print inner_close # print it — this closes spec_p_split body
print " // buun-llama-cpp DFlash options — added by patch-grpc-server.sh"
print " } else if (!strcmp(optname, \"tree_budget\")) {"
print " if (optval != NULL) {"
print " try { params.speculative.tree_budget = std::stoi(optval_str); } catch (...) {}"
print " }"
print " } else if (!strcmp(optname, \"draft_topk\")) {"
print " if (optval != NULL) {"
print " try { params.speculative.draft_topk = std::stoi(optval_str); } catch (...) {}"
print " }"
# The next source line (`} else if (…spec_ngram_size_n…) {`) closes
# our draft_topk block and continues the chain naturally; fall back
# into the main loop to emit it and everything after.
done = 1
prev2 = prev1
prev1 = inner_close
next
}
{ print; prev2 = prev1; prev1 = $0 }
END {
if (!done) {
print "patch-grpc-server.sh: spec_p_split anchor not found" > "/dev/stderr"
exit 1
}
}
' "$SRC" > "$SRC.tmp"
mv "$SRC.tmp" "$SRC"
echo "==> DFlash option-handler patch OK"
fi
if grep -qE 'ctx_server\.get_meta\(\)\.logit_bias_eog|params_base\.sampling\.logit_bias_eog,' "$SRC"; then
echo "==> patching $SRC to drop the logit_bias_eog arg from params_from_json_cmpl() callsites (buun still uses the pre-refactor 4-arg signature)"
# Upstream llama.cpp refactored params_from_json_cmpl to take a precomputed
# logit_bias_eog vector after buun's 2026-04-05 fork-point — simultaneously
# adding server_context_meta::logit_bias_eog as the supplier. Buun carries
# neither change: its params_from_json_cmpl is still 4-arg, and internally
# derives logit_bias_eog from the common_params it's passed. So we just
# delete the argument line entirely — the remaining 4 args match buun's
# signature and the resulting behavior matches upstream bit-for-bit
# (upstream's 5th arg is the same data buun derives internally).
#
# Guard is broad so this works whether the line has been run through this
# block before (leaving params_base.sampling.logit_bias_eog,) or not
# (leaving the original ctx_server.get_meta().logit_bias_eog,).
sed -E '/^[[:space:]]+(ctx_server\.get_meta\(\)\.logit_bias_eog|params_base\.sampling\.logit_bias_eog),$/d' "$SRC" > "$SRC.tmp"
mv "$SRC.tmp" "$SRC"
echo "==> logit_bias_eog arg drop OK"
else
echo "==> $SRC has no logit_bias_eog arg line, skipping"
fi
if grep -q 'get_media_marker()' "$SRC"; then
echo "==> patching $SRC to replace get_media_marker() with legacy \"<__media__>\" literal"
# Only one call site today (ModelMetadata), but replace all occurrences to
# stay robust if upstream adds more. Use a temp file to avoid relying on
# sed -i portability (the builder image uses GNU sed, but keeping this
# consistent with the awk block above).
sed 's/get_media_marker()/"<__media__>"/g' "$SRC" > "$SRC.tmp"
mv "$SRC.tmp" "$SRC"
echo "==> get_media_marker() substitution OK"
else
echo "==> $SRC has no get_media_marker() call, skipping media-marker patch"
fi
echo "==> all patches applied"

View File

@@ -0,0 +1,46 @@
Subject: [PATCH] ggml-cuda/fattn: provide atomicAdd(double*,double) shim for pre-sm_60
Buun's Q² calibration path in ggml_cuda_turbo_scale_q calls
atomicAdd(&d_q_channel_sq_fattn[threadIdx.x], (double)(val * val));
but native double atomicAdd is only available on compute capability 6.0
and newer. Compiling against a CUDA arch list that includes older
architectures (LocalAI's CUDA 12 Docker image builds for the full
published arch range) fails with:
fattn.cu(812): error: no instance of overloaded function "atomicAdd"
matches the argument list, argument types are: (double *, double)
Add the canonical CUDA-programming-guide shim at the top of fattn.cu so
pre-sm_60 codegen has a definition to call. On sm_60+ the native CUDA
intrinsic is used and the shim is elided via __CUDA_ARCH__.
--- a/ggml/src/ggml-cuda/fattn.cu
+++ b/ggml/src/ggml-cuda/fattn.cu
@@ -7,6 +7,27 @@
#include <atomic>
+// Pre-sm_60 double atomicAdd shim. Native double atomicAdd(double*,double)
+// is only available on CUDA compute capability 6.0+ (see CUDA C Programming
+// Guide, B.15 Atomic Functions). Buun's Q² calibration path below calls
+// atomicAdd with a double*; without this definition, nvcc fails to find a
+// matching overload whenever the compile target list includes pre-sm_60
+// architectures. The standard CAS loop implementation below matches the
+// semantics of the native intrinsic.
+#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ < 600
+static __device__ double atomicAdd(double * address, double val) {
+ unsigned long long int * address_as_ull = (unsigned long long int *)address;
+ unsigned long long int old = *address_as_ull;
+ unsigned long long int assumed;
+ do {
+ assumed = old;
+ old = atomicCAS(address_as_ull, assumed,
+ __double_as_longlong(val + __longlong_as_double(assumed)));
+ } while (assumed != old);
+ return __longlong_as_double(old);
+}
+#endif
+
// InnerQ: update the fattn-side inverse scale array from host (all devices)
void turbo_innerq_update_fattn_scales(const float * scale_inv) {
int cur_device;

View File

@@ -0,0 +1,32 @@
Subject: [PATCH] ggml-cuda/argmax: pass WARP_SIZE to the top-K __shfl_xor_sync calls
Two __shfl_xor_sync calls in the top-K intra-warp merge drop the `width`
argument and rely on the CUDA default (warpSize). Every other call in
the same file already passes WARP_SIZE explicitly, and the HIP/ROCm
compatibility shim at ggml/src/ggml-cuda/vendors/hip.h:33 is a 4-arg
function-like macro — so the 3-arg form fails to preprocess when
building with hipcc against ROCm:
argmax.cu:265: error: too few arguments provided to function-like
macro invocation
note: macro '__shfl_xor_sync' defined here:
#define __shfl_xor_sync(mask, var, laneMask, width) \
__shfl_xor(var, laneMask, width)
Align the two call sites with the rest of the file by passing WARP_SIZE
explicitly. On CUDA the generated code is unchanged (warpSize is the
default); on HIP it now matches the macro's arity.
--- a/ggml/src/ggml-cuda/argmax.cu
+++ b/ggml/src/ggml-cuda/argmax.cu
@@ -262,8 +262,8 @@
// Each step: lane gets partner's min element, if it beats our min, replace and re-heapify
for (int offset = WARP_SIZE / 2; offset > 0; offset >>= 1) {
for (int i = 0; i < K; i++) {
- float partner_val = __shfl_xor_sync(0xFFFFFFFF, heap_val[i], offset);
- int partner_idx = __shfl_xor_sync(0xFFFFFFFF, heap_idx[i], offset);
+ float partner_val = __shfl_xor_sync(0xFFFFFFFF, heap_val[i], offset, WARP_SIZE);
+ int partner_idx = __shfl_xor_sync(0xFFFFFFFF, heap_idx[i], offset, WARP_SIZE);
if (partner_val > heap_val[0]) {
heap_val[0] = partner_val;
heap_idx[0] = partner_idx;

View File

@@ -0,0 +1,24 @@
Subject: [PATCH] ggml-cuda/vendors/hip: alias cudaMemcpy{To,From}Symbol to hip counterparts
Buun's Q² calibration + TCQ codebook upload paths in fattn.cu use
cudaMemcpyToSymbol / cudaMemcpyFromSymbol. The HIP-compat header in
ggml/src/ggml-cuda/vendors/hip.h already aliases the scalar cudaMemcpy
family (cudaMemcpy, cudaMemcpyAsync, cudaMemcpy2DAsync, …) but is
missing the symbol variants. Building with hipcc therefore fails with
15+ "use of undeclared identifier 'cudaMemcpyToSymbol'" errors.
Add the two missing aliases alongside the existing memcpy block. HIP
provides hipMemcpy{To,From}Symbol with the same signature as CUDA's
equivalents, so this is a straight name substitution.
--- a/ggml/src/ggml-cuda/vendors/hip.h
+++ b/ggml/src/ggml-cuda/vendors/hip.h
@@ -85,6 +85,8 @@
#define cudaMemcpyDeviceToDevice hipMemcpyDeviceToDevice
#define cudaMemcpyDeviceToHost hipMemcpyDeviceToHost
#define cudaMemcpyHostToDevice hipMemcpyHostToDevice
+#define cudaMemcpyToSymbol hipMemcpyToSymbol
+#define cudaMemcpyFromSymbol hipMemcpyFromSymbol
#define cudaMemcpyKind hipMemcpyKind
#define cudaMemset hipMemset
#define cudaMemsetAsync hipMemsetAsync

View File

@@ -0,0 +1,36 @@
Subject: [PATCH] ggml-cuda/fattn: pass WARP_SIZE to fwht128 __shfl_xor_sync calls
Same issue as the argmax top-K fix: two __shfl_xor_sync call sites in
the FWHT-128 butterfly kernels (ggml_cuda_fwht128 and fwht128_store_half)
use the 3-arg CUDA form and omit the `width` argument that the HIP
function-like macro in vendors/hip.h:33 requires. Hipcc fails with:
fattn.cu:512: too few arguments provided to function-like macro
invocation
note: macro '__shfl_xor_sync' defined here:
#define __shfl_xor_sync(mask, var, laneMask, width) \
__shfl_xor(var, laneMask, width)
Add WARP_SIZE to both calls. CUDA codegen is unchanged (warpSize is the
default); HIP now matches the macro arity.
--- a/ggml/src/ggml-cuda/fattn.cu
+++ b/ggml/src/ggml-cuda/fattn.cu
@@ -509,7 +509,7 @@
// Intra-warp passes: shuffle xor with stride h, no smem, no sync.
#pragma unroll
for (int h = 1; h <= 16; h *= 2) {
- const float other = __shfl_xor_sync(0xFFFFFFFF, val, h);
+ const float other = __shfl_xor_sync(0xFFFFFFFF, val, h, WARP_SIZE);
val = (tid & h) ? (other - val) : (val + other);
}
@@ -533,7 +533,7 @@
static __device__ __forceinline__ void fwht128_store_half(
float val, half * dst_base) {
const int tid = threadIdx.x;
- const float neighbor = __shfl_xor_sync(0xFFFFFFFF, val, 1);
+ const float neighbor = __shfl_xor_sync(0xFFFFFFFF, val, 1, WARP_SIZE);
if ((tid & 1) == 0) {
const half2 packed = __floats2half2_rn(val, neighbor);
*((half2 *)(dst_base + tid)) = packed;

View File

@@ -0,0 +1,65 @@
#!/bin/bash
set -ex
# Get the absolute current dir where the script is located
CURDIR=$(dirname "$(realpath $0)")
cd /
echo "CPU info:"
grep -e "model\sname" /proc/cpuinfo | head -1
grep -e "flags" /proc/cpuinfo | head -1
BINARY=buun-llama-cpp-fallback
if grep -q -e "\savx\s" /proc/cpuinfo ; then
echo "CPU: AVX found OK"
if [ -e $CURDIR/buun-llama-cpp-avx ]; then
BINARY=buun-llama-cpp-avx
fi
fi
if grep -q -e "\savx2\s" /proc/cpuinfo ; then
echo "CPU: AVX2 found OK"
if [ -e $CURDIR/buun-llama-cpp-avx2 ]; then
BINARY=buun-llama-cpp-avx2
fi
fi
# Check avx 512
if grep -q -e "\savx512f\s" /proc/cpuinfo ; then
echo "CPU: AVX512F found OK"
if [ -e $CURDIR/buun-llama-cpp-avx512 ]; then
BINARY=buun-llama-cpp-avx512
fi
fi
if [ -n "$LLAMACPP_GRPC_SERVERS" ]; then
if [ -e $CURDIR/buun-llama-cpp-grpc ]; then
BINARY=buun-llama-cpp-grpc
fi
fi
# Extend ld library path with the dir where this script is located/lib
if [ "$(uname)" == "Darwin" ]; then
export DYLD_LIBRARY_PATH=$CURDIR/lib:$DYLD_LIBRARY_PATH
else
export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
# Tell rocBLAS where to find TensileLibrary data (GPU kernel tuning files)
if [ -d "$CURDIR/lib/rocblas/library" ]; then
export ROCBLAS_TENSILE_LIBPATH=$CURDIR/lib/rocblas/library
fi
fi
# If there is a lib/ld.so, use it
if [ -f $CURDIR/lib/ld.so ]; then
echo "Using lib/ld.so"
echo "Using binary: $BINARY"
exec $CURDIR/lib/ld.so $CURDIR/$BINARY "$@"
fi
echo "Using binary: $BINARY"
exec $CURDIR/$BINARY "$@"
# We should never reach this point, however just in case we do, run fallback
exec $CURDIR/buun-llama-cpp-fallback "$@"

View File

@@ -1,5 +1,5 @@
IK_LLAMA_VERSION?=286ce324baed17c95faec77792eaa6bdb1c7a5f5
IK_LLAMA_VERSION?=16996aeab772c69b6473597038b2ef0b85297e8b
LLAMA_REPO?=https://github.com/ikawrakow/ik_llama.cpp
CMAKE_ARGS?=

View File

@@ -0,0 +1,11 @@
--- a/examples/llava/clip.cpp
+++ b/examples/llava/clip.cpp
@@ -2494,7 +2494,7 @@
}
new_data = work.data();
- new_size = ggml_quantize_chunk(new_type, f32_data, new_data, 0, n_elms/cur->ne[0], cur->ne[0], nullptr);
+ new_size = ggml_quantize_chunk(new_type, f32_data, new_data, 0, n_elms/cur->ne[0], cur->ne[0], nullptr, nullptr);
} else {
new_type = cur->type;
new_data = cur->data;

View File

@@ -1,5 +1,5 @@
LLAMA_VERSION?=0d0764dfd257c0ae862525c05778207f87b99b1c
LLAMA_VERSION?=187a45637054881ecacf17f8e2f6f8f2ba7df1c7
LLAMA_REPO?=https://github.com/ggerganov/llama.cpp
CMAKE_ARGS?=

View File

@@ -4,7 +4,6 @@ package main
// It is meant to be used by the main executable that is the server for the specific backend type (falcon, gpt3, etc)
import (
"container/heap"
"errors"
"fmt"
"math"
"slices"
@@ -100,9 +99,16 @@ func sortIntoKeySlicese(keys []*pb.StoresKey) [][]float32 {
}
func (s *Store) Load(opts *pb.ModelOptions) error {
if opts.Model != "" {
return errors.New("not implemented")
}
// local-store is an in-memory vector store with no on-disk artefact to
// load — opts.Model is just a namespace identifier. The old `!= ""` guard
// rejected any non-empty model name with "not implemented", which broke
// callers that pass a namespace to isolate embedding spaces (face vs.
// voice biometrics both go through local-store but need distinct stores
// so ArcFace 512-D and ECAPA-TDNN 192-D don't collide). Namespace
// isolation is already handled upstream: ModelLoader spawns a fresh
// local-store process per (backend, model) tuple, so each namespace is
// its own Store{} instance. Nothing to do here beyond accepting the load.
_ = opts
return nil
}

11
backend/go/sherpa-onnx/.gitignore vendored Normal file
View File

@@ -0,0 +1,11 @@
.cache/
sources/
build*/
package/
backend-assets/
sherpa-onnx
*.so
compile_commands.json
sherpa-onnx-whisper-*
vits-ljs/
streaming-zipformer-en/

View File

@@ -0,0 +1,120 @@
CURRENT_DIR=$(abspath ./)
GOCMD=go
ONNX_VERSION?=1.24.4
# v1.12.39 — includes upstream's onnxruntime 1.24.4 bump (#3501). Earlier
# pinned commits only support onnxruntime 1.23.2, which has no CUDA 13
# pre-built tarball, blocking the -gpu-nvidia-cuda-13 build matrix entry.
SHERPA_COMMIT?=7288d15e3e31a7bd589b2ba88828d521e7a6b140
ONNX_ARCH?=x64
ONNX_OS?=linux
ifneq (,$(findstring aarch64,$(shell uname -m)))
ONNX_ARCH=aarch64
endif
ifeq ($(OS),Darwin)
ONNX_OS=osx
ifneq (,$(findstring aarch64,$(shell uname -m)))
ONNX_ARCH=arm64
else ifneq (,$(findstring arm64,$(shell uname -m)))
ONNX_ARCH=arm64
else
ONNX_ARCH=x86_64
endif
endif
# Upstream onnxruntime ships CUDA 12 and CUDA 13 variants under different
# names: -gpu-<ver>.tgz for CUDA 12, -gpu_cuda13-<ver>.tgz for CUDA 13
# (note underscore vs dash). CUDA 13 tarballs only exist from 1.24.x onward.
ifeq ($(BUILD_TYPE),cublas)
SHERPA_GPU=ON
ONNX_PROVIDER=cuda
ifeq ($(CUDA_MAJOR_VERSION),13)
ONNX_VARIANT=-gpu_cuda13
else
ONNX_VARIANT=-gpu
endif
else
ONNX_VARIANT=
SHERPA_GPU=OFF
ONNX_PROVIDER=cpu
endif
JOBS?=$(shell nproc --ignore=1 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 4)
sources/onnxruntime:
mkdir -p sources/onnxruntime
curl -L https://github.com/microsoft/onnxruntime/releases/download/v$(ONNX_VERSION)/onnxruntime-$(ONNX_OS)-$(ONNX_ARCH)$(ONNX_VARIANT)-$(ONNX_VERSION).tgz \
-o sources/onnxruntime/onnxruntime.tgz
cd sources/onnxruntime && tar -xf onnxruntime.tgz --strip-components=1 && rm onnxruntime.tgz
sources/sherpa-onnx: sources/onnxruntime
git clone https://github.com/k2-fsa/sherpa-onnx.git sources/sherpa-onnx
cd sources/sherpa-onnx && git checkout $(SHERPA_COMMIT)
mkdir -p sources/sherpa-onnx/build
# sherpa-onnx's cmake detects a pre-installed onnxruntime via the
# SHERPA_ONNXRUNTIME_{INCLUDE,LIB}_DIR env vars (not via -D flags).
# Point them at our locally-downloaded Microsoft tarball — without
# this, sherpa-onnx falls through to download_onnxruntime() which
# fetches from csukuangfj/onnxruntime-libs. For the GPU 1.24.4
# build that release mirror publishes `-patched.zip` instead of the
# expected `.tgz`, so the download 404s and the build fails.
cd sources/sherpa-onnx/build && \
SHERPA_ONNXRUNTIME_INCLUDE_DIR=$(CURRENT_DIR)/sources/onnxruntime/include \
SHERPA_ONNXRUNTIME_LIB_DIR=$(CURRENT_DIR)/sources/onnxruntime/lib \
cmake \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_C_FLAGS="-Wno-error=format-security" \
-DCMAKE_CXX_FLAGS="-Wno-error=format-security" \
-DSHERPA_ONNX_ENABLE_GPU=$(SHERPA_GPU) \
-DSHERPA_ONNX_ENABLE_TTS=ON \
-DSHERPA_ONNX_ENABLE_BINARY=OFF \
-DSHERPA_ONNX_ENABLE_PYTHON=OFF \
-DSHERPA_ONNX_ENABLE_TESTS=OFF \
-DSHERPA_ONNX_ENABLE_C_API=ON \
-DBUILD_SHARED_LIBS=ON \
-DSHERPA_ONNX_USE_PRE_INSTALLED_ONNXRUNTIME_IF_AVAILABLE=ON \
..
cd sources/sherpa-onnx/build && make -j$(JOBS)
backend-assets/lib: sources/sherpa-onnx sources/onnxruntime
mkdir -p backend-assets/lib
cp -rfLv sources/onnxruntime/lib/* backend-assets/lib/
cp -rfLv sources/sherpa-onnx/build/lib/*.so* backend-assets/lib/ 2>/dev/null || true
cp -rfLv sources/sherpa-onnx/build/lib/*.dylib backend-assets/lib/ 2>/dev/null || true
# libsherpa-shim wraps sherpa-onnx's nested config structs and TTS
# callback plumbing behind a purego-friendly API: opaque handles plus
# fixed-signature setters/getters/trampoline. Plain C compile — no cgo.
SHIM_EXT=so
ifeq ($(OS),Darwin)
SHIM_EXT=dylib
endif
backend-assets/lib/libsherpa-shim.$(SHIM_EXT): csrc/shim.c csrc/shim.h backend-assets/lib
$(CC) -shared -fPIC -O2 \
-I$(CURRENT_DIR)/sources/sherpa-onnx/sherpa-onnx/c-api \
-o $@ csrc/shim.c \
-L$(CURRENT_DIR)/backend-assets/lib \
-lsherpa-onnx-c-api \
-Wl,-rpath,'$$ORIGIN'
sherpa-onnx: backend-assets/lib backend-assets/lib/libsherpa-shim.$(SHIM_EXT)
CGO_ENABLED=0 $(GOCMD) build \
-ldflags "$(LD_FLAGS) -X main.onnxProvider=$(ONNX_PROVIDER)" \
-tags "$(GO_TAGS)" -o sherpa-onnx ./
package:
bash package.sh
build: sherpa-onnx package
clean:
rm -rf sherpa-onnx sources/ backend-assets/ package/ vits-ljs/ sherpa-onnx-whisper-*/
test: sherpa-onnx
LD_LIBRARY_PATH=$(CURRENT_DIR)/backend-assets/lib \
bash test.sh
.PHONY: build package clean test

View File

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,169 @@
package main
import (
"os"
"path/filepath"
"testing"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
func TestSherpaBackend(t *testing.T) {
RegisterFailHandler(Fail)
RunSpecs(t, "Sherpa-ONNX Backend Suite")
}
// Load libsherpa-shim + libsherpa-onnx-c-api via purego before any spec
// runs — otherwise any Load/TTS/VAD/AudioTranscription call hits a nil
// function pointer. LD_LIBRARY_PATH must contain the directory holding
// both .so files; test.sh sets this.
var _ = BeforeSuite(func() {
Expect(loadSherpaLibs()).To(Succeed())
})
var _ = Describe("Sherpa-ONNX", func() {
Context("lifecycle", func() {
It("is locking (C API is not thread safe)", func() {
Expect((&SherpaBackend{}).Locking()).To(BeTrue())
})
It("errors loading a non-existent model", func() {
tmpDir, err := os.MkdirTemp("", "sherpa-test-nonexistent")
Expect(err).ToNot(HaveOccurred())
defer os.RemoveAll(tmpDir)
err = (&SherpaBackend{}).Load(&pb.ModelOptions{
ModelFile: filepath.Join(tmpDir, "non-existent-model.onnx"),
})
Expect(err).To(HaveOccurred())
})
It("errors loading a non-existent ASR model", func() {
tmpDir, err := os.MkdirTemp("", "sherpa-test-asr")
Expect(err).ToNot(HaveOccurred())
defer os.RemoveAll(tmpDir)
err = (&SherpaBackend{}).Load(&pb.ModelOptions{
ModelFile: filepath.Join(tmpDir, "model.onnx"),
Type: "asr",
})
Expect(err).To(HaveOccurred())
})
It("dispatches Load by Type", func() {
tmpDir, err := os.MkdirTemp("", "sherpa-test-dispatch")
Expect(err).ToNot(HaveOccurred())
defer os.RemoveAll(tmpDir)
modelFile := filepath.Join(tmpDir, "model.onnx")
for _, typ := range []string{"", "asr", "vad"} {
err := (&SherpaBackend{}).Load(&pb.ModelOptions{ModelFile: modelFile, Type: typ})
Expect(err).To(HaveOccurred(), "Type=%q", typ)
}
})
})
Context("method errors without loaded model", func() {
It("rejects TTS", func() {
tmpDir, err := os.MkdirTemp("", "sherpa-test-tts")
Expect(err).ToNot(HaveOccurred())
defer os.RemoveAll(tmpDir)
err = (&SherpaBackend{}).TTS(&pb.TTSRequest{
Text: "should fail — no model loaded",
Dst: filepath.Join(tmpDir, "output.wav"),
})
Expect(err).To(HaveOccurred())
})
It("rejects AudioTranscription", func() {
_, err := (&SherpaBackend{}).AudioTranscription(&pb.TranscriptRequest{
Dst: "/tmp/nonexistent.wav",
})
Expect(err).To(HaveOccurred())
})
It("rejects VAD", func() {
_, err := (&SherpaBackend{}).VAD(&pb.VADRequest{
Audio: []float32{0.1, 0.2, 0.3},
})
Expect(err).To(HaveOccurred())
})
})
Context("type detection", func() {
DescribeTable("isASRType",
func(input string, want bool) {
Expect(isASRType(input)).To(Equal(want))
},
Entry("asr", "asr", true),
Entry("ASR", "ASR", true),
Entry("Asr", "Asr", true),
Entry("transcription", "transcription", true),
Entry("Transcription", "Transcription", true),
Entry("transcribe", "transcribe", true),
Entry("Transcribe", "Transcribe", true),
Entry("tts", "tts", false),
Entry("empty", "", false),
Entry("other", "other", false),
Entry("vad", "vad", false),
)
DescribeTable("isVADType",
func(input string, want bool) {
Expect(isVADType(input)).To(Equal(want))
},
Entry("vad", "vad", true),
Entry("VAD", "VAD", true),
Entry("Vad", "Vad", true),
Entry("asr", "asr", false),
Entry("tts", "tts", false),
Entry("empty", "", false),
Entry("other", "other", false),
)
})
Context("option parsing", func() {
It("parses float options with fallback on bad input", func() {
opts := &pb.ModelOptions{Options: []string{
"vad.threshold=0.3",
"tts.length_scale=1.25",
"bad.number=not-a-float",
}}
Expect(findOptionFloat(opts, "vad.threshold=", 0.5)).To(BeNumerically("~", 0.3, 1e-6))
Expect(findOptionFloat(opts, "tts.length_scale=", 1.0)).To(BeNumerically("~", 1.25, 1e-6))
Expect(findOptionFloat(opts, "missing.key=", 0.7)).To(BeNumerically("~", 0.7, 1e-6))
Expect(findOptionFloat(opts, "bad.number=", 9.9)).To(BeNumerically("~", 9.9, 1e-6))
})
It("parses int options with fallback on bad input", func() {
opts := &pb.ModelOptions{Options: []string{
"asr.sample_rate=22050",
"online.chunk_samples=800",
"bad.int=4.2",
}}
Expect(findOptionInt(opts, "asr.sample_rate=", 16000)).To(Equal(int32(22050)))
Expect(findOptionInt(opts, "online.chunk_samples=", 1600)).To(Equal(int32(800)))
Expect(findOptionInt(opts, "missing.key=", 42)).To(Equal(int32(42)))
Expect(findOptionInt(opts, "bad.int=", 100)).To(Equal(int32(100)))
})
It("parses bool options (0/1, true/false, yes/no, on/off)", func() {
opts := &pb.ModelOptions{Options: []string{
"online.enable_endpoint=0",
"asr.sense_voice.use_itn=True",
"feature.on=yes",
"feature.off=Off",
"feature.bad=maybe",
}}
Expect(findOptionBool(opts, "online.enable_endpoint=", 1)).To(Equal(int32(0)))
Expect(findOptionBool(opts, "asr.sense_voice.use_itn=", 0)).To(Equal(int32(1)))
Expect(findOptionBool(opts, "feature.on=", 0)).To(Equal(int32(1)))
Expect(findOptionBool(opts, "feature.off=", 1)).To(Equal(int32(0)))
Expect(findOptionBool(opts, "feature.bad=", 1)).To(Equal(int32(1)))
Expect(findOptionBool(opts, "missing.key=", 1)).To(Equal(int32(1)))
})
})
})

View File

@@ -0,0 +1,325 @@
#include "shim.h"
#include "c-api.h"
#include <stdlib.h>
#include <string.h>
// Replace the char* field pointed to by `slot` with a strdup of `s`
// (or NULL if s is NULL). Frees any prior value. Silently no-ops when
// strdup fails — the caller will see a Create* failure downstream.
static void shim_set_str(const char **slot, const char *s) {
free((char *)*slot);
*slot = s ? strdup(s) : NULL;
}
// ==================================================================
// VAD config
// ==================================================================
void *sherpa_shim_vad_config_new(void) {
return calloc(1, sizeof(SherpaOnnxVadModelConfig));
}
void sherpa_shim_vad_config_free(void *h) {
if (!h) return;
SherpaOnnxVadModelConfig *c = (SherpaOnnxVadModelConfig *)h;
free((char *)c->silero_vad.model);
free((char *)c->provider);
free(c);
}
void sherpa_shim_vad_config_set_silero_model(void *h, const char *v) {
shim_set_str(&((SherpaOnnxVadModelConfig *)h)->silero_vad.model, v);
}
void sherpa_shim_vad_config_set_silero_threshold(void *h, float v) {
((SherpaOnnxVadModelConfig *)h)->silero_vad.threshold = v;
}
void sherpa_shim_vad_config_set_silero_min_silence_duration(void *h, float v) {
((SherpaOnnxVadModelConfig *)h)->silero_vad.min_silence_duration = v;
}
void sherpa_shim_vad_config_set_silero_min_speech_duration(void *h, float v) {
((SherpaOnnxVadModelConfig *)h)->silero_vad.min_speech_duration = v;
}
void sherpa_shim_vad_config_set_silero_window_size(void *h, int32_t v) {
((SherpaOnnxVadModelConfig *)h)->silero_vad.window_size = v;
}
void sherpa_shim_vad_config_set_silero_max_speech_duration(void *h, float v) {
((SherpaOnnxVadModelConfig *)h)->silero_vad.max_speech_duration = v;
}
void sherpa_shim_vad_config_set_sample_rate(void *h, int32_t v) {
((SherpaOnnxVadModelConfig *)h)->sample_rate = v;
}
void sherpa_shim_vad_config_set_num_threads(void *h, int32_t v) {
((SherpaOnnxVadModelConfig *)h)->num_threads = v;
}
void sherpa_shim_vad_config_set_provider(void *h, const char *v) {
shim_set_str(&((SherpaOnnxVadModelConfig *)h)->provider, v);
}
void sherpa_shim_vad_config_set_debug(void *h, int32_t v) {
((SherpaOnnxVadModelConfig *)h)->debug = v;
}
void *sherpa_shim_create_vad(void *h, float buffer_size_seconds) {
return (void *)SherpaOnnxCreateVoiceActivityDetector(
(const SherpaOnnxVadModelConfig *)h, buffer_size_seconds);
}
// ==================================================================
// Offline TTS config (VITS)
// ==================================================================
void *sherpa_shim_tts_config_new(void) {
return calloc(1, sizeof(SherpaOnnxOfflineTtsConfig));
}
void sherpa_shim_tts_config_free(void *h) {
if (!h) return;
SherpaOnnxOfflineTtsConfig *c = (SherpaOnnxOfflineTtsConfig *)h;
free((char *)c->model.vits.model);
free((char *)c->model.vits.tokens);
free((char *)c->model.vits.lexicon);
free((char *)c->model.vits.data_dir);
free((char *)c->model.provider);
free(c);
}
void sherpa_shim_tts_config_set_vits_model(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineTtsConfig *)h)->model.vits.model, v);
}
void sherpa_shim_tts_config_set_vits_tokens(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineTtsConfig *)h)->model.vits.tokens, v);
}
void sherpa_shim_tts_config_set_vits_lexicon(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineTtsConfig *)h)->model.vits.lexicon, v);
}
void sherpa_shim_tts_config_set_vits_data_dir(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineTtsConfig *)h)->model.vits.data_dir, v);
}
void sherpa_shim_tts_config_set_vits_noise_scale(void *h, float v) {
((SherpaOnnxOfflineTtsConfig *)h)->model.vits.noise_scale = v;
}
void sherpa_shim_tts_config_set_vits_noise_scale_w(void *h, float v) {
((SherpaOnnxOfflineTtsConfig *)h)->model.vits.noise_scale_w = v;
}
void sherpa_shim_tts_config_set_vits_length_scale(void *h, float v) {
((SherpaOnnxOfflineTtsConfig *)h)->model.vits.length_scale = v;
}
void sherpa_shim_tts_config_set_num_threads(void *h, int32_t v) {
((SherpaOnnxOfflineTtsConfig *)h)->model.num_threads = v;
}
void sherpa_shim_tts_config_set_debug(void *h, int32_t v) {
((SherpaOnnxOfflineTtsConfig *)h)->model.debug = v;
}
void sherpa_shim_tts_config_set_provider(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineTtsConfig *)h)->model.provider, v);
}
void sherpa_shim_tts_config_set_max_num_sentences(void *h, int32_t v) {
((SherpaOnnxOfflineTtsConfig *)h)->max_num_sentences = v;
}
void *sherpa_shim_create_offline_tts(void *h) {
return (void *)SherpaOnnxCreateOfflineTts(
(const SherpaOnnxOfflineTtsConfig *)h);
}
// ==================================================================
// Offline recognizer config
// ==================================================================
void *sherpa_shim_offline_recog_config_new(void) {
return calloc(1, sizeof(SherpaOnnxOfflineRecognizerConfig));
}
void sherpa_shim_offline_recog_config_free(void *h) {
if (!h) return;
SherpaOnnxOfflineRecognizerConfig *c = (SherpaOnnxOfflineRecognizerConfig *)h;
free((char *)c->model_config.provider);
free((char *)c->model_config.tokens);
free((char *)c->model_config.whisper.encoder);
free((char *)c->model_config.whisper.decoder);
free((char *)c->model_config.whisper.language);
free((char *)c->model_config.whisper.task);
free((char *)c->model_config.paraformer.model);
free((char *)c->model_config.sense_voice.model);
free((char *)c->model_config.sense_voice.language);
free((char *)c->model_config.omnilingual.model);
free((char *)c->decoding_method);
free(c);
}
void sherpa_shim_offline_recog_config_set_num_threads(void *h, int32_t v) {
((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.num_threads = v;
}
void sherpa_shim_offline_recog_config_set_debug(void *h, int32_t v) {
((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.debug = v;
}
void sherpa_shim_offline_recog_config_set_provider(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.provider, v);
}
void sherpa_shim_offline_recog_config_set_tokens(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.tokens, v);
}
void sherpa_shim_offline_recog_config_set_feat_sample_rate(void *h, int32_t v) {
((SherpaOnnxOfflineRecognizerConfig *)h)->feat_config.sample_rate = v;
}
void sherpa_shim_offline_recog_config_set_feat_feature_dim(void *h, int32_t v) {
((SherpaOnnxOfflineRecognizerConfig *)h)->feat_config.feature_dim = v;
}
void sherpa_shim_offline_recog_config_set_decoding_method(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->decoding_method, v);
}
void sherpa_shim_offline_recog_config_set_whisper_encoder(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.whisper.encoder, v);
}
void sherpa_shim_offline_recog_config_set_whisper_decoder(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.whisper.decoder, v);
}
void sherpa_shim_offline_recog_config_set_whisper_language(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.whisper.language, v);
}
void sherpa_shim_offline_recog_config_set_whisper_task(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.whisper.task, v);
}
void sherpa_shim_offline_recog_config_set_whisper_tail_paddings(void *h, int32_t v) {
((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.whisper.tail_paddings = v;
}
void sherpa_shim_offline_recog_config_set_paraformer_model(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.paraformer.model, v);
}
void sherpa_shim_offline_recog_config_set_sense_voice_model(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.sense_voice.model, v);
}
void sherpa_shim_offline_recog_config_set_sense_voice_language(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.sense_voice.language, v);
}
void sherpa_shim_offline_recog_config_set_sense_voice_use_itn(void *h, int32_t v) {
((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.sense_voice.use_itn = v;
}
void sherpa_shim_offline_recog_config_set_omnilingual_model(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.omnilingual.model, v);
}
void *sherpa_shim_create_offline_recognizer(void *h) {
return (void *)SherpaOnnxCreateOfflineRecognizer(
(const SherpaOnnxOfflineRecognizerConfig *)h);
}
// ==================================================================
// Online recognizer config
// ==================================================================
void *sherpa_shim_online_recog_config_new(void) {
return calloc(1, sizeof(SherpaOnnxOnlineRecognizerConfig));
}
void sherpa_shim_online_recog_config_free(void *h) {
if (!h) return;
SherpaOnnxOnlineRecognizerConfig *c = (SherpaOnnxOnlineRecognizerConfig *)h;
free((char *)c->model_config.transducer.encoder);
free((char *)c->model_config.transducer.decoder);
free((char *)c->model_config.transducer.joiner);
free((char *)c->model_config.tokens);
free((char *)c->model_config.provider);
free((char *)c->decoding_method);
free(c);
}
void sherpa_shim_online_recog_config_set_transducer_encoder(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOnlineRecognizerConfig *)h)->model_config.transducer.encoder, v);
}
void sherpa_shim_online_recog_config_set_transducer_decoder(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOnlineRecognizerConfig *)h)->model_config.transducer.decoder, v);
}
void sherpa_shim_online_recog_config_set_transducer_joiner(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOnlineRecognizerConfig *)h)->model_config.transducer.joiner, v);
}
void sherpa_shim_online_recog_config_set_tokens(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOnlineRecognizerConfig *)h)->model_config.tokens, v);
}
void sherpa_shim_online_recog_config_set_num_threads(void *h, int32_t v) {
((SherpaOnnxOnlineRecognizerConfig *)h)->model_config.num_threads = v;
}
void sherpa_shim_online_recog_config_set_debug(void *h, int32_t v) {
((SherpaOnnxOnlineRecognizerConfig *)h)->model_config.debug = v;
}
void sherpa_shim_online_recog_config_set_provider(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOnlineRecognizerConfig *)h)->model_config.provider, v);
}
void sherpa_shim_online_recog_config_set_feat_sample_rate(void *h, int32_t v) {
((SherpaOnnxOnlineRecognizerConfig *)h)->feat_config.sample_rate = v;
}
void sherpa_shim_online_recog_config_set_feat_feature_dim(void *h, int32_t v) {
((SherpaOnnxOnlineRecognizerConfig *)h)->feat_config.feature_dim = v;
}
void sherpa_shim_online_recog_config_set_decoding_method(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOnlineRecognizerConfig *)h)->decoding_method, v);
}
void sherpa_shim_online_recog_config_set_enable_endpoint(void *h, int32_t v) {
((SherpaOnnxOnlineRecognizerConfig *)h)->enable_endpoint = v;
}
void sherpa_shim_online_recog_config_set_rule1_min_trailing_silence(void *h, float v) {
((SherpaOnnxOnlineRecognizerConfig *)h)->rule1_min_trailing_silence = v;
}
void sherpa_shim_online_recog_config_set_rule2_min_trailing_silence(void *h, float v) {
((SherpaOnnxOnlineRecognizerConfig *)h)->rule2_min_trailing_silence = v;
}
void sherpa_shim_online_recog_config_set_rule3_min_utterance_length(void *h, float v) {
((SherpaOnnxOnlineRecognizerConfig *)h)->rule3_min_utterance_length = v;
}
void *sherpa_shim_create_online_recognizer(void *h) {
return (void *)SherpaOnnxCreateOnlineRecognizer(
(const SherpaOnnxOnlineRecognizerConfig *)h);
}
// ==================================================================
// Result-struct accessors
// ==================================================================
int32_t sherpa_shim_wave_sample_rate(const void *h) {
return ((const SherpaOnnxWave *)h)->sample_rate;
}
int32_t sherpa_shim_wave_num_samples(const void *h) {
return ((const SherpaOnnxWave *)h)->num_samples;
}
const float *sherpa_shim_wave_samples(const void *h) {
return ((const SherpaOnnxWave *)h)->samples;
}
const char *sherpa_shim_offline_result_text(const void *h) {
return ((const SherpaOnnxOfflineRecognizerResult *)h)->text;
}
const char *sherpa_shim_online_result_text(const void *h) {
return ((const SherpaOnnxOnlineRecognizerResult *)h)->text;
}
int32_t sherpa_shim_generated_audio_sample_rate(const void *h) {
return ((const SherpaOnnxGeneratedAudio *)h)->sample_rate;
}
int32_t sherpa_shim_generated_audio_n(const void *h) {
return ((const SherpaOnnxGeneratedAudio *)h)->n;
}
const float *sherpa_shim_generated_audio_samples(const void *h) {
return ((const SherpaOnnxGeneratedAudio *)h)->samples;
}
int32_t sherpa_shim_speech_segment_start(const void *h) {
return ((const SherpaOnnxSpeechSegment *)h)->start;
}
int32_t sherpa_shim_speech_segment_n(const void *h) {
return ((const SherpaOnnxSpeechSegment *)h)->n;
}
// ==================================================================
// TTS streaming callback trampoline
// ==================================================================
void *sherpa_shim_tts_generate_with_callback(
void *tts, const char *text, int32_t sid, float speed,
uintptr_t callback_ptr, uintptr_t user_data) {
SherpaOnnxGeneratedAudioCallbackWithArg cb =
(SherpaOnnxGeneratedAudioCallbackWithArg)callback_ptr;
return (void *)SherpaOnnxOfflineTtsGenerateWithCallbackWithArg(
(const SherpaOnnxOfflineTts *)tts, text, sid, speed, cb,
(void *)user_data);
}

View File

@@ -0,0 +1,129 @@
#ifndef LOCALAI_SHERPA_ONNX_SHIM_H
#define LOCALAI_SHERPA_ONNX_SHIM_H
#include <stdint.h>
// libsherpa-shim: purego-friendly wrapper around sherpa-onnx's C API.
// Purego can't access C struct fields and can't route C callbacks to Go
// funcs directly. Every function here is a fixed-signature trampoline
// that replaces one field read/write or callback handoff that the Go
// backend would otherwise have to do through cgo.
//
// String lifetime: setters strdup; _free walks every owned string and
// frees it. Callers may discard their input buffers the moment a setter
// returns.
//
// Opaque handles are `void *` in both directions. Nothing here holds a
// reference across calls except config handles (freed via _free) and
// sherpa-allocated results (freed via sherpa's own Destroy* entry
// points, which Go calls through purego pass-through).
#ifdef __cplusplus
extern "C" {
#endif
// --- VAD config -----------------------------------------------------
void *sherpa_shim_vad_config_new(void);
void sherpa_shim_vad_config_free(void *cfg);
void sherpa_shim_vad_config_set_silero_model(void *cfg, const char *path);
void sherpa_shim_vad_config_set_silero_threshold(void *cfg, float v);
void sherpa_shim_vad_config_set_silero_min_silence_duration(void *cfg, float v);
void sherpa_shim_vad_config_set_silero_min_speech_duration(void *cfg, float v);
void sherpa_shim_vad_config_set_silero_window_size(void *cfg, int32_t v);
void sherpa_shim_vad_config_set_silero_max_speech_duration(void *cfg, float v);
void sherpa_shim_vad_config_set_sample_rate(void *cfg, int32_t v);
void sherpa_shim_vad_config_set_num_threads(void *cfg, int32_t v);
void sherpa_shim_vad_config_set_provider(void *cfg, const char *v);
void sherpa_shim_vad_config_set_debug(void *cfg, int32_t v);
void *sherpa_shim_create_vad(void *cfg, float buffer_size_seconds);
// --- Offline TTS config (VITS path — the only TTS family the backend uses) ---
void *sherpa_shim_tts_config_new(void);
void sherpa_shim_tts_config_free(void *cfg);
void sherpa_shim_tts_config_set_vits_model(void *cfg, const char *v);
void sherpa_shim_tts_config_set_vits_tokens(void *cfg, const char *v);
void sherpa_shim_tts_config_set_vits_lexicon(void *cfg, const char *v);
void sherpa_shim_tts_config_set_vits_data_dir(void *cfg, const char *v);
void sherpa_shim_tts_config_set_vits_noise_scale(void *cfg, float v);
void sherpa_shim_tts_config_set_vits_noise_scale_w(void *cfg, float v);
void sherpa_shim_tts_config_set_vits_length_scale(void *cfg, float v);
void sherpa_shim_tts_config_set_num_threads(void *cfg, int32_t v);
void sherpa_shim_tts_config_set_debug(void *cfg, int32_t v);
void sherpa_shim_tts_config_set_provider(void *cfg, const char *v);
void sherpa_shim_tts_config_set_max_num_sentences(void *cfg, int32_t v);
void *sherpa_shim_create_offline_tts(void *cfg);
// --- Offline recognizer config (Whisper / Paraformer / SenseVoice / Omnilingual) ---
void *sherpa_shim_offline_recog_config_new(void);
void sherpa_shim_offline_recog_config_free(void *cfg);
void sherpa_shim_offline_recog_config_set_num_threads(void *cfg, int32_t v);
void sherpa_shim_offline_recog_config_set_debug(void *cfg, int32_t v);
void sherpa_shim_offline_recog_config_set_provider(void *cfg, const char *v);
void sherpa_shim_offline_recog_config_set_tokens(void *cfg, const char *v);
void sherpa_shim_offline_recog_config_set_feat_sample_rate(void *cfg, int32_t v);
void sherpa_shim_offline_recog_config_set_feat_feature_dim(void *cfg, int32_t v);
void sherpa_shim_offline_recog_config_set_decoding_method(void *cfg, const char *v);
void sherpa_shim_offline_recog_config_set_whisper_encoder(void *cfg, const char *v);
void sherpa_shim_offline_recog_config_set_whisper_decoder(void *cfg, const char *v);
void sherpa_shim_offline_recog_config_set_whisper_language(void *cfg, const char *v);
void sherpa_shim_offline_recog_config_set_whisper_task(void *cfg, const char *v);
void sherpa_shim_offline_recog_config_set_whisper_tail_paddings(void *cfg, int32_t v);
void sherpa_shim_offline_recog_config_set_paraformer_model(void *cfg, const char *v);
void sherpa_shim_offline_recog_config_set_sense_voice_model(void *cfg, const char *v);
void sherpa_shim_offline_recog_config_set_sense_voice_language(void *cfg, const char *v);
void sherpa_shim_offline_recog_config_set_sense_voice_use_itn(void *cfg, int32_t v);
void sherpa_shim_offline_recog_config_set_omnilingual_model(void *cfg, const char *v);
void *sherpa_shim_create_offline_recognizer(void *cfg);
// --- Online recognizer config (streaming zipformer transducer) ---
void *sherpa_shim_online_recog_config_new(void);
void sherpa_shim_online_recog_config_free(void *cfg);
void sherpa_shim_online_recog_config_set_transducer_encoder(void *cfg, const char *v);
void sherpa_shim_online_recog_config_set_transducer_decoder(void *cfg, const char *v);
void sherpa_shim_online_recog_config_set_transducer_joiner(void *cfg, const char *v);
void sherpa_shim_online_recog_config_set_tokens(void *cfg, const char *v);
void sherpa_shim_online_recog_config_set_num_threads(void *cfg, int32_t v);
void sherpa_shim_online_recog_config_set_debug(void *cfg, int32_t v);
void sherpa_shim_online_recog_config_set_provider(void *cfg, const char *v);
void sherpa_shim_online_recog_config_set_feat_sample_rate(void *cfg, int32_t v);
void sherpa_shim_online_recog_config_set_feat_feature_dim(void *cfg, int32_t v);
void sherpa_shim_online_recog_config_set_decoding_method(void *cfg, const char *v);
void sherpa_shim_online_recog_config_set_enable_endpoint(void *cfg, int32_t v);
void sherpa_shim_online_recog_config_set_rule1_min_trailing_silence(void *cfg, float v);
void sherpa_shim_online_recog_config_set_rule2_min_trailing_silence(void *cfg, float v);
void sherpa_shim_online_recog_config_set_rule3_min_utterance_length(void *cfg, float v);
void *sherpa_shim_create_online_recognizer(void *cfg);
// --- Result accessors (sherpa-allocated; caller destroys via sherpa's own Destroy*) ---
int32_t sherpa_shim_wave_sample_rate(const void *wave);
int32_t sherpa_shim_wave_num_samples(const void *wave);
const float *sherpa_shim_wave_samples(const void *wave);
const char *sherpa_shim_offline_result_text(const void *result);
const char *sherpa_shim_online_result_text(const void *result);
int32_t sherpa_shim_generated_audio_sample_rate(const void *audio);
int32_t sherpa_shim_generated_audio_n(const void *audio);
const float *sherpa_shim_generated_audio_samples(const void *audio);
int32_t sherpa_shim_speech_segment_start(const void *seg);
int32_t sherpa_shim_speech_segment_n(const void *seg);
// --- TTS streaming callback trampoline -----------------------------
// Replaces the //export sherpaTtsGoCallback + callbacks.c bridge pattern.
// `callback_ptr` is the C-callable function pointer returned by
// purego.NewCallback. `user_data` is an integer the Go side uses to
// look up its state (sync.Map keyed by uint64).
//
// Returns the sherpa-allocated SherpaOnnxGeneratedAudio. Destroy with
// SherpaOnnxDestroyOfflineTtsGeneratedAudio (callable directly from
// Go via purego).
void *sherpa_shim_tts_generate_with_callback(
void *tts, const char *text, int32_t sid, float speed,
uintptr_t callback_ptr, uintptr_t user_data);
#ifdef __cplusplus
}
#endif
#endif

View File

@@ -0,0 +1,23 @@
package main
import (
"flag"
grpc "github.com/mudler/LocalAI/pkg/grpc"
)
var (
addr = flag.String("addr", "localhost:50051", "the address to connect to")
)
func main() {
flag.Parse()
if err := loadSherpaLibs(); err != nil {
panic(err)
}
if err := grpc.StartServer(*addr, &SherpaBackend{}); err != nil {
panic(err)
}
}

View File

@@ -0,0 +1,51 @@
#!/bin/bash
set -e
CURDIR=$(dirname "$(realpath $0)")
REPO_ROOT="${CURDIR}/../../.."
mkdir -p $CURDIR/package/lib
cp -avf $CURDIR/sherpa-onnx $CURDIR/package/
cp -avf $CURDIR/run.sh $CURDIR/package/
cp -rfLv $CURDIR/backend-assets/lib/* $CURDIR/package/lib/
if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
echo "Detected x86_64 architecture, copying x86_64 libraries..."
cp -arfLv /lib64/ld-linux-x86-64.so.2 $CURDIR/package/lib/ld.so
cp -arfLv /lib/x86_64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
cp -arfLv /lib/x86_64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
cp -arfLv /lib/x86_64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
cp -arfLv /lib/x86_64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
cp -arfLv /lib/x86_64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
cp -arfLv /lib/x86_64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
echo "Detected ARM64 architecture, copying ARM64 libraries..."
cp -arfLv /lib/ld-linux-aarch64.so.1 $CURDIR/package/lib/ld.so
cp -arfLv /lib/aarch64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
cp -arfLv /lib/aarch64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
cp -arfLv /lib/aarch64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
cp -arfLv /lib/aarch64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
cp -arfLv /lib/aarch64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
cp -arfLv /lib/aarch64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
elif [ $(uname -s) = "Darwin" ]; then
echo "Detected Darwin"
else
echo "Error: Could not detect architecture"
exit 1
fi
GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
if [ -f "$GPU_LIB_SCRIPT" ]; then
echo "Packaging GPU libraries for BUILD_TYPE=${BUILD_TYPE:-cpu}..."
source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
package_gpu_libs
fi
echo "Packaging completed successfully"
ls -liah $CURDIR/package/
ls -liah $CURDIR/package/lib/

13
backend/go/sherpa-onnx/run.sh Executable file
View File

@@ -0,0 +1,13 @@
#!/bin/bash
set -ex
CURDIR=$(dirname "$(realpath $0)")
export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
if [ -f $CURDIR/lib/ld.so ]; then
echo "Using lib/ld.so"
exec $CURDIR/lib/ld.so $CURDIR/sherpa-onnx "$@"
fi
exec $CURDIR/sherpa-onnx "$@"

12
backend/go/sherpa-onnx/test.sh Executable file
View File

@@ -0,0 +1,12 @@
#!/bin/bash
# Unit tests for the sherpa-onnx backend. Exercises error-path and
# dispatch logic via SherpaBackend directly (no gRPC). Integration
# coverage (gRPC TTS / streaming ASR / realtime pipeline) lives in
# tests/e2e-backends and tests/e2e and runs against the Docker image.
set -e
CURDIR=$(dirname "$(realpath $0)")
cd "$CURDIR"
PACKAGES=$(go list ./... | grep -v /sources/)
go test -v -timeout 60s $PACKAGES

View File

@@ -1006,6 +1006,23 @@
nvidia: "cuda12-neutts"
amd: "rocm-neutts"
nvidia-cuda-12: "cuda12-neutts"
- &sherpa-onnx
name: "sherpa-onnx"
alias: "sherpa-onnx"
urls:
- https://k2-fsa.github.io/sherpa/onnx/
description: |
Sherpa-ONNX backend for text-to-speech (VITS, Matcha, Kokoro), speech-to-text (Whisper, Paraformer, SenseVoice, Omnilingual ASR CTC), and voice activity detection via ONNX Runtime.
Supports multi-speaker voices, 1600+ language ASR, and GPU acceleration.
tags:
- text-to-speech
- TTS
- speech-to-text
- ASR
capabilities:
default: "cpu-sherpa-onnx"
nvidia: "cuda12-sherpa-onnx"
nvidia-cuda-12: "cuda12-sherpa-onnx"
- !!merge <<: *neutts
name: "neutts-development"
capabilities:
@@ -3834,3 +3851,30 @@
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-speaker-recognition"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-speaker-recognition
## sherpa-onnx
- !!merge <<: *sherpa-onnx
name: "sherpa-onnx-development"
capabilities:
default: "cpu-sherpa-onnx-development"
nvidia: "cuda12-sherpa-onnx-development"
nvidia-cuda-12: "cuda12-sherpa-onnx-development"
- !!merge <<: *sherpa-onnx
name: "cpu-sherpa-onnx"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-sherpa-onnx"
mirrors:
- localai/localai-backends:latest-cpu-sherpa-onnx
- !!merge <<: *sherpa-onnx
name: "cpu-sherpa-onnx-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-sherpa-onnx"
mirrors:
- localai/localai-backends:master-cpu-sherpa-onnx
- !!merge <<: *sherpa-onnx
name: "cuda12-sherpa-onnx"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-sherpa-onnx"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-sherpa-onnx
- !!merge <<: *sherpa-onnx
name: "cuda12-sherpa-onnx-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-sherpa-onnx"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-sherpa-onnx

View File

@@ -173,6 +173,30 @@ def _build_antispoofer(options: dict[str, str], model_dir: str | None) -> Antisp
# ─── InsightFaceEngine ────────────────────────────────────────────────
# Canonical ONNX manifest for each upstream insightface pack (v0.7 release
# at github.com/deepinsight/insightface/releases). LocalAI's gallery extracts
# these zips flat into the models directory, so when multiple packs or other
# backends drop their own ONNX files alongside, the glob-the-directory
# approach picks up foreign files and insightface's model_zoo.get_model()
# raises IndexError trying to index `input_shape[2]` on a tensor that isn't
# shaped like a face model. The manifest lets us pre-filter to only the
# files that actually belong to the requested pack — deterministic, correct
# pack choice, no crashes on neighbour ONNX files.
_KNOWN_PACK_MANIFESTS: dict[str, frozenset[str]] = {
"buffalo_l": frozenset({
"det_10g.onnx",
"w600k_r50.onnx",
"genderage.onnx",
"2d106det.onnx",
"1k3d68.onnx",
}),
"buffalo_sc": frozenset({
"det_500m.onnx",
"w600k_mbf.onnx",
}),
}
class InsightFaceEngine:
"""Drives insightface's model_zoo directly — no FaceAnalysis wrapper.
@@ -222,6 +246,21 @@ class InsightFaceEngine:
)
onnx_files = sorted(glob.glob(os.path.join(pack_dir, "*.onnx")))
# When the pack extracts flat into a shared models directory it
# mixes with ONNX files from other backends (opencv face engine,
# MiniFASNet antispoof, WeSpeaker voice embedding, other buffalo
# packs installed earlier). Feeding those into model_zoo.get_model()
# blows up inside insightface's router — it assumes a 4-D NCHW
# input and indexes `input_shape[2]` on tensors that aren't shaped
# like a face model, raising IndexError. For the upstream packs we
# know the exact ONNX manifest; scoping to it makes the load
# deterministic (without it, det_10g.onnx from buffalo_l sorts
# before det_500m.onnx from buffalo_sc and silently wins).
manifest = _KNOWN_PACK_MANIFESTS.get(self.model_pack)
if manifest is not None:
scoped = [f for f in onnx_files if os.path.basename(f) in manifest]
if scoped:
onnx_files = scoped
if not onnx_files:
raise ValueError(f"no ONNX files in pack directory: {pack_dir}")
@@ -231,14 +270,31 @@ class InsightFaceEngine:
self._providers = ["CUDAExecutionProvider", "CPUExecutionProvider"]
self.models = {}
skipped: list[tuple[str, str]] = []
for onnx_file in onnx_files:
m = model_zoo.get_model(onnx_file, providers=self._providers)
try:
m = model_zoo.get_model(onnx_file, providers=self._providers)
except Exception as err:
# Foreign ONNX (wrong rank/shape, non-insightface model) —
# older insightface versions raise IndexError / ValueError
# instead of returning None. Keep loading the rest.
skipped.append((os.path.basename(onnx_file), str(err)))
continue
if m is None:
skipped.append((os.path.basename(onnx_file), "unknown taskname"))
continue
# First occurrence of each taskname wins (matches FaceAnalysis).
if m.taskname not in self.models:
self.models[m.taskname] = m
if skipped:
import sys
print(
f"[insightface] skipped {len(skipped)} non-pack ONNX file(s) in {pack_dir}: "
+ ", ".join(f"{n} ({why})" for n, why in skipped),
file=sys.stderr,
)
if "detection" not in self.models:
raise ValueError(f"no detector (taskname='detection') found in {pack_dir}")
self.det_model = self.models["detection"]

View File

@@ -317,8 +317,23 @@ class OnnxDirectEngine:
else:
provider_list = ["CPUExecutionProvider"]
self._session = ort.InferenceSession(onnx_path, providers=provider_list)
self._input_name = self._session.get_inputs()[0].name
input_meta = self._session.get_inputs()[0]
self._input_name = input_meta.name
# Pre-exported speaker encoders come in two shapes:
# rank-2 [batch, samples] — some 3D-Speaker exports feed raw waveform.
# rank-3 [batch, frames, n_mels] — WeSpeaker and most Kaldi-lineage encoders
# expect pre-computed Kaldi FBank features.
# We detect this at load time and branch in embed(), because feeding raw audio
# into a rank-3 graph is exactly what triggered
# "Invalid rank for input: feats Got: 2 Expected: 3".
self._input_rank = len(input_meta.shape) if input_meta.shape is not None else 2
self._expected_sr = int(options.get("sample_rate", "16000"))
self._fbank_mels = int(options.get("fbank_num_mel_bins", "80"))
self._fbank_frame_length_ms = float(options.get("fbank_frame_length_ms", "25"))
self._fbank_frame_shift_ms = float(options.get("fbank_frame_shift_ms", "10"))
# Per-utterance cepstral mean normalisation — on for WeSpeaker by default,
# toggleable for encoders that expect raw FBank.
self._fbank_cmn = options.get("fbank_cmn", "true").lower() in ("1", "true", "yes")
self._analysis = AnalysisHead(options)
def _load_waveform(self, path: str):
@@ -344,11 +359,37 @@ class OnnxDirectEngine:
import numpy as np
audio = self._load_waveform(audio_path)
feed = audio.reshape(1, -1)
if self._input_rank >= 3:
feats = self._extract_fbank(audio) # [frames, n_mels]
feed = feats[np.newaxis, :, :] # [1, frames, n_mels]
else:
feed = audio.reshape(1, -1) # [1, samples]
out = self._session.run(None, {self._input_name: feed})
vec = np.asarray(out[0]).reshape(-1)
return [float(x) for x in vec]
def _extract_fbank(self, audio):
"""Compute Kaldi-style 80-dim FBank features for speaker encoders that
expect pre-featurised input (WeSpeaker, most 3D-Speaker exports).
torchaudio is already a backend dependency for SpeechBrain — no new
package required."""
import numpy as np
import torch # type: ignore
import torchaudio.compliance.kaldi as kaldi # type: ignore
tensor = torch.from_numpy(audio).unsqueeze(0) # [1, samples]
feats = kaldi.fbank(
tensor,
sample_frequency=self._expected_sr,
num_mel_bins=self._fbank_mels,
frame_length=self._fbank_frame_length_ms,
frame_shift=self._fbank_frame_shift_ms,
dither=0.0,
) # [frames, n_mels]
if self._fbank_cmn:
feats = feats - feats.mean(dim=0, keepdim=True)
return feats.numpy().astype(np.float32)
def compare(self, audio1: str, audio2: str) -> float:
return _cosine_distance(self.embed(audio1), self.embed(audio2))

View File

@@ -81,18 +81,30 @@ func newApplication(appConfig *config.ApplicationConfig) *Application {
// The resolver closes over the ModelLoader so the Registry stays
// decoupled from loader plumbing; swapping in a postgres-backed
// implementation later is a single construction change here.
//
// `faceStoreName` is the default namespace passed to StoreBackend when
// the request doesn't override it. Face and voice MUST use distinct
// namespaces — the local-store gRPC surface rejects mixed dimensions
// inside one namespace ("Try to add key with length N when existing
// length is M"). ArcFace buffalo_l produces 512-dim embeddings while
// ECAPA-TDNN produces 192-dim; enrolling one after the other into a
// shared namespace is exactly how we hit that error.
const (
faceStoreName = "localai-face-biometrics"
voiceStoreName = "localai-voice-biometrics"
)
faceStoreResolver := func(_ context.Context, storeName string) (pkggrpc.Backend, error) {
return corebackend.StoreBackend(ml, appConfig, storeName, "")
}
app.faceRegistry = facerecognition.NewStoreRegistry(faceStoreResolver, "", faceEmbeddingDim)
app.faceRegistry = facerecognition.NewStoreRegistry(faceStoreResolver, faceStoreName, faceEmbeddingDim)
// Voice (speaker) recognition registry — same plumbing, separate
// registry so embedding spaces stay isolated (a face vector and a
// speaker vector are not comparable).
// namespace so embedding spaces stay isolated (a face vector and a
// speaker vector are not comparable and differ in dimensionality).
voiceStoreResolver := func(_ context.Context, storeName string) (pkggrpc.Backend, error) {
return corebackend.StoreBackend(ml, appConfig, storeName, "")
}
app.voiceRegistry = voicerecognition.NewStoreRegistry(voiceStoreResolver, "", voiceEmbeddingDim)
app.voiceRegistry = voicerecognition.NewStoreRegistry(voiceStoreResolver, voiceStoreName, voiceEmbeddingDim)
return app
}

View File

@@ -242,6 +242,12 @@ func New(opts ...config.AppOption) (*Application, error) {
bmFn := func() galleryop.BackendManager { return application.GalleryService().BackendManager() }
uc := NewUpgradeChecker(options, application.ModelLoader(), application.distributedDB(), bmFn)
application.upgradeChecker = uc
// Refresh the upgrade cache the moment a backend op finishes — otherwise
// the UI keeps showing a just-upgraded backend as upgradeable until the
// next 6-hour tick. TriggerCheck is non-blocking.
if gs := application.GalleryService(); gs != nil {
gs.OnBackendOpCompleted = uc.TriggerCheck
}
go uc.Run(options.Context)
}

View File

@@ -11,8 +11,17 @@ func StoreBackend(sl *model.ModelLoader, appConfig *config.ApplicationConfig, st
if backend == "" {
backend = model.LocalStoreBackend
}
// ModelLoader caches backend processes by `modelID`, not by the `model`
// passed via WithModel. Without a distinct modelID, every StoreBackend
// call collapses to the same `modelID=""` cache slot — face (512-D) and
// voice (192-D) biometrics would then share the same local-store process
// and the second enrollment would fail with
// Try to add key with length N when existing length is M
// Use the store namespace as modelID so each namespace gets its own
// process instance and its own in-memory Store{}.
sc := []model.Option{
model.WithBackendString(backend),
model.WithModelID(storeName),
model.WithModel(storeName),
}

View File

@@ -37,6 +37,14 @@ var CacheTypeOptions = []FieldOption{
{Value: "q4_1", Label: "Q4_1"},
{Value: "q5_0", Label: "Q5_0"},
{Value: "q5_1", Label: "Q5_1"},
// TurboQuant KV-cache types — accepted by the turboquant and
// buun-llama-cpp fork backends; stock llama-cpp will reject them at load.
{Value: "turbo2", Label: "Turbo2 (TurboQuant)"},
{Value: "turbo3", Label: "Turbo3 (TurboQuant)"},
{Value: "turbo4", Label: "Turbo4 (TurboQuant)"},
// Trellis-Coded Quantization variants — buun-llama-cpp only.
{Value: "turbo2_tcq", Label: "Turbo2 TCQ (buun-llama-cpp)"},
{Value: "turbo3_tcq", Label: "Turbo3 TCQ (buun-llama-cpp)"},
}
var DiffusersPipelineOptions = []FieldOption{

View File

@@ -767,7 +767,7 @@ func (c *ModelConfig) GuessUsecases(u ModelConfigUsecase) bool {
}
if (u & FLAG_VAD) == FLAG_VAD {
if c.Backend != "silero-vad" && !(c.Backend == "whisper" && slices.Contains(c.Options, "vad_only")) {
if c.Backend != "silero-vad" && c.Backend != "sherpa-onnx" && !(c.Backend == "whisper" && slices.Contains(c.Options, "vad_only")) {
return false
}
}

View File

@@ -194,6 +194,20 @@ func InstallBackend(ctx context.Context, systemState *system.SystemState, modelL
name := config.Name
backendPath := filepath.Join(systemState.Backend.BackendsPath, name)
// Clean up legacy flat-layout artefacts: earlier dev builds of the
// golang backends dropped the compiled binary directly at
// `<backendsPath>/<name>` (a plain file) instead of
// `<backendsPath>/<name>/<name>` (the nested layout the current code
// expects). MkdirAll below returns ENOTDIR when such a stale file
// exists, permanently blocking any reinstall or upgrade. Remove the
// file first so the install can proceed; the new install will write
// the correct nested layout, including metadata.json + run.sh.
if fi, statErr := os.Lstat(backendPath); statErr == nil && !fi.IsDir() {
xlog.Warn("removing stale non-directory backend artefact to make room for fresh install", "path", backendPath)
if rmErr := os.Remove(backendPath); rmErr != nil {
return fmt.Errorf("failed to remove stale backend artefact at %s: %w", backendPath, rmErr)
}
}
err = os.MkdirAll(backendPath, 0750)
if err != nil {
return fmt.Errorf("failed to create base path: %v", err)

View File

@@ -34,6 +34,7 @@ func (i *LlamaCPPImporter) AdditionalBackends() []KnownBackendEntry {
return []KnownBackendEntry{
{Name: "ik-llama-cpp", Modality: "text", Description: "GGUF drop-in replacement for llama-cpp with ik-quants"},
{Name: "turboquant", Modality: "text", Description: "GGUF drop-in replacement for llama-cpp with TurboQuant optimizations"},
{Name: "buun-llama-cpp", Modality: "text", Description: "GGUF drop-in replacement for llama-cpp with DFlash speculative decoding and TurboQuant/TCQ KV-cache quantization"},
}
}
@@ -127,7 +128,7 @@ func (i *LlamaCPPImporter) Import(details Details) (gallery.ModelConfig, error)
backend := "llama-cpp"
if b, ok := preferencesMap["backend"].(string); ok {
switch b {
case "ik-llama-cpp", "turboquant":
case "ik-llama-cpp", "turboquant", "buun-llama-cpp":
backend = b
}
}

View File

@@ -181,6 +181,23 @@ var _ = Describe("LlamaCPPImporter", func() {
Expect(modelConfig.Files[0].Filename).To(Equal("my-model.gguf"))
})
It("swaps the emitted backend to buun-llama-cpp when preferred", func() {
preferences := json.RawMessage(`{"backend": "buun-llama-cpp"}`)
details := Details{
URI: "https://example.com/my-model.gguf",
Preferences: preferences,
}
modelConfig, err := importer.Import(details)
Expect(err).ToNot(HaveOccurred())
Expect(modelConfig.ConfigFile).To(ContainSubstring("backend: buun-llama-cpp"), fmt.Sprintf("Model config: %+v", modelConfig))
Expect(modelConfig.ConfigFile).NotTo(ContainSubstring("backend: llama-cpp\n"), fmt.Sprintf("Model config: %+v", modelConfig))
Expect(modelConfig.ConfigFile).To(ContainSubstring("model: my-model.gguf"), fmt.Sprintf("Model config: %+v", modelConfig))
Expect(len(modelConfig.Files)).To(Equal(1))
Expect(modelConfig.Files[0].Filename).To(Equal("my-model.gguf"))
})
It("keeps backend: llama-cpp for unknown backend preferences", func() {
// Unknown backend values must not leak into the emitted YAML —
// we only honour the two curated drop-in replacements.
@@ -375,7 +392,7 @@ var _ = Describe("LlamaCPPImporter", func() {
})
Context("AdditionalBackends", func() {
It("advertises ik-llama-cpp and turboquant as drop-in replacements", func() {
It("advertises ik-llama-cpp, turboquant, and buun-llama-cpp as drop-in replacements", func() {
entries := importer.AdditionalBackends()
names := make([]string, 0, len(entries))
@@ -384,7 +401,7 @@ var _ = Describe("LlamaCPPImporter", func() {
names = append(names, e.Name)
byName[e.Name] = e
}
Expect(names).To(ConsistOf("ik-llama-cpp", "turboquant"))
Expect(names).To(ConsistOf("ik-llama-cpp", "turboquant", "buun-llama-cpp"))
ik := byName["ik-llama-cpp"]
Expect(ik.Modality).To(Equal("text"))
@@ -393,6 +410,10 @@ var _ = Describe("LlamaCPPImporter", func() {
tq := byName["turboquant"]
Expect(tq.Modality).To(Equal("text"))
Expect(tq.Description).NotTo(BeEmpty())
bn := byName["buun-llama-cpp"]
Expect(bn.Modality).To(Equal("text"))
Expect(bn.Description).NotTo(BeEmpty())
})
})
})

View File

@@ -880,7 +880,7 @@ func convertAnthropicTools(input *schema.AnthropicRequest, cfg *config.ModelConf
if tcType, ok := tc["type"].(string); ok && tcType == "tool" {
if name, ok := tc["name"].(string); ok {
// Force specific tool
cfg.SetFunctionCallString(name)
cfg.SetFunctionCallNameString(name)
}
}
}

View File

@@ -14,7 +14,13 @@ import (
"github.com/mudler/LocalAI/pkg/utils"
)
var audioDataURIPattern = regexp.MustCompile(`^data:([^;]+);base64,`)
// Match `data:<mime>[;param=value...];base64,` — MediaRecorder in the browser
// produces data URIs like `data:audio/webm;codecs=opus;base64,...`, so the
// pre-`;base64,` section can contain zero or more parameter segments. The
// old `([^;]+)` form only matched exactly one segment and left recordings
// from the React UI's live-capture tab unparsed, which then failed base64
// decoding on the leading `data:` bytes.
var audioDataURIPattern = regexp.MustCompile(`^data:[^,]+?;base64,`)
var audioDownloadClient = http.Client{Timeout: 30 * time.Second}

View File

@@ -1315,13 +1315,35 @@ func triggerResponse(ctx context.Context, session *Session, conv *Conversation,
}
thinkingStartToken := reasoning.DetectThinkingStartToken(template, &config.ReasoningConfig)
reasoningText, responseWithoutReasoning := reasoning.ExtractReasoningWithConfig(rawResponse, thinkingStartToken, config.ReasoningConfig)
// When the C++ autoparser emitted ChatDeltas with actionable data,
// prefer them — the backend clears Reply.Message in that path and
// delivers parsed content/reasoning/tool-calls via the delta stream
// (see pkg/functions/chat_deltas.go, mirrored from chat.go's non-SSE
// handling). Without this, Response is empty and realtime would
// synthesize silence for replies that actually produced tokens.
var reasoningText, responseWithoutReasoning, textContent, cleanedResponse string
var toolCalls []functions.FuncCallResults
deltaToolCalls := functions.ToolCallsFromChatDeltas(pred.ChatDeltas)
deltaContent := functions.ContentFromChatDeltas(pred.ChatDeltas)
deltaReasoning := functions.ReasoningFromChatDeltas(pred.ChatDeltas)
if len(deltaToolCalls) > 0 || deltaContent != "" {
xlog.Debug("[ChatDeltas] realtime: using C++ autoparser deltas",
"tool_calls", len(deltaToolCalls),
"content_len", len(deltaContent),
"reasoning_len", len(deltaReasoning))
reasoningText = deltaReasoning
responseWithoutReasoning = deltaContent
textContent = deltaContent
cleanedResponse = deltaContent
toolCalls = deltaToolCalls
} else {
reasoningText, responseWithoutReasoning = reasoning.ExtractReasoningWithConfig(rawResponse, thinkingStartToken, config.ReasoningConfig)
textContent = functions.ParseTextContent(responseWithoutReasoning, config.FunctionsConfig)
cleanedResponse = functions.CleanupLLMResult(responseWithoutReasoning, config.FunctionsConfig)
toolCalls = functions.ParseFunctionCall(cleanedResponse, config.FunctionsConfig)
}
xlog.Debug("LLM Response", "reasoning", reasoningText, "response_without_reasoning", responseWithoutReasoning)
textContent := functions.ParseTextContent(responseWithoutReasoning, config.FunctionsConfig)
cleanedResponse := functions.CleanupLLMResult(responseWithoutReasoning, config.FunctionsConfig)
toolCalls := functions.ParseFunctionCall(cleanedResponse, config.FunctionsConfig)
xlog.Debug("Function call parsing", "textContent", textContent, "cleanedResponse", cleanedResponse, "toolCallsCount", len(toolCalls))
noActionName := "answer"

View File

@@ -168,7 +168,7 @@ func (m *wrappedModel) Predict(ctx context.Context, messages schema.Messages, im
}
} else if toolChoice.Function != nil {
// Specific function specified
m.LLMConfig.SetFunctionCallString(toolChoice.Function.Name)
m.LLMConfig.SetFunctionCallNameString(toolChoice.Function.Name)
}
}

View File

@@ -773,7 +773,7 @@ func convertORToolsToFunctions(input *schema.OpenResponsesRequest, cfg *config.M
case map[string]any:
if tcType, ok := tc["type"].(string); ok && tcType == "function" {
if name, ok := tc["name"].(string); ok {
cfg.SetFunctionCallString(name)
cfg.SetFunctionCallNameString(name)
}
}
}

View File

File diff suppressed because it is too large Load Diff

View File

@@ -24,6 +24,18 @@ const sections = [
{ path: '/app/quantize', icon: 'fas fa-compress', label: 'Quantize (Experimental)', feature: 'quantization' },
],
},
{
id: 'biometrics',
title: 'Biometrics',
featureMap: {
'/app/face': 'face_recognition',
'/app/voice': 'voice_recognition',
},
items: [
{ path: '/app/face', icon: 'fas fa-face-smile', label: 'Face Recognition', feature: 'face_recognition' },
{ path: '/app/voice', icon: 'fas fa-microphone-lines', label: 'Voice Recognition', feature: 'voice_recognition' },
],
},
{
id: 'agents',
title: 'Agents',

View File

@@ -0,0 +1,63 @@
import { useEffect, useRef, useState } from 'react'
// BoundingBoxCanvas — overlay face-detection rectangles on the user-supplied image.
// boxes: [{ x, y, w, h, label?, sublabel?, tone? }]
// tone: 'default' | 'success' | 'warning' | 'error' | 'accent'
export default function BoundingBoxCanvas({ src, boxes = [], alt = '' }) {
const wrapRef = useRef(null)
const imgRef = useRef(null)
const [dims, setDims] = useState({ w: 0, h: 0, natW: 0, natH: 0 })
useEffect(() => {
const update = () => {
if (!wrapRef.current || !imgRef.current) return
const rect = imgRef.current.getBoundingClientRect()
setDims({
w: rect.width,
h: rect.height,
natW: imgRef.current.naturalWidth || 1,
natH: imgRef.current.naturalHeight || 1,
})
}
update()
const ro = new ResizeObserver(update)
if (imgRef.current) ro.observe(imgRef.current)
window.addEventListener('resize', update)
return () => {
ro.disconnect()
window.removeEventListener('resize', update)
}
}, [src])
const sx = dims.natW ? dims.w / dims.natW : 1
const sy = dims.natH ? dims.h / dims.natH : 1
return (
<div ref={wrapRef} className="biometrics-bbox">
{src && <img ref={imgRef} src={src} alt={alt} onLoad={(e) => {
setDims({
w: e.target.getBoundingClientRect().width,
h: e.target.getBoundingClientRect().height,
natW: e.target.naturalWidth,
natH: e.target.naturalHeight,
})
}} />}
{boxes.map((b, i) => (
<div key={i} className={`biometrics-bbox__box tone-${b.tone || 'accent'}`}
style={{
left: `${b.x * sx}px`,
top: `${b.y * sy}px`,
width: `${b.w * sx}px`,
height: `${b.h * sy}px`,
}}>
{(b.label || b.sublabel) && (
<div className="biometrics-bbox__tag">
{b.label && <strong>{b.label}</strong>}
{b.sublabel && <span>{b.sublabel}</span>}
</div>
)}
</div>
))}
</div>
)
}

View File

@@ -0,0 +1,33 @@
// DistributionBars — one horizontal bar per label, width proportional to value.
// distribution: Record<string, number> (values are probabilities 0..1 or any positive scale).
// dominant: string — highlighted row.
export default function DistributionBars({ title, distribution, dominant, icon }) {
if (!distribution || Object.keys(distribution).length === 0) return null
const entries = Object.entries(distribution).sort((a, b) => b[1] - a[1])
const max = entries.reduce((m, [, v]) => Math.max(m, v), 0) || 1
return (
<div className="biometrics-dist card">
<div className="biometrics-dist__head">
{icon && <i className={icon} aria-hidden="true" />}
<h3>{title}</h3>
{dominant && <span className="biometrics-dist__dominant">{dominant}</span>}
</div>
<ul className="biometrics-dist__rows">
{entries.map(([label, value]) => {
const pct = (value / max) * 100
const isDominant = label === dominant
return (
<li key={label} className={`biometrics-dist__row ${isDominant ? 'dominant' : ''}`}>
<span className="biometrics-dist__label">{label}</span>
<div className="biometrics-dist__bar-wrap" aria-hidden="true">
<div className="biometrics-dist__bar" style={{ width: `${pct}%` }} />
</div>
<span className="biometrics-dist__value">{(value * 100).toFixed(1)}%</span>
</li>
)
})}
</ul>
</div>
)
}

View File

@@ -0,0 +1,89 @@
import { useMemo, useRef, useEffect, useState } from 'react'
// EmbeddingInspector — compact visualization of a raw vector returned by /v1/face|voice/embed.
// embedding: number[] (can be large). dim: int. model: string.
export default function EmbeddingInspector({ embedding, dim, model, elapsedMs }) {
const canvasRef = useRef(null)
const [copied, setCopied] = useState(false)
const summary = useMemo(() => {
if (!embedding || !embedding.length) return null
let sum = 0, sumSq = 0, min = Infinity, max = -Infinity
for (const v of embedding) {
sum += v
sumSq += v * v
if (v < min) min = v
if (v > max) max = v
}
const mean = sum / embedding.length
const norm = Math.sqrt(sumSq)
return { mean, norm, min, max }
}, [embedding])
useEffect(() => {
if (!canvasRef.current || !embedding?.length) return
const canvas = canvasRef.current
const dpr = window.devicePixelRatio || 1
const cssW = canvas.clientWidth
const cssH = 60
canvas.width = Math.floor(cssW * dpr)
canvas.height = Math.floor(cssH * dpr)
const ctx = canvas.getContext('2d')
ctx.scale(dpr, dpr)
ctx.clearRect(0, 0, cssW, cssH)
const COUNT = Math.min(embedding.length, 128)
const values = embedding.slice(0, COUNT)
const max = Math.max(...values.map(Math.abs)) || 1
const mid = cssH / 2
const barW = cssW / COUNT
const accent = getComputedStyle(canvas).getPropertyValue('--color-accent').trim() || '#e8a87c'
const accentMuted = getComputedStyle(canvas).getPropertyValue('--color-text-muted').trim() || '#6c7084'
ctx.strokeStyle = accentMuted
ctx.beginPath()
ctx.moveTo(0, mid + 0.5)
ctx.lineTo(cssW, mid + 0.5)
ctx.stroke()
ctx.fillStyle = accent
for (let i = 0; i < COUNT; i++) {
const v = values[i]
const h = (Math.abs(v) / max) * (cssH * 0.45)
if (v >= 0) ctx.fillRect(i * barW, mid - h, Math.max(0.5, barW - 0.5), h)
else ctx.fillRect(i * barW, mid, Math.max(0.5, barW - 0.5), h)
}
}, [embedding])
if (!embedding) return null
const copy = async () => {
try {
await navigator.clipboard.writeText(JSON.stringify(embedding))
setCopied(true)
setTimeout(() => setCopied(false), 1500)
} catch (_) {
/* clipboard gated */
}
}
return (
<div className="biometrics-embed card">
<div className="biometrics-embed__head">
<div>
<div className="biometrics-embed__title">Embedding vector</div>
<div className="biometrics-embed__meta">
{dim != null && <span><strong>{dim}</strong> dims</span>}
{summary && <span>L2 <strong>{summary.norm.toFixed(3)}</strong></span>}
{summary && <span>range <strong>[{summary.min.toFixed(3)}, {summary.max.toFixed(3)}]</strong></span>}
{model && <span>model <code>{model}</code></span>}
{elapsedMs != null && <span>{elapsedMs.toFixed(0)} ms</span>}
</div>
</div>
<button type="button" className="btn btn-secondary btn-sm" onClick={copy}>
<i className={`fas ${copied ? 'fa-check' : 'fa-copy'}`} aria-hidden="true" />
{copied ? ' Copied' : ' Copy JSON'}
</button>
</div>
<canvas ref={canvasRef} style={{ width: '100%', height: 60 }} aria-label="Embedding sparkline (first 128 dimensions)" />
</div>
)
}

View File

@@ -0,0 +1,65 @@
// EnrollmentList — grid of enrolled subjects (face or voice).
// entries: [{ id, name, labels?, thumbnail?, registeredAt?, sampleUrl? }]
// mode: 'image' | 'audio' — controls the card visual.
export default function EnrollmentList({ entries, onDelete, mode = 'image', highlightId }) {
if (!entries || entries.length === 0) {
return (
<div className="biometrics-enroll__empty">
<i className={`fas ${mode === 'image' ? 'fa-user-plus' : 'fa-microphone-lines'}`} aria-hidden="true" />
<p>No one enrolled yet. Add a sample using the form on the left to start building your identification store.</p>
</div>
)
}
return (
<ul className="biometrics-enroll__grid" role="list">
{entries.map((e) => {
const highlight = e.id === highlightId
return (
<li key={e.id} className={`biometrics-enroll__card ${highlight ? 'highlight' : ''}`}>
<div className="biometrics-enroll__media">
{mode === 'image' && e.thumbnail
? <img src={e.thumbnail} alt="" />
: mode === 'audio' && e.sampleUrl
? <audio controls src={e.sampleUrl} />
: <div className="biometrics-enroll__initials" aria-hidden="true">{initials(e.name)}</div>}
</div>
<div className="biometrics-enroll__body">
<div className="biometrics-enroll__name">{e.name}</div>
{e.labels && Object.keys(e.labels).length > 0 && (
<ul className="biometrics-enroll__labels" aria-label="labels">
{Object.entries(e.labels).slice(0, 3).map(([k, v]) => (
<li key={k}><span>{k}</span>{v}</li>
))}
</ul>
)}
{e.registeredAt && (
<div className="biometrics-enroll__meta">
<i className="fas fa-clock" aria-hidden="true" /> {formatTime(e.registeredAt)}
</div>
)}
</div>
<button type="button" className="biometrics-enroll__delete" onClick={() => onDelete(e)}
aria-label={`Forget ${e.name}`} title="Forget this enrollment">
<i className="fas fa-trash" aria-hidden="true" />
</button>
</li>
)
})}
</ul>
)
}
function initials(name) {
if (!name) return '?'
return name.trim().split(/\s+/).map(p => p[0] || '').join('').slice(0, 2).toUpperCase()
}
function formatTime(ts) {
try {
const d = new Date(ts)
return d.toLocaleString()
} catch (_) {
return ts
}
}

View File

@@ -0,0 +1,46 @@
// MatchGauge — distance vs threshold as a single horizontal meter.
// distance, threshold numeric (cosine distance, lower = closer).
// Scale is 0 → max (default 2× threshold or 1.0) so the threshold sits near the middle.
export default function MatchGauge({ distance, threshold, confidence, verified, label }) {
const max = Math.max(1.0, (threshold || 0.3) * 2)
const clamp = (v) => Math.max(0, Math.min(max, v))
const tPct = (clamp(threshold || 0) / max) * 100
const dPct = distance == null ? null : (clamp(distance) / max) * 100
const tone = verified ? 'success' : 'error'
return (
<div className={`biometrics-gauge tone-${tone}`} role="img"
aria-label={`${label || 'Match'}: ${verified ? 'match' : 'no match'} at distance ${distance?.toFixed?.(3) ?? '?'} (threshold ${threshold?.toFixed?.(3) ?? '?'})`}>
<div className="biometrics-gauge__head">
<div className="biometrics-gauge__verdict">
<i className={`fas ${verified ? 'fa-circle-check' : 'fa-circle-xmark'}`} aria-hidden="true" />
<span>{verified ? 'Match' : 'No match'}</span>
</div>
{confidence != null && (
<div className="biometrics-gauge__confidence">
<strong>{typeof confidence === 'number' ? confidence.toFixed(1) : confidence}</strong>
<span>confidence</span>
</div>
)}
</div>
<div className="biometrics-gauge__track" aria-hidden="true">
<div className="biometrics-gauge__zone biometrics-gauge__zone--match"
style={{ width: `${tPct}%` }} />
<div className="biometrics-gauge__zone biometrics-gauge__zone--miss"
style={{ left: `${tPct}%`, width: `${100 - tPct}%` }} />
<div className="biometrics-gauge__threshold" style={{ left: `${tPct}%` }}>
<span>threshold</span>
</div>
{dPct != null && (
<div className="biometrics-gauge__marker" style={{ left: `${dPct}%` }}>
<span>distance</span>
</div>
)}
</div>
<div className="biometrics-gauge__footer">
<span><em>distance</em> <code>{distance?.toFixed?.(4) ?? '—'}</code></span>
<span><em>threshold</em> <code>{threshold?.toFixed?.(4) ?? '—'}</code></span>
</div>
</div>
)
}

View File

@@ -0,0 +1,179 @@
import { useEffect, useRef, useState } from 'react'
import { useMediaCapture } from '../../hooks/useMediaCapture'
import { fileToBase64 } from '../../utils/api'
// MediaInput — one control, three ways to supply a sample.
// mode: 'image' | 'audio'. onChange receives null | { base64, dataUrl, mime, source }.
function UnsupportedNotice({ mode }) {
// Detect the likely cause so we can tell the user what to do, instead of just "not supported".
const isSecure = typeof window !== 'undefined' && (window.isSecureContext ?? true)
const hostname = typeof window !== 'undefined' ? window.location.hostname : ''
const origin = typeof window !== 'undefined' ? window.location.origin : ''
const thing = mode === 'image' ? 'webcam' : 'microphone'
if (!isSecure) {
return (
<div className="biometrics-mediainput__notice">
<i className="fas fa-lock" aria-hidden="true" />
<div>
<strong>{thing} needs a secure origin</strong>
<p>
Your browser only exposes <code>getUserMedia</code> over HTTPS, <code>localhost</code>,
or <code>127.0.0.1</code>. You're on <code>{origin || hostname}</code>. Reach the UI
via <code>http://localhost:&lt;port&gt;</code> (or put a TLS terminator in front) and the
live {thing} will light up. Upload still works fine from here.
</p>
</div>
</div>
)
}
return (
<div className="biometrics-mediainput__notice">
<i className="fas fa-circle-info" aria-hidden="true" />
<div>
<strong>Live {thing} not available</strong>
<p>
This browser doesn't expose <code>navigator.mediaDevices.getUserMedia</code>. Try another
browser, or use the upload tab the backend accepts either.
</p>
</div>
</div>
)
}
export default function MediaInput({ mode, label, value, onChange, idPrefix = 'media' }) {
const [tab, setTab] = useState('file') // 'file' | 'live'
const fileRef = useRef(null)
const cap = useMediaCapture(mode)
// Release the device when switching away from the live tab.
useEffect(() => {
if (tab !== 'live' && cap.active) cap.stop()
}, [tab]) // eslint-disable-line react-hooks/exhaustive-deps
const handleFile = async (e) => {
const f = e.target.files?.[0]
if (!f) { onChange(null); return }
const base64 = await fileToBase64(f)
const dataUrl = await new Promise((resolve) => {
const reader = new FileReader()
reader.onload = () => resolve(reader.result)
reader.readAsDataURL(f)
})
onChange({ base64, dataUrl, mime: f.type, source: 'file', name: f.name })
}
const handleSnap = () => {
const shot = cap.snap()
if (shot) onChange({ ...shot, source: 'live' })
}
const handleRecordToggle = async () => {
if (cap.recording) {
cap.stopRecording()
} else {
const pending = cap.startRecording()
if (!pending) return
const result = await pending
onChange({ ...result, source: 'live' })
}
}
const clear = () => {
onChange(null)
if (fileRef.current) fileRef.current.value = ''
}
const inputId = `${idPrefix}-${mode}-file`
return (
<div className="biometrics-mediainput">
{label && <label className="form-label" htmlFor={inputId}>{label}</label>}
<div className="biometrics-mediainput__tabs" role="tablist" aria-label={`${label || 'Media'} source`}>
<button type="button" role="tab" aria-selected={tab === 'file'}
className={`biometrics-mediainput__tab ${tab === 'file' ? 'active' : ''}`}
onClick={() => setTab('file')}>
<i className="fas fa-upload" aria-hidden="true" /> Upload
</button>
<button type="button" role="tab" aria-selected={tab === 'live'}
className={`biometrics-mediainput__tab ${tab === 'live' ? 'active' : ''}`}
onClick={() => setTab('live')}>
<i className={`fas ${mode === 'image' ? 'fa-camera' : 'fa-microphone'}`} aria-hidden="true" />
{mode === 'image' ? ' Webcam' : ' Record'}
</button>
</div>
{tab === 'file' && (
<div className="biometrics-mediainput__body">
<input
ref={fileRef}
id={inputId}
type="file"
className="input"
accept={mode === 'image' ? 'image/*' : 'audio/*'}
onChange={handleFile}
/>
</div>
)}
{tab === 'live' && (
<div className="biometrics-mediainput__body">
{!cap.supported && <UnsupportedNotice mode={mode} />}
{cap.supported && !cap.active && (
<button type="button" className="btn btn-secondary btn-full" onClick={cap.start}>
<i className={`fas ${mode === 'image' ? 'fa-camera' : 'fa-microphone'}`} aria-hidden="true" />
{mode === 'image' ? ' Start webcam' : ' Enable microphone'}
</button>
)}
{cap.active && mode === 'image' && (
<div className="biometrics-mediainput__live">
<video ref={cap.videoRef} autoPlay muted playsInline className="biometrics-mediainput__video" />
<div className="biometrics-mediainput__controls">
<button type="button" className="btn btn-primary" onClick={handleSnap}>
<i className="fas fa-circle-dot" aria-hidden="true" /> Capture
</button>
<button type="button" className="btn btn-secondary" onClick={cap.stop}>Stop</button>
</div>
</div>
)}
{cap.active && mode === 'audio' && (
<div className="biometrics-mediainput__live">
<div className={`biometrics-mediainput__meter ${cap.recording ? 'recording' : ''}`}>
<i className="fas fa-microphone" aria-hidden="true" />
<span>{cap.recording ? `Recording… ${cap.elapsed.toFixed(1)}s` : 'Microphone ready'}</span>
</div>
<div className="biometrics-mediainput__controls">
<button type="button" className={`btn ${cap.recording ? 'btn-secondary' : 'btn-primary'}`} onClick={handleRecordToggle}>
<i className={`fas ${cap.recording ? 'fa-stop' : 'fa-circle'}`} aria-hidden="true" />
{cap.recording ? ' Stop' : ' Record'}
</button>
<button type="button" className="btn btn-secondary" onClick={cap.stop} disabled={cap.recording}>Close</button>
</div>
</div>
)}
{cap.error && (
<p className="biometrics-mediainput__error" role="alert">{cap.error}</p>
)}
</div>
)}
{value && (
<div className="biometrics-mediainput__preview">
{mode === 'image'
? <img src={value.dataUrl} alt="" />
: <audio controls src={value.dataUrl} />}
<div className="biometrics-mediainput__preview-meta">
<span className="biometrics-mediainput__source-pill">
<i className={`fas ${value.source === 'live' ? (mode === 'image' ? 'fa-camera' : 'fa-microphone') : 'fa-file'}`} aria-hidden="true" />
{value.source === 'live' ? ' Captured' : ` ${value.name || 'Uploaded'}`}
</span>
<button type="button" className="biometrics-mediainput__clear" onClick={clear} aria-label="Remove sample">
<i className="fas fa-xmark" aria-hidden="true" />
</button>
</div>
</div>
)}
</div>
)
}

View File

@@ -0,0 +1,22 @@
export default function TabSwitch({ tabs, value, onChange }) {
return (
<div className="biometrics-tabs" role="tablist">
{tabs.map(t => {
const active = t.id === value
return (
<button
key={t.id}
role="tab"
type="button"
aria-selected={active}
className={`biometrics-tab ${active ? 'active' : ''}`}
onClick={() => onChange(t.id)}
>
{t.icon && <i className={`${t.icon}`} aria-hidden="true" />}
<span>{t.label}</span>
</button>
)
})}
</div>
)
}

View File

@@ -0,0 +1,99 @@
import { useEffect, useRef, useState } from 'react'
// WaveformStrip — decode an audio source (data URL or blob URL) via AudioContext,
// render a mono waveform, and overlay colored segment regions.
// segments: [{ start: seconds, end: seconds, label?, tone? }]
export default function WaveformStrip({ src, segments = [], height = 120 }) {
const canvasRef = useRef(null)
const [duration, setDuration] = useState(0)
const [peaks, setPeaks] = useState(null)
const [err, setErr] = useState(null)
useEffect(() => {
setPeaks(null)
setDuration(0)
setErr(null)
if (!src) return
let cancelled = false
async function decode() {
try {
const response = await fetch(src)
const buf = await response.arrayBuffer()
const Ctx = window.AudioContext || window.webkitAudioContext
const ctx = new Ctx()
const audioBuf = await ctx.decodeAudioData(buf.slice(0))
if (cancelled) { ctx.close(); return }
const data = audioBuf.getChannelData(0)
const BUCKETS = 480
const step = Math.max(1, Math.floor(data.length / BUCKETS))
const result = new Float32Array(BUCKETS)
for (let i = 0; i < BUCKETS; i++) {
let peak = 0
const start = i * step
const end = Math.min(start + step, data.length)
for (let j = start; j < end; j++) {
const v = Math.abs(data[j])
if (v > peak) peak = v
}
result[i] = peak
}
setPeaks(result)
setDuration(audioBuf.duration)
ctx.close()
} catch (e) {
if (!cancelled) setErr(e?.message || 'Could not decode audio')
}
}
decode()
return () => { cancelled = true }
}, [src])
useEffect(() => {
if (!canvasRef.current || !peaks) return
const canvas = canvasRef.current
const dpr = window.devicePixelRatio || 1
const cssW = canvas.clientWidth
const cssH = height
canvas.width = Math.floor(cssW * dpr)
canvas.height = Math.floor(cssH * dpr)
const ctx = canvas.getContext('2d')
ctx.scale(dpr, dpr)
ctx.clearRect(0, 0, cssW, cssH)
// Waveform
const accent = getComputedStyle(canvas).getPropertyValue('--biometrics-wave').trim() || '#e8a87c'
ctx.fillStyle = accent
const mid = cssH / 2
const barW = Math.max(1, cssW / peaks.length)
for (let i = 0; i < peaks.length; i++) {
const h = Math.max(1, peaks[i] * (cssH * 0.9))
ctx.fillRect(i * barW, mid - h / 2, Math.max(0.5, barW - 0.5), h)
}
}, [peaks, height])
if (err) return <div className="biometrics-waveform biometrics-waveform--error">{err}</div>
if (!src) return null
return (
<div className="biometrics-waveform" style={{ height }}>
<canvas ref={canvasRef} style={{ width: '100%', height: '100%' }} />
{duration > 0 && segments.map((s, i) => {
const left = (Math.max(0, s.start) / duration) * 100
const right = (Math.min(duration, s.end) / duration) * 100
return (
<div key={i} className={`biometrics-waveform__segment tone-${s.tone || 'accent'}`}
style={{ left: `${left}%`, width: `${Math.max(0.5, right - left)}%` }}>
{s.label && <span className="biometrics-waveform__seglabel">{s.label}</span>}
</div>
)
})}
{duration > 0 && (
<div className="biometrics-waveform__duration" aria-hidden="true">{duration.toFixed(1)}s</div>
)}
{!peaks && (
<div className="biometrics-waveform__loading">Decoding</div>
)}
</div>
)
}

View File

@@ -0,0 +1,205 @@
import { useCallback, useEffect, useRef, useState } from 'react'
// Encode an AudioBuffer as a 16-bit PCM mono WAV blob. Libsndfile (which the
// SpeechBrain / ONNX voice backends use) reads this shape without extra
// decoders. We downmix to mono because speaker-encoder models expect a single
// channel and sample-rate resampling is handled server-side.
function audioBufferToWavBlob(audioBuffer) {
const sampleRate = audioBuffer.sampleRate
const numFrames = audioBuffer.length
const bitsPerSample = 16
const blockAlign = bitsPerSample / 8 // mono, 1 channel
const byteRate = sampleRate * blockAlign
const dataSize = numFrames * blockAlign
const out = new ArrayBuffer(44 + dataSize)
const view = new DataView(out)
const writeAscii = (offset, s) => {
for (let i = 0; i < s.length; i++) view.setUint8(offset + i, s.charCodeAt(i))
}
writeAscii(0, 'RIFF')
view.setUint32(4, 36 + dataSize, true)
writeAscii(8, 'WAVE')
writeAscii(12, 'fmt ')
view.setUint32(16, 16, true) // fmt chunk size
view.setUint16(20, 1, true) // PCM
view.setUint16(22, 1, true) // mono
view.setUint32(24, sampleRate, true)
view.setUint32(28, byteRate, true)
view.setUint16(32, blockAlign, true)
view.setUint16(34, bitsPerSample, true)
writeAscii(36, 'data')
view.setUint32(40, dataSize, true)
// Average all input channels into mono, then clamp + convert to int16.
const numChannels = audioBuffer.numberOfChannels
const channels = []
for (let c = 0; c < numChannels; c++) channels.push(audioBuffer.getChannelData(c))
let offset = 44
for (let i = 0; i < numFrames; i++) {
let sum = 0
for (let c = 0; c < numChannels; c++) sum += channels[c][i]
const mono = Math.max(-1, Math.min(1, sum / numChannels))
view.setInt16(offset, mono < 0 ? mono * 0x8000 : mono * 0x7FFF, true)
offset += 2
}
return new Blob([out], { type: 'audio/wav' })
}
// useMediaCapture — wraps getUserMedia + MediaRecorder for the biometrics pages.
// mode: 'image' streams video-only for a snap-to-canvas; 'audio' records a clip via MediaRecorder.
// Consumers attach the returned videoRef to a <video autoPlay muted playsInline/> element.
export function useMediaCapture(mode) {
const [active, setActive] = useState(false)
const [recording, setRecording] = useState(false)
const [error, setError] = useState(null)
const [elapsed, setElapsed] = useState(0)
const streamRef = useRef(null)
const videoRef = useRef(null)
const recorderRef = useRef(null)
const chunksRef = useRef([])
const tickRef = useRef(null)
const resolveStopRef = useRef(null)
const supported = typeof navigator !== 'undefined' && !!navigator.mediaDevices?.getUserMedia
const stopStream = useCallback(() => {
if (tickRef.current) {
clearInterval(tickRef.current)
tickRef.current = null
}
if (streamRef.current) {
streamRef.current.getTracks().forEach(t => { try { t.stop() } catch (_) { /* ignore */ } })
streamRef.current = null
}
if (videoRef.current) {
try { videoRef.current.srcObject = null } catch (_) { /* ignore */ }
}
setActive(false)
setRecording(false)
setElapsed(0)
}, [])
const start = useCallback(async () => {
if (!supported) {
setError('Your browser does not support media capture.')
return
}
setError(null)
try {
const constraints = mode === 'audio'
? { audio: true }
: { video: { facingMode: 'user', width: { ideal: 640 }, height: { ideal: 480 } } }
const stream = await navigator.mediaDevices.getUserMedia(constraints)
streamRef.current = stream
// Attachment happens in the useEffect below — videoRef.current is still
// null at this point because the <video> element mounts only after React
// processes the setActive(true) state change.
setActive(true)
} catch (e) {
setError(e?.message || 'Could not access device')
stopStream()
}
}, [mode, supported, stopStream])
// Hook the stream into the <video> once both the stream and the element exist.
useEffect(() => {
if (mode !== 'image' || !active) return
const v = videoRef.current
const s = streamRef.current
if (!v || !s) return
if (v.srcObject !== s) v.srcObject = s
const playPromise = v.play()
if (playPromise && typeof playPromise.catch === 'function') {
playPromise.catch(() => { /* autoplay gated */ })
}
}, [active, mode])
// Snap a frame from the live video stream to a PNG base64 (image mode).
const snap = useCallback(() => {
if (mode !== 'image' || !videoRef.current || !streamRef.current) return null
const v = videoRef.current
const w = v.videoWidth || 640
const h = v.videoHeight || 480
const canvas = document.createElement('canvas')
canvas.width = w
canvas.height = h
const ctx = canvas.getContext('2d')
ctx.drawImage(v, 0, 0, w, h)
const dataUrl = canvas.toDataURL('image/png')
const base64 = dataUrl.split(',')[1] || ''
return { base64, dataUrl, mime: 'image/png' }
}, [mode])
// Start an audio recording — returns a promise that resolves with a WAV-encoded
// {base64, blob, dataUrl, mime} on stopRecording. Transcoding to 16-bit PCM mono
// WAV is necessary because the voice backends open the file via libsndfile, which
// doesn't handle WebM/Ogg-Opus containers — the browser's native MediaRecorder
// output — out of the box.
const startRecording = useCallback(() => {
if (mode !== 'audio' || !streamRef.current) return null
chunksRef.current = []
const recMime = (typeof MediaRecorder !== 'undefined' && MediaRecorder.isTypeSupported('audio/webm;codecs=opus'))
? 'audio/webm;codecs=opus'
: 'audio/webm'
let rec
try {
rec = new MediaRecorder(streamRef.current, { mimeType: recMime })
} catch (_) {
rec = new MediaRecorder(streamRef.current)
}
recorderRef.current = rec
rec.ondataavailable = (e) => { if (e.data && e.data.size > 0) chunksRef.current.push(e.data) }
const donePromise = new Promise((resolve, reject) => {
resolveStopRef.current = resolve
rec.onstop = async () => {
try {
const recBlob = new Blob(chunksRef.current, { type: rec.mimeType || recMime })
const arrayBuf = await recBlob.arrayBuffer()
const Ctx = window.AudioContext || window.webkitAudioContext
const ctx = new Ctx()
const audioBuf = await ctx.decodeAudioData(arrayBuf.slice(0))
const wavBlob = audioBufferToWavBlob(audioBuf)
ctx.close()
const dataUrl = await new Promise((res) => {
const reader = new FileReader()
reader.onloadend = () => res(reader.result)
reader.readAsDataURL(wavBlob)
})
const base64 = typeof dataUrl === 'string' ? (dataUrl.split(',')[1] || '') : ''
resolve({ blob: wavBlob, base64, dataUrl, mime: 'audio/wav' })
} catch (err) {
reject(err)
} finally {
resolveStopRef.current = null
}
}
})
rec.start()
setRecording(true)
setElapsed(0)
const started = Date.now()
tickRef.current = setInterval(() => setElapsed((Date.now() - started) / 1000), 100)
return donePromise
}, [mode])
const stopRecording = useCallback(() => {
if (recorderRef.current && recorderRef.current.state !== 'inactive') {
recorderRef.current.stop()
}
if (tickRef.current) {
clearInterval(tickRef.current)
tickRef.current = null
}
setRecording(false)
}, [])
// Cleanup on unmount — always release the device.
useEffect(() => () => stopStream(), [stopStream])
return {
supported, active, recording, error, elapsed,
videoRef, start, stop: stopStream, snap, startRecording, stopRecording,
}
}

View File

@@ -0,0 +1,602 @@
import { useEffect, useMemo, useState } from 'react'
import { useOutletContext, useParams } from 'react-router-dom'
import ModelSelector from '../components/ModelSelector'
import LoadingSpinner from '../components/LoadingSpinner'
import ErrorWithTraceLink from '../components/ErrorWithTraceLink'
import TabSwitch from '../components/biometrics/TabSwitch'
import MediaInput from '../components/biometrics/MediaInput'
import BoundingBoxCanvas from '../components/biometrics/BoundingBoxCanvas'
import MatchGauge from '../components/biometrics/MatchGauge'
import DistributionBars from '../components/biometrics/DistributionBars'
import EnrollmentList from '../components/biometrics/EnrollmentList'
import EmbeddingInspector from '../components/biometrics/EmbeddingInspector'
import { CAP_FACE_RECOGNITION } from '../utils/capabilities'
import { faceApi } from '../utils/api'
const TABS = [
{ id: 'analyze', icon: 'fas fa-chart-column', label: 'Analyze' },
{ id: 'compare', icon: 'fas fa-people-arrows', label: 'Compare' },
{ id: 'enroll', icon: 'fas fa-id-card', label: 'Enrollment' },
{ id: 'embed', icon: 'fas fa-code', label: 'Embedding' },
]
const ENROLL_KEY = 'localai_face_enrollments'
function loadEnrollments() {
try {
const raw = localStorage.getItem(ENROLL_KEY)
if (!raw) return []
const parsed = JSON.parse(raw)
return Array.isArray(parsed) ? parsed : []
} catch (_) { return [] }
}
function saveEnrollments(list) {
try { localStorage.setItem(ENROLL_KEY, JSON.stringify(list.slice(0, 50))) } catch (_) { /* quota */ }
}
// parse a textarea of "key: value" lines into a { key: value } object.
function parseLabels(text) {
const out = {}
if (!text) return out
for (const line of text.split('\n')) {
const idx = line.indexOf(':')
if (idx === -1) continue
const k = line.slice(0, idx).trim()
const v = line.slice(idx + 1).trim()
if (k) out[k] = v
}
return out
}
export default function FaceRecognition() {
const { model: urlModel } = useParams()
const { addToast } = useOutletContext()
const [model, setModel] = useState(urlModel || '')
const [tab, setTab] = useState('analyze')
return (
<div className="biometrics-page">
<header className="biometrics-page__header">
<div>
<h1 className="page-title"><i className="fas fa-face-smile" aria-hidden="true" /> Face Recognition</h1>
<p className="page-subtitle">Compare, identify, and analyze faces using any face model installed on this LocalAI instance. Samples never leave your machine they go only to the running backend.</p>
</div>
<div className="biometrics-page__model">
<label className="form-label" htmlFor="face-model">Model</label>
<ModelSelector value={model} onChange={setModel} capability={CAP_FACE_RECOGNITION} />
</div>
</header>
<TabSwitch tabs={TABS} value={tab} onChange={setTab} />
<div className="biometrics-page__body">
{tab === 'analyze' && <AnalyzeTab model={model} addToast={addToast} />}
{tab === 'compare' && <CompareTab model={model} addToast={addToast} />}
{tab === 'enroll' && <EnrollTab model={model} addToast={addToast} />}
{tab === 'embed' && <EmbedTab model={model} addToast={addToast} />}
</div>
</div>
)
}
// ──────────────────────────── Analyze ────────────────────────────
function AnalyzeTab({ model, addToast }) {
const [img, setImg] = useState(null)
const [actions, setActions] = useState({ age: true, gender: true, emotion: true, race: true })
const [antiSpoofing, setAntiSpoofing] = useState(false)
const [loading, setLoading] = useState(false)
const [error, setError] = useState(null)
const [result, setResult] = useState(null)
const [focusIdx, setFocusIdx] = useState(0)
const submit = async (e) => {
e.preventDefault()
if (!model) { addToast('Select a face model first', 'warning'); return }
if (!img) { addToast('Add an image to analyze', 'warning'); return }
setLoading(true); setError(null); setResult(null); setFocusIdx(0)
try {
const body = {
model,
img: img.dataUrl,
actions: Object.entries(actions).filter(([, v]) => v).map(([k]) => k),
anti_spoofing: antiSpoofing,
}
const data = await faceApi.analyze(body)
setResult(data)
if (!data?.faces?.length) addToast('No face detected in the image', 'warning')
} catch (err) {
setError(err.message)
} finally {
setLoading(false)
}
}
const boxes = useMemo(() => (result?.faces || []).map((f, i) => ({
x: f.region.x, y: f.region.y, w: f.region.w, h: f.region.h,
label: f.dominant_emotion || f.dominant_gender || `Face ${i + 1}`,
sublabel: f.age ? `~${Math.round(f.age)}y` : null,
tone: i === focusIdx ? 'accent' : 'default',
})), [result, focusIdx])
const faces = result?.faces || []
const focus = faces[focusIdx]
return (
<form className="biometrics-twocol" onSubmit={submit}>
<aside className="biometrics-panel">
<h2 className="biometrics-panel__title">Analyze a face</h2>
<MediaInput mode="image" label="Source image" value={img} onChange={setImg} idPrefix="face-analyze" />
<fieldset className="biometrics-fieldset">
<legend>Attributes</legend>
<div className="biometrics-chipset" role="group">
{['age', 'gender', 'emotion', 'race'].map(k => (
<label key={k} className={`biometrics-chip ${actions[k] ? 'active' : ''}`}>
<input type="checkbox" checked={actions[k]} onChange={(e) => setActions(a => ({ ...a, [k]: e.target.checked }))} />
<span>{k}</span>
</label>
))}
</div>
</fieldset>
<div className="form-row">
<div className="form-row__label">
<span className="form-row__label-text">Anti-spoofing</span>
<span className="form-row__hint">Reject photos-of-photos (requires model support).</span>
</div>
<label className="biometrics-switch">
<input type="checkbox" checked={antiSpoofing} onChange={(e) => setAntiSpoofing(e.target.checked)} />
<span aria-hidden="true" />
</label>
</div>
<button type="submit" className="btn btn-primary btn-full" disabled={loading || !img}>
{loading ? <><LoadingSpinner size="sm" /> Analyzing</> : <><i className="fas fa-wand-magic-sparkles" /> Analyze</>}
</button>
</aside>
<section className="biometrics-results">
{loading && <div className="biometrics-empty"><LoadingSpinner size="lg" /></div>}
{error && <ErrorWithTraceLink message={error} />}
{!loading && !error && !result && (
<EmptyState icon="fas fa-face-smile"
title="Drop a portrait to analyze"
body="The backend will detect each face and return age, gender, emotion, and race distributions — with an optional liveness check." />
)}
{result && img && (
<>
<div className="biometrics-split">
<div className="biometrics-split__media">
<BoundingBoxCanvas src={img.dataUrl} boxes={boxes} alt="Analyzed source" />
{faces.length > 1 && (
<div className="biometrics-facepicker" role="tablist" aria-label="Select face">
{faces.map((_, i) => (
<button key={i} type="button"
className={`biometrics-facepicker__chip ${i === focusIdx ? 'active' : ''}`}
onClick={() => setFocusIdx(i)}
aria-pressed={i === focusIdx}>
Face {i + 1}
</button>
))}
</div>
)}
</div>
<div className="biometrics-split__aside">
{focus && (
<>
<div className="biometrics-summary card">
<div className="biometrics-summary__head">
<h3><i className="fas fa-user" /> Face {focusIdx + 1}</h3>
{antiSpoofing && <LivenessPill isReal={focus.is_real} score={focus.antispoof_score} />}
</div>
<dl className="biometrics-summary__grid">
{focus.age != null && <><dt>Age</dt><dd>~{Math.round(focus.age)}</dd></>}
{focus.dominant_gender && <><dt>Gender</dt><dd>{focus.dominant_gender}</dd></>}
{focus.dominant_emotion && <><dt>Emotion</dt><dd>{focus.dominant_emotion}</dd></>}
{focus.dominant_race && <><dt>Race</dt><dd>{focus.dominant_race}</dd></>}
{focus.face_confidence != null && <><dt>Detection</dt><dd>{(focus.face_confidence * 100).toFixed(1)}%</dd></>}
</dl>
</div>
<DistributionBars title="Gender" icon="fas fa-venus-mars" distribution={focus.gender} dominant={focus.dominant_gender} />
<DistributionBars title="Emotion" icon="fas fa-face-smile-beam" distribution={focus.emotion} dominant={focus.dominant_emotion} />
<DistributionBars title="Race" icon="fas fa-globe" distribution={focus.race} dominant={focus.dominant_race} />
</>
)}
</div>
</div>
<ResponseDetails data={result} />
</>
)}
</section>
</form>
)
}
// ──────────────────────────── Compare ────────────────────────────
function CompareTab({ model, addToast }) {
const [img1, setImg1] = useState(null)
const [img2, setImg2] = useState(null)
const [antiSpoofing, setAntiSpoofing] = useState(false)
const [threshold, setThreshold] = useState(null)
const [loading, setLoading] = useState(false)
const [error, setError] = useState(null)
const [result, setResult] = useState(null)
const submit = async (e) => {
e.preventDefault()
if (!model) { addToast('Select a face model first', 'warning'); return }
if (!img1 || !img2) { addToast('Add both images to compare', 'warning'); return }
setLoading(true); setError(null); setResult(null)
try {
const body = { model, img1: img1.dataUrl, img2: img2.dataUrl, anti_spoofing: antiSpoofing }
if (threshold != null) body.threshold = threshold
const data = await faceApi.verify(body)
setResult(data)
if (threshold == null && data?.threshold) setThreshold(data.threshold)
} catch (err) {
setError(err.message)
} finally {
setLoading(false)
}
}
// Re-compute verified locally when user drags the threshold slider post-response.
const effective = useMemo(() => {
if (!result) return null
const t = threshold ?? result.threshold
const verified = result.distance <= t
const confidence = Math.max(0, Math.min(100, 100 * (1 - result.distance / t)))
return { verified, confidence, threshold: t, distance: result.distance }
}, [result, threshold])
return (
<form className="biometrics-twocol" onSubmit={submit}>
<aside className="biometrics-panel">
<h2 className="biometrics-panel__title">Compare two faces</h2>
<MediaInput mode="image" label="First image" value={img1} onChange={setImg1} idPrefix="face-cmp-1" />
<MediaInput mode="image" label="Second image" value={img2} onChange={setImg2} idPrefix="face-cmp-2" />
<div className="form-row">
<div className="form-row__label">
<span className="form-row__label-text">Anti-spoofing</span>
<span className="form-row__hint">Flag photos-of-photos on either image.</span>
</div>
<label className="biometrics-switch">
<input type="checkbox" checked={antiSpoofing} onChange={(e) => setAntiSpoofing(e.target.checked)} />
<span aria-hidden="true" />
</label>
</div>
<button type="submit" className="btn btn-primary btn-full" disabled={loading || !img1 || !img2}>
{loading ? <><LoadingSpinner size="sm" /> Comparing</> : <><i className="fas fa-equals" /> Compare</>}
</button>
</aside>
<section className="biometrics-results">
{loading && <div className="biometrics-empty"><LoadingSpinner size="lg" /></div>}
{error && <ErrorWithTraceLink message={error} />}
{!loading && !error && !result && (
<EmptyState icon="fas fa-people-arrows"
title="Drop two images to compare"
body="The backend will extract an embedding for each face and report the cosine distance between them. A match is declared when distance is below the threshold." />
)}
{result && effective && (
<>
<div className="biometrics-compare">
<div className="biometrics-compare__panel">
<div className="biometrics-compare__label">Image 1</div>
<BoundingBoxCanvas src={img1?.dataUrl}
boxes={result.img1_area ? [{ ...result.img1_area, label: result.img1_is_real === false ? 'Spoof' : null, tone: 'accent' }] : []} />
{antiSpoofing && result.img1_is_real != null && (
<LivenessPill isReal={result.img1_is_real} score={result.img1_antispoof_score} />
)}
</div>
<div className="biometrics-compare__center">
<MatchGauge
distance={effective.distance}
threshold={effective.threshold}
confidence={effective.confidence}
verified={effective.verified}
/>
<div className="biometrics-compare__threshold">
<label htmlFor="face-threshold">Threshold <code>{effective.threshold.toFixed(3)}</code></label>
<input id="face-threshold" type="range" min="0" max="1" step="0.005"
value={effective.threshold}
onChange={(e) => setThreshold(parseFloat(e.target.value))}
aria-describedby="face-threshold-hint" />
<p id="face-threshold-hint" className="biometrics-compare__hint">
Drag to see how the verdict changes. The backend default is <code>{result.threshold?.toFixed(3)}</code>.
</p>
</div>
</div>
<div className="biometrics-compare__panel">
<div className="biometrics-compare__label">Image 2</div>
<BoundingBoxCanvas src={img2?.dataUrl}
boxes={result.img2_area ? [{ ...result.img2_area, label: result.img2_is_real === false ? 'Spoof' : null, tone: 'accent' }] : []} />
{antiSpoofing && result.img2_is_real != null && (
<LivenessPill isReal={result.img2_is_real} score={result.img2_antispoof_score} />
)}
</div>
</div>
<ResponseDetails data={result} />
</>
)}
</section>
</form>
)
}
// ──────────────────────────── Enrollment (register / identify / forget) ────────────────────────────
function EnrollTab({ model, addToast }) {
const [enrolled, setEnrolled] = useState(loadEnrollments)
const [enrollName, setEnrollName] = useState('')
const [enrollLabels, setEnrollLabels] = useState('')
const [enrollImg, setEnrollImg] = useState(null)
const [enrolling, setEnrolling] = useState(false)
const [enrollErr, setEnrollErr] = useState(null)
const [lastEnrolled, setLastEnrolled] = useState(null)
const [probeImg, setProbeImg] = useState(null)
const [topK, setTopK] = useState(5)
const [threshold, setThreshold] = useState(0.35)
const [identifying, setIdentifying] = useState(false)
const [identifyErr, setIdentifyErr] = useState(null)
const [identifyResult, setIdentifyResult] = useState(null)
useEffect(() => { saveEnrollments(enrolled) }, [enrolled])
const enroll = async (e) => {
e.preventDefault()
if (!model) { addToast('Select a face model first', 'warning'); return }
if (!enrollName.trim()) { addToast('Give this person a name', 'warning'); return }
if (!enrollImg) { addToast('Add a sample image', 'warning'); return }
setEnrolling(true); setEnrollErr(null)
try {
const data = await faceApi.register({
model,
name: enrollName.trim(),
img: enrollImg.dataUrl,
labels: parseLabels(enrollLabels),
})
const entry = {
id: data.id,
name: data.name,
labels: parseLabels(enrollLabels),
thumbnail: enrollImg.dataUrl,
registeredAt: data.registered_at || new Date().toISOString(),
}
setEnrolled(prev => [entry, ...prev])
setLastEnrolled(entry.id)
setEnrollName(''); setEnrollLabels(''); setEnrollImg(null)
addToast(`Enrolled ${entry.name}`, 'success')
} catch (err) {
setEnrollErr(err.message)
} finally {
setEnrolling(false)
}
}
const forget = async (entry) => {
try {
await faceApi.forget({ id: entry.id })
setEnrolled(prev => prev.filter(e => e.id !== entry.id))
addToast(`Removed ${entry.name}`, 'info')
} catch (err) {
if (err.status === 404) {
setEnrolled(prev => prev.filter(e => e.id !== entry.id))
addToast(`${entry.name} was already gone from the backend store`, 'warning')
} else {
addToast(err.message, 'error')
}
}
}
const identify = async (e) => {
e.preventDefault()
if (!model) { addToast('Select a face model first', 'warning'); return }
if (!probeImg) { addToast('Add a probe image', 'warning'); return }
setIdentifying(true); setIdentifyErr(null); setIdentifyResult(null)
try {
const data = await faceApi.identify({
model,
img: probeImg.dataUrl,
top_k: topK,
threshold,
})
setIdentifyResult(data)
if (!data?.matches?.length) addToast('No matches above threshold', 'info')
} catch (err) {
setIdentifyErr(err.message)
} finally {
setIdentifying(false)
}
}
return (
<div className="biometrics-enrollgrid">
<section className="biometrics-enrollgrid__register card">
<h2 className="biometrics-panel__title"><i className="fas fa-user-plus" /> Enroll a face</h2>
<form onSubmit={enroll}>
<div className="form-group">
<label className="form-label" htmlFor="face-enroll-name">Name</label>
<input id="face-enroll-name" className="input" value={enrollName}
onChange={(e) => setEnrollName(e.target.value)} placeholder="e.g. Alice Johnson" />
</div>
<div className="form-group">
<label className="form-label" htmlFor="face-enroll-labels">Labels <span className="form-label__hint">(optional, one per line)</span></label>
<textarea id="face-enroll-labels" className="textarea" rows={2}
placeholder={"team: engineering\nfloor: 3"}
value={enrollLabels} onChange={(e) => setEnrollLabels(e.target.value)} />
</div>
<MediaInput mode="image" label="Sample image" value={enrollImg} onChange={setEnrollImg} idPrefix="face-enroll" />
<button type="submit" className="btn btn-primary btn-full" disabled={enrolling}>
{enrolling ? <><LoadingSpinner size="sm" /> Enrolling</> : <><i className="fas fa-plus" /> Enroll</>}
</button>
{enrollErr && <div className="biometrics-enrollgrid__err"><ErrorWithTraceLink message={enrollErr} /></div>}
</form>
</section>
<section className="biometrics-enrollgrid__identify card">
<h2 className="biometrics-panel__title"><i className="fas fa-magnifying-glass" /> Identify someone</h2>
<form onSubmit={identify}>
<MediaInput mode="image" label="Probe image" value={probeImg} onChange={setProbeImg} idPrefix="face-probe" />
<div className="form-grid-2col">
<div className="form-group">
<label className="form-label" htmlFor="face-topk">Top-K</label>
<input id="face-topk" type="number" min="1" max="25" className="input"
value={topK} onChange={(e) => setTopK(parseInt(e.target.value) || 1)} />
</div>
<div className="form-group">
<label className="form-label" htmlFor="face-threshold-id">Threshold</label>
<input id="face-threshold-id" type="number" min="0" max="1" step="0.01" className="input"
value={threshold} onChange={(e) => setThreshold(parseFloat(e.target.value) || 0)} />
</div>
</div>
<button type="submit" className="btn btn-primary btn-full" disabled={identifying || !probeImg}>
{identifying ? <><LoadingSpinner size="sm" /> Searching</> : <><i className="fas fa-magnifying-glass" /> Identify</>}
</button>
{identifyErr && <div className="biometrics-enrollgrid__err"><ErrorWithTraceLink message={identifyErr} /></div>}
{identifyResult && <MatchesList matches={identifyResult.matches || []} enrolled={enrolled} />}
</form>
</section>
<section className="biometrics-enrollgrid__list">
<div className="biometrics-enroll__head">
<h2 className="biometrics-panel__title"><i className="fas fa-id-card" /> Enrolled <span className="biometrics-enroll__count">{enrolled.length}</span></h2>
</div>
<EnrollmentList entries={enrolled} onDelete={forget} mode="image" highlightId={lastEnrolled} />
</section>
</div>
)
}
function MatchesList({ matches, enrolled }) {
if (!matches.length) {
return <div className="biometrics-matches__empty">No candidates above threshold.</div>
}
return (
<ul className="biometrics-matches" aria-label="Matches">
{matches.map((m, i) => {
const record = enrolled.find(e => e.id === m.id)
const conf = Math.max(0, Math.min(100, m.confidence ?? 0))
return (
<li key={m.id} className={`biometrics-matches__row ${m.match ? 'match' : 'miss'}`}>
<div className="biometrics-matches__rank">#{i + 1}</div>
<div className="biometrics-matches__avatar">
{record?.thumbnail
? <img src={record.thumbnail} alt="" />
: <span>{(m.name || '?').slice(0, 2).toUpperCase()}</span>}
</div>
<div className="biometrics-matches__body">
<div className="biometrics-matches__name">
<strong>{m.name || m.id}</strong>
{m.match ? <span className="biometrics-matches__badge match"><i className="fas fa-check" /> match</span>
: <span className="biometrics-matches__badge miss">below threshold</span>}
</div>
<div className="biometrics-matches__meter" aria-hidden="true">
<div className="biometrics-matches__fill" style={{ width: `${conf}%` }} />
</div>
<div className="biometrics-matches__meta">
<span>distance <code>{m.distance?.toFixed?.(4) ?? '—'}</code></span>
<span>confidence <code>{conf.toFixed(1)}%</code></span>
</div>
</div>
</li>
)
})}
</ul>
)
}
// ──────────────────────────── Embedding ────────────────────────────
function EmbedTab({ model, addToast }) {
const [img, setImg] = useState(null)
const [loading, setLoading] = useState(false)
const [error, setError] = useState(null)
const [result, setResult] = useState(null)
const [elapsedMs, setElapsedMs] = useState(null)
const submit = async (e) => {
e.preventDefault()
if (!model) { addToast('Select a face model first', 'warning'); return }
if (!img) { addToast('Add an image', 'warning'); return }
setLoading(true); setError(null); setResult(null)
const started = performance.now()
try {
const data = await faceApi.embed({ model, img: img.dataUrl })
setElapsedMs(performance.now() - started)
setResult(data)
} catch (err) {
setError(err.message)
} finally {
setLoading(false)
}
}
return (
<form className="biometrics-twocol" onSubmit={submit}>
<aside className="biometrics-panel">
<h2 className="biometrics-panel__title">Get a raw embedding</h2>
<p className="biometrics-panel__note">
Returns a single face embedding vector. This is the same representation the backend uses internally for verify, identify, and compare.
</p>
<MediaInput mode="image" label="Image" value={img} onChange={setImg} idPrefix="face-embed" />
<button type="submit" className="btn btn-primary btn-full" disabled={loading || !img}>
{loading ? <><LoadingSpinner size="sm" /> Embedding</> : <><i className="fas fa-code" /> Extract vector</>}
</button>
</aside>
<section className="biometrics-results">
{loading && <div className="biometrics-empty"><LoadingSpinner size="lg" /></div>}
{error && <ErrorWithTraceLink message={error} />}
{!loading && !error && !result && (
<EmptyState icon="fas fa-code"
title="Get a face embedding"
body="For developers — retrieve the raw vector for a face to store, search, or cluster outside of LocalAI." />
)}
{result && (
<EmbeddingInspector embedding={result.embedding} dim={result.dim} model={result.model} elapsedMs={elapsedMs} />
)}
</section>
</form>
)
}
// ──────────────────────────── Small shared bits ────────────────────────────
function LivenessPill({ isReal, score }) {
if (isReal == null) {
return <span className="biometrics-pill muted"><i className="fas fa-circle-question" /> Not checked</span>
}
return (
<span className={`biometrics-pill ${isReal ? 'good' : 'bad'}`}>
<i className={`fas ${isReal ? 'fa-user-shield' : 'fa-mask'}`} />
{isReal ? 'Real' : 'Spoof'}
{score != null && <small>{(score * 100).toFixed(0)}%</small>}
</span>
)
}
function EmptyState({ icon, title, body }) {
return (
<div className="biometrics-empty">
<i className={icon} aria-hidden="true" />
<h3>{title}</h3>
<p>{body}</p>
</div>
)
}
function ResponseDetails({ data }) {
return (
<details className="biometrics-response">
<summary><i className="fas fa-angle-right" aria-hidden="true" /> Raw response</summary>
<pre>{JSON.stringify(data, null, 2)}</pre>
</details>
)
}

View File

@@ -0,0 +1,543 @@
import { useEffect, useMemo, useState } from 'react'
import { useOutletContext, useParams } from 'react-router-dom'
import ModelSelector from '../components/ModelSelector'
import LoadingSpinner from '../components/LoadingSpinner'
import ErrorWithTraceLink from '../components/ErrorWithTraceLink'
import TabSwitch from '../components/biometrics/TabSwitch'
import MediaInput from '../components/biometrics/MediaInput'
import WaveformStrip from '../components/biometrics/WaveformStrip'
import MatchGauge from '../components/biometrics/MatchGauge'
import DistributionBars from '../components/biometrics/DistributionBars'
import EnrollmentList from '../components/biometrics/EnrollmentList'
import EmbeddingInspector from '../components/biometrics/EmbeddingInspector'
import { CAP_SPEAKER_RECOGNITION } from '../utils/capabilities'
import { voiceApi } from '../utils/api'
const TABS = [
{ id: 'analyze', icon: 'fas fa-wave-square', label: 'Analyze' },
{ id: 'compare', icon: 'fas fa-people-arrows', label: 'Compare' },
{ id: 'enroll', icon: 'fas fa-id-badge', label: 'Enrollment' },
{ id: 'embed', icon: 'fas fa-code', label: 'Embedding' },
]
const ENROLL_KEY = 'localai_voice_enrollments'
function loadEnrollments() {
try {
const raw = localStorage.getItem(ENROLL_KEY)
if (!raw) return []
const p = JSON.parse(raw)
return Array.isArray(p) ? p : []
} catch (_) { return [] }
}
function saveEnrollments(list) {
try { localStorage.setItem(ENROLL_KEY, JSON.stringify(list.slice(0, 50))) } catch (_) { /* quota */ }
}
function parseLabels(text) {
const out = {}
if (!text) return out
for (const line of text.split('\n')) {
const idx = line.indexOf(':')
if (idx === -1) continue
const k = line.slice(0, idx).trim()
const v = line.slice(idx + 1).trim()
if (k) out[k] = v
}
return out
}
const TONE_FOR_SEGMENT = ['accent', 'info', 'success', 'warning', 'data1', 'data2']
export default function VoiceRecognition() {
const { model: urlModel } = useParams()
const { addToast } = useOutletContext()
const [model, setModel] = useState(urlModel || '')
const [tab, setTab] = useState('analyze')
return (
<div className="biometrics-page">
<header className="biometrics-page__header">
<div>
<h1 className="page-title"><i className="fas fa-microphone-lines" aria-hidden="true" /> Voice Recognition</h1>
<p className="page-subtitle">
Compare, identify, and analyze speakers the audio analog to face recognition. Record directly from your microphone or upload a clip.
</p>
</div>
<div className="biometrics-page__model">
<label className="form-label">Model</label>
<ModelSelector value={model} onChange={setModel} capability={CAP_SPEAKER_RECOGNITION} />
</div>
</header>
<TabSwitch tabs={TABS} value={tab} onChange={setTab} />
<div className="biometrics-page__body">
{tab === 'analyze' && <AnalyzeTab model={model} addToast={addToast} />}
{tab === 'compare' && <CompareTab model={model} addToast={addToast} />}
{tab === 'enroll' && <EnrollTab model={model} addToast={addToast} />}
{tab === 'embed' && <EmbedTab model={model} addToast={addToast} />}
</div>
</div>
)
}
// ──────────────────────────── Analyze ────────────────────────────
function AnalyzeTab({ model, addToast }) {
const [audio, setAudio] = useState(null)
const [actions, setActions] = useState({ age: true, gender: true, emotion: true })
const [loading, setLoading] = useState(false)
const [error, setError] = useState(null)
const [result, setResult] = useState(null)
const [focusIdx, setFocusIdx] = useState(0)
const submit = async (e) => {
e.preventDefault()
if (!model) { addToast('Select a speaker model first', 'warning'); return }
if (!audio) { addToast('Add an audio clip', 'warning'); return }
setLoading(true); setError(null); setResult(null); setFocusIdx(0)
try {
const data = await voiceApi.analyze({
model,
audio: audio.dataUrl,
actions: Object.entries(actions).filter(([, v]) => v).map(([k]) => k),
})
setResult(data)
if (!data?.segments?.length) addToast('No speech segments detected', 'warning')
} catch (err) {
setError(err.message)
} finally {
setLoading(false)
}
}
const segments = useMemo(() => result?.segments || [], [result])
const focus = segments[focusIdx]
const waveformSegments = useMemo(() => segments.map((s, i) => ({
start: s.start, end: s.end,
label: s.dominant_emotion || s.dominant_gender || `#${i + 1}`,
tone: i === focusIdx ? 'accent' : TONE_FOR_SEGMENT[i % TONE_FOR_SEGMENT.length],
})), [segments, focusIdx])
return (
<form className="biometrics-twocol" onSubmit={submit}>
<aside className="biometrics-panel">
<h2 className="biometrics-panel__title">Analyze a speaker</h2>
<MediaInput mode="audio" label="Audio clip" value={audio} onChange={setAudio} idPrefix="voice-analyze" />
<fieldset className="biometrics-fieldset">
<legend>Attributes</legend>
<div className="biometrics-chipset" role="group">
{['age', 'gender', 'emotion'].map(k => (
<label key={k} className={`biometrics-chip ${actions[k] ? 'active' : ''}`}>
<input type="checkbox" checked={actions[k]} onChange={(e) => setActions(a => ({ ...a, [k]: e.target.checked }))} />
<span>{k}</span>
</label>
))}
</div>
</fieldset>
<button type="submit" className="btn btn-primary btn-full" disabled={loading || !audio}>
{loading ? <><LoadingSpinner size="sm" /> Analyzing</> : <><i className="fas fa-wand-magic-sparkles" /> Analyze</>}
</button>
</aside>
<section className="biometrics-results">
{loading && <div className="biometrics-empty"><LoadingSpinner size="lg" /></div>}
{error && <ErrorWithTraceLink message={error} />}
{!loading && !error && !result && (
<EmptyState icon="fas fa-wave-square"
title="Record or upload a clip to analyze"
body="The backend will segment the audio by speaker turn and infer age, gender, and emotion per segment." />
)}
{result && audio && (
<>
<WaveformStrip src={audio.dataUrl} segments={waveformSegments} />
{segments.length > 1 && (
<div className="biometrics-facepicker" role="tablist" aria-label="Select segment">
{segments.map((s, i) => (
<button key={i} type="button"
className={`biometrics-facepicker__chip ${i === focusIdx ? 'active' : ''}`}
onClick={() => setFocusIdx(i)}
aria-pressed={i === focusIdx}>
#{i + 1} <small>{s.start.toFixed(1)}s{s.end.toFixed(1)}s</small>
</button>
))}
</div>
)}
{focus && (
<div className="biometrics-split">
<div className="biometrics-split__aside" style={{ gridColumn: '1 / -1' }}>
<div className="biometrics-summary card">
<div className="biometrics-summary__head">
<h3><i className="fas fa-user" /> Segment {focusIdx + 1}
<small>· {focus.start.toFixed(2)}s {focus.end.toFixed(2)}s</small>
</h3>
</div>
<dl className="biometrics-summary__grid">
{focus.age != null && <><dt>Age</dt><dd>~{Math.round(focus.age)}</dd></>}
{focus.dominant_gender && <><dt>Gender</dt><dd>{focus.dominant_gender}</dd></>}
{focus.dominant_emotion && <><dt>Emotion</dt><dd>{focus.dominant_emotion}</dd></>}
</dl>
</div>
<DistributionBars title="Gender" icon="fas fa-venus-mars" distribution={focus.gender} dominant={focus.dominant_gender} />
<DistributionBars title="Emotion" icon="fas fa-face-smile-beam" distribution={focus.emotion} dominant={focus.dominant_emotion} />
</div>
</div>
)}
<ResponseDetails data={result} />
</>
)}
</section>
</form>
)
}
// ──────────────────────────── Compare ────────────────────────────
function CompareTab({ model, addToast }) {
const [audio1, setAudio1] = useState(null)
const [audio2, setAudio2] = useState(null)
const [threshold, setThreshold] = useState(null)
const [loading, setLoading] = useState(false)
const [error, setError] = useState(null)
const [result, setResult] = useState(null)
const submit = async (e) => {
e.preventDefault()
if (!model) { addToast('Select a speaker model first', 'warning'); return }
if (!audio1 || !audio2) { addToast('Add both clips to compare', 'warning'); return }
setLoading(true); setError(null); setResult(null)
try {
const body = { model, audio1: audio1.dataUrl, audio2: audio2.dataUrl }
if (threshold != null) body.threshold = threshold
const data = await voiceApi.verify(body)
setResult(data)
if (threshold == null && data?.threshold) setThreshold(data.threshold)
} catch (err) {
setError(err.message)
} finally {
setLoading(false)
}
}
const effective = useMemo(() => {
if (!result) return null
const t = threshold ?? result.threshold
const verified = result.distance <= t
const confidence = Math.max(0, Math.min(100, 100 * (1 - result.distance / t)))
return { verified, confidence, threshold: t, distance: result.distance }
}, [result, threshold])
return (
<form className="biometrics-twocol" onSubmit={submit}>
<aside className="biometrics-panel">
<h2 className="biometrics-panel__title">Compare two voices</h2>
<MediaInput mode="audio" label="First clip" value={audio1} onChange={setAudio1} idPrefix="voice-cmp-1" />
<MediaInput mode="audio" label="Second clip" value={audio2} onChange={setAudio2} idPrefix="voice-cmp-2" />
<button type="submit" className="btn btn-primary btn-full" disabled={loading || !audio1 || !audio2}>
{loading ? <><LoadingSpinner size="sm" /> Comparing</> : <><i className="fas fa-equals" /> Compare</>}
</button>
</aside>
<section className="biometrics-results">
{loading && <div className="biometrics-empty"><LoadingSpinner size="lg" /></div>}
{error && <ErrorWithTraceLink message={error} />}
{!loading && !error && !result && (
<EmptyState icon="fas fa-people-arrows"
title="Drop two clips to compare"
body="We extract a speaker embedding for each clip and report the cosine distance — a match is declared when the distance is below the threshold." />
)}
{result && effective && (
<>
<div className="biometrics-compare biometrics-compare--voice">
<div className="biometrics-compare__panel">
<div className="biometrics-compare__label">Clip 1</div>
<WaveformStrip src={audio1?.dataUrl} height={80} />
</div>
<div className="biometrics-compare__center">
<MatchGauge
distance={effective.distance}
threshold={effective.threshold}
confidence={effective.confidence}
verified={effective.verified}
/>
<div className="biometrics-compare__threshold">
<label htmlFor="voice-threshold">Threshold <code>{effective.threshold.toFixed(3)}</code></label>
<input id="voice-threshold" type="range" min="0" max="1" step="0.005"
value={effective.threshold}
onChange={(e) => setThreshold(parseFloat(e.target.value))} />
<p className="biometrics-compare__hint">
Drag to see how the verdict changes. The backend default is <code>{result.threshold?.toFixed(3)}</code>.
</p>
</div>
</div>
<div className="biometrics-compare__panel">
<div className="biometrics-compare__label">Clip 2</div>
<WaveformStrip src={audio2?.dataUrl} height={80} />
</div>
</div>
<ResponseDetails data={result} />
</>
)}
</section>
</form>
)
}
// ──────────────────────────── Enrollment ────────────────────────────
function EnrollTab({ model, addToast }) {
const [enrolled, setEnrolled] = useState(loadEnrollments)
const [enrollName, setEnrollName] = useState('')
const [enrollLabels, setEnrollLabels] = useState('')
const [enrollAudio, setEnrollAudio] = useState(null)
const [enrolling, setEnrolling] = useState(false)
const [enrollErr, setEnrollErr] = useState(null)
const [lastEnrolled, setLastEnrolled] = useState(null)
const [probeAudio, setProbeAudio] = useState(null)
const [topK, setTopK] = useState(5)
const [threshold, setThreshold] = useState(0.25)
const [identifying, setIdentifying] = useState(false)
const [identifyErr, setIdentifyErr] = useState(null)
const [identifyResult, setIdentifyResult] = useState(null)
useEffect(() => { saveEnrollments(enrolled) }, [enrolled])
const enroll = async (e) => {
e.preventDefault()
if (!model) { addToast('Select a speaker model first', 'warning'); return }
if (!enrollName.trim()) { addToast('Give this speaker a name', 'warning'); return }
if (!enrollAudio) { addToast('Add a sample clip', 'warning'); return }
setEnrolling(true); setEnrollErr(null)
try {
const data = await voiceApi.register({
model,
name: enrollName.trim(),
audio: enrollAudio.dataUrl,
labels: parseLabels(enrollLabels),
})
const entry = {
id: data.id,
name: data.name,
labels: parseLabels(enrollLabels),
sampleUrl: enrollAudio.dataUrl,
registeredAt: data.registered_at || new Date().toISOString(),
}
setEnrolled(prev => [entry, ...prev])
setLastEnrolled(entry.id)
setEnrollName(''); setEnrollLabels(''); setEnrollAudio(null)
addToast(`Enrolled ${entry.name}`, 'success')
} catch (err) {
setEnrollErr(err.message)
} finally {
setEnrolling(false)
}
}
const forget = async (entry) => {
try {
await voiceApi.forget({ id: entry.id })
setEnrolled(prev => prev.filter(e => e.id !== entry.id))
addToast(`Removed ${entry.name}`, 'info')
} catch (err) {
if (err.status === 404) {
setEnrolled(prev => prev.filter(e => e.id !== entry.id))
addToast(`${entry.name} was already gone from the backend store`, 'warning')
} else {
addToast(err.message, 'error')
}
}
}
const identify = async (e) => {
e.preventDefault()
if (!model) { addToast('Select a speaker model first', 'warning'); return }
if (!probeAudio) { addToast('Add a probe clip', 'warning'); return }
setIdentifying(true); setIdentifyErr(null); setIdentifyResult(null)
try {
const data = await voiceApi.identify({
model,
audio: probeAudio.dataUrl,
top_k: topK,
threshold,
})
setIdentifyResult(data)
if (!data?.matches?.length) addToast('No matches above threshold', 'info')
} catch (err) {
setIdentifyErr(err.message)
} finally {
setIdentifying(false)
}
}
return (
<div className="biometrics-enrollgrid">
<section className="biometrics-enrollgrid__register card">
<h2 className="biometrics-panel__title"><i className="fas fa-user-plus" /> Enroll a voice</h2>
<form onSubmit={enroll}>
<div className="form-group">
<label className="form-label" htmlFor="voice-enroll-name">Name</label>
<input id="voice-enroll-name" className="input" value={enrollName}
onChange={(e) => setEnrollName(e.target.value)} placeholder="e.g. Alice Johnson" />
</div>
<div className="form-group">
<label className="form-label" htmlFor="voice-enroll-labels">Labels <span className="form-label__hint">(optional, one per line)</span></label>
<textarea id="voice-enroll-labels" className="textarea" rows={2}
placeholder={"team: engineering\nrole: lead"}
value={enrollLabels} onChange={(e) => setEnrollLabels(e.target.value)} />
</div>
<MediaInput mode="audio" label="Sample clip" value={enrollAudio} onChange={setEnrollAudio} idPrefix="voice-enroll" />
<button type="submit" className="btn btn-primary btn-full" disabled={enrolling}>
{enrolling ? <><LoadingSpinner size="sm" /> Enrolling</> : <><i className="fas fa-plus" /> Enroll</>}
</button>
{enrollErr && <div className="biometrics-enrollgrid__err"><ErrorWithTraceLink message={enrollErr} /></div>}
</form>
</section>
<section className="biometrics-enrollgrid__identify card">
<h2 className="biometrics-panel__title"><i className="fas fa-magnifying-glass" /> Identify a speaker</h2>
<form onSubmit={identify}>
<MediaInput mode="audio" label="Probe clip" value={probeAudio} onChange={setProbeAudio} idPrefix="voice-probe" />
<div className="form-grid-2col">
<div className="form-group">
<label className="form-label" htmlFor="voice-topk">Top-K</label>
<input id="voice-topk" type="number" min="1" max="25" className="input"
value={topK} onChange={(e) => setTopK(parseInt(e.target.value) || 1)} />
</div>
<div className="form-group">
<label className="form-label" htmlFor="voice-threshold-id">Threshold</label>
<input id="voice-threshold-id" type="number" min="0" max="1" step="0.01" className="input"
value={threshold} onChange={(e) => setThreshold(parseFloat(e.target.value) || 0)} />
</div>
</div>
<button type="submit" className="btn btn-primary btn-full" disabled={identifying || !probeAudio}>
{identifying ? <><LoadingSpinner size="sm" /> Searching</> : <><i className="fas fa-magnifying-glass" /> Identify</>}
</button>
{identifyErr && <div className="biometrics-enrollgrid__err"><ErrorWithTraceLink message={identifyErr} /></div>}
{identifyResult && <MatchesList matches={identifyResult.matches || []} enrolled={enrolled} />}
</form>
</section>
<section className="biometrics-enrollgrid__list">
<div className="biometrics-enroll__head">
<h2 className="biometrics-panel__title"><i className="fas fa-id-badge" /> Enrolled <span className="biometrics-enroll__count">{enrolled.length}</span></h2>
</div>
<EnrollmentList entries={enrolled} onDelete={forget} mode="audio" highlightId={lastEnrolled} />
</section>
</div>
)
}
function MatchesList({ matches, enrolled }) {
if (!matches.length) {
return <div className="biometrics-matches__empty">No candidates above threshold.</div>
}
return (
<ul className="biometrics-matches" aria-label="Matches">
{matches.map((m, i) => {
const record = enrolled.find(e => e.id === m.id)
const conf = Math.max(0, Math.min(100, m.confidence ?? 0))
return (
<li key={m.id} className={`biometrics-matches__row ${m.match ? 'match' : 'miss'}`}>
<div className="biometrics-matches__rank">#{i + 1}</div>
<div className="biometrics-matches__avatar">
<span>{(m.name || '?').slice(0, 2).toUpperCase()}</span>
</div>
<div className="biometrics-matches__body">
<div className="biometrics-matches__name">
<strong>{m.name || m.id}</strong>
{m.match ? <span className="biometrics-matches__badge match"><i className="fas fa-check" /> match</span>
: <span className="biometrics-matches__badge miss">below threshold</span>}
</div>
{record?.sampleUrl && (
<audio controls src={record.sampleUrl} className="biometrics-matches__preview" />
)}
<div className="biometrics-matches__meter" aria-hidden="true">
<div className="biometrics-matches__fill" style={{ width: `${conf}%` }} />
</div>
<div className="biometrics-matches__meta">
<span>distance <code>{m.distance?.toFixed?.(4) ?? '—'}</code></span>
<span>confidence <code>{conf.toFixed(1)}%</code></span>
</div>
</div>
</li>
)
})}
</ul>
)
}
// ──────────────────────────── Embedding ────────────────────────────
function EmbedTab({ model, addToast }) {
const [audio, setAudio] = useState(null)
const [loading, setLoading] = useState(false)
const [error, setError] = useState(null)
const [result, setResult] = useState(null)
const [elapsedMs, setElapsedMs] = useState(null)
const submit = async (e) => {
e.preventDefault()
if (!model) { addToast('Select a speaker model first', 'warning'); return }
if (!audio) { addToast('Add an audio clip', 'warning'); return }
setLoading(true); setError(null); setResult(null)
const started = performance.now()
try {
const data = await voiceApi.embed({ model, audio: audio.dataUrl })
setElapsedMs(performance.now() - started)
setResult(data)
} catch (err) {
setError(err.message)
} finally {
setLoading(false)
}
}
return (
<form className="biometrics-twocol" onSubmit={submit}>
<aside className="biometrics-panel">
<h2 className="biometrics-panel__title">Get a raw speaker embedding</h2>
<p className="biometrics-panel__note">
Returns a speaker-encoder vector the same representation the backend uses internally for verify and identify.
</p>
<MediaInput mode="audio" label="Audio clip" value={audio} onChange={setAudio} idPrefix="voice-embed" />
<button type="submit" className="btn btn-primary btn-full" disabled={loading || !audio}>
{loading ? <><LoadingSpinner size="sm" /> Embedding</> : <><i className="fas fa-code" /> Extract vector</>}
</button>
</aside>
<section className="biometrics-results">
{loading && <div className="biometrics-empty"><LoadingSpinner size="lg" /></div>}
{error && <ErrorWithTraceLink message={error} />}
{!loading && !error && !result && (
<EmptyState icon="fas fa-code"
title="Get a speaker embedding"
body="For developers — retrieve the raw vector for a voice to store, search, or cluster outside of LocalAI." />
)}
{result && (
<EmbeddingInspector embedding={result.embedding} dim={result.dim} model={result.model} elapsedMs={elapsedMs} />
)}
</section>
</form>
)
}
function EmptyState({ icon, title, body }) {
return (
<div className="biometrics-empty">
<i className={icon} aria-hidden="true" />
<h3>{title}</h3>
<p>{body}</p>
</div>
)
}
function ResponseDetails({ data }) {
return (
<details className="biometrics-response">
<summary><i className="fas fa-angle-right" aria-hidden="true" /> Raw response</summary>
<pre>{JSON.stringify(data, null, 2)}</pre>
</details>
)
}

View File

@@ -34,6 +34,8 @@ import Login from './pages/Login'
import FineTune from './pages/FineTune'
import Quantize from './pages/Quantize'
import Studio from './pages/Studio'
import FaceRecognition from './pages/FaceRecognition'
import VoiceRecognition from './pages/VoiceRecognition'
import Nodes from './pages/Nodes'
import NodeBackendLogs from './pages/NodeBackendLogs'
import NotFound from './pages/NotFound'
@@ -73,6 +75,10 @@ const appChildren = [
{ path: 'sound/:model', element: <Sound /> },
{ path: 'studio', element: <Studio /> },
{ path: 'talk', element: <Talk /> },
{ path: 'face', element: <Feature feature="face_recognition"><FaceRecognition /></Feature> },
{ path: 'face/:model', element: <Feature feature="face_recognition"><FaceRecognition /></Feature> },
{ path: 'voice', element: <Feature feature="voice_recognition"><VoiceRecognition /></Feature> },
{ path: 'voice/:model', element: <Feature feature="voice_recognition"><VoiceRecognition /></Feature> },
{ path: 'usage', element: <Usage /> },
{ path: 'account', element: <Account /> },
{ path: 'users', element: <Admin><Users /></Admin> },

View File

@@ -259,6 +259,26 @@ export const audioApi = {
},
}
// Face biometrics — backend spec: core/http/endpoints/localai/face_*.go
export const faceApi = {
verify: (body) => postJSON(API_CONFIG.endpoints.faceVerify, body),
analyze: (body) => postJSON(API_CONFIG.endpoints.faceAnalyze, body),
embed: (body) => postJSON(API_CONFIG.endpoints.faceEmbed, body),
register: (body) => postJSON(API_CONFIG.endpoints.faceRegister, body),
identify: (body) => postJSON(API_CONFIG.endpoints.faceIdentify, body),
forget: (body) => postJSON(API_CONFIG.endpoints.faceForget, body),
}
// Voice biometrics — backend spec: core/http/endpoints/localai/voice_*.go
export const voiceApi = {
verify: (body) => postJSON(API_CONFIG.endpoints.voiceVerify, body),
analyze: (body) => postJSON(API_CONFIG.endpoints.voiceAnalyze, body),
embed: (body) => postJSON(API_CONFIG.endpoints.voiceEmbed, body),
register: (body) => postJSON(API_CONFIG.endpoints.voiceRegister, body),
identify: (body) => postJSON(API_CONFIG.endpoints.voiceIdentify, body),
forget: (body) => postJSON(API_CONFIG.endpoints.voiceForget, body),
}
// Realtime / WebRTC
export const realtimeApi = {
call: (body) => postJSON(API_CONFIG.endpoints.realtimeCalls, body),

View File

@@ -73,6 +73,23 @@ export const API_CONFIG = {
audioTranscriptions: '/v1/audio/transcriptions',
soundGeneration: '/v1/sound-generation',
embeddings: '/v1/embeddings',
// Face biometrics
faceVerify: '/v1/face/verify',
faceAnalyze: '/v1/face/analyze',
faceEmbed: '/v1/face/embed',
faceRegister: '/v1/face/register',
faceIdentify: '/v1/face/identify',
faceForget: '/v1/face/forget',
// Voice biometrics
voiceVerify: '/v1/voice/verify',
voiceAnalyze: '/v1/voice/analyze',
voiceEmbed: '/v1/voice/embed',
voiceRegister: '/v1/voice/register',
voiceIdentify: '/v1/voice/identify',
voiceForget: '/v1/voice/forget',
modelsList: '/v1/models',
modelsCapabilities: '/api/models/capabilities',

View File

@@ -1303,21 +1303,39 @@ func RegisterUIAPIRoutes(app *echo.Echo, cl *config.ModelConfigLoader, ml *model
})
}
uid, err := uuid.NewUUID()
id, err := uuid.NewUUID()
if err != nil {
return c.JSON(http.StatusInternalServerError, map[string]any{"error": err.Error()})
}
galleryService.BackendGalleryChannel <- galleryop.ManagementOp[gallery.GalleryBackend, any]{
ID: uid.String(),
uid := id.String()
// Register in opcache so the operation shows up in /api/operations
// and the Backends UI can reflect progress on the affected row.
opcache.SetBackend(backendName, uid)
ctx, cancelFunc := context.WithCancel(context.Background())
op := galleryop.ManagementOp[gallery.GalleryBackend, any]{
ID: uid,
GalleryElementName: backendName,
Galleries: appConfig.BackendGalleries,
Upgrade: true,
Context: ctx,
CancelFunc: cancelFunc,
}
// Store cancellation function immediately so queued operations can be cancelled
galleryService.StoreCancellation(uid, cancelFunc)
// Non-blocking send — BackendGalleryChannel is unbuffered and a direct
// send would hang the HTTP handler whenever the worker is busy.
go func() {
galleryService.BackendGalleryChannel <- op
}()
return c.JSON(200, map[string]any{
"uuid": uid.String(),
"statusUrl": fmt.Sprintf("/api/backends/job/%s", uid.String()),
"jobID": uid,
"uuid": uid,
"statusUrl": fmt.Sprintf("/api/backends/job/%s", uid),
"message": "Backend upgrade started",
})
}, adminMiddleware)

View File

@@ -28,6 +28,13 @@ type GalleryService struct {
// Distributed mode (nil when not in distributed mode)
natsClient messaging.Publisher
galleryStore *distributed.GalleryStore
// OnBackendOpCompleted is fired after every successful install/upgrade/delete
// on the backend channel. The Application wires this to UpgradeChecker.TriggerCheck
// so `/api/backends/upgrades` stops surfacing a backend as upgradeable the moment
// the worker finishes — previously the cache only refreshed on the 6-hour tick,
// making manual upgrades look like they failed even when they hadn't.
OnBackendOpCompleted func()
}
func NewGalleryService(appConfig *config.ApplicationConfig, ml *model.ModelLoader) *GalleryService {
@@ -245,6 +252,11 @@ func (g *GalleryService) Start(c context.Context, cl *config.ModelConfigLoader,
err := g.backendHandler(&op, systemState)
if err != nil {
updateError(op.ID, err)
} else if g.OnBackendOpCompleted != nil {
// Let listeners (e.g. UpgradeChecker) refresh their view of
// installed state. Run off the worker goroutine so a slow
// callback doesn't stall the next queued operation.
go g.OnBackendOpCompleted()
}
g.removeCancellation(op.ID)

View File

@@ -631,6 +631,83 @@ The `cache_type_k` / `cache_type_v` fields map to llama.cpp's `-ctk` / `-ctv` fl
- [Tracked branch: `feature/turboquant-kv-cache`](https://github.com/TheTom/llama-cpp-turboquant/tree/feature/turboquant-kv-cache)
### buun-llama-cpp (DFlash speculative decoding + TurboQuant/TCQ KV-cache)
[buun-llama-cpp](https://github.com/spiritbuun/buun-llama-cpp) is a fork-of-a-fork: spiritbuun forked `TheTom/llama-cpp-turboquant` (the `turboquant` backend above) and added two independent features on top:
1. **DFlash** — a block-diffusion speculative decoding scheme that uses a dedicated drafter model (new `DFlashDraftModel` GGUF architecture). On a target/drafter pair it emits a block of tokens per speculation step and can be combined with tree-structured verification ("DDTree") for multi-branch draft expansion.
2. **TCQ (Trellis-Coded Quantization)** — two additional KV-cache types (`turbo2_tcq`, `turbo3_tcq`) on top of the TurboQuant `turbo2` / `turbo3` / `turbo4` already shipped by the parent fork, delivering 1044% KL reduction over scalar quantization at 23 bits per value.
Like `turboquant`, this backend shares LocalAI's stock `llama-cpp` gRPC server sources — so any GGUF model that runs on `llama-cpp` also runs on `buun-llama-cpp`. Pick it over `turboquant` specifically when you want DFlash speculative decoding or the newer TCQ KV-cache variants.
#### Features
- Drop-in GGUF compatibility with upstream `llama.cpp`.
- DFlash block-diffusion speculative decoding (CUDA/Metal; no CPU fallback).
- TurboQuant KV-cache types (`turbo2`, `turbo3`, `turbo4`) inherited from the parent `turboquant` fork, plus buun-exclusive `turbo2_tcq` and `turbo3_tcq` variants.
- Same feature surface as `llama-cpp`: text generation, embeddings, tool calls, multimodal via mmproj.
- Available on CPU (AVX/AVX2/AVX512/fallback), NVIDIA CUDA 12/13, AMD ROCm/HIP, Intel SYCL f32/f16, Vulkan, and NVIDIA L4T — but note that DFlash and `turbo*` KV types have no CPU fallback and error at model-load on CPU-only builds.
#### Setup
`buun-llama-cpp` ships as a separate container image in the LocalAI backend gallery. Install it like any other backend:
```bash
local-ai backends install buun-llama-cpp
```
Or pick a specific flavor for your hardware (example tags: `cpu-buun-llama-cpp`, `cuda12-buun-llama-cpp`, `cuda13-buun-llama-cpp`, `rocm-buun-llama-cpp`, `intel-sycl-f16-buun-llama-cpp`, `vulkan-buun-llama-cpp`).
#### YAML configuration — TCQ KV-cache
To run a model with TurboQuant/TCQ quantized KV-cache, set the backend and pick a `turbo*` cache type:
```yaml
name: my-model
backend: buun-llama-cpp
parameters:
model: file.gguf
# Accepted values for the two fork-aware backends include the stock llama.cpp
# types (f16, f32, q8_0, q4_0, q4_1, q5_0, q5_1), the TurboQuant types
# (turbo2, turbo3, turbo4), and the buun-only TCQ variants (turbo2_tcq,
# turbo3_tcq). turbo3 / turbo4 / turbo*_tcq auto-enable flash_attention.
cache_type_k: turbo3
cache_type_v: turbo3_tcq
context_size: 8192
```
#### YAML configuration — DFlash speculative decoding
DFlash requires a **dedicated drafter model** in the new `DFlashDraftModel` GGUF architecture. At time of writing the only known public target/drafter pair is [`z-lab/Qwen3.5-27B`](https://huggingface.co/z-lab/Qwen3.5-27B) + [`z-lab/Qwen3.5-27B-DFlash`](https://huggingface.co/z-lab/Qwen3.5-27B-DFlash).
```yaml
name: qwen3-dflash
backend: buun-llama-cpp
parameters:
# Target model (quantized as usual)
model: Qwen3.5-27B-Q4_K_M.gguf
# Drafter model produced by buun's convert_hf_to_gguf.py from the
# DFlashDraftModel checkpoint. Resolved relative to the models path.
draft_model: Qwen3.5-27B-DFlash.gguf
options:
# Switches the speculative pipeline from the default draft-model mode to
# DFlash (block-diffusion). Required to activate the DFlash code path.
- spec_type:dflash
# Optional tuning:
# - tree_budget:0 # 0 = flat DFlash; >0 = DDTree verification budget
# - draft_topk:1 # drafter top-K per position (1 = argmax)
# - spec_n_max:16 # cap on draft tokens per speculation step
```
Under the hood LocalAI wires `draft_model` through to the grpc-server's `params.speculative.mparams_dft.path`, and `spec_type:dflash` is forwarded through the options passthrough to buun's `common_speculative_type_from_name("dflash")`. The `tree_budget` and `draft_topk` options are buun-exclusive; they reference struct fields that only exist in buun's fork, so they're surfaced on this backend only (passing them to stock `llama-cpp` is a no-op).
#### Reference
- [spiritbuun/buun-llama-cpp](https://github.com/spiritbuun/buun-llama-cpp)
- [TCQ paper / dataset](https://huggingface.co/datasets/spiritbuun/turboquant-tcq-kv-cache) — *"Closing the Gap: Trellis-Coded Quantization for KV Cache at 2-3 Bits"*
- DFlash target/drafter pair: [`z-lab/Qwen3.5-27B`](https://huggingface.co/z-lab/Qwen3.5-27B) + [`z-lab/Qwen3.5-27B-DFlash`](https://huggingface.co/z-lab/Qwen3.5-27B-DFlash)
### vLLM
[vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use library for LLM inference.

View File

@@ -20,6 +20,7 @@ LocalAI will attempt to automatically load models which are not explicitly confi
|---------|-------------|------------|------------|-----------|-------------|
| [llama.cpp](https://github.com/ggerganov/llama.cpp) | LLM inference in C/C++. Supports LLaMA, Mamba, RWKV, Falcon, Starcoder, GPT-2, [and many others](https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#description) | GPT, Functions | yes | yes | CPU, CUDA 12/13, ROCm, Intel SYCL, Vulkan, Metal, Jetson L4T |
| [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp) | Hard fork of llama.cpp optimized for CPU/hybrid CPU+GPU with IQK quants, custom quant mixes, and MLA for DeepSeek | GPT | yes | yes | CPU (AVX2+) |
| [buun-llama-cpp](https://github.com/spiritbuun/buun-llama-cpp) | llama.cpp fork with DFlash block-diffusion speculative decoding and TurboQuant/TCQ KV-cache quantization (23 bits per value). Accelerated paths are CUDA/Metal only. | GPT, Functions | yes | yes | CUDA, Metal (CPU fallback for non-turbo/non-DFlash only) |
| [vLLM](https://github.com/vllm-project/vllm) | Fast LLM serving with PagedAttention | GPT, Functions | no | yes | CPU, CUDA 12, ROCm, Intel |
| [vLLM Omni](https://github.com/vllm-project/vllm) | Unified multimodal generation (text, image, video, audio) | Multimodal GPT, Functions | no | yes | CUDA 12, ROCm |
| [transformers](https://github.com/huggingface/transformers) | HuggingFace Transformers framework | GPT, Embeddings, Multimodal | yes | yes* | CPU, CUDA 12/13, ROCm, Intel, Metal |

View File

@@ -3,40 +3,7 @@
url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
urls:
- https://huggingface.co/KyleHessling1/Qwopus-GLM-18B-Merged-GGUF
description: |
# 🪐 Qwen3.5-9B-GLM5.1-Distill-v1
## 📌 Model Overview
**Model Name:** `Jackrong/Qwen3.5-9B-GLM5.1-Distill-v1`
**Base Model:** Qwen3.5-9B
**Training Type:** Supervised Fine-Tuning (SFT, Distillation)
**Parameter Scale:** 9B
**Training Framework:** Unsloth
This model is a distilled variant of **Qwen3.5-9B**, trained on high-quality reasoning data derived from **GLM-5.1**.
The primary goals are to:
- Improve **structured reasoning ability**
- Enhance **instruction-following consistency**
- Activate **latent knowledge via better reasoning structure**
## 📊 Training Data
### Main Dataset
- `Jackrong/GLM-5.1-Reasoning-1M-Cleaned`
- Cleaned from the original `Kassadin88/GLM-5.1-1000000x` dataset.
- Generated from a **GLM-5.1 teacher model**
- Approximately **700x** the scale of `Qwen3.5-reasoning-700x`
- Training used a **filtered subset**, not the full source dataset.
### Auxiliary Dataset
- `Jackrong/Qwen3.5-reasoning-700x`
...
description: "# \U0001FA90 Qwen3.5-9B-GLM5.1-Distill-v1\n\n## \U0001F4CC Model Overview\n\n**Model Name:** `Jackrong/Qwen3.5-9B-GLM5.1-Distill-v1`\n**Base Model:** Qwen3.5-9B\n**Training Type:** Supervised Fine-Tuning (SFT, Distillation)\n**Parameter Scale:** 9B\n**Training Framework:** Unsloth\n\nThis model is a distilled variant of **Qwen3.5-9B**, trained on high-quality reasoning data derived from **GLM-5.1**.\n\nThe primary goals are to:\n\n - Improve **structured reasoning ability**\n - Enhance **instruction-following consistency**\n - Activate **latent knowledge via better reasoning structure**\n\n## \U0001F4CA Training Data\n\n### Main Dataset\n\n - `Jackrong/GLM-5.1-Reasoning-1M-Cleaned`\n - Cleaned from the original `Kassadin88/GLM-5.1-1000000x` dataset.\n - Generated from a **GLM-5.1 teacher model**\n - Approximately **700x** the scale of `Qwen3.5-reasoning-700x`\n - Training used a **filtered subset**, not the full source dataset.\n\n### Auxiliary Dataset\n\n - `Jackrong/Qwen3.5-reasoning-700x`\n\n...\n"
license: "apache-2.0"
tags:
- llm
@@ -127,26 +94,7 @@
url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
urls:
- https://huggingface.co/hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF
description: |
# 🔥 Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled
A reasoning SFT fine-tune of `Qwen/Qwen3.6-35B-A3B` on chain-of-thought (CoT) distillation mostly sourced from Claude Opus 4.6. The goal is to preserve Qwen3.6's strong agentic coding and reasoning base while nudging the model toward structured Claude Opus-style reasoning traces and more stable long-form problem solving.
The training path is text-only. The Qwen3.6 base architecture includes a vision encoder, but this fine-tuning run did not train on image or video examples.
- **Developed by:** @hesamation
- **Base model:** `Qwen/Qwen3.6-35B-A3B`
- **License:** apache-2.0
This fine-tuning run is inspired by Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled, including the notebook/training workflow style and Claude Opus reasoning-distillation direction.
[](https://x.com/Hesamation) [](https://discord.gg/vtJykN3t)
## Benchmark Results
The MMLU-Pro pass used 70 total questions per model: `--limit 5` across 14 MMLU-Pro subjects. Treat this as a smoke/comparative check, not a release-quality full benchmark.
...
description: "# \U0001F525 Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled\n\nA reasoning SFT fine-tune of `Qwen/Qwen3.6-35B-A3B` on chain-of-thought (CoT) distillation mostly sourced from Claude Opus 4.6. The goal is to preserve Qwen3.6's strong agentic coding and reasoning base while nudging the model toward structured Claude Opus-style reasoning traces and more stable long-form problem solving.\n\nThe training path is text-only. The Qwen3.6 base architecture includes a vision encoder, but this fine-tuning run did not train on image or video examples.\n\n - **Developed by:** @hesamation\n - **Base model:** `Qwen/Qwen3.6-35B-A3B`\n - **License:** apache-2.0\n\nThis fine-tuning run is inspired by Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled, including the notebook/training workflow style and Claude Opus reasoning-distillation direction.\n\n[](https://x.com/Hesamation) [](https://discord.gg/vtJykN3t)\n\n## Benchmark Results\n\nThe MMLU-Pro pass used 70 total questions per model: `--limit 5` across 14 MMLU-Pro subjects. Treat this as a smoke/comparative check, not a release-quality full benchmark.\n\n...\n"
license: "apache-2.0"
tags:
- llm
@@ -182,40 +130,7 @@
url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
urls:
- https://huggingface.co/Jackrong/Qwen3.5-9B-GLM5.1-Distill-v1-GGUF
description: |
# 🪐 Qwen3.5-9B-GLM5.1-Distill-v1
## 📌 Model Overview
**Model Name:** `Jackrong/Qwen3.5-9B-GLM5.1-Distill-v1`
**Base Model:** Qwen3.5-9B
**Training Type:** Supervised Fine-Tuning (SFT, Distillation)
**Parameter Scale:** 9B
**Training Framework:** Unsloth
This model is a distilled variant of **Qwen3.5-9B**, trained on high-quality reasoning data derived from **GLM-5.1**.
The primary goals are to:
- Improve **structured reasoning ability**
- Enhance **instruction-following consistency**
- Activate **latent knowledge via better reasoning structure**
## 📊 Training Data
### Main Dataset
- `Jackrong/GLM-5.1-Reasoning-1M-Cleaned`
- Cleaned from the original `Kassadin88/GLM-5.1-1000000x` dataset.
- Generated from a **GLM-5.1 teacher model**
- Approximately **700x** the scale of `Qwen3.5-reasoning-700x`
- Training used a **filtered subset**, not the full source dataset.
### Auxiliary Dataset
- `Jackrong/Qwen3.5-reasoning-700x`
...
description: "# \U0001FA90 Qwen3.5-9B-GLM5.1-Distill-v1\n\n## \U0001F4CC Model Overview\n\n**Model Name:** `Jackrong/Qwen3.5-9B-GLM5.1-Distill-v1`\n**Base Model:** Qwen3.5-9B\n**Training Type:** Supervised Fine-Tuning (SFT, Distillation)\n**Parameter Scale:** 9B\n**Training Framework:** Unsloth\n\nThis model is a distilled variant of **Qwen3.5-9B**, trained on high-quality reasoning data derived from **GLM-5.1**.\n\nThe primary goals are to:\n\n - Improve **structured reasoning ability**\n - Enhance **instruction-following consistency**\n - Activate **latent knowledge via better reasoning structure**\n\n## \U0001F4CA Training Data\n\n### Main Dataset\n\n - `Jackrong/GLM-5.1-Reasoning-1M-Cleaned`\n - Cleaned from the original `Kassadin88/GLM-5.1-1000000x` dataset.\n - Generated from a **GLM-5.1 teacher model**\n - Approximately **700x** the scale of `Qwen3.5-reasoning-700x`\n - Training used a **filtered subset**, not the full source dataset.\n\n### Auxiliary Dataset\n\n - `Jackrong/Qwen3.5-reasoning-700x`\n\n...\n"
license: "apache-2.0"
tags:
- llm
@@ -1178,6 +1093,134 @@
- transcript
parameters:
model: tiny
- name: omnilingual-0.3b-ctc-q8-sherpa
license: apache-2.0
url: "github:mudler/LocalAI/gallery/sherpa-onnx-asr.yaml@master"
description: |
Omnilingual ASR CTC 300M (int8) is a multilingual automatic speech recognition model supporting 1,600+ languages. Based on Meta's omniASR_CTC_300M architecture (Wav2Vec2 with CTC head), quantized to int8 for efficient inference. Uses the sherpa-onnx backend with ONNX Runtime.
urls:
- https://huggingface.co/csukuangfj/sherpa-onnx-omnilingual-asr-1600-languages-300M-ctc-int8-2025-11-12
- https://k2-fsa.github.io/sherpa/onnx/omnilingual-asr/models.html
icon: https://avatars.githubusercontent.com/u/75781706
tags:
- stt
- speech-to-text
- asr
- audio-transcription
- multilingual
- omnilingual
- sherpa-onnx
- cpu
- gpu
overrides:
known_usecases:
- transcript
parameters:
model: omnilingual-asr/model.int8.onnx
files:
- filename: omnilingual-asr/model.int8.onnx
sha256: e7c4e54ee4c4c47829cc6667d5d00ed8ea7bef1dcfeef0fce766f77752a2726c
uri: https://huggingface.co/csukuangfj/sherpa-onnx-omnilingual-asr-1600-languages-300M-ctc-int8-2025-11-12/resolve/main/model.int8.onnx
- filename: omnilingual-asr/tokens.txt
sha256: a7a044c52cb29cbe8b0dc1953e92cefd4ca16b0ed968177b6beab21f9a7d0b31
uri: https://huggingface.co/csukuangfj/sherpa-onnx-omnilingual-asr-1600-languages-300M-ctc-int8-2025-11-12/resolve/main/tokens.txt
- name: streaming-zipformer-en-sherpa
license: apache-2.0
url: "github:mudler/LocalAI/gallery/sherpa-onnx-asr.yaml@master"
description: |
Streaming English ASR: sherpa-onnx zipformer transducer (int8, chunk-16 left-128). Low-latency real-time transcription with endpoint detection via sherpa-onnx's online recognizer. English-only; for multilingual offline ASR see omnilingual-0.3b-ctc-q8-sherpa.
urls:
- https://huggingface.co/csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26
- https://k2-fsa.github.io/sherpa/onnx/pretrained_models/online-transducer/zipformer-transducer-models.html
icon: https://avatars.githubusercontent.com/u/75781706
tags:
- stt
- speech-to-text
- asr
- audio-transcription
- streaming
- real-time
- english
- zipformer
- sherpa-onnx
- cpu
- gpu
overrides:
known_usecases:
- transcript
parameters:
model: streaming-zipformer-en/encoder.int8.onnx
options:
- subtype=online
files:
- filename: streaming-zipformer-en/encoder.int8.onnx
sha256: 563fde436d16cf7607cf408cd6b30909819d03162652ef389c2450ced3f45ac1
uri: https://huggingface.co/csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26/resolve/main/encoder-epoch-99-avg-1-chunk-16-left-128.int8.onnx
- filename: streaming-zipformer-en/decoder.int8.onnx
sha256: 98da299f471e38bb4e1a8df579b8cc9122d6039576a77e357b3c60f17dd83b02
uri: https://huggingface.co/csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26/resolve/main/decoder-epoch-99-avg-1-chunk-16-left-128.int8.onnx
- filename: streaming-zipformer-en/joiner.int8.onnx
sha256: d944208d660d67c8d72cd2acaeac971fa5ceb8c80e76c1968148846fedd6e297
uri: https://huggingface.co/csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26/resolve/main/joiner-epoch-99-avg-1-chunk-16-left-128.int8.onnx
- filename: streaming-zipformer-en/tokens.txt
sha256: 49e3c2646595fd907228b3c6787069658f67b17377c60aeb8619c4551b2316fb
uri: https://huggingface.co/csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26/resolve/main/tokens.txt
- name: silero-vad-sherpa
license: mit
url: "github:mudler/LocalAI/gallery/sherpa-onnx-vad.yaml@master"
description: |
Silero VAD served through the sherpa-onnx backend. Uses the same ONNX weights as the dedicated silero-vad backend, loaded through sherpa-onnx's C VAD API. Pairs with the sherpa-onnx ASR entries for round-trip audio pipelines.
urls:
- https://github.com/snakers4/silero-vad
- https://huggingface.co/onnx-community/silero-vad
icon: https://github.com/snakers4/silero-models/raw/master/files/silero_logo.jpg
tags:
- vad
- voice-activity-detection
- sherpa-onnx
- cpu
- gpu
overrides:
known_usecases:
- vad
parameters:
model: silero-vad/silero-vad.onnx
files:
- filename: silero-vad/silero-vad.onnx
sha256: a4a068cd6cf1ea8355b84327595838ca748ec29a25bc91fc82e6c299ccdc5808
uri: https://huggingface.co/onnx-community/silero-vad/resolve/main/onnx/model.onnx
- name: vits-ljs-sherpa
license: mit
url: "github:mudler/LocalAI/gallery/sherpa-onnx-tts.yaml@master"
description: |
VITS-LJS English single-speaker TTS served through the sherpa-onnx backend. Trained on the LJSpeech corpus at 22.05 kHz. Pairs with the sherpa-onnx ASR entries for round-trip audio pipelines.
urls:
- https://github.com/k2-fsa/sherpa-onnx
- https://huggingface.co/csukuangfj/vits-ljs
icon: https://avatars.githubusercontent.com/u/75781706
tags:
- tts
- text-to-speech
- english
- vits
- sherpa-onnx
- cpu
- gpu
overrides:
known_usecases:
- tts
parameters:
model: vits-ljs/vits-ljs.onnx
files:
- filename: vits-ljs/vits-ljs.onnx
sha256: 5bbd273797a9ecf8d94bd6ec02ad16cb41cbb85f055ad98d528ced3e44c9b31a
uri: https://huggingface.co/csukuangfj/vits-ljs/resolve/main/vits-ljs.onnx
- filename: vits-ljs/tokens.txt
sha256: 5fee2c6b238d712287f2ecb08f34a8a8b413bcb7390862ef6fb6fd6f0f8d3a17
uri: https://huggingface.co/csukuangfj/vits-ljs/resolve/main/tokens.txt
- filename: vits-ljs/lexicon.txt
sha256: bdccfc6da71c45c48e2e0056fcf0aab760577c5f959f6c1b5eb3e3e916fd5a0e
uri: https://huggingface.co/csukuangfj/vits-ljs/resolve/main/lexicon.txt
- name: voxcpm-1.5
license: apache-2.0
url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
@@ -3845,7 +3888,7 @@
cached in the models directory like any other managed model).
NON-COMMERCIAL RESEARCH USE ONLY. For commercial use see `insightface-opencv`.
tags: [face-recognition, face-verification, face-embedding, research-only, gpu, cpu]
urls: [https://github.com/deepinsight/insightface]
urls: ['https://github.com/deepinsight/insightface']
overrides:
backend: insightface
parameters: {model: insightface-buffalo-l}
@@ -3876,7 +3919,7 @@
cheaper detector — good balance on mid-range hardware.
NON-COMMERCIAL RESEARCH USE ONLY.
tags: [face-recognition, face-verification, face-embedding, research-only, gpu, cpu]
urls: [https://github.com/deepinsight/insightface]
urls: ['https://github.com/deepinsight/insightface']
overrides:
backend: insightface
parameters: {model: insightface-buffalo-m}
@@ -3906,7 +3949,7 @@
genderage, ~159MB). Good fit for mid-range CPU deployments.
NON-COMMERCIAL RESEARCH USE ONLY.
tags: [face-recognition, face-verification, face-embedding, research-only, edge, cpu]
urls: [https://github.com/deepinsight/insightface]
urls: ['https://github.com/deepinsight/insightface']
overrides:
backend: insightface
parameters: {model: insightface-buffalo-s}
@@ -3938,7 +3981,7 @@
only verification and embedding are needed.
NON-COMMERCIAL RESEARCH USE ONLY.
tags: [face-recognition, face-verification, face-embedding, research-only, edge, cpu]
urls: [https://github.com/deepinsight/insightface]
urls: ['https://github.com/deepinsight/insightface']
overrides:
backend: insightface
parameters: {model: insightface-buffalo-sc}
@@ -3969,7 +4012,7 @@
harder benchmarks; pays for it in GPU memory.
NON-COMMERCIAL RESEARCH USE ONLY.
tags: [face-recognition, face-verification, face-embedding, research-only, gpu]
urls: [https://github.com/deepinsight/insightface]
urls: ['https://github.com/deepinsight/insightface']
overrides:
backend: insightface
parameters: {model: insightface-antelopev2}
@@ -4001,7 +4044,7 @@
Weights are downloaded on install via LocalAI's gallery mechanism
(~40MB).
tags: [face-recognition, face-verification, face-embedding, commercial-ok, gpu, cpu]
urls: [https://github.com/opencv/opencv_zoo]
urls: ['https://github.com/opencv/opencv_zoo']
overrides:
backend: insightface
parameters: {model: face_detection_yunet_2023mar.onnx}
@@ -4035,7 +4078,7 @@
at comparable accuracy for face tasks. APACHE 2.0 — commercial-safe.
Weights are downloaded on install via LocalAI's gallery mechanism.
tags: [face-recognition, face-verification, face-embedding, commercial-ok, edge, cpu]
urls: [https://github.com/opencv/opencv_zoo]
urls: ['https://github.com/opencv/opencv_zoo']
overrides:
backend: insightface
parameters: {model: face_detection_yunet_2023mar_int8.onnx}
@@ -15923,6 +15966,7 @@
uri: "https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/resolve/main/split_files/vae/wan_2.1_vae.safetensors"
- filename: "umt5-xxl-encoder-Q8_0.gguf"
uri: "huggingface://city96/umt5-xxl-encoder-gguf/umt5-xxl-encoder-Q8_0.gguf"
sha256: 2521d4de0bf9e1cc6549866463ceae85e4ec3239bc6063f7488810be39033bbc
- filename: "clip_vision_h.safetensors"
uri: "https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/resolve/main/split_files/clip_vision/clip_vision_h.safetensors"
- name: sd-1.5-ggml

View File

@@ -0,0 +1,27 @@
---
name: "sherpa-onnx-asr"
config_file: |
backend: sherpa-onnx
type: asr
options:
# Feature extraction. Most shipped sherpa-onnx ASR models expect
# 16 kHz / 80-dim log-mel; derivatives trained at other rates
# should override these.
- asr.sample_rate=16000
- asr.feature_dim=80
- asr.decoding_method=greedy_search
# Whisper-family defaults (ignored by non-whisper models).
- asr.whisper.task=transcribe
- asr.whisper.tail_paddings=-1
# SenseVoice-family: inverse text normalization is off in upstream
# sherpa but on here — we want formatted transcription output
# ("100" not "one hundred"). Set to 0 for raw tokens.
- asr.sense_voice.use_itn=1
# Online (streaming zipformer) ASR. Endpoint detection is upstream-
# off but on here — streaming consumers need segment boundaries.
- online.enable_endpoint=1
- online.rule1_min_trailing_silence=2.4
- online.rule2_min_trailing_silence=1.2
- online.rule3_min_utterance_length=20.0
- online.chunk_samples=1600

View File

@@ -0,0 +1,14 @@
---
name: "sherpa-onnx-tts"
config_file: |
backend: sherpa-onnx
options:
# VITS inference knobs. Matches upstream sherpa-onnx defaults.
- tts.noise_scale=0.667
- tts.noise_scale_w=0.8
- tts.length_scale=1.0
- tts.max_num_sentences=1
# Speech rate multiplier. Applied at every TTS / TTSStream call
# since the TTSRequest proto has no speed field.
- tts.speed=1.0

View File

@@ -0,0 +1,17 @@
---
name: "sherpa-onnx-vad"
config_file: |
backend: sherpa-onnx
type: vad
options:
# Silero VAD. Defaults mirror upstream sherpa-onnx. Override for
# faster turn-taking (lower min_silence) or different sample rate
# derivatives (8 kHz Silero variants).
- vad.threshold=0.5
- vad.min_silence=0.5
- vad.min_speech=0.25
- vad.window_size=512
- vad.max_speech=20.0
- vad.sample_rate=16000
- vad.buffer_size=60.0

View File

@@ -375,9 +375,20 @@ func (uri URI) DownloadFileWithContext(ctx context.Context, filePath, sha string
if uri.LooksLikeOCI() {
// Only Ollama wants to download to the file, for the rest, we want to download to the directory
// so we check if filepath has any extension, otherwise we assume it's a directory
if filepath.Ext(filePath) != "" && !strings.HasPrefix(url, OllamaPrefix) {
filePath = filepath.Dir(filePath)
// so we check if filepath has any extension, otherwise we assume it's a directory.
// Caveat: `filepath.Ext` treats any dot-suffix as an extension, so paths like
// `backends/local-store.upgrade-tmp` (the tmp dir created by gallery.UpgradeBackend)
// look like a "file" to this heuristic and get rewritten to their parent — which
// then unpacks the image at `backends/` top level and clobbers the real install
// with a flat-layout file. Guard against that by short-circuiting when the caller
// has already created the target as a directory: OCI destinations are always dirs
// in that case, regardless of what their suffix looks like.
if !strings.HasPrefix(url, OllamaPrefix) {
if fi, statErr := os.Stat(filePath); statErr == nil && fi.IsDir() {
// Existing directory — use as-is.
} else if filepath.Ext(filePath) != "" {
filePath = filepath.Dir(filePath)
}
}
progressStatus := func(desc ocispec.Descriptor) io.Writer {

View File

@@ -16,7 +16,13 @@ var base64DownloadClient http.Client = http.Client{
Timeout: 30 * time.Second,
}
var dataURIPattern = regexp.MustCompile(`^data:([^;]+);base64,`)
// Match `data:<mime>[;param=value...];base64,` — browser-produced data URIs
// often carry codec/charset params between the mime type and `;base64,`
// (e.g. MediaRecorder's `data:audio/webm;codecs=opus;base64,...`). The old
// `([^;]+)` form only tolerated exactly one segment, so anything with
// extra params failed the strip and tripped the downstream base64 decoder
// on the `data:` literal.
var dataURIPattern = regexp.MustCompile(`^data:[^,]+?;base64,`)
// GetContentURIAsBase64 checks if the string is an URL, if it's an URL downloads the content in memory encodes it in base64 and returns the base64 string, otherwise returns the string by stripping base64 data headers
func GetContentURIAsBase64(s string) (string, error) {

View File

@@ -21,6 +21,15 @@ var _ = Describe("utils/base64 tests", func() {
Expect(err).To(BeNil())
Expect(b64).To(Equal("BAR"))
})
It("GetContentURIAsBase64 strips data URI prefixes with codec/charset params", func() {
// Browser MediaRecorder produces data URIs like
// `data:audio/webm;codecs=opus;base64,...` — the regex must accept
// any number of MIME parameters between the type and `;base64,`.
input := "data:audio/webm;codecs=opus;base64,PAYLOAD"
b64, err := GetContentURIAsBase64(input)
Expect(err).To(BeNil())
Expect(b64).To(Equal("PAYLOAD"))
})
It("GetImageURLAsBase64 returns an error for bogus data", func() {
input := "FOO"
b64, err := GetContentURIAsBase64(input)

View File

@@ -2,10 +2,13 @@ package utils
import (
"fmt"
"io"
"os"
"os/exec"
"strings"
laudio "github.com/mudler/LocalAI/pkg/audio"
"github.com/go-audio/wav"
)
@@ -16,24 +19,61 @@ func ffmpegCommand(args []string) (string, error) {
return string(out), err
}
// AudioToWav converts audio to wav for transcribe.
// TODO: use https://github.com/mccoyst/ogg?
// AudioToWav converts audio to wav for transcribe (16 kHz mono s16le).
// WAV files already in the target format are passed through directly;
// everything else is converted via ffmpeg.
//
// The pass-through uses a hardlink (with a Copy fallback for cross-fs
// src/dst) rather than Rename — callers may invoke this twice against
// the same fixture (e.g. once for AudioTranscription and once for
// AudioTranscriptionStream) and expect the original file to remain.
func AudioToWav(src, dst string) error {
if strings.HasSuffix(src, ".wav") {
f, err := os.Open(src)
if err != nil {
return fmt.Errorf("open: %w", err)
}
dec := wav.NewDecoder(f)
dec.ReadInfo()
f.Close()
if dec.BitDepth == 16 && dec.NumChans == 1 && dec.SampleRate == 16000 {
os.Rename(src, dst)
return nil
}
if strings.HasSuffix(src, ".wav") && isTargetWav(src) {
return passthroughWAV(src, dst)
}
return convertWithFFmpeg(src, dst)
}
func passthroughWAV(src, dst string) error {
if err := os.Link(src, dst); err == nil {
return nil
}
// Fallback: copy. Hardlink fails across filesystems (e.g. src on a
// read-only mount, dst in /tmp) or when the destination already
// exists — both are fine; just copy bytes.
in, err := os.Open(src)
if err != nil {
return fmt.Errorf("open src: %w", err)
}
defer in.Close()
out, err := os.Create(dst)
if err != nil {
return fmt.Errorf("create dst: %w", err)
}
defer out.Close()
if _, err := io.Copy(out, in); err != nil {
return fmt.Errorf("copy: %w", err)
}
return nil
}
// isTargetWav returns true when src is a valid WAV already in the
// target format (16 kHz, mono, 16-bit PCM).
func isTargetWav(src string) bool {
f, err := os.Open(src)
if err != nil {
return false
}
defer f.Close()
dec := wav.NewDecoder(f)
if !dec.IsValidFile() {
return false
}
return dec.BitDepth == 16 && dec.NumChans == 1 && dec.SampleRate == 16000
}
func convertWithFFmpeg(src, dst string) error {
commandArgs := []string{"-i", src, "-format", "s16le", "-ar", "16000", "-ac", "1", "-acodec", "pcm_s16le", dst}
out, err := ffmpegCommand(commandArgs)
if err != nil {
@@ -85,3 +125,18 @@ func AudioConvert(src string, format string) (string, error) {
}
return dst, nil
}
// WriteWav16kFromReader reads all PCM data from r and writes a 16 kHz mono
// 16-bit WAV to w. Useful when the PCM length is not known in advance.
func WriteWav16kFromReader(w io.Writer, r io.Reader) error {
pcm, err := io.ReadAll(r)
if err != nil {
return fmt.Errorf("read pcm: %w", err)
}
hdr := laudio.NewWAVHeader(uint32(len(pcm)))
if err := hdr.Write(w); err != nil {
return fmt.Errorf("write wav header: %w", err)
}
_, err = w.Write(pcm)
return err
}

150
pkg/utils/ffmpeg_test.go Normal file
View File

@@ -0,0 +1,150 @@
package utils
import (
"encoding/binary"
"os"
"path/filepath"
"testing"
laudio "github.com/mudler/LocalAI/pkg/audio"
)
// generateTestWav creates a WAV file with a sine-ish tone at the given sample rate,
// channels, and bit depth (only 16-bit supported for simplicity).
func generateTestWav(t *testing.T, path string, sampleRate uint32, numChannels uint16, numSamples int) {
t.Helper()
f, err := os.Create(path)
if err != nil {
t.Fatal(err)
}
defer f.Close()
bitsPerSample := uint16(16)
blockAlign := numChannels * (bitsPerSample / 8)
byteRate := sampleRate * uint32(blockAlign)
totalSamples := numSamples * int(numChannels)
dataSize := uint32(totalSamples) * uint32(bitsPerSample/8)
hdr := laudio.WAVHeader{
ChunkID: [4]byte{'R', 'I', 'F', 'F'},
ChunkSize: 36 + dataSize,
Format: [4]byte{'W', 'A', 'V', 'E'},
Subchunk1ID: [4]byte{'f', 'm', 't', ' '},
Subchunk1Size: 16,
AudioFormat: 1,
NumChannels: numChannels,
SampleRate: sampleRate,
ByteRate: byteRate,
BlockAlign: blockAlign,
BitsPerSample: bitsPerSample,
Subchunk2ID: [4]byte{'d', 'a', 't', 'a'},
Subchunk2Size: dataSize,
}
if err := binary.Write(f, binary.LittleEndian, &hdr); err != nil {
t.Fatal(err)
}
for i := 0; i < totalSamples; i++ {
sample := int16(1000 * (i % 100))
if err := binary.Write(f, binary.LittleEndian, sample); err != nil {
t.Fatal(err)
}
}
}
func TestAudioToWav_AlreadyCorrectFormat(t *testing.T) {
dir := t.TempDir()
src := filepath.Join(dir, "input.wav")
dst := filepath.Join(dir, "output.wav")
generateTestWav(t, src, 16000, 1, 1600)
if err := AudioToWav(src, dst); err != nil {
t.Fatalf("AudioToWav failed: %v", err)
}
info, err := os.Stat(dst)
if err != nil {
t.Fatalf("output not found: %v", err)
}
if info.Size() == 0 {
t.Fatal("output file is empty")
}
}
func TestAudioToWav_ResampleFrom22050(t *testing.T) {
dir := t.TempDir()
src := filepath.Join(dir, "input.wav")
dst := filepath.Join(dir, "output.wav")
generateTestWav(t, src, 22050, 1, 22050)
if err := AudioToWav(src, dst); err != nil {
t.Fatalf("AudioToWav failed: %v", err)
}
info, err := os.Stat(dst)
if err != nil {
t.Fatalf("output not found: %v", err)
}
if info.Size() == 0 {
t.Fatal("output file is empty")
}
verifyWavFormat(t, dst, 16000, 1)
}
func TestAudioToWav_StereoDownmix(t *testing.T) {
dir := t.TempDir()
src := filepath.Join(dir, "input.wav")
dst := filepath.Join(dir, "output.wav")
generateTestWav(t, src, 16000, 2, 1600)
if err := AudioToWav(src, dst); err != nil {
t.Fatalf("AudioToWav failed: %v", err)
}
verifyWavFormat(t, dst, 16000, 1)
}
func TestAudioToWav_StereoAndResample(t *testing.T) {
dir := t.TempDir()
src := filepath.Join(dir, "input.wav")
dst := filepath.Join(dir, "output.wav")
generateTestWav(t, src, 44100, 2, 44100)
if err := AudioToWav(src, dst); err != nil {
t.Fatalf("AudioToWav failed: %v", err)
}
verifyWavFormat(t, dst, 16000, 1)
}
func verifyWavFormat(t *testing.T, path string, expectedRate uint32, expectedChannels uint16) {
t.Helper()
f, err := os.Open(path)
if err != nil {
t.Fatalf("open: %v", err)
}
defer f.Close()
var hdr laudio.WAVHeader
if err := binary.Read(f, binary.LittleEndian, &hdr); err != nil {
t.Fatalf("read header: %v", err)
}
if hdr.SampleRate != expectedRate {
t.Errorf("sample rate = %d, want %d", hdr.SampleRate, expectedRate)
}
if hdr.NumChannels != expectedChannels {
t.Errorf("channels = %d, want %d", hdr.NumChannels, expectedChannels)
}
if hdr.BitsPerSample != 16 {
t.Errorf("bit depth = %d, want 16", hdr.BitsPerSample)
}
if hdr.AudioFormat != 1 {
t.Errorf("audio format = %d, want 1 (PCM)", hdr.AudioFormat)
}
}

View File

@@ -32,6 +32,12 @@ function inferBackendPath(item) {
// via a thin wrapper Makefile. Changes to either dir should retrigger it.
return `backend/cpp/turboquant/`;
}
if (item.dockerfile.endsWith("buun-llama-cpp")) {
// buun-llama-cpp is a fork-of-a-fork (spiritbuun/buun-llama-cpp forks
// TheTom/llama-cpp-turboquant) that reuses backend/cpp/llama-cpp sources
// the same way turboquant does. Changes to either dir retrigger it.
return `backend/cpp/buun-llama-cpp/`;
}
if (item.dockerfile.endsWith("llama-cpp")) {
return `backend/cpp/llama-cpp/`;
}
@@ -138,9 +144,10 @@ async function getChangedFiles() {
// Per-backend boolean outputs
for (const [backend, pathPrefix] of allBackendPaths) {
let changed = changedFiles.some(file => file.startsWith(pathPrefix));
// turboquant reuses backend/cpp/llama-cpp sources via a thin wrapper;
// changes to either directory should retrigger its pipeline.
if (backend === "turboquant" && !changed) {
// turboquant and buun-llama-cpp reuse backend/cpp/llama-cpp sources via
// thin wrapper Makefiles; changes to that directory should retrigger
// their pipelines too.
if ((backend === "turboquant" || backend === "buun-llama-cpp") && !changed) {
changed = changedFiles.some(file => file.startsWith("backend/cpp/llama-cpp/"));
}
fs.appendFileSync(process.env.GITHUB_OUTPUT, `${backend}=${changed ? 'true' : 'false'}\n`);

View File

@@ -3442,6 +3442,7 @@ const docTemplate = `{
}
},
"is_real": {
"description": "Liveness fields — see FaceVerifyResponse for why these are pointers.",
"type": "boolean"
},
"race": {
@@ -3656,12 +3657,25 @@ const docTemplate = `{
"distance": {
"type": "number"
},
"img1_antispoof_score": {
"type": "number"
},
"img1_area": {
"$ref": "#/definitions/schema.FacialArea"
},
"img1_is_real": {
"description": "Liveness fields are only populated when the request set\nanti_spoofing=true. Pointers keep them fully absent from the\nJSON response otherwise, so callers can tell \"not checked\"\napart from \"checked and fake\" (which would collapse to zero\nvalues with plain bool+omitempty).",
"type": "boolean"
},
"img2_antispoof_score": {
"type": "number"
},
"img2_area": {
"$ref": "#/definitions/schema.FacialArea"
},
"img2_is_real": {
"type": "boolean"
},
"model": {
"type": "string"
},

View File

@@ -3439,6 +3439,7 @@
}
},
"is_real": {
"description": "Liveness fields — see FaceVerifyResponse for why these are pointers.",
"type": "boolean"
},
"race": {
@@ -3653,12 +3654,25 @@
"distance": {
"type": "number"
},
"img1_antispoof_score": {
"type": "number"
},
"img1_area": {
"$ref": "#/definitions/schema.FacialArea"
},
"img1_is_real": {
"description": "Liveness fields are only populated when the request set\nanti_spoofing=true. Pointers keep them fully absent from the\nJSON response otherwise, so callers can tell \"not checked\"\napart from \"checked and fake\" (which would collapse to zero\nvalues with plain bool+omitempty).",
"type": "boolean"
},
"img2_antispoof_score": {
"type": "number"
},
"img2_area": {
"$ref": "#/definitions/schema.FacialArea"
},
"img2_is_real": {
"type": "boolean"
},
"model": {
"type": "string"
},

View File

@@ -640,6 +640,7 @@ definitions:
type: number
type: object
is_real:
description: Liveness fields — see FaceVerifyResponse for why these are pointers.
type: boolean
race:
additionalProperties:
@@ -780,10 +781,24 @@ definitions:
type: number
distance:
type: number
img1_antispoof_score:
type: number
img1_area:
$ref: '#/definitions/schema.FacialArea'
img1_is_real:
description: |-
Liveness fields are only populated when the request set
anti_spoofing=true. Pointers keep them fully absent from the
JSON response otherwise, so callers can tell "not checked"
apart from "checked and fake" (which would collapse to zero
values with plain bool+omitempty).
type: boolean
img2_antispoof_score:
type: number
img2_area:
$ref: '#/definitions/schema.FacialArea'
img2_is_real:
type: boolean
model:
type: string
processing_time_ms:

View File

@@ -40,6 +40,12 @@ import (
// to download alongside the main model — required for
// multimodal models like Qwen3-ASR-0.6B-GGUF.
// BACKEND_TEST_MMPROJ_FILE Path to an already-available mmproj file.
// BACKEND_TEST_EXTRA_FILES Pipe-separated list of companion files to download
// next to the main model. Each entry is "<url>" or
// "<url>#<local-name>" (the optional suffix renames
// the file on disk — useful for sherpa-onnx models
// whose loader expects specific names like
// encoder.int8.onnx).
// BACKEND_TEST_AUDIO_URL HTTP(S) URL of a sample audio file used by the
// transcription specs.
// BACKEND_TEST_AUDIO_FILE Path to an already-available sample audio file.
@@ -71,6 +77,9 @@ import (
// (default: "What's the weather like in Paris, France?").
// BACKEND_TEST_TOOL_NAME Override the function name expected in the tool call
// (default: "get_weather").
// BACKEND_TEST_TTS_TEXT Override the text synthesized by the tts/ttsstream
// specs (default: "The quick brown fox jumps over the
// lazy dog.").
//
// The suite is intentionally model-format-agnostic: it only ever passes the
// file path to LoadModel, so GGUF, ONNX, safetensors, .bin etc. all work so
@@ -83,6 +92,7 @@ const (
capEmbeddings = "embeddings"
capTools = "tools"
capTranscription = "transcription"
capTTS = "tts"
capImage = "image"
capFaceDetect = "face_detect"
capFaceEmbed = "face_embed"
@@ -100,6 +110,7 @@ const (
defaultImagePrompt = "a photograph of an astronaut riding a horse"
defaultImageSteps = 4
defaultVerifyDistanceCeil = float32(0.6) // upper bound for same-person; SFace runs closer to 0.5 ArcFace to 0.35.
defaultTTSText = "The quick brown fox jumps over the lazy dog."
)
func defaultCaps() map[string]bool {
@@ -111,6 +122,17 @@ func defaultCaps() map[string]bool {
}
}
// splitURLAndName parses a "<url>#<local-name>" entry. The #name suffix is
// optional — if absent, defaultName is returned. Used by the main-model
// and extras download paths so a test can rename downloaded files to the
// shape the backend's loader expects.
func splitURLAndName(entry, defaultName string) (url, name string) {
if hash := strings.Index(entry, "#"); hash >= 0 {
return entry[:hash], entry[hash+1:]
}
return entry, defaultName
}
// parseCaps reads BACKEND_TEST_CAPS and returns the enabled capability set.
// An empty/unset value falls back to defaultCaps().
func parseCaps() map[string]bool {
@@ -205,9 +227,33 @@ var _ = Describe("Backend container", Ordered, func() {
Expect(filepath.Join(binaryDir, "run.sh")).To(BeAnExistingFile())
// Download the model once if not provided and no HF name given.
// BACKEND_TEST_MODEL_URL accepts an optional "#<local-name>" suffix
// for cases where the backend expects the model file to have a
// specific name (e.g. sherpa-onnx's online recognizer finds
// encoder/decoder/joiner by filename substring).
if modelFile == "" && modelName == "" {
modelFile = filepath.Join(workDir, "model.bin")
downloadFile(modelURL, modelFile)
url, name := splitURLAndName(modelURL, "model.bin")
modelFile = filepath.Join(workDir, name)
downloadFile(url, modelFile)
}
// Multi-file models (sherpa-onnx streaming zipformer, sherpa-onnx
// Omnilingual, any split encoder/decoder/joiner bundle) need
// companion files next to the main model. BACKEND_TEST_EXTRA_FILES
// is a pipe-separated list of "<url>[#<local-name>]" entries; each
// is downloaded into the same directory as modelFile. The optional
// <local-name> renames the saved file (useful when upstream URLs
// have stamp/version suffixes the loader doesn't recognise).
if extraSpec := strings.TrimSpace(os.Getenv("BACKEND_TEST_EXTRA_FILES")); extraSpec != "" && modelFile != "" {
modelDir := filepath.Dir(modelFile)
for _, entry := range strings.Split(extraSpec, "|") {
entry = strings.TrimSpace(entry)
if entry == "" {
continue
}
url, name := splitURLAndName(entry, filepath.Base(entry))
downloadFile(url, filepath.Join(modelDir, name))
}
}
// Multimodal projector (mmproj): required by audio/vision-capable
@@ -869,6 +915,62 @@ var _ = Describe("Backend container", Ordered, func() {
}
GinkgoWriter.Printf("voice_analyze: %d segments\n", len(res.GetSegments()))
})
It("synthesizes speech via TTS", func() {
if !caps[capTTS] {
Skip("tts capability not enabled")
}
text := os.Getenv("BACKEND_TEST_TTS_TEXT")
if text == "" {
text = defaultTTSText
}
dst := filepath.Join(workDir, "tts.wav")
ctx, cancel := context.WithTimeout(context.Background(), 2*time.Minute)
defer cancel()
_, err := client.TTS(ctx, &pb.TTSRequest{Text: text, Dst: dst})
Expect(err).NotTo(HaveOccurred())
info, err := os.Stat(dst)
Expect(err).NotTo(HaveOccurred(), "TTS did not write a file at %s", dst)
Expect(info.Size()).To(BeNumerically(">", int64(1024)),
"TTS output too small: %d bytes", info.Size())
GinkgoWriter.Printf("TTS: wrote %s (%d bytes)\n", dst, info.Size())
})
It("streams PCM via TTSStream", func() {
if !caps[capTTS] {
Skip("tts capability not enabled")
}
text := os.Getenv("BACKEND_TEST_TTS_TEXT")
if text == "" {
text = defaultTTSText
}
ctx, cancel := context.WithTimeout(context.Background(), 2*time.Minute)
defer cancel()
stream, err := client.TTSStream(ctx, &pb.TTSRequest{Text: text})
Expect(err).NotTo(HaveOccurred())
var chunks, totalBytes int
for {
reply, err := stream.Recv()
if err == io.EOF {
break
}
Expect(err).NotTo(HaveOccurred())
if audio := reply.GetAudio(); len(audio) > 0 {
chunks++
totalBytes += len(audio)
}
}
// Header + at least one PCM chunk proves real streaming (not emit-once).
Expect(chunks).To(BeNumerically(">=", 2),
"expected >=2 chunks (header + PCM), got %d (bytes=%d)", chunks, totalBytes)
Expect(totalBytes).To(BeNumerically(">", 1024),
"streamed audio too short: %d bytes", totalBytes)
GinkgoWriter.Printf("TTSStream: %d chunks, %d bytes\n", chunks, totalBytes)
})
})
// extractImage runs `docker create` + `docker export` to materialise the image
@@ -901,9 +1003,17 @@ func extractImage(image, dest string) {
// downloadFile fetches url into dest using curl -L. Used for CI convenience;
// local runs can use BACKEND_TEST_MODEL_FILE to skip downloading.
// Retry flags guard against transient CI network hiccups (github.com in
// particular has been flaky from GHA runners, timing out TCP connects).
func downloadFile(url, dest string) {
GinkgoHelper()
cmd := exec.Command("curl", "-sSfL", "-o", dest, url)
cmd := exec.Command("curl", "-sSfL",
"--connect-timeout", "30",
"--max-time", "600",
"--retry", "5",
"--retry-delay", "5",
"--retry-all-errors",
"-o", dest, url)
cmd.Stdout = GinkgoWriter
cmd.Stderr = GinkgoWriter
Expect(cmd.Run()).To(Succeed(), "failed to download %s", url)

View File

@@ -212,6 +212,9 @@ var _ = BeforeSuite(func() {
// Import model configs from an external directory (e.g. real model YAMLs
// and weights mounted into a container). Symlinks avoid copying large files.
// Both files and directories are symlinked — multi-file backends like
// sherpa-onnx TTS expect their tokens.txt / lexicon.txt sidecars in the
// same directory as the .onnx, so we need whole-directory imports.
if rtModels := os.Getenv("REALTIME_MODELS_PATH"); rtModels != "" {
entries, err := os.ReadDir(rtModels)
Expect(err).ToNot(HaveOccurred())
@@ -221,9 +224,6 @@ var _ = BeforeSuite(func() {
if _, err := os.Stat(dst); err == nil {
continue // don't overwrite mock configs
}
if entry.IsDir() {
continue
}
Expect(os.Symlink(src, dst)).To(Succeed())
}
}

View File

@@ -1,15 +1,21 @@
package e2e_test
import (
"bytes"
"encoding/base64"
"encoding/json"
"fmt"
"io"
"math"
"net/http"
"net/url"
"os"
"strings"
"time"
"github.com/gorilla/websocket"
laudio "github.com/mudler/LocalAI/pkg/audio"
"github.com/mudler/LocalAI/pkg/sound"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
@@ -72,6 +78,66 @@ func generatePCMBase64(freq float64, sampleRate, durationMs int) string {
return base64.StdEncoding.EncodeToString(pcm)
}
// padPCMBase64 prepends and appends the given milliseconds of silence to a
// base64-encoded 16-bit LE PCM buffer. Used to give VAD a clear lead-in /
// lead-out so Silero can reliably detect utterance boundaries.
func padPCMBase64(pcmB64 string, sampleRate, leadingMs, trailingMs int) string {
raw, err := base64.StdEncoding.DecodeString(pcmB64)
ExpectWithOffset(1, err).ToNot(HaveOccurred())
lead := make([]byte, sampleRate*leadingMs/1000*2)
trail := make([]byte, sampleRate*trailingMs/1000*2)
padded := make([]byte, 0, len(lead)+len(raw)+len(trail))
padded = append(padded, lead...)
padded = append(padded, raw...)
padded = append(padded, trail...)
return base64.StdEncoding.EncodeToString(padded)
}
// ttsPCMBase64 drives the /v1/audio/speech endpoint to render `text` through
// the given TTS model, strips the returned WAV header, resamples to the
// requested sample rate if needed, and returns base64-encoded 16-bit LE PCM.
// Fails the test on any transport / format error — there's no useful fallback.
func ttsPCMBase64(model, text string, targetSampleRate int) string {
body, err := json.Marshal(map[string]any{
"model": model,
"input": text,
"format": "wav",
})
ExpectWithOffset(1, err).ToNot(HaveOccurred())
req, err := http.NewRequest(http.MethodPost, apiURL+"/audio/speech", bytes.NewReader(body))
ExpectWithOffset(1, err).ToNot(HaveOccurred())
req.Header.Set("Content-Type", "application/json")
resp, err := http.DefaultClient.Do(req)
ExpectWithOffset(1, err).ToNot(HaveOccurred())
defer resp.Body.Close()
wav, err := io.ReadAll(resp.Body)
ExpectWithOffset(1, err).ToNot(HaveOccurred())
ExpectWithOffset(1, resp.StatusCode).To(Equal(http.StatusOK),
"tts returned %d: %s", resp.StatusCode, string(wav))
pcm, srcRate := laudio.ParseWAV(wav)
ExpectWithOffset(1, srcRate).To(BeNumerically(">", 0),
"tts response is not a valid WAV (body=%d bytes)", len(wav))
if srcRate != targetSampleRate {
samples := sound.BytesToInt16sLE(pcm)
pcm = sound.Int16toBytesLE(sound.ResampleInt16(samples, srcRate, targetSampleRate))
}
return base64.StdEncoding.EncodeToString(pcm)
}
// isRealTTS returns true when REALTIME_TTS names a real backend-backed model,
// as opposed to the default mock-tts. Used to gate test behavior that only
// makes sense with a real TTS — e.g. driving the session with a real
// utterance and asserting the transcription contains recognisable words.
func isRealTTS() bool {
m := os.Getenv("REALTIME_TTS")
return m != "" && m != "mock-tts"
}
// pipelineModel returns the model name to use for realtime tests.
func pipelineModel() string {
if m := os.Getenv("REALTIME_TEST_MODEL"); m != "" {
@@ -139,8 +205,19 @@ var _ = Describe("Realtime WebSocket API", Label("Realtime"), func() {
sendClientEvent(conn, disableVADEvent())
drainUntil(conn, "session.updated", 10*time.Second)
// Append 1 second of 440Hz sine wave at 24kHz (the default remote sample rate)
audio := generatePCMBase64(440, 24000, 1000)
// Real TTS: synthesise an utterance the ASR should be able to
// recognise, and pad with silence so Silero-VAD has a clear
// lead-in/out. Fallback: 1s of 440Hz sine wave — the mock
// transcriber returns a static string anyway, so this only
// needs to exercise the pipeline plumbing.
const inputText = "The quick brown fox jumps over the lazy dog."
var audio string
if isRealTTS() {
audio = ttsPCMBase64(os.Getenv("REALTIME_TTS"), inputText, 24000)
audio = padPCMBase64(audio, 24000, 500, 500)
} else {
audio = generatePCMBase64(440, 24000, 1000)
}
sendClientEvent(conn, map[string]any{
"type": "input_audio_buffer.append",
"audio": audio,
@@ -161,9 +238,30 @@ var _ = Describe("Realtime WebSocket API", Label("Realtime"), func() {
committed := drainUntil(conn, "input_audio_buffer.committed", 30*time.Second)
Expect(committed).ToNot(BeNil())
// Wait for the full response cycle to complete
done := drainUntil(conn, "response.done", 60*time.Second)
Expect(done).ToNot(BeNil())
// Drain the response cycle, capturing the input transcription
// event as it arrives so we can sanity-check it alongside the
// real-TTS path.
var transcript string
deadline := time.Now().Add(90 * time.Second)
for time.Now().Before(deadline) {
evt := readServerEvent(conn, time.Until(deadline))
if evt["type"] == "conversation.item.input_audio_transcription.completed" {
if t, ok := evt["transcript"].(string); ok {
transcript = t
}
}
if evt["type"] == "response.done" {
Expect(evt).ToNot(BeNil())
break
}
}
if isRealTTS() {
lower := strings.ToLower(transcript)
matched := strings.Contains(lower, "fox") || strings.Contains(lower, "dog")
Expect(matched).To(BeTrue(),
"expected real-TTS transcript to contain 'fox' or 'dog' (got %q)", transcript)
}
})
})

136
tests/e2e/run-realtime-sherpa.sh Executable file
View File

@@ -0,0 +1,136 @@
#!/bin/bash
# Drives tests/e2e/realtime_ws_test.go against a realtime pipeline where
# VAD, STT and TTS are served by the sherpa-onnx Docker backend, and the
# LLM slot stays mocked by the in-repo mock-backend. Pre-requisites:
# - `make build-mock-backend` has produced tests/e2e/mock-backend/mock-backend
# - `make docker-build-sherpa-onnx` has produced local-ai-backend:sherpa-onnx
# - `make protogen-go` is up-to-date
# Environment overrides:
# WORK_DIR Where to stage the extracted backend + model files.
# Defaults to a mktemp'd directory (cleaned on exit).
# KEEP_WORK Non-empty to preserve WORK_DIR after the test exits (useful for
# debugging repeated runs — skips redownloads if files already present).
set -euo pipefail
ROOT=$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")/../.." && pwd)
IMAGE=${BACKEND_IMAGE:-local-ai-backend:sherpa-onnx}
MODEL=${REALTIME_STT_MODEL:-omnilingual-0.3b-ctc-q8-sherpa}
VAD_MODEL=${REALTIME_VAD_MODEL:-silero-vad-sherpa}
TTS_MODEL=${REALTIME_TTS_MODEL:-vits-ljs-sherpa}
WORK_DIR=${WORK_DIR:-$(mktemp -d -t localai-sherpa-realtime.XXXXXX)}
if [[ -z "${KEEP_WORK:-}" ]]; then
trap 'rm -rf "$WORK_DIR"' EXIT
fi
echo "WORK_DIR=$WORK_DIR"
BACKENDS_DIR="$WORK_DIR/backends"
MODELS_DIR="$WORK_DIR/models"
mkdir -p "$BACKENDS_DIR/sherpa-onnx" "$MODELS_DIR"
# 1. Extract the sherpa-onnx backend image rootfs. Mirrors the pattern in
# tests/e2e-backends/backend_test.go:extractImage — docker create + export
# avoids having to pull and parse layer tarballs.
if [[ ! -x "$BACKENDS_DIR/sherpa-onnx/run.sh" ]]; then
echo "Extracting $IMAGE rootfs into $BACKENDS_DIR/sherpa-onnx ..."
CID=$(docker create --entrypoint=/run.sh "$IMAGE")
trap 'docker rm -f "$CID" >/dev/null 2>&1 || true; [[ -z "${KEEP_WORK:-}" ]] && rm -rf "$WORK_DIR"' EXIT
docker export "$CID" | tar -xC "$BACKENDS_DIR/sherpa-onnx" \
--exclude='dev/*' --exclude='proc/*' --exclude='sys/*'
docker rm -f "$CID" >/dev/null
fi
# Make sure run.sh is executable (tar usually preserves this, but belt + braces).
chmod +x "$BACKENDS_DIR/sherpa-onnx/run.sh" \
"$BACKENDS_DIR/sherpa-onnx/sherpa-onnx" 2>/dev/null || true
# 2. Download model files. URLs + sha256s match gallery/index.yaml entries.
download() {
local dst="$1" url="$2" sha="$3"
if [[ -f "$dst" ]]; then
actual=$(sha256sum "$dst" | awk '{print $1}')
if [[ "$actual" == "$sha" ]]; then
echo "cached: $dst"
return
fi
fi
mkdir -p "$(dirname "$dst")"
echo "downloading: $url -> $dst"
curl -sSfL "$url" -o "$dst"
actual=$(sha256sum "$dst" | awk '{print $1}')
if [[ "$actual" != "$sha" ]]; then
echo "sha256 mismatch for $dst: got $actual, expected $sha" >&2
exit 1
fi
}
# Silero VAD (single file)
download "$MODELS_DIR/silero-vad/silero-vad.onnx" \
"https://huggingface.co/onnx-community/silero-vad/resolve/main/onnx/model.onnx" \
"a4a068cd6cf1ea8355b84327595838ca748ec29a25bc91fc82e6c299ccdc5808"
# Omnilingual ASR (model + tokens)
download "$MODELS_DIR/omnilingual-asr/model.int8.onnx" \
"https://huggingface.co/csukuangfj/sherpa-onnx-omnilingual-asr-1600-languages-300M-ctc-int8-2025-11-12/resolve/main/model.int8.onnx" \
"e7c4e54ee4c4c47829cc6667d5d00ed8ea7bef1dcfeef0fce766f77752a2726c"
download "$MODELS_DIR/omnilingual-asr/tokens.txt" \
"https://huggingface.co/csukuangfj/sherpa-onnx-omnilingual-asr-1600-languages-300M-ctc-int8-2025-11-12/resolve/main/tokens.txt" \
"a7a044c52cb29cbe8b0dc1953e92cefd4ca16b0ed968177b6beab21f9a7d0b31"
# VITS-LJS TTS (model + tokens + lexicon)
download "$MODELS_DIR/vits-ljs/vits-ljs.onnx" \
"https://huggingface.co/csukuangfj/vits-ljs/resolve/main/vits-ljs.onnx" \
"5bbd273797a9ecf8d94bd6ec02ad16cb41cbb85f055ad98d528ced3e44c9b31a"
download "$MODELS_DIR/vits-ljs/tokens.txt" \
"https://huggingface.co/csukuangfj/vits-ljs/resolve/main/tokens.txt" \
"5fee2c6b238d712287f2ecb08f34a8a8b413bcb7390862ef6fb6fd6f0f8d3a17"
download "$MODELS_DIR/vits-ljs/lexicon.txt" \
"https://huggingface.co/csukuangfj/vits-ljs/resolve/main/lexicon.txt" \
"bdccfc6da71c45c48e2e0056fcf0aab760577c5f959f6c1b5eb3e3e916fd5a0e"
# 3. Write model config YAMLs matching the gallery entries' shape. These are
# what the realtime pipeline resolves via LoadModelConfigFileByName.
cat > "$MODELS_DIR/$VAD_MODEL.yaml" <<EOF
name: $VAD_MODEL
backend: sherpa-onnx
type: vad
parameters:
model: silero-vad/silero-vad.onnx
known_usecases:
- vad
EOF
cat > "$MODELS_DIR/$MODEL.yaml" <<EOF
name: $MODEL
backend: sherpa-onnx
type: asr
parameters:
model: omnilingual-asr/model.int8.onnx
options:
- subtype=omnilingual
known_usecases:
- transcript
EOF
cat > "$MODELS_DIR/$TTS_MODEL.yaml" <<EOF
name: $TTS_MODEL
backend: sherpa-onnx
parameters:
model: vits-ljs/vits-ljs.onnx
known_usecases:
- tts
EOF
# 4. Run the Ginkgo spec. REALTIME_TEST_MODEL=realtime-test-pipeline triggers
# the e2e suite to auto-compose a pipeline YAML from the slot env vars.
export REALTIME_TEST_MODEL=realtime-test-pipeline
export REALTIME_VAD="$VAD_MODEL"
export REALTIME_STT="$MODEL"
export REALTIME_LLM=mock-llm
export REALTIME_TTS="$TTS_MODEL"
export REALTIME_MODELS_PATH="$MODELS_DIR"
export REALTIME_BACKENDS_PATH="$BACKENDS_DIR"
cd "$ROOT"
go test -v -timeout 30m ./tests/e2e/... \
-ginkgo.focus="Manual audio commit"