Compare commits

..

90 Commits

Author SHA1 Message Date
LocalAI [bot]
4d3d54d61b test(e2e): live-server voice-recognition gate test (#10324)
Add mock-backend VoiceEmbed/VoiceVerify (deterministic DC-offset speaker
discrimination) and a verify-mode gated realtime pipeline, then drive the
real HTTP/WS stack: an authorized speaker reaches response.done while an
unauthorized one is dropped before the LLM with a speaker_not_authorized
event.


Assisted-by: Claude:opus-4.8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-13 23:54:27 +02:00
LocalAI [bot]
36e3419203 chore: ⬆️ Update vllm-project/vllm cu130 wheel to 0.23.0 (#10314)
⬆️ Update vllm-project/vllm cu130 wheel

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-13 23:39:10 +02:00
LocalAI [bot]
4ec6e3221e feat(realtime): gate realtime pipeline voice models behind voice recognition (#10319)
* feat(realtime): add pipeline voice_recognition gate config schema

Add the PipelineVoiceRecognition config block that gates a realtime
pipeline behind speaker verification (identify against the voice
registry, or verify against reference audios), with Normalize defaults
and Validate enum/shape checks. Register the new fields in the config
meta registry so the UI renders them with proper labels/components
(required by the registry-coverage gate).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]

* fix(realtime): range-check voice gate threshold and floor UI min

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]

* feat(realtime): add cosineDistance helper for voice gate

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]

* feat(realtime): add voiceGate identify-mode authorization

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]

* test(realtime): cover voice gate fail-closed error paths

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]

* feat(realtime): add voiceGate verify-mode authorization

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]

* feat(realtime): add voiceGate decide policy helper

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]

* feat(realtime): add newVoiceGate constructor

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]

* feat(realtime): gate pipeline responses behind voice recognition

Run speaker verification concurrently with transcription and join on a
hard barrier before generateResponse, so unauthorized utterances never
reach the LLM, tools, or TTS. Supports identify (registry) and verify
(reference) modes with multiple authorized speakers, per-utterance or
first-utterance checking, and drop-with-event or silent-drop on reject.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]

* fix(realtime): harden voice gate goroutine lifecycle

Only launch the verification goroutine on the transcription path and
drain it before the temp WAV is removed on the transcription-error
return, so an in-flight backend read never races the deferred cleanup.
Drop the write-only voiceMatched field; log the matched speaker instead.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]

* docs(realtime): document the voice_recognition pipeline gate

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]

* fix(realtime): fail closed on an incomplete voice_recognition block

A present voice_recognition block with no model previously disabled the
gate silently, authorizing every speaker. Treat block presence as the
intent signal and reject an empty model in Validate, so the session is
refused instead of running unprotected.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]

* test(realtime): integration-test the voice gate through commitUtterance

Drive the real commitUtterance path (gate goroutine, hard join before the
LLM, reject event, when:first session trust) with the existing
transport/model doubles: authorized speakers reach a full response,
unauthorized ones are dropped before the LLM with a speaker_not_authorized
event, backend errors fail closed, drop_silent stays quiet, and when:first
trusts the session after one match.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-13 23:38:08 +02:00
LocalAI [bot]
4bb592cf91 feat(qwen3-tts-cpp): migrate to ServeurpersoCom/qwentts.cpp (streaming, speakers, voice design) (#10316)
* feat(qwen3-tts-cpp): repoint upstream to ServeurpersoCom/qwentts.cpp

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(qwen3-tts-cpp): flatten qt_* ABI into qt3_* purego shim

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(qwen3-tts-cpp): build shim against upstream qwen-core static lib

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(qwen3-tts-cpp): add option/language/voice/sampling parsing

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(qwen3-tts-cpp): add 24kHz WAV encode/decode/stream-header helpers

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(qwen3-tts-cpp): purego backend with streaming, speakers, voice design

Map TTSRequest onto qwentts.cpp: instructions->instruct, voice->named
speaker or clone-reference path, params map->ref_text + sampling. Add
TTSStream over the qt chunk callback.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* test(qwen3-tts-cpp): unit specs + build-gated TTS/TTSStream e2e

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* fix(qwen3-tts-cpp): close defensive PCM-free gap on zero-sample result

Register CppPCMFree before the n<=0 guard so a non-null buffer with zero
samples cannot leak (the C contract returns NULL on failure, so this is
defensive). Raised in code review.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(qwen3-tts-cpp): advertise TTSStream capability

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(qwen3-tts-cpp): update backend index metadata for qwentts.cpp

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(gallery): qwentts.cpp models - base/customvoice/voicedesign, Q8_0 & Q4_K_M

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* docs(qwen3-tts-cpp): release note for qwentts.cpp migration

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* test(qwen3-tts-cpp): cover audio_path voice-cloning fallback

Add resolveRequest unit specs (config audio_path used as the clone
reference when Voice is empty; per-request audio Voice overrides it; a
named-speaker Voice does not trigger cloning) plus a real-inference e2e
that clones from audio_path (confirmed ref_spk_emb=yes in the pipeline).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(qwen3-tts-cpp): drop the release-note doc

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-13 23:09:59 +02:00
Ettore Di Giacinto
3e838c0cff docs: add realtime voice demo example and refresh README news
Add the localai-org/localai-realtime-demo Go client to the README
Examples list and to the realtime docs (integrations + realtime feature
page). Refresh the Latest News section with June 2026 highlights pulled
from history since v4.3.0: realtime pipeline streaming, the parakeet.cpp
and CrispASR speech work, new backends (locate-anything.cpp, Ideogram4,
llama.cpp video input), and distributed-mode hardening.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
2026-06-13 20:10:22 +00:00
moduvoice
36b4a81d1e feat(i18n): add Korean (ko) translation (#10312)
Add a full Korean locale (core/http/react-ui/public/locales/ko/, 13 namespaces,
840 keys, full parity with en/) and register ko in SUPPORTED_LANGUAGES
(core/http/react-ui/src/i18n/index.js). All i18next {{interpolation}} and
_one/_other plural keys preserved; brand/model names kept untranslated.

Assisted-by: Claude:claude-opus-4-8

Signed-off-by: moduvoice <moduvoicr77@gmail.com>
2026-06-13 21:58:50 +02:00
LocalAI [bot]
0854932a25 feat(omnivoice-cpp): add OmniVoice TTS backend (file + streaming, voice cloning + voice design) (#10310)
* feat(omnivoice-cpp): add C wrapper + CMake/Makefile build over OmniVoice ov_* ABI

Assisted-by: claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(omnivoice-cpp): add option/language parsing + WAV framing helpers with tests

Assisted-by: claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(omnivoice-cpp): wire purego binding with TTS + streaming TTSStream

Assisted-by: claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* build(omnivoice-cpp): wire backend into root Makefile

Assisted-by: claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(omnivoice-cpp): add build matrix entries + dep-bump registration

Assisted-by: claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(omnivoice-cpp): register backend meta + image entries

Assisted-by: claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(omnivoice-cpp): expose as preference-only importable backend

Assisted-by: claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(gallery): add omnivoice-cpp TTS models (Q8_0 default + BF16 HQ)

Assisted-by: claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(omnivoice-cpp): document the OmniVoice TTS backend

Assisted-by: claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(omnivoice-cpp): add env-gated e2e for TTS + streaming

Assisted-by: claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(omnivoice-cpp): honor tts.audio_path/tts.voice config as default cloning reference

The model config tts.audio_path (ModelOptions.AudioPath) and tts.voice now
provide a default voice-cloning reference used when a request omits Voice, so a
cloned voice can be pinned in the model YAML instead of passed per request. A
per-request voice still overrides. Paths resolve relative to the model dir.

Assisted-by: claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(omnivoice-cpp): add missing omnivoice-cpp-development backend meta

Mirrors the whisper/vibevoice convention: a -development meta aggregating the
master-tagged image variants (the production meta and per-variant prod+dev image
entries already existed; only the development meta aggregator was missing).

Assisted-by: claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-13 21:28:46 +02:00
LocalAI [bot]
203410871b feat(sherpa-onnx): add Kokoro TTS + multilingual Piper voices (#10309)
Wire the Kokoro model family into the sherpa-onnx backend (which only
supported VITS/Piper before) and add gallery voices for Italian, English,
Spanish, French and German plus a multilingual Kokoro model.

- csrc/shim.{c,h}: kokoro_* config setters (model/voices/tokens/data_dir/
  dict_dir/lexicon/lang/length_scale) mirroring the VITS path, with the
  matching frees in tts_config_free.
- backend.go: loadTTS now detects a Kokoro model (a voices.bin beside the
  ONNX) and routes to configureKokoroTTS, otherwise configureVitsTTS.
  Kokoro picks up espeak-ng-data, the jieba dict and the per-language
  lexicons (only one English variant, to avoid tens of thousands of
  duplicate-word warnings at load); the language= option hints the lang.
- backend_test.go: functional test for isKokoroModel detection.
- gallery: 5 Piper VITS voices (it_IT-paola, en_US-amy, es_ES-davefx,
  fr_FR-siwis, de_DE-thorsten) + kokoro-multi-lang-v1.0, served through
  sherpa-onnx-tts.yaml with native streaming TTS.

Verified by building the backend and synthesizing with a real Piper and
Kokoro model (31/31 specs pass, including real-model synth smokes).


Assisted-by: Claude:claude-opus-4-8 gofmt golangci-lint go-test

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-13 21:27:27 +02:00
LocalAI [bot]
7637f8cf1b feat(distributed): declarative per-model scheduling via env/args (#10308)
* feat(distributed): add SpreadAll column and authoritative scheduling seeding

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): parse declarative model scheduling config (env/file)

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): reconcile spread_all to one replica per matching node

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): wire LOCALAI_MODEL_SCHEDULING env/args and startup seeding

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): expose spread_all on the scheduling API endpoint

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): add spread-to-all-nodes mode to the scheduling UI

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(distributed): document LOCALAI_MODEL_SCHEDULING env/args

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(distributed): clarify replica modes and all-nodes spread in scheduling config

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-13 18:31:06 +02:00
LocalAI [bot]
f0e001b7f8 fix(xsysinfo): container-aware total RAM detection (cgroup/lxcfs) (#8059) (#10288)
fix(xsysinfo): make reported system RAM total cgroup/lxcfs-aware (#8059)

GetSystemRAMInfo derived Total from memory.TotalMemory(), which on Linux
uses syscall.Sysinfo().Totalram - the HOST kernel total. lxcfs/LXD does
NOT virtualize that value, while MemAvailable (used for Free/Available)
IS virtualized. Inside an LXD/container with a 128Gi host but a ~10Gi
container view this produced Total=128Gi, Available=10Gi => Used=118Gi,
reporting ~92% RAM usage on an idle container.

Derive Total instead from the minimum of all non-zero, non-unlimited
candidates: cgroup v2 memory.max, cgroup v1 memory.limit_in_bytes (the
kernel unlimited sentinel is ignored), /proc/meminfo MemTotal (which
lxcfs virtualizes), and the syscall.Sysinfo total as the bare-metal
fallback. On bare metal every candidate is unlimited or equals the host
total, so behavior is unchanged.

The selection/parsing lives in a pure function chooseTotalMemory(...)
taking file CONTENTS, unit-tested without a real LXD host; OS file
reads stay in a thin wrapper.

Assisted-by: claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-13 18:13:06 +02:00
pos-ei-don
cf9debf4eb model: fix case-insensitive suffix matching and skip .bak files in ListFilesInModelPath (#10306)
model: skip .bak files and fix case-insensitive suffix matching in ListFilesInModelPath
2026-06-13 17:46:46 +02:00
LocalAI [bot]
e1556aa1dc fix(react-ui): make agent chat timestamps format-agnostic (#9867) (#10290)
fix(agents): make React agent chat timestamps format-agnostic

The agent SSE bridge emits the json_message timestamp in three different
encodings depending on deploy mode: an RFC3339 string (standalone agent
pool), Unix milliseconds (local dispatcher), and Unix nanoseconds (the
older NATS path). The React AgentChat handler passed data.timestamp
straight through, so the standalone string and any numeric value outside
the millisecond range rendered as "Invalid Timestamp" or a constant
epoch-ish time.

Add a small pure helper, normalizeTimestampMs, that accepts an RFC3339
string or a numeric epoch in s/ms/us/ns and returns JS milliseconds,
falling back to Date.now() on null/empty/unparseable input. Use it in
the json_message handler so the rendered time is correct regardless of
which backend path produced it.

Fixes #9867


Assisted-by: claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-13 11:05:21 +02:00
LocalAI [bot]
53cbb578a9 chore(model gallery): 🤖 add 1 new models via gallery agent (#10304)
chore(model gallery): 🤖 add new models via gallery agent

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-13 11:03:03 +02:00
LocalAI [bot]
99c8205740 fix(react-ui): stop Talk pipeline overflow and center collapsed-rail avatar (#10305)
Two small visual fixes in the React UI:

- Talk page pipeline summary: the four-column grid used
  `repeat(4, 1fr)`, which resolves to `minmax(auto, 1fr)` so each track
  refuses to shrink below the min-content width of its `nowrap` model
  name. Long names (e.g. a verbose GGUF LLM id) blew the grid out past
  the container despite the per-cell ellipsis styling. Switching to
  `minmax(0, 1fr)` lets the tracks shrink and the ellipsis take effect.

- Sidebar user avatar: the desktop collapsed look centers the avatar via
  `.sidebar.collapsed .sidebar-user{-link}` rules, but the tablet
  icon-rail (640-1023px) collapses visually through `.sidebar:not(.open)`
  without necessarily carrying the `.collapsed` class, so the avatar kept
  its left-aligned negative margins and looked misaligned. Mirror the
  centering rules under `.sidebar:not(.open)`.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-13 11:02:48 +02:00
LocalAI [bot]
d7162b9f89 ci(darwin): build the ds4 backend for darwin/arm64 (metal) (#10303)
The gallery has metal-ds4 / metal-ds4-development entries, and the build
recipe exists (make backends/ds4-darwin, special-cased in
backend_build_darwin.yml), but ds4 was never listed in the darwin matrix,
so no metal-darwin-arm64-ds4 image was ever published and the entries
dangled.

- Add ds4 to the darwin matrix (includeDarwin), mirroring the llama-cpp
  form (the reusable workflow builds it via 'make backends/ds4-darwin').
- Fix inferBackendPathDarwin in scripts/changed-backends.js to map ds4 to
  backend/cpp/ds4/ (like llama-cpp): ds4 is C++ but the matrix entry carries
  lang=go, so without this its darwin build would only ever run on a release
  (FORCE_ALL), never incrementally when backend/cpp/ds4 changes.

sherpa-onnx and speaker-recognition are already in the darwin matrix on
master and are not changed here.

Assisted-by: claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-13 11:02:32 +02:00
LocalAI [bot]
3351b62c91 chore(model gallery): 🤖 add 1 new models via gallery agent (#10302)
chore(model gallery): 🤖 add new models via gallery agent

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-13 10:59:23 +02:00
LocalAI [bot]
0eca930b8d fix(gallery): correct meta-backend definitions for platform auto-selection (#10299)
fix(gallery): correct meta-backend definitions in backend/index.yaml

Backends that ship per-platform images must be meta backends (a capabilities
map and NO uri) so the right variant is auto-selected per platform - mirroring
llama-cpp/whisper. Several entries were misdefined; fixed here:

- Concrete base + metal sibling (could not select the Apple Silicon variant):
  silero-vad, piper, kitten-tts, local-store (+ their -development). Converted
  each anchor to a meta and added the cpu-<name> concrete.
- mlx family (mlx, mlx-vlm, mlx-audio, mlx-distributed + -development): anchor
  had both a uri AND a capabilities map, so IsMeta() was false and the map was
  ignored (always resolved to the metal-darwin image); the metal-<name> target
  did not exist. Removed the uri and added the missing metal-<name> concretes.
- Dangling capability targets: diffusers/kokoro nvidia-l4t-cuda-12 repointed to
  the existing nvidia-l4t-<name> concrete; coqui nvidia-cuda-13 key removed
  (no cuda13-coqui image).
- locate-anything: the meta existed but its concrete entries were never added,
  so it was un-installable on every platform. Added the full concrete set plus
  the locate-anything-development meta, mirroring rfdetr-cpp. Image tags grounded
  against the published quay.io tags.
- trl (cuda12/13): repointed the stale 'cublas-cuda12/13-trl' image tags to the
  actually-published 'gpu-nvidia-cuda-12/13-trl' tags (fixes #9236).

Assisted-by: claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-13 10:43:14 +02:00
LocalAI [bot]
81ab62e874 chore(model gallery): 🤖 add 1 new models via gallery agent (#10298)
chore(model gallery): 🤖 add new models via gallery agent

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-13 09:58:11 +02:00
LocalAI [bot]
0413fc03f8 fix(gallery): make opus a meta backend for platform auto-selection (#9813) (#10291)
fix(gallery): make opus a meta backend so the platform variant is auto-selected (#9813)

The realtime/WebRTC path loads the "opus" codec backend by name, but on
macOS arm64 only "metal-opus" is installable, so Load("opus") failed with
"opus backend not available".

The root cause: unlike llama-cpp and whisper, the opus entry was a concrete
CPU backend (it carried a uri and no capabilities map) rather than a meta
backend, so nothing mapped "opus" to the platform-appropriate variant.

Restructure opus to mirror llama-cpp/whisper: "opus" becomes a meta backend
with a capabilities map (default -> cpu-opus, metal -> metal-opus) and no
uri; the CPU image moves to a new "cpu-opus" concrete (and its dev variant
to "cpu-opus-development"). Installing "opus" now resolves to metal-opus on
Apple Silicon and cpu-opus elsewhere, and Load("opus") works on every
platform via the meta pointer - so the realtime endpoint needs no special
casing. This reverts the realtime_webrtc.go resolution helper from the
earlier approach in favor of the gallery-level fix.

Assisted-by: claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-13 09:51:02 +02:00
LocalAI [bot]
7088572f75 fix(neutts): pin torchaudio to match torch (fixes undefined symbol) (#9798) (#10292)
fix(neutts): pin torchaudio to match torch to avoid ABI mismatch (#9798)

neucodec pulls torchaudio transitively but it was unpinned, so an
incompatible torchaudio could be resolved against the pinned torch==2.8.0,
producing the 'undefined symbol: torch_library_impl' load failure. Pin
torchaudio==2.8.0 alongside torch in the cpu and cublas12 requirements.

Assisted-by: claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-13 09:28:41 +02:00
LocalAI [bot]
c1e8440f5b fix(deps): bump cogito to fix MCP image-result panic (#10101) (#10294)
fix(mcp): bump cogito to handle non-text tool result content

Fixes #10101: the API panicked with "interface conversion: mcp.Content
is *mcp.ImageContent, not *mcp.TextContent" when an MCP tool returned an
image. Upstream cogito PR #50 replaced the unchecked TextContent
assertion in the tool-result loop with a contentToString type-switch
that handles image (and other non-text) content blocks gracefully.

Bump github.com/mudler/cogito to v0.10.1-0.20260609212329-bf4010d31047,
which includes the fix.


Assisted-by: claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-13 09:28:25 +02:00
LocalAI [bot]
8f0059123b feat(gallery): add 60 piper TTS voices across 42 languages (Phase 2) (#10296)
Extends the piper voice set with a couple of voices per language for 42 more
languages (Arabic, Bulgarian, Catalan, Czech, Welsh, Danish, Greek, Spanish,
Basque, Persian, Finnish, French, Hindi, Hungarian, Indonesian, Icelandic,
Georgian, Kazakh, Luxembourgish, Latvian, Malayalam, Nepali, Dutch, Norwegian,
Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Albanian, Swedish,
Swahili, Telugu, Turkish, Ukrainian, Urdu, Vietnamese, Chinese, ...), run
through the crispasr backend's backend:piper engine and hosted at
LocalAI-Community/piper-voices-GGUF.

All converted from rhasspy/piper-voices with CrispASR's convert-piper-to-gguf.py
and screened end-to-end on the pinned engine. Only single-speaker low/medium
voices are included; high-quality decoders and multi-speaker models segfault and
are excluded (e.g. zh_CN-chaowen dropped, huayan kept).


Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-13 09:19:21 +02:00
LocalAI [bot]
a906438a69 fix(config): backend-gate the top_k=40 sampler default (#6632) (#10285)
fix(config): gate top_k=40 default on backend family (#6632)

SetDefaults injected top_k=40 (llama.cpp's sampling default) for every
model config regardless of backend. That value is wrong for backends
whose native default differs: mlx_lm's intended default is top_k=0
(disabled) and mlx does not remap 0->40, so a client that omits top_k
silently got 40 shipped to mlx, changing sampling. The mlx backend's own
getattr(request,'TopK',0) fallback is dead because proto3 int32 is always
present.

Gate the injection on backend family via UsesLlamaSamplerDefaults: keep
top_k=40 for the llama.cpp family and for the empty/auto backend (the GGUF
auto-detect path resolves to llama.cpp, so existing behavior is preserved),
but leave TopK nil for the known non-llama backends (mlx, mlx-vlm,
mlx-distributed). gRPCPredictOpts now sends 0 when TopK is nil, which is
the value mlx actually wants.

Only TopK is gated - the confirmed bug. The sibling sampler defaults
(top_p, temperature, min_p) are left global to avoid widening scope and
introducing nil-deref risk; revisit per-backend if needed.

Assisted-by: claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-13 09:04:25 +02:00
LocalAI [bot]
d28a5b6da1 chore: ⬆️ Update mudler/locate-anything.cpp to 92c1682da792c1e8a5dec91acc2be4b02c742ded (#10282)
⬆️ Update mudler/locate-anything.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-13 09:01:17 +02:00
LocalAI [bot]
edeacf22c4 fix(realtime): keep transcription model on a language-only session.update (#10295)
A transcription session.update that carries only a language (no model) —
e.g. a client forcing the STT input language — has an empty
Transcription.Model. updateSession unconditionally copied that into
session.ModelConfig.Pipeline.Transcription, blanking the pipeline's
configured transcription backend. The next utterance then transcribed
against an empty model and the backend RPC failed with "unimplemented"
(surfaced to the client as transcription_failed), so transcription
silently stopped whenever a language was selected.

Only adopt the incoming transcription model when it is non-empty, and
preserve the existing model otherwise (mirroring updateTransSession).

Signed-off-by: mudler <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-13 01:01:36 +02:00
Aniruddh Jha
51f4f67c47 fix(agents): emit chat event timestamps in milliseconds (#9867) (#10243)
Agent chat replies rendered a broken timestamp in the web UI
("Invalid Timestamp" / "12:00 AM", identical for every reply) because
the SSE timestamp unit was inconsistent across producers.

EventBridge.PublishEvent emitted Unix nanoseconds while the local
dispatcher (dispatcher.go) already emitted Unix milliseconds, and the
React UI fed the value straight into `new Date(ts)` after dividing by
1e6. Nanoseconds also overflow JS's safe-integer range (~1.7e18).

Standardize on Unix milliseconds: switch PublishEvent to UnixMilli and
drop the /1e6 conversion in AgentChat.jsx so both SSE paths agree and
match the React UI's expectation. Add a regression test asserting the
published timestamp is in milliseconds.
2026-06-12 23:18:44 +02:00
LocalAI [bot]
cf71e291b4 fix(darwin): fix vibevoice-cpp build linkage + fail-safe go backend packaging (#10276)
* fix(darwin): never package a go backend build tree as a working image

The darwin/arm64 vibevoice-cpp image shipped the source tree with a
half-built CMake directory (build-libgovibevoicecpp-fallback.so/) and no
backend binary, so the backend could never start: run.sh exec'd a
vibevoice-cpp binary that was not in the package and LocalAI timed out
waiting for the gRPC service.

Two durable, backend-agnostic defenses:

- backend/go/vibevoice-cpp/Makefile: mirror whisper's cleanup discipline so a
  partial CMake tree cannot survive into packaging. Run `make purge` before
  each variant build and `rm -rfv build*` after. The old recipe only removed
  its build dir after a successful `mv`, so a failed build left the half-built
  tree behind.

- scripts/build/golang-darwin.sh: before creating the OCI image, remove any
  stray build-* directory and assert that the binary run.sh launches actually
  exists. A build that produced no binary now fails the job loudly instead of
  publishing a source tree as a working backend. The binary name is derived
  from run.sh's `exec $CURDIR/<binary>` line (parakeet-cpp launches
  parakeet-cpp-grpc, so it is not always ${BACKEND}) with a ${BACKEND}
  fallback.

The underlying native build failure that left vibevoice-cpp half-built still
needs to be reproduced and fixed on Apple Silicon; this change ensures such a
failure can never again be published as a working image.

Refs #10267

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* fix(vibevoice-cpp): build libvibevoice.a on darwin (link target, not path)

The darwin build failed with:

    No rule to make target 'vibevoice/libvibevoice.a', needed by
    'libgovibevoicecpp.so'.  Stop.

The upstream vibevoice project is added with add_subdirectory(... EXCLUDE_FROM_ALL),
so its `vibevoice` static-library target is only built when something links it
as a target. The Apple branch linked only `$<TARGET_FILE:vibevoice>` - a bare
archive path with no target reference - so CMake never emitted a rule to build
libvibevoice.a, while the Linux branch worked because it passes the `vibevoice`
target name inside the --whole-archive flags.

Link the `vibevoice` target on Apple (establishing the build dependency) and
apply -force_load as a separate link option to keep whole-archive semantics so
purego can dlsym the vv_capi_* symbols.

Refs #10267

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-12 23:13:50 +02:00
LocalAI [bot]
a7a7bd646b fix(mlx): route vision-language models to the mlx-vlm backend (#10274)
Vision-language checkpoints such as mlx-community/gemma-4-E4B-it-qat-4bit
declare the "image-text-to-text" pipeline tag on HuggingFace. The mlx
importer hardcoded backend "mlx" for every mlx-community model, so these
VLMs were served by the text-only mlx-lm backend whose tokenizer does not
carry the processor chat template. The template was never applied and the
model produced degenerate, looping output that echoed the prompt.

Detect the "image-text-to-text" pipeline tag in the importer and route those
models to mlx-vlm, which applies the processor-aware chat template. An
explicit backend preference still wins.

As a defensive backstop, the mlx backend now warns loudly when the loaded
model has no chat template, so a misrouted VLM surfaces the problem instead
of silently looping.

Fixes #10269


Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-12 23:12:42 +02:00
LocalAI [bot]
cec93d2e00 docs: ⬆️ update docs version mudler/LocalAI (#10279)
⬆️ Update docs version mudler/LocalAI

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-12 23:12:30 +02:00
LocalAI [bot]
722bdb87e9 chore: ⬆️ Update mudler/parakeet.cpp to b8012f11e5269126eddb7f4fd02f891a2ccc29b0 (#10281)
* ⬆️ Update mudler/parakeet.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* fix(parakeet-cpp): close streaming segments on <EOB> after ABI v5 eou/eob split

parakeet.cpp ABI v5 (the pin this PR bumps to) splits the streaming JSON
"eou" flag: in v4 "eou":1 fired for either <EOU> (end of utterance) or
<EOB> (backchannel); in v5 "eou" means <EOU> only, with a new separate
"eob" field for the backchannel token.

The streamSegmenter closed a segment on "eou" alone, so after the bump a
backchannel token would silently stop ending a segment and merge into the
next utterance. Read the new "eob" field and flush on either signal to
preserve the v4 segmentation boundaries. The flat stream_feed eou_out path
is unaffected: its mask is still non-zero for either event.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-12 23:12:04 +02:00
LocalAI [bot]
50dea8c983 feat(crispasr): bundle espeak-ng and add piper TTS voices to the gallery (#10283)
CrispASR's piper backend phonemizes non-English text via espeak-ng (dlopen,
the MIT-clean path; English uses a built-in G2P). The FROM scratch crispasr
image shipped none of it, so non-English piper voices loaded but failed
synthesis with "phonemization failed". Bundle the espeak-ng runtime so they
work:

- Dockerfile.golang: install espeak-ng-data + libespeak-ng1 and its libpcaudio0
  / libsonic0 deps in the crispasr builder (espeak's dlopen fails without the
  latter two).
- package.sh: copy libespeak-ng.so.1, libpcaudio.so.0, libsonic.so.0 into
  package/lib/ and the espeak-ng-data dir into the package root.
- run.sh: export CRISPASR_ESPEAK_DATA_PATH so the bundled data is found.

Add 9 single-speaker piper voices (de/en/it, incl. Italian paola + riccardo) to
the gallery, run through backend:piper, hosted at
LocalAI-Community/piper-voices-GGUF (converted from rhasspy/piper-voices with
CrispASR's convert-piper-to-gguf.py). Only single-speaker low/medium voices are
included; the engine does not yet support multi-speaker or high-quality piper
decoders.

All 9 verified end-to-end: each synthesizes a WAV at the model's native sample
rate using only the image-bundled espeak payload.


Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-12 23:10:30 +02:00
LocalAI [bot]
46ba70632b fix(crispasr): write piper TTS WAV at the model's native sample rate (#10277)
CrispASR's piper backend returns PCM at the voice's native rate (from the GGUF
piper.sample_rate key: 16 kHz for x_low/low, 22.05 kHz for medium/high) and does
not resample, but the Go WAV encoder hardcoded 24000 Hz. Every piper voice was
therefore written with a wrong header and played back at the wrong pitch/speed.

Read piper.sample_rate from the model's GGUF metadata at Load via the vendored
gguf-parser-go and use it for the WAV header, falling back to the 24 kHz default
for the other CrispASR TTS engines (vibevoice/orpheus/chatterbox/qwen3-tts) that
emit 24 kHz and carry no such key.

Adds unit specs (minimal crafted GGUFs + WAV-header decode) and an env-gated
end-to-end spec (CRISPASR_PIPER_MODEL_PATH). Verified e2e: en_GB-cori-medium
synthesizes a 22050 Hz WAV through backend:piper.


Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-12 23:10:17 +02:00
LocalAI [bot]
60facc7252 fix(darwin): publish sherpa-onnx and speaker-recognition images for darwin/arm64 (#10275)
Neither the sherpa-onnx nor the speaker-recognition backend had a
darwin/arm64 image, so `local-ai backends install` failed with "no child
with platform darwin/arm64" on macOS. This left /v1/audio/diarization (the
sherpa-onnx path) and /v1/voice/embed without any usable backend on Apple
Silicon.

Both backends build on darwin/arm64:
- sherpa-onnx (Go) already fetches the onnxruntime osx-arm64 runtime in its
  Makefile; it only needed a darwin matrix entry (build-type metal, lang go,
  like whisper and silero-vad).
- speaker-recognition (Python) needed a requirements-mps.txt so the mps build
  installs plain onnxruntime (which ships a macOS arm64 wheel) instead of the
  onnxruntime-gpu pulled by its base requirements (which does not).

Add both to the includeDarwin build matrix, wire the metal capability and
metal image aliases into the gallery, and add the speaker-recognition
requirements-mps.txt.

Fixes #10268


Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-12 22:32:42 +02:00
LocalAI [bot]
8c8204d3c4 feat(parakeet-cpp): enable GGML_CUDA_GRAPHS in the cublas build (#10273)
ggml leaves GGML_CUDA_GRAPHS off by default. Passing -DGGML_CUDA_GRAPHS=ON
for cublas builds lets the CUDA backend capture and replay the compute
graph for a small free speedup (about 1% measured on a GB10, never
negative). It is not gated by parakeet.cpp's CMake options, so it passes
straight through to ggml.

Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>

Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-12 18:47:36 +02:00
LocalAI [bot]
4ce0f6102a chore(model gallery): 🤖 add 1 new models via gallery agent (#10270)
chore(model gallery): 🤖 add new models via gallery agent

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-12 16:21:35 +02:00
Richard Palethorpe
085fc53bbc fix(router): production-ready request router + auto-size batch for embedding/rerank (#10104)
* fix(router): score classifier production-readiness

Conversation trimming runs through the classifier model's chat template
and trims by exact token count, sized to the model's n_batch which is
now scaled to context so long probes can't crash the backend. Missing
chat_message templates are a hard error at router build time. Router-
facing factories (Embedder/Scorer/Reranker/TokenCounter) re-resolve
ModelConfig per call so a model installed post-startup doesn't bind a
stub Backend="" config and silently fall into the loader's auto-
iterate path.

New 'vector_store' backend trace recorded inside localVectorStore on
every Search/Insert — including the backend-load-failure path that
previously vanished into an xlog.Warn — with outcome tagging
(hit/miss/empty_store/backend_load_error/find_error/insert_error/ok).
Companion cleanup drops misleading similarity:0 and input_tokens_count:0
from non-hit and text-mode traces.

Gallery local-store-development aliases to 'local-store' so the master
image satisfies pkg/model.LocalStoreBackend lookups from the embedding
cache.

Misc: llama-cpp TokenizeString reads the correct 'prompt' JSON key
(the original bug); ModelTokenize nil-guard; non-fatal mitm proxy
startup; PII 'route_local' renamed to 'allow' with docs/UI in sync;
model-editor footer no longer eats the edit area on small screens;
several config-editor template/dropdown/section fixes.

Tests: e2e router specs (casual/code-hint + long-conversation trim),
vector_store trace specs, lazy-factory specs, gallery dev-alias
resolution, Playwright trace badge + scroll regression.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(backend): auto-size batch to context for embedding and rerank models

Embedding and rerank models pool over the whole input in a single physical batch (n_ubatch). With batch left at the 512 default, the backend rejects longer inputs with "input is too large to process", silently capping a large-context embedder (e.g. 8k/32k) at 512 tokens. Size n_batch to the context for these single-pass usecases, mirroring the existing FLAG_SCORE behaviour; an explicit batch: still wins.

Extracts EffectiveContextSize/EffectiveBatchSize from grpcModelOpts so the effective decode window has one home for other callers to reuse.

Adds an e2e-aio regression test that embeds a >512-token input. The AIO embedding model is switched to nomic-embed-text-v1.5 (2048 context) because the previous granite model was capped at 512 tokens and could not exercise the larger batch.

Assisted-by: claude-code:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(gallery): raise arch-router scoring output cap via parallel:64

Scoring decodes the whole prompt+candidate in a single llama_decode and
reads one logit row per candidate token. The vendored llama.cpp server
caps causal output rows at n_parallel, so the default of 1 aborts with
GGML_ASSERT(n_outputs_max <= cparams.n_outputs_max) on multi-token route
labels. Set options: [parallel:64] on both arch-router quant entries to
lift the cap; kv_unified (the grpc-server default) keeps the full context
per sequence, so this does not split the KV cache.

Assisted-by: claude-code:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-06-12 16:21:15 +02:00
LocalAI [bot]
56cc4f63fc feat(backend): locate-anything-cpp (open-vocabulary object detection via ggml) (#10264)
* feat(backend): add locate-anything-cpp backend (open-vocab detection via la_capi)

A Go/purego backend wrapping locate-anything.cpp's la_capi C ABI, implementing
the gRPC Detect RPC: image + open-vocabulary text prompt -> labeled boxes.
Mirrors backend/go/rfdetr-cpp; static-links ggml into a per-CPU-variant .so.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(backend): register locate-anything-cpp in build matrix

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(gallery): locate-anything gallery entry + model importer

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(backend): locate-anything-cpp Load+Detect wire test

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(gallery): add locate-anything-3b model to the gallery index

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(backend): register locate-anything.cpp in bump_deps auto-bump

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: mudler <mudler@localai.io>

* ci(test): e2e smoke for locate-anything-cpp in test-extra (loads the 3B + image, runs Detect)

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: mudler <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Signed-off-by: mudler <mudler@localai.io>
Co-authored-by: mudler <mudler@localai.io>
2026-06-12 14:59:07 +02:00
LocalAI [bot]
a53f34e78f chore: ⬆️ Update ggml-org/llama.cpp to 4c6595503fe45d5a39f88d194e270f64c7424677 (#10261)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-12 14:57:52 +02:00
Dedy F. Setyawan
1cea96f09f feat(react-ui): add Indonesian language support (#10266)
Signed-off-by: Dedy F. Setyawan <dedyfajars@gmail.com>
2026-06-12 10:08:58 +02:00
LocalAI [bot]
006a9d38c7 chore: ⬆️ Update mudler/parakeet.cpp to 9db92be63179a27201d3b88d5d40c545b2ac48ae (#10263)
⬆️ Update mudler/parakeet.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-12 09:18:21 +02:00
LocalAI [bot]
892ce951ce chore: ⬆️ Update antirez/ds4 to d881f2a05e8ff6bec001315a36b794b4aa310173 (#10262)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-12 09:18:07 +02:00
LocalAI [bot]
7cda221d36 docs: ⬆️ update docs version mudler/LocalAI (#10259)
⬆️ Update docs version mudler/LocalAI

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-12 09:17:49 +02:00
LocalAI [bot]
9a88eb81e7 chore: ⬆️ Update CrispStrobe/CrispASR to d745bda4386ae0f9d1d2f23fff8ec95d76428221 (#10260)
⬆️ Update CrispStrobe/CrispASR

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-12 09:17:34 +02:00
pos-ei-don
58cdc050e9 fix(cuda): install cuda-nvrtc-dev alongside the other CUDA dev packages (#10257)
Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
2026-06-11 23:57:00 +02:00
pos-ei-don
b962f4a192 fix(vllm): parse tool_call function arguments before applying the chat template (#10256)
Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
2026-06-11 23:55:38 +02:00
LocalAI [bot]
b6fcb3e1db chore: ⬆️ Update CrispStrobe/CrispASR to 4b27392ffd0991a857594652cbb8b57e585bcd7b (#10241)
⬆️ Update CrispStrobe/CrispASR

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-11 18:33:58 +02:00
LocalAI [bot]
ff09683d84 chore: ⬆️ Update ggml-org/llama.cpp to ac4cddeb0dbd778f650bf568f6f08344a06abe3a (#10239)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-11 18:33:38 +02:00
LocalAI [bot]
f618636c71 docs: fix broken relref to realtime page (#10255)
Hugo fails the gh-pages build with REF_NOT_FOUND because the relref
in model-configuration.md uses the 'docs/' prefix; refs are resolved
relative to content/, so the page lives at 'features/openai-realtime'
(as the other ref in the same file already uses).


Assisted-by: Claude Code:claude-fable-5

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-11 18:32:50 +02:00
LocalAI [bot]
892fc49949 feat(realtime): stream the LLM / TTS / transcription pipeline stages (#10176)
* feat(realtime): pipeline streaming + disable_thinking config

Add a nested pipeline.streaming.{llm,tts,transcription} block plus
pipeline.disable_thinking, with StreamLLM/StreamTTS/StreamTranscription/
ThinkingDisabled helpers. Pointer-bools so unset keeps the unary path;
existing configs are unaffected. Wiring into the realtime handler follows.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): sentence segmenter for streamed LLM->TTS pipelining

streamSegmenter accumulates streamed LLM tokens and emits complete
sentence/clause segments (terminator+whitespace, or newline) so TTS can
synthesize each segment as it completes instead of waiting for the whole
reply. Pure helper; the streaming handler wiring consumes it next.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): streaming TTS/transcription methods on Model interface

Add TTSStream and TranscribeStream to the realtime Model interface and
implement them on wrappedModel (delegating to backend.ModelTTSStream /
ModelTranscriptionStream) and transcriptOnlyModel. ttsStream adapts the
backend's WAV-framed stream (44-byte header carrying the sample rate, then
PCM) into raw PCM + sample rate for the realtime transports. Handler wiring
that consumes these (flag-gated) follows.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): emitSpeech with flag-gated streaming TTS

emitSpeech synthesizes a piece of text and forwards audio to the client,
streaming one output_audio.delta per backend PCM chunk when the pipeline
sets streaming.tts, or one delta for the whole utterance otherwise. WebRTC
gets raw PCM (it resamples internally); WebSocket gets base64 PCM at the
session rate. It emits no transcript/audio-done events so a streamed reply
can be split into multiple spoken segments sharing one response.

Adds fakeModel/fakeTransport test doubles for the realtime Model/Transport
interfaces, driving streaming assertions deterministically.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): route response audio through emitSpeech (streaming TTS)

Replace the inline unary TTS block in the response handler with emitSpeech,
which streams a response.output_audio.delta per backend PCM chunk when
pipeline.streaming.tts is set and otherwise preserves the single-delta unary
behaviour. emitSpeech returns the accumulated base64 audio, stored on the
conversation item as before. Transcript and audio-done events stay in the
handler so later per-segment streaming can reuse emitSpeech.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): streaming transcription text deltas

Add emitTranscription and route commitUtterance through it. With
pipeline.streaming.transcription set it streams each transcript fragment as
a conversation.item.input_audio_transcription.delta via TranscribeStream
then a completed event; otherwise it preserves the single completed-event
unary behaviour. Returns the final transcript for response generation.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): pipeline disable_thinking maps to enable_thinking off

applyPipelineThinking forces the LLM's ReasoningConfig.DisableReasoning when
pipeline.disable_thinking is set, which gRPCPredictOpts turns into the
enable_thinking=false backend metadata. Applied at newModel construction on
the per-session LLM config copy, so it doesn't leak to other model users and
needs no realtime-specific request plumbing.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): speechStreamer for token-streamed LLM->TTS

emitSpeech now returns raw PCM (caller base64-encodes) so streamed segments
accumulate correctly. speechStreamer consumes streamed LLM tokens: it strips
reasoning via the streaming ReasoningExtractor, emits a transcript delta per
content fragment, and sentence-pipes content into emitSpeech so each sentence
is synthesized as soon as it's ready. Handler wiring (plain-content turns)
follows.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): wire streamLLMResponse for token-streamed replies

triggerResponseAtTurn takes a streamed path when pipeline.streaming.llm is
set, the turn has no tools, and audio is requested: streamLLMResponse
announces the assistant item, drives the LLM token callback through a
speechStreamer (reasoning-stripped transcript deltas + sentence-piped TTS),
and emits the terminal events. Tool turns and non-streaming pipelines keep
the existing buffered path unchanged, so this is strictly opt-in.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(realtime): document pipeline streaming + disable_thinking

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(realtime): register pipeline streaming/thinking config fields

TestAllFieldsHaveRegistryEntries (core/config/meta) requires every config
field to have a meta registry entry. The four new pipeline fields
(disable_thinking, streaming.{llm,tts,transcription}) had none, failing
tests-linux/tests-apple. Add toggle entries for them.

Also handle the os.Remove return in realtime_speech_test.go to satisfy
errcheck (golangci-lint).

Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(realtime): always strip reasoning from spoken output

disable_thinking maps to ReasoningConfig.DisableReasoning=true on the LLM
config, which the backend reads as enable_thinking=false. But the realtime
handler reads that SAME config to drive reasoning extraction, and there
DisableReasoning=true means "skip stripping". PredictConfig() returns this
LLM config, so both the streamed (speechStreamer) and buffered realtime
paths stopped stripping <think>…</think> exactly when disable_thinking was
on — leaking raw reasoning to the client whenever the model ignored the
enable_thinking hint (e.g. lfm2.5).

Add spokenReasoningConfig() which clears DisableReasoning for extraction
(keeping custom tokens/tag pairs) and route both realtime paths through it.
Spoken output now always strips reasoning, independent of the backend
suppression hint.

Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(realtime): clean TTS temp path before read (gosec G304)

emitSpeech reads the WAV file the TTS backend wrote. The read moved here
from realtime.go, so code-scanning flagged it as a new G304 alert even
though the path is backend-controlled (a temp file), not user input.
Wrap it in filepath.Clean — a real path normalization that also clears
the alert, keeping with the repo's no-#nosec convention.

Assisted-by: Claude:claude-opus-4-8 gosec, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(realtime): buffer whole message for TTS, drop sentence segmenter

Per review (richiejp): the sentence segmenter pipelined unary TTS by
splitting on ASCII .!?/newline, which does nothing for languages without
those boundaries (CJK/Thai) — there it already degraded to buffering the
whole message anyway.

Replace it with a uniform model: stream the LLM transcript live, buffer the
full message, then synthesize it once. emitSpeech already streams the audio
chunks when the backend implements TTSStream and falls back to a single
unary delta otherwise, so this is real streaming TTS where supported and a
clean whole-message synthesis elsewhere — no per-sentence emulation, no
language assumptions. speechStreamer becomes transcriptStreamer (transcript
deltas only); the whole-message synthesis moves into streamLLMResponse.

Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): stream tool-call turns via tokenizer-template autoparser

Per review (richiejp): tool-call deltas exist, so streaming should work with
tools too. It does — for models that use their tokenizer template. The C++
autoparser then clears reply.Message and delivers content + tool calls via
ChatDeltas, so the streamed transcript carries only spoken content (no
tool-call JSON leak) and the tool calls are parsed from the final response.

- Drop the len(tools)==0 gate; stream when no tools OR use_tokenizer_template
  (grammar-based function calling still buffers, since its call is emitted as
  JSON in the token stream and would leak into the transcript).
- streamLLMResponse takes tools/toolChoice/toolTurn, reads ChatDelta content
  in the token callback, parses tool calls from the final ChatDeltas, and
  creates the assistant content item lazily so a content-less tool turn emits
  only the tool calls.
- Extract emitToolCallItems from the buffered path so both paths finalize tool
  calls, response.done, and server-side assistant-tool follow-ups identically.

Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): script-aware clause chunking + streamed-reply fixes

Opt-in pipeline.streaming.clause_chunking splits the streamed LLM reply
into speakable clauses and synthesizes each as soon as it completes,
lowering time-to-first-audio instead of buffering the whole message. The
splitter is script-aware (rivo/uniseg, pure Go): UAX#29 sentence
segmentation handles CJK 。!? with no whitespace, CJK clause
punctuation (,、;:) and Thai/Lao spaces give finer cuts, and a UAX#14
line-break cap bounds an over-long punctuation-less run. Unlike the old
ASCII .!?/newline segmenter (dropped in 076dcdbe) it does not degrade to
whole-message buffering for CJK/Thai; scripts needing a dictionary
(Khmer/Burmese) stay buffered until a space or end-of-message. Clauses
are synthesized synchronously in the token callback (the LLM keeps
generating into the gRPC stream meanwhile), so audio still starts
mid-generation. Off by default — the whole-message path is unchanged.

Also fix the streamed-reply path and the Talk page:

- Don't swallow streamed autoparser content as reasoning: the
  tokenizer-template path already delivers reasoning-free content via
  ChatDeltas, so prefilling the thinking start token re-tagged it as an
  unclosed reasoning block, leaving no spoken reply. Disable the prefill
  on that path; closed tag pairs are still stripped (#9985).

- Generate collision-free realtime IDs (16 random bytes) instead of a
  constant, so per-item bookkeeping (cancel, conversation.item.retrieve)
  works.

- Key the Talk transcript by the server item_id and upsert entries.
  Realtime events arrive over a WebRTC data channel — outside React's
  event system — so React defers the setTranscript updaters while
  synchronous ref writes in handler bodies run first; the old
  index-tracking ref rendered a duplicate assistant bubble on
  completion. Upserts by item_id are idempotent and order-independent.

- Drop the partial assistant bubble on a cancelled response (barge-in):
  the server discards the interrupted item and sends response.done with
  status "cancelled"; mirror that in the UI so the regenerated reply
  isn't rendered as a second assistant message.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Assisted-by: Claude:claude-fable-5 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Richard Palethorpe <io@richiejp.com>
2026-06-11 08:43:12 +01:00
pos-ei-don
228a6dfe79 fix(vllm): restore compatibility with vLLM >= 0.22 (get_tokenizer moved to vllm.tokenizers) (#10252)
fix(vllm): restore compatibility with vLLM >= 0.22 (get_tokenizer moved)

vLLM 0.22 moved get_tokenizer from vllm.transformers_utils.tokenizer
to vllm.tokenizers. Since the backend requirements install vllm
unpinned, freshly built/installed vllm backends currently fail to
start with ModuleNotFoundError: No module named
'vllm.transformers_utils.tokenizer' (surfacing as 'grpc service not
ready' when loading a model).

Use the same try/except version-compat import pattern already used
elsewhere in this file: try the new vllm.tokenizers location first and
fall back to the pre-0.22 path.

Tested on a DGX Spark (GB10, ARM64) with the
cuda13-nvidia-l4t-arm64-vllm backend and vllm 0.22.0: model load, chat
completions and tool calls all work with this patch applied.

Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-11 09:05:23 +02:00
LocalAI [bot]
51a92b6093 chore: ⬆️ Update antirez/ds4 to 8384adf0f9fa0f3bb342dd925372de778b95b263 (#10242)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-11 00:10:34 +02:00
LocalAI [bot]
b5964d385d docs: ⬆️ update docs version mudler/LocalAI (#10245)
⬆️ Update docs version mudler/LocalAI

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-11 00:10:10 +02:00
LocalAI [bot]
fba8c9c498 fix(distributed): track in-flight for non-LLM inference methods (VAD, diarize, voice, ...) (#10238)
fix(distributed): track in-flight for non-LLM inference methods

InFlightTrackingClient only wrapped a subset of the grpc.Backend
inference methods (Predict, Embeddings, TTS, AudioTranscription, Detect,
Rerank, ...). Methods like VAD were left as embedded passthrough, so
track() never ran for them.

In distributed mode every model is loaded with in_flight=1 as a
reservation; that reservation is only released by the OnFirstComplete
callback, which fires after the first *tracked* inference call completes.
A VAD-only model (e.g. silero-vad) never calls a tracked method, so the
reservation is never released and in-flight stays pinned at 1 forever -
which also blocks the router's idle-eviction logic.

Wrap the remaining unary inference methods (VAD, Diarize, Face*, Voice*,
TokenClassify, Score, AudioEncode, AudioDecode, AudioTransform) with the
same track()/reconcile() pattern. The three bidi-stream constructors
(AudioTransformStream, AudioToAudioStream, Forward) are deliberately left
as passthrough - their inference spans the stream lifetime, not the
constructor call, so track() there would fire onFirstComplete before any
data flows.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-10 16:29:50 +02:00
LocalAI [bot]
6b2badb837 chore: ⬆️ Update CrispStrobe/CrispASR to c29f6653a516a3001d923944dad8892072cc7334 (#10236)
⬆️ Update CrispStrobe/CrispASR

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-10 16:16:24 +02:00
LocalAI [bot]
8b8506d01a chore: ⬆️ Update ggml-org/llama.cpp to 039e20a2db9e87b2477c76cc04905f3e1acad77f (#10223)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-10 12:22:03 +02:00
LocalAI [bot]
6910a0bb48 chore: ⬆️ Update antirez/ds4 to 91bafb5acd5a6cf00b1e55ef68bf40ddd207bee7 (#10234)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-10 12:08:19 +02:00
LocalAI [bot]
cffd03b522 chore: ⬆️ Update ikawrakow/ik_llama.cpp to e6f8112f3ba126eed3ff5b30cdd08085414a7516 (#10233)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-10 12:07:49 +02:00
LocalAI [bot]
bf448d3794 chore: ⬆️ Update ggml-org/whisper.cpp to df7638d8229a243af8a4b5a8ae557e0d74e0a0ae (#10220)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-10 01:16:29 +02:00
LocalAI [bot]
1d4a12f7c0 chore: ⬆️ Update CrispStrobe/CrispASR to 97cad527d247edefc904e6c40c4cf5ee78bed055 (#10221)
⬆️ Update CrispStrobe/CrispASR

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-10 01:16:17 +02:00
LocalAI [bot]
186d62801d chore: ⬆️ Update leejet/stable-diffusion.cpp to 19bdfe22d255d5b4dff39d449318b9bc5ea2317f (#10222)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-10 01:16:06 +02:00
LocalAI [bot]
da4ed05429 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 2768b6251548b78b6610e95edad13f888ad95982 (#10219)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-10 01:15:54 +02:00
LocalAI [bot]
ec1eea4f45 chore: ⬆️ Update antirez/ds4 to 512d07cb08f234b704b5a5959aa9e2d4c466eeb0 (#10224)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-10 01:15:42 +02:00
LocalAI [bot]
b203b32e57 feat(realtime): make WebRTC ICE candidates configurable (#10231)
The /v1/realtime WebRTC handler created the peer connection with a bare
webrtc.Configuration and no SettingEngine, so pion gathered a host ICE
candidate for every local interface. Under Docker host networking that
includes bridge addresses (docker0/veth, 172.x) a remote browser cannot
route to; the call establishes on a good pair and then drops once ICE
consent freshness checks fail on the unreachable candidates.

Add two opt-in knobs, applied via a pion SettingEngine:
- LOCALAI_WEBRTC_NAT_1TO1_IPS: advertise these IPs as the host candidates
  (e.g. the host LAN IP)
- LOCALAI_WEBRTC_ICE_INTERFACES: restrict ICE gathering to these interfaces

Defaults are unchanged (empty => current all-interface behavior).

Assisted-by: Claude:claude-opus-4-8

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-09 22:28:03 +02:00
Ching
48a8ce98aa fix(cli): handle chat output errors (#10229)
Propagate terminal write errors from the chat prompt and explicitly ignore stream close errors during cleanup.

Update chat tests to assert response writer errors so errcheck passes without hiding failed writes.

Tests:
- go test -count=1 ./core/cli/chat
- go test -count=1 ./core/cli

Assisted-by: Codex:GPT-5

Signed-off-by: Ching Kao <0980124jim@gmail.com>
2026-06-09 19:10:24 +02:00
Ching
8344d1c865 feat(cli): add interactive chat mode (#10226)
Add an opt-in `local-ai chat` command for testing chat models directly from the terminal without manually sending curl requests.

The command connects to a running LocalAI server, lists available models through the existing OpenAI-compatible API, streams chat completions, and supports interactive commands such as `/models`, `/model`, `/clear`, and `/exit`.

Keep `local-ai run` focused on the server lifecycle so the web UI, API clients, and multiple chat terminals can coexist against the same server.

Document the new command and terminal workflow in the README and CLI docs.

Tests:
- go test -count=1 ./core/cli/chat
- go test -count=1 ./core/cli

Assisted-by: Codex:GPT-5

Signed-off-by: Ching Kao <0980124jim@gmail.com>
2026-06-09 14:58:44 +00:00
Pete
d2e6b93369 feat(agents): surface KB source citations in RAG responses (#10228)
* dev knowledge.go structure

Signed-off-by: Pete Chen <petechentw@gmail.com>

* feat(agents): append KB source citations to responses

Render structured KB citations as a Sources block after agent responses, linking each source to the existing raw collection entry endpoint.

Keep long-term memory writes on the original model response so citation blocks do not get stored back into the knowledge base.

Tested with: go test ./core/services/agents

Assisted-by: Codex:gpt-5
Signed-off-by: Pete Chen <petechentw@gmail.com>

* Collect KB citations from tool searches

Signed-off-by: Pete Chen <petechentw@gmail.com>

* fix(agents): append KB sources in local chats

Apply the shared KB citation post-processing to standalone LocalAGI chat responses so the React agent chat receives the same clickable Sources block as the native executor path. Also fix the run target to use the current cmd/local-ai entrypoint.

Assisted-by: Codex:gpt-5
Signed-off-by: Pete Chen <petechentw@gmail.com>

---------

Signed-off-by: Pete Chen <petechentw@gmail.com>
Co-authored-by: shihyunhuang <shihyunhuang88@gmail.com>
Co-authored-by: TLoE419 <tloemizuchizu@gmail.com>
Co-authored-by: Ching Kao <0980124jim@gmail.com>
2026-06-09 16:32:56 +02:00
LocalAI [bot]
e1ec03d33f fix(reasoning): stop prefilled <think> from swallowing tag-less answers (#10225)
* fix(reasoning): stop prefilled <think> from swallowing tag-less answers

When a chat template injects the thinking start token into the prompt (so
DetectThinkingStartToken returns e.g. "<think>"), the model's output begins
inside a reasoning block and carries only the closing tag. The non-jinja
autoparser fallback (peg-native "pure content" mode, issue #9985) prepends the
start token so the extractor can pair it with the model's </think>.

But on a COMPLETE response that contains no closing tag, the model answered
directly with no reasoning at all. Prepending the start token there manufactures
an unclosed block that swallows the entire answer into reasoning, leaving the
OpenAI `content` field empty. This breaks short/direct answers — session names,
JSON summaries, any terse completion where the model skips the think block —
which come back with empty content. Regression surfaced by #9991, which added
the defensive prefill extraction to the complete-response paths.

Add reasoning.ExtractReasoningComplete: it only honors a prefilled start token
when the response actually contains the matching closing tag (proof a reasoning
block exists). Genuine reasoning tags already in the content still extract;
tag-less content stays content. Apply it at every complete-response site
(applyAutoparserOverride, realtime, openresponses). The streaming per-token
extractor is intentionally left on ExtractReasoningWithConfig — mid-stream an
as-yet-unclosed block is legitimate and must surface as reasoning deltas.

Also adds reasoning.ClosingTokenForStart and hoists the default reasoning tag
pairs to package scope so both helpers share one source of truth.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(reasoning): cover the enable_thinking=false non-thinking-mode regression

Adds the end-to-end case that actually broke session summaries / auto-titles
and was not covered before: a request with enable_thinking=false against a
<think>-capable model. In non-thinking mode the model emits no reasoning block,
so llama.cpp's autoparser returns ChatDeltas with content set and
reasoning_content empty (verified against stock llama-server: same model with
chat_template_kwargs.enable_thinking=false returns reasoning_content=null,
content="hello"). thinkingStartToken is still "<think>" because it is detected
per-model from the enable_thinking=true render, so the old code prepended it and
swallowed the answer. The test fails without the ExtractReasoningComplete gate.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 09:02:04 +02:00
LocalAI [bot]
9323f4b5ca feat(llama-cpp): video input support (mtmd #24269) (#10216)
* chore(llama-cpp): bump to 8f83d6c for mtmd video input support

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(llama-cpp): forward video input to mtmd (template + non-template paths)

Wire request->videos() into grpc-server.cpp mirroring the existing image
and audio handling: a video_data build + non-template files extraction, and
input_video chat chunks on the tokenizer-template path. allow_video is
auto-set at model load by the vendored upstream chat_params.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ui): add video attachment support to the chat UI

Mirror the image/audio attachment path for video: emit video_url content
parts, accept video/* in the picker, keep video files as base64, show a
film icon badge, and render attached video inline with a <video> player.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(llama-cpp): patch mtmd video stdin double-close (heap crash)

Upstream mtmd video input (ggml-org/llama.cpp#24269) double-fcloses the
ffmpeg/ffprobe stdin FILE: feed_stdin() fclose()s the FILE returned by
subprocess_stdin() (which is sp->stdin_file), then subprocess_destroy()
fclose()s the same pointer again -> heap corruption that aborts the
backend on any base64 input_video request (the CLI --video file path is
unaffected). Vendor a one-line fix (null sp->stdin_file after fclose)
via prepare.sh's patches/ until upstream merges it.

Verified e2e with gemma-4-e2b-it-qat-q4_0: video frames decode via
ffmpeg and the model answers correctly (red clip -> 'Red', blue -> 'Blue').

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore(llama-cpp): re-pin to upstream #24316, drop vendored stdin patch

Upstream replaced the ad-hoc video stdin handling with a proper RAII
refactor (ggml-org/llama.cpp#24316, "mtmd: refactor video subproc
handling"), which includes the same `sp->stdin_file = nullptr` guard our
patch added (plus join-before-destroy ordering). Re-pin LLAMA_VERSION to
that branch head and drop patches/0001 - it's now redundant.

Verified e2e with gemma-4-e2b-it-qat-q4_0: no crash, video frames decode
and the model answers correctly (red clip -> "Red", blue -> "Blue").

NOTE: #24316 is not yet merged, so this pins to its branch-head commit
(28ca1e60). Re-pin to the squash-merge commit on master once it lands,
otherwise `git fetch` may lose the commit after the branch is deleted.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-08 23:17:50 +02:00
LocalAI [bot]
c20225fc13 chore: ⬆️ Update CrispStrobe/CrispASR to f7838a306687f22c281d29c250f879a4ab3df2d7 (#10177)
* ⬆️ Update CrispStrobe/CrispASR

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* fix(crispasr): link crispasr-lib CMake target instead of crispasr

The dependency-bump regeneration of this branch reset CMakeLists.txt to
master and dropped the prior link-target fix, reintroducing the
`cannot find -lcrispasr` failure. Upstream CrispASR (f7838a3) defines the
library as the CMake target `crispasr-lib` (with OUTPUT_NAME crispasr);
there is no target named `crispasr`, so target_link_libraries falls back
to a bare `-lcrispasr` linker flag that cannot be resolved. Point the link
at the real target name.

Verified locally: CPU cmake-configure of the bumped source generates a
gocrispasr link line referencing sources/CrispASR/src/libcrispasr.a with no
dangling -lcrispasr.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]

---------

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-08 16:01:19 +02:00
LocalAI [bot]
337acc4c37 chore: ⬆️ Update antirez/ds4 to c463029c205c2ec8d7ab6c0df4a3f52979091286 (#10189)
* ⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* fix(ds4): link ds4_ssd.o into the backend build

Upstream antirez/ds4 splits the SSD expert-cache into its own ds4_ssd.c
translation unit, whose symbols (ds4_ssd_memory_lock_acquire/release,
ds4_ssd_cache_experts_for_byte_budget, ds4_ssd_auto_cache_plan) are
referenced by ds4.c/ds4_cpu.o. The dependency-bump automation regenerated
this branch from clean master and dropped the prior linkage fix, so the
cpu-ds4 / cublas-ds4 backend builds fail again with undefined references.

Re-apply the ds4_ssd.o linkage GPU-agnostically (mirroring ds4_distributed.o)
in both the backend Makefile (DS4_OBJ_TARGET + the engine-object build rule
for every GPU mode) and CMakeLists.txt (list(APPEND DS4_OBJS ds4_ssd.o)).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]

---------

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-08 11:15:32 +02:00
LocalAI [bot]
618e90cd13 feat(gallery): add Gemma 4 QAT family + MTP speculative-decoding pairs (#10215)
Add the remaining official Google Gemma 4 QAT Q4_0 GGUFs (E2B, E4B,
26B-A4B, 31B) next to the existing 12B entry, each shipping its
multimodal mmproj.

Also add three MTP (Multi-Token Prediction) speculative-decoding bundles
that pair each QAT target with a QAT-matched assistant/drafter head:

  - 12B       <- Janvitos/gemma-4-12B-it-qat-assistant-MTP-Q8_0-GGUF
  - 26B-A4B   <- boxwrench/gemma-4-qat-mtp-assistant-heads
  - 31B       <- boxwrench/gemma-4-qat-mtp-assistant-heads

The assistant heads use the gemma4_assistant architecture and are not
standalone chat models, so each entry bundles the target + draft and
sets draft_model together with the draft-mtp spec options
(spec_type:draft-mtp / spec_n_max:6 / spec_p_min:0.75), matching
MTPSpecOptions() in core/config/mtp.go. QAT-matched heads raise draft
acceptance substantially over generic non-QAT heads.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-08 10:26:42 +02:00
LocalAI [bot]
92dea961c2 fix: distributed backend reinstall/upgrade UI stuck on 'reinstalling' (#10214)
* fix(galleryop): self-evict terminal ops from OpCache.GetStatus

The processingBackends map (the UI 'reinstalling' spinner source) only cleared
an op when a client polled /api/backends/job/:uid. The Manage-page Reinstall and
Upgrade buttons never poll, so completed installs leaked into processingBackends
forever and the backend card spun 'reinstalling' even though the install had
finished. Evict terminal ops on the list read instead; DeleteUUID already
broadcasts the eviction so peer replicas converge.

Reproduced on a live 5-node distributed cluster: 5 backends sat in
processingBackends with underlying jobs reporting completed:true,progress:100.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(nodes): clear pending backend ops behind offline/draining nodes

ListDuePendingBackendOps filters status=healthy, so a backend op queued against
a node that went offline (stale heartbeat) or draining (admin action) was never
retried, aged out, or deleted - it leaked forever and kept the UI operation
spinning. Add DeleteStalePendingBackendOps and run it each reconcile pass:
draining nodes are cleared immediately (model rows already purged), offline
nodes once their heartbeat is older than a grace window (blip protection).

Reproduced on a live cluster: orphaned llama-cpp install rows targeting an
offline (nvidia-thor) and a draining (mac-mini-m4) node sat at attempts=0
indefinitely.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(nodes): stream per-node progress during backend upgrade

The install dispatch subscribed to a per-op progress subject and streamed
per-node download ticks; the upgrade dispatch did a bare 15-minute blocking
NATS round-trip with no subscription, so the UI showed progress:0 the whole
time (the 'reinstalling but nothing happens' report on a slow node).

Thread the op ID through BackendManager.UpgradeBackend -> the distributed
manager -> the adapter, and have the adapter subscribe to the per-op progress
subject before the request (extracted into a shared subscribeProgress helper
reused by install/upgrade/force-fallback). The worker's upgradeBackend now
creates the same DebouncedInstallProgressPublisher installBackend uses. An
upgrade is a force-reinstall, so it reuses SubjectNodeBackendInstallProgress
rather than minting a new subject - no new NATS permission, no new
rolling-update compat surface. Reconciler-driven retries pass empty
opID/onProgress and stay on the silent path.

Reproduced on a live cluster: upgrade of llama-cpp-development on agx-orin-slow
sat at progress:0 for 4+ minutes with no per-node feedback.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(galleryop): persist cancellation + periodically reap orphaned ops

Two distributed gaps surfaced when a replica was killed mid-upgrade on a live
cluster, leaving the backend stuck 'processing' in the UI forever:

1. CancelOperation flipped the in-memory status to cancelled and broadcast a
   NATS event but never persisted the terminal status. On the next replica
   restart the still-active row re-hydrated straight back into
   processingBackends and the UI spun again. It now calls store.Cancel(id) so
   the cancel survives a restart.

2. CleanStale (which marks abandoned active ops failed) only ran once on
   startup, so an op orphaned AFTER startup - its owning replica's foreground
   handler goroutine gone - was never reaped until the next restart. Add
   GalleryService.ReapStaleOperations and run it on a 15m ticker (CleanStale
   now returns the reaped count for observability).

Neither is covered by the OpCache self-evict fix: an orphaned op never reaches
Processed, so it would never self-evict.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(review): address self-review findings on the distributed install fixes

Three findings from an adversarial review of this branch:

1. CRITICAL - OpCache.GetStatus crashed under concurrent load. m.Map() returns
   the live internal map by reference, so deleting from it on the read path was
   an unsynchronized write to a map four HTTP handlers poll every ~1s -> a
   'concurrent map writes' fatal. Rewritten to iterate a Keys() snapshot, build
   a fresh result map, and apply evictions via the locked DeleteUUID after the
   loop. Added a -race concurrency regression guard.

2. HIGH - GetStatus evicted failed ops too, hiding them from /api/operations
   and breaking the dismiss-failed-op flow (the panel keeps Error != nil ops so
   the admin can read the error and click Dismiss). Eviction now fires only for
   terminal ops with Error == nil (success/cancelled); failures are retained.

3. MEDIUM - DeleteStalePendingBackendOps missed StatusUnhealthy nodes. A node
   marked unhealthy on a NATS ErrNoResponders never transitions to offline
   (health.go skips re-marking it), so its pending ops leaked exactly like the
   offline case. Unhealthy is now reaped via the same stale-heartbeat grace path
   (a fresh-heartbeat node is recovering and keeps its op).

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(review-2): don't evict the still-installing soft-path; don't spin on failed ops

Second review pass found two issues:

1. MEDIUM (Go) - OpCache.GetStatus evicted the ErrWorkerStillInstalling
   soft-path op. That op is deliberately Processed=true with no error to show a
   yellow in-progress state when a worker timed out the NATS round-trip but is
   still installing in the background; the reconciler confirms the real outcome
   later. Evicting it (and broadcasting OpEnd + marking the DB completed) hid an
   install that may still fail. Eviction is now scoped to a clean success
   (progress 100 + 'completed', matching the job-poll's historical condition) or
   a cancellation - the soft-path (progress != 100) and failures are kept.

2. MEDIUM (React) - the Backends gallery card rendered ANY operation as an
   'Installing...' spinner, so a failed op (now intentionally kept in the list
   for the OperationsBar error + Dismiss) spun forever. Exclude errored ops from
   the card spinner, mirroring Models.jsx (isInstalling already excludes
   op.error). The error + Dismiss still surface in the global OperationsBar.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(ui): refresh Manage backends table when an operation settles

The Manage backends table fetched installed backends only on mount/after delete
and checked upgrades only on tab activation. After a reinstall/upgrade completed
neither re-ran, so the installed-version cell and the 'update available' badge
stayed stale until the user switched tabs - the op looked like it 'did nothing'.

Watch the operations list (via useOperations) and re-fetch installed backends +
available upgrades whenever the count settles, mirroring the operations.length
watch Backends.jsx already uses. Consolidates the prior tab-activation upgrades
check into the same effect.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-08 10:03:02 +02:00
LocalAI [bot]
2e93186043 chore: ⬆️ Update ggml-org/llama.cpp to 9e3b928fd8c9d14dbf15a8768b9fdd7e5c721d66 (#10210)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-08 09:35:17 +02:00
LocalAI [bot]
d07037e817 chore: ⬆️ Update leejet/stable-diffusion.cpp to b3d56d0ba1bd437886079e339118e8e75bb79ee7 (#10211)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-08 09:03:57 +02:00
LocalAI [bot]
f6cc90d258 chore: ⬆️ Update mudler/parakeet.cpp to e270af73b94c9a5c37ec516230219ed4580e1db6 (#10212)
⬆️ Update mudler/parakeet.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-07 23:52:44 +02:00
Adira
2c804bef5a fix(config): skip vocab arrays and mmap GGUF headers to speed up startup (#10213)
When the models directory holds many GGUF files, startup parsed every
model's full GGUF — including the tokenizer vocab arrays
(tokenizer.ggml.tokens/scores/merges, often >100k entries) — once per
model while guessing defaults. On slow storage (e.g. a models directory
on a Docker volume) those hundreds of thousands of tiny reads dominate
boot time before the HTTP server comes up.

The default-guessing path and the VRAM metadata reader only consume
scalar metadata and array lengths, never the array contents. Parse with
SkipLargeMetadata (seek past large arrays) and UseMMap (fault in a few
header pages instead of issuing per-element read() syscalls). For a
256k-token vocab this cuts the parse from ~524k read() syscalls to 8.
The mapping is released when ParseGGUFFile returns.

Fixes #9790

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Adira Denis Muhando <dennisadira@gmail.com>
2026-06-07 23:33:52 +02:00
LocalAI [bot]
6070402477 chore(model gallery): 🤖 add 1 new models via gallery agent (#10209)
chore(model gallery): 🤖 add new models via gallery agent

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-07 22:09:32 +02:00
LocalAI [bot]
67f80a152b fix(mtp): don't auto-enable self-spec MTP for draft-only assistant GGUFs (#10208)
Gemma4 MTP (ggml-org/llama.cpp#23398) registers the prediction head as a
separate `gemma4-assistant` architecture. That assistant GGUF still carries
`<arch>.nextn_predict_layers`, so the architecture-agnostic detection in
HasEmbeddedMTPHead matched it and appended the `spec_type:draft-mtp` defaults.

Unlike the DeepSeek/Qwen embedded-head models, an assistant checkpoint cannot
self-speculate: it is a draft model that requires a paired target context
(`ctx_other`) and throws if loaded alone. Auto-applying the self-spec defaults
to a standalone assistant import therefore produces a broken config.

Guard the detection against draft-only assistant architectures (the `-assistant`
suffix is upstream's naming convention) so importing one no longer yields a
self-speculation config. Two-model target+draft pairing remains expressible
manually via `draft_model:` and is left to a follow-up.


Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-07 22:09:02 +02:00
LocalAI [bot]
a7cb587d96 feat(parakeet-cpp): real segment timestamps (NeMo-faithful) (#10207)
* feat(parakeet-cpp): real segment timestamps (NeMo-faithful)

Offline: replace the single synthetic whole-clip segment with multiple
segments grouped exactly like NeMo's get_segment_offsets - a new segment
after sentence-ending punctuation ('. ? !'), each carrying start/end and
its time-window token ids. The optional model option segment_gap_threshold
(NeMo's unit: encoder FRAMES, default 0=off) adds NeMo's silence-gap split,
converted to seconds via the JSON frame_sec the engine now reports.
Per-segment words are still gated behind timestamp_granularities=["word"];
a zero-word document falls back to a single text segment.

Streaming: when libparakeet.so exposes the ABI v4 JSON entry points
(probed), drive parakeet_capi_stream_feed_json / _finalize_json and
accumulate the streamed per-word timestamps into per-utterance segments
(EOU stays the boundary), so streaming FinalResult segments now carry
start/end. Falls back to the text-only feed against an older library.

Pure-Go specs cover splitWordsIntoSegments (punctuation + gap rules, NeMo
elif order, fallback), transcriptResultFromDoc (multi-segment, token
windows, word-granularity gate), and the streaming segmenter.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(audio): document parakeet-cpp segment timestamps + segment_gap_threshold

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(parakeet-cpp): update model-gated specs for multi-segment output

The offline AudioTranscription specs asserted the old single synthetic
segment (Segments HaveLen(1), Segments[0].Text == res.Text). With
NeMo-faithful segmentation a multi-sentence clip now yields multiple
punctuation-delimited segments, so assert the new contract instead:
one-or-more time-ordered segments, each with text and (under word
granularity) per-segment words whose span tracks the segment start/end.
Caught by running the model-gated suite on the dgx (GB10) against the
real tdt_ctc-110m + realtime_eou models.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-07 22:08:24 +02:00
LocalAI [bot]
f7c74ad2da chore: ⬆️ Update ggml-org/llama.cpp to 31e82494c0a3913c919c1027fa70500fbf4c07dd (#10191)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-07 10:43:17 +02:00
LocalAI [bot]
7402d1fd20 chore(turboquant): bump to 7d9715f1 + fix compilation against rebased fork (#10205)
* chore(turboquant): bump TheTom/llama-cpp-turboquant to 7d9715f1

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* fix(turboquant): drop obsolete legacy-spec shim after fork rebased

The TheTom/llama-cpp-turboquant fork (pin c9aa86a) rebased past the
upstream common_params_speculative refactor (ggml-org/llama.cpp
#22397/#22838/#22964), the model_tgt rename (#22838) and get_media_marker
(#21962). The old fork-compat shim forced now-wrong legacy code paths,
breaking the build with errors like 'struct common_params_speculative has
no member named mparams_dft / type' and 'server_context_impl has no member
named model'.

Remove the obsolete LOCALAI_LEGACY_LLAMA_CPP_SPEC branches from the shared
grpc-server.cpp (stock llama-cpp and the modern fork both take the modern
path now), and narrow the one remaining gap (the fork still lacks
common_params::checkpoint_min_step) to a dedicated
LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP guard injected by
patch-grpc-server.sh. The patch script now only adds the turbo2/3/4
KV-cache types and injects that one macro.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* fix(turboquant): HIP-port the fork's CUDA additions (copy2d 3D-peer + cudaEventCreate)

The turboquant fork adds/modifies a few ggml-cuda.cu spots with CUDA APIs that
ggml's HIP/MUSA shim does not provide, breaking the -gpu-rocm-hipblas-turboquant
build. patches/0001-hip-guard-copy2d-peer-fastpath.patch (applied by
apply-patches.sh) ports them:

- Guard ggml_cuda_copy2d_across_devices's 3D-peer copy fast path with
  #if !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA) so HIP/MUSA fall through
  to the existing cudaMemcpyAsync staging fallback (HIP genuinely lacks
  cudaMemcpy3DPeerAsync, per the fork's own comment).
- Create the device event in ggml_backend_cuda_device_event_new with the
  HIP-aliased cudaEventCreateWithFlags(.., cudaEventDisableTiming) instead of the
  un-aliased plain cudaEventCreate, matching this file's own usage elsewhere.

CUDA builds are unaffected.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* ci(turboquant): drop the ROCm/hipblas build flavor

The TheTom/llama-cpp-turboquant fork is not ROCm-clean at the current pin:
beyond the CUDA-API gaps already patched (3D-peer copy, cudaEventCreate),
its llama.cpp base fails to compile the flash-attention MMA f16 kernels for
head-dim 640 under HIP (cols_per_warp evaluates to 0 -> division-by-zero /
non-constant static asserts in fattn-mma-f16.cuh). That is a deep
ggml-on-ROCm kernel issue, not something a small fork patch can paper over.

Drop -gpu-rocm-hipblas-turboquant from the build matrix so turboquant still
ships for cpu / cublas / vulkan / sycl. Re-add it once the fork's HIP path
compiles (or upstream ggml fixes the large-head-dim MMA kernels for ROCm).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-07 10:42:06 +02:00
LocalAI [bot]
8c42695ef8 chore: ⬆️ Update ggml-org/whisper.cpp to a8ec021f2750a473ff4a8f3883bc9fdf5feafa84 (#10202)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-07 08:37:42 +02:00
LocalAI [bot]
72e3241431 chore: ⬆️ Update mudler/parakeet.cpp to abd0087dcc92ec5ad1f96f9fd86c49eb26a5ce67 (#10204)
⬆️ Update mudler/parakeet.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-07 00:37:28 +02:00
LocalAI [bot]
cd2bf95862 fix(docs): use relearn notice shortcode instead of unsupported alert (#10206)
The Hugo relearn theme does not provide an "alert" shortcode, so the
docs deploy failed at the Build site step:

  failed to extract shortcode: template for shortcode "alert" not found
  docs/content/features/distributed-mode.md:136

Convert the warning block to the theme-supported notice shortcode used
everywhere else in the docs.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-07 00:37:12 +02:00
LocalAI [bot]
f64b72dd7d feat: support Ideogram4 in stablediffusion-ggml backend + gallery (#10201)
* feat(stablediffusion-ggml): support Ideogram4 unconditional diffusion model

Bump stable-diffusion.cpp from 1f9ee88 to b9254dd, the upstream commit that
adds Ideogram4 support (leejet/stable-diffusion.cpp#1609). Ideogram4 derives
its classifier-free guidance from a separate unconditional diffusion model,
exposed upstream through the new sd_ctx_params_t.uncond_diffusion_model_path
field.

Wire that field into the gosd wrapper via a new uncond_diffusion_model_path
option. The _path suffix is deliberate: the Go loader only resolves options
whose name contains "path" to an absolute path under the model directory, so
this keeps the option consistent with diffusion_model_path and
high_noise_diffusion_model_path.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(gallery): add Ideogram4 stablediffusion-ggml models

Single-file GGUF weights for Ideogram4 are now published
(stduhpf/ideogram-4-gguf), so add the model to the gallery. Ideogram4 is a
text-to-image model with strong, accurate in-image text rendering, driven by
a Qwen3-VL-8B text encoder and real classifier-free guidance from a separate
unconditional diffusion model (the uncond_diffusion_model_path support added
in the preceding commit).

Two index entries, both built on gallery/virtual.yaml with the full config
inlined in overrides (same pattern as the other models, no dedicated template
file):
- ideogram-4-iq4nl-ggml (4-bit, ~11.6GB diffusion)
- ideogram-4-q8_0-ggml  (8-bit, ~20GB diffusion)

Each bundles the diffusion + unconditional GGUF (stduhpf), the
Qwen3-VL-8B-Instruct text encoder (unsloth), and the FLUX.2 VAE (Comfy-Org
mirror, non-gated). cfg_scale is 7 to match the upstream Ideogram4 default,
since it performs real CFG unlike the guidance-distilled Flux/Z-Image models.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-06 22:50:12 +02:00
LocalAI [bot]
03c84cff28 feat(parakeet-cpp): nemotron-3.5-asr multilingual streaming model + request language support (#10199)
* feat(parakeet-cpp): honor request language (multilingual nemotron) on batched + streaming paths

Reads opts.GetLanguage() and threads it through to the new
parakeet_capi_transcribe_pcm_batch_json_lang and parakeet_capi_stream_begin_lang
C-API entry points, both probed with Dlsym so the backend still loads against an
older libparakeet.so (falling back to the non-lang paths, i.e. model default).

parakeet.cpp's batched C-API takes a single target_lang for the whole batch, so
the dispatcher only coalesces same-language requests: a request whose language
differs from the batch leader is held as a single carry-over and becomes the
leader of the next batch, never dropped and never left waiting (including on
shutdown). A new batcher test asserts no dispatched batch is ever mixed-language
and that every submitted request still receives a reply.

Assisted-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(gallery): add parakeet-cpp-nemotron-3.5-asr-streaming-0.6b; bump parakeet.cpp pin

Adds the multilingual prompt-conditioned streaming model to the gallery (q8_0
default, OpenMDW-1.1) and bumps the parakeet-cpp backend pin to the parakeet.cpp
commit that ships nemotron support plus batched causal subsampling and the
batched target_lang C-API.

Assisted-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-06 13:53:10 +02:00
LocalAI [bot]
9bc69c9e5f chore(model gallery): 🤖 add 1 new models via gallery agent (#10200)
chore(model gallery): 🤖 add new models via gallery agent

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-06 13:52:46 +02:00
LocalAI [bot]
1e6c9cfd60 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 6b9de3dbaa21ae95ea80638e5ee836795cc48c93 (#10190)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-06 09:42:43 +02:00
LocalAI [bot]
0e6712f734 chore: ⬆️ Update mudler/parakeet.cpp to 843600590f96a31467a5199f827c253f34c110f7 (#10198)
chore(parakeet-cpp): bump pin to banded long-audio attention (843600590)

Update PARAKEET_VERSION to mudler/parakeet.cpp@843600590f
(merge of parakeet.cpp#9). Brings NeMo rel_pos_local_attn banded/Longformer
attention with the chunk-matmul construction: long audio now uses O(T*window)
attention instead of global O(T^2), fixing the encoder OOM on long clips
(~16.6-min clip: 54GB->9.4GB peak, ~4x faster) at NeMo's full [128,128] window.
Short clips are unchanged (global path). No C-ABI change.


Assisted-by: Claude:claude-opus-4-8

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-06 09:25:25 +02:00
LocalAI [bot]
0e4cee9a97 chore: bump LocalAGI + localrecall (fix pgvector hybrid search seqscan, #10186) (#10192)
chore: bump LocalAGI and localrecall (index-backed RRF hybrid search)

Bumps the agent stack to pull in the PostgreSQL hybrid-search fix:

- mudler/localrecall -> v0.6.3-...-9a3b3321a9cd (mudler/LocalRecall#46, merged)
- mudler/LocalAGI    -> ...-14aed1ae4336 (mudler/LocalAGI#477, merged)

localrecall's hybrid search previously sorted on a wrapped scalar
similarity expression, which blinded the planner into a full sequential
scan over every row and exceeded the statement timeout on large
collections, returning an empty result set. It now uses the canonical
Reciprocal Rank Fusion pattern (index-backed candidate retrieval + FULL
OUTER JOIN + weighted RRF).

Fixes #10186

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-06 09:16:59 +02:00
311 changed files with 20208 additions and 1634 deletions

View File

@@ -703,6 +703,19 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "8"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-12-locate-anything-cpp'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "locate-anything-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "8"
@@ -768,6 +781,19 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "8"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-12-omnivoice-cpp'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "omnivoice-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "8"
@@ -1543,6 +1569,19 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-13-locate-anything-cpp'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "locate-anything-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
@@ -1569,6 +1608,19 @@ include:
backend: "rfdetr-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/arm64'
skip-drivers: 'false'
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-cuda-13-arm64-locate-anything-cpp'
base-image: "ubuntu:24.04"
ubuntu-version: '2404'
runs-on: 'ubuntu-24.04-arm'
backend: "locate-anything-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
@@ -1673,6 +1725,19 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-13-omnivoice-cpp'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "omnivoice-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
@@ -1712,6 +1777,19 @@ include:
backend: "qwen3-tts-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/arm64'
skip-drivers: 'false'
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-cuda-13-arm64-omnivoice-cpp'
base-image: "ubuntu:24.04"
ubuntu-version: '2404'
runs-on: 'ubuntu-24.04-arm'
backend: "omnivoice-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
@@ -1766,20 +1844,6 @@ include:
dockerfile: "./backend/Dockerfile.llama-cpp"
context: "./"
ubuntu-version: '2404'
- build-type: 'hipblas'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-rocm-hipblas-turboquant'
builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-rocm-amd64'
runs-on: 'ubuntu-latest'
base-image: "rocm/dev-ubuntu-24.04:7.2.1"
skip-drivers: 'false'
backend: "turboquant"
dockerfile: "./backend/Dockerfile.turboquant"
context: "./"
ubuntu-version: '2404'
- build-type: 'hipblas'
cuda-major-version: ""
cuda-minor-version: ""
@@ -2820,6 +2884,74 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
# locate-anything-cpp
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-cpu-locate-anything-cpp'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "locate-anything-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f32'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-intel-sycl-f32-locate-anything-cpp'
runs-on: 'ubuntu-latest'
base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
skip-drivers: 'false'
backend: "locate-anything-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f16'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-intel-sycl-f16-locate-anything-cpp'
runs-on: 'ubuntu-latest'
base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
skip-drivers: 'false'
backend: "locate-anything-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'vulkan'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
platform-tag: 'amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-vulkan-locate-anything-cpp'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "locate-anything-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'vulkan'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/arm64'
platform-tag: 'arm64'
tag-latest: 'auto'
tag-suffix: '-gpu-vulkan-locate-anything-cpp'
runs-on: 'ubuntu-24.04-arm'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "locate-anything-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f32'
cuda-major-version: ""
cuda-minor-version: ""
@@ -2913,6 +3045,19 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2204'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "0"
platforms: 'linux/arm64'
skip-drivers: 'false'
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-arm64-locate-anything-cpp'
base-image: "nvcr.io/nvidia/l4t-jetpack:r36.4.0"
runs-on: 'ubuntu-24.04-arm'
backend: "locate-anything-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2204'
# whisper
- build-type: ''
cuda-major-version: ""
@@ -3377,6 +3522,35 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
# omnivoice-cpp
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
platform-tag: 'amd64'
tag-latest: 'auto'
tag-suffix: '-cpu-omnivoice-cpp'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "omnivoice-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/arm64'
platform-tag: 'arm64'
tag-latest: 'auto'
tag-suffix: '-cpu-omnivoice-cpp'
runs-on: 'ubuntu-24.04-arm'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "omnivoice-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f32'
cuda-major-version: ""
cuda-minor-version: ""
@@ -3390,6 +3564,19 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f32'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-intel-sycl-f32-omnivoice-cpp'
runs-on: 'ubuntu-latest'
base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
skip-drivers: 'false'
backend: "omnivoice-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f16'
cuda-major-version: ""
cuda-minor-version: ""
@@ -3403,6 +3590,19 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f16'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-intel-sycl-f16-omnivoice-cpp'
runs-on: 'ubuntu-latest'
base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
skip-drivers: 'false'
backend: "omnivoice-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'vulkan'
cuda-major-version: ""
cuda-minor-version: ""
@@ -3417,6 +3617,20 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'vulkan'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
platform-tag: 'amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-vulkan-omnivoice-cpp'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "omnivoice-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'vulkan'
cuda-major-version: ""
cuda-minor-version: ""
@@ -3431,6 +3645,20 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'vulkan'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/arm64'
platform-tag: 'arm64'
tag-latest: 'auto'
tag-suffix: '-gpu-vulkan-omnivoice-cpp'
runs-on: 'ubuntu-24.04-arm'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "omnivoice-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "0"
@@ -3444,6 +3672,19 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2204'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "0"
platforms: 'linux/arm64'
skip-drivers: 'false'
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-arm64-omnivoice-cpp'
base-image: "nvcr.io/nvidia/l4t-jetpack:r36.4.0"
runs-on: 'ubuntu-24.04-arm'
backend: "omnivoice-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2204'
- build-type: 'hipblas'
cuda-major-version: ""
cuda-minor-version: ""
@@ -3457,6 +3698,19 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'hipblas'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-rocm-hipblas-omnivoice-cpp'
base-image: "rocm/dev-ubuntu-24.04:6.4.4"
runs-on: 'ubuntu-latest'
skip-drivers: 'false'
backend: "omnivoice-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
# vibevoice-cpp
- build-type: ''
cuda-major-version: ""
@@ -4287,6 +4541,10 @@ includeDarwin:
tag-suffix: "-metal-darwin-arm64-qwen3-tts-cpp"
build-type: "metal"
lang: "go"
- backend: "omnivoice-cpp"
tag-suffix: "-metal-darwin-arm64-omnivoice-cpp"
build-type: "metal"
lang: "go"
- backend: "vibevoice-cpp"
tag-suffix: "-metal-darwin-arm64-vibevoice-cpp"
build-type: "metal"
@@ -4355,6 +4613,10 @@ includeDarwin:
tag-suffix: "-metal-darwin-arm64-silero-vad"
build-type: "metal"
lang: "go"
- backend: "sherpa-onnx"
tag-suffix: "-metal-darwin-arm64-sherpa-onnx"
build-type: "metal"
lang: "go"
- backend: "local-store"
tag-suffix: "-metal-darwin-arm64-local-store"
build-type: "metal"
@@ -4362,3 +4624,9 @@ includeDarwin:
- backend: "llama-cpp-quantization"
tag-suffix: "-metal-darwin-arm64-llama-cpp-quantization"
build-type: "mps"
- backend: "speaker-recognition"
tag-suffix: "-metal-darwin-arm64-speaker-recognition"
build-type: "mps"
- backend: "ds4"
tag-suffix: "-metal-darwin-arm64-ds4"
lang: "go"

View File

@@ -62,10 +62,18 @@ jobs:
variable: "RFDETR_VERSION"
branch: "main"
file: "backend/go/rfdetr-cpp/Makefile"
- repository: "mudler/locate-anything.cpp"
variable: "LOCATEANYTHING_VERSION"
branch: "master"
file: "backend/go/locate-anything-cpp/Makefile"
- repository: "predict-woo/qwen3-tts.cpp"
variable: "QWEN3TTS_CPP_VERSION"
branch: "main"
file: "backend/go/qwen3-tts-cpp/Makefile"
- repository: "ServeurpersoCom/omnivoice.cpp"
variable: "OMNIVOICE_VERSION"
branch: "master"
file: "backend/go/omnivoice-cpp/Makefile"
- repository: "localai-org/vibevoice.cpp"
variable: "VIBEVOICE_CPP_VERSION"
branch: "master"

View File

@@ -38,6 +38,7 @@ jobs:
acestep-cpp: ${{ steps.detect.outputs.acestep-cpp }}
qwen3-tts-cpp: ${{ steps.detect.outputs.qwen3-tts-cpp }}
rfdetr-cpp: ${{ steps.detect.outputs.rfdetr-cpp }}
locate-anything-cpp: ${{ steps.detect.outputs.locate-anything-cpp }}
vibevoice-cpp: ${{ steps.detect.outputs.vibevoice-cpp }}
localvqe: ${{ steps.detect.outputs.localvqe }}
voxtral: ${{ steps.detect.outputs.voxtral }}
@@ -563,7 +564,7 @@ jobs:
- name: Run e2e-backends smoke
env:
BACKEND_IMAGE: quay.io/go-skynet/local-ai-backends:master-cpu-llama-cpp
BACKEND_TEST_CAPS: health,load,predict,stream,logprobs,logit_bias
BACKEND_TEST_CAPS: health,load,predict,stream,logprobs,logit_bias,tokenize
run: |
make test-extra-backend
# Realtime e2e with sherpa-onnx driving VAD + STT + TTS against a mocked LLM.
@@ -901,6 +902,45 @@ jobs:
- name: Test rfdetr-cpp
run: |
make --jobs=5 --output-sync=target -C backend/go/rfdetr-cpp test
# Per-backend e2e for locate-anything-cpp: builds the .so + Go binary and
# runs `make -C backend/go/locate-anything-cpp test`. test.sh fetches the
# locate-anything-q8_0 GGUF (~6.3 GB, NVIDIA LocateAnything-3B) from the
# published mudler/locate-anything.cpp-gguf HF repo + a COCO image, then the
# Go wire test loads the model and runs an open-vocabulary Detect, asserting
# at least one labeled box. Heavier than the other Go backends (it is a 3B),
# so it is gated to changes under backend/go/locate-anything-cpp/.
tests-locate-anything-cpp:
needs: detect-changes
if: needs.detect-changes.outputs.locate-anything-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: ubuntu-latest
steps:
- name: Clone
uses: actions/checkout@v6
with:
submodules: true
- name: Dependencies
run: |
sudo apt-get update
sudo apt-get install -y build-essential cmake curl libopenblas-dev
- name: Setup Go
uses: actions/setup-go@v5
- name: Display Go version
run: go version
- name: Proto Dependencies
run: |
# Install protoc
curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
rm protoc.zip
go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2
go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af
PATH="$PATH:$HOME/go/bin" make protogen-go
- name: Build locate-anything-cpp
run: |
make --jobs=5 --output-sync=target -C backend/go/locate-anything-cpp
- name: Test locate-anything-cpp
run: |
make --jobs=5 --output-sync=target -C backend/go/locate-anything-cpp test
# Per-backend smoke for vibevoice-cpp: builds the .so + Go binary and
# runs `make -C backend/go/vibevoice-cpp test`. test.sh auto-downloads
# the published mudler/vibevoice.cpp-models bundle (TTS Q8_0 + ASR Q4_K

View File

@@ -108,6 +108,7 @@ RUN <<EOT bash
apt-get update && \
apt-get install -y --no-install-recommends \
cuda-nvcc-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
cuda-nvrtc-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
libcufft-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
libcurand-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
libcublas-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \

View File

@@ -1,5 +1,5 @@
# Disable parallel execution for backend builds
.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/crispasr backends/parakeet-cpp backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/rfdetr-cpp backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin backends/liquid-audio
.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/crispasr backends/parakeet-cpp backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/rfdetr-cpp backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/omnivoice-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin backends/liquid-audio
GOCMD=go
GOTEST=$(GOCMD) test
@@ -180,7 +180,7 @@ osx-signed: build
## Run
run: ## run local-ai
CGO_LDFLAGS="$(CGO_LDFLAGS)" $(GOCMD) run ./
CGO_LDFLAGS="$(CGO_LDFLAGS)" $(GOCMD) run ./cmd/local-ai
prepare-test: protogen-go build-mock-backend
@@ -566,6 +566,7 @@ prepare-test-extra: protogen-python
$(MAKE) -C backend/python/speaker-recognition
$(MAKE) -C backend/rust/kokoros kokoros-grpc
$(MAKE) -C backend/go/rfdetr-cpp
$(MAKE) -C backend/go/locate-anything-cpp
test-extra: prepare-test-extra
$(MAKE) -C backend/python/transformers test
@@ -593,6 +594,7 @@ test-extra: prepare-test-extra
$(MAKE) -C backend/python/speaker-recognition test
$(MAKE) -C backend/rust/kokoros test
$(MAKE) -C backend/go/rfdetr-cpp test
$(MAKE) -C backend/go/locate-anything-cpp test
##
## End-to-end gRPC tests that exercise a built backend container image.
@@ -1174,6 +1176,7 @@ BACKEND_PARAKEET_CPP = parakeet-cpp|golang|.|false|true
BACKEND_VOXTRAL = voxtral|golang|.|false|true
BACKEND_ACESTEP_CPP = acestep-cpp|golang|.|false|true
BACKEND_QWEN3_TTS_CPP = qwen3-tts-cpp|golang|.|false|true
BACKEND_OMNIVOICE_CPP = omnivoice-cpp|golang|.|false|true
BACKEND_VIBEVOICE_CPP = vibevoice-cpp|golang|.|false|true
BACKEND_LOCALVQE = localvqe|golang|.|false|true
BACKEND_OPUS = opus|golang|.|false|true
@@ -1292,6 +1295,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_WHISPERX)))
$(eval $(call generate-docker-build-target,$(BACKEND_ACE_STEP)))
$(eval $(call generate-docker-build-target,$(BACKEND_ACESTEP_CPP)))
$(eval $(call generate-docker-build-target,$(BACKEND_QWEN3_TTS_CPP)))
$(eval $(call generate-docker-build-target,$(BACKEND_OMNIVOICE_CPP)))
$(eval $(call generate-docker-build-target,$(BACKEND_VIBEVOICE_CPP)))
$(eval $(call generate-docker-build-target,$(BACKEND_LOCALVQE)))
$(eval $(call generate-docker-build-target,$(BACKEND_MLX)))
@@ -1309,7 +1313,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_SHERPA_ONNX)))
docker-save-%: backend-images
docker save local-ai-backend:$* -o backend-images/$*.tar
docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-ds4 docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-crispasr docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-liquid-audio docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-rfdetr-cpp docker-build-qwen3-tts-cpp docker-build-vibevoice-cpp docker-build-localvqe docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx docker-build-cloud-proxy
docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-ds4 docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-crispasr docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-liquid-audio docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-rfdetr-cpp docker-build-qwen3-tts-cpp docker-build-omnivoice-cpp docker-build-vibevoice-cpp docker-build-localvqe docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx docker-build-cloud-proxy
########################################################
### Mock Backend for E2E Tests

View File

@@ -149,12 +149,26 @@ local-ai run https://gist.githubusercontent.com/.../phi-2.yaml
local-ai run oci://localai/phi-2:latest
```
To test a running LocalAI server from the terminal, open an interactive chat session from another shell. Inside the prompt, `/models` lists installed models and `/model <name>` switches between them.
```bash
# Terminal 1
local-ai run llama-3.2-1b-instruct:q4_k_m
# Terminal 2
local-ai chat --model llama-3.2-1b-instruct:q4_k_m
```
> **Automatic Backend Detection**: LocalAI automatically detects your GPU capabilities and downloads the appropriate backend. For advanced options, see [GPU Acceleration](https://localai.io/features/gpu-acceleration/).
For more details, see the [Getting Started guide](https://localai.io/basics/getting_started/).
## Latest News
- **June 2026**: New [realtime voice assistant demo](https://github.com/localai-org/localai-realtime-demo) (a tiny Go client for the Realtime API with a full talk-back voice loop and tool calling), plus [streaming of the realtime LLM / TTS / transcription pipeline stages](https://github.com/mudler/LocalAI/pull/10176) and [configurable WebRTC ICE candidates](https://github.com/mudler/LocalAI/pull/10231).
- **June 2026**: Big speech push: the [parakeet.cpp](https://github.com/mudler/parakeet.cpp) ASR engine gains [NeMo-faithful segment timestamps](https://github.com/mudler/LocalAI/pull/10207), a [multilingual streaming Nemotron-3.5 model](https://github.com/mudler/LocalAI/pull/10199), [dynamic batching for concurrent transcription](https://github.com/mudler/LocalAI/pull/10112) and [CUDA graphs](https://github.com/mudler/LocalAI/pull/10273); the new [CrispASR backend](https://github.com/mudler/LocalAI/pull/10099) adds multi-architecture ASR + TTS, and [60 Piper TTS voices across 42 languages](https://github.com/mudler/LocalAI/pull/10296) land in the gallery (plus [per-request TTS instructions and params](https://github.com/mudler/LocalAI/pull/10172)).
- **June 2026**: New backends and models: [locate-anything.cpp](https://github.com/mudler/LocalAI/pull/10264) for open-vocabulary object detection via ggml, [Ideogram4 image generation](https://github.com/mudler/LocalAI/pull/10201) in stablediffusion-ggml, [llama.cpp video input](https://github.com/mudler/LocalAI/pull/10216), and the [Gemma 4 QAT family with MTP speculative-decoding pairs](https://github.com/mudler/LocalAI/pull/10215). Plus an [interactive CLI chat mode](https://github.com/mudler/LocalAI/pull/10226) and [RAG source citations in agent responses](https://github.com/mudler/LocalAI/pull/10228).
- **June 2026**: Distributed mode hardening: [prefix-cache-aware routing](https://github.com/mudler/LocalAI/pull/10071), a [production-ready request router with auto-sized embedding/rerank batches](https://github.com/mudler/LocalAI/pull/10104), [ds4 layer-split distributed inference](https://github.com/mudler/LocalAI/pull/10098), [NATS JWT auth + TLS/mTLS](https://github.com/mudler/LocalAI/pull/10159), and [resumable file uploads](https://github.com/mudler/LocalAI/pull/10109).
- **May 2026**: **LocalAI 4.3.0** - `llama.cpp` [prompt cache on by default](https://github.com/mudler/LocalAI/pull/9925) (repeated system prompts collapse from minutes to seconds), [keyless cosign signing of backend OCI images](https://github.com/mudler/LocalAI/pull/9823), [per-API-key + per-user usage attribution](https://github.com/mudler/LocalAI/pull/9920), Distributed v3 with [per-request replica routing](https://github.com/mudler/LocalAI/pull/9968). [Release notes](https://github.com/mudler/LocalAI/releases/tag/v4.3.0)
- **May 2026**: **LocalAI 4.2.0** - LocalAI sees and hears: [voice recognition](https://github.com/mudler/LocalAI/pull/9500), [face recognition + antispoofing liveness](https://github.com/mudler/LocalAI/pull/9480), speaker diarization. Plus [drop-in Ollama API](https://github.com/mudler/LocalAI/pull/9284), [video generation](https://github.com/mudler/LocalAI/pull/9420), redesigned UI with i18n + admin-configurable branding, vLLM at feature parity with llama.cpp, and 11 new backends. [Release notes](https://github.com/mudler/LocalAI/releases/tag/v4.2.0)
- **April 2026**: **LocalAI 4.1.0** - LocalAI becomes a control tower: distributed cluster mode with VRAM-aware smart routing + autoscaling, multi-user platform with OIDC and API keys, per-user quotas with predictive analytics, in-UI fine-tuning with TRL (auto-export to GGUF), on-the-fly quantization backend, visual pipeline editor. [Release notes](https://github.com/mudler/LocalAI/releases/tag/v4.1.0)
@@ -207,7 +221,7 @@ See the full [Backend & Model Compatibility Table](https://localai.io/model-comp
- [Integrations & community projects](https://localai.io/docs/integrations/)
- [Installation video walkthrough](https://www.youtube.com/watch?v=cMVNnlqwfw4)
- [Media & blog posts](https://localai.io/basics/news/#media-blogs-social)
- [Examples](https://github.com/mudler/LocalAI-examples)
- [Examples](https://github.com/mudler/LocalAI-examples) — including the [realtime voice assistant demo](https://github.com/localai-org/localai-realtime-demo) (Go client for the Realtime API with tool calling)
## Team

View File

@@ -206,6 +206,16 @@ RUN if [ "${BACKEND}" = "opus" ]; then \
apt-get clean && rm -rf /var/lib/apt/lists/*; \
fi
# CrispASR's piper TTS backend dlopens libespeak-ng at runtime to phonemize
# non-English text (the MIT-clean path; English uses a built-in G2P). Install
# the espeak-ng runtime + its libpcaudio/libsonic deps + voice data so
# package.sh can bundle them into the FROM scratch image.
RUN if [ "${BACKEND}" = "crispasr" ]; then \
apt-get update && apt-get install -y --no-install-recommends \
espeak-ng-data libespeak-ng1 libpcaudio0 libsonic0 && \
apt-get clean && rm -rf /var/lib/apt/lists/*; \
fi
COPY . /LocalAI
RUN git config --global --add safe.directory /LocalAI

View File

@@ -126,6 +126,7 @@ RUN <<EOT bash
apt-get update && \
apt-get install -y --no-install-recommends \
cuda-nvcc-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
cuda-nvrtc-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
libcufft-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
libcurand-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
libcublas-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \

View File

@@ -60,10 +60,12 @@ elseif(DS4_GPU STREQUAL "cpu")
set(DS4_OBJS "${DS4_DIR}/ds4_cpu.o")
endif()
# ds4.c now references ds4_distributed.c (distributed inference was split into
# its own translation unit upstream). It is a single GPU-agnostic object shared
# by every GPU mode, so link it in regardless of DS4_GPU.
# ds4.c now references ds4_distributed.c (distributed inference) and ds4_ssd.c
# (SSD expert-cache), each split into its own translation unit upstream. Both
# are GPU-agnostic objects shared by every GPU mode, so link them in regardless
# of DS4_GPU.
list(APPEND DS4_OBJS "${DS4_DIR}/ds4_distributed.o")
list(APPEND DS4_OBJS "${DS4_DIR}/ds4_ssd.o")
add_executable(${TARGET}
grpc-server.cpp

View File

@@ -1,10 +1,10 @@
# ds4 backend Makefile.
#
# Upstream pin lives below as DS4_VERSION?=477c0e82e2699b35a65fd0a1ed6fe66b41087dfe
# Upstream pin lives below as DS4_VERSION?=d881f2a05e8ff6bec001315a36b794b4aa310173
# (.github/bump_deps.sh) can find and update it - matches the
# llama-cpp / ik-llama-cpp / turboquant convention.
DS4_VERSION?=477c0e82e2699b35a65fd0a1ed6fe66b41087dfe
DS4_VERSION?=d881f2a05e8ff6bec001315a36b794b4aa310173
DS4_REPO?=https://github.com/antirez/ds4
CURRENT_MAKEFILE_DIR := $(dir $(abspath $(lastword $(MAKEFILE_LIST))))
@@ -18,19 +18,20 @@ UNAME_S := $(shell uname -s)
CMAKE_ARGS ?= -DCMAKE_BUILD_TYPE=Release
# ds4_distributed.o is a GPU-agnostic translation unit that ds4.c/ds4_cpu.o now
# reference (upstream split distributed inference into its own .c). The same
# object is shared by every GPU mode, so it is appended unconditionally below.
# ds4_distributed.o and ds4_ssd.o are GPU-agnostic translation units that
# ds4.c/ds4_cpu.o now reference (upstream split distributed inference and the
# SSD expert-cache into their own .c files). Both objects are shared by every
# GPU mode, so they are appended unconditionally below.
ifeq ($(BUILD_TYPE),cublas)
CMAKE_ARGS += -DDS4_GPU=cuda
DS4_OBJ_TARGET := ds4.o ds4_cuda.o ds4_distributed.o
DS4_OBJ_TARGET := ds4.o ds4_cuda.o ds4_distributed.o ds4_ssd.o
else ifeq ($(UNAME_S),Darwin)
CMAKE_ARGS += -DDS4_GPU=metal
DS4_OBJ_TARGET := ds4.o ds4_metal.o ds4_distributed.o
DS4_OBJ_TARGET := ds4.o ds4_metal.o ds4_distributed.o ds4_ssd.o
else
# CPU reference path (Linux only - macOS CPU path is broken by VM bug per ds4 README).
CMAKE_ARGS += -DDS4_GPU=cpu
DS4_OBJ_TARGET := ds4_cpu.o ds4_distributed.o
DS4_OBJ_TARGET := ds4_cpu.o ds4_distributed.o ds4_ssd.o
endif
ifneq ($(NATIVE),true)
@@ -55,11 +56,11 @@ ds4:
# the right per-platform compile flags (Objective-C/Metal on Darwin, nvcc on Linux+CUDA).
ds4/ds4.o: ds4
ifeq ($(BUILD_TYPE),cublas)
+$(MAKE) -C ds4 ds4.o ds4_cuda.o ds4_distributed.o
+$(MAKE) -C ds4 ds4.o ds4_cuda.o ds4_distributed.o ds4_ssd.o
else ifeq ($(UNAME_S),Darwin)
+$(MAKE) -C ds4 ds4.o ds4_metal.o ds4_distributed.o
+$(MAKE) -C ds4 ds4.o ds4_metal.o ds4_distributed.o ds4_ssd.o
else
+$(MAKE) -C ds4 ds4_cpu.o ds4_distributed.o
+$(MAKE) -C ds4 ds4_cpu.o ds4_distributed.o ds4_ssd.o
endif
grpc-server: ds4/ds4.o

View File

@@ -1,5 +1,5 @@
IK_LLAMA_VERSION?=1520eda980564241434b791ce2bbbd128c4be9ea
IK_LLAMA_VERSION?=e6f8112f3ba126eed3ff5b30cdd08085414a7516
LLAMA_REPO?=https://github.com/ikawrakow/ik_llama.cpp
CMAKE_ARGS?=

View File

@@ -1,5 +1,5 @@
LLAMA_VERSION?=7c158fbb4aec1bdc9c81d6ca0e785139f4826fae
LLAMA_VERSION?=4c6595503fe45d5a39f88d194e270f64c7424677
LLAMA_REPO?=https://github.com/ggerganov/llama.cpp
CMAKE_ARGS?=

View File

@@ -381,6 +381,15 @@ json parse_options(bool streaming, const backend::PredictOptions* predict, const
});
}
// for each video in the request, add the video data
for (int i = 0; i < predict->videos_size(); i++) {
data["video_data"].push_back(json
{
{"id", i},
{"data", predict->videos(i)},
});
}
data["stop"] = predict->stopprompts();
// data["n_probs"] = predict->nprobs();
//TODO: images,
@@ -482,23 +491,13 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
if (!request->draftmodel().empty()) {
params.speculative.draft.mparams.path = request->draftmodel();
// Default to draft type if a draft model is set but no explicit type.
// Upstream (post ggml-org/llama.cpp#22838) made the speculative type a
// vector; the turboquant fork still uses the legacy scalar. The
// LOCALAI_LEGACY_LLAMA_CPP_SPEC macro is injected by
// backend/cpp/turboquant/patch-grpc-server.sh for fork builds only.
// Upstream renamed COMMON_SPECULATIVE_TYPE_DRAFT -> ..._DRAFT_SIMPLE
// in ggml-org/llama.cpp#22964; the fork still uses the old name.
#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
if (params.speculative.type == COMMON_SPECULATIVE_TYPE_NONE) {
params.speculative.type = COMMON_SPECULATIVE_TYPE_DRAFT;
}
#else
// Upstream made the speculative type a vector (ggml-org/llama.cpp#22838)
// and renamed COMMON_SPECULATIVE_TYPE_DRAFT -> ..._DRAFT_SIMPLE (#22964).
const bool no_spec_type = params.speculative.types.empty() ||
(params.speculative.types.size() == 1 && params.speculative.types[0] == COMMON_SPECULATIVE_TYPE_NONE);
if (no_spec_type) {
params.speculative.types = { COMMON_SPECULATIVE_TYPE_DRAFT_SIMPLE };
}
#endif
}
// params.model_alias ??
@@ -574,9 +573,10 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
// tokens (0 disables the minimum). Match upstream's default (256). This
// field was renamed from `checkpoint_every_nt` in llama.cpp; the semantics
// also shifted from a fixed cadence to a minimum spacing. The turboquant
// fork branched before the field existed, so skip it on the legacy path
// (LOCALAI_LEGACY_LLAMA_CPP_SPEC is injected by patch-grpc-server.sh).
#ifndef LOCALAI_LEGACY_LLAMA_CPP_SPEC
// fork still lacks common_params::checkpoint_min_step, so skip it there
// (LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP is injected by
// backend/cpp/turboquant/patch-grpc-server.sh).
#ifndef LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP
params.checkpoint_min_step = 256;
#endif
@@ -752,7 +752,7 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
params.cache_idle_slots = false;
}
#ifndef LOCALAI_LEGACY_LLAMA_CPP_SPEC
#ifndef LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP
// --- minimum context-checkpoint spacing (upstream -cms / --checkpoint-min-step) ---
// 0 disables the minimum-spacing gate. Old option names (`checkpoint_every_nt`,
// `checkpoint_every_n_tokens`) are kept as aliases for backward compatibility
@@ -906,17 +906,6 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
// Speculative decoding options
} else if (!strcmp(optname, "spec_type") || !strcmp(optname, "speculative_type")) {
#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
// Fork only knows a single scalar `type`. Take the first comma-
// separated value and assign it via the singular helper.
std::string first = optval_str;
const auto comma = first.find(',');
if (comma != std::string::npos) first = first.substr(0, comma);
auto type = common_speculative_type_from_name(first);
if (type != COMMON_SPECULATIVE_TYPE_COUNT) {
params.speculative.type = type;
}
#else
// Upstream switched to a vector of types (comma-separated for multi-type
// chaining via common_speculative_types_from_names). We keep accepting a
// single value here, but also tolerate comma-separated lists.
@@ -945,7 +934,6 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
if (!parsed.empty()) {
params.speculative.types = parsed;
}
#endif
} else if (!strcmp(optname, "spec_n_max") || !strcmp(optname, "draft_max")) {
if (optval != NULL) {
try { params.speculative.draft.n_max = std::stoi(optval_str); } catch (...) {}
@@ -983,21 +971,6 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
// shares the target context size. Accept the option for backward
// compatibility but silently ignore it.
// Everything below relies on struct shape introduced in ggml-org/llama.cpp#22838
// (parallel drafting): `ngram_mod`, `ngram_map_k`, `ngram_map_k4v`,
// `ngram_cache`, and the `draft.{cache_type_*, cpuparams*, tensor_buft_overrides}`
// fields. The turboquant fork branched before that, so its build defines
// LOCALAI_LEGACY_LLAMA_CPP_SPEC via patch-grpc-server.sh and these option
// keys become unrecognized (silently dropped, like any unknown opt) for it.
//
// The `#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC` / `#else` split below sits at the
// closing-brace position of the `draft_ctx_size` branch on purpose: in the
// legacy build the chain ends here (the brace closes draft_ctx_size), and in
// the modern build the chain continues with `} else if (...)` instead, so the
// brace count stays balanced under both branches of the preprocessor.
#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
}
#else
// --- ngram_mod family (upstream --spec-ngram-mod-*) ---
} else if (!strcmp(optname, "spec_ngram_mod_n_min")) {
if (optval != NULL) {
@@ -1127,7 +1100,6 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
}
if (!cur.empty()) flush(cur);
}
#endif // LOCALAI_LEGACY_LLAMA_CPP_SPEC — closes the `else`/`#ifdef` opened at draft_ctx_size
}
// Set params.n_parallel from environment variable if not set via options (fallback)
@@ -1177,15 +1149,11 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
params.tensor_buft_overrides.push_back({nullptr, nullptr});
}
}
// The draft tensor_buft_overrides are only populated under the modern
// (post-#22838) layout, whose population code is itself gated by
// LOCALAI_LEGACY_LLAMA_CPP_SPEC above. The turboquant fork lacks
// common_params_speculative::draft entirely, so skip the sentinel there too.
#ifndef LOCALAI_LEGACY_LLAMA_CPP_SPEC
// Terminate the draft tensor_buft_overrides list with a sentinel, mirroring
// the main-model handling above.
if (!params.speculative.draft.tensor_buft_overrides.empty()) {
params.speculative.draft.tensor_buft_overrides.push_back({nullptr, nullptr});
}
#endif
// TODO: Add yarn
@@ -1544,7 +1512,7 @@ public:
msg_json["role"] = msg.role();
bool is_last_user_msg = (i == last_user_msg_idx);
bool has_images_or_audio = (request->images_size() > 0 || request->audios_size() > 0);
bool has_images_or_audio = (request->images_size() > 0 || request->audios_size() > 0 || request->videos_size() > 0);
// Handle content - can be string, null, or array
// For multimodal content, we'll embed images/audio from separate fields
@@ -1595,6 +1563,16 @@ public:
content_array.push_back(audio_chunk);
}
}
if (request->videos_size() > 0) {
for (int j = 0; j < request->videos_size(); j++) {
json video_chunk;
video_chunk["type"] = "input_video";
json input_video;
input_video["data"] = request->videos(j);
video_chunk["input_video"] = input_video;
content_array.push_back(video_chunk);
}
}
msg_json["content"] = content_array;
} else {
// Use content as-is (already array or not last user message)
@@ -1629,6 +1607,16 @@ public:
content_array.push_back(audio_chunk);
}
}
if (request->videos_size() > 0) {
for (int j = 0; j < request->videos_size(); j++) {
json video_chunk;
video_chunk["type"] = "input_video";
json input_video;
input_video["data"] = request->videos(j);
video_chunk["input_video"] = input_video;
content_array.push_back(video_chunk);
}
}
msg_json["content"] = content_array;
} else if (msg.role() == "tool") {
// Tool role messages must have content field set, even if empty
@@ -2080,6 +2068,16 @@ public:
files.push_back(decoded_data);
}
}
const auto &video_data = data.find("video_data");
if (video_data != data.end() && video_data->is_array())
{
for (const auto &video : *video_data)
{
auto decoded_data = base64_decode(video["data"].get<std::string>());
files.push_back(decoded_data);
}
}
}
const bool has_mtmd = ctx_server.impl->mctx != nullptr;
@@ -2332,7 +2330,7 @@ public:
}
bool is_last_user_msg = (i == last_user_msg_idx);
bool has_images_or_audio = (request->images_size() > 0 || request->audios_size() > 0);
bool has_images_or_audio = (request->images_size() > 0 || request->audios_size() > 0 || request->videos_size() > 0);
// Handle content - can be string, null, or array
// For multimodal content, we'll embed images/audio from separate fields
@@ -2385,6 +2383,16 @@ public:
content_array.push_back(audio_chunk);
}
}
if (request->videos_size() > 0) {
for (int j = 0; j < request->videos_size(); j++) {
json video_chunk;
video_chunk["type"] = "input_video";
json input_video;
input_video["data"] = request->videos(j);
video_chunk["input_video"] = input_video;
content_array.push_back(video_chunk);
}
}
msg_json["content"] = content_array;
} else {
// Use content as-is (already array or not last user message)
@@ -2424,6 +2432,16 @@ public:
content_array.push_back(audio_chunk);
}
}
if (request->videos_size() > 0) {
for (int j = 0; j < request->videos_size(); j++) {
json video_chunk;
video_chunk["type"] = "input_video";
json input_video;
input_video["data"] = request->videos(j);
video_chunk["input_video"] = input_video;
content_array.push_back(video_chunk);
}
}
msg_json["content"] = content_array;
SRV_INF("[CONTENT DEBUG] Predict: Message %d created content array with media\n", i);
} else if (!msg.tool_calls().empty()) {
@@ -2886,6 +2904,16 @@ public:
files.push_back(decoded_data);
}
}
const auto &video_data = data.find("video_data");
if (video_data != data.end() && video_data->is_array())
{
for (const auto &video : *video_data)
{
auto decoded_data = base64_decode(video["data"].get<std::string>());
files.push_back(decoded_data);
}
}
}
// process files
@@ -3458,7 +3486,7 @@ public:
if (body.count("prompt") != 0) {
const bool add_special = json_value(body, "add_special", false);
llama_tokens tokens = tokenize_mixed(ctx_server.impl->vocab, body.at("content"), add_special, true);
llama_tokens tokens = tokenize_mixed(ctx_server.impl->vocab, body.at("prompt"), add_special, true);
for (const auto& token : tokens) {

View File

@@ -1,7 +1,7 @@
# Pinned to the HEAD of feature/turboquant-kv-cache on https://github.com/TheTom/llama-cpp-turboquant.
# Auto-bumped nightly by .github/workflows/bump_deps.yaml.
TURBOQUANT_VERSION?=5aeb2fdbe26cd4c534c6fa15de73cb5749bd0403
TURBOQUANT_VERSION?=7d9715f1f071fa07c7b2ad3dbfd320b314139e65
LLAMA_REPO?=https://github.com/TheTom/llama-cpp-turboquant
CMAKE_ARGS?=

View File

@@ -4,21 +4,19 @@
#
# 1. Augment the kv_cache_types[] allow-list so `LoadModel` accepts the
# fork-specific `turbo2` / `turbo3` / `turbo4` cache types.
# 2. Replace `get_media_marker()` (added upstream in ggml-org/llama.cpp#21962,
# server-side random per-instance marker) with the legacy "<__media__>"
# literal. The fork branched before that PR, so server-common.cpp has no
# get_media_marker symbol. The fork's mtmd_default_marker() still returns
# "<__media__>", and Go-side tooling falls back to that sentinel when the
# backend does not expose media_marker, so substituting the literal keeps
# behavior identical on the turboquant path.
# 3. Revert the `common_params_speculative` field references to the
# pre-refactor flat layout. Upstream ggml-org/llama.cpp#22397 split the
# struct into nested `draft` / `ngram_simple` / `ngram_mod` / etc. members;
# the turboquant fork branched before that PR and still exposes the flat
# `n_max`, `mparams_dft`, `ngram_size_n`, ... fields. The substitutions
# below map the new nested paths back to the legacy flat names so the
# shared grpc-server.cpp keeps compiling against the fork's common.h.
# Drop this block once the fork rebases past #22397.
# 2. Define LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP at the top of the file
# so the grpc-server option parser skips the two references to
# common_params::checkpoint_min_step (the default and the option handler).
# That field does not exist in the fork yet; drop this once it does.
#
# The fork used to lag upstream on the whole common_params_speculative refactor
# (ggml-org/llama.cpp#22397/#22838/#22964), the model_tgt rename (#22838) and
# get_media_marker (#21962), which required a much larger compat shim here
# (flat-field sed renames + a coarse LOCALAI_LEGACY_LLAMA_CPP_SPEC define). The
# fork has since rebased past all of those, so the only remaining gap is
# checkpoint_min_step. If a future bump reintroduces a divergence, add a narrow
# guard in grpc-server.cpp keyed on a fork-specific macro and inject it here
# rather than resurrecting the coarse one.
#
# We patch the *copy* sitting in turboquant-<flavor>-build/, never the original
# under backend/cpp/llama-cpp/, so the stock llama-cpp build keeps compiling
@@ -72,72 +70,20 @@ else
echo "==> KV allow-list patch OK"
fi
if grep -q 'get_media_marker()' "$SRC"; then
echo "==> patching $SRC to replace get_media_marker() with legacy \"<__media__>\" literal"
# Only one call site today (ModelMetadata), but replace all occurrences to
# stay robust if upstream adds more. Use a temp file to avoid relying on
# sed -i portability (the builder image uses GNU sed, but keeping this
# consistent with the awk block above).
sed 's/get_media_marker()/"<__media__>"/g' "$SRC" > "$SRC.tmp"
mv "$SRC.tmp" "$SRC"
echo "==> get_media_marker() substitution OK"
# 2. Define LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP at the top of the file so
# the grpc-server option parser skips the two references to
# common_params::checkpoint_min_step (the default assignment and the option
# handler). That field does not exist in the fork yet. Drop this block once
# the fork rebases past the bump that added checkpoint_min_step.
if grep -q '^#define LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP' "$SRC"; then
echo "==> $SRC already defines LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP, skipping"
else
echo "==> $SRC has no get_media_marker() call, skipping media-marker patch"
fi
if grep -q 'params\.speculative\.draft\.\|params\.speculative\.ngram_simple\.' "$SRC"; then
echo "==> patching $SRC to revert common_params_speculative refs to pre-#22397 flat layout"
# Each substitution is the exact post-refactor path → legacy flat field.
# Order doesn't matter because the source paths are disjoint, but we keep
# the most-specific (mparams.path) first for readability.
sed -E \
-e 's/params\.speculative\.draft\.mparams\.path/params.speculative.mparams_dft.path/g' \
-e 's/params\.speculative\.draft\.n_max/params.speculative.n_max/g' \
-e 's/params\.speculative\.draft\.n_min/params.speculative.n_min/g' \
-e 's/params\.speculative\.draft\.p_min/params.speculative.p_min/g' \
-e 's/params\.speculative\.draft\.p_split/params.speculative.p_split/g' \
-e 's/params\.speculative\.draft\.n_gpu_layers/params.speculative.n_gpu_layers/g' \
-e 's/params\.speculative\.draft\.n_ctx/params.speculative.n_ctx/g' \
-e 's/params\.speculative\.ngram_simple\.size_n/params.speculative.ngram_size_n/g' \
-e 's/params\.speculative\.ngram_simple\.size_m/params.speculative.ngram_size_m/g' \
-e 's/params\.speculative\.ngram_simple\.min_hits/params.speculative.ngram_min_hits/g' \
"$SRC" > "$SRC.tmp"
mv "$SRC.tmp" "$SRC"
echo "==> speculative field rename OK"
else
echo "==> $SRC has no post-#22397 speculative field refs, skipping spec rename patch"
fi
# 4. Revert the `ctx_server.impl->model_tgt` rename introduced by upstream
# ggml-org/llama.cpp#22838 (parallel drafting). The turboquant fork still
# exposes the field as `model` on `server_context_impl`. The two call sites
# are in the Rerank and ModelMetadata RPC handlers.
if grep -q 'ctx_server\.impl->model_tgt' "$SRC"; then
echo "==> patching $SRC to revert ctx_server.impl->model_tgt -> ctx_server.impl->model"
sed -E 's/ctx_server\.impl->model_tgt/ctx_server.impl->model/g' "$SRC" > "$SRC.tmp"
mv "$SRC.tmp" "$SRC"
echo "==> model_tgt rename OK"
else
echo "==> $SRC has no ctx_server.impl->model_tgt refs, skipping model_tgt rename patch"
fi
# 5. Define LOCALAI_LEGACY_LLAMA_CPP_SPEC at the top of the file so the
# grpc-server option parser skips the new option-handler blocks (ngram_mod,
# ngram_map_k, ngram_map_k4v, ngram_cache, draft.cache_type_*, draft.cpuparams*,
# draft.tensor_buft_overrides) introduced for the post-#22838 layout, the
# draft.tensor_buft_overrides sentinel termination, and the
# common_params::checkpoint_min_step default/option (added with the
# 35c9b1f3 bump). Those blocks reference struct fields that simply do not
# exist in the fork.
if grep -q '^#define LOCALAI_LEGACY_LLAMA_CPP_SPEC' "$SRC"; then
echo "==> $SRC already defines LOCALAI_LEGACY_LLAMA_CPP_SPEC, skipping"
else
echo "==> patching $SRC to define LOCALAI_LEGACY_LLAMA_CPP_SPEC at the top"
# Insert the define before the very first `#include` so it precedes all the
# speculative-decoding code paths.
echo "==> patching $SRC to define LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP at the top"
# Insert the define before the very first `#include` so it precedes the
# checkpoint_min_step references.
awk '
!done && /^#include/ {
print "#define LOCALAI_LEGACY_LLAMA_CPP_SPEC 1"
print "#define LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP 1"
print "// ^ injected by backend/cpp/turboquant/patch-grpc-server.sh"
print ""
done = 1
@@ -145,13 +91,13 @@ else
{ print }
END {
if (!done) {
print "patch-grpc-server.sh: no #include anchor found to insert LOCALAI_LEGACY_LLAMA_CPP_SPEC" > "/dev/stderr"
print "patch-grpc-server.sh: no #include anchor found to insert LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP" > "/dev/stderr"
exit 1
}
}
' "$SRC" > "$SRC.tmp"
mv "$SRC.tmp" "$SRC"
echo "==> LOCALAI_LEGACY_LLAMA_CPP_SPEC define OK"
echo "==> LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP define OK"
fi
echo "==> all patches applied"

View File

@@ -0,0 +1,55 @@
hip: port the turboquant CUDA additions that ggml's HIP shim doesn't cover
The turboquant fork adds/modifies a few ggml-cuda.cu spots with CUDA APIs
that ggml's HIP (and MUSA) compatibility layer does not provide, breaking
the -gpu-rocm-hipblas-turboquant build:
1. ggml_cuda_copy2d_across_devices() (host-staged cross-device copy for
split mul_mat output) uses the CUDA 3D-peer copy APIs
cudaMemcpy3DPeerParms / make_cudaPitchedPtr / make_cudaExtent /
cudaMemcpy3DPeerAsync. HIP genuinely does not support these (see the
fork's own comment "HIP does not support cudaMemcpy3DPeerAsync"), so
guard the peer fast path with #if !defined(GGML_USE_HIP) &&
!defined(GGML_USE_MUSA) -- matching how the fork already guards the
same API for the sibling 2D copy -- and fall through to the existing
cudaMemcpyAsync staging fallback below (functionally identical,
slightly slower on multi-GPU ROCm).
2. ggml_backend_cuda_device_event_new() creates its event with plain
cudaEventCreate, which ggml's HIP shim does not alias (it only aliases
cudaEventCreateWithFlags). Use cudaEventCreateWithFlags(...,
cudaEventDisableTiming) -- exactly what the rest of this file already
does (cf. lines ~1034, ~3461) and HIP-safe.
CUDA builds are unaffected. Drop the relevant hunk once the fork HIP-ports
these; apply-patches.sh fails fast if an anchor goes stale.
diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
index 0427e6b..6352e6a 100644
--- a/ggml/src/ggml-cuda/ggml-cuda.cu
+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
@@ -1933,6 +1933,7 @@ static cudaError_t ggml_cuda_copy2d_across_devices(
size_t width, size_t height, cudaStream_t dst_stream, cudaStream_t src_stream) {
const auto & info = ggml_cuda_info();
+#if !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA) // 3D-peer copy types unmapped by ggml's HIP/MUSA shim; use staging fallback below
if (info.peer_access[src_device][dst_device]) {
cudaMemcpy3DPeerParms p = {};
p.dstDevice = dst_device;
@@ -1942,6 +1943,7 @@ static cudaError_t ggml_cuda_copy2d_across_devices(
p.extent = make_cudaExtent(width, height, 1);
return cudaMemcpy3DPeerAsync(&p, dst_stream);
}
+#endif // !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA)
// Fallback: stage all rows through a single contiguous pinned buffer
int prev_device = ggml_cuda_get_device();
@@ -5714,7 +5716,7 @@ static ggml_backend_event_t ggml_backend_cuda_device_event_new(ggml_backend_dev_
ggml_cuda_set_device(dev_ctx->device);
cudaEvent_t event;
- CUDA_CHECK(cudaEventCreate(&event));
+ CUDA_CHECK(cudaEventCreateWithFlags(&event, cudaEventDisableTiming));
return new ggml_backend_event {
/* .device = */ dev,

View File

@@ -14,7 +14,7 @@ target_include_directories(gocrispasr PRIVATE
# whisper. crispasr is the referencer; the backend static libs supply the
# per-architecture symbols; ggml is the math/runtime base.
target_link_libraries(gocrispasr PRIVATE
crispasr
crispasr-lib
parakeet canary canary_ctc cohere granite_speech granite_nle
voxtral voxtral4b qwen3_asr qwen3_tts orpheus chatterbox indextts
kokoro voxcpm2_tts m2m100 t5_translate wav2vec2-ggml vibevoice

View File

@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)
# CrispASR version (release tag)
CRISPASR_REPO?=https://github.com/CrispStrobe/CrispASR
CRISPASR_VERSION?=13d54e110e1538e0f0bc3af0680b9ab246cfb48d
CRISPASR_VERSION?=d745bda4386ae0f9d1d2f23fff8ec95d76428221
SO_TARGET?=libgocrispasr.so
CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF

View File

@@ -11,6 +11,7 @@ import (
"github.com/go-audio/audio"
"github.com/go-audio/wav"
gguf "github.com/gpustack/gguf-parser-go"
"github.com/mudler/LocalAI/pkg/grpc/base"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
"github.com/mudler/LocalAI/pkg/utils"
@@ -37,6 +38,39 @@ var (
type CrispASR struct {
base.SingleThread
// sampleRate is the output rate (Hz) of the loaded TTS engine's PCM, used to
// write a correct WAV header. Most CrispASR TTS backends emit 24 kHz, but
// piper returns its model's native rate (16 kHz for x_low/low voices,
// 22.05 kHz for medium/high), so it is read from the GGUF metadata at Load.
sampleRate int
}
// defaultTTSSampleRate is the output rate assumed for CrispASR TTS engines that
// don't advertise one in GGUF metadata (vibevoice/orpheus/chatterbox/qwen3-tts
// all emit 24 kHz). piper is the exception and carries piper.sample_rate.
const defaultTTSSampleRate = 24000
// piperSampleRate reads the piper.sample_rate metadata key from a GGUF model.
// CrispASR's piper backend returns PCM at the model's native rate without
// resampling, so the WAV header must match it. Returns ok=false for non-piper
// models (key absent) or an unreadable file, letting the caller fall back to
// defaultTTSSampleRate.
func piperSampleRate(modelPath string) (int, bool) {
// Only scalar architecture keys are read, so skip the large array metadata
// (phoneme map) and mmap the header - same rationale as pkg/vram's reader.
f, err := gguf.ParseGGUFFile(modelPath, gguf.UseMMap(), gguf.SkipLargeMetadata())
if err != nil {
return 0, false
}
kv, ok := f.Header.MetadataKV.Get("piper.sample_rate")
if !ok || kv.ValueType != gguf.GGUFMetadataValueTypeUint32 {
return 0, false
}
rate := int(kv.ValueUint32())
if rate <= 0 {
return 0, false
}
return rate, true
}
// splitOption splits a "prefix:value" model option into its key and value,
@@ -103,6 +137,14 @@ func (w *CrispASR) Load(opts *pb.ModelOptions) error {
return fmt.Errorf("Failed to load CrispASR transcription model")
}
// Determine the TTS output sample rate for the WAV header. piper voices
// carry their native rate in GGUF metadata and CrispASR does not resample;
// every other engine emits the 24 kHz default.
w.sampleRate = defaultTTSSampleRate
if rate, ok := piperSampleRate(opts.ModelFile); ok {
w.sampleRate = rate
}
// Load the companion file (codec/tokenizer/s3gen) after the session is open.
// rc==0 means success or "not applicable" for the active backend; only a
// negative code is fatal.
@@ -390,7 +432,7 @@ func (w *CrispASR) synthesize(text string) ([]float32, error) {
}
defer CppTTSFree(ptr)
src := unsafe.Slice((*float32)(unsafe.Pointer(ptr)), int(n)) //nolint:govet // ptr addresses C-allocated PCM returned across the purego boundary; copied out immediately below, before tts_free.
out := make([]float32, int(n)) // copy out of C memory before free
out := make([]float32, int(n)) // copy out of C memory before free
copy(out, src)
return out, nil
}
@@ -417,7 +459,7 @@ func (w *CrispASR) TTS(req *pb.TTSRequest) error {
if err != nil {
return err
}
return writeWAV24k(req.Dst, pcm)
return writeWAV(req.Dst, pcm, w.sampleRate)
}
// TTSStream is the streaming counterpart to TTS. CrispASR has no progressive
@@ -447,7 +489,7 @@ func (w *CrispASR) TTSStream(req *pb.TTSRequest, results chan []byte) error {
}
defer func() { _ = os.Remove(dst) }()
if err := writeWAV24k(dst, pcm); err != nil {
if err := writeWAV(dst, pcm, w.sampleRate); err != nil {
return err
}
@@ -459,14 +501,14 @@ func (w *CrispASR) TTSStream(req *pb.TTSRequest, results chan []byte) error {
return nil
}
// writeWAV24k writes pcm as a 24000 Hz, mono, 16-bit PCM WAV at dst.
func writeWAV24k(dst string, pcm []float32) error {
// writeWAV writes pcm as a sampleRate Hz, mono, 16-bit PCM WAV at dst.
func writeWAV(dst string, pcm []float32, sampleRate int) error {
f, err := os.Create(dst)
if err != nil {
return fmt.Errorf("crispasr: create %q: %w", dst, err)
}
enc := wav.NewEncoder(f, 24000, 16, 1, 1)
enc := wav.NewEncoder(f, sampleRate, 16, 1, 1)
ints := make([]int, len(pcm))
for i, s := range pcm {
if s > 1 {
@@ -477,7 +519,7 @@ func writeWAV24k(dst string, pcm []float32) error {
ints[i] = int(s * 32767)
}
buf := &audio.IntBuffer{
Format: &audio.Format{NumChannels: 1, SampleRate: 24000},
Format: &audio.Format{NumChannels: 1, SampleRate: sampleRate},
Data: ints,
SourceBitDepth: 16,
}

View File

@@ -0,0 +1,164 @@
package main
import (
"bytes"
"encoding/binary"
"os"
"path/filepath"
"github.com/go-audio/wav"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
// GGUF metadata value type tags (subset) from the GGUF spec.
const (
ggufTypeUint32 uint32 = 4
ggufTypeString uint32 = 8
)
type ggufKV struct {
key string
vtype uint32
val any
}
// writeMinimalGGUF emits a valid, tensor-less GGUF file carrying only the given
// metadata key-values. Enough for the header-only parse path piperSampleRate
// uses; avoids pulling a real multi-MB voice into the test.
func writeMinimalGGUF(path string, kvs []ggufKV) error {
var b bytes.Buffer
b.WriteString("GGUF") // magic
_ = binary.Write(&b, binary.LittleEndian, uint32(3)) // version
_ = binary.Write(&b, binary.LittleEndian, uint64(0)) // tensor count
_ = binary.Write(&b, binary.LittleEndian, uint64(len(kvs)))
for _, kv := range kvs {
_ = binary.Write(&b, binary.LittleEndian, uint64(len(kv.key)))
b.WriteString(kv.key)
_ = binary.Write(&b, binary.LittleEndian, kv.vtype)
switch v := kv.val.(type) {
case uint32:
_ = binary.Write(&b, binary.LittleEndian, v)
case string:
_ = binary.Write(&b, binary.LittleEndian, uint64(len(v)))
b.WriteString(v)
}
}
return os.WriteFile(path, b.Bytes(), 0o644)
}
// wavSampleRate decodes the WAV header at path and returns its sample rate.
func wavSampleRate(path string) (int, error) {
f, err := os.Open(path)
if err != nil {
return 0, err
}
defer func() { _ = f.Close() }()
dec := wav.NewDecoder(f)
dec.ReadInfo()
return int(dec.SampleRate), nil
}
var _ = Describe("piper sample rate", func() {
Context("piperSampleRate", func() {
It("reads piper.sample_rate from a piper GGUF (medium = 22050)", func() {
p := filepath.Join(GinkgoT().TempDir(), "voice.gguf")
Expect(writeMinimalGGUF(p, []ggufKV{
{key: "general.architecture", vtype: ggufTypeString, val: "piper"},
{key: "piper.sample_rate", vtype: ggufTypeUint32, val: uint32(22050)},
})).To(Succeed())
rate, ok := piperSampleRate(p)
Expect(ok).To(BeTrue(), "piper.sample_rate should be found")
Expect(rate).To(Equal(22050))
})
It("reads the low-quality rate (16000)", func() {
p := filepath.Join(GinkgoT().TempDir(), "voice.gguf")
Expect(writeMinimalGGUF(p, []ggufKV{
{key: "piper.sample_rate", vtype: ggufTypeUint32, val: uint32(16000)},
})).To(Succeed())
rate, ok := piperSampleRate(p)
Expect(ok).To(BeTrue())
Expect(rate).To(Equal(16000))
})
It("returns ok=false for a non-piper GGUF (no piper.sample_rate key)", func() {
p := filepath.Join(GinkgoT().TempDir(), "other.gguf")
Expect(writeMinimalGGUF(p, []ggufKV{
{key: "general.architecture", vtype: ggufTypeString, val: "vibevoice"},
})).To(Succeed())
_, ok := piperSampleRate(p)
Expect(ok).To(BeFalse())
})
It("returns ok=false for an unreadable/non-GGUF file", func() {
p := filepath.Join(GinkgoT().TempDir(), "garbage.gguf")
Expect(os.WriteFile(p, []byte("not a gguf"), 0o644)).To(Succeed())
_, ok := piperSampleRate(p)
Expect(ok).To(BeFalse())
})
})
// End-to-end through the built .so. Gated on CRISPASR_PIPER_MODEL_PATH (a
// real piper voice GGUF) like the other model-backed specs; never runs in
// default CI. Proves CrispASR's piper backend output rate flows into the
// WAV header instead of the hardcoded 24 kHz default.
Context("piper TTS end-to-end", func() {
It("writes the WAV at the model's native piper.sample_rate", func() {
model := os.Getenv("CRISPASR_PIPER_MODEL_PATH")
if model == "" {
Skip("set CRISPASR_PIPER_MODEL_PATH to run the piper e2e spec")
}
ensureLibLoaded()
expected, ok := piperSampleRate(model)
Expect(ok).To(BeTrue(), "model should carry piper.sample_rate metadata")
w := &CrispASR{}
Expect(w.Load(&pb.ModelOptions{
ModelFile: model,
Options: []string{"backend:piper"},
Threads: 4,
})).To(Succeed())
dst := filepath.Join(GinkgoT().TempDir(), "piper.wav")
Expect(w.TTS(&pb.TTSRequest{Text: "Hello from CrispASR piper.", Dst: dst})).To(Succeed())
info, err := os.Stat(dst)
Expect(err).ToNot(HaveOccurred())
Expect(info.Size()).To(BeNumerically(">", 1024), "expected a non-trivial WAV")
rate, err := wavSampleRate(dst)
Expect(err).ToNot(HaveOccurred())
Expect(rate).To(Equal(expected),
"WAV header rate must equal the model's native piper.sample_rate, not the 24k default")
})
})
Context("writeWAV", func() {
It("writes the WAV header at the given sample rate (22050 for piper, not the 24k default)", func() {
dst := filepath.Join(GinkgoT().TempDir(), "out.wav")
pcm := make([]float32, 220) // 10 ms of silence is enough for a header
Expect(writeWAV(dst, pcm, 22050)).To(Succeed())
rate, err := wavSampleRate(dst)
Expect(err).ToNot(HaveOccurred())
Expect(rate).To(Equal(22050))
})
It("writes a 16000 Hz header for low-quality piper voices", func() {
dst := filepath.Join(GinkgoT().TempDir(), "out.wav")
pcm := make([]float32, 160)
Expect(writeWAV(dst, pcm, 16000)).To(Succeed())
rate, err := wavSampleRate(dst)
Expect(err).ToNot(HaveOccurred())
Expect(rate).To(Equal(16000))
})
})
})

View File

@@ -51,6 +51,32 @@ else
exit 1
fi
# Bundle espeak-ng (+ its libpcaudio/libsonic runtime deps) and its voice data so
# the piper TTS backend can phonemize non-English text. CrispASR dlopens
# libespeak-ng.so.1 at runtime (the MIT-clean path); the dlopen succeeds loading
# libespeak-ng but FAILS if libpcaudio/libsonic are absent, so all three .so are
# required. run.sh points CRISPASR_ESPEAK_DATA_PATH at the bundled data dir.
# Best-effort: only copied when present, so a local dev build without espeak-ng
# installed still packages the rest (English voices keep working).
ESPEAK_LIBDIR=""
for d in /usr/lib/x86_64-linux-gnu /usr/lib/aarch64-linux-gnu; do
if [ -f "$d/libespeak-ng.so.1" ]; then
ESPEAK_LIBDIR="$d"
break
fi
done
if [ -n "$ESPEAK_LIBDIR" ]; then
echo "Bundling espeak-ng from $ESPEAK_LIBDIR ..."
cp -arfLv "$ESPEAK_LIBDIR/libespeak-ng.so.1" $CURDIR/package/lib/
cp -arfLv "$ESPEAK_LIBDIR/libpcaudio.so.0" $CURDIR/package/lib/
cp -arfLv "$ESPEAK_LIBDIR/libsonic.so.0" $CURDIR/package/lib/
if [ -d "$ESPEAK_LIBDIR/espeak-ng-data" ]; then
cp -arfLv "$ESPEAK_LIBDIR/espeak-ng-data" $CURDIR/package/
fi
else
echo "espeak-ng not found; non-English piper voices will not phonemize"
fi
# Package GPU libraries based on BUILD_TYPE
# The GPU library packaging script will detect BUILD_TYPE and copy appropriate GPU libraries
GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"

View File

@@ -41,6 +41,11 @@ fi
export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
export CRISPASR_LIBRARY=$LIBRARY
# Point piper's espeak-ng phonemizer at the bundled voice data. The variable
# names the directory CONTAINING espeak-ng-data (package.sh drops it next to
# this script). Harmless when espeak-ng wasn't bundled.
export CRISPASR_ESPEAK_DATA_PATH=$CURDIR
# If there is a lib/ld.so, use it
if [ -f $CURDIR/lib/ld.so ]; then
echo "Using lib/ld.so"

View File

@@ -0,0 +1,7 @@
sources/
build*/
package/
liblocateanythingcpp*.so
locate-anything-cpp
test-models/
test-data/

View File

@@ -0,0 +1,57 @@
cmake_minimum_required(VERSION 3.18)
project(liblocateanythingcpp LANGUAGES C CXX)
set(CMAKE_POSITION_INDEPENDENT_CODE ON)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
# Static-link ggml + locate_anything so the resulting .so has no runtime
# dependency on extra ggml/locate_anything shared libraries — only on
# libc/libstdc++/libgomp, which the LocalAI package step bundles into the
# docker image.
set(BUILD_SHARED_LIBS OFF CACHE BOOL "Build static libraries" FORCE)
# locate-anything.cpp build switches: skip CLI/tests, keep static lib.
set(LA_BUILD_CLI OFF CACHE BOOL "Disable locate-anything CLI" FORCE)
set(LA_BUILD_TESTS OFF CACHE BOOL "Disable locate-anything tests" FORCE)
set(LA_SHARED OFF CACHE BOOL "Build locate_anything as static lib" FORCE)
# Unlike rt-detr.cpp, locate-anything.cpp ships no in-tree ggml patches, so
# there is no apply_ggml_patches.sh hook to shim here.
add_subdirectory(./sources/locate-anything.cpp)
# locate-anything.cpp's top-level CMakeLists points its own target's include
# dirs at ${CMAKE_SOURCE_DIR}/{include,src,third_party,...}. CMAKE_SOURCE_DIR
# is the *top-level* source dir of the whole CMake tree, so when we pull it in
# via add_subdirectory it resolves to OUR directory, not theirs, and the
# locate_anything target fails to find its own headers (la_capi.h, stb_image.h,
# la_gguf_keys.h). Re-add the correct, subdir-relative include paths to the
# already-defined target so it compiles regardless of where it's nested.
set(LA_SRC ${CMAKE_CURRENT_SOURCE_DIR}/sources/locate-anything.cpp)
target_include_directories(locate_anything PRIVATE
${LA_SRC}/include
${LA_SRC}/src
${LA_SRC}/third_party
${LA_SRC}/third_party/stb)
# locate-anything.cpp's C-API symbols already live inside liblocate_anything
# (src/la_capi.cpp is compiled into the lib). We re-export them via a MODULE
# library that links locate_anything so the symbols are visible at dlopen time.
add_library(locateanythingcpp MODULE
sources/locate-anything.cpp/src/la_capi.cpp)
target_include_directories(locateanythingcpp PRIVATE
sources/locate-anything.cpp/include
sources/locate-anything.cpp/src
sources/locate-anything.cpp/third_party
sources/locate-anything.cpp/third_party/stb
)
target_link_libraries(locateanythingcpp PRIVATE locate_anything ggml)
if(CMAKE_CXX_COMPILER_ID MATCHES "GNU" AND CMAKE_CXX_COMPILER_VERSION VERSION_LESS 9.0)
target_link_libraries(locateanythingcpp PRIVATE stdc++fs)
endif()
set_property(TARGET locateanythingcpp PROPERTY CXX_STANDARD 17)
set_target_properties(locateanythingcpp PROPERTIES LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR})

View File

@@ -0,0 +1,134 @@
CMAKE_ARGS?=
BUILD_TYPE?=
NATIVE?=false
GOCMD?=go
GO_TAGS?=
JOBS?=$(shell nproc --ignore=1)
# locate-anything.cpp. Pin to a specific commit for a stable build; leaving
# this on `master` always picks up the latest C-API surface (incl. the
# per-detection accessor functions used by golocateanythingcpp.go).
LOCATEANYTHING_REPO?=https://github.com/mudler/locate-anything.cpp.git
LOCATEANYTHING_VERSION?=92c1682da792c1e8a5dec91acc2be4b02c742ded
ifeq ($(NATIVE),false)
CMAKE_ARGS+=-DGGML_NATIVE=OFF
endif
# Forward LocalAI's BUILD_TYPE to the matching ggml backend switch.
ifeq ($(BUILD_TYPE),cublas)
CMAKE_ARGS+=-DGGML_CUDA=ON -DLA_GGML_CUDA=ON
else ifeq ($(BUILD_TYPE),openblas)
CMAKE_ARGS+=-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
else ifeq ($(BUILD_TYPE),clblas)
CMAKE_ARGS+=-DGGML_CLBLAST=ON
else ifeq ($(BUILD_TYPE),hipblas)
ROCM_HOME ?= /opt/rocm
ROCM_PATH ?= /opt/rocm
export CXX=$(ROCM_HOME)/llvm/bin/clang++
export CC=$(ROCM_HOME)/llvm/bin/clang
AMDGPU_TARGETS?=gfx908,gfx90a,gfx942,gfx950,gfx1030,gfx1100,gfx1101,gfx1102,gfx1200,gfx1201
CMAKE_ARGS+=-DGGML_HIPBLAS=ON -DAMDGPU_TARGETS=$(AMDGPU_TARGETS)
else ifeq ($(BUILD_TYPE),vulkan)
CMAKE_ARGS+=-DGGML_VULKAN=ON -DLA_GGML_VULKAN=ON
else ifeq ($(OS),Darwin)
ifneq ($(BUILD_TYPE),metal)
CMAKE_ARGS+=-DGGML_METAL=OFF
else
CMAKE_ARGS+=-DGGML_METAL=ON
CMAKE_ARGS+=-DGGML_METAL_EMBED_LIBRARY=ON
CMAKE_ARGS+=-DLA_GGML_METAL=ON
endif
endif
ifeq ($(BUILD_TYPE),sycl_f16)
CMAKE_ARGS+=-DGGML_SYCL=ON \
-DCMAKE_C_COMPILER=icx \
-DCMAKE_CXX_COMPILER=icpx \
-DGGML_SYCL_F16=ON
endif
ifeq ($(BUILD_TYPE),sycl_f32)
CMAKE_ARGS+=-DGGML_SYCL=ON \
-DCMAKE_C_COMPILER=icx \
-DCMAKE_CXX_COMPILER=icpx
endif
sources/locate-anything.cpp:
mkdir -p sources && \
git clone --recursive $(LOCATEANYTHING_REPO) sources/locate-anything.cpp && \
cd sources/locate-anything.cpp && \
git checkout $(LOCATEANYTHING_VERSION) && \
git submodule update --init --recursive --depth 1 --single-branch
# Detect OS
UNAME_S := $(shell uname -s)
# Only build CPU variants on Linux
ifeq ($(UNAME_S),Linux)
VARIANT_TARGETS = liblocateanythingcpp-avx.so liblocateanythingcpp-avx2.so liblocateanythingcpp-avx512.so liblocateanythingcpp-fallback.so
else
# On non-Linux (e.g., Darwin), build only fallback variant
VARIANT_TARGETS = liblocateanythingcpp-fallback.so
endif
locate-anything-cpp: main.go golocateanythingcpp.go $(VARIANT_TARGETS)
CGO_ENABLED=0 $(GOCMD) build -tags "$(GO_TAGS)" -o locate-anything-cpp ./
package: locate-anything-cpp
bash package.sh
build: package
clean: purge
rm -rf liblocateanythingcpp*.so locate-anything-cpp package sources
purge:
rm -rf build*
# Build all variants (Linux only)
ifeq ($(UNAME_S),Linux)
liblocateanythingcpp-avx.so: sources/locate-anything.cpp
rm -rfv build-$@
$(info ${GREEN}I locate-anything-cpp build info:avx${RESET})
SO_TARGET=$@ CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) liblocateanythingcpp-custom
rm -rfv build-$@
liblocateanythingcpp-avx2.so: sources/locate-anything.cpp
rm -rfv build-$@
$(info ${GREEN}I locate-anything-cpp build info:avx2${RESET})
SO_TARGET=$@ CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=off -DGGML_FMA=on -DGGML_F16C=on -DGGML_BMI2=on" $(MAKE) liblocateanythingcpp-custom
rm -rfv build-$@
liblocateanythingcpp-avx512.so: sources/locate-anything.cpp
rm -rfv build-$@
$(info ${GREEN}I locate-anything-cpp build info:avx512${RESET})
SO_TARGET=$@ CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=on -DGGML_FMA=on -DGGML_F16C=on -DGGML_BMI2=on" $(MAKE) liblocateanythingcpp-custom
rm -rfv build-$@
endif
# Build fallback variant (all platforms)
liblocateanythingcpp-fallback.so: sources/locate-anything.cpp
rm -rfv build-$@
$(info ${GREEN}I locate-anything-cpp build info:fallback${RESET})
SO_TARGET=$@ CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) liblocateanythingcpp-custom
rm -rfv build-$@
liblocateanythingcpp-custom: CMakeLists.txt
mkdir -p build-$(SO_TARGET) && \
cd build-$(SO_TARGET) && \
cmake .. $(CMAKE_ARGS) && \
cmake --build . --config Release -j$(JOBS) && \
cd .. && \
mv build-$(SO_TARGET)/liblocateanythingcpp.so ./$(SO_TARGET)
all: locate-anything-cpp package
# `test` is invoked by the top-level Makefile's `test-extra` target. It builds
# the backend binary + the fallback shared library (needed for dlopen at
# runtime), then runs test.sh which downloads the q8_0 GGUF + COCO image and
# exercises the gRPC Load/Detect wire path via the Go smoke test in
# main_test.go.
test: locate-anything-cpp liblocateanythingcpp-fallback.so
bash test.sh

View File

@@ -0,0 +1,174 @@
package main
// golocateanythingcpp.go - gRPC handlers (Load, Detect) for the
// locate-anything-cpp backend.
//
// Embeds base.SingleThread to default unimplemented RPCs to "not supported"
// while we only implement open-vocabulary object detection (Detect).
import (
"encoding/base64"
"fmt"
"os"
"path/filepath"
"unsafe"
"github.com/mudler/LocalAI/pkg/grpc/base"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
)
// la_ctx* is an opaque handle. la_capi_load returns it directly (0 == failure),
// unlike rfdetr's out-parameter convention.
var (
// la_capi_load(const char* gguf_path, int n_threads) -> la_ctx* (0 = fail)
CapiLoad func(gguf string, nThreads int32) uintptr
// la_capi_free(la_ctx* ctx)
CapiFree func(handle uintptr)
// la_capi_locate_path(ctx, image_path, prompt, mode) -> char* json (0 = err)
CapiLocatePath func(handle uintptr, imagePath string, prompt string, mode int32) uintptr
// la_capi_locate_buffer(ctx, bytes, len, prompt, mode) -> char* json (0 = err)
CapiLocateBuffer func(handle uintptr, bytes uintptr, length uintptr, prompt string, mode int32) uintptr
// la_capi_get_n_detections(ctx) -> int
CapiGetNDetections func(handle uintptr) int32
// la_capi_get_detection_box(ctx, i, out_xyxy[4]) -> int (0 on success)
CapiGetDetectionBox func(handle uintptr, i int32, outXYXY uintptr) int32
// la_capi_get_detection_label(ctx, i, buf, buf_size) -> int (required size incl NUL; two-call sizing)
CapiGetDetectionLabel func(handle uintptr, i int32, buf uintptr, bufSize int32) int32
// la_capi_free_string(char* s)
CapiFreeString func(s uintptr)
// la_capi_last_error(ctx) -> const char* (owned by ctx, "" if none / null ctx).
// purego marshals the returned C string into a Go string (a copy), so we
// never free it and avoid raw pointer arithmetic.
CapiLastError func(handle uintptr) string
)
type LocateAnythingCpp struct {
base.SingleThread
handle uintptr
}
// Load loads the GGUF model at opts.ModelFile (joined with opts.ModelPath if
// relative) and stores the la_ctx handle for later Detect calls.
func (r *LocateAnythingCpp) Load(opts *pb.ModelOptions) error {
modelFile := opts.ModelFile
if modelFile == "" {
modelFile = opts.Model
}
if modelFile == "" {
return fmt.Errorf("locate-anything-cpp: ModelFile is empty")
}
var modelPath string
if filepath.IsAbs(modelFile) {
modelPath = modelFile
} else {
modelPath = filepath.Join(opts.ModelPath, modelFile)
}
if _, err := os.Stat(modelPath); err != nil {
return fmt.Errorf("locate-anything-cpp: model file not found: %s: %w", modelPath, err)
}
threads := opts.Threads
if threads <= 0 {
threads = 4
}
// Release previous model if any (re-Load).
if r.handle != 0 {
CapiFree(r.handle)
r.handle = 0
}
h := CapiLoad(modelPath, threads)
if h == 0 {
// la_capi_last_error needs a ctx; on a failed load we have none (it
// returns "" for a null ctx), so the text is best-effort. Surface it
// when present.
if msg := CapiLastError(0); msg != "" {
return fmt.Errorf("locate-anything-cpp: la_capi_load failed for %s: %s", modelPath, msg)
}
return fmt.Errorf("locate-anything-cpp: la_capi_load failed for %s", modelPath)
}
r.handle = h
return nil
}
// Detect runs open-vocabulary detection on the base64-encoded image in opts.Src
// using the required text prompt in opts.Prompt, returning one pb.Detection per
// located object with its predicted label as ClassName.
func (r *LocateAnythingCpp) Detect(opts *pb.DetectOptions) (pb.DetectResponse, error) {
if r.handle == 0 {
return pb.DetectResponse{}, fmt.Errorf("locate-anything-cpp: model not loaded")
}
// Open-vocabulary detection is prompt-driven; without a prompt there is
// nothing to locate.
prompt := opts.Prompt
if prompt == "" {
return pb.DetectResponse{}, fmt.Errorf("locate-anything-cpp: a text prompt is required (open-vocabulary detection)")
}
// Decode base64 image and write to temp file.
imgData, err := base64.StdEncoding.DecodeString(opts.Src)
if err != nil {
return pb.DetectResponse{}, fmt.Errorf("locate-anything-cpp: failed to decode base64 image: %w", err)
}
tmpFile, err := os.CreateTemp("", "locate-anything-*.img")
if err != nil {
return pb.DetectResponse{}, fmt.Errorf("locate-anything-cpp: failed to create temp file: %w", err)
}
defer func() { _ = os.Remove(tmpFile.Name()) }()
if _, err := tmpFile.Write(imgData); err != nil {
_ = tmpFile.Close()
return pb.DetectResponse{}, fmt.Errorf("locate-anything-cpp: failed to write temp file: %w", err)
}
if err := tmpFile.Close(); err != nil {
return pb.DetectResponse{}, fmt.Errorf("locate-anything-cpp: failed to close temp file: %w", err)
}
// mode 0 = hybrid (Parallel Box Decoding). The JSON return value is unused:
// structured detections are read via the accessor functions. Still must
// free the returned string.
jsonPtr := CapiLocatePath(r.handle, tmpFile.Name(), prompt, 0)
if jsonPtr != 0 {
CapiFreeString(jsonPtr)
}
n := CapiGetNDetections(r.handle)
if n < 0 {
return pb.DetectResponse{}, fmt.Errorf("locate-anything-cpp: invalid n_detections=%d", n)
}
detections := make([]*pb.Detection, 0, n)
for i := int32(0); i < n; i++ {
var xyxy [4]float32 // x1, y1, x2, y2
if CapiGetDetectionBox(r.handle, i, uintptr(unsafe.Pointer(&xyxy[0]))) != 0 {
continue
}
// Two-call sizing for the label string.
label := ""
need := CapiGetDetectionLabel(r.handle, i, 0, 0)
if need > 0 {
buf := make([]byte, need)
CapiGetDetectionLabel(r.handle, i, uintptr(unsafe.Pointer(&buf[0])), need)
label = string(buf[:need-1])
}
detections = append(detections, &pb.Detection{
X: xyxy[0],
Y: xyxy[1],
Width: xyxy[2] - xyxy[0],
Height: xyxy[3] - xyxy[1],
Confidence: 1.0,
ClassName: label,
})
}
return pb.DetectResponse{
Detections: detections,
}, nil
}

View File

@@ -0,0 +1,59 @@
package main
// main.go - entry point for the locate-anything-cpp gRPC backend.
//
// Dlopens liblocateanythingcpp-<variant>.so via purego at the path in
// LOCATEANYTHING_LIBRARY (set by run.sh based on /proc/cpuinfo), registers
// the la_capi_* C ABI symbols, then starts the gRPC server.
import (
"flag"
"os"
"github.com/ebitengine/purego"
grpc "github.com/mudler/LocalAI/pkg/grpc"
)
var (
addr = flag.String("addr", "localhost:50051", "the address to connect to")
)
type LibFuncs struct {
FuncPtr any
Name string
}
func main() {
// Get library name from environment variable, default to fallback
libName := os.Getenv("LOCATEANYTHING_LIBRARY")
if libName == "" {
libName = "./liblocateanythingcpp-fallback.so"
}
lib, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL)
if err != nil {
panic(err)
}
libFuncs := []LibFuncs{
{&CapiLoad, "la_capi_load"},
{&CapiFree, "la_capi_free"},
{&CapiLocatePath, "la_capi_locate_path"},
{&CapiLocateBuffer, "la_capi_locate_buffer"},
{&CapiGetNDetections, "la_capi_get_n_detections"},
{&CapiGetDetectionBox, "la_capi_get_detection_box"},
{&CapiGetDetectionLabel, "la_capi_get_detection_label"},
{&CapiFreeString, "la_capi_free_string"},
{&CapiLastError, "la_capi_last_error"},
}
for _, lf := range libFuncs {
purego.RegisterLibFunc(lf.FuncPtr, lib, lf.Name)
}
flag.Parse()
if err := grpc.StartServer(*addr, &LocateAnythingCpp{}); err != nil {
panic(err)
}
}

View File

@@ -0,0 +1,176 @@
package main
// main_test.go - end-to-end smoke test for the locate-anything-cpp gRPC backend.
//
// Spawns the compiled locate-anything-cpp binary on a free local port, dials it
// via gRPC, and exercises LoadModel + Detect against the test fixtures
// downloaded by test.sh: the q8_0 GGUF of nvidia/LocateAnything-3B and a real
// COCO image with people + cars. Asserts that open-vocabulary detection driven
// by a text prompt returns at least one detection, each carrying a non-empty
// class name and a bounding box of non-zero size.
//
// The spec Skip()s cleanly if its fixtures (the ~6.3 GB model, the test image,
// the built binary, or the fallback .so) are missing, so the test target stays
// usable on a fresh checkout / on CI runners where the large model hasn't been
// downloaded.
import (
"context"
"encoding/base64"
"fmt"
"net"
"os"
"os/exec"
"path/filepath"
"testing"
"time"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
"google.golang.org/grpc"
"google.golang.org/grpc/credentials/insecure"
)
func TestDetect(t *testing.T) {
RegisterFailHandler(Fail)
RunSpecs(t, "locate-anything-cpp backend smoke suite")
}
// freePort grabs an ephemeral TCP port and immediately releases it so the
// spawned backend can bind to it. There is a tiny TOCTOU window here but in
// practice it's adequate for a smoke test on a quiet runner.
func freePort() int {
l, err := net.Listen("tcp", "127.0.0.1:0")
Expect(err).ToNot(HaveOccurred(), "freePort listen")
port := l.Addr().(*net.TCPAddr).Port
Expect(l.Close()).To(Succeed())
return port
}
// startBackend spawns the locate-anything-cpp binary on the given port and
// waits until it accepts TCP connections (up to 10s). It mirrors how main.go
// resolves the purego library: the LOCATEANYTHING_LIBRARY env var points the
// dlopen at the freshly built fallback .so, and the la_capi_* symbols are
// registered there. The returned cleanup func kills the process and reaps it.
func startBackend(port int) func() {
binary, err := filepath.Abs("./locate-anything-cpp")
Expect(err).ToNot(HaveOccurred())
if _, err := os.Stat(binary); err != nil {
Skip(fmt.Sprintf("backend binary not built: %s (run `make locate-anything-cpp` first)", binary))
}
libPath, err := filepath.Abs("./liblocateanythingcpp-fallback.so")
Expect(err).ToNot(HaveOccurred())
if _, err := os.Stat(libPath); err != nil {
Skip(fmt.Sprintf("fallback library not built: %s (run `make liblocateanythingcpp-fallback.so` first)", libPath))
}
addr := fmt.Sprintf("127.0.0.1:%d", port)
cmd := exec.Command(binary, "--addr", addr)
cmd.Env = append(os.Environ(), "LOCATEANYTHING_LIBRARY="+libPath)
cmd.Stdout = os.Stderr
cmd.Stderr = os.Stderr
Expect(cmd.Start()).To(Succeed())
cleanup := func() {
if cmd.Process != nil {
_ = cmd.Process.Kill()
_, _ = cmd.Process.Wait()
}
}
deadline := time.Now().Add(10 * time.Second)
for time.Now().Before(deadline) {
c, err := net.DialTimeout("tcp", addr, 200*time.Millisecond)
if err == nil {
_ = c.Close()
return cleanup
}
time.Sleep(200 * time.Millisecond)
}
cleanup()
Fail(fmt.Sprintf("backend did not become ready on %s within 10s", addr))
return func() {}
}
// loadTestImage reads the COCO test image downloaded by test.sh and returns its
// base64-encoded content (the wire format accepted by the Detect RPC).
func loadTestImage() string {
imgPath, err := filepath.Abs("test-data/test.jpg")
Expect(err).ToNot(HaveOccurred())
imgBytes, err := os.ReadFile(imgPath)
if err != nil {
Skip(fmt.Sprintf("test image not present: %s (run test.sh first)", imgPath))
}
return base64.StdEncoding.EncodeToString(imgBytes)
}
// dialBackend opens a gRPC client connection to the spawned backend.
func dialBackend(port int) (pb.BackendClient, func()) {
addr := fmt.Sprintf("127.0.0.1:%d", port)
conn, err := grpc.NewClient(addr, grpc.WithTransportCredentials(insecure.NewCredentials()))
Expect(err).ToNot(HaveOccurred())
return pb.NewBackendClient(conn), func() { _ = conn.Close() }
}
// modelPathOrSkip resolves the model file under ./test-models/ and Skip()s the
// current spec if it's missing (the ~6.3 GB GGUF is not present on a fresh
// checkout / on CI runners without the download).
func modelPathOrSkip(name string) string {
modelDir, err := filepath.Abs("test-models")
Expect(err).ToNot(HaveOccurred())
modelPath := filepath.Join(modelDir, name)
if _, err := os.Stat(modelPath); err != nil {
Skip(fmt.Sprintf("model not present: %s (run test.sh first)", modelPath))
}
return modelPath
}
var _ = Describe("locate-anything-cpp backend", func() {
It("runs open-vocabulary detection against a known-good COCO image", func() {
modelPath := modelPathOrSkip("locate-anything-q8_0.gguf")
imgB64 := loadTestImage()
port := freePort()
cleanup := startBackend(port)
defer cleanup()
client, closeConn := dialBackend(port)
defer closeConn()
// The q8_0 model is ~6.3 GB and hybrid Parallel Box Decoding on CPU is
// not cheap, so give LoadModel + Detect a generous deadline.
ctx, cancel := context.WithTimeout(context.Background(), 20*time.Minute)
defer cancel()
loadResp, err := client.LoadModel(ctx, &pb.ModelOptions{
Model: "locate-anything-q8_0.gguf",
ModelFile: modelPath,
Threads: 4,
})
Expect(err).ToNot(HaveOccurred(), "LoadModel")
Expect(loadResp.GetSuccess()).To(BeTrue(), "LoadModel reported failure: %s", loadResp.GetMessage())
// Open-vocabulary detection is prompt-driven; the prompt names the
// classes to locate (people + cars), separated by the </c> control token.
detResp, err := client.Detect(ctx, &pb.DetectOptions{
Src: imgB64,
Prompt: "Locate all the instances that matches the following description: person</c>car.",
})
Expect(err).ToNot(HaveOccurred(), "Detect")
Expect(detResp.GetDetections()).ToNot(BeEmpty(), "no detections returned on a known-good COCO image")
_, _ = fmt.Fprintf(GinkgoWriter, "detection OK: %d detections\n", len(detResp.GetDetections()))
for i, d := range detResp.GetDetections() {
Expect(d.GetClassName()).ToNot(BeEmpty(), "detection %d has empty class_name", i)
Expect(d.GetWidth()).To(BeNumerically(">", float32(0)),
"detection %d has non-positive width", i)
Expect(d.GetHeight()).To(BeNumerically(">", float32(0)),
"detection %d has non-positive height", i)
_, _ = fmt.Fprintf(GinkgoWriter, " [%d] %s box=(%.1f,%.1f,%.1fx%.1f)\n",
i, d.GetClassName(), d.GetX(), d.GetY(), d.GetWidth(), d.GetHeight())
}
})
})

View File

@@ -0,0 +1,59 @@
#!/bin/bash
# Script to copy the appropriate libraries based on architecture
set -e
CURDIR=$(dirname "$(realpath $0)")
REPO_ROOT="${CURDIR}/../../.."
# Create lib directory
mkdir -p $CURDIR/package/lib
cp -avf $CURDIR/liblocateanythingcpp-*.so $CURDIR/package/
cp -avf $CURDIR/locate-anything-cpp $CURDIR/package/
cp -fv $CURDIR/run.sh $CURDIR/package/
# Detect architecture and copy appropriate libraries
if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
# x86_64 architecture
echo "Detected x86_64 architecture, copying x86_64 libraries..."
cp -arfLv /lib64/ld-linux-x86-64.so.2 $CURDIR/package/lib/ld.so
cp -arfLv /lib/x86_64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
cp -arfLv /lib/x86_64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
cp -arfLv /lib/x86_64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
cp -arfLv /lib/x86_64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
cp -arfLv /lib/x86_64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
cp -arfLv /lib/x86_64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
# ARM64 architecture
echo "Detected ARM64 architecture, copying ARM64 libraries..."
cp -arfLv /lib/ld-linux-aarch64.so.1 $CURDIR/package/lib/ld.so
cp -arfLv /lib/aarch64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
cp -arfLv /lib/aarch64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
cp -arfLv /lib/aarch64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
cp -arfLv /lib/aarch64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
cp -arfLv /lib/aarch64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
cp -arfLv /lib/aarch64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
elif [ $(uname -s) = "Darwin" ]; then
echo "Detected Darwin"
else
echo "Error: Could not detect architecture"
exit 1
fi
# Package GPU libraries based on BUILD_TYPE
GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
if [ -f "$GPU_LIB_SCRIPT" ]; then
echo "Packaging GPU libraries for BUILD_TYPE=${BUILD_TYPE:-cpu}..."
source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
package_gpu_libs
fi
echo "Packaging completed successfully"
ls -liah $CURDIR/package/
ls -liah $CURDIR/package/lib/

View File

@@ -0,0 +1,52 @@
#!/bin/bash
set -ex
# Get the absolute current dir where the script is located
CURDIR=$(dirname "$(realpath $0)")
cd /
echo "CPU info:"
if [ "$(uname)" != "Darwin" ]; then
grep -e "model\sname" /proc/cpuinfo | head -1
grep -e "flags" /proc/cpuinfo | head -1
fi
LIBRARY="$CURDIR/liblocateanythingcpp-fallback.so"
if [ "$(uname)" != "Darwin" ]; then
if grep -q -e "\savx\s" /proc/cpuinfo ; then
echo "CPU: AVX found OK"
if [ -e $CURDIR/liblocateanythingcpp-avx.so ]; then
LIBRARY="$CURDIR/liblocateanythingcpp-avx.so"
fi
fi
if grep -q -e "\savx2\s" /proc/cpuinfo ; then
echo "CPU: AVX2 found OK"
if [ -e $CURDIR/liblocateanythingcpp-avx2.so ]; then
LIBRARY="$CURDIR/liblocateanythingcpp-avx2.so"
fi
fi
# Check avx 512
if grep -q -e "\savx512f\s" /proc/cpuinfo ; then
echo "CPU: AVX512F found OK"
if [ -e $CURDIR/liblocateanythingcpp-avx512.so ]; then
LIBRARY="$CURDIR/liblocateanythingcpp-avx512.so"
fi
fi
fi
export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
export LOCATEANYTHING_LIBRARY=$LIBRARY
# If there is a lib/ld.so, use it
if [ -f $CURDIR/lib/ld.so ]; then
echo "Using lib/ld.so"
echo "Using library: $LIBRARY"
exec $CURDIR/lib/ld.so $CURDIR/locate-anything-cpp "$@"
fi
echo "Using library: $LIBRARY"
exec $CURDIR/locate-anything-cpp "$@"

View File

@@ -0,0 +1,47 @@
#!/bin/bash
set -e
CURDIR=$(dirname "$(realpath $0)")
echo "Running locate-anything-cpp backend tests..."
# Test model from the mudler/locate-anything.cpp-gguf HuggingFace repo. This is
# the q8_0 quantization of nvidia/LocateAnything-3B (~6.3 GB), so the download
# is the slow step. It is resumed with `curl -C -` and skipped entirely if the
# file is already present.
LOCATEANYTHING_MODEL_DIR="${LOCATEANYTHING_MODEL_DIR:-$CURDIR/test-models}"
LOCATEANYTHING_MODEL_FILE="${LOCATEANYTHING_MODEL_FILE:-locate-anything-q8_0.gguf}"
LOCATEANYTHING_MODEL_URL="${LOCATEANYTHING_MODEL_URL:-https://huggingface.co/mudler/locate-anything.cpp-gguf/resolve/main/locate-anything-q8_0.gguf}"
mkdir -p "$LOCATEANYTHING_MODEL_DIR"
if [ ! -f "$LOCATEANYTHING_MODEL_DIR/$LOCATEANYTHING_MODEL_FILE" ]; then
echo "Downloading locate-anything q8_0 model (~6.3 GB, this is slow)..."
# -C - resumes a partial download so an interrupted run doesn't restart from 0.
curl -L -C - -o "$LOCATEANYTHING_MODEL_DIR/$LOCATEANYTHING_MODEL_FILE" "$LOCATEANYTHING_MODEL_URL" --progress-bar
fi
# Use a real COCO test image (people + cars) from the upstream rf-detr.cpp repo
# (~46 KB). Open-vocabulary detection needs real content to locate, so a
# synthetic image would trivially yield zero detections.
TEST_IMAGE_DIR="$CURDIR/test-data"
TEST_IMAGE_FILE="$TEST_IMAGE_DIR/test.jpg"
TEST_IMAGE_URL="${TEST_IMAGE_URL:-https://raw.githubusercontent.com/mudler/rf-detr.cpp/main/tests/fixtures/ci/test_image.jpg}"
mkdir -p "$TEST_IMAGE_DIR"
if [ ! -f "$TEST_IMAGE_FILE" ]; then
echo "Downloading COCO test image..."
curl -L -o "$TEST_IMAGE_FILE" "$TEST_IMAGE_URL" --progress-bar
fi
echo "locate-anything-cpp test setup complete."
echo " model: $LOCATEANYTHING_MODEL_DIR/$LOCATEANYTHING_MODEL_FILE"
echo " test image: $TEST_IMAGE_FILE"
# Run the Go smoke test: spawns the backend binary on a free port, calls
# LoadModel + Detect via gRPC against the downloaded GGUF + COCO image.
echo ""
echo "Running Go smoke test..."
cd "$CURDIR"
go test -v -timeout 30m ./...

17
backend/go/omnivoice-cpp/.gitignore vendored Normal file
View File

@@ -0,0 +1,17 @@
# Fetched upstream sources
sources/
# CMake build directories
build*/
# Compiled shared libraries
*.so
# Compiled backend binary
omnivoice-cpp
# Packaging output
package/
# Downloaded e2e models
omnivoice-models/

View File

@@ -0,0 +1,53 @@
cmake_minimum_required(VERSION 3.14)
project(gomnivoicecpp LANGUAGES C CXX)
set(CMAKE_POSITION_INDEPENDENT_CODE ON)
set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
set(OMNIVOICE_DIR ${CMAKE_CURRENT_SOURCE_DIR}/sources/omnivoice.cpp)
# Override upstream's CMAKE_CUDA_ARCHITECTURES before add_subdirectory.
if(NOT DEFINED CMAKE_CUDA_ARCHITECTURES)
set(CMAKE_CUDA_ARCHITECTURES "75-virtual;80-virtual;86-real;89-real")
endif()
# Add the upstream project. Its own CMakeLists adds ggml + builds
# omnivoice-core (STATIC, contains src/omnivoice.cpp i.e. the ov_* impl).
# EXCLUDE_FROM_ALL keeps its CLI tools/tests from building unless referenced.
add_subdirectory(${OMNIVOICE_DIR} omnivoice EXCLUDE_FROM_ALL)
# Upstream generates version.h into its own CMAKE_CURRENT_BINARY_DIR and adds
# the top-level ${CMAKE_BINARY_DIR} to omnivoice-core's include path. When the
# project is nested under add_subdirectory those two directories differ
# (<build>/omnivoice vs <build>), so omnivoice.cpp cannot find version.h. Point
# omnivoice-core at the subproject binary dir where version.h is actually
# generated. (Fix lives here, never in the fetched upstream checkout.)
target_include_directories(omnivoice-core PRIVATE ${CMAKE_BINARY_DIR}/omnivoice)
add_library(gomnivoicecpp MODULE cpp/gomnivoicecpp.cpp)
target_link_libraries(gomnivoicecpp PRIVATE omnivoice-core)
target_include_directories(gomnivoicecpp PRIVATE ${OMNIVOICE_DIR}/src)
target_include_directories(gomnivoicecpp SYSTEM PRIVATE ${OMNIVOICE_DIR}/ggml/include)
# Link GPU backends if the upstream ggml created them.
foreach(backend blas cuda metal vulkan sycl)
if(TARGET ggml-${backend})
target_link_libraries(gomnivoicecpp PRIVATE ggml-${backend})
if(backend STREQUAL "cuda")
find_package(CUDAToolkit QUIET)
if(CUDAToolkit_FOUND)
target_link_libraries(gomnivoicecpp PRIVATE CUDA::cudart)
endif()
endif()
endif()
endforeach()
if(MSVC)
target_compile_options(gomnivoicecpp PRIVATE /W4 /wd4100 /wd4505)
else()
target_compile_options(gomnivoicecpp PRIVATE -Wall -Wextra
-Wno-unused-parameter -Wno-unused-function)
endif()
set_property(TARGET gomnivoicecpp PROPERTY CXX_STANDARD 17)
set_target_properties(gomnivoicecpp PROPERTIES LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR})

View File

@@ -0,0 +1,122 @@
CMAKE_ARGS?=
BUILD_TYPE?=
NATIVE?=false
GOCMD?=go
GO_TAGS?=
JOBS?=$(shell nproc --ignore=1)
# omnivoice.cpp version
OMNIVOICE_REPO?=https://github.com/ServeurpersoCom/omnivoice.cpp
OMNIVOICE_VERSION?=2603355a5dfacae5cfc33531d5d0933221843509
SO_TARGET?=libgomnivoicecpp.so
CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF
ifeq ($(NATIVE),false)
CMAKE_ARGS+=-DGGML_NATIVE=OFF
endif
ifeq ($(BUILD_TYPE),cublas)
CMAKE_ARGS+=-DGGML_CUDA=ON
else ifeq ($(BUILD_TYPE),openblas)
CMAKE_ARGS+=-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
else ifeq ($(BUILD_TYPE),clblas)
CMAKE_ARGS+=-DGGML_CLBLAST=ON -DCLBlast_DIR=/some/path
else ifeq ($(BUILD_TYPE),hipblas)
CMAKE_ARGS+=-DGGML_HIPBLAS=ON
else ifeq ($(BUILD_TYPE),vulkan)
CMAKE_ARGS+=-DGGML_VULKAN=ON
else ifeq ($(OS),Darwin)
ifneq ($(BUILD_TYPE),metal)
CMAKE_ARGS+=-DGGML_METAL=OFF
else
CMAKE_ARGS+=-DGGML_METAL=ON
CMAKE_ARGS+=-DGGML_METAL_EMBED_LIBRARY=ON
endif
endif
ifeq ($(BUILD_TYPE),sycl_f16)
CMAKE_ARGS+=-DGGML_SYCL=ON \
-DCMAKE_C_COMPILER=icx \
-DCMAKE_CXX_COMPILER=icpx \
-DGGML_SYCL_F16=ON
endif
ifeq ($(BUILD_TYPE),sycl_f32)
CMAKE_ARGS+=-DGGML_SYCL=ON \
-DCMAKE_C_COMPILER=icx \
-DCMAKE_CXX_COMPILER=icpx
endif
sources/omnivoice.cpp:
mkdir -p sources/omnivoice.cpp
cd sources/omnivoice.cpp && \
git init && \
git remote add origin $(OMNIVOICE_REPO) && \
git fetch origin && \
git checkout $(OMNIVOICE_VERSION) && \
git submodule update --init --recursive --depth 1 --single-branch
# Detect OS
UNAME_S := $(shell uname -s)
# Only build CPU variants on Linux
ifeq ($(UNAME_S),Linux)
VARIANT_TARGETS = libgomnivoicecpp-avx.so libgomnivoicecpp-avx2.so libgomnivoicecpp-avx512.so libgomnivoicecpp-fallback.so
else
VARIANT_TARGETS = libgomnivoicecpp-fallback.so
endif
omnivoice-cpp: main.go gomnivoicecpp.go $(VARIANT_TARGETS)
CGO_ENABLED=0 $(GOCMD) build -tags "$(GO_TAGS)" -o omnivoice-cpp ./
package: omnivoice-cpp
bash package.sh
build: package
clean: purge
rm -rf libgomnivoicecpp*.so package sources/omnivoice.cpp omnivoice-cpp
purge:
rm -rf build*
.NOTPARALLEL:
ifeq ($(UNAME_S),Linux)
libgomnivoicecpp-avx.so: sources/omnivoice.cpp
$(info ${GREEN}I omnivoice-cpp build info:avx${RESET})
SO_TARGET=libgomnivoicecpp-avx.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) libgomnivoicecpp-custom
rm -rf build-libgomnivoicecpp-avx.so
libgomnivoicecpp-avx2.so: sources/omnivoice.cpp
$(info ${GREEN}I omnivoice-cpp build info:avx2${RESET})
SO_TARGET=libgomnivoicecpp-avx2.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=off -DGGML_FMA=on -DGGML_F16C=on -DGGML_BMI2=on" $(MAKE) libgomnivoicecpp-custom
rm -rf build-libgomnivoicecpp-avx2.so
libgomnivoicecpp-avx512.so: sources/omnivoice.cpp
$(info ${GREEN}I omnivoice-cpp build info:avx512${RESET})
SO_TARGET=libgomnivoicecpp-avx512.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=on -DGGML_FMA=on -DGGML_F16C=on -DGGML_BMI2=on" $(MAKE) libgomnivoicecpp-custom
rm -rf build-libgomnivoicecpp-avx512.so
endif
libgomnivoicecpp-fallback.so: sources/omnivoice.cpp
$(info ${GREEN}I omnivoice-cpp build info:fallback${RESET})
SO_TARGET=libgomnivoicecpp-fallback.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) libgomnivoicecpp-custom
rm -rf build-libgomnivoicecpp-fallback.so
libgomnivoicecpp-custom: CMakeLists.txt cpp/gomnivoicecpp.cpp cpp/gomnivoicecpp.h
mkdir -p build-$(SO_TARGET) && \
cd build-$(SO_TARGET) && \
cmake .. $(CMAKE_ARGS) && \
cmake --build . --config Release -j$(JOBS) --target gomnivoicecpp && \
cd .. && \
mv build-$(SO_TARGET)/libgomnivoicecpp.so ./$(SO_TARGET)
test: omnivoice-cpp
@echo "Running omnivoice-cpp tests..."
bash test.sh
@echo "omnivoice-cpp tests completed."
all: omnivoice-cpp package

View File

@@ -0,0 +1,129 @@
package main
import (
"bytes"
"encoding/binary"
"fmt"
"os"
"runtime"
"github.com/go-audio/audio"
"github.com/go-audio/wav"
)
const omnivoiceSampleRate = 24000
// wavHeader24k returns a 44-byte WAV header for a streaming 24 kHz mono 16-bit
// PCM stream, with placeholder (0xFFFFFFFF) sizes since the total length is
// unknown up front. Emitted as the first chunk of TTSStream so the HTTP layer
// receives a self-describing WAV (the gRPC TTSStream path never sets Message,
// so the backend owns the header - see core/backend/tts.go:ModelTTSStream).
func wavHeader24k() []byte {
var buf bytes.Buffer
w := func(v any) { _ = binary.Write(&buf, binary.LittleEndian, v) }
buf.WriteString("RIFF")
w(uint32(0xFFFFFFFF))
buf.WriteString("WAVE")
buf.WriteString("fmt ")
w(uint32(16)) // Subchunk1Size
w(uint16(1)) // PCM
w(uint16(1)) // mono
w(uint32(omnivoiceSampleRate)) // sample rate
w(uint32(omnivoiceSampleRate * 2)) // byte rate = SR * blockAlign
w(uint16(2)) // block align (16-bit mono)
w(uint16(16)) // bits per sample
buf.WriteString("data")
w(uint32(0xFFFFFFFF))
return buf.Bytes()
}
// floatToPCM16LE clamps each sample to [-1,1] and encodes it as little-endian
// signed 16-bit PCM.
func floatToPCM16LE(samples []float32) []byte {
out := make([]byte, len(samples)*2)
for i, s := range samples {
if s > 1 {
s = 1
} else if s < -1 {
s = -1
}
v := int16(s * 32767)
out[i*2] = byte(v)
out[i*2+1] = byte(v >> 8)
}
return out
}
// writeWAV24k writes samples as a finalized 24 kHz mono 16-bit WAV at dst.
func writeWAV24k(dst string, samples []float32) error {
f, err := os.Create(dst)
if err != nil {
return fmt.Errorf("omnivoice: create %q: %w", dst, err)
}
enc := wav.NewEncoder(f, omnivoiceSampleRate, 16, 1, 1)
ints := make([]int, len(samples))
for i, s := range samples {
if s > 1 {
s = 1
} else if s < -1 {
s = -1
}
ints[i] = int(s * 32767)
}
b := &audio.IntBuffer{
Format: &audio.Format{NumChannels: 1, SampleRate: omnivoiceSampleRate},
Data: ints,
SourceBitDepth: 16,
}
if err := enc.Write(b); err != nil {
_ = enc.Close()
_ = f.Close()
return fmt.Errorf("omnivoice: encode WAV: %w", err)
}
if err := enc.Close(); err != nil {
_ = f.Close()
return fmt.Errorf("omnivoice: finalize WAV: %w", err)
}
return f.Close()
}
// readWAVAsFloat decodes a WAV file (any sample rate/channels) to a mono
// float32 slice in [-1,1] for use as reference audio. OmniVoice expects 24 kHz;
// callers should supply 24 kHz reference clips.
func readWAVAsFloat(path string) ([]float32, error) {
f, err := os.Open(path)
if err != nil {
return nil, fmt.Errorf("omnivoice: open ref %q: %w", path, err)
}
defer func() { _ = f.Close() }()
dec := wav.NewDecoder(f)
buf, err := dec.FullPCMBuffer()
if err != nil {
return nil, fmt.Errorf("omnivoice: decode ref %q: %w", path, err)
}
ch := int(buf.Format.NumChannels)
if ch < 1 {
ch = 1
}
bitDepth := int(buf.SourceBitDepth)
if bitDepth == 0 {
bitDepth = 16
}
scale := float32(int64(1) << uint(bitDepth-1))
n := len(buf.Data) / ch
out := make([]float32, n)
for i := 0; i < n; i++ {
// Downmix to mono by averaging channels.
var acc int
for c := 0; c < ch; c++ {
acc += buf.Data[i*ch+c]
}
out[i] = float32(acc) / float32(ch) / scale
}
return out, nil
}
// runtimeKeepAlive prevents the GC from reclaiming the reference-audio slice
// while its backing pointer is in use across the C call.
func runtimeKeepAlive(v any) { runtime.KeepAlive(v) }

View File

@@ -0,0 +1,166 @@
#include "gomnivoicecpp.h"
#include "ggml-backend.h"
#include "omnivoice.h"
#include <cstdio>
#include <cstdlib>
#include <cstring>
static ov_context *g_ctx = nullptr;
static void ggml_log_cb(enum ggml_log_level level, const char *log,
void * /*data*/) {
if (!log)
return;
const char *lvl = "?????";
switch (level) {
case GGML_LOG_LEVEL_DEBUG: lvl = "DEBUG"; break;
case GGML_LOG_LEVEL_INFO: lvl = "INFO"; break;
case GGML_LOG_LEVEL_WARN: lvl = "WARN"; break;
case GGML_LOG_LEVEL_ERROR: lvl = "ERROR"; break;
default: break;
}
fprintf(stderr, "[%-5s] %s", lvl, log);
fflush(stderr);
}
int omni_load(const char *model_path, const char *codec_path, int use_fa,
int clamp_fp16) {
ggml_log_set(ggml_log_cb, nullptr);
ggml_backend_load_all();
if (!model_path || model_path[0] == '\0') {
fprintf(stderr, "[omnivoice-cpp] ERROR: model_path is required\n");
return 1;
}
if (!codec_path || codec_path[0] == '\0') {
fprintf(stderr, "[omnivoice-cpp] ERROR: codec_path is required\n");
return 2;
}
ov_init_params p;
ov_init_default_params(&p);
p.model_path = model_path;
p.codec_path = codec_path;
p.use_fa = use_fa != 0;
p.clamp_fp16 = clamp_fp16 != 0;
fprintf(stderr, "[omnivoice-cpp] Loading model=%s codec=%s\n", model_path,
codec_path);
g_ctx = ov_init(&p);
if (!g_ctx) {
fprintf(stderr, "[omnivoice-cpp] FATAL: ov_init failed: %s\n",
ov_last_error());
return 3;
}
fprintf(stderr, "[omnivoice-cpp] Model loaded (%s)\n", ov_version());
return 0;
}
// Fill an ov_tts_params from the flat wrapper arguments.
static void fill_params(ov_tts_params *tp, const char *text, const char *lang,
const char *instruct, const float *ref_samples,
int ref_n, const char *ref_text, long long seed,
int denoise) {
ov_tts_default_params(tp);
tp->text = text ? text : "";
tp->lang = lang ? lang : "";
if (instruct && instruct[0] != '\0')
tp->instruct = instruct;
if (ref_samples && ref_n > 0) {
tp->ref_audio_24k = ref_samples;
tp->ref_n_samples = ref_n;
if (ref_text && ref_text[0] != '\0')
tp->ref_text = ref_text;
tp->denoise = denoise != 0;
}
if (seed >= 0)
tp->mg_seed = (uint64_t)seed;
}
float *omni_tts(const char *text, const char *lang, const char *instruct,
const float *ref_samples, int ref_n, const char *ref_text,
long long seed, int denoise, int *out_n) {
if (out_n)
*out_n = 0;
if (!g_ctx) {
fprintf(stderr, "[omnivoice-cpp] ERROR: model not loaded\n");
return nullptr;
}
if (!text || text[0] == '\0') {
fprintf(stderr, "[omnivoice-cpp] ERROR: text is required\n");
return nullptr; // omni_tts: out_n already 0
}
ov_tts_params tp;
fill_params(&tp, text, lang, instruct, ref_samples, ref_n, ref_text, seed,
denoise);
ov_audio out = {0};
enum ov_status rc = ov_synthesize(g_ctx, &tp, &out);
if (rc != OV_STATUS_OK || out.n_samples <= 0 || !out.samples) {
fprintf(stderr, "[omnivoice-cpp] ERROR: synthesize failed (rc=%d): %s\n",
(int)rc, ov_last_error());
ov_audio_free(&out);
return nullptr;
}
// Copy into a plain malloc buffer the Go side can free symmetrically via
// omni_pcm_free; then release the ov_audio-owned buffer.
size_t bytes = (size_t)out.n_samples * sizeof(float);
float *buf = (float *)malloc(bytes);
if (!buf) {
fprintf(stderr, "[omnivoice-cpp] ERROR: malloc(%zu) failed\n", bytes);
ov_audio_free(&out);
return nullptr;
}
memcpy(buf, out.samples, bytes);
if (out_n)
*out_n = out.n_samples;
ov_audio_free(&out);
return buf;
}
int omni_tts_stream(const char *text, const char *lang, const char *instruct,
const float *ref_samples, int ref_n, const char *ref_text,
long long seed, int denoise, omni_pcm_chunk_cb cb,
void *user_data) {
if (!g_ctx) {
fprintf(stderr, "[omnivoice-cpp] ERROR: model not loaded\n");
return 1;
}
if (!cb) {
fprintf(stderr, "[omnivoice-cpp] ERROR: stream callback is null\n");
return 2;
}
if (!text || text[0] == '\0') {
fprintf(stderr, "[omnivoice-cpp] ERROR: text is required\n");
return 4;
}
ov_tts_params tp;
fill_params(&tp, text, lang, instruct, ref_samples, ref_n, ref_text, seed,
denoise);
// ov_audio_chunk_cb has the identical signature to omni_pcm_chunk_cb
// (bool vs int return are ABI-compatible; non-zero == true).
tp.on_chunk = (ov_audio_chunk_cb)cb;
tp.on_chunk_user_data = user_data;
ov_audio out = {0}; // stays empty in streaming mode
enum ov_status rc = ov_synthesize(g_ctx, &tp, &out);
ov_audio_free(&out);
if (rc != OV_STATUS_OK && rc != OV_STATUS_CANCELLED) {
fprintf(stderr, "[omnivoice-cpp] ERROR: stream synth failed (rc=%d): %s\n",
(int)rc, ov_last_error());
return 3;
}
return 0;
}
void omni_pcm_free(float *p) { free(p); }
void omni_unload(void) {
if (g_ctx) {
ov_free(g_ctx);
g_ctx = nullptr;
}
}

View File

@@ -0,0 +1,38 @@
#pragma once
#include <cstdint>
extern "C" {
// Streaming PCM chunk callback. samples is mono float PCM at 24 kHz, valid
// only for the duration of the call. Return non-zero to continue, 0 to abort.
typedef int (*omni_pcm_chunk_cb)(const float *samples, int n_samples,
void *user_data);
// Load the LM (model_path) + codec (codec_path) GGUFs. use_fa / clamp_fp16
// map to ov_init_params. Returns 0 on success, non-zero on failure.
int omni_load(const char *model_path, const char *codec_path, int use_fa,
int clamp_fp16);
// Synthesize to a malloc'd float PCM buffer (caller frees via omni_pcm_free).
// ref_samples != null && ref_n > 0 => voice cloning (ref_text optional).
// instruct != null && non-empty => voice design. seed < 0 keeps the default
// MaskGIT seed. denoise toggles the <|denoise|> marker (only with a reference).
// Writes the sample count to *out_n. Returns NULL on failure (out_n set to 0).
float *omni_tts(const char *text, const char *lang, const char *instruct,
const float *ref_samples, int ref_n, const char *ref_text,
long long seed, int denoise, int *out_n);
// Streaming synthesis: cb is invoked per PCM chunk as audio is produced.
// Same reference/design/seed semantics as omni_tts. Returns 0 on success.
int omni_tts_stream(const char *text, const char *lang, const char *instruct,
const float *ref_samples, int ref_n, const char *ref_text,
long long seed, int denoise, omni_pcm_chunk_cb cb,
void *user_data);
// Free a buffer returned by omni_tts.
void omni_pcm_free(float *p);
// Release the OmniVoice context.
void omni_unload(void);
}

View File

@@ -0,0 +1,74 @@
package main
import (
"os"
"strings"
"github.com/ebitengine/purego"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
func ttsReq(text, voice string, lang *string, dst string) *pb.TTSRequest {
return &pb.TTSRequest{Text: text, Voice: voice, Language: lang, Dst: dst}
}
var _ = Describe("OmniVoice e2e", Label("e2e"), func() {
var loaded bool
BeforeEach(func() {
modelPath := os.Getenv("OMNIVOICE_MODEL")
codecPath := os.Getenv("OMNIVOICE_CODEC")
if modelPath == "" || codecPath == "" {
Skip("OMNIVOICE_MODEL / OMNIVOICE_CODEC not set; skipping e2e")
}
if !loaded {
lib := os.Getenv("OMNIVOICE_LIBRARY")
if lib == "" {
lib = "./libgomnivoicecpp-fallback.so"
}
h, err := purego.Dlopen(lib, purego.RTLD_NOW|purego.RTLD_GLOBAL)
Expect(err).ToNot(HaveOccurred())
purego.RegisterLibFunc(&CppLoad, h, "omni_load")
purego.RegisterLibFunc(&CppTTS, h, "omni_tts")
purego.RegisterLibFunc(&CppTTSStream, h, "omni_tts_stream")
purego.RegisterLibFunc(&CppPCMFree, h, "omni_pcm_free")
purego.RegisterLibFunc(&CppUnload, h, "omni_unload")
Expect(CppLoad(modelPath, codecPath, 0, 0)).To(Equal(0))
loaded = true
}
})
It("synthesizes a WAV file via TTS", func() {
b := &OmnivoiceCpp{opts: loadOptions{seed: 42, denoise: true}}
dst := GinkgoT().TempDir() + "/out.wav"
lang := "en"
err := b.TTS(ttsReq("Hello world.", "", &lang, dst))
Expect(err).ToNot(HaveOccurred())
fi, err := os.Stat(dst)
Expect(err).ToNot(HaveOccurred())
Expect(fi.Size()).To(BeNumerically(">", int64(44)))
})
It("streams audio chunks via TTSStream", func() {
b := &OmnivoiceCpp{opts: loadOptions{seed: 42, denoise: true}}
results := make(chan []byte, 1024)
lang := "en"
done := make(chan error, 1)
go func() { done <- b.TTSStream(ttsReq("Hello there, streaming test.", "", &lang, ""), results) }()
var chunks int
var first []byte
for c := range results {
if chunks == 0 {
first = c
}
chunks++
}
Expect(<-done).ToNot(HaveOccurred())
Expect(chunks).To(BeNumerically(">=", 2))
Expect(string(first[0:4])).To(Equal("RIFF"))
Expect(strings.HasPrefix(string(first[8:12]), "WAVE")).To(BeTrue())
})
})

View File

@@ -0,0 +1,246 @@
package main
import (
"fmt"
"os"
"path/filepath"
"strings"
"sync"
"unsafe"
"github.com/ebitengine/purego"
"github.com/mudler/LocalAI/pkg/grpc/base"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
)
var (
// omni_load(model_path, codec_path, use_fa, clamp_fp16) int
CppLoad func(modelPath, codecPath string, useFA, clampFP16 int) int
// omni_tts(text, lang, instruct, ref_samples, ref_n, ref_text, seed, denoise, out_n) -> float* (uintptr)
CppTTS func(text, lang, instruct string, refSamples unsafe.Pointer, refN int,
refText string, seed int64, denoise int, outN unsafe.Pointer) uintptr
// omni_tts_stream(text, lang, instruct, ref_samples, ref_n, ref_text, seed, denoise, cb, user) int
CppTTSStream func(text, lang, instruct string, refSamples unsafe.Pointer, refN int,
refText string, seed int64, denoise int, cb uintptr, user uintptr) int
CppPCMFree func(ptr uintptr)
CppUnload func()
)
type OmnivoiceCpp struct {
base.SingleThread
opts loadOptions
// audioPath is the model-config reference voice (tts.audio_path), used as
// the default voice-cloning reference when a request does not set Voice.
audioPath string
}
func (o *OmnivoiceCpp) Load(opts *pb.ModelOptions) error {
model := opts.ModelFile
if model == "" {
model = opts.ModelPath
}
if !filepath.IsAbs(model) && opts.ModelPath != "" {
model = filepath.Join(opts.ModelPath, model)
}
o.opts = parseOptions(opts.Options)
// Resolve the codec/tokenizer GGUF: explicit option, else auto-discover a
// *tokenizer*.gguf sibling of the base model.
codec := o.opts.codecPath
if codec != "" && !filepath.IsAbs(codec) {
codec = filepath.Join(filepath.Dir(model), codec)
}
if codec == "" {
codec = discoverTokenizer(filepath.Dir(model))
}
if codec == "" {
return fmt.Errorf("omnivoice: no codec/tokenizer GGUF found; set option 'tokenizer:<file>'")
}
o.opts.codecPath = codec
// tts.audio_path (ModelOptions.AudioPath) is the config-level voice-cloning
// reference: a default reference WAV used when a request omits Voice.
// Resolved relative to the model directory like the codec.
o.audioPath = opts.AudioPath
if o.audioPath != "" && !filepath.IsAbs(o.audioPath) {
o.audioPath = filepath.Join(filepath.Dir(model), o.audioPath)
}
useFA := boolToInt(o.opts.useFA)
clamp := boolToInt(o.opts.clampFP16)
fmt.Fprintf(os.Stderr, "[omnivoice-cpp] Load model=%s codec=%s use_fa=%d clamp_fp16=%d\n",
model, codec, useFA, clamp)
if rc := CppLoad(model, codec, useFA, clamp); rc != 0 {
return fmt.Errorf("omnivoice: failed to load model (rc=%d)", rc)
}
return nil
}
// discoverTokenizer returns the first *tokenizer*.gguf in dir, or "".
func discoverTokenizer(dir string) string {
entries, err := os.ReadDir(dir)
if err != nil {
return ""
}
for _, e := range entries {
name := strings.ToLower(e.Name())
if strings.Contains(name, "tokenizer") && strings.HasSuffix(name, ".gguf") {
return filepath.Join(dir, e.Name())
}
}
return ""
}
func boolToInt(b bool) int {
if b {
return 1
}
return 0
}
// refAudio loads the reference WAV (voice cloning) if voice points to a file.
// Returns nil if no cloning (empty or non-path - voice design uses Instructions).
func (o *OmnivoiceCpp) refAudio(voice string) ([]float32, error) {
v := strings.TrimSpace(voice)
if v == "" {
return nil, nil
}
if _, err := os.Stat(v); err != nil {
return nil, nil
}
return readWAVAsFloat(v)
}
// refAudioFor resolves the cloning reference for a request: the per-request
// Voice takes precedence, falling back to the model-config audio_path. Empty
// result means no cloning (voice design via Instructions still applies).
func (o *OmnivoiceCpp) refAudioFor(req *pb.TTSRequest) ([]float32, error) {
voice := strings.TrimSpace(req.Voice)
if voice == "" {
voice = o.audioPath
}
return o.refAudio(voice)
}
func reqParam(req *pb.TTSRequest, key string) string {
if req.Params == nil {
return ""
}
return req.Params[key]
}
func (o *OmnivoiceCpp) seedFor(req *pb.TTSRequest) int64 {
if s := reqParam(req, "seed"); s != "" {
var n int64
if _, err := fmt.Sscan(s, &n); err == nil {
return n
}
}
return o.opts.seed
}
func optStr(p *string) string {
if p == nil {
return ""
}
return *p
}
func (o *OmnivoiceCpp) TTS(req *pb.TTSRequest) error {
if req.Dst == "" {
return fmt.Errorf("omnivoice: TTS requires a destination path")
}
lang := normalizeLanguage(optStr(req.Language))
instruct := optStr(req.Instructions)
refText := reqParam(req, "ref_text")
seed := o.seedFor(req)
ref, err := o.refAudioFor(req)
if err != nil {
return err
}
var refPtr unsafe.Pointer
if len(ref) > 0 {
refPtr = unsafe.Pointer(&ref[0])
}
var n int32
ptr := CppTTS(req.Text, lang, instruct, refPtr, len(ref), refText, seed,
boolToInt(o.opts.denoise), unsafe.Pointer(&n))
runtimeKeepAlive(ref)
if ptr == 0 || n <= 0 {
return fmt.Errorf("omnivoice: synthesis failed")
}
defer CppPCMFree(ptr)
src := unsafe.Slice((*float32)(unsafe.Pointer(ptr)), int(n)) //nolint:govet // C-allocated PCM, copied out before free
out := make([]float32, int(n))
copy(out, src)
return writeWAV24k(req.Dst, out)
}
// streamState carries the active TTSStream channel to the single shared C
// callback. base.SingleThread serializes TTS/TTSStream, so one global slot is
// safe and avoids leaking a purego callback per request (purego callbacks
// cannot be freed and are capped).
var (
streamMu sync.Mutex
streamChan chan []byte
streamCbOnce sync.Once
streamCbPtr uintptr
)
// streamCallback is registered once and forwards each PCM chunk to streamChan.
func streamCallback(samples *float32, nSamples int32, _ uintptr) uintptr {
if nSamples <= 0 || samples == nil || streamChan == nil {
return 1 // continue
}
src := unsafe.Slice(samples, int(nSamples))
cp := make([]float32, int(nSamples)) // copy out of C memory before returning
copy(cp, src)
streamChan <- floatToPCM16LE(cp)
return 1 // continue
}
func (o *OmnivoiceCpp) TTSStream(req *pb.TTSRequest, results chan []byte) error {
defer close(results)
if req.Text == "" {
return fmt.Errorf("omnivoice: TTSStream requires text")
}
streamCbOnce.Do(func() {
streamCbPtr = purego.NewCallback(streamCallback)
})
lang := normalizeLanguage(optStr(req.Language))
instruct := optStr(req.Instructions)
refText := reqParam(req, "ref_text")
seed := o.seedFor(req)
ref, err := o.refAudioFor(req)
if err != nil {
return err
}
var refPtr unsafe.Pointer
if len(ref) > 0 {
refPtr = unsafe.Pointer(&ref[0])
}
// Emit the WAV header first so the HTTP layer gets a self-describing stream.
results <- wavHeader24k()
streamMu.Lock()
streamChan = results
rc := CppTTSStream(req.Text, lang, instruct, refPtr, len(ref), refText, seed,
boolToInt(o.opts.denoise), streamCbPtr, 0)
streamChan = nil
streamMu.Unlock()
runtimeKeepAlive(ref)
if rc != 0 {
return fmt.Errorf("omnivoice: streaming synthesis failed (rc=%d)", rc)
}
return nil
}

View File

@@ -0,0 +1,90 @@
package main
import (
"bytes"
"encoding/binary"
"testing"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
func TestOmnivoiceCpp(t *testing.T) {
RegisterFailHandler(Fail)
RunSpecs(t, "omnivoice-cpp suite")
}
var _ = Describe("normalizeLanguage", func() {
DescribeTable("maps caller language to OmniVoice codes",
func(in, want string) {
Expect(normalizeLanguage(in)).To(Equal(want))
},
Entry("empty stays empty", "", ""),
Entry("english full name", "English", "en"),
Entry("chinese full name", "Chinese", "zh"),
Entry("locale suffix stripped", "en-US", "en"),
Entry("underscore locale", "zh_CN", "zh"),
Entry("already a code", "en", "en"),
Entry("unknown passes through normalized", "xx", "xx"),
)
})
var _ = Describe("parseOptions", func() {
It("extracts codec, use_fa, clamp_fp16, seed, denoise", func() {
o := parseOptions([]string{
"tokenizer:tok.gguf",
"use_fa:true",
"clamp_fp16:true",
"seed:7",
"denoise:false",
"unknown:ignored",
})
Expect(o.codecPath).To(Equal("tok.gguf"))
Expect(o.useFA).To(BeTrue())
Expect(o.clampFP16).To(BeTrue())
Expect(o.seed).To(Equal(int64(7)))
Expect(o.denoise).To(BeFalse())
})
It("accepts codec: as an alias for tokenizer:", func() {
o := parseOptions([]string{"codec:c.gguf"})
Expect(o.codecPath).To(Equal("c.gguf"))
})
It("defaults seed to -1 and denoise to true", func() {
o := parseOptions(nil)
Expect(o.seed).To(Equal(int64(-1)))
Expect(o.denoise).To(BeTrue())
})
})
var _ = Describe("wavHeader24k", func() {
It("emits a 44-byte streaming WAV header at 24 kHz mono 16-bit", func() {
h := wavHeader24k()
Expect(h).To(HaveLen(44))
Expect(string(h[0:4])).To(Equal("RIFF"))
Expect(string(h[8:12])).To(Equal("WAVE"))
Expect(string(h[12:16])).To(Equal("fmt "))
Expect(string(h[36:40])).To(Equal("data"))
var sampleRate uint32
Expect(binary.Read(bytes.NewReader(h[24:28]), binary.LittleEndian, &sampleRate)).To(Succeed())
Expect(sampleRate).To(Equal(uint32(24000)))
})
})
var _ = Describe("floatToPCM16LE", func() {
It("clamps and converts float PCM to little-endian int16 bytes", func() {
b := floatToPCM16LE([]float32{0, 1.0, -1.0, 2.0, -2.0})
Expect(b).To(HaveLen(10)) // 5 samples * 2 bytes
read := func(off int) int16 {
var v int16
_ = binary.Read(bytes.NewReader(b[off:off+2]), binary.LittleEndian, &v)
return v
}
Expect(read(0)).To(Equal(int16(0)))
Expect(read(2)).To(Equal(int16(32767)))
Expect(read(4)).To(Equal(int16(-32767)))
Expect(read(6)).To(Equal(int16(32767))) // clamped from 2.0
Expect(read(8)).To(Equal(int16(-32767))) // clamped from -2.0
})
})

View File

@@ -0,0 +1,48 @@
package main
// Note: this is started internally by LocalAI and a server is allocated for each model
import (
"flag"
"os"
"github.com/ebitengine/purego"
grpc "github.com/mudler/LocalAI/pkg/grpc"
)
var (
addr = flag.String("addr", "localhost:50051", "the address to connect to")
)
type LibFuncs struct {
FuncPtr any
Name string
}
func main() {
libName := os.Getenv("OMNIVOICE_LIBRARY")
if libName == "" {
libName = "./libgomnivoicecpp-fallback.so"
}
lib, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL)
if err != nil {
panic(err)
}
libFuncs := []LibFuncs{
{&CppLoad, "omni_load"},
{&CppTTS, "omni_tts"},
{&CppTTSStream, "omni_tts_stream"},
{&CppPCMFree, "omni_pcm_free"},
{&CppUnload, "omni_unload"},
}
for _, lf := range libFuncs {
purego.RegisterLibFunc(lf.FuncPtr, lib, lf.Name)
}
flag.Parse()
if err := grpc.StartServer(*addr, &OmnivoiceCpp{}); err != nil {
panic(err)
}
}

View File

@@ -0,0 +1,74 @@
package main
import (
"strconv"
"strings"
)
// loadOptions holds the parsed model-level options for OmniVoice.
type loadOptions struct {
codecPath string
useFA bool
clampFP16 bool
seed int64
denoise bool
}
func splitOption(o string) (key, value string, ok bool) {
i := strings.Index(o, ":")
if i < 0 {
return "", "", false
}
return strings.TrimSpace(o[:i]), strings.TrimSpace(o[i+1:]), true
}
// parseOptions reads the backend "key:value" option slice. Unknown keys are
// ignored. Defaults: seed -1 (engine default), denoise true.
func parseOptions(opts []string) loadOptions {
o := loadOptions{seed: -1, denoise: true}
for _, oo := range opts {
key, value, ok := splitOption(oo)
if !ok {
continue
}
switch key {
case "tokenizer", "codec":
o.codecPath = value
case "use_fa":
o.useFA = value == "true" || value == "1"
case "clamp_fp16":
o.clampFP16 = value == "true" || value == "1"
case "seed":
if n, err := strconv.ParseInt(value, 10, 64); err == nil {
o.seed = n
}
case "denoise":
o.denoise = value == "true" || value == "1"
}
}
return o
}
// languageNameAliases maps full language names to OmniVoice codes. OmniVoice's
// lang hint accepts "" (auto), "en", "zh" per the upstream convention; other
// codes pass through and the engine treats unknown hints as auto.
var languageNameAliases = map[string]string{
"english": "en",
"chinese": "zh",
}
// normalizeLanguage lowercases, trims, strips a region/locale suffix, and
// resolves common full names. Empty stays empty so the engine auto-detects.
func normalizeLanguage(lang string) string {
lang = strings.ToLower(strings.TrimSpace(lang))
if lang == "" {
return ""
}
if i := strings.IndexAny(lang, "-_."); i >= 0 {
lang = lang[:i]
}
if code, ok := languageNameAliases[lang]; ok {
return code
}
return lang
}

View File

@@ -0,0 +1,64 @@
#!/bin/bash
# Script to copy the appropriate libraries based on architecture
# This script is used in the final stage of the Dockerfile
set -e
CURDIR=$(dirname "$(realpath $0)")
REPO_ROOT="${CURDIR}/../../.."
# Create lib directory
mkdir -p $CURDIR/package/lib
cp -avf $CURDIR/omnivoice-cpp $CURDIR/package/
cp -fv $CURDIR/libgomnivoicecpp-*.so $CURDIR/package/
cp -fv $CURDIR/run.sh $CURDIR/package/
# Detect architecture and copy appropriate libraries
if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
# x86_64 architecture
echo "Detected x86_64 architecture, copying x86_64 libraries..."
cp -arfLv /lib64/ld-linux-x86-64.so.2 $CURDIR/package/lib/ld.so
cp -arfLv /lib/x86_64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
cp -arfLv /lib/x86_64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
cp -arfLv /lib/x86_64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
cp -arfLv /lib/x86_64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
cp -arfLv /lib/x86_64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
cp -arfLv /lib/x86_64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
# ARM64 architecture
echo "Detected ARM64 architecture, copying ARM64 libraries..."
cp -arfLv /lib/ld-linux-aarch64.so.1 $CURDIR/package/lib/ld.so
cp -arfLv /lib/aarch64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
cp -arfLv /lib/aarch64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
cp -arfLv /lib/aarch64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
cp -arfLv /lib/aarch64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
cp -arfLv /lib/aarch64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
cp -arfLv /lib/aarch64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
elif [ $(uname -s) = "Darwin" ]; then
echo "Detected Darwin"
else
echo "Error: Could not detect architecture"
exit 1
fi
# Package GPU libraries based on BUILD_TYPE
GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
if [ -f "$GPU_LIB_SCRIPT" ]; then
echo "Packaging GPU libraries for BUILD_TYPE=${BUILD_TYPE:-cpu}..."
source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
package_gpu_libs
fi
echo "Packaging completed successfully"
ls -liah $CURDIR/package/
ls -liah $CURDIR/package/lib/

52
backend/go/omnivoice-cpp/run.sh Executable file
View File

@@ -0,0 +1,52 @@
#!/bin/bash
set -ex
# Get the absolute current dir where the script is located
CURDIR=$(dirname "$(realpath $0)")
cd /
echo "CPU info:"
if [ "$(uname)" != "Darwin" ]; then
grep -e "model\sname" /proc/cpuinfo | head -1
grep -e "flags" /proc/cpuinfo | head -1
fi
LIBRARY="$CURDIR/libgomnivoicecpp-fallback.so"
if [ "$(uname)" != "Darwin" ]; then
if grep -q -e "\savx\s" /proc/cpuinfo ; then
echo "CPU: AVX found OK"
if [ -e $CURDIR/libgomnivoicecpp-avx.so ]; then
LIBRARY="$CURDIR/libgomnivoicecpp-avx.so"
fi
fi
if grep -q -e "\savx2\s" /proc/cpuinfo ; then
echo "CPU: AVX2 found OK"
if [ -e $CURDIR/libgomnivoicecpp-avx2.so ]; then
LIBRARY="$CURDIR/libgomnivoicecpp-avx2.so"
fi
fi
# Check avx 512
if grep -q -e "\savx512f\s" /proc/cpuinfo ; then
echo "CPU: AVX512F found OK"
if [ -e $CURDIR/libgomnivoicecpp-avx512.so ]; then
LIBRARY="$CURDIR/libgomnivoicecpp-avx512.so"
fi
fi
fi
export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
export OMNIVOICE_LIBRARY=$LIBRARY
# If there is a lib/ld.so, use it
if [ -f $CURDIR/lib/ld.so ]; then
echo "Using lib/ld.so"
echo "Using library: $LIBRARY"
exec $CURDIR/lib/ld.so $CURDIR/omnivoice-cpp "$@"
fi
echo "Using library: $LIBRARY"
exec $CURDIR/omnivoice-cpp "$@"

View File

@@ -0,0 +1,30 @@
#!/bin/bash
set -e
CURDIR=$(dirname "$(realpath $0)")
cd "$CURDIR"
echo "Running omnivoice-cpp backend tests..."
if [ -z "$OMNIVOICE_MODEL" ]; then
MODEL_DIR="./omnivoice-models"
mkdir -p "$MODEL_DIR"
REPO_ID="Serveurperso/OmniVoice-GGUF"
BASE_URL="https://huggingface.co/${REPO_ID}/resolve/main"
FILES=( "omnivoice-base-Q4_K_M.gguf" "omnivoice-tokenizer-Q4_K_M.gguf" )
for file in "${FILES[@]}"; do
dest="${MODEL_DIR}/${file}"
if [ -f "${dest}" ]; then
echo " [skip] ${file}"
else
echo " [download] ${file}..."
curl -L -o "${dest}" "${BASE_URL}/${file}" --progress-bar
fi
done
export OMNIVOICE_MODEL="${MODEL_DIR}/omnivoice-base-Q4_K_M.gguf"
export OMNIVOICE_CODEC="${MODEL_DIR}/omnivoice-tokenizer-Q4_K_M.gguf"
fi
go test -v -timeout 1200s .
echo "All omnivoice-cpp e2e tests passed."

View File

@@ -1,6 +1,6 @@
# parakeet-cpp backend Makefile.
#
# Upstream pin lives below as PARAKEET_VERSION?=b11fe5bca78ad8b342dd559a43d76df3984bb447
# Upstream pin lives below as PARAKEET_VERSION?=b8012f11e5269126eddb7f4fd02f891a2ccc29b0
# (.github/bump_deps.sh) can find and update it - matches the
# whisper.cpp / ds4 / vibevoice-cpp convention.
#
@@ -15,7 +15,7 @@
# That's what the L0 smoke test uses. The default target below does the
# proper clone-at-pin + cmake build so CI doesn't need a side-checkout.
PARAKEET_VERSION?=b11fe5bca78ad8b342dd559a43d76df3984bb447
PARAKEET_VERSION?=b8012f11e5269126eddb7f4fd02f891a2ccc29b0
PARAKEET_REPO?=https://github.com/mudler/parakeet.cpp
GOCMD?=go
@@ -39,7 +39,10 @@ endif
# is overwritten back to OFF and the build silently falls back to CPU. Forward the
# PARAKEET_GGML_* options instead. (openblas is not gated, so -DGGML_BLAS passes through.)
ifeq ($(BUILD_TYPE),cublas)
CMAKE_ARGS+=-DPARAKEET_GGML_CUDA=ON
# GGML_CUDA_GRAPHS is OFF by ggml default; enabling it gives a small free
# speedup (~1% measured on GB10, never negative) by capturing/replaying the
# CUDA graph. Not gated by parakeet.cpp, so it passes straight through to ggml.
CMAKE_ARGS+=-DPARAKEET_GGML_CUDA=ON -DGGML_CUDA_GRAPHS=ON
else ifeq ($(BUILD_TYPE),openblas)
CMAKE_ARGS+=-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
else ifeq ($(BUILD_TYPE),hipblas)

View File

@@ -7,8 +7,12 @@ import "time"
type batchRequest struct {
pcm []float32
decoder int32
tag string
reply chan batchReply
// language is the per-request target locale ("" means the model default).
// parakeet.cpp's batched C-API takes ONE target_lang for the whole batch,
// so the dispatcher only coalesces requests that share a language.
language string
tag string
reply chan batchReply
}
// batchReply carries one per-item JSON object string (an element of the C-API's
@@ -43,13 +47,25 @@ func newBatcher(maxSize int, maxWait time.Duration, runBatch func([]*batchReques
// run is the dispatcher loop: accumulate submitted requests until either maxSize
// is reached or maxWait elapses since the first queued request, then dispatch.
// Exits when stop is closed (draining any partially-filled batch first).
//
// A batch carries ONE language (parakeet.cpp's batched C-API takes a single
// target_lang), so a request whose language differs from the batch leader is
// not coalesced: it is held in carry and becomes the leader of the next batch.
// carry is therefore never dropped and its caller never deadlocks: every batch
// (including a lone carry on stop) is dispatched, and runBatch replies to all.
func (b *batcher) run(stop <-chan struct{}) {
var carry *batchRequest
for {
var first *batchRequest
select {
case first = <-b.submit:
case <-stop:
return
if carry != nil {
// A mismatched request from the previous fill leads this batch.
first, carry = carry, nil
} else {
select {
case first = <-b.submit:
case <-stop:
return
}
}
batch := []*batchRequest{first}
@@ -64,12 +80,22 @@ func (b *batcher) run(stop <-chan struct{}) {
for len(batch) < b.maxSize {
select {
case r := <-b.submit:
if r.language != first.language {
// Different language: carry it to the next batch so this
// batch stays single-language, then dispatch what we have.
carry = r
break fill
}
batch = append(batch, r)
case <-timer.C:
break fill
case <-stop:
timer.Stop()
b.runBatch(batch)
// Don't strand a carried request's caller on shutdown.
if carry != nil {
b.runBatch([]*batchRequest{carry})
}
return
}
}

View File

@@ -105,4 +105,60 @@ var _ = Describe("batcher", func() {
go func() { <-rep }()
Eventually(dispatched, "2s").Should(Receive(Equal(1)))
})
It("never coalesces requests with different languages into one batch", func() {
// parakeet.cpp's batched C-API takes ONE target_lang per batch, so the
// dispatcher must keep every dispatched batch single-language. Submit a
// mix of languages and assert (a) no batch ever carries more than one
// distinct language and (b) every submitted request still gets a reply
// (the mismatched carry-over is never dropped).
var mu sync.Mutex
var langsPerBatch [][]string
run := func(reqs []*batchRequest) {
seen := map[string]struct{}{}
var distinct []string
for _, r := range reqs {
if _, ok := seen[r.language]; !ok {
seen[r.language] = struct{}{}
distinct = append(distinct, r.language)
}
}
mu.Lock()
langsPerBatch = append(langsPerBatch, distinct)
mu.Unlock()
echoReply(reqs)
}
// Large window + size so the fill loop stays open across submits and the
// language constraint (not the timer) is what splits the batches.
b := newBatcher(16, 200*time.Millisecond, run)
stop := make(chan struct{})
go b.run(stop)
defer close(stop)
langs := []string{"en", "en", "de", "de", "en", "fr", "fr"}
const N = 7
var wg sync.WaitGroup
got := make([]string, N)
for i := 0; i < N; i++ {
wg.Add(1)
go func(i int) {
defer wg.Done()
rep := make(chan batchReply, 1)
b.submit <- &batchRequest{tag: string(rune('a' + i)), language: langs[i], reply: rep}
got[i] = (<-rep).json
}(i)
}
wg.Wait()
mu.Lock()
defer mu.Unlock()
// Invariant: every dispatched batch is single-language.
for _, distinct := range langsPerBatch {
Expect(len(distinct)).To(Equal(1), "a batch coalesced more than one language: %v", distinct)
}
// Liveness: every request got a reply (carry-over never stranded).
for i := 0; i < N; i++ {
Expect(got[i]).To(Equal(string(rune('a' + i))))
}
})
})

View File

@@ -48,6 +48,13 @@ var (
// side reads them as const float*/const int*.
CppTranscribePcmBatchJSON func(ctx uintptr, samplesConcat []float32, nSamples []int32, nClips int32, sampleRate int32, decoder int32) uintptr
// CppTranscribePcmBatchJSONLang is the multilingual variant of the batched
// JSON entry point: identical, plus a trailing target_lang. "" (the model
// default, "auto") is passed for non-prompt models, which ignore it; an
// unknown locale on a prompt model returns 0 and sets last_error. Present
// only in newer libparakeet.so; nil falls back to CppTranscribePcmBatchJSON.
CppTranscribePcmBatchJSONLang func(ctx uintptr, samplesConcat []float32, nSamples []int32, nClips int32, sampleRate int32, decoder int32, targetLang string) uintptr
// Cache-aware streaming (RNN-T) entry points. stream_begin returns 0 for
// non-streaming models. feed/finalize return a malloc'd char* (uintptr,
// freed via CppFreeString); feed writes 1 to *eouOut on an <EOU>/<EOB>.
@@ -55,6 +62,18 @@ var (
CppStreamFeed func(s uintptr, pcm []float32, nSamples int32, eouOut unsafe.Pointer) uintptr
CppStreamFinalize func(s uintptr) uintptr
CppStreamFree func(s uintptr)
// CppStreamBeginLang is the multilingual variant of stream_begin: identical,
// plus a trailing target_lang ("" means the model default). Present only in
// newer libparakeet.so; nil falls back to CppStreamBegin.
CppStreamBeginLang func(ctx uintptr, targetLang string) uintptr
// Streaming JSON variants (ABI v4): feed/finalize returning a malloc'd char*
// JSON document {text,eou,frame_sec,words} (uintptr, freed via CppFreeString)
// so streaming segments can carry per-word timestamps. Present only in newer
// libparakeet.so; nil falls back to the text-only CppStreamFeed/Finalize path.
CppStreamFeedJSON func(s uintptr, pcm []float32, nSamples int32) uintptr
CppStreamFinalizeJSON func(s uintptr) uintptr
)
// streamChunkSamples is how much 16 kHz mono PCM we hand to stream_feed per
@@ -72,9 +91,30 @@ const streamChunkSamples = 16000
//
// "start"/"end"/"t" are seconds; "conf" is confidence in (0,1].
type transcriptJSON struct {
Text string `json:"text"`
Words []transcriptWord `json:"words"`
Tokens []transcriptToken `json:"tokens"`
Text string `json:"text"`
FrameSec float64 `json:"frame_sec"`
Words []transcriptWord `json:"words"`
Tokens []transcriptToken `json:"tokens"`
}
// streamFeedJSON mirrors the document returned by
// parakeet_capi_stream_feed_json / parakeet_capi_stream_finalize_json (ABI v5):
//
// {"text":"...","eou":0,"eob":0,"frame_sec":0.080000,
// "words":[{"w":"...","start":0.480,"end":0.640,"conf":0.9100}, ...]}
//
// "text" is the newly-finalized text since the last call; "eou" is 1 when an
// <EOU> (end of utterance) fired this feed and "eob" is 1 when an <EOB>
// (backchannel) fired. ABI v4 conflated the two into "eou"; v5 split them, so
// we read both and treat either as an utterance boundary for segmentation.
// "words" are the words finalized this call with absolute (stream-relative)
// start/end seconds.
type streamFeedJSON struct {
Text string `json:"text"`
Eou int `json:"eou"`
Eob int `json:"eob"`
FrameSec float64 `json:"frame_sec"`
Words []transcriptWord `json:"words"`
}
type transcriptWord struct {
@@ -103,6 +143,10 @@ type ParakeetCpp struct {
engineMu sync.Mutex // sole guard of the one C engine (dispatcher + streaming)
bat *batcher
batStop chan struct{}
// segmentGapFrames is NeMo's segment_gap_threshold in ENCODER FRAMES (model
// YAML option, default 0=off). When >0 it adds NeMo's silence-gap split on
// top of the punctuation split; converted to seconds via the JSON frame_sec.
segmentGapFrames int
}
// Load is the LocalAI gRPC entry point for LoadModel: it calls
@@ -132,6 +176,11 @@ func (p *ParakeetCpp) Load(opts *pb.ModelOptions) error {
if maxWaitMs < 0 {
maxWaitMs = 0
}
// NeMo's segment_gap_threshold (encoder frames, default 0=off). Off by
// default matches NeMo's default (punctuation-only segments); when set it
// additionally splits segments on inter-word silence (see transcriptResultFromDoc).
p.segmentGapFrames = optInt(opts, "segment_gap_threshold", 0)
if CppTranscribePcmBatchJSON != nil {
p.batStop = make(chan struct{})
p.bat = newBatcher(maxSize, time.Duration(maxWaitMs)*time.Millisecond, p.runBatch)
@@ -187,8 +236,19 @@ func (p *ParakeetCpp) runBatch(reqs []*batchRequest) {
if len(reqs) > 0 {
dec = reqs[0].decoder
}
// All requests in a batch share one language (the batcher coalesces only
// same-language requests), so any element's language describes the batch.
lang := ""
if len(reqs) > 0 {
lang = reqs[0].language
}
p.engineMu.Lock()
cstr := CppTranscribePcmBatchJSON(p.ctxPtr, concat, nSamples, int32(len(reqs)), 16000, dec)
var cstr uintptr
if CppTranscribePcmBatchJSONLang != nil {
cstr = CppTranscribePcmBatchJSONLang(p.ctxPtr, concat, nSamples, int32(len(reqs)), 16000, dec, lang)
} else {
cstr = CppTranscribePcmBatchJSON(p.ctxPtr, concat, nSamples, int32(len(reqs)), 16000, dec)
}
p.engineMu.Unlock()
if cstr == 0 {
err := fmt.Errorf("parakeet-cpp: batch transcribe failed: %s", CppLastError(p.ctxPtr))
@@ -226,8 +286,9 @@ func (p *ParakeetCpp) runBatch(reqs []*batchRequest) {
// OpenAI API, whose default is segment-level); token ids always populate
// Segment.Tokens.
//
// translate/diarize/prompt/temperature/language/threads are not applicable to
// parakeet and are ignored; streaming is handled by AudioTranscriptionStream
// translate/diarize/prompt/temperature/threads are not applicable to parakeet
// and are ignored; language is honored on the batched + streaming paths (see
// opts.GetLanguage() below); streaming is handled by AudioTranscriptionStream
// (L2).
func (p *ParakeetCpp) AudioTranscription(ctx context.Context, opts *pb.TranscriptRequest) (pb.TranscriptResult, error) {
if p.ctxPtr == 0 {
@@ -259,7 +320,7 @@ func (p *ParakeetCpp) AudioTranscription(ctx context.Context, opts *pb.Transcrip
if err := json.Unmarshal([]byte(raw), &doc); err != nil {
return pb.TranscriptResult{}, fmt.Errorf("parakeet-cpp: decode transcript json: %w", err)
}
return transcriptResultFromDoc(doc, opts), nil
return transcriptResultFromDoc(doc, opts, p.segmentGapFrames), nil
}
// Batched path: decode to PCM, submit to the batcher, wait for this request's
@@ -271,7 +332,7 @@ func (p *ParakeetCpp) AudioTranscription(ctx context.Context, opts *pb.Transcrip
}
rep := make(chan batchReply, 1)
select {
case p.bat.submit <- &batchRequest{pcm: pcm, decoder: 0, reply: rep}:
case p.bat.submit <- &batchRequest{pcm: pcm, decoder: 0, language: opts.GetLanguage(), reply: rep}:
case <-ctx.Done():
return pb.TranscriptResult{}, status.Error(codes.Canceled, "transcription cancelled")
}
@@ -288,34 +349,172 @@ func (p *ParakeetCpp) AudioTranscription(ctx context.Context, opts *pb.Transcrip
if err := json.Unmarshal([]byte(res.json), &doc); err != nil {
return pb.TranscriptResult{}, fmt.Errorf("parakeet-cpp: decode transcript json: %w", err)
}
return transcriptResultFromDoc(doc, opts), nil
return transcriptResultFromDoc(doc, opts, p.segmentGapFrames), nil
}
// segmentSeparators is NeMo's default segment_seperators (sentence-ending
// punctuation). Splitting on these matches NeMo's default segment timestamps.
var segmentSeparators = []rune{'.', '?', '!'}
// transcriptResultFromDoc maps a decoded transcriptJSON to a TranscriptResult,
// synthesising a single whole-clip segment and attaching word timings only when
// the caller requested word granularity. Shared by the batched and direct paths.
func transcriptResultFromDoc(doc transcriptJSON, opts *pb.TranscriptRequest) pb.TranscriptResult {
// grouping words into NeMo-faithful segments (see splitWordsIntoSegments). The
// optional gapFrames (NeMo's segment_gap_threshold, in encoder FRAMES; 0=off)
// additionally splits on inter-word silence; it is converted to a seconds gap
// with the document's frame_sec. Per-segment word timings are attached only when
// the caller requested word granularity; token ids populate each segment's
// Tokens by time-window membership. Shared by the batched and direct paths.
func transcriptResultFromDoc(doc transcriptJSON, opts *pb.TranscriptRequest, gapFrames int) pb.TranscriptResult {
text := strings.TrimSpace(doc.Text)
words := make([]*pb.TranscriptWord, 0, len(doc.Words))
for _, w := range doc.Words {
words = append(words, &pb.TranscriptWord{Start: secondsToNanos(w.Start), End: secondsToNanos(w.End), Text: w.W})
// Frame-unit gap threshold -> seconds (NeMo segment_gap_threshold). 0 = off.
gapSeconds := 0.0
if gapFrames > 0 {
if doc.FrameSec > 0 {
gapSeconds = float64(gapFrames) * doc.FrameSec
} else {
xlog.Warn("parakeet-cpp: segment_gap_threshold set but libparakeet.so " +
"did not report frame_sec; falling back to punctuation-only segments")
}
}
tokens := make([]int32, 0, len(doc.Tokens))
for _, t := range doc.Tokens {
tokens = append(tokens, t.ID)
groups := splitWordsIntoSegments(doc.Words, segmentSeparators, gapSeconds)
if len(groups) == 0 {
// No words (edge case): single whole-clip text segment.
return pb.TranscriptResult{
Text: text,
Segments: []*pb.TranscriptSegment{{Id: 0, Text: text}},
}
}
var segStart, segEnd int64
if len(words) > 0 {
segStart = words[0].Start
segEnd = words[len(words)-1].End
wantWords := wordsRequested(opts.TimestampGranularities)
segments := make([]*pb.TranscriptSegment, 0, len(groups))
for id, group := range groups {
parts := make([]string, len(group))
for i, gw := range group {
parts[i] = gw.W
}
seg := &pb.TranscriptSegment{
Id: int32(id),
Start: secondsToNanos(group[0].Start),
End: secondsToNanos(group[len(group)-1].End),
Text: strings.TrimSpace(strings.Join(parts, " ")),
Tokens: tokensInWindow(doc.Tokens, group[0].Start, group[len(group)-1].End),
}
if wantWords {
ws := make([]*pb.TranscriptWord, len(group))
for i, gw := range group {
ws[i] = &pb.TranscriptWord{Start: secondsToNanos(gw.Start), End: secondsToNanos(gw.End), Text: gw.W}
}
seg.Words = ws
}
segments = append(segments, seg)
}
seg := &pb.TranscriptSegment{Id: 0, Start: segStart, End: segEnd, Text: text, Tokens: tokens}
if wordsRequested(opts.TimestampGranularities) {
seg.Words = words
}
return pb.TranscriptResult{Text: text, Segments: []*pb.TranscriptSegment{seg}}
return pb.TranscriptResult{Text: text, Segments: segments}
}
// splitWordsIntoSegments groups words into segments exactly as NeMo's
// get_segment_offsets does (nemo/collections/asr/parts/utils/timestamp_utils.py).
// Walking the words, it closes a segment when (1) the gap rule is enabled
// (gapSeconds > 0) and the segment already has words and the gap from the
// previous word's end to this word's start is >= gapSeconds - the current word
// then STARTS a new segment - or, checked only when the gap rule did not apply
// (NeMo's elif), (2) the word ends with (or is) a separator, which closes the
// segment INCLUDING that word. Trailing words flush into a final segment.
// gapSeconds <= 0 disables the gap rule, matching NeMo's default
// segment_gap_threshold=None (punctuation-only segments).
func splitWordsIntoSegments(words []transcriptWord, separators []rune, gapSeconds float64) [][]transcriptWord {
var segments [][]transcriptWord
var cur []transcriptWord
for i, word := range words {
gapActive := gapSeconds > 0 && len(cur) > 0
if gapActive && (word.Start-words[i-1].End) >= gapSeconds {
segments = append(segments, cur)
cur = []transcriptWord{word}
continue
}
if !gapActive && endsWithSeparator(word.W, separators) {
cur = append(cur, word)
segments = append(segments, cur)
cur = nil
continue
}
cur = append(cur, word)
}
if len(cur) > 0 {
segments = append(segments, cur)
}
return segments
}
// endsWithSeparator reports whether w's last rune is in separators (matching
// NeMo's `word[-1] in delims or word in delims`).
func endsWithSeparator(w string, separators []rune) bool {
r := []rune(strings.TrimSpace(w))
if len(r) == 0 {
return false
}
last := r[len(r)-1]
for _, s := range separators {
if last == s {
return true
}
}
return false
}
// tokensInWindow returns the ids of tokens whose timestamp t falls in
// [start, end] (inclusive), assigning each token to the segment that spans its
// time. The last segment's end is the last word end, so the final token is
// included.
func tokensInWindow(tokens []transcriptToken, start, end float64) []int32 {
var ids []int32
for _, t := range tokens {
if t.T >= start && t.T <= end {
ids = append(ids, t.ID)
}
}
return ids
}
// streamSegmenter accumulates streaming words into per-utterance segments. EOU
// is the model's own utterance boundary; each closed segment takes its start/end
// from its first/last accumulated word.
type streamSegmenter struct {
segs []*pb.TranscriptSegment
cur []transcriptWord
nextID int32
}
func (s *streamSegmenter) add(doc streamFeedJSON) {
s.cur = append(s.cur, doc.Words...)
// Close the segment on either turn signal: <EOU> (end of utterance) or
// <EOB> (backchannel). ABI v4 reported both via "eou"; v5 split them, so we
// OR them here to keep the v4 segmentation boundaries.
if doc.Eou != 0 || doc.Eob != 0 {
s.flush()
}
}
func (s *streamSegmenter) flush() {
if len(s.cur) == 0 {
return
}
parts := make([]string, len(s.cur))
for i, w := range s.cur {
parts[i] = w.W
}
s.segs = append(s.segs, &pb.TranscriptSegment{
Id: s.nextID,
Start: secondsToNanos(s.cur[0].Start),
End: secondsToNanos(s.cur[len(s.cur)-1].End),
Text: strings.TrimSpace(strings.Join(parts, " ")),
})
s.nextID++
s.cur = nil
}
func (s *streamSegmenter) segments() []*pb.TranscriptSegment { return s.segs }
// wordsRequested reports whether the caller asked for word-level timestamps.
// The OpenAI transcription API gates word timings behind
// timestamp_granularities[] containing "word" and defaults to segment-level
@@ -361,7 +560,12 @@ func (p *ParakeetCpp) AudioTranscriptionStream(ctx context.Context, opts *pb.Tra
return status.Error(codes.Canceled, "transcription cancelled")
}
stream := CppStreamBegin(p.ctxPtr)
var stream uintptr
if CppStreamBeginLang != nil {
stream = CppStreamBeginLang(p.ctxPtr, opts.GetLanguage())
} else {
stream = CppStreamBegin(p.ctxPtr)
}
if stream == 0 {
// Not a cache-aware streaming model: run a normal offline
// transcription and emit it as one delta + a closing final result.
@@ -390,6 +594,14 @@ func (p *ParakeetCpp) AudioTranscriptionStream(ctx context.Context, opts *pb.Tra
return err
}
// ABI v4: when the streaming JSON entry points are present, drive them so the
// per-utterance segments carry per-word start/end timestamps. Falls through to
// the text-only loop below against an older libparakeet.so. Runs under the
// engineMu already held above.
if CppStreamFeedJSON != nil {
return p.streamJSON(ctx, stream, data, duration, results)
}
var (
full strings.Builder
segText strings.Builder
@@ -466,6 +678,72 @@ func (p *ParakeetCpp) AudioTranscriptionStream(ctx context.Context, opts *pb.Tra
return nil
}
// streamJSON drives the streaming JSON entry points (present since ABI v4): each
// feed/finalize returns a {text,eou,eob,frame_sec,words} document. The
// newly-finalized text is emitted as a delta (unchanged streaming contract)
// while words are accumulated into per-utterance segments (closed on <EOU> or
// <EOB>) so the closing FinalResult carries timestamped segments. Runs under
// engineMu (already held by the caller).
func (p *ParakeetCpp) streamJSON(ctx context.Context, stream uintptr, data []float32,
duration float32, results chan *pb.TranscriptStreamResponse) error {
var (
full strings.Builder
seg streamSegmenter
)
// consume frees the malloc'd char* (a 0 return is an error), parses the JSON,
// emits the delta, and routes words through the segmenter.
consume := func(ret uintptr) error {
if ret == 0 {
msg := CppLastError(p.ctxPtr)
if msg == "" {
msg = "unknown error"
}
return fmt.Errorf("parakeet-cpp: stream feed/finalize failed: %s", msg)
}
raw := goStringFromCPtr(ret)
CppFreeString(ret)
var doc streamFeedJSON
if err := json.Unmarshal([]byte(raw), &doc); err != nil {
return fmt.Errorf("parakeet-cpp: decode stream json: %w", err)
}
if doc.Text != "" {
full.WriteString(doc.Text)
results <- &pb.TranscriptStreamResponse{Delta: doc.Text}
}
seg.add(doc)
return nil
}
for off := 0; off < len(data); off += streamChunkSamples {
if err := ctx.Err(); err != nil {
return status.Error(codes.Canceled, "transcription cancelled")
}
end := min(off+streamChunkSamples, len(data))
chunk := data[off:end]
if err := consume(CppStreamFeedJSON(stream, chunk, int32(len(chunk)))); err != nil {
return err
}
}
if err := consume(CppStreamFinalizeJSON(stream)); err != nil {
return err
}
seg.flush() // close any trailing utterance that never saw an EOU
text := strings.TrimSpace(full.String())
segments := seg.segments()
if len(segments) == 0 && text != "" {
segments = append(segments, &pb.TranscriptSegment{Id: 0, Text: text})
}
results <- &pb.TranscriptStreamResponse{
FinalResult: &pb.TranscriptResult{
Text: text,
Segments: segments,
Duration: duration,
},
}
return nil
}
// decodeWavMono16k converts any input audio to 16 kHz mono PCM and returns the
// float samples plus the clip duration in seconds. Mirrors the whisper
// backend: utils.AudioToWav (ffmpeg) normalises rate/channels, go-audio

View File

@@ -53,6 +53,10 @@ func ensureLibLoaded() {
purego.RegisterLibFunc(&CppStreamFeed, lib, "parakeet_capi_stream_feed")
purego.RegisterLibFunc(&CppStreamFinalize, lib, "parakeet_capi_stream_finalize")
purego.RegisterLibFunc(&CppStreamFree, lib, "parakeet_capi_stream_free")
if sym, err := purego.Dlsym(lib, "parakeet_capi_stream_feed_json"); err == nil && sym != 0 {
purego.RegisterLibFunc(&CppStreamFeedJSON, lib, "parakeet_capi_stream_feed_json")
purego.RegisterLibFunc(&CppStreamFinalizeJSON, lib, "parakeet_capi_stream_finalize_json")
}
purego.RegisterLibFunc(&CppFreeString, lib, "parakeet_capi_free_string")
purego.RegisterLibFunc(&CppLastError, lib, "parakeet_capi_last_error")
})
@@ -107,13 +111,22 @@ var _ = Describe("ParakeetCpp", func() {
Expect(err).ToNot(HaveOccurred())
Expect(strings.TrimSpace(res.Text)).ToNot(BeEmpty(),
"expected non-empty transcript for %s", audioPath)
Expect(res.Segments).To(HaveLen(1),
"synthesises a single whole-clip segment")
Expect(res.Segments[0].Text).To(Equal(res.Text),
"single segment text must equal the top-level text")
// Default (no granularities) is segment-level: no per-word timings.
Expect(res.Segments[0].Words).To(BeEmpty(),
"word timings are opt-in via timestamp_granularities")
// NeMo-faithful segmentation: one or more punctuation-delimited
// segments, each with text and a monotonically-advancing time span.
Expect(res.Segments).ToNot(BeEmpty(), "expected at least one segment")
var prevEnd int64
for i, seg := range res.Segments {
Expect(strings.TrimSpace(seg.Text)).ToNot(BeEmpty(),
"segment %d must have text", i)
Expect(seg.End).To(BeNumerically(">=", seg.Start),
"segment %d end must not precede its start", i)
Expect(seg.Start).To(BeNumerically(">=", prevEnd),
"segments must be in time order")
prevEnd = seg.End
// Default (no granularities) is segment-level: no per-word timings.
Expect(seg.Words).To(BeEmpty(),
"word timings are opt-in via timestamp_granularities")
}
})
It("emits word-level timestamps when granularity=word", func() {
@@ -129,15 +142,28 @@ var _ = Describe("ParakeetCpp", func() {
TimestampGranularities: []string{"word"},
})
Expect(err).ToNot(HaveOccurred())
Expect(res.Segments).To(HaveLen(1))
seg := res.Segments[0]
Expect(seg.Words).ToNot(BeEmpty(),
"expected per-word timestamps with granularity=word")
// Monotonic, non-negative timings spanning the segment.
Expect(seg.Words[0].Start).To(BeNumerically(">=", int64(0)))
Expect(seg.End).To(BeNumerically(">=", seg.Start))
Expect(seg.Words[len(seg.Words)-1].End).To(Equal(seg.End),
"segment end tracks the last word")
Expect(res.Segments).ToNot(BeEmpty())
// With word granularity every segment carries its own words, and each
// segment's span tracks its first/last word; word starts advance
// monotonically across the whole transcript.
totalWords := 0
var prevStart int64 = -1
for i, seg := range res.Segments {
Expect(seg.Words).ToNot(BeEmpty(),
"segment %d must carry per-word timestamps with granularity=word", i)
Expect(seg.Start).To(Equal(seg.Words[0].Start),
"segment %d start tracks its first word", i)
Expect(seg.End).To(Equal(seg.Words[len(seg.Words)-1].End),
"segment %d end tracks its last word", i)
for _, w := range seg.Words {
Expect(w.End).To(BeNumerically(">=", w.Start))
Expect(w.Start).To(BeNumerically(">=", prevStart))
prevStart = w.Start
totalWords++
}
}
Expect(totalWords).To(BeNumerically(">", 0))
Expect(res.Segments[0].Words[0].Start).To(BeNumerically(">=", int64(0)))
})
})

View File

@@ -65,6 +65,25 @@ func main() {
purego.RegisterLibFunc(&CppTranscribePcmBatchJSON, lib, "parakeet_capi_transcribe_pcm_batch_json")
}
// Per-request language variants (multilingual nemotron). Same probe pattern:
// present only in libparakeet.so built with multilingual support, so the
// backend still loads against an older library and falls back to the
// non-lang batched + streaming entry points (model default / "auto").
if sym, err := purego.Dlsym(lib, "parakeet_capi_transcribe_pcm_batch_json_lang"); err == nil && sym != 0 {
purego.RegisterLibFunc(&CppTranscribePcmBatchJSONLang, lib, "parakeet_capi_transcribe_pcm_batch_json_lang")
}
if sym, err := purego.Dlsym(lib, "parakeet_capi_stream_begin_lang"); err == nil && sym != 0 {
purego.RegisterLibFunc(&CppStreamBeginLang, lib, "parakeet_capi_stream_begin_lang")
}
// Streaming JSON entry points (ABI v4): surface per-word timestamps on the
// streaming path. Same probe pattern; absent in older libparakeet.so, where
// the backend falls back to the text-only streaming feed.
if sym, err := purego.Dlsym(lib, "parakeet_capi_stream_feed_json"); err == nil && sym != 0 {
purego.RegisterLibFunc(&CppStreamFeedJSON, lib, "parakeet_capi_stream_feed_json")
purego.RegisterLibFunc(&CppStreamFinalizeJSON, lib, "parakeet_capi_stream_finalize_json")
}
fmt.Fprintf(os.Stderr, "[parakeet-cpp] ABI=%d\n", CppAbiVersion())
flag.Parse()

View File

@@ -0,0 +1,140 @@
package main
import (
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
func tw(text string, start, end float64) transcriptWord {
return transcriptWord{W: text, Start: start, End: end}
}
var _ = Describe("splitWordsIntoSegments (NeMo get_segment_offsets parity)", func() {
seps := []rune{'.', '?', '!'}
It("splits on sentence-ending punctuation, including the delimiter word", func() {
words := []transcriptWord{tw("hello", 0, 0.4), tw("world.", 0.4, 0.8), tw("bye", 1.0, 1.3)}
segs := splitWordsIntoSegments(words, seps, 0)
Expect(segs).To(HaveLen(2))
Expect(segs[0]).To(HaveLen(2))
Expect(segs[0][1].W).To(Equal("world."))
Expect(segs[1]).To(HaveLen(1))
Expect(segs[1][0].W).To(Equal("bye"))
})
It("keeps a single segment with no terminal punctuation and gap off", func() {
words := []transcriptWord{tw("a", 0, 0.2), tw("b", 0.2, 0.4), tw("c", 5.0, 5.2)}
segs := splitWordsIntoSegments(words, seps, 0)
Expect(segs).To(HaveLen(1))
})
It("splits on the gap rule when enabled, the gapped word starting the next segment", func() {
words := []transcriptWord{tw("a", 0, 0.2), tw("b", 0.2, 0.4), tw("c", 5.0, 5.2)}
segs := splitWordsIntoSegments(words, seps, 1.0) // c is 4.6s after b
Expect(segs).To(HaveLen(2))
Expect(segs[0]).To(HaveLen(2)) // a b
Expect(segs[1]).To(HaveLen(1)) // c
Expect(segs[1][0].W).To(Equal("c"))
})
It("checks the gap rule before punctuation (NeMo elif order)", func() {
// "b." would terminate, but c is far after it -> gap closes [a b.] at b.
words := []transcriptWord{tw("a", 0, 0.2), tw("b.", 0.2, 0.4), tw("c", 9.0, 9.2)}
segs := splitWordsIntoSegments(words, seps, 1.0)
Expect(segs).To(HaveLen(2))
Expect(segs[0]).To(HaveLen(2))
Expect(segs[1][0].W).To(Equal("c"))
})
It("still splits on punctuation when the gap rule is enabled but does not fire", func() {
words := []transcriptWord{tw("hi.", 0, 0.4), tw("bye", 0.4, 0.8)}
segs := splitWordsIntoSegments(words, seps, 5.0) // gap never reached
Expect(segs).To(HaveLen(2))
Expect(segs[0][0].W).To(Equal("hi."))
})
It("returns nothing for empty input", func() {
Expect(splitWordsIntoSegments(nil, seps, 0)).To(BeEmpty())
})
})
var _ = Describe("transcriptResultFromDoc (multi-segment)", func() {
doc := transcriptJSON{
Text: "hello world. bye now",
FrameSec: 0.08,
Words: []transcriptWord{
{W: "hello", Start: 0.0, End: 0.4},
{W: "world.", Start: 0.4, End: 0.8},
{W: "bye", Start: 1.0, End: 1.3},
{W: "now", Start: 1.3, End: 1.6},
},
Tokens: []transcriptToken{{ID: 1, T: 0.1}, {ID: 2, T: 0.5}, {ID: 3, T: 1.1}, {ID: 4, T: 1.4}},
}
It("emits one segment per punctuation-delimited group with start/end", func() {
res := transcriptResultFromDoc(doc, &pb.TranscriptRequest{}, 0)
Expect(res.Segments).To(HaveLen(2))
Expect(res.Segments[0].Text).To(Equal("hello world."))
Expect(res.Segments[0].Start).To(Equal(int64(0)))
Expect(res.Segments[0].End).To(Equal(secondsToNanos(0.8)))
Expect(res.Segments[1].Text).To(Equal("bye now"))
Expect(res.Segments[1].Start).To(Equal(secondsToNanos(1.0)))
Expect(res.Segments[1].Id).To(Equal(int32(1)))
})
It("assigns tokens to the segment whose time window contains them", func() {
res := transcriptResultFromDoc(doc, &pb.TranscriptRequest{}, 0)
Expect(res.Segments[0].Tokens).To(Equal([]int32{1, 2}))
Expect(res.Segments[1].Tokens).To(Equal([]int32{3, 4}))
})
It("attaches per-segment words only when word granularity requested", func() {
plain := transcriptResultFromDoc(doc, &pb.TranscriptRequest{}, 0)
Expect(plain.Segments[0].Words).To(BeEmpty())
withWords := transcriptResultFromDoc(doc, &pb.TranscriptRequest{TimestampGranularities: []string{"word"}}, 0)
Expect(withWords.Segments[0].Words).To(HaveLen(2))
})
It("falls back to a single text segment when there are no words", func() {
res := transcriptResultFromDoc(transcriptJSON{Text: "hi"}, &pb.TranscriptRequest{}, 0)
Expect(res.Segments).To(HaveLen(1))
Expect(res.Segments[0].Text).To(Equal("hi"))
})
})
var _ = Describe("streaming segment assembly", func() {
It("closes a segment with start/end from its words on EOU", func() {
acc := &streamSegmenter{}
acc.add(streamFeedJSON{Text: "hello world", Eou: 1, Words: []transcriptWord{
{W: "hello", Start: 0.0, End: 0.4}, {W: "world", Start: 0.4, End: 0.9},
}})
segs := acc.segments()
Expect(segs).To(HaveLen(1))
Expect(segs[0].Text).To(Equal("hello world"))
Expect(segs[0].Start).To(Equal(int64(0)))
Expect(segs[0].End).To(Equal(secondsToNanos(0.9)))
})
It("buffers words across feeds until EOU", func() {
acc := &streamSegmenter{}
acc.add(streamFeedJSON{Text: "hi", Eou: 0, Words: []transcriptWord{{W: "hi", Start: 0, End: 0.3}}})
Expect(acc.segments()).To(BeEmpty())
acc.add(streamFeedJSON{Text: "there", Eou: 1, Words: []transcriptWord{{W: "there", Start: 0.3, End: 0.7}}})
Expect(acc.segments()).To(HaveLen(1))
Expect(acc.segments()[0].Text).To(Equal("hi there"))
})
// ABI v5 split <EOB> (backchannel) out of the "eou" flag into its own "eob"
// field; a backchannel must still close the segment as it did in v4.
It("closes a segment on EOB (backchannel) too", func() {
acc := &streamSegmenter{}
acc.add(streamFeedJSON{Text: "uh huh", Eou: 0, Eob: 1, Words: []transcriptWord{
{W: "uh", Start: 0.0, End: 0.2}, {W: "huh", Start: 0.2, End: 0.5},
}})
segs := acc.segments()
Expect(segs).To(HaveLen(1))
Expect(segs[0].Text).To(Equal("uh huh"))
Expect(segs[0].End).To(Equal(secondsToNanos(0.5)))
})
})

View File

@@ -3,35 +3,36 @@ project(goqwen3ttscpp LANGUAGES C CXX)
set(CMAKE_POSITION_INDEPENDENT_CODE ON)
set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
set(QWEN3TTS_DIR ${CMAKE_CURRENT_SOURCE_DIR}/sources/qwen3-tts.cpp)
set(QWENTTS_DIR ${CMAKE_CURRENT_SOURCE_DIR}/sources/qwentts.cpp)
# Override upstream's CMAKE_CUDA_ARCHITECTURES before add_subdirectory.
if(NOT DEFINED CMAKE_CUDA_ARCHITECTURES)
set(CMAKE_CUDA_ARCHITECTURES "75-virtual;80-virtual;86-real;89-real")
endif()
# Build ggml from the upstream's submodule FIRST, so that ggml/ggml-base/ggml-cpu
# CMake targets exist when the upstream project references them by name.
# The upstream CMakeLists.txt uses target_link_libraries(... ggml ggml-base ggml-cpu)
# with target_link_directories pointing at a pre-built ggml/build/. By adding ggml
# as a subdirectory here, CMake resolves those names as targets instead.
add_subdirectory(${QWEN3TTS_DIR}/ggml ggml EXCLUDE_FROM_ALL)
# Add the upstream project. Its own CMakeLists adds ggml + cpp-httplib + yyjson
# and builds qwen-core (STATIC, the qt_* impl). EXCLUDE_FROM_ALL keeps its CLI
# tools / tts-server / tests from building unless referenced.
add_subdirectory(${QWENTTS_DIR} qwentts EXCLUDE_FROM_ALL)
# Now add the upstream project
add_subdirectory(${QWEN3TTS_DIR} qwen3tts EXCLUDE_FROM_ALL)
# Upstream generates version.h into its own CMAKE_CURRENT_BINARY_DIR and adds
# the top-level ${CMAKE_BINARY_DIR} to qwen-core's include path. Under
# add_subdirectory those two dirs differ (<build>/qwentts vs <build>), so
# qwen.cpp cannot find version.h. Point qwen-core at the subproject binary dir
# where version.h is actually generated. (Fix lives here, never in the fetched
# upstream checkout.)
target_include_directories(qwen-core PRIVATE ${CMAKE_BINARY_DIR}/qwentts)
add_library(goqwen3ttscpp MODULE cpp/goqwen3ttscpp.cpp)
target_link_libraries(goqwen3ttscpp PRIVATE qwen3_tts)
target_link_libraries(goqwen3ttscpp PRIVATE qwen-core)
target_include_directories(goqwen3ttscpp PRIVATE ${QWEN3TTS_DIR}/src)
target_include_directories(goqwen3ttscpp SYSTEM PRIVATE ${QWEN3TTS_DIR}/ggml/include)
target_include_directories(goqwen3ttscpp PRIVATE ${QWENTTS_DIR}/src)
target_include_directories(goqwen3ttscpp SYSTEM PRIVATE ${QWENTTS_DIR}/ggml/include)
# Link GPU backends if available
foreach(backend blas cuda metal vulkan)
# Link GPU backends if the upstream ggml created them.
foreach(backend blas cuda metal vulkan sycl)
if(TARGET ggml-${backend})
target_link_libraries(goqwen3ttscpp PRIVATE ggml-${backend})
string(TOUPPER ${backend} BACKEND_UPPER)
target_compile_definitions(goqwen3ttscpp PRIVATE QWEN3TTS_HAVE_${BACKEND_UPPER})
if(backend STREQUAL "cuda")
find_package(CUDAToolkit QUIET)
if(CUDAToolkit_FOUND)
@@ -44,12 +45,8 @@ endforeach()
if(MSVC)
target_compile_options(goqwen3ttscpp PRIVATE /W4 /wd4100 /wd4505)
else()
target_compile_options(goqwen3ttscpp PRIVATE -Wall -Wextra -Wshadow -Wconversion
-Wno-unused-parameter -Wno-unused-function -Wno-sign-conversion)
endif()
if(CMAKE_CXX_COMPILER_ID MATCHES "GNU" AND CMAKE_CXX_COMPILER_VERSION VERSION_LESS 9.0)
target_link_libraries(goqwen3ttscpp PRIVATE stdc++fs)
target_compile_options(goqwen3ttscpp PRIVATE -Wall -Wextra
-Wno-unused-parameter -Wno-unused-function)
endif()
set_property(TARGET goqwen3ttscpp PROPERTY CXX_STANDARD 17)

View File

@@ -6,9 +6,9 @@ GOCMD?=go
GO_TAGS?=
JOBS?=$(shell nproc --ignore=1)
# qwen3-tts.cpp version
QWEN3TTS_REPO?=https://github.com/predict-woo/qwen3-tts.cpp
QWEN3TTS_CPP_VERSION?=136e5d36c17083da0321fd96512dc7b263f94a44
# qwentts.cpp version
QWEN3TTS_REPO?=https://github.com/ServeurpersoCom/qwentts.cpp
QWEN3TTS_CPP_VERSION?=0bf4a18b22e8bb8718d95294e9f7f45c0d4270a4
SO_TARGET?=libgoqwen3ttscpp.so
CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF
@@ -49,9 +49,9 @@ ifeq ($(BUILD_TYPE),sycl_f32)
-DCMAKE_CXX_COMPILER=icpx
endif
sources/qwen3-tts.cpp:
mkdir -p sources/qwen3-tts.cpp
cd sources/qwen3-tts.cpp && \
sources/qwentts.cpp:
mkdir -p sources/qwentts.cpp
cd sources/qwentts.cpp && \
git init && \
git remote add origin $(QWEN3TTS_REPO) && \
git fetch origin && \
@@ -78,7 +78,7 @@ package: qwen3-tts-cpp
build: package
clean: purge
rm -rf libgoqwen3ttscpp*.so package sources/qwen3-tts.cpp qwen3-tts-cpp
rm -rf libgoqwen3ttscpp*.so package sources/qwentts.cpp qwen3-tts-cpp
purge:
rm -rf build*
@@ -88,24 +88,24 @@ purge:
# Build all variants (Linux only)
ifeq ($(UNAME_S),Linux)
libgoqwen3ttscpp-avx.so: sources/qwen3-tts.cpp
libgoqwen3ttscpp-avx.so: sources/qwentts.cpp
$(info ${GREEN}I qwen3-tts-cpp build info:avx${RESET})
SO_TARGET=libgoqwen3ttscpp-avx.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) libgoqwen3ttscpp-custom
rm -rf build-libgoqwen3ttscpp-avx.so
libgoqwen3ttscpp-avx2.so: sources/qwen3-tts.cpp
libgoqwen3ttscpp-avx2.so: sources/qwentts.cpp
$(info ${GREEN}I qwen3-tts-cpp build info:avx2${RESET})
SO_TARGET=libgoqwen3ttscpp-avx2.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=off -DGGML_FMA=on -DGGML_F16C=on -DGGML_BMI2=on" $(MAKE) libgoqwen3ttscpp-custom
rm -rf build-libgoqwen3ttscpp-avx2.so
libgoqwen3ttscpp-avx512.so: sources/qwen3-tts.cpp
libgoqwen3ttscpp-avx512.so: sources/qwentts.cpp
$(info ${GREEN}I qwen3-tts-cpp build info:avx512${RESET})
SO_TARGET=libgoqwen3ttscpp-avx512.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=on -DGGML_FMA=on -DGGML_F16C=on -DGGML_BMI2=on" $(MAKE) libgoqwen3ttscpp-custom
rm -rf build-libgoqwen3ttscpp-avx512.so
endif
# Build fallback variant (all platforms)
libgoqwen3ttscpp-fallback.so: sources/qwen3-tts.cpp
libgoqwen3ttscpp-fallback.so: sources/qwentts.cpp
$(info ${GREEN}I qwen3-tts-cpp build info:fallback${RESET})
SO_TARGET=libgoqwen3ttscpp-fallback.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) libgoqwen3ttscpp-custom
rm -rf build-libgoqwen3ttscpp-fallback.so

View File

@@ -0,0 +1,128 @@
package main
import (
"bytes"
"encoding/binary"
"fmt"
"os"
"runtime"
"github.com/go-audio/audio"
"github.com/go-audio/wav"
)
const qwen3ttsSampleRate = 24000
// wavHeader24k returns a 44-byte WAV header for a streaming 24 kHz mono 16-bit
// PCM stream, with placeholder (0xFFFFFFFF) sizes since the total length is
// unknown up front. Emitted as the first chunk of TTSStream so the HTTP layer
// receives a self-describing WAV (the gRPC TTSStream path never sets Message,
// so the backend owns the header - see core/backend/tts.go:ModelTTSStream).
func wavHeader24k() []byte {
var buf bytes.Buffer
w := func(v any) { _ = binary.Write(&buf, binary.LittleEndian, v) }
buf.WriteString("RIFF")
w(uint32(0xFFFFFFFF))
buf.WriteString("WAVE")
buf.WriteString("fmt ")
w(uint32(16)) // Subchunk1Size
w(uint16(1)) // PCM
w(uint16(1)) // mono
w(uint32(qwen3ttsSampleRate)) // sample rate
w(uint32(qwen3ttsSampleRate * 2)) // byte rate = SR * blockAlign
w(uint16(2)) // block align (16-bit mono)
w(uint16(16)) // bits per sample
buf.WriteString("data")
w(uint32(0xFFFFFFFF))
return buf.Bytes()
}
// floatToPCM16LE clamps each sample to [-1,1] and encodes it as little-endian
// signed 16-bit PCM.
func floatToPCM16LE(samples []float32) []byte {
out := make([]byte, len(samples)*2)
for i, s := range samples {
if s > 1 {
s = 1
} else if s < -1 {
s = -1
}
v := int16(s * 32767)
out[i*2] = byte(v)
out[i*2+1] = byte(v >> 8)
}
return out
}
// writeWAV24k writes samples as a finalized 24 kHz mono 16-bit WAV at dst.
func writeWAV24k(dst string, samples []float32) error {
f, err := os.Create(dst)
if err != nil {
return fmt.Errorf("qwen3-tts: create %q: %w", dst, err)
}
enc := wav.NewEncoder(f, qwen3ttsSampleRate, 16, 1, 1)
ints := make([]int, len(samples))
for i, s := range samples {
if s > 1 {
s = 1
} else if s < -1 {
s = -1
}
ints[i] = int(s * 32767)
}
b := &audio.IntBuffer{
Format: &audio.Format{NumChannels: 1, SampleRate: qwen3ttsSampleRate},
Data: ints,
SourceBitDepth: 16,
}
if err := enc.Write(b); err != nil {
_ = enc.Close()
_ = f.Close()
return fmt.Errorf("qwen3-tts: encode WAV: %w", err)
}
if err := enc.Close(); err != nil {
_ = f.Close()
return fmt.Errorf("qwen3-tts: finalize WAV: %w", err)
}
return f.Close()
}
// readWAVAsFloat decodes a WAV file (any sample rate/channels) to a mono
// float32 slice in [-1,1] for use as cloning reference audio. qwentts expects
// 24 kHz; callers should supply 24 kHz reference clips.
func readWAVAsFloat(path string) ([]float32, error) {
f, err := os.Open(path)
if err != nil {
return nil, fmt.Errorf("qwen3-tts: open ref %q: %w", path, err)
}
defer func() { _ = f.Close() }()
dec := wav.NewDecoder(f)
buf, err := dec.FullPCMBuffer()
if err != nil {
return nil, fmt.Errorf("qwen3-tts: decode ref %q: %w", path, err)
}
ch := int(buf.Format.NumChannels)
if ch < 1 {
ch = 1
}
bitDepth := int(buf.SourceBitDepth)
if bitDepth == 0 {
bitDepth = 16
}
scale := float32(int64(1) << uint(bitDepth-1))
n := len(buf.Data) / ch
out := make([]float32, n)
for i := 0; i < n; i++ {
var acc int
for c := 0; c < ch; c++ {
acc += buf.Data[i*ch+c]
}
out[i] = float32(acc) / float32(ch) / scale
}
return out, nil
}
// runtimeKeepAlive prevents the GC from reclaiming the reference-audio slice
// while its backing pointer is in use across the C call.
func runtimeKeepAlive(v any) { runtime.KeepAlive(v) }

View File

@@ -0,0 +1,54 @@
package main
import (
"path/filepath"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
// These specs pin the voice-selection logic in resolveRequest, in particular
// the config-level audio_path (tts.audio_path -> ModelOptions.AudioPath) being
// used as the default voice-cloning reference. No model/C library is needed:
// resolveRequest only reads the reference WAV via readWAVAsFloat (pure Go).
var _ = Describe("resolveRequest voice/clone selection", func() {
var dir, refWav string
BeforeEach(func() {
dir = GinkgoT().TempDir()
refWav = filepath.Join(dir, "ref.wav")
// 0.5s of non-silent 24kHz mono audio as a clone reference.
samples := make([]float32, qwen3ttsSampleRate/2)
for i := range samples {
samples[i] = 0.1
}
Expect(writeWAV24k(refWav, samples)).To(Succeed())
})
It("uses the config audio_path as the clone reference when Voice is empty", func() {
q := &Qwen3TtsCpp{audioPath: refWav}
_, _, speaker, _, ref, _, err := q.resolveRequest(&pb.TTSRequest{Text: "hi"})
Expect(err).ToNot(HaveOccurred())
Expect(speaker).To(BeEmpty())
Expect(len(ref)).To(Equal(qwen3ttsSampleRate / 2))
})
It("lets a per-request audio Voice override audio_path", func() {
other := filepath.Join(dir, "other.wav")
Expect(writeWAV24k(other, make([]float32, 100))).To(Succeed())
q := &Qwen3TtsCpp{audioPath: refWav}
_, _, speaker, _, ref, _, err := q.resolveRequest(&pb.TTSRequest{Text: "hi", Voice: other})
Expect(err).ToNot(HaveOccurred())
Expect(speaker).To(BeEmpty())
Expect(len(ref)).To(Equal(100))
})
It("does not trigger audio_path cloning for a named-speaker Voice", func() {
q := &Qwen3TtsCpp{audioPath: refWav}
_, _, speaker, _, ref, _, err := q.resolveRequest(&pb.TTSRequest{Text: "hi", Voice: "serena"})
Expect(err).ToNot(HaveOccurred())
Expect(speaker).To(Equal("serena"))
Expect(ref).To(BeNil())
})
})

View File

@@ -1,161 +1,191 @@
#include "goqwen3ttscpp.h"
#include "ggml-backend.h"
#include "qwen3_tts.h"
#include "qwen.h"
#include <cmath>
#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <string>
using namespace qwen3_tts;
static qt_context *g_ctx = nullptr;
// Global engine (loaded once, reused across requests)
static Qwen3TTS *g_engine = nullptr;
static bool g_loaded = false;
static int g_threads = 4;
static void ggml_log_cb(enum ggml_log_level level, const char *log, void *data) {
const char *level_str;
static void ggml_log_cb(enum ggml_log_level level, const char *log,
void * /*data*/) {
if (!log)
return;
const char *lvl = "?????";
switch (level) {
case GGML_LOG_LEVEL_DEBUG:
level_str = "DEBUG";
break;
case GGML_LOG_LEVEL_INFO:
level_str = "INFO";
break;
case GGML_LOG_LEVEL_WARN:
level_str = "WARN";
break;
case GGML_LOG_LEVEL_ERROR:
level_str = "ERROR";
break;
default:
level_str = "?????";
break;
case GGML_LOG_LEVEL_DEBUG: lvl = "DEBUG"; break;
case GGML_LOG_LEVEL_INFO: lvl = "INFO"; break;
case GGML_LOG_LEVEL_WARN: lvl = "WARN"; break;
case GGML_LOG_LEVEL_ERROR: lvl = "ERROR"; break;
default: break;
}
fprintf(stderr, "[%-5s] ", level_str);
fputs(log, stderr);
fprintf(stderr, "[%-5s] %s", lvl, log);
fflush(stderr);
}
// Map language string to language_id token used by the model
static int language_to_id(const char *lang) {
if (!lang || lang[0] == '\0')
return 2050; // default: English
std::string l(lang);
if (l == "en")
return 2050;
if (l == "ru")
return 2069;
if (l == "zh")
return 2055;
if (l == "ja")
return 2058;
if (l == "ko")
return 2064;
if (l == "de")
return 2053;
if (l == "fr")
return 2061;
if (l == "es")
return 2054;
if (l == "it")
return 2056;
if (l == "pt")
return 2057;
fprintf(stderr, "[qwen3-tts-cpp] Unknown language '%s', defaulting to English\n",
lang);
return 2050;
}
int load_model(const char *model_dir, int n_threads) {
int qt3_load(const char *talker_path, const char *codec_path, int use_fa,
int clamp_fp16) {
ggml_log_set(ggml_log_cb, nullptr);
ggml_backend_load_all();
if (n_threads <= 0)
n_threads = 4;
g_threads = n_threads;
fprintf(stderr, "[qwen3-tts-cpp] Loading models from %s (threads=%d)\n",
model_dir, n_threads);
g_engine = new Qwen3TTS();
if (!g_engine->load_models(model_dir)) {
fprintf(stderr, "[qwen3-tts-cpp] FATAL: failed to load models from %s\n",
model_dir);
delete g_engine;
g_engine = nullptr;
if (!talker_path || talker_path[0] == '\0') {
fprintf(stderr, "[qwen3-tts-cpp] ERROR: talker_path is required\n");
return 1;
}
g_loaded = true;
fprintf(stderr, "[qwen3-tts-cpp] Models loaded successfully\n");
return 0;
}
int synthesize(const char *text, const char *ref_audio_path, const char *dst,
const char *language, float temperature, float top_p,
int top_k, float repetition_penalty, int max_audio_tokens,
int n_threads) {
if (!g_loaded || !g_engine) {
fprintf(stderr, "[qwen3-tts-cpp] ERROR: models not loaded\n");
return 1;
}
if (!text || !dst) {
fprintf(stderr, "[qwen3-tts-cpp] ERROR: text and dst are required\n");
if (!codec_path || codec_path[0] == '\0') {
fprintf(stderr, "[qwen3-tts-cpp] ERROR: codec_path is required\n");
return 2;
}
tts_params params;
params.max_audio_tokens = max_audio_tokens > 0 ? max_audio_tokens : 4096;
params.temperature = temperature;
params.top_p = top_p;
params.top_k = top_k;
params.repetition_penalty = repetition_penalty;
params.n_threads = n_threads > 0 ? n_threads : g_threads;
params.language_id = language_to_id(language);
qt_init_params p;
qt_init_default_params(&p);
p.talker_path = talker_path;
p.codec_path = codec_path;
p.use_fa = use_fa != 0;
p.clamp_fp16 = clamp_fp16 != 0;
fprintf(stderr, "[qwen3-tts-cpp] Synthesizing: text='%.50s%s', lang_id=%d, "
"temp=%.2f, threads=%d\n",
text, (strlen(text) > 50 ? "..." : ""), params.language_id,
temperature, params.n_threads);
fprintf(stderr, "[qwen3-tts-cpp] Loading talker=%s codec=%s\n", talker_path,
codec_path);
tts_result result;
bool has_ref = ref_audio_path && ref_audio_path[0] != '\0';
if (has_ref) {
fprintf(stderr, "[qwen3-tts-cpp] Voice cloning with ref: %s\n",
ref_audio_path);
result = g_engine->synthesize_with_voice(text, ref_audio_path, params);
} else {
result = g_engine->synthesize(text, params);
}
if (!result.success) {
fprintf(stderr, "[qwen3-tts-cpp] ERROR: synthesis failed: %s\n",
result.error_msg.c_str());
g_ctx = qt_init(&p);
if (!g_ctx) {
fprintf(stderr, "[qwen3-tts-cpp] FATAL: qt_init failed: %s\n",
qt_last_error());
return 3;
}
int n_samples = (int)result.audio.size();
if (n_samples == 0) {
fprintf(stderr, "[qwen3-tts-cpp] ERROR: synthesis produced no samples\n");
return 4;
}
fprintf(stderr,
"[qwen3-tts-cpp] Synthesis done: %d samples (%.2fs @ 24kHz)\n",
n_samples, (float)n_samples / 24000.0f);
if (!save_audio_file(dst, result.audio, result.sample_rate)) {
fprintf(stderr, "[qwen3-tts-cpp] ERROR: failed to write %s\n", dst);
return 5;
}
fprintf(stderr, "[qwen3-tts-cpp] Wrote %s\n", dst);
fprintf(stderr, "[qwen3-tts-cpp] Model loaded (%s)\n", qt_version());
return 0;
}
// Fill a qt_tts_params from the flat wrapper arguments. Unset/zero scalars keep
// the qt defaults (temperature 0.9, top_k 50, top_p 1.0, rep 1.05, max 2048).
static void fill_params(qt_tts_params *tp, const char *text, const char *lang,
const char *instruct, const char *speaker,
const float *ref_samples, int ref_n,
const char *ref_text, long long seed, float temperature,
int top_k, float top_p, float repetition_penalty,
int max_new_tokens) {
qt_tts_default_params(tp);
tp->text = text ? text : "";
if (lang && lang[0] != '\0')
tp->lang = lang; // else keep default NULL -> auto
if (instruct && instruct[0] != '\0')
tp->instruct = instruct;
if (speaker && speaker[0] != '\0')
tp->speaker = speaker;
if (ref_samples && ref_n > 0) {
tp->ref_audio_24k = ref_samples;
tp->ref_n_samples = ref_n;
if (ref_text && ref_text[0] != '\0')
tp->ref_text = ref_text;
}
if (seed >= 0)
tp->seed = (int64_t)seed; // else default -1 (random)
if (temperature > 0.0f)
tp->temperature = temperature;
if (top_k > 0)
tp->top_k = top_k;
if (top_p > 0.0f)
tp->top_p = top_p;
if (repetition_penalty > 0.0f)
tp->repetition_penalty = repetition_penalty;
if (max_new_tokens > 0)
tp->max_new_tokens = max_new_tokens;
}
float *qt3_tts(const char *text, const char *lang, const char *instruct,
const char *speaker, const float *ref_samples, int ref_n,
const char *ref_text, long long seed, float temperature,
int top_k, float top_p, float repetition_penalty,
int max_new_tokens, int *out_n) {
if (out_n)
*out_n = 0;
if (!g_ctx) {
fprintf(stderr, "[qwen3-tts-cpp] ERROR: model not loaded\n");
return nullptr;
}
if (!text || text[0] == '\0') {
fprintf(stderr, "[qwen3-tts-cpp] ERROR: text is required\n");
return nullptr;
}
qt_tts_params tp;
fill_params(&tp, text, lang, instruct, speaker, ref_samples, ref_n,
ref_text, seed, temperature, top_k, top_p, repetition_penalty,
max_new_tokens);
qt_audio out = {0};
enum qt_status rc = qt_synthesize(g_ctx, &tp, &out);
if (rc != QT_STATUS_OK || out.n_samples <= 0 || !out.samples) {
fprintf(stderr, "[qwen3-tts-cpp] ERROR: synthesize failed (rc=%d): %s\n",
(int)rc, qt_last_error());
qt_audio_free(&out);
return nullptr;
}
// Copy into a plain malloc buffer the Go side frees via qt3_pcm_free.
size_t bytes = (size_t)out.n_samples * sizeof(float);
float *buf = (float *)malloc(bytes);
if (!buf) {
fprintf(stderr, "[qwen3-tts-cpp] ERROR: malloc(%zu) failed\n", bytes);
qt_audio_free(&out);
return nullptr;
}
memcpy(buf, out.samples, bytes);
if (out_n)
*out_n = out.n_samples;
qt_audio_free(&out);
return buf;
}
int qt3_tts_stream(const char *text, const char *lang, const char *instruct,
const char *speaker, const float *ref_samples, int ref_n,
const char *ref_text, long long seed, float temperature,
int top_k, float top_p, float repetition_penalty,
int max_new_tokens, qt3_chunk_cb cb, void *user_data) {
if (!g_ctx) {
fprintf(stderr, "[qwen3-tts-cpp] ERROR: model not loaded\n");
return 1;
}
if (!cb) {
fprintf(stderr, "[qwen3-tts-cpp] ERROR: stream callback is null\n");
return 2;
}
if (!text || text[0] == '\0') {
fprintf(stderr, "[qwen3-tts-cpp] ERROR: text is required\n");
return 4;
}
qt_tts_params tp;
fill_params(&tp, text, lang, instruct, speaker, ref_samples, ref_n,
ref_text, seed, temperature, top_k, top_p, repetition_penalty,
max_new_tokens);
// qt_audio_chunk_cb has the identical signature to qt3_chunk_cb
// (bool vs int return are ABI-compatible; non-zero == true).
tp.on_chunk = (qt_audio_chunk_cb)cb;
tp.on_chunk_user_data = user_data;
qt_audio out = {0}; // stays empty in streaming mode
enum qt_status rc = qt_synthesize(g_ctx, &tp, &out);
qt_audio_free(&out);
if (rc != QT_STATUS_OK && rc != QT_STATUS_CANCELLED) {
fprintf(stderr, "[qwen3-tts-cpp] ERROR: stream synth failed (rc=%d): %s\n",
(int)rc, qt_last_error());
return 3;
}
return 0;
}
void qt3_pcm_free(float *p) { free(p); }
void qt3_unload(void) {
if (g_ctx) {
qt_free(g_ctx);
g_ctx = nullptr;
}
}
int qt3_n_speakers(void) { return g_ctx ? qt_n_speakers(g_ctx) : 0; }
const char *qt3_speaker_name(int i) {
return g_ctx ? qt_speaker_name(g_ctx, i) : nullptr;
}

View File

@@ -1,12 +1,47 @@
#pragma once
#include <cstddef>
#include <cstdint>
extern "C" {
int load_model(const char *model_dir, int n_threads);
int synthesize(const char *text, const char *ref_audio_path, const char *dst,
const char *language, float temperature, float top_p,
int top_k, float repetition_penalty, int max_audio_tokens,
int n_threads);
// Streaming PCM chunk callback. samples is mono float PCM at 24 kHz, valid
// only for the duration of the call. Return non-zero to continue, 0 to abort.
typedef int (*qt3_chunk_cb)(const float *samples, int n_samples,
void *user_data);
// Load the talker + codec/tokenizer GGUFs. use_fa / clamp_fp16 map to
// qt_init_params (the qt ABI exposes no thread count; ggml uses its own
// default). Returns 0 on success, non-zero on failure.
int qt3_load(const char *talker_path, const char *codec_path, int use_fa,
int clamp_fp16);
// Synthesize to a malloc'd float PCM buffer (caller frees via qt3_pcm_free).
// The synthesis mode (base / custom_voice / voice_design) is auto-detected by
// qt from the talker GGUF; speaker is honoured only for custom_voice, instruct
// for voice_design / custom_voice, and ref_samples (+ optional ref_text) drive
// base-mode cloning. qt enforces the rules and we surface qt_last_error() on
// QT_STATUS_MODE_INVALID. Writes the sample count to *out_n. Returns NULL on
// failure (out_n set to 0).
float *qt3_tts(const char *text, const char *lang, const char *instruct,
const char *speaker, const float *ref_samples, int ref_n,
const char *ref_text, long long seed, float temperature,
int top_k, float top_p, float repetition_penalty,
int max_new_tokens, int *out_n);
// Streaming synthesis: cb is invoked per PCM chunk as audio is produced. Same
// param semantics as qt3_tts. Returns 0 on success.
int qt3_tts_stream(const char *text, const char *lang, const char *instruct,
const char *speaker, const float *ref_samples, int ref_n,
const char *ref_text, long long seed, float temperature,
int top_k, float top_p, float repetition_penalty,
int max_new_tokens, qt3_chunk_cb cb, void *user_data);
// Free a buffer returned by qt3_tts.
void qt3_pcm_free(float *p);
// Release the qt context.
void qt3_unload(void);
// Named-speaker introspection (custom_voice models). Returns 0 / NULL when no
// model is loaded or the index is out of range.
int qt3_n_speakers(void);
const char *qt3_speaker_name(int i);
}

View File

@@ -0,0 +1,95 @@
package main
import (
"math"
"os"
"strings"
"github.com/ebitengine/purego"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
func ttsReq(text, voice string, lang *string, dst string) *pb.TTSRequest {
return &pb.TTSRequest{Text: text, Voice: voice, Language: lang, Dst: dst}
}
var _ = Describe("qwen3-tts-cpp e2e", Label("e2e"), func() {
var loaded bool
BeforeEach(func() {
modelPath := os.Getenv("QWEN3TTS_MODEL")
codecPath := os.Getenv("QWEN3TTS_CODEC")
if modelPath == "" || codecPath == "" {
Skip("QWEN3TTS_MODEL / QWEN3TTS_CODEC not set; skipping e2e")
}
if !loaded {
lib := os.Getenv("QWEN3TTS_LIBRARY")
if lib == "" {
lib = "./libgoqwen3ttscpp-fallback.so"
}
h, err := purego.Dlopen(lib, purego.RTLD_NOW|purego.RTLD_GLOBAL)
Expect(err).ToNot(HaveOccurred())
purego.RegisterLibFunc(&CppLoad, h, "qt3_load")
purego.RegisterLibFunc(&CppTTS, h, "qt3_tts")
purego.RegisterLibFunc(&CppTTSStream, h, "qt3_tts_stream")
purego.RegisterLibFunc(&CppPCMFree, h, "qt3_pcm_free")
purego.RegisterLibFunc(&CppUnload, h, "qt3_unload")
Expect(CppLoad(modelPath, codecPath, 1, 0)).To(Equal(0))
loaded = true
}
})
It("synthesizes a WAV file via TTS", func() {
b := &Qwen3TtsCpp{opts: loadOptions{seed: 42, useFA: true}}
dst := GinkgoT().TempDir() + "/out.wav"
lang := "english"
err := b.TTS(ttsReq("Hello world.", "", &lang, dst))
Expect(err).ToNot(HaveOccurred())
fi, err := os.Stat(dst)
Expect(err).ToNot(HaveOccurred())
Expect(fi.Size()).To(BeNumerically(">", int64(44)))
})
It("streams audio chunks via TTSStream", func() {
b := &Qwen3TtsCpp{opts: loadOptions{seed: 42, useFA: true}}
results := make(chan []byte, 1024)
lang := "english"
done := make(chan error, 1)
go func() { done <- b.TTSStream(ttsReq("Hello there, streaming test.", "", &lang, ""), results) }()
var chunks int
var first []byte
for c := range results {
if chunks == 0 {
first = c
}
chunks++
}
Expect(<-done).ToNot(HaveOccurred())
Expect(chunks).To(BeNumerically(">=", 2))
Expect(string(first[0:4])).To(Equal("RIFF"))
Expect(strings.HasPrefix(string(first[8:12]), "WAVE")).To(BeTrue())
})
It("clones a voice from the config audio_path reference", func() {
// 1s of 24kHz mono audio as a clone reference; the base model carries
// a speaker encoder, so audio_path drives x-vector voice cloning.
ref := GinkgoT().TempDir() + "/ref.wav"
samples := make([]float32, qwen3ttsSampleRate)
for i := range samples {
samples[i] = float32(0.05 * math.Sin(float64(i)*0.06))
}
Expect(writeWAV24k(ref, samples)).To(Succeed())
b := &Qwen3TtsCpp{opts: loadOptions{seed: 42, useFA: true}, audioPath: ref}
dst := GinkgoT().TempDir() + "/clone.wav"
lang := "english"
// Empty Voice -> the config audio_path is used as the clone reference.
Expect(b.TTS(ttsReq("Cloned voice test.", "", &lang, dst))).To(Succeed())
fi, err := os.Stat(dst)
Expect(err).ToNot(HaveOccurred())
Expect(fi.Size()).To(BeNumerically(">", int64(44)))
})
})

View File

@@ -5,108 +5,225 @@ import (
"os"
"path/filepath"
"strings"
"sync"
"unsafe"
"github.com/ebitengine/purego"
"github.com/mudler/LocalAI/pkg/grpc/base"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
)
var (
CppLoadModel func(modelDir string, nThreads int) int
CppSynthesize func(text, refAudioPath, dst, language string,
temperature, topP float32, topK int,
repetitionPenalty float32, maxAudioTokens, nThreads int) int
// qt3_load(talker_path, codec_path, use_fa, clamp_fp16) int
CppLoad func(talkerPath, codecPath string, useFA, clampFP16 int) int
// qt3_tts(text, lang, instruct, speaker, ref_samples, ref_n, ref_text,
// seed, temperature, top_k, top_p, rep_pen, max_new, out_n) -> float*
CppTTS func(text, lang, instruct, speaker string, refSamples unsafe.Pointer,
refN int, refText string, seed int64, temperature float32, topK int,
topP, repPen float32, maxNew int, outN unsafe.Pointer) uintptr
// qt3_tts_stream(..., cb, user) int
CppTTSStream func(text, lang, instruct, speaker string, refSamples unsafe.Pointer,
refN int, refText string, seed int64, temperature float32, topK int,
topP, repPen float32, maxNew int, cb uintptr, user uintptr) int
CppPCMFree func(ptr uintptr)
CppUnload func()
)
type Qwen3TtsCpp struct {
base.SingleThread
threads int
}
// languageNameAliases maps common full language names to the canonical
// two-letter code understood by the C++ language_to_id table.
var languageNameAliases = map[string]string{
"english": "en",
"russian": "ru",
"chinese": "zh",
"japanese": "ja",
"korean": "ko",
"german": "de",
"french": "fr",
"spanish": "es",
"italian": "it",
"portuguese": "pt",
}
// normalizeLanguage coerces a caller-supplied language into the canonical code
// the model expects. It lowercases, trims, strips any region/locale suffix
// (en-US, en_US, ja.JP -> en/ja), and resolves common full names (english -> en).
// An empty input stays empty so the C++ side applies its English default; an
// unrecognized value is returned normalized so C++ can log it and default.
func normalizeLanguage(lang string) string {
lang = strings.ToLower(strings.TrimSpace(lang))
if lang == "" {
return ""
}
// Strip region/locale suffix: keep the segment before the first separator.
if i := strings.IndexAny(lang, "-_."); i >= 0 {
lang = lang[:i]
}
if code, ok := languageNameAliases[lang]; ok {
return code
}
return lang
opts loadOptions
// audioPath is the model-config reference voice (tts.audio_path), the
// default clone reference when a request omits an audio Voice.
audioPath string
}
func (q *Qwen3TtsCpp) Load(opts *pb.ModelOptions) error {
// ModelFile is the model directory path (containing GGUF files)
modelDir := opts.ModelFile
if modelDir == "" {
modelDir = opts.ModelPath
model := opts.ModelFile
if model == "" {
model = opts.ModelPath
}
if !filepath.IsAbs(model) && opts.ModelPath != "" {
model = filepath.Join(opts.ModelPath, model)
}
// Resolve relative paths
if !filepath.IsAbs(modelDir) && opts.ModelPath != "" {
modelDir = filepath.Join(opts.ModelPath, modelDir)
q.opts = parseOptions(opts.Options)
// Resolve the codec/tokenizer GGUF: explicit option, else auto-discover a
// *tokenizer*.gguf sibling of the talker model.
codec := q.opts.codecPath
if codec != "" && !filepath.IsAbs(codec) {
codec = filepath.Join(filepath.Dir(model), codec)
}
if codec == "" {
codec = discoverTokenizer(filepath.Dir(model))
}
if codec == "" {
return fmt.Errorf("qwen3-tts: no codec/tokenizer GGUF found; set option 'tokenizer:<file>'")
}
q.opts.codecPath = codec
q.audioPath = opts.AudioPath
if q.audioPath != "" && !filepath.IsAbs(q.audioPath) {
q.audioPath = filepath.Join(filepath.Dir(model), q.audioPath)
}
threads := int(opts.Threads)
if threads <= 0 {
threads = 4
useFA := boolToInt(q.opts.useFA)
clamp := boolToInt(q.opts.clampFP16)
fmt.Fprintf(os.Stderr, "[qwen3-tts-cpp] Load talker=%s codec=%s use_fa=%d clamp_fp16=%d\n",
model, codec, useFA, clamp)
if rc := CppLoad(model, codec, useFA, clamp); rc != 0 {
return fmt.Errorf("qwen3-tts: failed to load model (rc=%d)", rc)
}
q.threads = threads
fmt.Fprintf(os.Stderr, "[qwen3-tts-cpp] Loading models from: %s (threads=%d)\n", modelDir, threads)
if ret := CppLoadModel(modelDir, threads); ret != 0 {
return fmt.Errorf("failed to load qwen3-tts model (error code: %d)", ret)
}
return nil
}
// discoverTokenizer returns the first *tokenizer*.gguf in dir, or "".
func discoverTokenizer(dir string) string {
entries, err := os.ReadDir(dir)
if err != nil {
return ""
}
for _, e := range entries {
name := strings.ToLower(e.Name())
if strings.Contains(name, "tokenizer") && strings.HasSuffix(name, ".gguf") {
return filepath.Join(dir, e.Name())
}
}
return ""
}
func boolToInt(b bool) int {
if b {
return 1
}
return 0
}
func optStr(p *string) string {
if p == nil {
return ""
}
return *p
}
// resolveRequest derives the synthesis inputs from a TTSRequest:
// language, instruct, speaker, ref-audio samples, ref-text and sampling.
func (q *Qwen3TtsCpp) resolveRequest(req *pb.TTSRequest) (lang, instruct, speaker, refText string, ref []float32, s sampling, err error) {
lang = normalizeLanguage(optStr(req.Language))
instruct = optStr(req.Instructions)
var refPath string
speaker, refPath = resolveVoice(req.Voice)
if refPath == "" && speaker == "" && q.audioPath != "" {
// No per-request voice: fall back to the config clone reference.
refPath = q.audioPath
}
if refPath != "" {
ref, err = readWAVAsFloat(refPath)
if err != nil {
return
}
}
if req.Params != nil {
refText = req.Params["ref_text"]
}
s = parseSampling(req.Params, q.opts.seed)
return
}
func (q *Qwen3TtsCpp) TTS(req *pb.TTSRequest) error {
text := req.Text
voice := req.Voice // reference audio path for voice cloning (empty = no cloning)
dst := req.Dst
language := ""
if req.Language != nil {
language = normalizeLanguage(*req.Language)
if req.Dst == "" {
return fmt.Errorf("qwen3-tts: TTS requires a destination path")
}
if req.Text == "" {
return fmt.Errorf("qwen3-tts: TTS requires text")
}
lang, instruct, speaker, refText, ref, s, err := q.resolveRequest(req)
if err != nil {
return err
}
var refPtr unsafe.Pointer
if len(ref) > 0 {
refPtr = unsafe.Pointer(&ref[0])
}
// Synthesis parameters with sensible defaults
temperature := float32(0.9)
topP := float32(0.8)
topK := 50
repetitionPenalty := float32(1.05)
maxAudioTokens := 4096
var n int32
ptr := CppTTS(req.Text, lang, instruct, speaker, refPtr, len(ref), refText,
s.seed, s.temperature, s.topK, s.topP, s.repPen, s.maxNew, unsafe.Pointer(&n))
runtimeKeepAlive(ref)
if ptr == 0 {
return fmt.Errorf("qwen3-tts: synthesis failed")
}
// Register the free as soon as we own a non-null buffer, so the n<=0 guard
// below cannot leak it (defensive: the C contract returns NULL on failure).
defer CppPCMFree(ptr)
if n <= 0 {
return fmt.Errorf("qwen3-tts: synthesis produced no samples")
}
src := unsafe.Slice((*float32)(unsafe.Pointer(ptr)), int(n)) //nolint:govet // C-allocated PCM, copied out before free
out := make([]float32, int(n))
copy(out, src)
return writeWAV24k(req.Dst, out)
}
if ret := CppSynthesize(text, voice, dst, language,
temperature, topP, topK, repetitionPenalty,
maxAudioTokens, q.threads); ret != 0 {
return fmt.Errorf("failed to synthesize audio (error code: %d)", ret)
// streamState carries the active TTSStream channel to the single shared C
// callback. base.SingleThread serializes TTS/TTSStream, so one global slot is
// safe and avoids leaking a purego callback per request (purego callbacks
// cannot be freed and are capped).
var (
streamMu sync.Mutex
streamChan chan []byte
streamCbOnce sync.Once
streamCbPtr uintptr
)
// streamCallback is registered once and forwards each PCM chunk to streamChan.
func streamCallback(samples *float32, nSamples int32, _ uintptr) uintptr {
if nSamples <= 0 || samples == nil || streamChan == nil {
return 1 // continue
}
src := unsafe.Slice(samples, int(nSamples))
cp := make([]float32, int(nSamples)) // copy out of C memory before returning
copy(cp, src)
streamChan <- floatToPCM16LE(cp)
return 1 // continue
}
func (q *Qwen3TtsCpp) TTSStream(req *pb.TTSRequest, results chan []byte) error {
defer close(results)
if req.Text == "" {
return fmt.Errorf("qwen3-tts: TTSStream requires text")
}
streamCbOnce.Do(func() {
streamCbPtr = purego.NewCallback(streamCallback)
})
lang, instruct, speaker, refText, ref, s, err := q.resolveRequest(req)
if err != nil {
return err
}
var refPtr unsafe.Pointer
if len(ref) > 0 {
refPtr = unsafe.Pointer(&ref[0])
}
// Emit the WAV header first so the HTTP layer gets a self-describing stream.
results <- wavHeader24k()
streamMu.Lock()
streamChan = results
rc := CppTTSStream(req.Text, lang, instruct, speaker, refPtr, len(ref), refText,
s.seed, s.temperature, s.topK, s.topP, s.repPen, s.maxNew, streamCbPtr, 0)
streamChan = nil
streamMu.Unlock()
runtimeKeepAlive(ref)
if rc != 0 {
return fmt.Errorf("qwen3-tts: streaming synthesis failed (rc=%d)", rc)
}
return nil
}

View File

@@ -1,53 +0,0 @@
package main
import (
"testing"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
func TestLanguageNormalization(t *testing.T) {
RegisterFailHandler(Fail)
RunSpecs(t, "qwen3-tts-cpp language normalization")
}
var _ = Describe("normalizeLanguage", func() {
DescribeTable("maps caller input to the canonical model language code",
func(input, expected string) {
Expect(normalizeLanguage(input)).To(Equal(expected))
},
// Canonical codes pass through unchanged
Entry("canonical en", "en", "en"),
Entry("canonical zh", "zh", "zh"),
Entry("canonical pt", "pt", "pt"),
// Case-insensitive
Entry("uppercase", "EN", "en"),
Entry("mixed case", "Ja", "ja"),
// Surrounding whitespace
Entry("trims whitespace", " en ", "en"),
// Region/locale stripping
Entry("BCP-47 region", "en-US", "en"),
Entry("underscore region", "en_US", "en"),
Entry("dotted locale", "ja.JP", "ja"),
Entry("region + case", "ZH-CN", "zh"),
// Full-name aliases
Entry("english name", "english", "en"),
Entry("chinese name cased", "Chinese", "zh"),
Entry("japanese name", "japanese", "ja"),
Entry("russian name", "russian", "ru"),
Entry("portuguese name", "portuguese", "pt"),
// Empty stays empty (C++ applies the English default)
Entry("empty", "", ""),
Entry("whitespace only", " ", ""),
// Unknown values pass through normalized so C++ can log + default
Entry("unknown code", "klingon", "klingon"),
Entry("unknown with region", "xx-YY", "xx"),
)
})

View File

@@ -19,24 +19,25 @@ type LibFuncs struct {
}
func main() {
// Get library name from environment variable, default to fallback
libName := os.Getenv("QWEN3TTS_LIBRARY")
if libName == "" {
libName = "./libgoqwen3ttscpp-fallback.so"
}
gosd, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL)
lib, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL)
if err != nil {
panic(err)
}
libFuncs := []LibFuncs{
{&CppLoadModel, "load_model"},
{&CppSynthesize, "synthesize"},
{&CppLoad, "qt3_load"},
{&CppTTS, "qt3_tts"},
{&CppTTSStream, "qt3_tts_stream"},
{&CppPCMFree, "qt3_pcm_free"},
{&CppUnload, "qt3_unload"},
}
for _, lf := range libFuncs {
purego.RegisterLibFunc(lf.FuncPtr, gosd, lf.Name)
purego.RegisterLibFunc(lf.FuncPtr, lib, lf.Name)
}
flag.Parse()

View File

@@ -0,0 +1,161 @@
package main
import (
"strconv"
"strings"
)
// loadOptions holds the parsed model-level options.
type loadOptions struct {
codecPath string
useFA bool
clampFP16 bool
seed int64
}
// sampling holds per-request generation parameters with qt defaults applied.
type sampling struct {
temperature float32
topK int
topP float32
repPen float32
maxNew int
seed int64
}
func splitOption(o string) (key, value string, ok bool) {
i := strings.Index(o, ":")
if i < 0 {
return "", "", false
}
return strings.TrimSpace(o[:i]), strings.TrimSpace(o[i+1:]), true
}
func parseBool(v string) bool { return v == "true" || v == "1" }
// parseOptions reads the backend "key:value" option slice. Unknown keys are
// ignored. Defaults: use_fa true (qt default; CPU still uses the F32 chain),
// seed -1 (engine random).
func parseOptions(opts []string) loadOptions {
o := loadOptions{useFA: true, seed: -1}
for _, oo := range opts {
key, value, ok := splitOption(oo)
if !ok {
continue
}
switch key {
case "tokenizer", "codec":
o.codecPath = value
case "use_fa":
o.useFA = parseBool(value)
case "clamp_fp16":
o.clampFP16 = parseBool(value)
case "seed":
if n, err := strconv.ParseInt(value, 10, 64); err == nil {
o.seed = n
}
}
}
return o
}
// languageAliases maps codes / locales / full names to the upstream qwentts
// language names. "auto" (and empty) map to "" so the engine auto-detects.
var languageAliases = map[string]string{
"en": "english", "english": "english",
"zh": "chinese", "chinese": "chinese", "mandarin": "chinese",
"ja": "japanese", "japanese": "japanese",
"ko": "korean", "korean": "korean",
"de": "german", "german": "german",
"fr": "french", "french": "french",
"es": "spanish", "spanish": "spanish",
"it": "italian", "italian": "italian",
"pt": "portuguese", "portuguese": "portuguese",
"ru": "russian", "russian": "russian",
"auto": "",
}
// normalizeLanguage lowercases, trims, strips a region/locale suffix
// (en-US -> en), and resolves to the qwentts language name. Empty stays empty
// (engine auto-detects); an unknown value passes through normalized.
func normalizeLanguage(lang string) string {
lang = strings.ToLower(strings.TrimSpace(lang))
if lang == "" {
return ""
}
if i := strings.IndexAny(lang, "-_."); i >= 0 {
lang = lang[:i]
}
if v, ok := languageAliases[lang]; ok {
return v
}
return lang
}
var refAudioExts = []string{".wav", ".flac", ".mp3", ".ogg", ".m4a"}
// resolveVoice interprets the request Voice field: a value ending in a known
// audio extension is a clone-reference path; anything else is a named speaker
// (custom_voice). Empty input yields no speaker and no reference.
func resolveVoice(voice string) (speaker, refPath string) {
v := strings.TrimSpace(voice)
if v == "" {
return "", ""
}
lower := strings.ToLower(v)
for _, ext := range refAudioExts {
if strings.HasSuffix(lower, ext) {
return "", v
}
}
return v, ""
}
func parseFloat32(v string, def float32) float32 {
if v == "" {
return def
}
f, err := strconv.ParseFloat(v, 32)
if err != nil {
return def
}
return float32(f)
}
func parseInt(v string, def int) int {
if v == "" {
return def
}
n, err := strconv.Atoi(v)
if err != nil {
return def
}
return n
}
func parseInt64(v string, def int64) int64 {
if v == "" {
return def
}
n, err := strconv.ParseInt(v, 10, 64)
if err != nil {
return def
}
return n
}
// parseSampling reads per-request sampling params from the TTSRequest params
// map, applying qt defaults (matching qt_tts_default_params).
func parseSampling(params map[string]string, defaultSeed int64) sampling {
s := sampling{temperature: 0.9, topK: 50, topP: 1.0, repPen: 1.05, maxNew: 2048, seed: defaultSeed}
if params == nil {
return s
}
s.temperature = parseFloat32(params["temperature"], s.temperature)
s.topK = parseInt(params["top_k"], s.topK)
s.topP = parseFloat32(params["top_p"], s.topP)
s.repPen = parseFloat32(params["repetition_penalty"], s.repPen)
s.maxNew = parseInt(params["max_new_tokens"], s.maxNew)
s.seed = parseInt64(params["seed"], s.seed)
return s
}

View File

@@ -1,173 +1,136 @@
package main
import (
"context"
"os"
"os/exec"
"path/filepath"
"bytes"
"encoding/binary"
"testing"
"time"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
"google.golang.org/grpc"
"google.golang.org/grpc/credentials/insecure"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
const (
testAddr = "localhost:50051"
startupWait = 5 * time.Second
)
func skipIfNoModel(t *testing.T) string {
t.Helper()
modelDir := os.Getenv("QWEN3TTS_MODEL_DIR")
if modelDir == "" {
t.Skip("QWEN3TTS_MODEL_DIR not set, skipping test (set to directory with GGUF models)")
}
if _, err := os.Stat(filepath.Join(modelDir, "qwen3-tts-0.6b-f16.gguf")); os.IsNotExist(err) {
t.Skipf("TTS model file not found in %s, skipping", modelDir)
}
if _, err := os.Stat(filepath.Join(modelDir, "qwen3-tts-tokenizer-f16.gguf")); os.IsNotExist(err) {
t.Skipf("Tokenizer model file not found in %s, skipping", modelDir)
}
return modelDir
func TestQwen3TtsCpp(t *testing.T) {
RegisterFailHandler(Fail)
RunSpecs(t, "qwen3-tts-cpp suite")
}
func startServer(t *testing.T) *exec.Cmd {
t.Helper()
binary := os.Getenv("QWEN3TTS_BINARY")
if binary == "" {
binary = "./qwen3-tts-cpp"
}
if _, err := os.Stat(binary); os.IsNotExist(err) {
t.Skipf("Backend binary not found at %s, skipping", binary)
}
cmd := exec.Command(binary, "--addr", testAddr)
cmd.Stdout = os.Stderr
cmd.Stderr = os.Stderr
if err := cmd.Start(); err != nil {
t.Fatalf("Failed to start server: %v", err)
}
time.Sleep(startupWait)
return cmd
}
func stopServer(cmd *exec.Cmd) {
if cmd != nil && cmd.Process != nil {
cmd.Process.Kill()
cmd.Wait()
}
}
func dialGRPC(t *testing.T) *grpc.ClientConn {
t.Helper()
conn, err := grpc.Dial(testAddr,
grpc.WithTransportCredentials(insecure.NewCredentials()),
grpc.WithDefaultCallOptions(
grpc.MaxCallRecvMsgSize(50*1024*1024),
grpc.MaxCallSendMsgSize(50*1024*1024),
),
var _ = Describe("normalizeLanguage", func() {
DescribeTable("maps caller language to qwentts language names",
func(in, want string) {
Expect(normalizeLanguage(in)).To(Equal(want))
},
Entry("empty stays empty", "", ""),
Entry("auto maps to empty", "auto", ""),
Entry("english full name", "English", "english"),
Entry("english code", "en", "english"),
Entry("locale suffix stripped", "en-US", "english"),
Entry("underscore locale", "zh_CN", "chinese"),
Entry("mandarin alias", "mandarin", "chinese"),
Entry("japanese already full", "japanese", "japanese"),
Entry("unknown passes through normalized", "xx", "xx"),
)
if err != nil {
t.Fatalf("Failed to dial gRPC: %v", err)
}
return conn
}
})
func TestServerHealth(t *testing.T) {
cmd := startServer(t)
defer stopServer(cmd)
conn := dialGRPC(t)
defer conn.Close()
client := pb.NewBackendClient(conn)
resp, err := client.Health(context.Background(), &pb.HealthMessage{})
if err != nil {
t.Fatalf("Health check failed: %v", err)
}
if string(resp.Message) != "OK" {
t.Fatalf("Expected OK, got %s", string(resp.Message))
}
}
func TestLoadModel(t *testing.T) {
modelDir := skipIfNoModel(t)
cmd := startServer(t)
defer stopServer(cmd)
conn := dialGRPC(t)
defer conn.Close()
client := pb.NewBackendClient(conn)
resp, err := client.LoadModel(context.Background(), &pb.ModelOptions{
ModelFile: modelDir,
Threads: 4,
var _ = Describe("resolveVoice", func() {
It("treats a bare token as a named speaker", func() {
sp, ref := resolveVoice("serena")
Expect(sp).To(Equal("serena"))
Expect(ref).To(BeEmpty())
})
if err != nil {
t.Fatalf("LoadModel failed: %v", err)
}
if !resp.Success {
t.Fatalf("LoadModel returned failure: %s", resp.Message)
}
}
func TestTTS(t *testing.T) {
modelDir := skipIfNoModel(t)
tmpDir, err := os.MkdirTemp("", "qwen3tts-test")
if err != nil {
t.Fatal(err)
}
t.Cleanup(func() { os.RemoveAll(tmpDir) })
outputFile := filepath.Join(tmpDir, "output.wav")
cmd := startServer(t)
defer stopServer(cmd)
conn := dialGRPC(t)
defer conn.Close()
client := pb.NewBackendClient(conn)
// Load models
loadResp, err := client.LoadModel(context.Background(), &pb.ModelOptions{
ModelFile: modelDir,
Threads: 4,
It("treats an audio path as a clone reference (case-insensitive ext)", func() {
sp, ref := resolveVoice("/x/ref.WAV")
Expect(sp).To(BeEmpty())
Expect(ref).To(Equal("/x/ref.WAV"))
})
if err != nil {
t.Fatalf("LoadModel failed: %v", err)
}
if !loadResp.Success {
t.Fatalf("LoadModel returned failure: %s", loadResp.Message)
}
// Synthesize speech
language := "en"
_, err = client.TTS(context.Background(), &pb.TTSRequest{
Text: "Hello, this is a test of the Qwen3 text to speech system.",
Dst: outputFile,
Language: &language,
It("recognizes mp3/flac/ogg/m4a", func() {
for _, p := range []string{"a.mp3", "b.flac", "c.ogg", "d.m4a"} {
sp, ref := resolveVoice(p)
Expect(sp).To(BeEmpty())
Expect(ref).To(Equal(p))
}
})
if err != nil {
t.Fatalf("TTS failed: %v", err)
}
It("returns empty for empty input", func() {
sp, ref := resolveVoice(" ")
Expect(sp).To(BeEmpty())
Expect(ref).To(BeEmpty())
})
})
// Verify output file exists and has content
info, err := os.Stat(outputFile)
if os.IsNotExist(err) {
t.Fatal("Output audio file was not created")
}
if err != nil {
t.Fatalf("Failed to stat output file: %v", err)
}
var _ = Describe("parseOptions", func() {
It("extracts codec, use_fa, clamp_fp16, seed", func() {
o := parseOptions([]string{
"tokenizer:tok.gguf", "use_fa:false", "clamp_fp16:true",
"seed:7", "unknown:ignored",
})
Expect(o.codecPath).To(Equal("tok.gguf"))
Expect(o.useFA).To(BeFalse())
Expect(o.clampFP16).To(BeTrue())
Expect(o.seed).To(Equal(int64(7)))
})
It("accepts codec: as an alias for tokenizer:", func() {
Expect(parseOptions([]string{"codec:c.gguf"}).codecPath).To(Equal("c.gguf"))
})
It("defaults use_fa true and seed -1", func() {
o := parseOptions(nil)
Expect(o.useFA).To(BeTrue())
Expect(o.seed).To(Equal(int64(-1)))
})
})
t.Logf("Output file size: %d bytes", info.Size())
var _ = Describe("parseSampling", func() {
It("applies qt defaults when params are absent", func() {
s := parseSampling(nil, -1)
Expect(s.temperature).To(BeNumerically("~", 0.9, 1e-6))
Expect(s.topK).To(Equal(50))
Expect(s.topP).To(BeNumerically("~", 1.0, 1e-6))
Expect(s.repPen).To(BeNumerically("~", 1.05, 1e-6))
Expect(s.maxNew).To(Equal(2048))
Expect(s.seed).To(Equal(int64(-1)))
})
It("reads overrides and falls back to default seed", func() {
s := parseSampling(map[string]string{
"temperature": "0.5", "top_k": "10", "top_p": "0.8",
"repetition_penalty": "1.2", "max_new_tokens": "512",
}, 99)
Expect(s.temperature).To(BeNumerically("~", 0.5, 1e-6))
Expect(s.topK).To(Equal(10))
Expect(s.topP).To(BeNumerically("~", 0.8, 1e-6))
Expect(s.repPen).To(BeNumerically("~", 1.2, 1e-6))
Expect(s.maxNew).To(Equal(512))
Expect(s.seed).To(Equal(int64(99)))
})
It("reads an explicit seed override", func() {
Expect(parseSampling(map[string]string{"seed": "123"}, -1).seed).To(Equal(int64(123)))
})
})
// WAV header is 44 bytes minimum; any real audio should be much larger
if info.Size() < 1000 {
t.Errorf("Output file too small (%d bytes), expected real audio data", info.Size())
}
}
var _ = Describe("wavHeader24k", func() {
It("emits a 44-byte streaming WAV header at 24 kHz mono 16-bit", func() {
h := wavHeader24k()
Expect(h).To(HaveLen(44))
Expect(string(h[0:4])).To(Equal("RIFF"))
Expect(string(h[8:12])).To(Equal("WAVE"))
Expect(string(h[12:16])).To(Equal("fmt "))
Expect(string(h[36:40])).To(Equal("data"))
var sampleRate uint32
Expect(binary.Read(bytes.NewReader(h[24:28]), binary.LittleEndian, &sampleRate)).To(Succeed())
Expect(sampleRate).To(Equal(uint32(24000)))
})
})
var _ = Describe("floatToPCM16LE", func() {
It("clamps and converts float PCM to little-endian int16 bytes", func() {
b := floatToPCM16LE([]float32{0, 1.0, -1.0, 2.0, -2.0})
Expect(b).To(HaveLen(10))
read := func(off int) int16 {
var v int16
_ = binary.Read(bytes.NewReader(b[off:off+2]), binary.LittleEndian, &v)
return v
}
Expect(read(0)).To(Equal(int16(0)))
Expect(read(2)).To(Equal(int16(32767)))
Expect(read(4)).To(Equal(int16(-32767)))
Expect(read(6)).To(Equal(int16(32767))) // clamped from 2.0
Expect(read(8)).To(Equal(int16(-32767))) // clamped from -2.0
})
})

View File

@@ -2,51 +2,30 @@
set -e
CURDIR=$(dirname "$(realpath $0)")
cd "$CURDIR"
echo "Running qwen3-tts-cpp backend tests..."
# The test requires:
# - QWEN3TTS_MODEL_DIR: path to directory containing GGUF model files
# - QWEN3TTS_BINARY: path to the qwen3-tts-cpp binary (defaults to ./qwen3-tts-cpp)
#
# Tests that require the model will be skipped if QWEN3TTS_MODEL_DIR is not set
# or the directory does not contain the required model files.
cd "$CURDIR"
# Only auto-download models when QWEN3TTS_MODEL_DIR is not explicitly set
if [ -z "$QWEN3TTS_MODEL_DIR" ]; then
export QWEN3TTS_MODEL_DIR="./qwen3-tts-models"
if [ ! -d "$QWEN3TTS_MODEL_DIR" ]; then
echo "Creating qwen3-tts-models directory for tests..."
mkdir -p "$QWEN3TTS_MODEL_DIR"
REPO_ID="endo5501/qwen3-tts.cpp"
echo "Repository: ${REPO_ID}"
echo ""
# Files to download (smallest model for testing)
FILES=(
"qwen3-tts-0.6b-f16.gguf"
"qwen3-tts-tokenizer-f16.gguf"
)
BASE_URL="https://huggingface.co/${REPO_ID}/resolve/main"
for file in "${FILES[@]}"; do
dest="${QWEN3TTS_MODEL_DIR}/${file}"
if [ -f "${dest}" ]; then
echo " [skip] ${file} (already exists)"
else
echo " [download] ${file}..."
curl -L -o "${dest}" "${BASE_URL}/${file}" --progress-bar
echo " [done] ${file}"
fi
done
fi
# Auto-download a small model pair only when QWEN3TTS_MODEL is not set.
if [ -z "$QWEN3TTS_MODEL" ]; then
MODEL_DIR="./qwen3-tts-models"
mkdir -p "$MODEL_DIR"
REPO_ID="Serveurperso/Qwen3-TTS-GGUF"
BASE_URL="https://huggingface.co/${REPO_ID}/resolve/main"
FILES=( "qwen-talker-0.6b-base-Q4_K_M.gguf" "qwen-tokenizer-12hz-Q4_K_M.gguf" )
for file in "${FILES[@]}"; do
dest="${MODEL_DIR}/${file}"
if [ -f "${dest}" ]; then
echo " [skip] ${file}"
else
echo " [download] ${file}..."
curl -L -o "${dest}" "${BASE_URL}/${file}" --progress-bar
fi
done
export QWEN3TTS_MODEL="${MODEL_DIR}/qwen-talker-0.6b-base-Q4_K_M.gguf"
export QWEN3TTS_CODEC="${MODEL_DIR}/qwen-tokenizer-12hz-Q4_K_M.gguf"
fi
# Run Go tests
go test -v -timeout 600s .
go test -v -timeout 1200s .
echo "All qwen3-tts-cpp tests passed."

View File

@@ -62,7 +62,7 @@ var (
shimVadConfigSetDebug func(uintptr, int32)
shimCreateVad func(uintptr, float32) uintptr
// TTS (offline, VITS) config
// TTS (offline, VITS/Piper and Kokoro) config
shimTtsConfigNew func() uintptr
shimTtsConfigFree func(uintptr)
shimTtsConfigSetVitsModel func(uintptr, string)
@@ -76,6 +76,14 @@ var (
shimTtsConfigSetDebug func(uintptr, int32)
shimTtsConfigSetProvider func(uintptr, string)
shimTtsConfigSetMaxNumSentences func(uintptr, int32)
shimTtsConfigSetKokoroModel func(uintptr, string)
shimTtsConfigSetKokoroVoices func(uintptr, string)
shimTtsConfigSetKokoroTokens func(uintptr, string)
shimTtsConfigSetKokoroDataDir func(uintptr, string)
shimTtsConfigSetKokoroDictDir func(uintptr, string)
shimTtsConfigSetKokoroLexicon func(uintptr, string)
shimTtsConfigSetKokoroLang func(uintptr, string)
shimTtsConfigSetKokoroLengthScale func(uintptr, float32)
shimCreateOfflineTts func(uintptr) uintptr
// Offline recognizer config
@@ -101,37 +109,37 @@ var (
shimCreateOfflineRecognizer func(uintptr) uintptr
// Online recognizer config
shimOnlineRecogConfigNew func() uintptr
shimOnlineRecogConfigFree func(uintptr)
shimOnlineRecogConfigSetTransducerEncoder func(uintptr, string)
shimOnlineRecogConfigSetTransducerDecoder func(uintptr, string)
shimOnlineRecogConfigSetTransducerJoiner func(uintptr, string)
shimOnlineRecogConfigSetTokens func(uintptr, string)
shimOnlineRecogConfigSetNumThreads func(uintptr, int32)
shimOnlineRecogConfigSetDebug func(uintptr, int32)
shimOnlineRecogConfigSetProvider func(uintptr, string)
shimOnlineRecogConfigSetFeatSampleRate func(uintptr, int32)
shimOnlineRecogConfigSetFeatFeatureDim func(uintptr, int32)
shimOnlineRecogConfigSetDecodingMethod func(uintptr, string)
shimOnlineRecogConfigSetEnableEndpoint func(uintptr, int32)
shimOnlineRecogConfigNew func() uintptr
shimOnlineRecogConfigFree func(uintptr)
shimOnlineRecogConfigSetTransducerEncoder func(uintptr, string)
shimOnlineRecogConfigSetTransducerDecoder func(uintptr, string)
shimOnlineRecogConfigSetTransducerJoiner func(uintptr, string)
shimOnlineRecogConfigSetTokens func(uintptr, string)
shimOnlineRecogConfigSetNumThreads func(uintptr, int32)
shimOnlineRecogConfigSetDebug func(uintptr, int32)
shimOnlineRecogConfigSetProvider func(uintptr, string)
shimOnlineRecogConfigSetFeatSampleRate func(uintptr, int32)
shimOnlineRecogConfigSetFeatFeatureDim func(uintptr, int32)
shimOnlineRecogConfigSetDecodingMethod func(uintptr, string)
shimOnlineRecogConfigSetEnableEndpoint func(uintptr, int32)
shimOnlineRecogConfigSetRule1MinTrailingSilence func(uintptr, float32)
shimOnlineRecogConfigSetRule2MinTrailingSilence func(uintptr, float32)
shimOnlineRecogConfigSetRule3MinUtteranceLength func(uintptr, float32)
shimCreateOnlineRecognizer func(uintptr) uintptr
shimCreateOnlineRecognizer func(uintptr) uintptr
// Result accessors. Pointer returns use unsafe.Pointer so Go's
// vet checker doesn't flag them — the returned memory is C-owned,
// not subject to Go GC motion.
shimWaveSampleRate func(uintptr) int32
shimWaveNumSamples func(uintptr) int32
shimWaveSamples func(uintptr) unsafe.Pointer
shimOfflineResultText func(uintptr) unsafe.Pointer
shimOnlineResultText func(uintptr) unsafe.Pointer
shimGeneratedAudioSampleRate func(uintptr) int32
shimGeneratedAudioN func(uintptr) int32
shimGeneratedAudioSamples func(uintptr) unsafe.Pointer
shimSpeechSegmentStart func(uintptr) int32
shimSpeechSegmentN func(uintptr) int32
shimWaveSampleRate func(uintptr) int32
shimWaveNumSamples func(uintptr) int32
shimWaveSamples func(uintptr) unsafe.Pointer
shimOfflineResultText func(uintptr) unsafe.Pointer
shimOnlineResultText func(uintptr) unsafe.Pointer
shimGeneratedAudioSampleRate func(uintptr) int32
shimGeneratedAudioN func(uintptr) int32
shimGeneratedAudioSamples func(uintptr) unsafe.Pointer
shimSpeechSegmentStart func(uintptr) int32
shimSpeechSegmentN func(uintptr) int32
// TTS streaming callback trampoline
shimTtsGenerateWithCallback func(tts uintptr, text string, sid int32, speed float32, cb uintptr, ud uintptr) uintptr
@@ -161,13 +169,13 @@ var (
// pointer returned by the shim or `unsafe.Pointer(&slice[0])` from Go.
var (
// VAD
sherpaVadAcceptWaveform func(vad uintptr, samples unsafe.Pointer, n int32)
sherpaVadReset func(vad uintptr)
sherpaVadFlush func(vad uintptr)
sherpaVadEmpty func(vad uintptr) int32
sherpaVadFront func(vad uintptr) uintptr
sherpaVadPop func(vad uintptr)
sherpaDestroySpeechSegment func(seg uintptr)
sherpaVadAcceptWaveform func(vad uintptr, samples unsafe.Pointer, n int32)
sherpaVadReset func(vad uintptr)
sherpaVadFlush func(vad uintptr)
sherpaVadEmpty func(vad uintptr) int32
sherpaVadFront func(vad uintptr) uintptr
sherpaVadPop func(vad uintptr)
sherpaDestroySpeechSegment func(seg uintptr)
// Wave IO
sherpaReadWave func(filename string) uintptr
@@ -175,11 +183,11 @@ var (
sherpaWriteWave func(samples unsafe.Pointer, n int32, sampleRate int32, filename string) int32
// Offline ASR
sherpaCreateOfflineStream func(rec uintptr) uintptr
sherpaDestroyOfflineStream func(stream uintptr)
sherpaAcceptWaveformOffline func(stream uintptr, sr int32, samples unsafe.Pointer, n int32)
sherpaDecodeOfflineStream func(rec uintptr, stream uintptr)
sherpaGetOfflineStreamResult func(stream uintptr) uintptr
sherpaCreateOfflineStream func(rec uintptr) uintptr
sherpaDestroyOfflineStream func(stream uintptr)
sherpaAcceptWaveformOffline func(stream uintptr, sr int32, samples unsafe.Pointer, n int32)
sherpaDecodeOfflineStream func(rec uintptr, stream uintptr)
sherpaGetOfflineStreamResult func(stream uintptr) uintptr
sherpaDestroyOfflineRecognizerResult func(result uintptr)
// Online ASR
@@ -195,21 +203,21 @@ var (
sherpaOnlineStreamInputFinished func(stream uintptr)
// TTS
sherpaOfflineTtsGenerate func(tts uintptr, text string, sid int32, speed float32) uintptr
sherpaOfflineTtsGenerate func(tts uintptr, text string, sid int32, speed float32) uintptr
sherpaDestroyOfflineTtsGeneratedAudio func(audio uintptr)
sherpaOfflineTtsSampleRate func(tts uintptr) int32
sherpaOfflineTtsSampleRate func(tts uintptr) int32
// Offline speaker diarization. Result handle owns the segment-array
// pointer returned by ResultSortByStartTime; destroy the segment
// array first, then the result, then (at backend Free()) the diarizer.
sherpaDestroyOfflineSpeakerDiarization func(sd uintptr)
sherpaOfflineSpeakerDiarizationGetSampleRate func(sd uintptr) int32
sherpaOfflineSpeakerDiarizationProcess func(sd uintptr, samples unsafe.Pointer, n int32) uintptr
sherpaOfflineSpeakerDiarizationResultGetNumSegments func(result uintptr) int32
sherpaOfflineSpeakerDiarizationResultGetNumSpeakers func(result uintptr) int32
sherpaOfflineSpeakerDiarizationResultSortByStartTime func(result uintptr) uintptr
sherpaOfflineSpeakerDiarizationDestroySegment func(segs uintptr)
sherpaDestroyOfflineSpeakerDiarizationResult func(result uintptr)
sherpaDestroyOfflineSpeakerDiarization func(sd uintptr)
sherpaOfflineSpeakerDiarizationGetSampleRate func(sd uintptr) int32
sherpaOfflineSpeakerDiarizationProcess func(sd uintptr, samples unsafe.Pointer, n int32) uintptr
sherpaOfflineSpeakerDiarizationResultGetNumSegments func(result uintptr) int32
sherpaOfflineSpeakerDiarizationResultGetNumSpeakers func(result uintptr) int32
sherpaOfflineSpeakerDiarizationResultSortByStartTime func(result uintptr) uintptr
sherpaOfflineSpeakerDiarizationDestroySegment func(segs uintptr)
sherpaDestroyOfflineSpeakerDiarizationResult func(result uintptr)
)
var (
@@ -278,6 +286,14 @@ func loadSherpaLibsOnce() error {
{&shimTtsConfigSetDebug, "sherpa_shim_tts_config_set_debug"},
{&shimTtsConfigSetProvider, "sherpa_shim_tts_config_set_provider"},
{&shimTtsConfigSetMaxNumSentences, "sherpa_shim_tts_config_set_max_num_sentences"},
{&shimTtsConfigSetKokoroModel, "sherpa_shim_tts_config_set_kokoro_model"},
{&shimTtsConfigSetKokoroVoices, "sherpa_shim_tts_config_set_kokoro_voices"},
{&shimTtsConfigSetKokoroTokens, "sherpa_shim_tts_config_set_kokoro_tokens"},
{&shimTtsConfigSetKokoroDataDir, "sherpa_shim_tts_config_set_kokoro_data_dir"},
{&shimTtsConfigSetKokoroDictDir, "sherpa_shim_tts_config_set_kokoro_dict_dir"},
{&shimTtsConfigSetKokoroLexicon, "sherpa_shim_tts_config_set_kokoro_lexicon"},
{&shimTtsConfigSetKokoroLang, "sherpa_shim_tts_config_set_kokoro_lang"},
{&shimTtsConfigSetKokoroLengthScale, "sherpa_shim_tts_config_set_kokoro_length_scale"},
{&shimCreateOfflineTts, "sherpa_shim_create_offline_tts"},
{&shimOfflineRecogConfigNew, "sherpa_shim_offline_recog_config_new"},
@@ -688,21 +704,14 @@ func (s *SherpaBackend) loadTTS(opts *pb.ModelOptions) error {
cfg := shimTtsConfigNew()
defer shimTtsConfigFree(cfg)
shimTtsConfigSetVitsModel(cfg, modelFile)
if tokensPath := filepath.Join(modelDir, "tokens.txt"); fileExists(tokensPath) {
shimTtsConfigSetVitsTokens(cfg, tokensPath)
// Kokoro models ship a voices style file alongside the ONNX, whereas
// VITS/Piper voices do not. That presence is what tells the two model
// families apart, since both arrive as a plain *.onnx in modelDir.
if isKokoroModel(modelDir) {
s.configureKokoroTTS(cfg, opts, modelFile, modelDir)
} else {
s.configureVitsTTS(cfg, opts, modelFile, modelDir)
}
if lexiconPath := filepath.Join(modelDir, "lexicon.txt"); fileExists(lexiconPath) {
shimTtsConfigSetVitsLexicon(cfg, lexiconPath)
}
if dataDir := filepath.Join(modelDir, "espeak-ng-data"); dirExists(dataDir) {
shimTtsConfigSetVitsDataDir(cfg, dataDir)
}
shimTtsConfigSetVitsNoiseScale(cfg, findOptionFloat(opts, optionTtsNoiseScale, 0.667))
shimTtsConfigSetVitsNoiseScaleW(cfg, findOptionFloat(opts, optionTtsNoiseScaleW, 0.8))
shimTtsConfigSetVitsLengthScale(cfg, findOptionFloat(opts, optionTtsLengthScale, 1.0))
threads := int32(1)
if opts.Threads != 0 {
@@ -723,6 +732,80 @@ func (s *SherpaBackend) loadTTS(opts *pb.ModelOptions) error {
return nil
}
// kokoroVoicesFile is the speaker-style bank that ships with Kokoro models and
// is absent from VITS/Piper voices; its presence is how loadTTS tells them apart.
const kokoroVoicesFile = "voices.bin"
// isKokoroModel reports whether modelDir holds a Kokoro model (a voices file
// next to the ONNX) rather than a VITS/Piper single-speaker model.
func isKokoroModel(modelDir string) bool {
return fileExists(filepath.Join(modelDir, kokoroVoicesFile))
}
// configureVitsTTS wires a VITS/Piper single-speaker model into cfg: the ONNX
// plus the optional tokens, lexicon and espeak-ng-data found beside it.
func (s *SherpaBackend) configureVitsTTS(cfg uintptr, opts *pb.ModelOptions, modelFile, modelDir string) {
shimTtsConfigSetVitsModel(cfg, modelFile)
if tokensPath := filepath.Join(modelDir, "tokens.txt"); fileExists(tokensPath) {
shimTtsConfigSetVitsTokens(cfg, tokensPath)
}
if lexiconPath := filepath.Join(modelDir, "lexicon.txt"); fileExists(lexiconPath) {
shimTtsConfigSetVitsLexicon(cfg, lexiconPath)
}
if dataDir := filepath.Join(modelDir, "espeak-ng-data"); dirExists(dataDir) {
shimTtsConfigSetVitsDataDir(cfg, dataDir)
}
shimTtsConfigSetVitsNoiseScale(cfg, findOptionFloat(opts, optionTtsNoiseScale, 0.667))
shimTtsConfigSetVitsNoiseScaleW(cfg, findOptionFloat(opts, optionTtsNoiseScaleW, 0.8))
shimTtsConfigSetVitsLengthScale(cfg, findOptionFloat(opts, optionTtsLengthScale, 1.0))
}
// configureKokoroTTS wires a Kokoro model into cfg: the ONNX, its voices bank,
// tokens, and the optional espeak-ng-data / jieba dict / lexicon assets the
// multi-lingual packs ship. A language hint comes from the `language=` option.
func (s *SherpaBackend) configureKokoroTTS(cfg uintptr, opts *pb.ModelOptions, modelFile, modelDir string) {
shimTtsConfigSetKokoroModel(cfg, modelFile)
shimTtsConfigSetKokoroVoices(cfg, filepath.Join(modelDir, kokoroVoicesFile))
if tokensPath := filepath.Join(modelDir, "tokens.txt"); fileExists(tokensPath) {
shimTtsConfigSetKokoroTokens(cfg, tokensPath)
}
if dataDir := filepath.Join(modelDir, "espeak-ng-data"); dirExists(dataDir) {
shimTtsConfigSetKokoroDataDir(cfg, dataDir)
}
if dictDir := filepath.Join(modelDir, "dict"); dirExists(dictDir) {
shimTtsConfigSetKokoroDictDir(cfg, dictDir)
}
// Multi-lingual Kokoro ships per-language lexicons; the C API takes them as
// a single comma-separated list. US and GB English overlap almost entirely,
// so pass only one (US preferred) to avoid tens of thousands of "duplicated
// word" warnings at load; non-English lexicons (e.g. zh) are additive.
var lexicons []string
addLexicon := func(name string) {
if p := filepath.Join(modelDir, name); fileExists(p) {
lexicons = append(lexicons, p)
}
}
if fileExists(filepath.Join(modelDir, "lexicon-us-en.txt")) {
addLexicon("lexicon-us-en.txt")
} else {
addLexicon("lexicon-gb-en.txt")
}
addLexicon("lexicon-zh.txt")
addLexicon("lexicon.txt")
if len(lexicons) > 0 {
shimTtsConfigSetKokoroLexicon(cfg, strings.Join(lexicons, ","))
}
if lang := findOptionValue(opts, optionLanguage, ""); lang != "" {
shimTtsConfigSetKokoroLang(cfg, lang)
}
shimTtsConfigSetKokoroLengthScale(cfg, findOptionFloat(opts, optionTtsLengthScale, 1.0))
}
func fileExists(p string) bool {
info, err := os.Stat(p)
return err == nil && !info.IsDir()
@@ -1252,7 +1335,7 @@ type ttsStreamState struct {
var (
ttsStates sync.Map // uint64 → *ttsStreamState
ttsNextID atomic.Uint64
ttsCallbackPtr uintptr // purego.NewCallback return; registered in loadSherpaLibs
ttsCallbackPtr uintptr // purego.NewCallback return; registered in loadSherpaLibs
)
// ttsStreamCallback is invoked by sherpa-onnx for each PCM chunk VITS

View File

@@ -124,6 +124,20 @@ var _ = Describe("Sherpa-ONNX", func() {
Entry("empty", "", false),
Entry("other", "other", false),
)
It("isKokoroModel detects a voices file beside the ONNX", func() {
dir, err := os.MkdirTemp("", "sherpa-kokoro-*")
Expect(err).NotTo(HaveOccurred())
defer func() { _ = os.RemoveAll(dir) }()
// A bare VITS/Piper directory (ONNX only) is not Kokoro.
Expect(os.WriteFile(filepath.Join(dir, "model.onnx"), []byte("x"), 0o600)).To(Succeed())
Expect(isKokoroModel(dir)).To(BeFalse())
// Adding the Kokoro voices bank flips detection on.
Expect(os.WriteFile(filepath.Join(dir, kokoroVoicesFile), []byte("x"), 0o600)).To(Succeed())
Expect(isKokoroModel(dir)).To(BeTrue())
})
})
Context("option parsing", func() {

View File

@@ -79,6 +79,13 @@ void sherpa_shim_tts_config_free(void *h) {
free((char *)c->model.vits.tokens);
free((char *)c->model.vits.lexicon);
free((char *)c->model.vits.data_dir);
free((char *)c->model.kokoro.model);
free((char *)c->model.kokoro.voices);
free((char *)c->model.kokoro.tokens);
free((char *)c->model.kokoro.data_dir);
free((char *)c->model.kokoro.dict_dir);
free((char *)c->model.kokoro.lexicon);
free((char *)c->model.kokoro.lang);
free((char *)c->model.provider);
free(c);
}
@@ -117,6 +124,34 @@ void sherpa_shim_tts_config_set_max_num_sentences(void *h, int32_t v) {
((SherpaOnnxOfflineTtsConfig *)h)->max_num_sentences = v;
}
// Kokoro multi-speaker / multi-lingual TTS. Distinct ONNX + a voices style
// file (voices.bin) instead of VITS' single-speaker graph; espeak-ng-data,
// lexicon and a language hint are optional refinements.
void sherpa_shim_tts_config_set_kokoro_model(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineTtsConfig *)h)->model.kokoro.model, v);
}
void sherpa_shim_tts_config_set_kokoro_voices(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineTtsConfig *)h)->model.kokoro.voices, v);
}
void sherpa_shim_tts_config_set_kokoro_tokens(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineTtsConfig *)h)->model.kokoro.tokens, v);
}
void sherpa_shim_tts_config_set_kokoro_data_dir(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineTtsConfig *)h)->model.kokoro.data_dir, v);
}
void sherpa_shim_tts_config_set_kokoro_dict_dir(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineTtsConfig *)h)->model.kokoro.dict_dir, v);
}
void sherpa_shim_tts_config_set_kokoro_lexicon(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineTtsConfig *)h)->model.kokoro.lexicon, v);
}
void sherpa_shim_tts_config_set_kokoro_lang(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineTtsConfig *)h)->model.kokoro.lang, v);
}
void sherpa_shim_tts_config_set_kokoro_length_scale(void *h, float v) {
((SherpaOnnxOfflineTtsConfig *)h)->model.kokoro.length_scale = v;
}
void *sherpa_shim_create_offline_tts(void *h) {
return (void *)SherpaOnnxCreateOfflineTts(
(const SherpaOnnxOfflineTtsConfig *)h);

View File

@@ -37,7 +37,7 @@ void sherpa_shim_vad_config_set_provider(void *cfg, const char *v);
void sherpa_shim_vad_config_set_debug(void *cfg, int32_t v);
void *sherpa_shim_create_vad(void *cfg, float buffer_size_seconds);
// --- Offline TTS config (VITS path — the only TTS family the backend uses) ---
// --- Offline TTS config (VITS/Piper and Kokoro model families) ---
void *sherpa_shim_tts_config_new(void);
void sherpa_shim_tts_config_free(void *cfg);
void sherpa_shim_tts_config_set_vits_model(void *cfg, const char *v);
@@ -51,6 +51,14 @@ void sherpa_shim_tts_config_set_num_threads(void *cfg, int32_t v);
void sherpa_shim_tts_config_set_debug(void *cfg, int32_t v);
void sherpa_shim_tts_config_set_provider(void *cfg, const char *v);
void sherpa_shim_tts_config_set_max_num_sentences(void *cfg, int32_t v);
void sherpa_shim_tts_config_set_kokoro_model(void *cfg, const char *v);
void sherpa_shim_tts_config_set_kokoro_voices(void *cfg, const char *v);
void sherpa_shim_tts_config_set_kokoro_tokens(void *cfg, const char *v);
void sherpa_shim_tts_config_set_kokoro_data_dir(void *cfg, const char *v);
void sherpa_shim_tts_config_set_kokoro_dict_dir(void *cfg, const char *v);
void sherpa_shim_tts_config_set_kokoro_lexicon(void *cfg, const char *v);
void sherpa_shim_tts_config_set_kokoro_lang(void *cfg, const char *v);
void sherpa_shim_tts_config_set_kokoro_length_scale(void *cfg, float v);
void *sherpa_shim_create_offline_tts(void *cfg);
// --- Offline recognizer config (Whisper / Paraformer / SenseVoice / Omnilingual) ---

View File

@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)
# stablediffusion.cpp (ggml)
STABLEDIFFUSION_GGML_REPO?=https://github.com/leejet/stable-diffusion.cpp
STABLEDIFFUSION_GGML_VERSION?=1f9ee88e09c258053fa59d5e05e23dfb10fa0b13
STABLEDIFFUSION_GGML_VERSION?=19bdfe22d255d5b4dff39d449318b9bc5ea2317f
CMAKE_ARGS+=-DGGML_MAX_NAME=128

View File

@@ -386,6 +386,7 @@ int load_model(const char *model, char *model_path, char* options[], int threads
const char *llm_vision_path = "";
const char *diffusion_model_path = stableDiffusionModel;
const char *high_noise_diffusion_model_path = "";
const char *uncond_diffusion_model_path = "";
const char *taesd_path = "";
const char *control_net_path = "";
const char *embedding_dir = "";
@@ -472,6 +473,7 @@ int load_model(const char *model, char *model_path, char* options[], int threads
if (!strcmp(optname, "llm_vision_path")) llm_vision_path = strdup(optval);
if (!strcmp(optname, "diffusion_model_path")) diffusion_model_path = strdup(optval);
if (!strcmp(optname, "high_noise_diffusion_model_path")) high_noise_diffusion_model_path = strdup(optval);
if (!strcmp(optname, "uncond_diffusion_model_path")) uncond_diffusion_model_path = strdup(optval);
if (!strcmp(optname, "taesd_path")) taesd_path = strdup(optval);
if (!strcmp(optname, "control_net_path")) control_net_path = strdup(optval);
if (!strcmp(optname, "embedding_dir")) {
@@ -571,6 +573,7 @@ int load_model(const char *model, char *model_path, char* options[], int threads
ctx_params.llm_vision_path = llm_vision_path;
ctx_params.diffusion_model_path = diffusion_model_path;
ctx_params.high_noise_diffusion_model_path = high_noise_diffusion_model_path;
ctx_params.uncond_diffusion_model_path = uncond_diffusion_model_path;
ctx_params.vae_path = vae_path;
ctx_params.audio_vae_path = audio_vae_path;
ctx_params.embeddings_connectors_path = embeddings_connectors_path;

View File

@@ -26,8 +26,16 @@ add_library(govibevoicecpp MODULE cpp/govibevoicecpp.cpp)
# vv_capi_* symbols (purego dlopens them by name, nothing in our
# translation unit references them). Force the static archive's
# entire contents into the MODULE so dlsym finds vv_capi_load etc.
#
# Link the `vibevoice` TARGET (not a bare archive path) so CMake builds
# libvibevoice.a first and tracks the dependency: the upstream project is added
# with EXCLUDE_FROM_ALL, so without a target-level link there is no rule to
# build it. Passing only $<TARGET_FILE:vibevoice> as a path on Apple left the
# build with "No rule to make target 'vibevoice/libvibevoice.a'" (issue #10267).
# force_load is then applied as a separate link option.
if(APPLE)
target_link_libraries(govibevoicecpp PRIVATE -Wl,-force_load $<TARGET_FILE:vibevoice>)
target_link_libraries(govibevoicecpp PRIVATE vibevoice)
target_link_options(govibevoicecpp PRIVATE "-Wl,-force_load,$<TARGET_FILE:vibevoice>")
elseif(MSVC)
target_link_libraries(govibevoicecpp PRIVATE vibevoice)
set_property(TARGET govibevoicecpp APPEND PROPERTY LINK_FLAGS "/WHOLEARCHIVE:vibevoice")

View File

@@ -94,26 +94,30 @@ purge:
# Build all variants (Linux only)
ifeq ($(UNAME_S),Linux)
libgovibevoicecpp-avx.so: sources/vibevoice.cpp
$(MAKE) purge
$(info ${GREEN}I vibevoice-cpp build info:avx${RESET})
SO_TARGET=libgovibevoicecpp-avx.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) libgovibevoicecpp-custom
rm -rf build-libgovibevoicecpp-avx.so
rm -rfv build*
libgovibevoicecpp-avx2.so: sources/vibevoice.cpp
$(MAKE) purge
$(info ${GREEN}I vibevoice-cpp build info:avx2${RESET})
SO_TARGET=libgovibevoicecpp-avx2.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=off -DGGML_FMA=on -DGGML_F16C=on -DGGML_BMI2=on" $(MAKE) libgovibevoicecpp-custom
rm -rf build-libgovibevoicecpp-avx2.so
rm -rfv build*
libgovibevoicecpp-avx512.so: sources/vibevoice.cpp
$(MAKE) purge
$(info ${GREEN}I vibevoice-cpp build info:avx512${RESET})
SO_TARGET=libgovibevoicecpp-avx512.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=on -DGGML_FMA=on -DGGML_F16C=on -DGGML_BMI2=on" $(MAKE) libgovibevoicecpp-custom
rm -rf build-libgovibevoicecpp-avx512.so
rm -rfv build*
endif
# Build fallback variant (all platforms)
libgovibevoicecpp-fallback.so: sources/vibevoice.cpp
$(MAKE) purge
$(info ${GREEN}I vibevoice-cpp build info:fallback${RESET})
SO_TARGET=libgovibevoicecpp-fallback.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) libgovibevoicecpp-custom
rm -rf build-libgovibevoicecpp-fallback.so
rm -rfv build*
libgovibevoicecpp-custom: CMakeLists.txt cpp/govibevoicecpp.cpp cpp/govibevoicecpp.h
mkdir -p build-$(SO_TARGET) && \

View File

@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)
# whisper.cpp version
WHISPER_REPO?=https://github.com/ggml-org/whisper.cpp
WHISPER_CPP_VERSION?=99613cb720b65036237d44b52f753b51f75c2797
WHISPER_CPP_VERSION?=df7638d8229a243af8a4b5a8ae557e0d74e0a0ae
SO_TARGET?=libgowhisper.so
CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF

View File

@@ -337,6 +337,127 @@
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-rfdetr-cpp"
intel: "intel-sycl-f32-rfdetr-cpp"
vulkan: "vulkan-rfdetr-cpp"
- &locateanything
name: "locate-anything"
alias: "locate-anything"
license: apache-2.0
description: |
Open-vocabulary object detection and visual grounding (NVIDIA
LocateAnything-3B) in C/C++ using GGML. Loads pre-built GGUF weights
and, given an image and a free-form text prompt, returns bounding
boxes, class labels, and confidence scores for the referred objects.
urls:
- https://github.com/mudler/locate-anything.cpp
- https://huggingface.co/nvidia/LocateAnything-3B
tags:
- object-detection
- visual-grounding
- open-vocabulary
- locate-anything
- gpu
- cpu
capabilities:
default: "cpu-locate-anything-cpp"
nvidia: "cuda12-locate-anything-cpp"
nvidia-cuda-12: "cuda12-locate-anything-cpp"
nvidia-cuda-13: "cuda13-locate-anything-cpp"
nvidia-l4t: "nvidia-l4t-arm64-locate-anything-cpp"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-locate-anything-cpp"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-locate-anything-cpp"
intel: "intel-sycl-f32-locate-anything-cpp"
vulkan: "vulkan-locate-anything-cpp"
- !!merge <<: *locateanything
name: "locate-anything-development"
capabilities:
default: "cpu-locate-anything-cpp-development"
nvidia: "cuda12-locate-anything-cpp-development"
nvidia-cuda-12: "cuda12-locate-anything-cpp-development"
nvidia-cuda-13: "cuda13-locate-anything-cpp-development"
nvidia-l4t: "nvidia-l4t-arm64-locate-anything-cpp-development"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-locate-anything-cpp-development"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-locate-anything-cpp-development"
intel: "intel-sycl-f32-locate-anything-cpp-development"
vulkan: "vulkan-locate-anything-cpp-development"
- !!merge <<: *locateanything
name: "cpu-locate-anything-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-locate-anything-cpp"
mirrors:
- localai/localai-backends:latest-cpu-locate-anything-cpp
- !!merge <<: *locateanything
name: "cpu-locate-anything-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-locate-anything-cpp"
mirrors:
- localai/localai-backends:master-cpu-locate-anything-cpp
- !!merge <<: *locateanything
name: "cuda12-locate-anything-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-locate-anything-cpp"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-locate-anything-cpp
- !!merge <<: *locateanything
name: "cuda12-locate-anything-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-locate-anything-cpp"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-locate-anything-cpp
- !!merge <<: *locateanything
name: "cuda13-locate-anything-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-locate-anything-cpp"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-13-locate-anything-cpp
- !!merge <<: *locateanything
name: "cuda13-locate-anything-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-locate-anything-cpp"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-locate-anything-cpp
- !!merge <<: *locateanything
name: "nvidia-l4t-arm64-locate-anything-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-arm64-locate-anything-cpp"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-arm64-locate-anything-cpp
- !!merge <<: *locateanything
name: "nvidia-l4t-arm64-locate-anything-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-arm64-locate-anything-cpp"
mirrors:
- localai/localai-backends:master-nvidia-l4t-arm64-locate-anything-cpp
- !!merge <<: *locateanything
name: "cuda13-nvidia-l4t-arm64-locate-anything-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-locate-anything-cpp"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-locate-anything-cpp
- !!merge <<: *locateanything
name: "cuda13-nvidia-l4t-arm64-locate-anything-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-locate-anything-cpp"
mirrors:
- localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-locate-anything-cpp
- !!merge <<: *locateanything
name: "intel-sycl-f32-locate-anything-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f32-locate-anything-cpp"
mirrors:
- localai/localai-backends:latest-gpu-intel-sycl-f32-locate-anything-cpp
- !!merge <<: *locateanything
name: "intel-sycl-f32-locate-anything-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f32-locate-anything-cpp"
mirrors:
- localai/localai-backends:master-gpu-intel-sycl-f32-locate-anything-cpp
- !!merge <<: *locateanything
name: "intel-sycl-f16-locate-anything-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f16-locate-anything-cpp"
mirrors:
- localai/localai-backends:latest-gpu-intel-sycl-f16-locate-anything-cpp
- !!merge <<: *locateanything
name: "intel-sycl-f16-locate-anything-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f16-locate-anything-cpp"
mirrors:
- localai/localai-backends:master-gpu-intel-sycl-f16-locate-anything-cpp
- !!merge <<: *locateanything
name: "vulkan-locate-anything-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-vulkan-locate-anything-cpp"
mirrors:
- localai/localai-backends:latest-gpu-vulkan-locate-anything-cpp
- !!merge <<: *locateanything
name: "vulkan-locate-anything-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-vulkan-locate-anything-cpp"
mirrors:
- localai/localai-backends:master-gpu-vulkan-locate-anything-cpp
- &vllm
name: "vllm"
license: apache-2.0
@@ -426,12 +547,9 @@
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-vllm-omni"
- &mlx
name: "mlx"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-mlx"
icon: https://avatars.githubusercontent.com/u/102832242?s=200&v=4
urls:
- https://github.com/ml-explore/mlx-lm
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-mlx
license: MIT
description: |
Run LLMs with MLX
@@ -450,12 +568,9 @@
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-mlx"
- &mlx-vlm
name: "mlx-vlm"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-mlx-vlm"
icon: https://avatars.githubusercontent.com/u/102832242?s=200&v=4
urls:
- https://github.com/Blaizzy/mlx-vlm
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-mlx-vlm
license: MIT
description: |
Run Vision-Language Models with MLX
@@ -476,12 +591,9 @@
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-mlx-vlm"
- &mlx-audio
name: "mlx-audio"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-mlx-audio"
icon: https://avatars.githubusercontent.com/u/102832242?s=200&v=4
urls:
- https://github.com/Blaizzy/mlx-audio
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-mlx-audio
license: MIT
description: |
Run Audio Models with MLX
@@ -502,12 +614,9 @@
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-mlx-audio"
- &mlx-distributed
name: "mlx-distributed"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-mlx-distributed"
icon: https://avatars.githubusercontent.com/u/102832242?s=200&v=4
urls:
- https://github.com/ml-explore/mlx-lm
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-mlx-distributed
license: MIT
description: |
Run distributed LLM inference with MLX across multiple Apple Silicon Macs
@@ -603,7 +712,7 @@
default: "cpu-diffusers"
nvidia-cuda-13: "cuda13-diffusers"
nvidia-cuda-12: "cuda12-diffusers"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-diffusers"
nvidia-l4t-cuda-12: "nvidia-l4t-diffusers"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-diffusers"
- &ace-step
name: "ace-step"
@@ -659,14 +768,17 @@
- &qwen3ttscpp
name: "qwen3-tts-cpp"
description: |
Qwen3-TTS C++ backend using GGML. Native C++ text-to-speech with voice cloning support.
Generates 24kHz mono audio from text with optional reference audio for voice cloning via ECAPA-TDNN speaker embeddings.
Qwen3-TTS C++ backend using GGML (qwentts.cpp). Native C++ text-to-speech
with streaming output, named speakers, voice design, and zero-shot voice
cloning. 24kHz mono, 11 languages with Mandarin dialects. 0.6B and 1.7B
models in Q8_0 / Q4_K_M.
urls:
- https://github.com/predict-woo/qwen3-tts.cpp
- https://github.com/ServeurpersoCom/qwentts.cpp
tags:
- text-to-speech
- tts
- voice-cloning
- streaming
alias: "qwen3-tts-cpp"
capabilities:
default: "cpu-qwen3-tts-cpp"
@@ -680,6 +792,33 @@
nvidia-l4t: "nvidia-l4t-arm64-qwen3-tts-cpp"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-qwen3-tts-cpp"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-qwen3-tts-cpp"
- &omnivoicecpp
name: "omnivoice-cpp"
description: |
OmniVoice C++ backend using GGML. Native text-to-speech with voice cloning
(reference audio + transcript) and voice design (attribute keywords: gender,
age, pitch, style, volume, emotion). 24kHz mono output, 646 languages.
Supports streaming synthesis.
urls:
- https://github.com/ServeurpersoCom/omnivoice.cpp
tags:
- text-to-speech
- tts
- voice-cloning
- voice-design
alias: "omnivoice-cpp"
capabilities:
default: "cpu-omnivoice-cpp"
nvidia: "cuda12-omnivoice-cpp"
nvidia-cuda-13: "cuda13-omnivoice-cpp"
nvidia-cuda-12: "cuda12-omnivoice-cpp"
intel: "intel-sycl-f16-omnivoice-cpp"
metal: "metal-omnivoice-cpp"
amd: "rocm-omnivoice-cpp"
vulkan: "vulkan-omnivoice-cpp"
nvidia-l4t: "nvidia-l4t-arm64-omnivoice-cpp"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-omnivoice-cpp"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-omnivoice-cpp"
- &vibevoicecpp
name: "vibevoice-cpp"
description: |
@@ -825,7 +964,7 @@
metal: "metal-kokoro"
nvidia-cuda-13: "cuda13-kokoro"
nvidia-cuda-12: "cuda12-kokoro"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-kokoro"
nvidia-l4t-cuda-12: "nvidia-l4t-kokoro"
- &kokoros
icon: https://avatars.githubusercontent.com/u/166769057?v=4
description: |
@@ -868,7 +1007,6 @@
intel: "intel-coqui"
amd: "rocm-coqui"
metal: "metal-coqui"
nvidia-cuda-13: "cuda13-coqui"
nvidia-cuda-12: "cuda12-coqui"
icon: https://avatars.githubusercontent.com/u/1338804?s=200&v=4
- &outetts
@@ -1118,27 +1256,27 @@
icon: https://avatars.githubusercontent.com/u/151010778?s=200&v=4
- &piper
name: "piper"
uri: "quay.io/go-skynet/local-ai-backends:latest-piper"
icon: https://github.com/OHF-Voice/piper1-gpl/raw/main/etc/logo.png
urls:
- https://github.com/rhasspy/piper
- https://github.com/mudler/go-piper
mirrors:
- localai/localai-backends:latest-piper
license: MIT
description: |
A fast, local neural text to speech system
tags:
- text-to-speech
- TTS
capabilities:
default: "cpu-piper"
metal: "metal-piper"
- &opus
name: "opus"
alias: "opus"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-opus"
capabilities:
default: "cpu-opus"
metal: "metal-opus"
urls:
- https://opus-codec.org/
mirrors:
- localai/localai-backends:latest-cpu-opus
license: BSD-3-Clause
description: |
Opus audio codec backend for encoding and decoding audio.
@@ -1148,15 +1286,19 @@
- opus
- WebRTC
- realtime
- CPU
- !!merge <<: *opus
name: "opus-development"
capabilities:
default: "cpu-opus-development"
metal: "metal-opus-development"
- &silero-vad
name: "silero-vad"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-silero-vad"
icon: https://user-images.githubusercontent.com/12515440/89997349-b3523080-dc94-11ea-9906-ca2e8bc50535.png
urls:
- https://github.com/snakers4/silero-vad
mirrors:
- localai/localai-backends:latest-cpu-silero-vad
capabilities:
default: "cpu-silero-vad"
metal: "metal-silero-vad"
description: |
Silero VAD: pre-trained enterprise-grade Voice Activity Detector.
Silero VAD is a voice activity detection model that can be used to detect whether a given audio contains speech or not.
@@ -1167,9 +1309,6 @@
- CPU
- &local-store
name: "local-store"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-local-store"
mirrors:
- localai/localai-backends:latest-cpu-local-store
urls:
- https://github.com/mudler/LocalAI
description: |
@@ -1180,11 +1319,11 @@
- open-source
- CPU
license: MIT
capabilities:
default: "cpu-local-store"
metal: "metal-local-store"
- &kitten-tts
name: "kitten-tts"
uri: "quay.io/go-skynet/local-ai-backends:latest-kitten-tts"
mirrors:
- localai/localai-backends:latest-kitten-tts
urls:
- https://github.com/KittenML/KittenTTS
description: |
@@ -1193,6 +1332,9 @@
- text-to-speech
- TTS
license: apache-2.0
capabilities:
default: "cpu-kitten-tts"
metal: "metal-kitten-tts"
- &neutts
name: "neutts"
urls:
@@ -1225,6 +1367,7 @@
default: "cpu-sherpa-onnx"
nvidia: "cuda12-sherpa-onnx"
nvidia-cuda-12: "cuda12-sherpa-onnx"
metal: "metal-sherpa-onnx"
- !!merge <<: *neutts
name: "neutts-development"
capabilities:
@@ -1317,25 +1460,89 @@
mirrors:
- localai/localai-backends:master-gpu-rocm-hipblas-neutts
- !!merge <<: *mlx
name: "mlx-development"
name: "metal-mlx"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-mlx"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-mlx
- !!merge <<: *mlx
name: "metal-mlx-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-mlx"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-mlx
- !!merge <<: *mlx
name: "mlx-development"
capabilities:
default: "cpu-mlx-development"
nvidia: "cuda12-mlx-development"
metal: "metal-mlx-development"
nvidia-cuda-12: "cuda12-mlx-development"
nvidia-cuda-13: "cuda13-mlx-development"
nvidia-l4t: "nvidia-l4t-mlx-development"
nvidia-l4t-cuda-12: "nvidia-l4t-mlx-development"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-mlx-development"
- !!merge <<: *mlx-vlm
name: "mlx-vlm-development"
name: "metal-mlx-vlm"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-mlx-vlm"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-mlx-vlm
- !!merge <<: *mlx-vlm
name: "metal-mlx-vlm-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-mlx-vlm"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-mlx-vlm
- !!merge <<: *mlx-vlm
name: "mlx-vlm-development"
capabilities:
default: "cpu-mlx-vlm-development"
nvidia: "cuda12-mlx-vlm-development"
metal: "metal-mlx-vlm-development"
nvidia-cuda-12: "cuda12-mlx-vlm-development"
nvidia-cuda-13: "cuda13-mlx-vlm-development"
nvidia-l4t: "nvidia-l4t-mlx-vlm-development"
nvidia-l4t-cuda-12: "nvidia-l4t-mlx-vlm-development"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-mlx-vlm-development"
- !!merge <<: *mlx-audio
name: "mlx-audio-development"
name: "metal-mlx-audio"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-mlx-audio"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-mlx-audio
- !!merge <<: *mlx-audio
name: "metal-mlx-audio-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-mlx-audio"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-mlx-audio
- !!merge <<: *mlx-audio
name: "mlx-audio-development"
capabilities:
default: "cpu-mlx-audio-development"
nvidia: "cuda12-mlx-audio-development"
metal: "metal-mlx-audio-development"
nvidia-cuda-12: "cuda12-mlx-audio-development"
nvidia-cuda-13: "cuda13-mlx-audio-development"
nvidia-l4t: "nvidia-l4t-mlx-audio-development"
nvidia-l4t-cuda-12: "nvidia-l4t-mlx-audio-development"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-mlx-audio-development"
- !!merge <<: *mlx-distributed
name: "mlx-distributed-development"
name: "metal-mlx-distributed"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-mlx-distributed"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-mlx-distributed
- !!merge <<: *mlx-distributed
name: "metal-mlx-distributed-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-mlx-distributed"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-mlx-distributed
- !!merge <<: *mlx-distributed
name: "mlx-distributed-development"
capabilities:
default: "cpu-mlx-distributed-development"
nvidia: "cuda12-mlx-distributed-development"
metal: "metal-mlx-distributed-development"
nvidia-cuda-12: "cuda12-mlx-distributed-development"
nvidia-cuda-13: "cuda13-mlx-distributed-development"
nvidia-l4t: "nvidia-l4t-mlx-distributed-development"
nvidia-l4t-cuda-12: "nvidia-l4t-mlx-distributed-development"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-mlx-distributed-development"
## mlx
- !!merge <<: *mlx
name: "cpu-mlx"
@@ -1541,10 +1748,20 @@
mirrors:
- localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-mlx-distributed
- !!merge <<: *kitten-tts
name: "kitten-tts-development"
name: "cpu-kitten-tts"
uri: "quay.io/go-skynet/local-ai-backends:latest-kitten-tts"
mirrors:
- localai/localai-backends:latest-kitten-tts
- !!merge <<: *kitten-tts
name: "cpu-kitten-tts-development"
uri: "quay.io/go-skynet/local-ai-backends:master-kitten-tts"
mirrors:
- localai/localai-backends:master-kitten-tts
- !!merge <<: *kitten-tts
name: "kitten-tts-development"
capabilities:
default: "cpu-kitten-tts-development"
metal: "metal-kitten-tts-development"
- !!merge <<: *kitten-tts
name: "metal-kitten-tts"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-kitten-tts"
@@ -1556,10 +1773,23 @@
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-kitten-tts
- !!merge <<: *local-store
name: "local-store-development"
name: "cpu-local-store"
alias: "local-store"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-local-store"
mirrors:
- localai/localai-backends:latest-cpu-local-store
- !!merge <<: *local-store
name: "cpu-local-store-development"
alias: "local-store"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-local-store"
mirrors:
- localai/localai-backends:master-cpu-local-store
- !!merge <<: *local-store
name: "local-store-development"
alias: "local-store"
capabilities:
default: "cpu-local-store-development"
metal: "metal-local-store-development"
- !!merge <<: *local-store
name: "metal-local-store"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-local-store"
@@ -1567,11 +1797,17 @@
- localai/localai-backends:latest-metal-darwin-arm64-local-store
- !!merge <<: *local-store
name: "metal-local-store-development"
alias: "local-store"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-local-store"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-local-store
- !!merge <<: *opus
name: "opus-development"
name: "cpu-opus"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-opus"
mirrors:
- localai/localai-backends:latest-cpu-opus
- !!merge <<: *opus
name: "cpu-opus-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-opus"
mirrors:
- localai/localai-backends:master-cpu-opus
@@ -1586,10 +1822,20 @@
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-opus
- !!merge <<: *silero-vad
name: "silero-vad-development"
name: "cpu-silero-vad"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-silero-vad"
mirrors:
- localai/localai-backends:latest-cpu-silero-vad
- !!merge <<: *silero-vad
name: "cpu-silero-vad-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-silero-vad"
mirrors:
- localai/localai-backends:master-cpu-silero-vad
- !!merge <<: *silero-vad
name: "silero-vad-development"
capabilities:
default: "cpu-silero-vad-development"
metal: "metal-silero-vad-development"
- !!merge <<: *silero-vad
name: "metal-silero-vad"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-silero-vad"
@@ -1601,10 +1847,20 @@
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-silero-vad
- !!merge <<: *piper
name: "piper-development"
name: "cpu-piper"
uri: "quay.io/go-skynet/local-ai-backends:latest-piper"
mirrors:
- localai/localai-backends:latest-piper
- !!merge <<: *piper
name: "cpu-piper-development"
uri: "quay.io/go-skynet/local-ai-backends:master-piper"
mirrors:
- localai/localai-backends:master-piper
- !!merge <<: *piper
name: "piper-development"
capabilities:
default: "cpu-piper-development"
metal: "metal-piper-development"
- !!merge <<: *piper
name: "metal-piper"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-piper"
@@ -3247,6 +3503,121 @@
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-qwen3-tts-cpp"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-qwen3-tts-cpp
## omnivoice-cpp
- !!merge <<: *omnivoicecpp
name: "omnivoice-cpp-development"
capabilities:
default: "cpu-omnivoice-cpp-development"
nvidia: "cuda12-omnivoice-cpp-development"
nvidia-cuda-13: "cuda13-omnivoice-cpp-development"
nvidia-cuda-12: "cuda12-omnivoice-cpp-development"
intel: "intel-sycl-f16-omnivoice-cpp-development"
metal: "metal-omnivoice-cpp-development"
amd: "rocm-omnivoice-cpp-development"
vulkan: "vulkan-omnivoice-cpp-development"
nvidia-l4t: "nvidia-l4t-arm64-omnivoice-cpp-development"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-omnivoice-cpp-development"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-omnivoice-cpp-development"
- !!merge <<: *omnivoicecpp
name: "nvidia-l4t-arm64-omnivoice-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-arm64-omnivoice-cpp"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-arm64-omnivoice-cpp
- !!merge <<: *omnivoicecpp
name: "nvidia-l4t-arm64-omnivoice-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-arm64-omnivoice-cpp"
mirrors:
- localai/localai-backends:master-nvidia-l4t-arm64-omnivoice-cpp
- !!merge <<: *omnivoicecpp
name: "cuda13-nvidia-l4t-arm64-omnivoice-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-omnivoice-cpp"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-omnivoice-cpp
- !!merge <<: *omnivoicecpp
name: "cuda13-nvidia-l4t-arm64-omnivoice-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-omnivoice-cpp"
mirrors:
- localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-omnivoice-cpp
- !!merge <<: *omnivoicecpp
name: "cpu-omnivoice-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-omnivoice-cpp"
mirrors:
- localai/localai-backends:latest-cpu-omnivoice-cpp
- !!merge <<: *omnivoicecpp
name: "metal-omnivoice-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-omnivoice-cpp"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-omnivoice-cpp
- !!merge <<: *omnivoicecpp
name: "metal-omnivoice-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-omnivoice-cpp"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-omnivoice-cpp
- !!merge <<: *omnivoicecpp
name: "cpu-omnivoice-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-omnivoice-cpp"
mirrors:
- localai/localai-backends:master-cpu-omnivoice-cpp
- !!merge <<: *omnivoicecpp
name: "cuda12-omnivoice-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-omnivoice-cpp"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-omnivoice-cpp
- !!merge <<: *omnivoicecpp
name: "rocm-omnivoice-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-omnivoice-cpp"
mirrors:
- localai/localai-backends:latest-gpu-rocm-hipblas-omnivoice-cpp
- !!merge <<: *omnivoicecpp
name: "intel-sycl-f32-omnivoice-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f32-omnivoice-cpp"
mirrors:
- localai/localai-backends:latest-gpu-intel-sycl-f32-omnivoice-cpp
- !!merge <<: *omnivoicecpp
name: "intel-sycl-f16-omnivoice-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f16-omnivoice-cpp"
mirrors:
- localai/localai-backends:latest-gpu-intel-sycl-f16-omnivoice-cpp
- !!merge <<: *omnivoicecpp
name: "vulkan-omnivoice-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-vulkan-omnivoice-cpp"
mirrors:
- localai/localai-backends:latest-gpu-vulkan-omnivoice-cpp
- !!merge <<: *omnivoicecpp
name: "vulkan-omnivoice-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-vulkan-omnivoice-cpp"
mirrors:
- localai/localai-backends:master-gpu-vulkan-omnivoice-cpp
- !!merge <<: *omnivoicecpp
name: "cuda12-omnivoice-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-omnivoice-cpp"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-omnivoice-cpp
- !!merge <<: *omnivoicecpp
name: "rocm-omnivoice-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-omnivoice-cpp"
mirrors:
- localai/localai-backends:master-gpu-rocm-hipblas-omnivoice-cpp
- !!merge <<: *omnivoicecpp
name: "intel-sycl-f32-omnivoice-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f32-omnivoice-cpp"
mirrors:
- localai/localai-backends:master-gpu-intel-sycl-f32-omnivoice-cpp
- !!merge <<: *omnivoicecpp
name: "intel-sycl-f16-omnivoice-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f16-omnivoice-cpp"
mirrors:
- localai/localai-backends:master-gpu-intel-sycl-f16-omnivoice-cpp
- !!merge <<: *omnivoicecpp
name: "cuda13-omnivoice-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-omnivoice-cpp"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-13-omnivoice-cpp
- !!merge <<: *omnivoicecpp
name: "cuda13-omnivoice-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-omnivoice-cpp"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-omnivoice-cpp
## vibevoice-cpp
- !!merge <<: *vibevoicecpp
name: "nvidia-l4t-arm64-vibevoice-cpp"
@@ -4577,24 +4948,24 @@
- localai/localai-backends:master-cpu-trl
- !!merge <<: *trl
name: "cuda12-trl"
uri: "quay.io/go-skynet/local-ai-backends:latest-cublas-cuda12-trl"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-trl"
mirrors:
- localai/localai-backends:latest-cublas-cuda12-trl
- localai/localai-backends:latest-gpu-nvidia-cuda-12-trl
- !!merge <<: *trl
name: "cuda12-trl-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cublas-cuda12-trl"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-trl"
mirrors:
- localai/localai-backends:master-cublas-cuda12-trl
- localai/localai-backends:master-gpu-nvidia-cuda-12-trl
- !!merge <<: *trl
name: "cuda13-trl"
uri: "quay.io/go-skynet/local-ai-backends:latest-cublas-cuda13-trl"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-trl"
mirrors:
- localai/localai-backends:latest-cublas-cuda13-trl
- localai/localai-backends:latest-gpu-nvidia-cuda-13-trl
- !!merge <<: *trl
name: "cuda13-trl-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cublas-cuda13-trl"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-trl"
mirrors:
- localai/localai-backends:master-cublas-cuda13-trl
- localai/localai-backends:master-gpu-nvidia-cuda-13-trl
## llama.cpp quantization backend
- &llama-cpp-quantization
name: "llama-cpp-quantization"
@@ -4685,12 +5056,14 @@
default: "cpu-speaker-recognition"
nvidia: "cuda12-speaker-recognition"
nvidia-cuda-12: "cuda12-speaker-recognition"
metal: "metal-speaker-recognition"
- !!merge <<: *speakerrecognition
name: "speaker-recognition-development"
capabilities:
default: "cpu-speaker-recognition-development"
nvidia: "cuda12-speaker-recognition-development"
nvidia-cuda-12: "cuda12-speaker-recognition-development"
metal: "metal-speaker-recognition-development"
- !!merge <<: *speakerrecognition
name: "cpu-speaker-recognition"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-speaker-recognition"
@@ -4711,6 +5084,16 @@
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-speaker-recognition"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-speaker-recognition
- !!merge <<: *speakerrecognition
name: "metal-speaker-recognition"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-speaker-recognition"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-speaker-recognition
- !!merge <<: *speakerrecognition
name: "metal-speaker-recognition-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-speaker-recognition"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-speaker-recognition
## sherpa-onnx
- !!merge <<: *sherpa-onnx
name: "sherpa-onnx-development"
@@ -4718,6 +5101,7 @@
default: "cpu-sherpa-onnx-development"
nvidia: "cuda12-sherpa-onnx-development"
nvidia-cuda-12: "cuda12-sherpa-onnx-development"
metal: "metal-sherpa-onnx-development"
- !!merge <<: *sherpa-onnx
name: "cpu-sherpa-onnx"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-sherpa-onnx"
@@ -4738,3 +5122,13 @@
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-sherpa-onnx"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-sherpa-onnx
- !!merge <<: *sherpa-onnx
name: "metal-sherpa-onnx"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-sherpa-onnx"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-sherpa-onnx
- !!merge <<: *sherpa-onnx
name: "metal-sherpa-onnx-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-sherpa-onnx"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-sherpa-onnx

View File

@@ -407,6 +407,24 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
if not request.Prompt and request.UseTokenizerTemplate and request.Messages:
messages = messages_to_dicts(request.Messages)
# The mlx-lm tokenizer only carries a text-LM chat template. A
# vision-language checkpoint (e.g. gemma-4 E4B) loaded here has no
# usable template, so apply_chat_template silently passes the raw
# text through and the model just echoes/loops (issue #10269).
# Warn loudly so the misroute is visible; such models belong on the
# mlx-vlm backend.
chat_template = getattr(self.tokenizer, "chat_template", None)
if not chat_template:
underlying = getattr(self.tokenizer, "_tokenizer", None)
chat_template = getattr(underlying, "chat_template", None)
if not chat_template:
print(
"WARNING: this model has no chat template; output may be "
"degenerate. Vision-language models (e.g. gemma-4 E4B) must "
"use the 'mlx-vlm' backend instead of 'mlx'.",
file=sys.stderr,
)
kwargs = {"tokenize": False, "add_generation_prompt": True}
if request.Tools:
try:

View File

@@ -1,6 +1,7 @@
--extra-index-url https://download.pytorch.org/whl/cpu
accelerate
torch==2.8.0
torchaudio==2.8.0
transformers==4.56.1
librosa==0.11.0
neucodec>=0.0.4

View File

@@ -3,6 +3,7 @@ neucodec>=0.0.4
phonemizer==3.3.0
soundfile==0.13.1
torch==2.8.0
torchaudio==2.8.0
transformers==4.56.1
resemble-perth==1.0.1
accelerate

View File

@@ -1,4 +1,4 @@
transformers
accelerate
torch==2.7.1+xpu
torch==2.7.1
rerankers[transformers]

View File

@@ -1,4 +1,4 @@
transformers
accelerate
torch==2.7.1+xpu
torch==2.7.1
rerankers[transformers]

View File

@@ -1,5 +1,5 @@
--extra-index-url https://download.pytorch.org/whl/cu130
transformers
accelerate
torch==2.7.1+xpu
torch==2.9.1
rerankers[transformers]

View File

@@ -1,5 +1,5 @@
--extra-index-url https://download.pytorch.org/whl/rocm7.0
transformers
accelerate
torch==2.7.1+xpu
torch==2.10.0+rocm7.0
rerankers[transformers]

View File

@@ -1,4 +1,4 @@
torch==2.7.1+xpu
torch==2.7.1
transformers
accelerate
rerankers[transformers]

View File

@@ -0,0 +1,5 @@
torch
torchaudio
speechbrain
transformers
onnxruntime

View File

@@ -26,7 +26,10 @@ from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine
from vllm.sampling_params import SamplingParams
from vllm.utils import random_uuid
from vllm.transformers_utils.tokenizer import get_tokenizer
try:
from vllm.tokenizers import get_tokenizer # vLLM >= 0.22
except ImportError:
from vllm.transformers_utils.tokenizer import get_tokenizer # vLLM < 0.22
from vllm.multimodal.utils import fetch_image
from vllm.assets.video import VideoAsset
import base64
@@ -147,9 +150,24 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
d["reasoning_content"] = msg.reasoning_content
if msg.tool_calls:
try:
d["tool_calls"] = json.loads(msg.tool_calls)
tool_calls = json.loads(msg.tool_calls)
except json.JSONDecodeError:
pass
else:
# OpenAI wire format carries function.arguments as a
# JSON-encoded string, but chat templates (e.g. Qwen3)
# iterate over it as a mapping. vLLM's own OpenAI server
# parses arguments before applying the template, so do
# the same here.
if isinstance(tool_calls, list):
for tc in tool_calls:
func = tc.get("function") if isinstance(tc, dict) else None
if isinstance(func, dict) and isinstance(func.get("arguments"), str):
try:
func["arguments"] = json.loads(func["arguments"])
except json.JSONDecodeError:
pass
d["tool_calls"] = tool_calls
result.append(d)
return result

View File

@@ -3,5 +3,5 @@
# on a cu130 host. Pull the cu130-flavoured wheel from vLLM's per-tag index
# instead — the cublas13 case in install.sh adds --index-strategy=unsafe-best-match
# so uv consults this index alongside PyPI.
--extra-index-url https://wheels.vllm.ai/0.22.1/cu130
vllm==0.22.1
--extra-index-url https://wheels.vllm.ai/0.23.0/cu130
vllm==0.23.0

View File

@@ -161,6 +161,21 @@ func initDistributed(cfg *config.ApplicationConfig, authDB *gorm.DB, configLoade
}
xlog.Info("Node registry initialized")
// Seed declarative per-model scheduling config (LOCALAI_MODEL_SCHEDULING /
// LOCALAI_MODEL_SCHEDULING_CONFIG). Authoritative: overwrites matching models
// on every boot. Runs before the reconciler starts so the first tick already
// sees the desired state. Models not listed are left untouched.
if cfg.Distributed.ModelSchedulingJSON != "" || cfg.Distributed.ModelSchedulingConfigPath != "" {
schedConfigs, err := nodes.ParseSchedulingSeed(cfg.Distributed.ModelSchedulingJSON, cfg.Distributed.ModelSchedulingConfigPath)
if err != nil {
return nil, fmt.Errorf("parsing declarative model scheduling config: %w", err)
}
if err := registry.SeedModelScheduling(context.Background(), schedConfigs); err != nil {
return nil, fmt.Errorf("seeding declarative model scheduling config: %w", err)
}
xlog.Info("Applied declarative model scheduling config", "models", len(schedConfigs))
}
// Collect SmartRouter option values; the router itself is created after all
// dependencies (including FileStager and Unloader) are ready.
var routerAuthToken string

View File

@@ -11,6 +11,29 @@ import (
"github.com/mudler/xlog"
)
// startMITMIfConfigured brings up the cloudproxy MITM listener when an
// address is configured, treating any startup failure as non-fatal.
//
// The listener is opt-in middleware whose address is persisted in runtime
// settings (/api/settings → runtime_settings.json) and replayed on every
// boot. A bad value — e.g. a host the process can't bind, like a LAN IP
// inside a container — must NOT abort the whole server: doing so crash-loops
// with no way out, because the Settings UI used to correct the address can't
// load if startup never completes. So on failure we log loudly and carry on;
// the admin fixes the address via /api/settings, which calls RestartMITM.
func startMITMIfConfigured(app *Application, options *config.ApplicationConfig) {
if options.MITMListen == "" {
return
}
if err := startMITMProxy(app, options); err != nil {
xlog.Error("mitm: cloudproxy listener failed to start — continuing without it",
"listen", options.MITMListen,
"error", err,
"hint", "fix the address via Settings (e.g. \":8082\" to bind all interfaces) and the listener will restart",
)
}
}
func startMITMProxy(app *Application, options *config.ApplicationConfig) error {
app.mitmMutex.Lock()
defer app.mitmMutex.Unlock()

View File

@@ -0,0 +1,58 @@
package application
import (
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/pkg/system"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
// minimal Application wired enough for startMITMProxy: an empty model
// config loader (no host claims), CA written under a temp DataPath.
func newMITMTestApp(dataPath string) (*Application, *config.ApplicationConfig) {
state, err := system.GetSystemState()
Expect(err).NotTo(HaveOccurred())
state.Model.ModelsPath = dataPath
opts := config.NewApplicationConfig(
config.WithSystemState(state),
config.WithDataPath(dataPath),
)
return newApplication(opts), opts
}
var _ = Describe("startMITMIfConfigured", func() {
It("does nothing when no listen address is configured", func() {
app, opts := newMITMTestApp(GinkgoT().TempDir())
opts.MITMListen = ""
Expect(func() { startMITMIfConfigured(app, opts) }).NotTo(Panic())
Expect(app.mitmServer.Load()).To(BeNil(), "no listener should be stored when disabled")
})
// Regression: a persisted-but-unbindable MITM address (e.g. a LAN host
// inside a container) must not abort startup. startMITMIfConfigured
// swallows the bind error so the rest of LocalAI still comes up and the
// admin can fix the address via the Settings UI.
It("logs and continues when the listen address cannot be bound", func() {
app, opts := newMITMTestApp(GinkgoT().TempDir())
// 192.0.2.1 is TEST-NET-1 (RFC 5737): guaranteed not assigned to any
// local interface, so bind fails deterministically without DNS.
opts.MITMListen = "192.0.2.1:8082"
Expect(func() { startMITMIfConfigured(app, opts) }).NotTo(Panic())
Expect(app.mitmServer.Load()).To(BeNil(), "failed listener must not be stored")
})
It("starts and stores the listener on a bindable address", func() {
app, opts := newMITMTestApp(GinkgoT().TempDir())
opts.MITMListen = "127.0.0.1:0" // OS-assigned free port
startMITMIfConfigured(app, opts)
srv := app.mitmServer.Load()
Expect(srv).NotTo(BeNil(), "listener should be stored on success")
DeferCleanup(srv.Stop)
Expect(srv.Addr()).NotTo(BeEmpty())
})
})

View File

@@ -1,63 +1,120 @@
package application
import (
"context"
"fmt"
"github.com/mudler/LocalAI/core/backend"
"github.com/mudler/LocalAI/core/config"
)
// adapterConfig resolves a model name to its runtime ModelConfig, or
// nil when the name is unknown. Shared by the router-facing factories
// below and by ModelConfigLookup.
// adapterConfig resolves a model name to its runtime ModelConfig, or nil when
// unknown. LoadModelConfigFileByNameDefaultOptions never returns nil — for an
// unknown name it returns a defaults-filled stub with an empty Name (the YAML
// `name:` field is required by Validate), which is how we tell the two apart.
func (a *Application) adapterConfig(modelName string) *config.ModelConfig {
cfg, err := a.backendLoader.LoadModelConfigFileByNameDefaultOptions(modelName, a.applicationConfig)
if err != nil || cfg == nil {
if err != nil || cfg == nil || cfg.Name == "" {
return nil
}
return cfg
}
// ModelConfigLookup is the lookup function the router middleware's
// classifier validator uses to confirm classifier_model declares
// FLAG_SCORE before binding it.
// ModelConfigLookup is the lookup the router middleware's classifier validator
// uses to confirm classifier_model declares FLAG_SCORE before binding it.
func (a *Application) ModelConfigLookup() func(modelName string) *config.ModelConfig {
return a.adapterConfig
}
// Scorer returns a backend.Scorer bound to the named model, or nil
// when the model is unknown. Used as a method value (app.Scorer) by
// router.ClassifierDeps — no factory-of-factory wrapper needed.
// The router-facing factories below (Scorer, Embedder, Reranker, TokenCounter)
// bind a model NAME at construction and re-resolve the CONFIG on every call.
// Capturing the config at construction would bake in whatever state
// adapterConfig saw first — including a stub returned before the YAML reached
// bcl.configs (e.g. /import-model or gallery install racing startup). The
// classifier registry caches factories by router-config fingerprint, so a
// once-stale capture stays stale until the router config is edited.
func (a *Application) Scorer(modelName string) backend.Scorer {
cfg := a.adapterConfig(modelName)
if cfg == nil {
if a.adapterConfig(modelName) == nil {
return nil
}
return backend.NewScorer(a.modelLoader, *cfg, a.applicationConfig)
return &lazyScorer{app: a, modelName: modelName}
}
type lazyScorer struct {
app *Application
modelName string
}
func (l *lazyScorer) Score(ctx context.Context, prompt string, candidates []string) ([]backend.CandidateScore, error) {
cfg := l.app.adapterConfig(l.modelName)
if cfg == nil {
return nil, fmt.Errorf("scorer: model %q no longer available", l.modelName)
}
return backend.NewScorer(l.app.modelLoader, *cfg, l.app.applicationConfig).Score(ctx, prompt, candidates)
}
// TokenCounter returns a func so the middleware's literal field type accepts
// it as a method value without importing core/http/middleware from here.
func (a *Application) TokenCounter(modelName string) func(string) (int, error) {
if a.adapterConfig(modelName) == nil {
return nil
}
return func(text string) (int, error) {
cfg := a.adapterConfig(modelName)
if cfg == nil {
return 0, fmt.Errorf("token counter: model %q no longer available", modelName)
}
resp, err := backend.ModelTokenize(text, a.modelLoader, *cfg, a.applicationConfig)
if err != nil {
return 0, err
}
return len(resp.Tokens), nil
}
}
// Reranker returns a backend.Reranker bound to the named model, or
// nil when unknown. The reranker model's `type:` (e.g. "colbert")
// selects the scoring head inside the rerankers backend.
func (a *Application) Reranker(modelName string) backend.Reranker {
cfg := a.adapterConfig(modelName)
if cfg == nil {
if a.adapterConfig(modelName) == nil {
return nil
}
return backend.NewReranker(a.modelLoader, *cfg, a.applicationConfig)
return &lazyReranker{app: a, modelName: modelName}
}
type lazyReranker struct {
app *Application
modelName string
}
func (l *lazyReranker) Rerank(ctx context.Context, query string, documents []string) ([]backend.RerankResult, error) {
cfg := l.app.adapterConfig(l.modelName)
if cfg == nil {
return nil, fmt.Errorf("reranker: model %q no longer available", l.modelName)
}
return backend.NewReranker(l.app.modelLoader, *cfg, l.app.applicationConfig).Rerank(ctx, query, documents)
}
// Embedder returns a backend.Embedder bound to the named model, or
// nil when unknown. Used by the router's L2 embedding cache.
func (a *Application) Embedder(modelName string) backend.Embedder {
cfg := a.adapterConfig(modelName)
if cfg == nil {
if a.adapterConfig(modelName) == nil {
return nil
}
return backend.NewEmbedder(a.modelLoader, *cfg, a.applicationConfig)
return &lazyEmbedder{app: a, modelName: modelName}
}
// VectorStore returns a backend.VectorStore for the named collection,
// or nil when the name is empty. Each router model gets its own
// backend process via the model loader's cache keyed by storeName.
type lazyEmbedder struct {
app *Application
modelName string
}
func (l *lazyEmbedder) Embed(ctx context.Context, text string) ([]float32, error) {
cfg := l.app.adapterConfig(l.modelName)
if cfg == nil {
return nil, fmt.Errorf("embedder: model %q no longer available", l.modelName)
}
return backend.NewEmbedder(l.app.modelLoader, *cfg, l.app.applicationConfig).Embed(ctx, text)
}
// VectorStore takes a store name, not a model name — no adapterConfig, no
// staleness to avoid.
func (a *Application) VectorStore(storeName string) backend.VectorStore {
return backend.NewVectorStore(a.modelLoader, a.applicationConfig, storeName)
}

View File

@@ -0,0 +1,155 @@
package application
import (
"context"
"os"
"path/filepath"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/pkg/model"
"github.com/mudler/LocalAI/pkg/system"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
// Regression: the router-facing factories used to capture
// *config.ModelConfig at construction. A gallery install that raced
// startup left a stub (Backend="") bound for the lifetime of the
// classifier registry's cache entry, bypassing the user's `backend:`
// config. These specs pin the lazy re-resolve.
var _ = Describe("router_factories lazy config resolution", func() {
var (
tmpDir string
app *Application
)
BeforeEach(func() {
var err error
tmpDir, err = os.MkdirTemp("", "router-factories-*")
Expect(err).NotTo(HaveOccurred())
appCfg := &config.ApplicationConfig{
Context: context.Background(),
SystemState: &system.SystemState{Model: system.Model{ModelsPath: tmpDir}},
}
app = &Application{
backendLoader: config.NewModelConfigLoader(tmpDir),
modelLoader: model.NewModelLoader(appCfg.SystemState),
applicationConfig: appCfg,
}
})
AfterEach(func() {
_ = os.RemoveAll(tmpDir)
})
// writeCfg seeds both the on-disk YAML and the in-memory cache —
// removing only the cache would fall through to file-read.
writeCfg := func(name, backend string) {
yaml := "name: " + name + "\nbackend: " + backend + "\nparameters:\n model: " + name + ".bin\n"
Expect(os.WriteFile(filepath.Join(tmpDir, name+".yaml"), []byte(yaml), 0644)).To(Succeed())
Expect(app.backendLoader.LoadModelConfigsFromPath(tmpDir)).To(Succeed())
cfg, ok := app.backendLoader.GetModelConfig(name)
Expect(ok).To(BeTrue(), "config must be loaded before the spec runs")
Expect(cfg.Backend).To(Equal(backend))
}
// removeCfg purges both the cache and the YAML so LoadModelConfigFileByName
// returns the empty-stub case and adapterConfig returns nil.
removeCfg := func(name string) {
app.backendLoader.RemoveModelConfig(name)
Expect(os.Remove(filepath.Join(tmpDir, name+".yaml"))).To(Succeed())
}
Context("Embedder", func() {
It("returns nil at construction for an unknown model", func() {
Expect(app.Embedder("missing")).To(BeNil())
})
It("re-resolves the model config on each Embed call", func() {
writeCfg("emb-test", "llama-cpp")
emb := app.Embedder("emb-test")
Expect(emb).NotTo(BeNil())
// The factory must hold the NAME, not a captured config —
// otherwise stale captures survive cache invalidation.
lazy, ok := emb.(*lazyEmbedder)
Expect(ok).To(BeTrue(), "Embedder must return *lazyEmbedder")
Expect(lazy.modelName).To(Equal("emb-test"))
// Mutate the cached config. A lazy implementation sees the
// update on the next adapterConfig call; a captured-at-
// construction implementation would still see "llama-cpp".
app.backendLoader.UpdateModelConfig("emb-test", func(c *config.ModelConfig) {
c.Backend = "rerankers"
})
Expect(lazy.app.adapterConfig("emb-test").Backend).To(Equal("rerankers"))
// Remove the config entirely → Embed must surface the disappearance.
removeCfg("emb-test")
_, err := emb.Embed(context.Background(), "anything")
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring("no longer available"))
})
})
Context("Scorer", func() {
It("returns nil at construction for an unknown model", func() {
Expect(app.Scorer("missing")).To(BeNil())
})
It("re-resolves the model config on each Score call", func() {
writeCfg("score-test", "llama-cpp")
sc := app.Scorer("score-test")
Expect(sc).NotTo(BeNil())
lazy, ok := sc.(*lazyScorer)
Expect(ok).To(BeTrue(), "Scorer must return *lazyScorer")
Expect(lazy.modelName).To(Equal("score-test"))
removeCfg("score-test")
_, err := sc.Score(context.Background(), "prompt", []string{"a"})
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring("no longer available"))
})
})
Context("Reranker", func() {
It("returns nil at construction for an unknown model", func() {
Expect(app.Reranker("missing")).To(BeNil())
})
It("re-resolves the model config on each Rerank call", func() {
writeCfg("rerank-test", "rerankers")
rr := app.Reranker("rerank-test")
Expect(rr).NotTo(BeNil())
lazy, ok := rr.(*lazyReranker)
Expect(ok).To(BeTrue(), "Reranker must return *lazyReranker")
Expect(lazy.modelName).To(Equal("rerank-test"))
removeCfg("rerank-test")
_, err := rr.Rerank(context.Background(), "q", []string{"d"})
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring("no longer available"))
})
})
Context("TokenCounter", func() {
It("returns nil at construction for an unknown model", func() {
Expect(app.TokenCounter("missing")).To(BeNil())
})
It("re-resolves the model config on each call", func() {
writeCfg("tok-test", "llama-cpp")
tc := app.TokenCounter("tok-test")
Expect(tc).NotTo(BeNil())
removeCfg("tok-test")
_, err := tc("anything")
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring("no longer available"))
})
})
})

View File

@@ -23,9 +23,9 @@ import (
"github.com/mudler/LocalAI/core/services/routing/pii"
"github.com/mudler/LocalAI/core/services/routing/router"
"github.com/mudler/LocalAI/core/services/storage"
"github.com/mudler/LocalAI/pkg/signals"
coreStartup "github.com/mudler/LocalAI/core/startup"
"github.com/mudler/LocalAI/internal"
"github.com/mudler/LocalAI/pkg/signals"
"github.com/mudler/LocalAI/pkg/vram"
"github.com/mudler/LocalAI/pkg/model"
@@ -308,10 +308,31 @@ func New(opts ...config.AppOption) (*Application, error) {
application.galleryService.SetNATSClient(distSvc.Nats)
if distSvc.DistStores != nil && distSvc.DistStores.Gallery != nil {
// Clean up stale in-progress operations from previous crashed instances
if err := distSvc.DistStores.Gallery.CleanStale(30 * time.Minute); err != nil {
if _, err := distSvc.DistStores.Gallery.CleanStale(30 * time.Minute); err != nil {
xlog.Warn("Failed to clean stale gallery operations", "error", err)
}
application.galleryService.SetGalleryStore(distSvc.DistStores.Gallery)
// Reap stale ops periodically, not just at boot: an op orphaned by
// a replica that died mid-install (its foreground handler goroutine
// gone) would otherwise linger "processing" in the UI until the next
// restart. 30m matches the install/upgrade ceiling so a genuinely
// slow op is never reaped out from under itself.
gsvc := application.galleryService
go func() {
ticker := time.NewTicker(15 * time.Minute)
defer ticker.Stop()
for {
select {
case <-options.Context.Done():
return
case <-ticker.C:
if _, err := gsvc.ReapStaleOperations(30 * time.Minute); err != nil {
xlog.Warn("Failed to reap stale gallery operations", "error", err)
}
}
}
}()
}
// Hydrate from the store first so the wildcard subscriber finds an
// already-populated statuses map for any operations still in flight
@@ -441,11 +462,7 @@ func New(opts ...config.AppOption) (*Application, error) {
// traffic doesn't need a parallel config for MITM traffic.
// Runs after loadRuntimeSettingsFromFile so a listener configured
// via /api/settings is brought back up across restarts.
if options.MITMListen != "" {
if err := startMITMProxy(application, options); err != nil {
return nil, fmt.Errorf("mitm: startup: %w", err)
}
}
startMITMIfConfigured(application, options)
application.ModelLoader().SetBackendLoggingEnabled(options.EnableBackendLogging)

View File

@@ -214,7 +214,9 @@ func (uc *UpgradeChecker) runCheck(ctx context.Context) {
"from", info.InstalledVersion, "to", info.AvailableVersion)
var err error
if bm != nil {
err = bm.UpgradeBackend(ctx, name, nil)
// Background auto-upgrade: no live admin watching a progress bar,
// so opID is empty and the distributed path skips progress streaming.
err = bm.UpgradeBackend(ctx, "", name, nil)
} else {
err = gallery.UpgradeBackend(ctx, uc.systemState, uc.modelLoader,
uc.galleries, name, nil, uc.appConfig.RequireBackendIntegrity)

View File

@@ -100,8 +100,13 @@ func ModelEmbedding(ctx context.Context, s string, tokens []int, loader *model.M
trace.InitBackendTracingIfEnabled(appConfig.TracingMaxItems, appConfig.TracingMaxBodyBytes)
traceData := map[string]any{
"input_text": trace.TruncateString(s, 1000),
"input_tokens_count": len(tokens),
"input_text": trace.TruncateString(s, 1000),
}
// Only present for token-mode callers (pre-tokenized override);
// emitting "0" alongside input_text would read as "consumed zero
// tokens", which is wrong.
if len(tokens) > 0 {
traceData["input_tokens_count"] = len(tokens)
}
startTime := time.Now()

View File

@@ -87,11 +87,47 @@ func getSeed(c config.ModelConfig) int32 {
return seed
}
func grpcModelOpts(c config.ModelConfig, modelPath string) *pb.ModelOptions {
b := 512
if c.Batch != 0 {
b = c.Batch
// DefaultContextSize and DefaultBatchSize are the backend's fallbacks when a
// model config leaves them unset. Exported so callers that must respect the
// effective decode window — notably the router's prompt trimmer — resolve the
// same numbers grpcModelOpts does instead of guessing.
const (
DefaultContextSize = 4096
DefaultBatchSize = 512
)
// EffectiveContextSize is the context window the backend will run with: the
// configured value, or DefaultContextSize when unset.
func EffectiveContextSize(c config.ModelConfig) int {
if c.ContextSize != nil {
return *c.ContextSize
}
return DefaultContextSize
}
// EffectiveBatchSize is the single-decode batch the backend will run with.
// Score, embedding and rerank all process the whole input in one pass: score
// decodes prompt+candidate (asserts n_tokens <= n_batch), and embedding/rerank
// pool over the full sequence in one physical batch (n_ubatch). So the batch
// is sized to the context — anything that fits the context fits one pass,
// avoiding both the GGML_ASSERT crash and the "input is too large to process"
// error. Explicit `batch:` always wins.
func EffectiveBatchSize(c config.ModelConfig) int {
if c.Batch != 0 {
return c.Batch
}
singlePass := c.HasUsecases(config.FLAG_SCORE) ||
c.HasUsecases(config.FLAG_EMBEDDINGS) ||
c.HasUsecases(config.FLAG_RERANK)
if ctx := EffectiveContextSize(c); singlePass && ctx > DefaultBatchSize {
return ctx
}
return DefaultBatchSize
}
func grpcModelOpts(c config.ModelConfig, modelPath string) *pb.ModelOptions {
ctxSize := EffectiveContextSize(c)
b := EffectiveBatchSize(c)
flashAttention := "auto"
@@ -134,11 +170,6 @@ func grpcModelOpts(c config.ModelConfig, modelPath string) *pb.ModelOptions {
}
}
ctxSize := 4096
if c.ContextSize != nil {
ctxSize = *c.ContextSize
}
mmlock := false
if c.MMlock != nil {
mmlock = *c.MMlock
@@ -276,11 +307,19 @@ func gRPCPredictOpts(c config.ModelConfig, modelPath string) *pb.PredictOptions
}
}
// TopK may be nil after SetDefaults for backends that don't use llama.cpp's
// top_k=40 default (issue #6632, e.g. mlx). proto3 int32 can't be unset, so
// send 0 — the value mlx actually wants (top-k disabled).
var topK int32
if c.TopK != nil {
topK = int32(*c.TopK)
}
pbOpts := &pb.PredictOptions{
Temperature: float32(*c.Temperature),
TopP: float32(*c.TopP),
NDraft: c.NDraft,
TopK: int32(*c.TopK),
TopK: topK,
MinP: float32(*c.MinP),
Tokens: int32(*c.Maxtokens),
Threads: int32(*c.Threads),

View File

@@ -97,3 +97,67 @@ var _ = Describe("gRPCPredictOpts reasoning_effort metadata", func() {
Expect(opts.Metadata).ToNot(HaveKey("reasoning_effort"))
})
})
var _ = Describe("grpcModelOpts NBatch", func() {
scoreUsecase := config.FLAG_SCORE
threads := 1
ctx := 4096
It("defaults to 512 for an ordinary model", func() {
cfg := config.ModelConfig{Threads: &threads, LLMConfig: config.LLMConfig{ContextSize: &ctx}}
opts := grpcModelOpts(cfg, "/tmp/models")
Expect(opts.NBatch).To(BeEquivalentTo(512))
})
It("sizes the batch to the context window for score models", func() {
// Score models decode the whole prompt+candidate in one
// llama_decode; n_batch must cover it or the backend aborts.
cfg := config.ModelConfig{Threads: &threads, LLMConfig: config.LLMConfig{ContextSize: &ctx}, KnownUsecases: &scoreUsecase}
opts := grpcModelOpts(cfg, "/tmp/models")
Expect(opts.NBatch).To(BeEquivalentTo(4096))
})
It("keeps an explicit batch over the score default", func() {
cfg := config.ModelConfig{Threads: &threads, LLMConfig: config.LLMConfig{ContextSize: &ctx}, KnownUsecases: &scoreUsecase}
cfg.Batch = 1024
opts := grpcModelOpts(cfg, "/tmp/models")
Expect(opts.NBatch).To(BeEquivalentTo(1024))
})
It("sizes the batch to the context window for embedding models", func() {
// Embedding/rerank pool over the whole sequence in one physical batch
// (n_ubatch); without this the input is capped at the 512 default and
// the backend returns "input is too large to process".
embeddings := true
cfg := config.ModelConfig{Threads: &threads, LLMConfig: config.LLMConfig{ContextSize: &ctx}}
cfg.Embeddings = &embeddings
opts := grpcModelOpts(cfg, "/tmp/models")
Expect(opts.NBatch).To(BeEquivalentTo(4096))
})
It("sizes the batch to the context window for rerank models", func() {
reranking := true
cfg := config.ModelConfig{Threads: &threads, LLMConfig: config.LLMConfig{ContextSize: &ctx}}
cfg.Reranking = &reranking
opts := grpcModelOpts(cfg, "/tmp/models")
Expect(opts.NBatch).To(BeEquivalentTo(4096))
})
It("does not raise the batch when a score model's context is below the default", func() {
small := 256
cfg := config.ModelConfig{Threads: &threads, LLMConfig: config.LLMConfig{ContextSize: &small}, KnownUsecases: &scoreUsecase}
opts := grpcModelOpts(cfg, "/tmp/models")
Expect(opts.NBatch).To(BeEquivalentTo(512))
})
It("sizes the batch to the effective 4096 default for a score model with no explicit context_size", func() {
// The crash case: the backend defaults n_ctx to 4096, so n_batch must
// follow even when context_size is unset — otherwise n_batch stays 512
// against a 4096 window and the score decode hits the GGML_ASSERT.
cfg := config.ModelConfig{Threads: &threads, KnownUsecases: &scoreUsecase}
Expect(cfg.ContextSize).To(BeNil())
opts := grpcModelOpts(cfg, "/tmp/models")
Expect(opts.NBatch).To(BeEquivalentTo(4096))
Expect(opts.ContextSize).To(BeEquivalentTo(4096), "n_batch must match the effective n_ctx the backend receives")
})
})

View File

@@ -3,9 +3,10 @@ package backend
import (
"context"
"fmt"
"strings"
"time"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/trace"
"github.com/mudler/LocalAI/pkg/grpc"
"github.com/mudler/LocalAI/pkg/model"
@@ -39,34 +40,85 @@ func (s *localVectorStore) backend(_ context.Context) (grpc.Backend, error) {
return StoreBackend(s.loader, s.appConfig, s.storeName, "")
}
func (s *localVectorStore) Search(ctx context.Context, vec []float32) (float64, []byte, bool, error) {
be, err := s.backend(ctx)
if err != nil {
return 0, nil, false, fmt.Errorf("vector store load: %w", err)
func (s *localVectorStore) Search(ctx context.Context, vec []float32) (sim float64, payload []byte, ok bool, err error) {
start := time.Now()
outcome := "hit"
defer func() {
s.recordTrace(start, "search", len(vec), sim, outcome, err)
}()
be, berr := s.backend(ctx)
if berr != nil {
outcome = "backend_load_error"
return 0, nil, false, fmt.Errorf("vector store load: %w", berr)
}
_, values, similarities, err := store.Find(ctx, be, vec, 1)
if err != nil {
// local-store's Find returns "existing length is -1" before
// any keys are inserted. Surface that as a clean miss so the
// cache layer treats it as an empty store and proceeds to
// Insert rather than skipping.
if strings.Contains(err.Error(), "existing length is -1") {
return 0, nil, false, nil
}
return 0, nil, false, fmt.Errorf("vector store find: %w", err)
_, values, similarities, ferr := store.Find(ctx, be, vec, 1)
if ferr != nil {
outcome = "find_error"
return 0, nil, false, fmt.Errorf("vector store find: %w", ferr)
}
if len(values) == 0 || len(similarities) == 0 {
outcome = "miss"
return 0, nil, false, nil
}
return float64(similarities[0]), values[0], true, nil
}
func (s *localVectorStore) Insert(ctx context.Context, vec []float32, payload []byte) error {
be, err := s.backend(ctx)
if err != nil {
return fmt.Errorf("vector store load: %w", err)
func (s *localVectorStore) Insert(ctx context.Context, vec []float32, payload []byte) (err error) {
start := time.Now()
outcome := "ok"
defer func() {
s.recordTrace(start, "insert", len(vec), 0, outcome, err)
}()
be, berr := s.backend(ctx)
if berr != nil {
outcome = "backend_load_error"
return fmt.Errorf("vector store load: %w", berr)
}
return store.SetSingle(ctx, be, vec, payload)
if serr := store.SetSingle(ctx, be, vec, payload); serr != nil {
outcome = "insert_error"
return serr
}
return nil
}
// recordTrace surfaces vector-store calls in /api/backend-traces, including
// the backend-load-failure path that otherwise vanishes into an xlog.Warn.
// modelName uses the store namespace (e.g. "router-cache-smart-router") so
// admins can tell which router's cache misbehaved; the backend is always
// "local-store" and can't disambiguate.
func (s *localVectorStore) recordTrace(start time.Time, op string, vecDim int, sim float64, outcome string, err error) {
if s.appConfig == nil || !s.appConfig.EnableTracing {
return
}
trace.InitBackendTracingIfEnabled(s.appConfig.TracingMaxItems, s.appConfig.TracingMaxBodyBytes)
errStr := ""
if err != nil {
errStr = err.Error()
}
summary := op + " " + outcome
if op == "search" && outcome == "hit" {
summary = fmt.Sprintf("search hit (sim=%.3f)", sim)
}
data := map[string]any{
"op": op,
"outcome": outcome,
"vector_dim": vecDim,
}
// Only include similarity for a real neighbor — miss/empty_store would
// otherwise render "similarity: 0" and read as a measured value.
if op == "search" && outcome == "hit" {
data["similarity"] = sim
}
trace.RecordBackendTrace(trace.BackendTrace{
Timestamp: start,
Duration: time.Since(start),
Type: trace.BackendTraceVectorStore,
ModelName: s.storeName,
Backend: model.LocalStoreBackend,
Summary: summary,
Error: errStr,
Data: data,
})
}
func StoreBackend(sl *model.ModelLoader, appConfig *config.ApplicationConfig, storeName string, backend string) (grpc.Backend, error) {

View File

@@ -0,0 +1,88 @@
package backend
import (
"context"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/trace"
"github.com/mudler/LocalAI/pkg/model"
"github.com/mudler/LocalAI/pkg/system"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
// findVectorStoreTrace returns the most recent vector_store trace whose
// model_name matches storeName, or nil if none was recorded. Used by
// the specs below to assert the trace landed without relying on
// ring-buffer ordering across other tests in the suite.
func findVectorStoreTrace(storeName string) *trace.BackendTrace {
traces := trace.GetBackendTraces()
for i := range traces {
bt := &traces[i]
if bt.Type == trace.BackendTraceVectorStore && bt.ModelName == storeName {
return bt
}
}
return nil
}
var _ = Describe("localVectorStore tracing", func() {
// Pin the trace surface admins read from /api/backend-traces.
// The original failure mode that motivated these specs — the
// local-store backend not installed — was silent on every surface
// except a per-call xlog.Warn. With tracing wired in, the row
// appears next to the embedder/score traces for the same request.
BeforeEach(func() {
trace.ClearBackendTraces()
})
It("records a vector_store trace with outcome=backend_load_error when the backend can't be loaded", func() {
// nil ModelLoader → s.backend → StoreBackend → panics on load.
// Use a real-but-empty loader so the failure surfaces as an
// error instead, exercising the load-failure trace path the
// admin would hit when local-store isn't installed.
appCfg := &config.ApplicationConfig{
EnableTracing: true,
TracingMaxItems: 16,
TracingMaxBodyBytes: 1024,
}
s := &localVectorStore{
loader: model.NewModelLoader(&system.SystemState{}),
appConfig: appCfg,
storeName: "router-cache-test",
}
// Search must surface the error AND record a trace describing it.
_, _, _, err := s.Search(context.Background(), []float32{0.1, 0.2, 0.3})
Expect(err).To(HaveOccurred())
Eventually(func() *trace.BackendTrace {
return findVectorStoreTrace("router-cache-test")
}).ShouldNot(BeNil())
bt := findVectorStoreTrace("router-cache-test")
Expect(bt.Backend).To(Equal(model.LocalStoreBackend))
Expect(bt.Data["op"]).To(Equal("search"))
Expect(bt.Data["outcome"]).To(Equal("backend_load_error"))
Expect(bt.Data["vector_dim"]).To(Equal(3))
// Error is the wrapped "vector store load: …" surfaced to the caller.
Expect(bt.Error).To(ContainSubstring("vector store load"))
})
It("does not record a trace when tracing is disabled", func() {
// Opt-out path: appConfig.EnableTracing=false must short-circuit
// before InitBackendTracingIfEnabled, so a workload with tracing
// turned off doesn't pay the channel-send cost per cache call.
appCfg := &config.ApplicationConfig{EnableTracing: false}
s := &localVectorStore{
loader: model.NewModelLoader(&system.SystemState{}),
appConfig: appCfg,
storeName: "router-cache-disabled",
}
_, _, _, _ = s.Search(context.Background(), []float32{1})
Consistently(func() *trace.BackendTrace {
return findVectorStoreTrace("router-cache-disabled")
}).Should(BeNil())
})
})

View File

@@ -7,9 +7,23 @@ import (
"github.com/mudler/LocalAI/core/schema"
"github.com/mudler/LocalAI/core/trace"
"github.com/mudler/LocalAI/pkg/grpc"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
"github.com/mudler/LocalAI/pkg/model"
)
// tokenizeTokenCount returns the number of tokens in a backend response,
// treating a nil response as zero. The gRPC client returns (nil, err) on
// failure, and the tracing block below runs before that error is returned —
// so the count must be read nil-safely here. Reading resp.Tokens on a nil
// resp previously panicked the whole HTTP handler when tracing was enabled
// (e.g. a transient tokenize failure during router probe-budget sizing).
func tokenizeTokenCount(resp *pb.TokenizationResponse) int {
if resp == nil {
return 0
}
return len(resp.Tokens)
}
func ModelTokenize(s string, loader *model.ModelLoader, modelConfig config.ModelConfig, appConfig *config.ApplicationConfig) (schema.TokenizeResponse, error) {
var inferenceModel grpc.Backend
@@ -40,10 +54,7 @@ func ModelTokenize(s string, loader *model.ModelLoader, modelConfig config.Model
errStr = err.Error()
}
tokenCount := 0
if resp.Tokens != nil {
tokenCount = len(resp.Tokens)
}
tokenCount := tokenizeTokenCount(resp)
trace.RecordBackendTrace(trace.BackendTrace{
Timestamp: startTime,
@@ -64,8 +75,8 @@ func ModelTokenize(s string, loader *model.ModelLoader, modelConfig config.Model
return schema.TokenizeResponse{}, err
}
if resp.Tokens == nil {
resp.Tokens = make([]int32, 0)
if resp == nil || resp.Tokens == nil {
return schema.TokenizeResponse{Tokens: make([]int32, 0)}, nil
}
return schema.TokenizeResponse{

View File

@@ -0,0 +1,27 @@
package backend
import (
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
var _ = Describe("tokenizeTokenCount", func() {
// Regression: the gRPC client returns (nil, err) when a tokenize call
// fails, and ModelTokenize's tracing block reads the token count before
// the error is returned. Dereferencing a nil response there panicked the
// HTTP handler (nil pointer dereference) — e.g. a transient tokenize
// failure while the router sized its probe-token budget.
It("returns zero for a nil response instead of panicking", func() {
Expect(tokenizeTokenCount(nil)).To(Equal(0))
})
It("returns zero when the response carries no tokens", func() {
Expect(tokenizeTokenCount(&pb.TokenizationResponse{})).To(Equal(0))
})
It("counts the tokens present on the response", func() {
Expect(tokenizeTokenCount(&pb.TokenizationResponse{Tokens: []int32{1, 2, 3}})).To(Equal(3))
})
})

Some files were not shown because too many files have changed in this diff Show More