Compare commits

...

44 Commits

Author SHA1 Message Date
Ettore Di Giacinto
a51580fd79 fix(agents): make React agent chat timestamps format-agnostic
The agent SSE bridge emits the json_message timestamp in three different
encodings depending on deploy mode: an RFC3339 string (standalone agent
pool), Unix milliseconds (local dispatcher), and Unix nanoseconds (the
older NATS path). The React AgentChat handler passed data.timestamp
straight through, so the standalone string and any numeric value outside
the millisecond range rendered as "Invalid Timestamp" or a constant
epoch-ish time.

Add a small pure helper, normalizeTimestampMs, that accepts an RFC3339
string or a numeric epoch in s/ms/us/ns and returns JS milliseconds,
falling back to Date.now() on null/empty/unparseable input. Use it in
the json_message handler so the rendered time is correct regardless of
which backend path produced it.

Fixes #9867

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-12 22:07:22 +00:00
Aniruddh Jha
51f4f67c47 fix(agents): emit chat event timestamps in milliseconds (#9867) (#10243)
Agent chat replies rendered a broken timestamp in the web UI
("Invalid Timestamp" / "12:00 AM", identical for every reply) because
the SSE timestamp unit was inconsistent across producers.

EventBridge.PublishEvent emitted Unix nanoseconds while the local
dispatcher (dispatcher.go) already emitted Unix milliseconds, and the
React UI fed the value straight into `new Date(ts)` after dividing by
1e6. Nanoseconds also overflow JS's safe-integer range (~1.7e18).

Standardize on Unix milliseconds: switch PublishEvent to UnixMilli and
drop the /1e6 conversion in AgentChat.jsx so both SSE paths agree and
match the React UI's expectation. Add a regression test asserting the
published timestamp is in milliseconds.
2026-06-12 23:18:44 +02:00
LocalAI [bot]
cf71e291b4 fix(darwin): fix vibevoice-cpp build linkage + fail-safe go backend packaging (#10276)
* fix(darwin): never package a go backend build tree as a working image

The darwin/arm64 vibevoice-cpp image shipped the source tree with a
half-built CMake directory (build-libgovibevoicecpp-fallback.so/) and no
backend binary, so the backend could never start: run.sh exec'd a
vibevoice-cpp binary that was not in the package and LocalAI timed out
waiting for the gRPC service.

Two durable, backend-agnostic defenses:

- backend/go/vibevoice-cpp/Makefile: mirror whisper's cleanup discipline so a
  partial CMake tree cannot survive into packaging. Run `make purge` before
  each variant build and `rm -rfv build*` after. The old recipe only removed
  its build dir after a successful `mv`, so a failed build left the half-built
  tree behind.

- scripts/build/golang-darwin.sh: before creating the OCI image, remove any
  stray build-* directory and assert that the binary run.sh launches actually
  exists. A build that produced no binary now fails the job loudly instead of
  publishing a source tree as a working backend. The binary name is derived
  from run.sh's `exec $CURDIR/<binary>` line (parakeet-cpp launches
  parakeet-cpp-grpc, so it is not always ${BACKEND}) with a ${BACKEND}
  fallback.

The underlying native build failure that left vibevoice-cpp half-built still
needs to be reproduced and fixed on Apple Silicon; this change ensures such a
failure can never again be published as a working image.

Refs #10267

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* fix(vibevoice-cpp): build libvibevoice.a on darwin (link target, not path)

The darwin build failed with:

    No rule to make target 'vibevoice/libvibevoice.a', needed by
    'libgovibevoicecpp.so'.  Stop.

The upstream vibevoice project is added with add_subdirectory(... EXCLUDE_FROM_ALL),
so its `vibevoice` static-library target is only built when something links it
as a target. The Apple branch linked only `$<TARGET_FILE:vibevoice>` - a bare
archive path with no target reference - so CMake never emitted a rule to build
libvibevoice.a, while the Linux branch worked because it passes the `vibevoice`
target name inside the --whole-archive flags.

Link the `vibevoice` target on Apple (establishing the build dependency) and
apply -force_load as a separate link option to keep whole-archive semantics so
purego can dlsym the vv_capi_* symbols.

Refs #10267

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-12 23:13:50 +02:00
LocalAI [bot]
a7a7bd646b fix(mlx): route vision-language models to the mlx-vlm backend (#10274)
Vision-language checkpoints such as mlx-community/gemma-4-E4B-it-qat-4bit
declare the "image-text-to-text" pipeline tag on HuggingFace. The mlx
importer hardcoded backend "mlx" for every mlx-community model, so these
VLMs were served by the text-only mlx-lm backend whose tokenizer does not
carry the processor chat template. The template was never applied and the
model produced degenerate, looping output that echoed the prompt.

Detect the "image-text-to-text" pipeline tag in the importer and route those
models to mlx-vlm, which applies the processor-aware chat template. An
explicit backend preference still wins.

As a defensive backstop, the mlx backend now warns loudly when the loaded
model has no chat template, so a misrouted VLM surfaces the problem instead
of silently looping.

Fixes #10269


Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-12 23:12:42 +02:00
LocalAI [bot]
cec93d2e00 docs: ⬆️ update docs version mudler/LocalAI (#10279)
⬆️ Update docs version mudler/LocalAI

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-12 23:12:30 +02:00
LocalAI [bot]
722bdb87e9 chore: ⬆️ Update mudler/parakeet.cpp to b8012f11e5269126eddb7f4fd02f891a2ccc29b0 (#10281)
* ⬆️ Update mudler/parakeet.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* fix(parakeet-cpp): close streaming segments on <EOB> after ABI v5 eou/eob split

parakeet.cpp ABI v5 (the pin this PR bumps to) splits the streaming JSON
"eou" flag: in v4 "eou":1 fired for either <EOU> (end of utterance) or
<EOB> (backchannel); in v5 "eou" means <EOU> only, with a new separate
"eob" field for the backchannel token.

The streamSegmenter closed a segment on "eou" alone, so after the bump a
backchannel token would silently stop ending a segment and merge into the
next utterance. Read the new "eob" field and flush on either signal to
preserve the v4 segmentation boundaries. The flat stream_feed eou_out path
is unaffected: its mask is still non-zero for either event.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-12 23:12:04 +02:00
LocalAI [bot]
50dea8c983 feat(crispasr): bundle espeak-ng and add piper TTS voices to the gallery (#10283)
CrispASR's piper backend phonemizes non-English text via espeak-ng (dlopen,
the MIT-clean path; English uses a built-in G2P). The FROM scratch crispasr
image shipped none of it, so non-English piper voices loaded but failed
synthesis with "phonemization failed". Bundle the espeak-ng runtime so they
work:

- Dockerfile.golang: install espeak-ng-data + libespeak-ng1 and its libpcaudio0
  / libsonic0 deps in the crispasr builder (espeak's dlopen fails without the
  latter two).
- package.sh: copy libespeak-ng.so.1, libpcaudio.so.0, libsonic.so.0 into
  package/lib/ and the espeak-ng-data dir into the package root.
- run.sh: export CRISPASR_ESPEAK_DATA_PATH so the bundled data is found.

Add 9 single-speaker piper voices (de/en/it, incl. Italian paola + riccardo) to
the gallery, run through backend:piper, hosted at
LocalAI-Community/piper-voices-GGUF (converted from rhasspy/piper-voices with
CrispASR's convert-piper-to-gguf.py). Only single-speaker low/medium voices are
included; the engine does not yet support multi-speaker or high-quality piper
decoders.

All 9 verified end-to-end: each synthesizes a WAV at the model's native sample
rate using only the image-bundled espeak payload.


Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-12 23:10:30 +02:00
LocalAI [bot]
46ba70632b fix(crispasr): write piper TTS WAV at the model's native sample rate (#10277)
CrispASR's piper backend returns PCM at the voice's native rate (from the GGUF
piper.sample_rate key: 16 kHz for x_low/low, 22.05 kHz for medium/high) and does
not resample, but the Go WAV encoder hardcoded 24000 Hz. Every piper voice was
therefore written with a wrong header and played back at the wrong pitch/speed.

Read piper.sample_rate from the model's GGUF metadata at Load via the vendored
gguf-parser-go and use it for the WAV header, falling back to the 24 kHz default
for the other CrispASR TTS engines (vibevoice/orpheus/chatterbox/qwen3-tts) that
emit 24 kHz and carry no such key.

Adds unit specs (minimal crafted GGUFs + WAV-header decode) and an env-gated
end-to-end spec (CRISPASR_PIPER_MODEL_PATH). Verified e2e: en_GB-cori-medium
synthesizes a 22050 Hz WAV through backend:piper.


Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-12 23:10:17 +02:00
LocalAI [bot]
60facc7252 fix(darwin): publish sherpa-onnx and speaker-recognition images for darwin/arm64 (#10275)
Neither the sherpa-onnx nor the speaker-recognition backend had a
darwin/arm64 image, so `local-ai backends install` failed with "no child
with platform darwin/arm64" on macOS. This left /v1/audio/diarization (the
sherpa-onnx path) and /v1/voice/embed without any usable backend on Apple
Silicon.

Both backends build on darwin/arm64:
- sherpa-onnx (Go) already fetches the onnxruntime osx-arm64 runtime in its
  Makefile; it only needed a darwin matrix entry (build-type metal, lang go,
  like whisper and silero-vad).
- speaker-recognition (Python) needed a requirements-mps.txt so the mps build
  installs plain onnxruntime (which ships a macOS arm64 wheel) instead of the
  onnxruntime-gpu pulled by its base requirements (which does not).

Add both to the includeDarwin build matrix, wire the metal capability and
metal image aliases into the gallery, and add the speaker-recognition
requirements-mps.txt.

Fixes #10268


Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-12 22:32:42 +02:00
LocalAI [bot]
8c8204d3c4 feat(parakeet-cpp): enable GGML_CUDA_GRAPHS in the cublas build (#10273)
ggml leaves GGML_CUDA_GRAPHS off by default. Passing -DGGML_CUDA_GRAPHS=ON
for cublas builds lets the CUDA backend capture and replay the compute
graph for a small free speedup (about 1% measured on a GB10, never
negative). It is not gated by parakeet.cpp's CMake options, so it passes
straight through to ggml.

Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>

Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-12 18:47:36 +02:00
LocalAI [bot]
4ce0f6102a chore(model gallery): 🤖 add 1 new models via gallery agent (#10270)
chore(model gallery): 🤖 add new models via gallery agent

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-12 16:21:35 +02:00
Richard Palethorpe
085fc53bbc fix(router): production-ready request router + auto-size batch for embedding/rerank (#10104)
* fix(router): score classifier production-readiness

Conversation trimming runs through the classifier model's chat template
and trims by exact token count, sized to the model's n_batch which is
now scaled to context so long probes can't crash the backend. Missing
chat_message templates are a hard error at router build time. Router-
facing factories (Embedder/Scorer/Reranker/TokenCounter) re-resolve
ModelConfig per call so a model installed post-startup doesn't bind a
stub Backend="" config and silently fall into the loader's auto-
iterate path.

New 'vector_store' backend trace recorded inside localVectorStore on
every Search/Insert — including the backend-load-failure path that
previously vanished into an xlog.Warn — with outcome tagging
(hit/miss/empty_store/backend_load_error/find_error/insert_error/ok).
Companion cleanup drops misleading similarity:0 and input_tokens_count:0
from non-hit and text-mode traces.

Gallery local-store-development aliases to 'local-store' so the master
image satisfies pkg/model.LocalStoreBackend lookups from the embedding
cache.

Misc: llama-cpp TokenizeString reads the correct 'prompt' JSON key
(the original bug); ModelTokenize nil-guard; non-fatal mitm proxy
startup; PII 'route_local' renamed to 'allow' with docs/UI in sync;
model-editor footer no longer eats the edit area on small screens;
several config-editor template/dropdown/section fixes.

Tests: e2e router specs (casual/code-hint + long-conversation trim),
vector_store trace specs, lazy-factory specs, gallery dev-alias
resolution, Playwright trace badge + scroll regression.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(backend): auto-size batch to context for embedding and rerank models

Embedding and rerank models pool over the whole input in a single physical batch (n_ubatch). With batch left at the 512 default, the backend rejects longer inputs with "input is too large to process", silently capping a large-context embedder (e.g. 8k/32k) at 512 tokens. Size n_batch to the context for these single-pass usecases, mirroring the existing FLAG_SCORE behaviour; an explicit batch: still wins.

Extracts EffectiveContextSize/EffectiveBatchSize from grpcModelOpts so the effective decode window has one home for other callers to reuse.

Adds an e2e-aio regression test that embeds a >512-token input. The AIO embedding model is switched to nomic-embed-text-v1.5 (2048 context) because the previous granite model was capped at 512 tokens and could not exercise the larger batch.

Assisted-by: claude-code:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(gallery): raise arch-router scoring output cap via parallel:64

Scoring decodes the whole prompt+candidate in a single llama_decode and
reads one logit row per candidate token. The vendored llama.cpp server
caps causal output rows at n_parallel, so the default of 1 aborts with
GGML_ASSERT(n_outputs_max <= cparams.n_outputs_max) on multi-token route
labels. Set options: [parallel:64] on both arch-router quant entries to
lift the cap; kv_unified (the grpc-server default) keeps the full context
per sequence, so this does not split the KV cache.

Assisted-by: claude-code:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-06-12 16:21:15 +02:00
LocalAI [bot]
56cc4f63fc feat(backend): locate-anything-cpp (open-vocabulary object detection via ggml) (#10264)
* feat(backend): add locate-anything-cpp backend (open-vocab detection via la_capi)

A Go/purego backend wrapping locate-anything.cpp's la_capi C ABI, implementing
the gRPC Detect RPC: image + open-vocabulary text prompt -> labeled boxes.
Mirrors backend/go/rfdetr-cpp; static-links ggml into a per-CPU-variant .so.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(backend): register locate-anything-cpp in build matrix

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(gallery): locate-anything gallery entry + model importer

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(backend): locate-anything-cpp Load+Detect wire test

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(gallery): add locate-anything-3b model to the gallery index

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(backend): register locate-anything.cpp in bump_deps auto-bump

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: mudler <mudler@localai.io>

* ci(test): e2e smoke for locate-anything-cpp in test-extra (loads the 3B + image, runs Detect)

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: mudler <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Signed-off-by: mudler <mudler@localai.io>
Co-authored-by: mudler <mudler@localai.io>
2026-06-12 14:59:07 +02:00
LocalAI [bot]
a53f34e78f chore: ⬆️ Update ggml-org/llama.cpp to 4c6595503fe45d5a39f88d194e270f64c7424677 (#10261)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-12 14:57:52 +02:00
Dedy F. Setyawan
1cea96f09f feat(react-ui): add Indonesian language support (#10266)
Signed-off-by: Dedy F. Setyawan <dedyfajars@gmail.com>
2026-06-12 10:08:58 +02:00
LocalAI [bot]
006a9d38c7 chore: ⬆️ Update mudler/parakeet.cpp to 9db92be63179a27201d3b88d5d40c545b2ac48ae (#10263)
⬆️ Update mudler/parakeet.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-12 09:18:21 +02:00
LocalAI [bot]
892ce951ce chore: ⬆️ Update antirez/ds4 to d881f2a05e8ff6bec001315a36b794b4aa310173 (#10262)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-12 09:18:07 +02:00
LocalAI [bot]
7cda221d36 docs: ⬆️ update docs version mudler/LocalAI (#10259)
⬆️ Update docs version mudler/LocalAI

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-12 09:17:49 +02:00
LocalAI [bot]
9a88eb81e7 chore: ⬆️ Update CrispStrobe/CrispASR to d745bda4386ae0f9d1d2f23fff8ec95d76428221 (#10260)
⬆️ Update CrispStrobe/CrispASR

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-12 09:17:34 +02:00
pos-ei-don
58cdc050e9 fix(cuda): install cuda-nvrtc-dev alongside the other CUDA dev packages (#10257)
Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
2026-06-11 23:57:00 +02:00
pos-ei-don
b962f4a192 fix(vllm): parse tool_call function arguments before applying the chat template (#10256)
Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
2026-06-11 23:55:38 +02:00
LocalAI [bot]
b6fcb3e1db chore: ⬆️ Update CrispStrobe/CrispASR to 4b27392ffd0991a857594652cbb8b57e585bcd7b (#10241)
⬆️ Update CrispStrobe/CrispASR

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-11 18:33:58 +02:00
LocalAI [bot]
ff09683d84 chore: ⬆️ Update ggml-org/llama.cpp to ac4cddeb0dbd778f650bf568f6f08344a06abe3a (#10239)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-11 18:33:38 +02:00
LocalAI [bot]
f618636c71 docs: fix broken relref to realtime page (#10255)
Hugo fails the gh-pages build with REF_NOT_FOUND because the relref
in model-configuration.md uses the 'docs/' prefix; refs are resolved
relative to content/, so the page lives at 'features/openai-realtime'
(as the other ref in the same file already uses).


Assisted-by: Claude Code:claude-fable-5

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-11 18:32:50 +02:00
LocalAI [bot]
892fc49949 feat(realtime): stream the LLM / TTS / transcription pipeline stages (#10176)
* feat(realtime): pipeline streaming + disable_thinking config

Add a nested pipeline.streaming.{llm,tts,transcription} block plus
pipeline.disable_thinking, with StreamLLM/StreamTTS/StreamTranscription/
ThinkingDisabled helpers. Pointer-bools so unset keeps the unary path;
existing configs are unaffected. Wiring into the realtime handler follows.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): sentence segmenter for streamed LLM->TTS pipelining

streamSegmenter accumulates streamed LLM tokens and emits complete
sentence/clause segments (terminator+whitespace, or newline) so TTS can
synthesize each segment as it completes instead of waiting for the whole
reply. Pure helper; the streaming handler wiring consumes it next.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): streaming TTS/transcription methods on Model interface

Add TTSStream and TranscribeStream to the realtime Model interface and
implement them on wrappedModel (delegating to backend.ModelTTSStream /
ModelTranscriptionStream) and transcriptOnlyModel. ttsStream adapts the
backend's WAV-framed stream (44-byte header carrying the sample rate, then
PCM) into raw PCM + sample rate for the realtime transports. Handler wiring
that consumes these (flag-gated) follows.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): emitSpeech with flag-gated streaming TTS

emitSpeech synthesizes a piece of text and forwards audio to the client,
streaming one output_audio.delta per backend PCM chunk when the pipeline
sets streaming.tts, or one delta for the whole utterance otherwise. WebRTC
gets raw PCM (it resamples internally); WebSocket gets base64 PCM at the
session rate. It emits no transcript/audio-done events so a streamed reply
can be split into multiple spoken segments sharing one response.

Adds fakeModel/fakeTransport test doubles for the realtime Model/Transport
interfaces, driving streaming assertions deterministically.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): route response audio through emitSpeech (streaming TTS)

Replace the inline unary TTS block in the response handler with emitSpeech,
which streams a response.output_audio.delta per backend PCM chunk when
pipeline.streaming.tts is set and otherwise preserves the single-delta unary
behaviour. emitSpeech returns the accumulated base64 audio, stored on the
conversation item as before. Transcript and audio-done events stay in the
handler so later per-segment streaming can reuse emitSpeech.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): streaming transcription text deltas

Add emitTranscription and route commitUtterance through it. With
pipeline.streaming.transcription set it streams each transcript fragment as
a conversation.item.input_audio_transcription.delta via TranscribeStream
then a completed event; otherwise it preserves the single completed-event
unary behaviour. Returns the final transcript for response generation.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): pipeline disable_thinking maps to enable_thinking off

applyPipelineThinking forces the LLM's ReasoningConfig.DisableReasoning when
pipeline.disable_thinking is set, which gRPCPredictOpts turns into the
enable_thinking=false backend metadata. Applied at newModel construction on
the per-session LLM config copy, so it doesn't leak to other model users and
needs no realtime-specific request plumbing.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): speechStreamer for token-streamed LLM->TTS

emitSpeech now returns raw PCM (caller base64-encodes) so streamed segments
accumulate correctly. speechStreamer consumes streamed LLM tokens: it strips
reasoning via the streaming ReasoningExtractor, emits a transcript delta per
content fragment, and sentence-pipes content into emitSpeech so each sentence
is synthesized as soon as it's ready. Handler wiring (plain-content turns)
follows.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): wire streamLLMResponse for token-streamed replies

triggerResponseAtTurn takes a streamed path when pipeline.streaming.llm is
set, the turn has no tools, and audio is requested: streamLLMResponse
announces the assistant item, drives the LLM token callback through a
speechStreamer (reasoning-stripped transcript deltas + sentence-piped TTS),
and emits the terminal events. Tool turns and non-streaming pipelines keep
the existing buffered path unchanged, so this is strictly opt-in.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(realtime): document pipeline streaming + disable_thinking

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(realtime): register pipeline streaming/thinking config fields

TestAllFieldsHaveRegistryEntries (core/config/meta) requires every config
field to have a meta registry entry. The four new pipeline fields
(disable_thinking, streaming.{llm,tts,transcription}) had none, failing
tests-linux/tests-apple. Add toggle entries for them.

Also handle the os.Remove return in realtime_speech_test.go to satisfy
errcheck (golangci-lint).

Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(realtime): always strip reasoning from spoken output

disable_thinking maps to ReasoningConfig.DisableReasoning=true on the LLM
config, which the backend reads as enable_thinking=false. But the realtime
handler reads that SAME config to drive reasoning extraction, and there
DisableReasoning=true means "skip stripping". PredictConfig() returns this
LLM config, so both the streamed (speechStreamer) and buffered realtime
paths stopped stripping <think>…</think> exactly when disable_thinking was
on — leaking raw reasoning to the client whenever the model ignored the
enable_thinking hint (e.g. lfm2.5).

Add spokenReasoningConfig() which clears DisableReasoning for extraction
(keeping custom tokens/tag pairs) and route both realtime paths through it.
Spoken output now always strips reasoning, independent of the backend
suppression hint.

Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(realtime): clean TTS temp path before read (gosec G304)

emitSpeech reads the WAV file the TTS backend wrote. The read moved here
from realtime.go, so code-scanning flagged it as a new G304 alert even
though the path is backend-controlled (a temp file), not user input.
Wrap it in filepath.Clean — a real path normalization that also clears
the alert, keeping with the repo's no-#nosec convention.

Assisted-by: Claude:claude-opus-4-8 gosec, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(realtime): buffer whole message for TTS, drop sentence segmenter

Per review (richiejp): the sentence segmenter pipelined unary TTS by
splitting on ASCII .!?/newline, which does nothing for languages without
those boundaries (CJK/Thai) — there it already degraded to buffering the
whole message anyway.

Replace it with a uniform model: stream the LLM transcript live, buffer the
full message, then synthesize it once. emitSpeech already streams the audio
chunks when the backend implements TTSStream and falls back to a single
unary delta otherwise, so this is real streaming TTS where supported and a
clean whole-message synthesis elsewhere — no per-sentence emulation, no
language assumptions. speechStreamer becomes transcriptStreamer (transcript
deltas only); the whole-message synthesis moves into streamLLMResponse.

Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): stream tool-call turns via tokenizer-template autoparser

Per review (richiejp): tool-call deltas exist, so streaming should work with
tools too. It does — for models that use their tokenizer template. The C++
autoparser then clears reply.Message and delivers content + tool calls via
ChatDeltas, so the streamed transcript carries only spoken content (no
tool-call JSON leak) and the tool calls are parsed from the final response.

- Drop the len(tools)==0 gate; stream when no tools OR use_tokenizer_template
  (grammar-based function calling still buffers, since its call is emitted as
  JSON in the token stream and would leak into the transcript).
- streamLLMResponse takes tools/toolChoice/toolTurn, reads ChatDelta content
  in the token callback, parses tool calls from the final ChatDeltas, and
  creates the assistant content item lazily so a content-less tool turn emits
  only the tool calls.
- Extract emitToolCallItems from the buffered path so both paths finalize tool
  calls, response.done, and server-side assistant-tool follow-ups identically.

Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): script-aware clause chunking + streamed-reply fixes

Opt-in pipeline.streaming.clause_chunking splits the streamed LLM reply
into speakable clauses and synthesizes each as soon as it completes,
lowering time-to-first-audio instead of buffering the whole message. The
splitter is script-aware (rivo/uniseg, pure Go): UAX#29 sentence
segmentation handles CJK 。!? with no whitespace, CJK clause
punctuation (,、;:) and Thai/Lao spaces give finer cuts, and a UAX#14
line-break cap bounds an over-long punctuation-less run. Unlike the old
ASCII .!?/newline segmenter (dropped in 076dcdbe) it does not degrade to
whole-message buffering for CJK/Thai; scripts needing a dictionary
(Khmer/Burmese) stay buffered until a space or end-of-message. Clauses
are synthesized synchronously in the token callback (the LLM keeps
generating into the gRPC stream meanwhile), so audio still starts
mid-generation. Off by default — the whole-message path is unchanged.

Also fix the streamed-reply path and the Talk page:

- Don't swallow streamed autoparser content as reasoning: the
  tokenizer-template path already delivers reasoning-free content via
  ChatDeltas, so prefilling the thinking start token re-tagged it as an
  unclosed reasoning block, leaving no spoken reply. Disable the prefill
  on that path; closed tag pairs are still stripped (#9985).

- Generate collision-free realtime IDs (16 random bytes) instead of a
  constant, so per-item bookkeeping (cancel, conversation.item.retrieve)
  works.

- Key the Talk transcript by the server item_id and upsert entries.
  Realtime events arrive over a WebRTC data channel — outside React's
  event system — so React defers the setTranscript updaters while
  synchronous ref writes in handler bodies run first; the old
  index-tracking ref rendered a duplicate assistant bubble on
  completion. Upserts by item_id are idempotent and order-independent.

- Drop the partial assistant bubble on a cancelled response (barge-in):
  the server discards the interrupted item and sends response.done with
  status "cancelled"; mirror that in the UI so the regenerated reply
  isn't rendered as a second assistant message.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Assisted-by: Claude:claude-fable-5 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Richard Palethorpe <io@richiejp.com>
2026-06-11 08:43:12 +01:00
pos-ei-don
228a6dfe79 fix(vllm): restore compatibility with vLLM >= 0.22 (get_tokenizer moved to vllm.tokenizers) (#10252)
fix(vllm): restore compatibility with vLLM >= 0.22 (get_tokenizer moved)

vLLM 0.22 moved get_tokenizer from vllm.transformers_utils.tokenizer
to vllm.tokenizers. Since the backend requirements install vllm
unpinned, freshly built/installed vllm backends currently fail to
start with ModuleNotFoundError: No module named
'vllm.transformers_utils.tokenizer' (surfacing as 'grpc service not
ready' when loading a model).

Use the same try/except version-compat import pattern already used
elsewhere in this file: try the new vllm.tokenizers location first and
fall back to the pre-0.22 path.

Tested on a DGX Spark (GB10, ARM64) with the
cuda13-nvidia-l4t-arm64-vllm backend and vllm 0.22.0: model load, chat
completions and tool calls all work with this patch applied.

Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-11 09:05:23 +02:00
LocalAI [bot]
51a92b6093 chore: ⬆️ Update antirez/ds4 to 8384adf0f9fa0f3bb342dd925372de778b95b263 (#10242)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-11 00:10:34 +02:00
LocalAI [bot]
b5964d385d docs: ⬆️ update docs version mudler/LocalAI (#10245)
⬆️ Update docs version mudler/LocalAI

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-11 00:10:10 +02:00
LocalAI [bot]
fba8c9c498 fix(distributed): track in-flight for non-LLM inference methods (VAD, diarize, voice, ...) (#10238)
fix(distributed): track in-flight for non-LLM inference methods

InFlightTrackingClient only wrapped a subset of the grpc.Backend
inference methods (Predict, Embeddings, TTS, AudioTranscription, Detect,
Rerank, ...). Methods like VAD were left as embedded passthrough, so
track() never ran for them.

In distributed mode every model is loaded with in_flight=1 as a
reservation; that reservation is only released by the OnFirstComplete
callback, which fires after the first *tracked* inference call completes.
A VAD-only model (e.g. silero-vad) never calls a tracked method, so the
reservation is never released and in-flight stays pinned at 1 forever -
which also blocks the router's idle-eviction logic.

Wrap the remaining unary inference methods (VAD, Diarize, Face*, Voice*,
TokenClassify, Score, AudioEncode, AudioDecode, AudioTransform) with the
same track()/reconcile() pattern. The three bidi-stream constructors
(AudioTransformStream, AudioToAudioStream, Forward) are deliberately left
as passthrough - their inference spans the stream lifetime, not the
constructor call, so track() there would fire onFirstComplete before any
data flows.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-10 16:29:50 +02:00
LocalAI [bot]
6b2badb837 chore: ⬆️ Update CrispStrobe/CrispASR to c29f6653a516a3001d923944dad8892072cc7334 (#10236)
⬆️ Update CrispStrobe/CrispASR

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-10 16:16:24 +02:00
LocalAI [bot]
8b8506d01a chore: ⬆️ Update ggml-org/llama.cpp to 039e20a2db9e87b2477c76cc04905f3e1acad77f (#10223)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-10 12:22:03 +02:00
LocalAI [bot]
6910a0bb48 chore: ⬆️ Update antirez/ds4 to 91bafb5acd5a6cf00b1e55ef68bf40ddd207bee7 (#10234)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-10 12:08:19 +02:00
LocalAI [bot]
cffd03b522 chore: ⬆️ Update ikawrakow/ik_llama.cpp to e6f8112f3ba126eed3ff5b30cdd08085414a7516 (#10233)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-10 12:07:49 +02:00
LocalAI [bot]
bf448d3794 chore: ⬆️ Update ggml-org/whisper.cpp to df7638d8229a243af8a4b5a8ae557e0d74e0a0ae (#10220)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-10 01:16:29 +02:00
LocalAI [bot]
1d4a12f7c0 chore: ⬆️ Update CrispStrobe/CrispASR to 97cad527d247edefc904e6c40c4cf5ee78bed055 (#10221)
⬆️ Update CrispStrobe/CrispASR

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-10 01:16:17 +02:00
LocalAI [bot]
186d62801d chore: ⬆️ Update leejet/stable-diffusion.cpp to 19bdfe22d255d5b4dff39d449318b9bc5ea2317f (#10222)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-10 01:16:06 +02:00
LocalAI [bot]
da4ed05429 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 2768b6251548b78b6610e95edad13f888ad95982 (#10219)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-10 01:15:54 +02:00
LocalAI [bot]
ec1eea4f45 chore: ⬆️ Update antirez/ds4 to 512d07cb08f234b704b5a5959aa9e2d4c466eeb0 (#10224)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-10 01:15:42 +02:00
LocalAI [bot]
b203b32e57 feat(realtime): make WebRTC ICE candidates configurable (#10231)
The /v1/realtime WebRTC handler created the peer connection with a bare
webrtc.Configuration and no SettingEngine, so pion gathered a host ICE
candidate for every local interface. Under Docker host networking that
includes bridge addresses (docker0/veth, 172.x) a remote browser cannot
route to; the call establishes on a good pair and then drops once ICE
consent freshness checks fail on the unreachable candidates.

Add two opt-in knobs, applied via a pion SettingEngine:
- LOCALAI_WEBRTC_NAT_1TO1_IPS: advertise these IPs as the host candidates
  (e.g. the host LAN IP)
- LOCALAI_WEBRTC_ICE_INTERFACES: restrict ICE gathering to these interfaces

Defaults are unchanged (empty => current all-interface behavior).

Assisted-by: Claude:claude-opus-4-8

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-09 22:28:03 +02:00
Ching
48a8ce98aa fix(cli): handle chat output errors (#10229)
Propagate terminal write errors from the chat prompt and explicitly ignore stream close errors during cleanup.

Update chat tests to assert response writer errors so errcheck passes without hiding failed writes.

Tests:
- go test -count=1 ./core/cli/chat
- go test -count=1 ./core/cli

Assisted-by: Codex:GPT-5

Signed-off-by: Ching Kao <0980124jim@gmail.com>
2026-06-09 19:10:24 +02:00
Ching
8344d1c865 feat(cli): add interactive chat mode (#10226)
Add an opt-in `local-ai chat` command for testing chat models directly from the terminal without manually sending curl requests.

The command connects to a running LocalAI server, lists available models through the existing OpenAI-compatible API, streams chat completions, and supports interactive commands such as `/models`, `/model`, `/clear`, and `/exit`.

Keep `local-ai run` focused on the server lifecycle so the web UI, API clients, and multiple chat terminals can coexist against the same server.

Document the new command and terminal workflow in the README and CLI docs.

Tests:
- go test -count=1 ./core/cli/chat
- go test -count=1 ./core/cli

Assisted-by: Codex:GPT-5

Signed-off-by: Ching Kao <0980124jim@gmail.com>
2026-06-09 14:58:44 +00:00
Pete
d2e6b93369 feat(agents): surface KB source citations in RAG responses (#10228)
* dev knowledge.go structure

Signed-off-by: Pete Chen <petechentw@gmail.com>

* feat(agents): append KB source citations to responses

Render structured KB citations as a Sources block after agent responses, linking each source to the existing raw collection entry endpoint.

Keep long-term memory writes on the original model response so citation blocks do not get stored back into the knowledge base.

Tested with: go test ./core/services/agents

Assisted-by: Codex:gpt-5
Signed-off-by: Pete Chen <petechentw@gmail.com>

* Collect KB citations from tool searches

Signed-off-by: Pete Chen <petechentw@gmail.com>

* fix(agents): append KB sources in local chats

Apply the shared KB citation post-processing to standalone LocalAGI chat responses so the React agent chat receives the same clickable Sources block as the native executor path. Also fix the run target to use the current cmd/local-ai entrypoint.

Assisted-by: Codex:gpt-5
Signed-off-by: Pete Chen <petechentw@gmail.com>

---------

Signed-off-by: Pete Chen <petechentw@gmail.com>
Co-authored-by: shihyunhuang <shihyunhuang88@gmail.com>
Co-authored-by: TLoE419 <tloemizuchizu@gmail.com>
Co-authored-by: Ching Kao <0980124jim@gmail.com>
2026-06-09 16:32:56 +02:00
LocalAI [bot]
e1ec03d33f fix(reasoning): stop prefilled <think> from swallowing tag-less answers (#10225)
* fix(reasoning): stop prefilled <think> from swallowing tag-less answers

When a chat template injects the thinking start token into the prompt (so
DetectThinkingStartToken returns e.g. "<think>"), the model's output begins
inside a reasoning block and carries only the closing tag. The non-jinja
autoparser fallback (peg-native "pure content" mode, issue #9985) prepends the
start token so the extractor can pair it with the model's </think>.

But on a COMPLETE response that contains no closing tag, the model answered
directly with no reasoning at all. Prepending the start token there manufactures
an unclosed block that swallows the entire answer into reasoning, leaving the
OpenAI `content` field empty. This breaks short/direct answers — session names,
JSON summaries, any terse completion where the model skips the think block —
which come back with empty content. Regression surfaced by #9991, which added
the defensive prefill extraction to the complete-response paths.

Add reasoning.ExtractReasoningComplete: it only honors a prefilled start token
when the response actually contains the matching closing tag (proof a reasoning
block exists). Genuine reasoning tags already in the content still extract;
tag-less content stays content. Apply it at every complete-response site
(applyAutoparserOverride, realtime, openresponses). The streaming per-token
extractor is intentionally left on ExtractReasoningWithConfig — mid-stream an
as-yet-unclosed block is legitimate and must surface as reasoning deltas.

Also adds reasoning.ClosingTokenForStart and hoists the default reasoning tag
pairs to package scope so both helpers share one source of truth.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(reasoning): cover the enable_thinking=false non-thinking-mode regression

Adds the end-to-end case that actually broke session summaries / auto-titles
and was not covered before: a request with enable_thinking=false against a
<think>-capable model. In non-thinking mode the model emits no reasoning block,
so llama.cpp's autoparser returns ChatDeltas with content set and
reasoning_content empty (verified against stock llama-server: same model with
chat_template_kwargs.enable_thinking=false returns reasoning_content=null,
content="hello"). thinkingStartToken is still "<think>" because it is detected
per-model from the enable_thinking=true render, so the old code prepended it and
swallowed the answer. The test fails without the ExtractReasoningComplete gate.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 09:02:04 +02:00
LocalAI [bot]
9323f4b5ca feat(llama-cpp): video input support (mtmd #24269) (#10216)
* chore(llama-cpp): bump to 8f83d6c for mtmd video input support

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(llama-cpp): forward video input to mtmd (template + non-template paths)

Wire request->videos() into grpc-server.cpp mirroring the existing image
and audio handling: a video_data build + non-template files extraction, and
input_video chat chunks on the tokenizer-template path. allow_video is
auto-set at model load by the vendored upstream chat_params.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ui): add video attachment support to the chat UI

Mirror the image/audio attachment path for video: emit video_url content
parts, accept video/* in the picker, keep video files as base64, show a
film icon badge, and render attached video inline with a <video> player.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(llama-cpp): patch mtmd video stdin double-close (heap crash)

Upstream mtmd video input (ggml-org/llama.cpp#24269) double-fcloses the
ffmpeg/ffprobe stdin FILE: feed_stdin() fclose()s the FILE returned by
subprocess_stdin() (which is sp->stdin_file), then subprocess_destroy()
fclose()s the same pointer again -> heap corruption that aborts the
backend on any base64 input_video request (the CLI --video file path is
unaffected). Vendor a one-line fix (null sp->stdin_file after fclose)
via prepare.sh's patches/ until upstream merges it.

Verified e2e with gemma-4-e2b-it-qat-q4_0: video frames decode via
ffmpeg and the model answers correctly (red clip -> 'Red', blue -> 'Blue').

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore(llama-cpp): re-pin to upstream #24316, drop vendored stdin patch

Upstream replaced the ad-hoc video stdin handling with a proper RAII
refactor (ggml-org/llama.cpp#24316, "mtmd: refactor video subproc
handling"), which includes the same `sp->stdin_file = nullptr` guard our
patch added (plus join-before-destroy ordering). Re-pin LLAMA_VERSION to
that branch head and drop patches/0001 - it's now redundant.

Verified e2e with gemma-4-e2b-it-qat-q4_0: no crash, video frames decode
and the model answers correctly (red clip -> "Red", blue -> "Blue").

NOTE: #24316 is not yet merged, so this pins to its branch-head commit
(28ca1e60). Re-pin to the squash-merge commit on master once it lands,
otherwise `git fetch` may lose the commit after the branch is deleted.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-08 23:17:50 +02:00
194 changed files with 9360 additions and 642 deletions

View File

@@ -703,6 +703,19 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "8"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-12-locate-anything-cpp'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "locate-anything-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "8"
@@ -1543,6 +1556,19 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-13-locate-anything-cpp'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "locate-anything-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
@@ -1569,6 +1595,19 @@ include:
backend: "rfdetr-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/arm64'
skip-drivers: 'false'
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-cuda-13-arm64-locate-anything-cpp'
base-image: "ubuntu:24.04"
ubuntu-version: '2404'
runs-on: 'ubuntu-24.04-arm'
backend: "locate-anything-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
@@ -2806,6 +2845,74 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
# locate-anything-cpp
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-cpu-locate-anything-cpp'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "locate-anything-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f32'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-intel-sycl-f32-locate-anything-cpp'
runs-on: 'ubuntu-latest'
base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
skip-drivers: 'false'
backend: "locate-anything-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f16'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-intel-sycl-f16-locate-anything-cpp'
runs-on: 'ubuntu-latest'
base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
skip-drivers: 'false'
backend: "locate-anything-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'vulkan'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
platform-tag: 'amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-vulkan-locate-anything-cpp'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "locate-anything-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'vulkan'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/arm64'
platform-tag: 'arm64'
tag-latest: 'auto'
tag-suffix: '-gpu-vulkan-locate-anything-cpp'
runs-on: 'ubuntu-24.04-arm'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "locate-anything-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f32'
cuda-major-version: ""
cuda-minor-version: ""
@@ -2899,6 +3006,19 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2204'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "0"
platforms: 'linux/arm64'
skip-drivers: 'false'
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-arm64-locate-anything-cpp'
base-image: "nvcr.io/nvidia/l4t-jetpack:r36.4.0"
runs-on: 'ubuntu-24.04-arm'
backend: "locate-anything-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2204'
# whisper
- build-type: ''
cuda-major-version: ""
@@ -4341,6 +4461,10 @@ includeDarwin:
tag-suffix: "-metal-darwin-arm64-silero-vad"
build-type: "metal"
lang: "go"
- backend: "sherpa-onnx"
tag-suffix: "-metal-darwin-arm64-sherpa-onnx"
build-type: "metal"
lang: "go"
- backend: "local-store"
tag-suffix: "-metal-darwin-arm64-local-store"
build-type: "metal"
@@ -4348,3 +4472,6 @@ includeDarwin:
- backend: "llama-cpp-quantization"
tag-suffix: "-metal-darwin-arm64-llama-cpp-quantization"
build-type: "mps"
- backend: "speaker-recognition"
tag-suffix: "-metal-darwin-arm64-speaker-recognition"
build-type: "mps"

View File

@@ -62,6 +62,10 @@ jobs:
variable: "RFDETR_VERSION"
branch: "main"
file: "backend/go/rfdetr-cpp/Makefile"
- repository: "mudler/locate-anything.cpp"
variable: "LOCATEANYTHING_VERSION"
branch: "master"
file: "backend/go/locate-anything-cpp/Makefile"
- repository: "predict-woo/qwen3-tts.cpp"
variable: "QWEN3TTS_CPP_VERSION"
branch: "main"

View File

@@ -38,6 +38,7 @@ jobs:
acestep-cpp: ${{ steps.detect.outputs.acestep-cpp }}
qwen3-tts-cpp: ${{ steps.detect.outputs.qwen3-tts-cpp }}
rfdetr-cpp: ${{ steps.detect.outputs.rfdetr-cpp }}
locate-anything-cpp: ${{ steps.detect.outputs.locate-anything-cpp }}
vibevoice-cpp: ${{ steps.detect.outputs.vibevoice-cpp }}
localvqe: ${{ steps.detect.outputs.localvqe }}
voxtral: ${{ steps.detect.outputs.voxtral }}
@@ -563,7 +564,7 @@ jobs:
- name: Run e2e-backends smoke
env:
BACKEND_IMAGE: quay.io/go-skynet/local-ai-backends:master-cpu-llama-cpp
BACKEND_TEST_CAPS: health,load,predict,stream,logprobs,logit_bias
BACKEND_TEST_CAPS: health,load,predict,stream,logprobs,logit_bias,tokenize
run: |
make test-extra-backend
# Realtime e2e with sherpa-onnx driving VAD + STT + TTS against a mocked LLM.
@@ -901,6 +902,45 @@ jobs:
- name: Test rfdetr-cpp
run: |
make --jobs=5 --output-sync=target -C backend/go/rfdetr-cpp test
# Per-backend e2e for locate-anything-cpp: builds the .so + Go binary and
# runs `make -C backend/go/locate-anything-cpp test`. test.sh fetches the
# locate-anything-q8_0 GGUF (~6.3 GB, NVIDIA LocateAnything-3B) from the
# published mudler/locate-anything.cpp-gguf HF repo + a COCO image, then the
# Go wire test loads the model and runs an open-vocabulary Detect, asserting
# at least one labeled box. Heavier than the other Go backends (it is a 3B),
# so it is gated to changes under backend/go/locate-anything-cpp/.
tests-locate-anything-cpp:
needs: detect-changes
if: needs.detect-changes.outputs.locate-anything-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: ubuntu-latest
steps:
- name: Clone
uses: actions/checkout@v6
with:
submodules: true
- name: Dependencies
run: |
sudo apt-get update
sudo apt-get install -y build-essential cmake curl libopenblas-dev
- name: Setup Go
uses: actions/setup-go@v5
- name: Display Go version
run: go version
- name: Proto Dependencies
run: |
# Install protoc
curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
rm protoc.zip
go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2
go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af
PATH="$PATH:$HOME/go/bin" make protogen-go
- name: Build locate-anything-cpp
run: |
make --jobs=5 --output-sync=target -C backend/go/locate-anything-cpp
- name: Test locate-anything-cpp
run: |
make --jobs=5 --output-sync=target -C backend/go/locate-anything-cpp test
# Per-backend smoke for vibevoice-cpp: builds the .so + Go binary and
# runs `make -C backend/go/vibevoice-cpp test`. test.sh auto-downloads
# the published mudler/vibevoice.cpp-models bundle (TTS Q8_0 + ASR Q4_K

View File

@@ -108,6 +108,7 @@ RUN <<EOT bash
apt-get update && \
apt-get install -y --no-install-recommends \
cuda-nvcc-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
cuda-nvrtc-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
libcufft-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
libcurand-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
libcublas-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \

View File

@@ -180,7 +180,7 @@ osx-signed: build
## Run
run: ## run local-ai
CGO_LDFLAGS="$(CGO_LDFLAGS)" $(GOCMD) run ./
CGO_LDFLAGS="$(CGO_LDFLAGS)" $(GOCMD) run ./cmd/local-ai
prepare-test: protogen-go build-mock-backend
@@ -566,6 +566,7 @@ prepare-test-extra: protogen-python
$(MAKE) -C backend/python/speaker-recognition
$(MAKE) -C backend/rust/kokoros kokoros-grpc
$(MAKE) -C backend/go/rfdetr-cpp
$(MAKE) -C backend/go/locate-anything-cpp
test-extra: prepare-test-extra
$(MAKE) -C backend/python/transformers test
@@ -593,6 +594,7 @@ test-extra: prepare-test-extra
$(MAKE) -C backend/python/speaker-recognition test
$(MAKE) -C backend/rust/kokoros test
$(MAKE) -C backend/go/rfdetr-cpp test
$(MAKE) -C backend/go/locate-anything-cpp test
##
## End-to-end gRPC tests that exercise a built backend container image.

View File

@@ -149,6 +149,16 @@ local-ai run https://gist.githubusercontent.com/.../phi-2.yaml
local-ai run oci://localai/phi-2:latest
```
To test a running LocalAI server from the terminal, open an interactive chat session from another shell. Inside the prompt, `/models` lists installed models and `/model <name>` switches between them.
```bash
# Terminal 1
local-ai run llama-3.2-1b-instruct:q4_k_m
# Terminal 2
local-ai chat --model llama-3.2-1b-instruct:q4_k_m
```
> **Automatic Backend Detection**: LocalAI automatically detects your GPU capabilities and downloads the appropriate backend. For advanced options, see [GPU Acceleration](https://localai.io/features/gpu-acceleration/).
For more details, see the [Getting Started guide](https://localai.io/basics/getting_started/).

View File

@@ -206,6 +206,16 @@ RUN if [ "${BACKEND}" = "opus" ]; then \
apt-get clean && rm -rf /var/lib/apt/lists/*; \
fi
# CrispASR's piper TTS backend dlopens libespeak-ng at runtime to phonemize
# non-English text (the MIT-clean path; English uses a built-in G2P). Install
# the espeak-ng runtime + its libpcaudio/libsonic deps + voice data so
# package.sh can bundle them into the FROM scratch image.
RUN if [ "${BACKEND}" = "crispasr" ]; then \
apt-get update && apt-get install -y --no-install-recommends \
espeak-ng-data libespeak-ng1 libpcaudio0 libsonic0 && \
apt-get clean && rm -rf /var/lib/apt/lists/*; \
fi
COPY . /LocalAI
RUN git config --global --add safe.directory /LocalAI

View File

@@ -126,6 +126,7 @@ RUN <<EOT bash
apt-get update && \
apt-get install -y --no-install-recommends \
cuda-nvcc-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
cuda-nvrtc-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
libcufft-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
libcurand-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
libcublas-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \

View File

@@ -1,10 +1,10 @@
# ds4 backend Makefile.
#
# Upstream pin lives below as DS4_VERSION?=c463029c205c2ec8d7ab6c0df4a3f52979091286
# Upstream pin lives below as DS4_VERSION?=d881f2a05e8ff6bec001315a36b794b4aa310173
# (.github/bump_deps.sh) can find and update it - matches the
# llama-cpp / ik-llama-cpp / turboquant convention.
DS4_VERSION?=c463029c205c2ec8d7ab6c0df4a3f52979091286
DS4_VERSION?=d881f2a05e8ff6bec001315a36b794b4aa310173
DS4_REPO?=https://github.com/antirez/ds4
CURRENT_MAKEFILE_DIR := $(dir $(abspath $(lastword $(MAKEFILE_LIST))))

View File

@@ -1,5 +1,5 @@
IK_LLAMA_VERSION?=6b9de3dbaa21ae95ea80638e5ee836795cc48c93
IK_LLAMA_VERSION?=e6f8112f3ba126eed3ff5b30cdd08085414a7516
LLAMA_REPO?=https://github.com/ikawrakow/ik_llama.cpp
CMAKE_ARGS?=

View File

@@ -1,5 +1,5 @@
LLAMA_VERSION?=9e3b928fd8c9d14dbf15a8768b9fdd7e5c721d66
LLAMA_VERSION?=4c6595503fe45d5a39f88d194e270f64c7424677
LLAMA_REPO?=https://github.com/ggerganov/llama.cpp
CMAKE_ARGS?=

View File

@@ -381,6 +381,15 @@ json parse_options(bool streaming, const backend::PredictOptions* predict, const
});
}
// for each video in the request, add the video data
for (int i = 0; i < predict->videos_size(); i++) {
data["video_data"].push_back(json
{
{"id", i},
{"data", predict->videos(i)},
});
}
data["stop"] = predict->stopprompts();
// data["n_probs"] = predict->nprobs();
//TODO: images,
@@ -1503,7 +1512,7 @@ public:
msg_json["role"] = msg.role();
bool is_last_user_msg = (i == last_user_msg_idx);
bool has_images_or_audio = (request->images_size() > 0 || request->audios_size() > 0);
bool has_images_or_audio = (request->images_size() > 0 || request->audios_size() > 0 || request->videos_size() > 0);
// Handle content - can be string, null, or array
// For multimodal content, we'll embed images/audio from separate fields
@@ -1554,6 +1563,16 @@ public:
content_array.push_back(audio_chunk);
}
}
if (request->videos_size() > 0) {
for (int j = 0; j < request->videos_size(); j++) {
json video_chunk;
video_chunk["type"] = "input_video";
json input_video;
input_video["data"] = request->videos(j);
video_chunk["input_video"] = input_video;
content_array.push_back(video_chunk);
}
}
msg_json["content"] = content_array;
} else {
// Use content as-is (already array or not last user message)
@@ -1588,6 +1607,16 @@ public:
content_array.push_back(audio_chunk);
}
}
if (request->videos_size() > 0) {
for (int j = 0; j < request->videos_size(); j++) {
json video_chunk;
video_chunk["type"] = "input_video";
json input_video;
input_video["data"] = request->videos(j);
video_chunk["input_video"] = input_video;
content_array.push_back(video_chunk);
}
}
msg_json["content"] = content_array;
} else if (msg.role() == "tool") {
// Tool role messages must have content field set, even if empty
@@ -2039,6 +2068,16 @@ public:
files.push_back(decoded_data);
}
}
const auto &video_data = data.find("video_data");
if (video_data != data.end() && video_data->is_array())
{
for (const auto &video : *video_data)
{
auto decoded_data = base64_decode(video["data"].get<std::string>());
files.push_back(decoded_data);
}
}
}
const bool has_mtmd = ctx_server.impl->mctx != nullptr;
@@ -2291,7 +2330,7 @@ public:
}
bool is_last_user_msg = (i == last_user_msg_idx);
bool has_images_or_audio = (request->images_size() > 0 || request->audios_size() > 0);
bool has_images_or_audio = (request->images_size() > 0 || request->audios_size() > 0 || request->videos_size() > 0);
// Handle content - can be string, null, or array
// For multimodal content, we'll embed images/audio from separate fields
@@ -2344,6 +2383,16 @@ public:
content_array.push_back(audio_chunk);
}
}
if (request->videos_size() > 0) {
for (int j = 0; j < request->videos_size(); j++) {
json video_chunk;
video_chunk["type"] = "input_video";
json input_video;
input_video["data"] = request->videos(j);
video_chunk["input_video"] = input_video;
content_array.push_back(video_chunk);
}
}
msg_json["content"] = content_array;
} else {
// Use content as-is (already array or not last user message)
@@ -2383,6 +2432,16 @@ public:
content_array.push_back(audio_chunk);
}
}
if (request->videos_size() > 0) {
for (int j = 0; j < request->videos_size(); j++) {
json video_chunk;
video_chunk["type"] = "input_video";
json input_video;
input_video["data"] = request->videos(j);
video_chunk["input_video"] = input_video;
content_array.push_back(video_chunk);
}
}
msg_json["content"] = content_array;
SRV_INF("[CONTENT DEBUG] Predict: Message %d created content array with media\n", i);
} else if (!msg.tool_calls().empty()) {
@@ -2845,6 +2904,16 @@ public:
files.push_back(decoded_data);
}
}
const auto &video_data = data.find("video_data");
if (video_data != data.end() && video_data->is_array())
{
for (const auto &video : *video_data)
{
auto decoded_data = base64_decode(video["data"].get<std::string>());
files.push_back(decoded_data);
}
}
}
// process files
@@ -3417,7 +3486,7 @@ public:
if (body.count("prompt") != 0) {
const bool add_special = json_value(body, "add_special", false);
llama_tokens tokens = tokenize_mixed(ctx_server.impl->vocab, body.at("content"), add_special, true);
llama_tokens tokens = tokenize_mixed(ctx_server.impl->vocab, body.at("prompt"), add_special, true);
for (const auto& token : tokens) {

View File

@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)
# CrispASR version (release tag)
CRISPASR_REPO?=https://github.com/CrispStrobe/CrispASR
CRISPASR_VERSION?=f7838a306687f22c281d29c250f879a4ab3df2d7
CRISPASR_VERSION?=d745bda4386ae0f9d1d2f23fff8ec95d76428221
SO_TARGET?=libgocrispasr.so
CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF

View File

@@ -11,6 +11,7 @@ import (
"github.com/go-audio/audio"
"github.com/go-audio/wav"
gguf "github.com/gpustack/gguf-parser-go"
"github.com/mudler/LocalAI/pkg/grpc/base"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
"github.com/mudler/LocalAI/pkg/utils"
@@ -37,6 +38,39 @@ var (
type CrispASR struct {
base.SingleThread
// sampleRate is the output rate (Hz) of the loaded TTS engine's PCM, used to
// write a correct WAV header. Most CrispASR TTS backends emit 24 kHz, but
// piper returns its model's native rate (16 kHz for x_low/low voices,
// 22.05 kHz for medium/high), so it is read from the GGUF metadata at Load.
sampleRate int
}
// defaultTTSSampleRate is the output rate assumed for CrispASR TTS engines that
// don't advertise one in GGUF metadata (vibevoice/orpheus/chatterbox/qwen3-tts
// all emit 24 kHz). piper is the exception and carries piper.sample_rate.
const defaultTTSSampleRate = 24000
// piperSampleRate reads the piper.sample_rate metadata key from a GGUF model.
// CrispASR's piper backend returns PCM at the model's native rate without
// resampling, so the WAV header must match it. Returns ok=false for non-piper
// models (key absent) or an unreadable file, letting the caller fall back to
// defaultTTSSampleRate.
func piperSampleRate(modelPath string) (int, bool) {
// Only scalar architecture keys are read, so skip the large array metadata
// (phoneme map) and mmap the header - same rationale as pkg/vram's reader.
f, err := gguf.ParseGGUFFile(modelPath, gguf.UseMMap(), gguf.SkipLargeMetadata())
if err != nil {
return 0, false
}
kv, ok := f.Header.MetadataKV.Get("piper.sample_rate")
if !ok || kv.ValueType != gguf.GGUFMetadataValueTypeUint32 {
return 0, false
}
rate := int(kv.ValueUint32())
if rate <= 0 {
return 0, false
}
return rate, true
}
// splitOption splits a "prefix:value" model option into its key and value,
@@ -103,6 +137,14 @@ func (w *CrispASR) Load(opts *pb.ModelOptions) error {
return fmt.Errorf("Failed to load CrispASR transcription model")
}
// Determine the TTS output sample rate for the WAV header. piper voices
// carry their native rate in GGUF metadata and CrispASR does not resample;
// every other engine emits the 24 kHz default.
w.sampleRate = defaultTTSSampleRate
if rate, ok := piperSampleRate(opts.ModelFile); ok {
w.sampleRate = rate
}
// Load the companion file (codec/tokenizer/s3gen) after the session is open.
// rc==0 means success or "not applicable" for the active backend; only a
// negative code is fatal.
@@ -390,7 +432,7 @@ func (w *CrispASR) synthesize(text string) ([]float32, error) {
}
defer CppTTSFree(ptr)
src := unsafe.Slice((*float32)(unsafe.Pointer(ptr)), int(n)) //nolint:govet // ptr addresses C-allocated PCM returned across the purego boundary; copied out immediately below, before tts_free.
out := make([]float32, int(n)) // copy out of C memory before free
out := make([]float32, int(n)) // copy out of C memory before free
copy(out, src)
return out, nil
}
@@ -417,7 +459,7 @@ func (w *CrispASR) TTS(req *pb.TTSRequest) error {
if err != nil {
return err
}
return writeWAV24k(req.Dst, pcm)
return writeWAV(req.Dst, pcm, w.sampleRate)
}
// TTSStream is the streaming counterpart to TTS. CrispASR has no progressive
@@ -447,7 +489,7 @@ func (w *CrispASR) TTSStream(req *pb.TTSRequest, results chan []byte) error {
}
defer func() { _ = os.Remove(dst) }()
if err := writeWAV24k(dst, pcm); err != nil {
if err := writeWAV(dst, pcm, w.sampleRate); err != nil {
return err
}
@@ -459,14 +501,14 @@ func (w *CrispASR) TTSStream(req *pb.TTSRequest, results chan []byte) error {
return nil
}
// writeWAV24k writes pcm as a 24000 Hz, mono, 16-bit PCM WAV at dst.
func writeWAV24k(dst string, pcm []float32) error {
// writeWAV writes pcm as a sampleRate Hz, mono, 16-bit PCM WAV at dst.
func writeWAV(dst string, pcm []float32, sampleRate int) error {
f, err := os.Create(dst)
if err != nil {
return fmt.Errorf("crispasr: create %q: %w", dst, err)
}
enc := wav.NewEncoder(f, 24000, 16, 1, 1)
enc := wav.NewEncoder(f, sampleRate, 16, 1, 1)
ints := make([]int, len(pcm))
for i, s := range pcm {
if s > 1 {
@@ -477,7 +519,7 @@ func writeWAV24k(dst string, pcm []float32) error {
ints[i] = int(s * 32767)
}
buf := &audio.IntBuffer{
Format: &audio.Format{NumChannels: 1, SampleRate: 24000},
Format: &audio.Format{NumChannels: 1, SampleRate: sampleRate},
Data: ints,
SourceBitDepth: 16,
}

View File

@@ -0,0 +1,164 @@
package main
import (
"bytes"
"encoding/binary"
"os"
"path/filepath"
"github.com/go-audio/wav"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
// GGUF metadata value type tags (subset) from the GGUF spec.
const (
ggufTypeUint32 uint32 = 4
ggufTypeString uint32 = 8
)
type ggufKV struct {
key string
vtype uint32
val any
}
// writeMinimalGGUF emits a valid, tensor-less GGUF file carrying only the given
// metadata key-values. Enough for the header-only parse path piperSampleRate
// uses; avoids pulling a real multi-MB voice into the test.
func writeMinimalGGUF(path string, kvs []ggufKV) error {
var b bytes.Buffer
b.WriteString("GGUF") // magic
_ = binary.Write(&b, binary.LittleEndian, uint32(3)) // version
_ = binary.Write(&b, binary.LittleEndian, uint64(0)) // tensor count
_ = binary.Write(&b, binary.LittleEndian, uint64(len(kvs)))
for _, kv := range kvs {
_ = binary.Write(&b, binary.LittleEndian, uint64(len(kv.key)))
b.WriteString(kv.key)
_ = binary.Write(&b, binary.LittleEndian, kv.vtype)
switch v := kv.val.(type) {
case uint32:
_ = binary.Write(&b, binary.LittleEndian, v)
case string:
_ = binary.Write(&b, binary.LittleEndian, uint64(len(v)))
b.WriteString(v)
}
}
return os.WriteFile(path, b.Bytes(), 0o644)
}
// wavSampleRate decodes the WAV header at path and returns its sample rate.
func wavSampleRate(path string) (int, error) {
f, err := os.Open(path)
if err != nil {
return 0, err
}
defer func() { _ = f.Close() }()
dec := wav.NewDecoder(f)
dec.ReadInfo()
return int(dec.SampleRate), nil
}
var _ = Describe("piper sample rate", func() {
Context("piperSampleRate", func() {
It("reads piper.sample_rate from a piper GGUF (medium = 22050)", func() {
p := filepath.Join(GinkgoT().TempDir(), "voice.gguf")
Expect(writeMinimalGGUF(p, []ggufKV{
{key: "general.architecture", vtype: ggufTypeString, val: "piper"},
{key: "piper.sample_rate", vtype: ggufTypeUint32, val: uint32(22050)},
})).To(Succeed())
rate, ok := piperSampleRate(p)
Expect(ok).To(BeTrue(), "piper.sample_rate should be found")
Expect(rate).To(Equal(22050))
})
It("reads the low-quality rate (16000)", func() {
p := filepath.Join(GinkgoT().TempDir(), "voice.gguf")
Expect(writeMinimalGGUF(p, []ggufKV{
{key: "piper.sample_rate", vtype: ggufTypeUint32, val: uint32(16000)},
})).To(Succeed())
rate, ok := piperSampleRate(p)
Expect(ok).To(BeTrue())
Expect(rate).To(Equal(16000))
})
It("returns ok=false for a non-piper GGUF (no piper.sample_rate key)", func() {
p := filepath.Join(GinkgoT().TempDir(), "other.gguf")
Expect(writeMinimalGGUF(p, []ggufKV{
{key: "general.architecture", vtype: ggufTypeString, val: "vibevoice"},
})).To(Succeed())
_, ok := piperSampleRate(p)
Expect(ok).To(BeFalse())
})
It("returns ok=false for an unreadable/non-GGUF file", func() {
p := filepath.Join(GinkgoT().TempDir(), "garbage.gguf")
Expect(os.WriteFile(p, []byte("not a gguf"), 0o644)).To(Succeed())
_, ok := piperSampleRate(p)
Expect(ok).To(BeFalse())
})
})
// End-to-end through the built .so. Gated on CRISPASR_PIPER_MODEL_PATH (a
// real piper voice GGUF) like the other model-backed specs; never runs in
// default CI. Proves CrispASR's piper backend output rate flows into the
// WAV header instead of the hardcoded 24 kHz default.
Context("piper TTS end-to-end", func() {
It("writes the WAV at the model's native piper.sample_rate", func() {
model := os.Getenv("CRISPASR_PIPER_MODEL_PATH")
if model == "" {
Skip("set CRISPASR_PIPER_MODEL_PATH to run the piper e2e spec")
}
ensureLibLoaded()
expected, ok := piperSampleRate(model)
Expect(ok).To(BeTrue(), "model should carry piper.sample_rate metadata")
w := &CrispASR{}
Expect(w.Load(&pb.ModelOptions{
ModelFile: model,
Options: []string{"backend:piper"},
Threads: 4,
})).To(Succeed())
dst := filepath.Join(GinkgoT().TempDir(), "piper.wav")
Expect(w.TTS(&pb.TTSRequest{Text: "Hello from CrispASR piper.", Dst: dst})).To(Succeed())
info, err := os.Stat(dst)
Expect(err).ToNot(HaveOccurred())
Expect(info.Size()).To(BeNumerically(">", 1024), "expected a non-trivial WAV")
rate, err := wavSampleRate(dst)
Expect(err).ToNot(HaveOccurred())
Expect(rate).To(Equal(expected),
"WAV header rate must equal the model's native piper.sample_rate, not the 24k default")
})
})
Context("writeWAV", func() {
It("writes the WAV header at the given sample rate (22050 for piper, not the 24k default)", func() {
dst := filepath.Join(GinkgoT().TempDir(), "out.wav")
pcm := make([]float32, 220) // 10 ms of silence is enough for a header
Expect(writeWAV(dst, pcm, 22050)).To(Succeed())
rate, err := wavSampleRate(dst)
Expect(err).ToNot(HaveOccurred())
Expect(rate).To(Equal(22050))
})
It("writes a 16000 Hz header for low-quality piper voices", func() {
dst := filepath.Join(GinkgoT().TempDir(), "out.wav")
pcm := make([]float32, 160)
Expect(writeWAV(dst, pcm, 16000)).To(Succeed())
rate, err := wavSampleRate(dst)
Expect(err).ToNot(HaveOccurred())
Expect(rate).To(Equal(16000))
})
})
})

View File

@@ -51,6 +51,32 @@ else
exit 1
fi
# Bundle espeak-ng (+ its libpcaudio/libsonic runtime deps) and its voice data so
# the piper TTS backend can phonemize non-English text. CrispASR dlopens
# libespeak-ng.so.1 at runtime (the MIT-clean path); the dlopen succeeds loading
# libespeak-ng but FAILS if libpcaudio/libsonic are absent, so all three .so are
# required. run.sh points CRISPASR_ESPEAK_DATA_PATH at the bundled data dir.
# Best-effort: only copied when present, so a local dev build without espeak-ng
# installed still packages the rest (English voices keep working).
ESPEAK_LIBDIR=""
for d in /usr/lib/x86_64-linux-gnu /usr/lib/aarch64-linux-gnu; do
if [ -f "$d/libespeak-ng.so.1" ]; then
ESPEAK_LIBDIR="$d"
break
fi
done
if [ -n "$ESPEAK_LIBDIR" ]; then
echo "Bundling espeak-ng from $ESPEAK_LIBDIR ..."
cp -arfLv "$ESPEAK_LIBDIR/libespeak-ng.so.1" $CURDIR/package/lib/
cp -arfLv "$ESPEAK_LIBDIR/libpcaudio.so.0" $CURDIR/package/lib/
cp -arfLv "$ESPEAK_LIBDIR/libsonic.so.0" $CURDIR/package/lib/
if [ -d "$ESPEAK_LIBDIR/espeak-ng-data" ]; then
cp -arfLv "$ESPEAK_LIBDIR/espeak-ng-data" $CURDIR/package/
fi
else
echo "espeak-ng not found; non-English piper voices will not phonemize"
fi
# Package GPU libraries based on BUILD_TYPE
# The GPU library packaging script will detect BUILD_TYPE and copy appropriate GPU libraries
GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"

View File

@@ -41,6 +41,11 @@ fi
export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
export CRISPASR_LIBRARY=$LIBRARY
# Point piper's espeak-ng phonemizer at the bundled voice data. The variable
# names the directory CONTAINING espeak-ng-data (package.sh drops it next to
# this script). Harmless when espeak-ng wasn't bundled.
export CRISPASR_ESPEAK_DATA_PATH=$CURDIR
# If there is a lib/ld.so, use it
if [ -f $CURDIR/lib/ld.so ]; then
echo "Using lib/ld.so"

View File

@@ -0,0 +1,7 @@
sources/
build*/
package/
liblocateanythingcpp*.so
locate-anything-cpp
test-models/
test-data/

View File

@@ -0,0 +1,57 @@
cmake_minimum_required(VERSION 3.18)
project(liblocateanythingcpp LANGUAGES C CXX)
set(CMAKE_POSITION_INDEPENDENT_CODE ON)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
# Static-link ggml + locate_anything so the resulting .so has no runtime
# dependency on extra ggml/locate_anything shared libraries — only on
# libc/libstdc++/libgomp, which the LocalAI package step bundles into the
# docker image.
set(BUILD_SHARED_LIBS OFF CACHE BOOL "Build static libraries" FORCE)
# locate-anything.cpp build switches: skip CLI/tests, keep static lib.
set(LA_BUILD_CLI OFF CACHE BOOL "Disable locate-anything CLI" FORCE)
set(LA_BUILD_TESTS OFF CACHE BOOL "Disable locate-anything tests" FORCE)
set(LA_SHARED OFF CACHE BOOL "Build locate_anything as static lib" FORCE)
# Unlike rt-detr.cpp, locate-anything.cpp ships no in-tree ggml patches, so
# there is no apply_ggml_patches.sh hook to shim here.
add_subdirectory(./sources/locate-anything.cpp)
# locate-anything.cpp's top-level CMakeLists points its own target's include
# dirs at ${CMAKE_SOURCE_DIR}/{include,src,third_party,...}. CMAKE_SOURCE_DIR
# is the *top-level* source dir of the whole CMake tree, so when we pull it in
# via add_subdirectory it resolves to OUR directory, not theirs, and the
# locate_anything target fails to find its own headers (la_capi.h, stb_image.h,
# la_gguf_keys.h). Re-add the correct, subdir-relative include paths to the
# already-defined target so it compiles regardless of where it's nested.
set(LA_SRC ${CMAKE_CURRENT_SOURCE_DIR}/sources/locate-anything.cpp)
target_include_directories(locate_anything PRIVATE
${LA_SRC}/include
${LA_SRC}/src
${LA_SRC}/third_party
${LA_SRC}/third_party/stb)
# locate-anything.cpp's C-API symbols already live inside liblocate_anything
# (src/la_capi.cpp is compiled into the lib). We re-export them via a MODULE
# library that links locate_anything so the symbols are visible at dlopen time.
add_library(locateanythingcpp MODULE
sources/locate-anything.cpp/src/la_capi.cpp)
target_include_directories(locateanythingcpp PRIVATE
sources/locate-anything.cpp/include
sources/locate-anything.cpp/src
sources/locate-anything.cpp/third_party
sources/locate-anything.cpp/third_party/stb
)
target_link_libraries(locateanythingcpp PRIVATE locate_anything ggml)
if(CMAKE_CXX_COMPILER_ID MATCHES "GNU" AND CMAKE_CXX_COMPILER_VERSION VERSION_LESS 9.0)
target_link_libraries(locateanythingcpp PRIVATE stdc++fs)
endif()
set_property(TARGET locateanythingcpp PROPERTY CXX_STANDARD 17)
set_target_properties(locateanythingcpp PROPERTIES LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR})

View File

@@ -0,0 +1,134 @@
CMAKE_ARGS?=
BUILD_TYPE?=
NATIVE?=false
GOCMD?=go
GO_TAGS?=
JOBS?=$(shell nproc --ignore=1)
# locate-anything.cpp. Pin to a specific commit for a stable build; leaving
# this on `master` always picks up the latest C-API surface (incl. the
# per-detection accessor functions used by golocateanythingcpp.go).
LOCATEANYTHING_REPO?=https://github.com/mudler/locate-anything.cpp.git
LOCATEANYTHING_VERSION?=60e450945476d5e97e0754a8c0e71a9ea81690e0
ifeq ($(NATIVE),false)
CMAKE_ARGS+=-DGGML_NATIVE=OFF
endif
# Forward LocalAI's BUILD_TYPE to the matching ggml backend switch.
ifeq ($(BUILD_TYPE),cublas)
CMAKE_ARGS+=-DGGML_CUDA=ON -DLA_GGML_CUDA=ON
else ifeq ($(BUILD_TYPE),openblas)
CMAKE_ARGS+=-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
else ifeq ($(BUILD_TYPE),clblas)
CMAKE_ARGS+=-DGGML_CLBLAST=ON
else ifeq ($(BUILD_TYPE),hipblas)
ROCM_HOME ?= /opt/rocm
ROCM_PATH ?= /opt/rocm
export CXX=$(ROCM_HOME)/llvm/bin/clang++
export CC=$(ROCM_HOME)/llvm/bin/clang
AMDGPU_TARGETS?=gfx908,gfx90a,gfx942,gfx950,gfx1030,gfx1100,gfx1101,gfx1102,gfx1200,gfx1201
CMAKE_ARGS+=-DGGML_HIPBLAS=ON -DAMDGPU_TARGETS=$(AMDGPU_TARGETS)
else ifeq ($(BUILD_TYPE),vulkan)
CMAKE_ARGS+=-DGGML_VULKAN=ON -DLA_GGML_VULKAN=ON
else ifeq ($(OS),Darwin)
ifneq ($(BUILD_TYPE),metal)
CMAKE_ARGS+=-DGGML_METAL=OFF
else
CMAKE_ARGS+=-DGGML_METAL=ON
CMAKE_ARGS+=-DGGML_METAL_EMBED_LIBRARY=ON
CMAKE_ARGS+=-DLA_GGML_METAL=ON
endif
endif
ifeq ($(BUILD_TYPE),sycl_f16)
CMAKE_ARGS+=-DGGML_SYCL=ON \
-DCMAKE_C_COMPILER=icx \
-DCMAKE_CXX_COMPILER=icpx \
-DGGML_SYCL_F16=ON
endif
ifeq ($(BUILD_TYPE),sycl_f32)
CMAKE_ARGS+=-DGGML_SYCL=ON \
-DCMAKE_C_COMPILER=icx \
-DCMAKE_CXX_COMPILER=icpx
endif
sources/locate-anything.cpp:
mkdir -p sources && \
git clone --recursive $(LOCATEANYTHING_REPO) sources/locate-anything.cpp && \
cd sources/locate-anything.cpp && \
git checkout $(LOCATEANYTHING_VERSION) && \
git submodule update --init --recursive --depth 1 --single-branch
# Detect OS
UNAME_S := $(shell uname -s)
# Only build CPU variants on Linux
ifeq ($(UNAME_S),Linux)
VARIANT_TARGETS = liblocateanythingcpp-avx.so liblocateanythingcpp-avx2.so liblocateanythingcpp-avx512.so liblocateanythingcpp-fallback.so
else
# On non-Linux (e.g., Darwin), build only fallback variant
VARIANT_TARGETS = liblocateanythingcpp-fallback.so
endif
locate-anything-cpp: main.go golocateanythingcpp.go $(VARIANT_TARGETS)
CGO_ENABLED=0 $(GOCMD) build -tags "$(GO_TAGS)" -o locate-anything-cpp ./
package: locate-anything-cpp
bash package.sh
build: package
clean: purge
rm -rf liblocateanythingcpp*.so locate-anything-cpp package sources
purge:
rm -rf build*
# Build all variants (Linux only)
ifeq ($(UNAME_S),Linux)
liblocateanythingcpp-avx.so: sources/locate-anything.cpp
rm -rfv build-$@
$(info ${GREEN}I locate-anything-cpp build info:avx${RESET})
SO_TARGET=$@ CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) liblocateanythingcpp-custom
rm -rfv build-$@
liblocateanythingcpp-avx2.so: sources/locate-anything.cpp
rm -rfv build-$@
$(info ${GREEN}I locate-anything-cpp build info:avx2${RESET})
SO_TARGET=$@ CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=off -DGGML_FMA=on -DGGML_F16C=on -DGGML_BMI2=on" $(MAKE) liblocateanythingcpp-custom
rm -rfv build-$@
liblocateanythingcpp-avx512.so: sources/locate-anything.cpp
rm -rfv build-$@
$(info ${GREEN}I locate-anything-cpp build info:avx512${RESET})
SO_TARGET=$@ CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=on -DGGML_FMA=on -DGGML_F16C=on -DGGML_BMI2=on" $(MAKE) liblocateanythingcpp-custom
rm -rfv build-$@
endif
# Build fallback variant (all platforms)
liblocateanythingcpp-fallback.so: sources/locate-anything.cpp
rm -rfv build-$@
$(info ${GREEN}I locate-anything-cpp build info:fallback${RESET})
SO_TARGET=$@ CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) liblocateanythingcpp-custom
rm -rfv build-$@
liblocateanythingcpp-custom: CMakeLists.txt
mkdir -p build-$(SO_TARGET) && \
cd build-$(SO_TARGET) && \
cmake .. $(CMAKE_ARGS) && \
cmake --build . --config Release -j$(JOBS) && \
cd .. && \
mv build-$(SO_TARGET)/liblocateanythingcpp.so ./$(SO_TARGET)
all: locate-anything-cpp package
# `test` is invoked by the top-level Makefile's `test-extra` target. It builds
# the backend binary + the fallback shared library (needed for dlopen at
# runtime), then runs test.sh which downloads the q8_0 GGUF + COCO image and
# exercises the gRPC Load/Detect wire path via the Go smoke test in
# main_test.go.
test: locate-anything-cpp liblocateanythingcpp-fallback.so
bash test.sh

View File

@@ -0,0 +1,174 @@
package main
// golocateanythingcpp.go - gRPC handlers (Load, Detect) for the
// locate-anything-cpp backend.
//
// Embeds base.SingleThread to default unimplemented RPCs to "not supported"
// while we only implement open-vocabulary object detection (Detect).
import (
"encoding/base64"
"fmt"
"os"
"path/filepath"
"unsafe"
"github.com/mudler/LocalAI/pkg/grpc/base"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
)
// la_ctx* is an opaque handle. la_capi_load returns it directly (0 == failure),
// unlike rfdetr's out-parameter convention.
var (
// la_capi_load(const char* gguf_path, int n_threads) -> la_ctx* (0 = fail)
CapiLoad func(gguf string, nThreads int32) uintptr
// la_capi_free(la_ctx* ctx)
CapiFree func(handle uintptr)
// la_capi_locate_path(ctx, image_path, prompt, mode) -> char* json (0 = err)
CapiLocatePath func(handle uintptr, imagePath string, prompt string, mode int32) uintptr
// la_capi_locate_buffer(ctx, bytes, len, prompt, mode) -> char* json (0 = err)
CapiLocateBuffer func(handle uintptr, bytes uintptr, length uintptr, prompt string, mode int32) uintptr
// la_capi_get_n_detections(ctx) -> int
CapiGetNDetections func(handle uintptr) int32
// la_capi_get_detection_box(ctx, i, out_xyxy[4]) -> int (0 on success)
CapiGetDetectionBox func(handle uintptr, i int32, outXYXY uintptr) int32
// la_capi_get_detection_label(ctx, i, buf, buf_size) -> int (required size incl NUL; two-call sizing)
CapiGetDetectionLabel func(handle uintptr, i int32, buf uintptr, bufSize int32) int32
// la_capi_free_string(char* s)
CapiFreeString func(s uintptr)
// la_capi_last_error(ctx) -> const char* (owned by ctx, "" if none / null ctx).
// purego marshals the returned C string into a Go string (a copy), so we
// never free it and avoid raw pointer arithmetic.
CapiLastError func(handle uintptr) string
)
type LocateAnythingCpp struct {
base.SingleThread
handle uintptr
}
// Load loads the GGUF model at opts.ModelFile (joined with opts.ModelPath if
// relative) and stores the la_ctx handle for later Detect calls.
func (r *LocateAnythingCpp) Load(opts *pb.ModelOptions) error {
modelFile := opts.ModelFile
if modelFile == "" {
modelFile = opts.Model
}
if modelFile == "" {
return fmt.Errorf("locate-anything-cpp: ModelFile is empty")
}
var modelPath string
if filepath.IsAbs(modelFile) {
modelPath = modelFile
} else {
modelPath = filepath.Join(opts.ModelPath, modelFile)
}
if _, err := os.Stat(modelPath); err != nil {
return fmt.Errorf("locate-anything-cpp: model file not found: %s: %w", modelPath, err)
}
threads := opts.Threads
if threads <= 0 {
threads = 4
}
// Release previous model if any (re-Load).
if r.handle != 0 {
CapiFree(r.handle)
r.handle = 0
}
h := CapiLoad(modelPath, threads)
if h == 0 {
// la_capi_last_error needs a ctx; on a failed load we have none (it
// returns "" for a null ctx), so the text is best-effort. Surface it
// when present.
if msg := CapiLastError(0); msg != "" {
return fmt.Errorf("locate-anything-cpp: la_capi_load failed for %s: %s", modelPath, msg)
}
return fmt.Errorf("locate-anything-cpp: la_capi_load failed for %s", modelPath)
}
r.handle = h
return nil
}
// Detect runs open-vocabulary detection on the base64-encoded image in opts.Src
// using the required text prompt in opts.Prompt, returning one pb.Detection per
// located object with its predicted label as ClassName.
func (r *LocateAnythingCpp) Detect(opts *pb.DetectOptions) (pb.DetectResponse, error) {
if r.handle == 0 {
return pb.DetectResponse{}, fmt.Errorf("locate-anything-cpp: model not loaded")
}
// Open-vocabulary detection is prompt-driven; without a prompt there is
// nothing to locate.
prompt := opts.Prompt
if prompt == "" {
return pb.DetectResponse{}, fmt.Errorf("locate-anything-cpp: a text prompt is required (open-vocabulary detection)")
}
// Decode base64 image and write to temp file.
imgData, err := base64.StdEncoding.DecodeString(opts.Src)
if err != nil {
return pb.DetectResponse{}, fmt.Errorf("locate-anything-cpp: failed to decode base64 image: %w", err)
}
tmpFile, err := os.CreateTemp("", "locate-anything-*.img")
if err != nil {
return pb.DetectResponse{}, fmt.Errorf("locate-anything-cpp: failed to create temp file: %w", err)
}
defer func() { _ = os.Remove(tmpFile.Name()) }()
if _, err := tmpFile.Write(imgData); err != nil {
_ = tmpFile.Close()
return pb.DetectResponse{}, fmt.Errorf("locate-anything-cpp: failed to write temp file: %w", err)
}
if err := tmpFile.Close(); err != nil {
return pb.DetectResponse{}, fmt.Errorf("locate-anything-cpp: failed to close temp file: %w", err)
}
// mode 0 = hybrid (Parallel Box Decoding). The JSON return value is unused:
// structured detections are read via the accessor functions. Still must
// free the returned string.
jsonPtr := CapiLocatePath(r.handle, tmpFile.Name(), prompt, 0)
if jsonPtr != 0 {
CapiFreeString(jsonPtr)
}
n := CapiGetNDetections(r.handle)
if n < 0 {
return pb.DetectResponse{}, fmt.Errorf("locate-anything-cpp: invalid n_detections=%d", n)
}
detections := make([]*pb.Detection, 0, n)
for i := int32(0); i < n; i++ {
var xyxy [4]float32 // x1, y1, x2, y2
if CapiGetDetectionBox(r.handle, i, uintptr(unsafe.Pointer(&xyxy[0]))) != 0 {
continue
}
// Two-call sizing for the label string.
label := ""
need := CapiGetDetectionLabel(r.handle, i, 0, 0)
if need > 0 {
buf := make([]byte, need)
CapiGetDetectionLabel(r.handle, i, uintptr(unsafe.Pointer(&buf[0])), need)
label = string(buf[:need-1])
}
detections = append(detections, &pb.Detection{
X: xyxy[0],
Y: xyxy[1],
Width: xyxy[2] - xyxy[0],
Height: xyxy[3] - xyxy[1],
Confidence: 1.0,
ClassName: label,
})
}
return pb.DetectResponse{
Detections: detections,
}, nil
}

View File

@@ -0,0 +1,59 @@
package main
// main.go - entry point for the locate-anything-cpp gRPC backend.
//
// Dlopens liblocateanythingcpp-<variant>.so via purego at the path in
// LOCATEANYTHING_LIBRARY (set by run.sh based on /proc/cpuinfo), registers
// the la_capi_* C ABI symbols, then starts the gRPC server.
import (
"flag"
"os"
"github.com/ebitengine/purego"
grpc "github.com/mudler/LocalAI/pkg/grpc"
)
var (
addr = flag.String("addr", "localhost:50051", "the address to connect to")
)
type LibFuncs struct {
FuncPtr any
Name string
}
func main() {
// Get library name from environment variable, default to fallback
libName := os.Getenv("LOCATEANYTHING_LIBRARY")
if libName == "" {
libName = "./liblocateanythingcpp-fallback.so"
}
lib, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL)
if err != nil {
panic(err)
}
libFuncs := []LibFuncs{
{&CapiLoad, "la_capi_load"},
{&CapiFree, "la_capi_free"},
{&CapiLocatePath, "la_capi_locate_path"},
{&CapiLocateBuffer, "la_capi_locate_buffer"},
{&CapiGetNDetections, "la_capi_get_n_detections"},
{&CapiGetDetectionBox, "la_capi_get_detection_box"},
{&CapiGetDetectionLabel, "la_capi_get_detection_label"},
{&CapiFreeString, "la_capi_free_string"},
{&CapiLastError, "la_capi_last_error"},
}
for _, lf := range libFuncs {
purego.RegisterLibFunc(lf.FuncPtr, lib, lf.Name)
}
flag.Parse()
if err := grpc.StartServer(*addr, &LocateAnythingCpp{}); err != nil {
panic(err)
}
}

View File

@@ -0,0 +1,176 @@
package main
// main_test.go - end-to-end smoke test for the locate-anything-cpp gRPC backend.
//
// Spawns the compiled locate-anything-cpp binary on a free local port, dials it
// via gRPC, and exercises LoadModel + Detect against the test fixtures
// downloaded by test.sh: the q8_0 GGUF of nvidia/LocateAnything-3B and a real
// COCO image with people + cars. Asserts that open-vocabulary detection driven
// by a text prompt returns at least one detection, each carrying a non-empty
// class name and a bounding box of non-zero size.
//
// The spec Skip()s cleanly if its fixtures (the ~6.3 GB model, the test image,
// the built binary, or the fallback .so) are missing, so the test target stays
// usable on a fresh checkout / on CI runners where the large model hasn't been
// downloaded.
import (
"context"
"encoding/base64"
"fmt"
"net"
"os"
"os/exec"
"path/filepath"
"testing"
"time"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
"google.golang.org/grpc"
"google.golang.org/grpc/credentials/insecure"
)
func TestDetect(t *testing.T) {
RegisterFailHandler(Fail)
RunSpecs(t, "locate-anything-cpp backend smoke suite")
}
// freePort grabs an ephemeral TCP port and immediately releases it so the
// spawned backend can bind to it. There is a tiny TOCTOU window here but in
// practice it's adequate for a smoke test on a quiet runner.
func freePort() int {
l, err := net.Listen("tcp", "127.0.0.1:0")
Expect(err).ToNot(HaveOccurred(), "freePort listen")
port := l.Addr().(*net.TCPAddr).Port
Expect(l.Close()).To(Succeed())
return port
}
// startBackend spawns the locate-anything-cpp binary on the given port and
// waits until it accepts TCP connections (up to 10s). It mirrors how main.go
// resolves the purego library: the LOCATEANYTHING_LIBRARY env var points the
// dlopen at the freshly built fallback .so, and the la_capi_* symbols are
// registered there. The returned cleanup func kills the process and reaps it.
func startBackend(port int) func() {
binary, err := filepath.Abs("./locate-anything-cpp")
Expect(err).ToNot(HaveOccurred())
if _, err := os.Stat(binary); err != nil {
Skip(fmt.Sprintf("backend binary not built: %s (run `make locate-anything-cpp` first)", binary))
}
libPath, err := filepath.Abs("./liblocateanythingcpp-fallback.so")
Expect(err).ToNot(HaveOccurred())
if _, err := os.Stat(libPath); err != nil {
Skip(fmt.Sprintf("fallback library not built: %s (run `make liblocateanythingcpp-fallback.so` first)", libPath))
}
addr := fmt.Sprintf("127.0.0.1:%d", port)
cmd := exec.Command(binary, "--addr", addr)
cmd.Env = append(os.Environ(), "LOCATEANYTHING_LIBRARY="+libPath)
cmd.Stdout = os.Stderr
cmd.Stderr = os.Stderr
Expect(cmd.Start()).To(Succeed())
cleanup := func() {
if cmd.Process != nil {
_ = cmd.Process.Kill()
_, _ = cmd.Process.Wait()
}
}
deadline := time.Now().Add(10 * time.Second)
for time.Now().Before(deadline) {
c, err := net.DialTimeout("tcp", addr, 200*time.Millisecond)
if err == nil {
_ = c.Close()
return cleanup
}
time.Sleep(200 * time.Millisecond)
}
cleanup()
Fail(fmt.Sprintf("backend did not become ready on %s within 10s", addr))
return func() {}
}
// loadTestImage reads the COCO test image downloaded by test.sh and returns its
// base64-encoded content (the wire format accepted by the Detect RPC).
func loadTestImage() string {
imgPath, err := filepath.Abs("test-data/test.jpg")
Expect(err).ToNot(HaveOccurred())
imgBytes, err := os.ReadFile(imgPath)
if err != nil {
Skip(fmt.Sprintf("test image not present: %s (run test.sh first)", imgPath))
}
return base64.StdEncoding.EncodeToString(imgBytes)
}
// dialBackend opens a gRPC client connection to the spawned backend.
func dialBackend(port int) (pb.BackendClient, func()) {
addr := fmt.Sprintf("127.0.0.1:%d", port)
conn, err := grpc.NewClient(addr, grpc.WithTransportCredentials(insecure.NewCredentials()))
Expect(err).ToNot(HaveOccurred())
return pb.NewBackendClient(conn), func() { _ = conn.Close() }
}
// modelPathOrSkip resolves the model file under ./test-models/ and Skip()s the
// current spec if it's missing (the ~6.3 GB GGUF is not present on a fresh
// checkout / on CI runners without the download).
func modelPathOrSkip(name string) string {
modelDir, err := filepath.Abs("test-models")
Expect(err).ToNot(HaveOccurred())
modelPath := filepath.Join(modelDir, name)
if _, err := os.Stat(modelPath); err != nil {
Skip(fmt.Sprintf("model not present: %s (run test.sh first)", modelPath))
}
return modelPath
}
var _ = Describe("locate-anything-cpp backend", func() {
It("runs open-vocabulary detection against a known-good COCO image", func() {
modelPath := modelPathOrSkip("locate-anything-q8_0.gguf")
imgB64 := loadTestImage()
port := freePort()
cleanup := startBackend(port)
defer cleanup()
client, closeConn := dialBackend(port)
defer closeConn()
// The q8_0 model is ~6.3 GB and hybrid Parallel Box Decoding on CPU is
// not cheap, so give LoadModel + Detect a generous deadline.
ctx, cancel := context.WithTimeout(context.Background(), 20*time.Minute)
defer cancel()
loadResp, err := client.LoadModel(ctx, &pb.ModelOptions{
Model: "locate-anything-q8_0.gguf",
ModelFile: modelPath,
Threads: 4,
})
Expect(err).ToNot(HaveOccurred(), "LoadModel")
Expect(loadResp.GetSuccess()).To(BeTrue(), "LoadModel reported failure: %s", loadResp.GetMessage())
// Open-vocabulary detection is prompt-driven; the prompt names the
// classes to locate (people + cars), separated by the </c> control token.
detResp, err := client.Detect(ctx, &pb.DetectOptions{
Src: imgB64,
Prompt: "Locate all the instances that matches the following description: person</c>car.",
})
Expect(err).ToNot(HaveOccurred(), "Detect")
Expect(detResp.GetDetections()).ToNot(BeEmpty(), "no detections returned on a known-good COCO image")
_, _ = fmt.Fprintf(GinkgoWriter, "detection OK: %d detections\n", len(detResp.GetDetections()))
for i, d := range detResp.GetDetections() {
Expect(d.GetClassName()).ToNot(BeEmpty(), "detection %d has empty class_name", i)
Expect(d.GetWidth()).To(BeNumerically(">", float32(0)),
"detection %d has non-positive width", i)
Expect(d.GetHeight()).To(BeNumerically(">", float32(0)),
"detection %d has non-positive height", i)
_, _ = fmt.Fprintf(GinkgoWriter, " [%d] %s box=(%.1f,%.1f,%.1fx%.1f)\n",
i, d.GetClassName(), d.GetX(), d.GetY(), d.GetWidth(), d.GetHeight())
}
})
})

View File

@@ -0,0 +1,59 @@
#!/bin/bash
# Script to copy the appropriate libraries based on architecture
set -e
CURDIR=$(dirname "$(realpath $0)")
REPO_ROOT="${CURDIR}/../../.."
# Create lib directory
mkdir -p $CURDIR/package/lib
cp -avf $CURDIR/liblocateanythingcpp-*.so $CURDIR/package/
cp -avf $CURDIR/locate-anything-cpp $CURDIR/package/
cp -fv $CURDIR/run.sh $CURDIR/package/
# Detect architecture and copy appropriate libraries
if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
# x86_64 architecture
echo "Detected x86_64 architecture, copying x86_64 libraries..."
cp -arfLv /lib64/ld-linux-x86-64.so.2 $CURDIR/package/lib/ld.so
cp -arfLv /lib/x86_64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
cp -arfLv /lib/x86_64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
cp -arfLv /lib/x86_64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
cp -arfLv /lib/x86_64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
cp -arfLv /lib/x86_64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
cp -arfLv /lib/x86_64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
# ARM64 architecture
echo "Detected ARM64 architecture, copying ARM64 libraries..."
cp -arfLv /lib/ld-linux-aarch64.so.1 $CURDIR/package/lib/ld.so
cp -arfLv /lib/aarch64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
cp -arfLv /lib/aarch64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
cp -arfLv /lib/aarch64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
cp -arfLv /lib/aarch64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
cp -arfLv /lib/aarch64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
cp -arfLv /lib/aarch64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
elif [ $(uname -s) = "Darwin" ]; then
echo "Detected Darwin"
else
echo "Error: Could not detect architecture"
exit 1
fi
# Package GPU libraries based on BUILD_TYPE
GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
if [ -f "$GPU_LIB_SCRIPT" ]; then
echo "Packaging GPU libraries for BUILD_TYPE=${BUILD_TYPE:-cpu}..."
source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
package_gpu_libs
fi
echo "Packaging completed successfully"
ls -liah $CURDIR/package/
ls -liah $CURDIR/package/lib/

View File

@@ -0,0 +1,52 @@
#!/bin/bash
set -ex
# Get the absolute current dir where the script is located
CURDIR=$(dirname "$(realpath $0)")
cd /
echo "CPU info:"
if [ "$(uname)" != "Darwin" ]; then
grep -e "model\sname" /proc/cpuinfo | head -1
grep -e "flags" /proc/cpuinfo | head -1
fi
LIBRARY="$CURDIR/liblocateanythingcpp-fallback.so"
if [ "$(uname)" != "Darwin" ]; then
if grep -q -e "\savx\s" /proc/cpuinfo ; then
echo "CPU: AVX found OK"
if [ -e $CURDIR/liblocateanythingcpp-avx.so ]; then
LIBRARY="$CURDIR/liblocateanythingcpp-avx.so"
fi
fi
if grep -q -e "\savx2\s" /proc/cpuinfo ; then
echo "CPU: AVX2 found OK"
if [ -e $CURDIR/liblocateanythingcpp-avx2.so ]; then
LIBRARY="$CURDIR/liblocateanythingcpp-avx2.so"
fi
fi
# Check avx 512
if grep -q -e "\savx512f\s" /proc/cpuinfo ; then
echo "CPU: AVX512F found OK"
if [ -e $CURDIR/liblocateanythingcpp-avx512.so ]; then
LIBRARY="$CURDIR/liblocateanythingcpp-avx512.so"
fi
fi
fi
export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
export LOCATEANYTHING_LIBRARY=$LIBRARY
# If there is a lib/ld.so, use it
if [ -f $CURDIR/lib/ld.so ]; then
echo "Using lib/ld.so"
echo "Using library: $LIBRARY"
exec $CURDIR/lib/ld.so $CURDIR/locate-anything-cpp "$@"
fi
echo "Using library: $LIBRARY"
exec $CURDIR/locate-anything-cpp "$@"

View File

@@ -0,0 +1,47 @@
#!/bin/bash
set -e
CURDIR=$(dirname "$(realpath $0)")
echo "Running locate-anything-cpp backend tests..."
# Test model from the mudler/locate-anything.cpp-gguf HuggingFace repo. This is
# the q8_0 quantization of nvidia/LocateAnything-3B (~6.3 GB), so the download
# is the slow step. It is resumed with `curl -C -` and skipped entirely if the
# file is already present.
LOCATEANYTHING_MODEL_DIR="${LOCATEANYTHING_MODEL_DIR:-$CURDIR/test-models}"
LOCATEANYTHING_MODEL_FILE="${LOCATEANYTHING_MODEL_FILE:-locate-anything-q8_0.gguf}"
LOCATEANYTHING_MODEL_URL="${LOCATEANYTHING_MODEL_URL:-https://huggingface.co/mudler/locate-anything.cpp-gguf/resolve/main/locate-anything-q8_0.gguf}"
mkdir -p "$LOCATEANYTHING_MODEL_DIR"
if [ ! -f "$LOCATEANYTHING_MODEL_DIR/$LOCATEANYTHING_MODEL_FILE" ]; then
echo "Downloading locate-anything q8_0 model (~6.3 GB, this is slow)..."
# -C - resumes a partial download so an interrupted run doesn't restart from 0.
curl -L -C - -o "$LOCATEANYTHING_MODEL_DIR/$LOCATEANYTHING_MODEL_FILE" "$LOCATEANYTHING_MODEL_URL" --progress-bar
fi
# Use a real COCO test image (people + cars) from the upstream rf-detr.cpp repo
# (~46 KB). Open-vocabulary detection needs real content to locate, so a
# synthetic image would trivially yield zero detections.
TEST_IMAGE_DIR="$CURDIR/test-data"
TEST_IMAGE_FILE="$TEST_IMAGE_DIR/test.jpg"
TEST_IMAGE_URL="${TEST_IMAGE_URL:-https://raw.githubusercontent.com/mudler/rf-detr.cpp/main/tests/fixtures/ci/test_image.jpg}"
mkdir -p "$TEST_IMAGE_DIR"
if [ ! -f "$TEST_IMAGE_FILE" ]; then
echo "Downloading COCO test image..."
curl -L -o "$TEST_IMAGE_FILE" "$TEST_IMAGE_URL" --progress-bar
fi
echo "locate-anything-cpp test setup complete."
echo " model: $LOCATEANYTHING_MODEL_DIR/$LOCATEANYTHING_MODEL_FILE"
echo " test image: $TEST_IMAGE_FILE"
# Run the Go smoke test: spawns the backend binary on a free port, calls
# LoadModel + Detect via gRPC against the downloaded GGUF + COCO image.
echo ""
echo "Running Go smoke test..."
cd "$CURDIR"
go test -v -timeout 30m ./...

View File

@@ -1,6 +1,6 @@
# parakeet-cpp backend Makefile.
#
# Upstream pin lives below as PARAKEET_VERSION?=e270af73b94c9a5c37ec516230219ed4580e1db6
# Upstream pin lives below as PARAKEET_VERSION?=b8012f11e5269126eddb7f4fd02f891a2ccc29b0
# (.github/bump_deps.sh) can find and update it - matches the
# whisper.cpp / ds4 / vibevoice-cpp convention.
#
@@ -15,7 +15,7 @@
# That's what the L0 smoke test uses. The default target below does the
# proper clone-at-pin + cmake build so CI doesn't need a side-checkout.
PARAKEET_VERSION?=e270af73b94c9a5c37ec516230219ed4580e1db6
PARAKEET_VERSION?=b8012f11e5269126eddb7f4fd02f891a2ccc29b0
PARAKEET_REPO?=https://github.com/mudler/parakeet.cpp
GOCMD?=go
@@ -39,7 +39,10 @@ endif
# is overwritten back to OFF and the build silently falls back to CPU. Forward the
# PARAKEET_GGML_* options instead. (openblas is not gated, so -DGGML_BLAS passes through.)
ifeq ($(BUILD_TYPE),cublas)
CMAKE_ARGS+=-DPARAKEET_GGML_CUDA=ON
# GGML_CUDA_GRAPHS is OFF by ggml default; enabling it gives a small free
# speedup (~1% measured on GB10, never negative) by capturing/replaying the
# CUDA graph. Not gated by parakeet.cpp, so it passes straight through to ggml.
CMAKE_ARGS+=-DPARAKEET_GGML_CUDA=ON -DGGML_CUDA_GRAPHS=ON
else ifeq ($(BUILD_TYPE),openblas)
CMAKE_ARGS+=-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
else ifeq ($(BUILD_TYPE),hipblas)

View File

@@ -98,17 +98,21 @@ type transcriptJSON struct {
}
// streamFeedJSON mirrors the document returned by
// parakeet_capi_stream_feed_json / parakeet_capi_stream_finalize_json (ABI v4):
// parakeet_capi_stream_feed_json / parakeet_capi_stream_finalize_json (ABI v5):
//
// {"text":"...","eou":0,"frame_sec":0.080000,
// {"text":"...","eou":0,"eob":0,"frame_sec":0.080000,
// "words":[{"w":"...","start":0.480,"end":0.640,"conf":0.9100}, ...]}
//
// "text" is the newly-finalized text since the last call; "eou" is 1 when an
// <EOU>/<EOB> fired this feed; "words" are the words finalized this call with
// absolute (stream-relative) start/end seconds.
// <EOU> (end of utterance) fired this feed and "eob" is 1 when an <EOB>
// (backchannel) fired. ABI v4 conflated the two into "eou"; v5 split them, so
// we read both and treat either as an utterance boundary for segmentation.
// "words" are the words finalized this call with absolute (stream-relative)
// start/end seconds.
type streamFeedJSON struct {
Text string `json:"text"`
Eou int `json:"eou"`
Eob int `json:"eob"`
FrameSec float64 `json:"frame_sec"`
Words []transcriptWord `json:"words"`
}
@@ -483,7 +487,10 @@ type streamSegmenter struct {
func (s *streamSegmenter) add(doc streamFeedJSON) {
s.cur = append(s.cur, doc.Words...)
if doc.Eou != 0 {
// Close the segment on either turn signal: <EOU> (end of utterance) or
// <EOB> (backchannel). ABI v4 reported both via "eou"; v5 split them, so we
// OR them here to keep the v4 segmentation boundaries.
if doc.Eou != 0 || doc.Eob != 0 {
s.flush()
}
}
@@ -671,11 +678,12 @@ func (p *ParakeetCpp) AudioTranscriptionStream(ctx context.Context, opts *pb.Tra
return nil
}
// streamJSON drives the ABI v4 streaming JSON entry points: each feed/finalize
// returns a {text,eou,frame_sec,words} document. The newly-finalized text is
// emitted as a delta (unchanged streaming contract) while words are accumulated
// into per-utterance segments (closed on EOU) so the closing FinalResult carries
// timestamped segments. Runs under engineMu (already held by the caller).
// streamJSON drives the streaming JSON entry points (present since ABI v4): each
// feed/finalize returns a {text,eou,eob,frame_sec,words} document. The
// newly-finalized text is emitted as a delta (unchanged streaming contract)
// while words are accumulated into per-utterance segments (closed on <EOU> or
// <EOB>) so the closing FinalResult carries timestamped segments. Runs under
// engineMu (already held by the caller).
func (p *ParakeetCpp) streamJSON(ctx context.Context, stream uintptr, data []float32,
duration float32, results chan *pb.TranscriptStreamResponse) error {
var (

View File

@@ -124,4 +124,17 @@ var _ = Describe("streaming segment assembly", func() {
Expect(acc.segments()).To(HaveLen(1))
Expect(acc.segments()[0].Text).To(Equal("hi there"))
})
// ABI v5 split <EOB> (backchannel) out of the "eou" flag into its own "eob"
// field; a backchannel must still close the segment as it did in v4.
It("closes a segment on EOB (backchannel) too", func() {
acc := &streamSegmenter{}
acc.add(streamFeedJSON{Text: "uh huh", Eou: 0, Eob: 1, Words: []transcriptWord{
{W: "uh", Start: 0.0, End: 0.2}, {W: "huh", Start: 0.2, End: 0.5},
}})
segs := acc.segments()
Expect(segs).To(HaveLen(1))
Expect(segs[0].Text).To(Equal("uh huh"))
Expect(segs[0].End).To(Equal(secondsToNanos(0.5)))
})
})

View File

@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)
# stablediffusion.cpp (ggml)
STABLEDIFFUSION_GGML_REPO?=https://github.com/leejet/stable-diffusion.cpp
STABLEDIFFUSION_GGML_VERSION?=b3d56d0ba1bd437886079e339118e8e75bb79ee7
STABLEDIFFUSION_GGML_VERSION?=19bdfe22d255d5b4dff39d449318b9bc5ea2317f
CMAKE_ARGS+=-DGGML_MAX_NAME=128

View File

@@ -26,8 +26,16 @@ add_library(govibevoicecpp MODULE cpp/govibevoicecpp.cpp)
# vv_capi_* symbols (purego dlopens them by name, nothing in our
# translation unit references them). Force the static archive's
# entire contents into the MODULE so dlsym finds vv_capi_load etc.
#
# Link the `vibevoice` TARGET (not a bare archive path) so CMake builds
# libvibevoice.a first and tracks the dependency: the upstream project is added
# with EXCLUDE_FROM_ALL, so without a target-level link there is no rule to
# build it. Passing only $<TARGET_FILE:vibevoice> as a path on Apple left the
# build with "No rule to make target 'vibevoice/libvibevoice.a'" (issue #10267).
# force_load is then applied as a separate link option.
if(APPLE)
target_link_libraries(govibevoicecpp PRIVATE -Wl,-force_load $<TARGET_FILE:vibevoice>)
target_link_libraries(govibevoicecpp PRIVATE vibevoice)
target_link_options(govibevoicecpp PRIVATE "-Wl,-force_load,$<TARGET_FILE:vibevoice>")
elseif(MSVC)
target_link_libraries(govibevoicecpp PRIVATE vibevoice)
set_property(TARGET govibevoicecpp APPEND PROPERTY LINK_FLAGS "/WHOLEARCHIVE:vibevoice")

View File

@@ -94,26 +94,30 @@ purge:
# Build all variants (Linux only)
ifeq ($(UNAME_S),Linux)
libgovibevoicecpp-avx.so: sources/vibevoice.cpp
$(MAKE) purge
$(info ${GREEN}I vibevoice-cpp build info:avx${RESET})
SO_TARGET=libgovibevoicecpp-avx.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) libgovibevoicecpp-custom
rm -rf build-libgovibevoicecpp-avx.so
rm -rfv build*
libgovibevoicecpp-avx2.so: sources/vibevoice.cpp
$(MAKE) purge
$(info ${GREEN}I vibevoice-cpp build info:avx2${RESET})
SO_TARGET=libgovibevoicecpp-avx2.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=off -DGGML_FMA=on -DGGML_F16C=on -DGGML_BMI2=on" $(MAKE) libgovibevoicecpp-custom
rm -rf build-libgovibevoicecpp-avx2.so
rm -rfv build*
libgovibevoicecpp-avx512.so: sources/vibevoice.cpp
$(MAKE) purge
$(info ${GREEN}I vibevoice-cpp build info:avx512${RESET})
SO_TARGET=libgovibevoicecpp-avx512.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=on -DGGML_FMA=on -DGGML_F16C=on -DGGML_BMI2=on" $(MAKE) libgovibevoicecpp-custom
rm -rf build-libgovibevoicecpp-avx512.so
rm -rfv build*
endif
# Build fallback variant (all platforms)
libgovibevoicecpp-fallback.so: sources/vibevoice.cpp
$(MAKE) purge
$(info ${GREEN}I vibevoice-cpp build info:fallback${RESET})
SO_TARGET=libgovibevoicecpp-fallback.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) libgovibevoicecpp-custom
rm -rf build-libgovibevoicecpp-fallback.so
rm -rfv build*
libgovibevoicecpp-custom: CMakeLists.txt cpp/govibevoicecpp.cpp cpp/govibevoicecpp.h
mkdir -p build-$(SO_TARGET) && \

View File

@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)
# whisper.cpp version
WHISPER_REPO?=https://github.com/ggml-org/whisper.cpp
WHISPER_CPP_VERSION?=a8ec021f2750a473ff4a8f3883bc9fdf5feafa84
WHISPER_CPP_VERSION?=df7638d8229a243af8a4b5a8ae557e0d74e0a0ae
SO_TARGET?=libgowhisper.so
CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF

View File

@@ -337,6 +337,35 @@
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-rfdetr-cpp"
intel: "intel-sycl-f32-rfdetr-cpp"
vulkan: "vulkan-rfdetr-cpp"
- &locateanything
name: "locate-anything"
alias: "locate-anything"
license: apache-2.0
description: |
Open-vocabulary object detection and visual grounding (NVIDIA
LocateAnything-3B) in C/C++ using GGML. Loads pre-built GGUF weights
and, given an image and a free-form text prompt, returns bounding
boxes, class labels, and confidence scores for the referred objects.
urls:
- https://github.com/mudler/locate-anything.cpp
- https://huggingface.co/nvidia/LocateAnything-3B
tags:
- object-detection
- visual-grounding
- open-vocabulary
- locate-anything
- gpu
- cpu
capabilities:
default: "cpu-locate-anything-cpp"
nvidia: "cuda12-locate-anything-cpp"
nvidia-cuda-12: "cuda12-locate-anything-cpp"
nvidia-cuda-13: "cuda13-locate-anything-cpp"
nvidia-l4t: "nvidia-l4t-arm64-locate-anything-cpp"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-locate-anything-cpp"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-locate-anything-cpp"
intel: "intel-sycl-f32-locate-anything-cpp"
vulkan: "vulkan-locate-anything-cpp"
- &vllm
name: "vllm"
license: apache-2.0
@@ -1225,6 +1254,7 @@
default: "cpu-sherpa-onnx"
nvidia: "cuda12-sherpa-onnx"
nvidia-cuda-12: "cuda12-sherpa-onnx"
metal: "metal-sherpa-onnx"
- !!merge <<: *neutts
name: "neutts-development"
capabilities:
@@ -1557,6 +1587,7 @@
- localai/localai-backends:master-metal-darwin-arm64-kitten-tts
- !!merge <<: *local-store
name: "local-store-development"
alias: "local-store"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-local-store"
mirrors:
- localai/localai-backends:master-cpu-local-store
@@ -1567,6 +1598,7 @@
- localai/localai-backends:latest-metal-darwin-arm64-local-store
- !!merge <<: *local-store
name: "metal-local-store-development"
alias: "local-store"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-local-store"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-local-store
@@ -4685,12 +4717,14 @@
default: "cpu-speaker-recognition"
nvidia: "cuda12-speaker-recognition"
nvidia-cuda-12: "cuda12-speaker-recognition"
metal: "metal-speaker-recognition"
- !!merge <<: *speakerrecognition
name: "speaker-recognition-development"
capabilities:
default: "cpu-speaker-recognition-development"
nvidia: "cuda12-speaker-recognition-development"
nvidia-cuda-12: "cuda12-speaker-recognition-development"
metal: "metal-speaker-recognition-development"
- !!merge <<: *speakerrecognition
name: "cpu-speaker-recognition"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-speaker-recognition"
@@ -4711,6 +4745,16 @@
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-speaker-recognition"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-speaker-recognition
- !!merge <<: *speakerrecognition
name: "metal-speaker-recognition"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-speaker-recognition"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-speaker-recognition
- !!merge <<: *speakerrecognition
name: "metal-speaker-recognition-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-speaker-recognition"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-speaker-recognition
## sherpa-onnx
- !!merge <<: *sherpa-onnx
name: "sherpa-onnx-development"
@@ -4718,6 +4762,7 @@
default: "cpu-sherpa-onnx-development"
nvidia: "cuda12-sherpa-onnx-development"
nvidia-cuda-12: "cuda12-sherpa-onnx-development"
metal: "metal-sherpa-onnx-development"
- !!merge <<: *sherpa-onnx
name: "cpu-sherpa-onnx"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-sherpa-onnx"
@@ -4738,3 +4783,13 @@
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-sherpa-onnx"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-sherpa-onnx
- !!merge <<: *sherpa-onnx
name: "metal-sherpa-onnx"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-sherpa-onnx"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-sherpa-onnx
- !!merge <<: *sherpa-onnx
name: "metal-sherpa-onnx-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-sherpa-onnx"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-sherpa-onnx

View File

@@ -407,6 +407,24 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
if not request.Prompt and request.UseTokenizerTemplate and request.Messages:
messages = messages_to_dicts(request.Messages)
# The mlx-lm tokenizer only carries a text-LM chat template. A
# vision-language checkpoint (e.g. gemma-4 E4B) loaded here has no
# usable template, so apply_chat_template silently passes the raw
# text through and the model just echoes/loops (issue #10269).
# Warn loudly so the misroute is visible; such models belong on the
# mlx-vlm backend.
chat_template = getattr(self.tokenizer, "chat_template", None)
if not chat_template:
underlying = getattr(self.tokenizer, "_tokenizer", None)
chat_template = getattr(underlying, "chat_template", None)
if not chat_template:
print(
"WARNING: this model has no chat template; output may be "
"degenerate. Vision-language models (e.g. gemma-4 E4B) must "
"use the 'mlx-vlm' backend instead of 'mlx'.",
file=sys.stderr,
)
kwargs = {"tokenize": False, "add_generation_prompt": True}
if request.Tools:
try:

View File

@@ -0,0 +1,5 @@
torch
torchaudio
speechbrain
transformers
onnxruntime

View File

@@ -26,7 +26,10 @@ from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine
from vllm.sampling_params import SamplingParams
from vllm.utils import random_uuid
from vllm.transformers_utils.tokenizer import get_tokenizer
try:
from vllm.tokenizers import get_tokenizer # vLLM >= 0.22
except ImportError:
from vllm.transformers_utils.tokenizer import get_tokenizer # vLLM < 0.22
from vllm.multimodal.utils import fetch_image
from vllm.assets.video import VideoAsset
import base64
@@ -147,9 +150,24 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
d["reasoning_content"] = msg.reasoning_content
if msg.tool_calls:
try:
d["tool_calls"] = json.loads(msg.tool_calls)
tool_calls = json.loads(msg.tool_calls)
except json.JSONDecodeError:
pass
else:
# OpenAI wire format carries function.arguments as a
# JSON-encoded string, but chat templates (e.g. Qwen3)
# iterate over it as a mapping. vLLM's own OpenAI server
# parses arguments before applying the template, so do
# the same here.
if isinstance(tool_calls, list):
for tc in tool_calls:
func = tc.get("function") if isinstance(tc, dict) else None
if isinstance(func, dict) and isinstance(func.get("arguments"), str):
try:
func["arguments"] = json.loads(func["arguments"])
except json.JSONDecodeError:
pass
d["tool_calls"] = tool_calls
result.append(d)
return result

View File

@@ -11,6 +11,29 @@ import (
"github.com/mudler/xlog"
)
// startMITMIfConfigured brings up the cloudproxy MITM listener when an
// address is configured, treating any startup failure as non-fatal.
//
// The listener is opt-in middleware whose address is persisted in runtime
// settings (/api/settings → runtime_settings.json) and replayed on every
// boot. A bad value — e.g. a host the process can't bind, like a LAN IP
// inside a container — must NOT abort the whole server: doing so crash-loops
// with no way out, because the Settings UI used to correct the address can't
// load if startup never completes. So on failure we log loudly and carry on;
// the admin fixes the address via /api/settings, which calls RestartMITM.
func startMITMIfConfigured(app *Application, options *config.ApplicationConfig) {
if options.MITMListen == "" {
return
}
if err := startMITMProxy(app, options); err != nil {
xlog.Error("mitm: cloudproxy listener failed to start — continuing without it",
"listen", options.MITMListen,
"error", err,
"hint", "fix the address via Settings (e.g. \":8082\" to bind all interfaces) and the listener will restart",
)
}
}
func startMITMProxy(app *Application, options *config.ApplicationConfig) error {
app.mitmMutex.Lock()
defer app.mitmMutex.Unlock()

View File

@@ -0,0 +1,58 @@
package application
import (
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/pkg/system"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
// minimal Application wired enough for startMITMProxy: an empty model
// config loader (no host claims), CA written under a temp DataPath.
func newMITMTestApp(dataPath string) (*Application, *config.ApplicationConfig) {
state, err := system.GetSystemState()
Expect(err).NotTo(HaveOccurred())
state.Model.ModelsPath = dataPath
opts := config.NewApplicationConfig(
config.WithSystemState(state),
config.WithDataPath(dataPath),
)
return newApplication(opts), opts
}
var _ = Describe("startMITMIfConfigured", func() {
It("does nothing when no listen address is configured", func() {
app, opts := newMITMTestApp(GinkgoT().TempDir())
opts.MITMListen = ""
Expect(func() { startMITMIfConfigured(app, opts) }).NotTo(Panic())
Expect(app.mitmServer.Load()).To(BeNil(), "no listener should be stored when disabled")
})
// Regression: a persisted-but-unbindable MITM address (e.g. a LAN host
// inside a container) must not abort startup. startMITMIfConfigured
// swallows the bind error so the rest of LocalAI still comes up and the
// admin can fix the address via the Settings UI.
It("logs and continues when the listen address cannot be bound", func() {
app, opts := newMITMTestApp(GinkgoT().TempDir())
// 192.0.2.1 is TEST-NET-1 (RFC 5737): guaranteed not assigned to any
// local interface, so bind fails deterministically without DNS.
opts.MITMListen = "192.0.2.1:8082"
Expect(func() { startMITMIfConfigured(app, opts) }).NotTo(Panic())
Expect(app.mitmServer.Load()).To(BeNil(), "failed listener must not be stored")
})
It("starts and stores the listener on a bindable address", func() {
app, opts := newMITMTestApp(GinkgoT().TempDir())
opts.MITMListen = "127.0.0.1:0" // OS-assigned free port
startMITMIfConfigured(app, opts)
srv := app.mitmServer.Load()
Expect(srv).NotTo(BeNil(), "listener should be stored on success")
DeferCleanup(srv.Stop)
Expect(srv.Addr()).NotTo(BeEmpty())
})
})

View File

@@ -1,63 +1,120 @@
package application
import (
"context"
"fmt"
"github.com/mudler/LocalAI/core/backend"
"github.com/mudler/LocalAI/core/config"
)
// adapterConfig resolves a model name to its runtime ModelConfig, or
// nil when the name is unknown. Shared by the router-facing factories
// below and by ModelConfigLookup.
// adapterConfig resolves a model name to its runtime ModelConfig, or nil when
// unknown. LoadModelConfigFileByNameDefaultOptions never returns nil — for an
// unknown name it returns a defaults-filled stub with an empty Name (the YAML
// `name:` field is required by Validate), which is how we tell the two apart.
func (a *Application) adapterConfig(modelName string) *config.ModelConfig {
cfg, err := a.backendLoader.LoadModelConfigFileByNameDefaultOptions(modelName, a.applicationConfig)
if err != nil || cfg == nil {
if err != nil || cfg == nil || cfg.Name == "" {
return nil
}
return cfg
}
// ModelConfigLookup is the lookup function the router middleware's
// classifier validator uses to confirm classifier_model declares
// FLAG_SCORE before binding it.
// ModelConfigLookup is the lookup the router middleware's classifier validator
// uses to confirm classifier_model declares FLAG_SCORE before binding it.
func (a *Application) ModelConfigLookup() func(modelName string) *config.ModelConfig {
return a.adapterConfig
}
// Scorer returns a backend.Scorer bound to the named model, or nil
// when the model is unknown. Used as a method value (app.Scorer) by
// router.ClassifierDeps — no factory-of-factory wrapper needed.
// The router-facing factories below (Scorer, Embedder, Reranker, TokenCounter)
// bind a model NAME at construction and re-resolve the CONFIG on every call.
// Capturing the config at construction would bake in whatever state
// adapterConfig saw first — including a stub returned before the YAML reached
// bcl.configs (e.g. /import-model or gallery install racing startup). The
// classifier registry caches factories by router-config fingerprint, so a
// once-stale capture stays stale until the router config is edited.
func (a *Application) Scorer(modelName string) backend.Scorer {
cfg := a.adapterConfig(modelName)
if cfg == nil {
if a.adapterConfig(modelName) == nil {
return nil
}
return backend.NewScorer(a.modelLoader, *cfg, a.applicationConfig)
return &lazyScorer{app: a, modelName: modelName}
}
type lazyScorer struct {
app *Application
modelName string
}
func (l *lazyScorer) Score(ctx context.Context, prompt string, candidates []string) ([]backend.CandidateScore, error) {
cfg := l.app.adapterConfig(l.modelName)
if cfg == nil {
return nil, fmt.Errorf("scorer: model %q no longer available", l.modelName)
}
return backend.NewScorer(l.app.modelLoader, *cfg, l.app.applicationConfig).Score(ctx, prompt, candidates)
}
// TokenCounter returns a func so the middleware's literal field type accepts
// it as a method value without importing core/http/middleware from here.
func (a *Application) TokenCounter(modelName string) func(string) (int, error) {
if a.adapterConfig(modelName) == nil {
return nil
}
return func(text string) (int, error) {
cfg := a.adapterConfig(modelName)
if cfg == nil {
return 0, fmt.Errorf("token counter: model %q no longer available", modelName)
}
resp, err := backend.ModelTokenize(text, a.modelLoader, *cfg, a.applicationConfig)
if err != nil {
return 0, err
}
return len(resp.Tokens), nil
}
}
// Reranker returns a backend.Reranker bound to the named model, or
// nil when unknown. The reranker model's `type:` (e.g. "colbert")
// selects the scoring head inside the rerankers backend.
func (a *Application) Reranker(modelName string) backend.Reranker {
cfg := a.adapterConfig(modelName)
if cfg == nil {
if a.adapterConfig(modelName) == nil {
return nil
}
return backend.NewReranker(a.modelLoader, *cfg, a.applicationConfig)
return &lazyReranker{app: a, modelName: modelName}
}
type lazyReranker struct {
app *Application
modelName string
}
func (l *lazyReranker) Rerank(ctx context.Context, query string, documents []string) ([]backend.RerankResult, error) {
cfg := l.app.adapterConfig(l.modelName)
if cfg == nil {
return nil, fmt.Errorf("reranker: model %q no longer available", l.modelName)
}
return backend.NewReranker(l.app.modelLoader, *cfg, l.app.applicationConfig).Rerank(ctx, query, documents)
}
// Embedder returns a backend.Embedder bound to the named model, or
// nil when unknown. Used by the router's L2 embedding cache.
func (a *Application) Embedder(modelName string) backend.Embedder {
cfg := a.adapterConfig(modelName)
if cfg == nil {
if a.adapterConfig(modelName) == nil {
return nil
}
return backend.NewEmbedder(a.modelLoader, *cfg, a.applicationConfig)
return &lazyEmbedder{app: a, modelName: modelName}
}
// VectorStore returns a backend.VectorStore for the named collection,
// or nil when the name is empty. Each router model gets its own
// backend process via the model loader's cache keyed by storeName.
type lazyEmbedder struct {
app *Application
modelName string
}
func (l *lazyEmbedder) Embed(ctx context.Context, text string) ([]float32, error) {
cfg := l.app.adapterConfig(l.modelName)
if cfg == nil {
return nil, fmt.Errorf("embedder: model %q no longer available", l.modelName)
}
return backend.NewEmbedder(l.app.modelLoader, *cfg, l.app.applicationConfig).Embed(ctx, text)
}
// VectorStore takes a store name, not a model name — no adapterConfig, no
// staleness to avoid.
func (a *Application) VectorStore(storeName string) backend.VectorStore {
return backend.NewVectorStore(a.modelLoader, a.applicationConfig, storeName)
}

View File

@@ -0,0 +1,155 @@
package application
import (
"context"
"os"
"path/filepath"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/pkg/model"
"github.com/mudler/LocalAI/pkg/system"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
// Regression: the router-facing factories used to capture
// *config.ModelConfig at construction. A gallery install that raced
// startup left a stub (Backend="") bound for the lifetime of the
// classifier registry's cache entry, bypassing the user's `backend:`
// config. These specs pin the lazy re-resolve.
var _ = Describe("router_factories lazy config resolution", func() {
var (
tmpDir string
app *Application
)
BeforeEach(func() {
var err error
tmpDir, err = os.MkdirTemp("", "router-factories-*")
Expect(err).NotTo(HaveOccurred())
appCfg := &config.ApplicationConfig{
Context: context.Background(),
SystemState: &system.SystemState{Model: system.Model{ModelsPath: tmpDir}},
}
app = &Application{
backendLoader: config.NewModelConfigLoader(tmpDir),
modelLoader: model.NewModelLoader(appCfg.SystemState),
applicationConfig: appCfg,
}
})
AfterEach(func() {
_ = os.RemoveAll(tmpDir)
})
// writeCfg seeds both the on-disk YAML and the in-memory cache —
// removing only the cache would fall through to file-read.
writeCfg := func(name, backend string) {
yaml := "name: " + name + "\nbackend: " + backend + "\nparameters:\n model: " + name + ".bin\n"
Expect(os.WriteFile(filepath.Join(tmpDir, name+".yaml"), []byte(yaml), 0644)).To(Succeed())
Expect(app.backendLoader.LoadModelConfigsFromPath(tmpDir)).To(Succeed())
cfg, ok := app.backendLoader.GetModelConfig(name)
Expect(ok).To(BeTrue(), "config must be loaded before the spec runs")
Expect(cfg.Backend).To(Equal(backend))
}
// removeCfg purges both the cache and the YAML so LoadModelConfigFileByName
// returns the empty-stub case and adapterConfig returns nil.
removeCfg := func(name string) {
app.backendLoader.RemoveModelConfig(name)
Expect(os.Remove(filepath.Join(tmpDir, name+".yaml"))).To(Succeed())
}
Context("Embedder", func() {
It("returns nil at construction for an unknown model", func() {
Expect(app.Embedder("missing")).To(BeNil())
})
It("re-resolves the model config on each Embed call", func() {
writeCfg("emb-test", "llama-cpp")
emb := app.Embedder("emb-test")
Expect(emb).NotTo(BeNil())
// The factory must hold the NAME, not a captured config —
// otherwise stale captures survive cache invalidation.
lazy, ok := emb.(*lazyEmbedder)
Expect(ok).To(BeTrue(), "Embedder must return *lazyEmbedder")
Expect(lazy.modelName).To(Equal("emb-test"))
// Mutate the cached config. A lazy implementation sees the
// update on the next adapterConfig call; a captured-at-
// construction implementation would still see "llama-cpp".
app.backendLoader.UpdateModelConfig("emb-test", func(c *config.ModelConfig) {
c.Backend = "rerankers"
})
Expect(lazy.app.adapterConfig("emb-test").Backend).To(Equal("rerankers"))
// Remove the config entirely → Embed must surface the disappearance.
removeCfg("emb-test")
_, err := emb.Embed(context.Background(), "anything")
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring("no longer available"))
})
})
Context("Scorer", func() {
It("returns nil at construction for an unknown model", func() {
Expect(app.Scorer("missing")).To(BeNil())
})
It("re-resolves the model config on each Score call", func() {
writeCfg("score-test", "llama-cpp")
sc := app.Scorer("score-test")
Expect(sc).NotTo(BeNil())
lazy, ok := sc.(*lazyScorer)
Expect(ok).To(BeTrue(), "Scorer must return *lazyScorer")
Expect(lazy.modelName).To(Equal("score-test"))
removeCfg("score-test")
_, err := sc.Score(context.Background(), "prompt", []string{"a"})
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring("no longer available"))
})
})
Context("Reranker", func() {
It("returns nil at construction for an unknown model", func() {
Expect(app.Reranker("missing")).To(BeNil())
})
It("re-resolves the model config on each Rerank call", func() {
writeCfg("rerank-test", "rerankers")
rr := app.Reranker("rerank-test")
Expect(rr).NotTo(BeNil())
lazy, ok := rr.(*lazyReranker)
Expect(ok).To(BeTrue(), "Reranker must return *lazyReranker")
Expect(lazy.modelName).To(Equal("rerank-test"))
removeCfg("rerank-test")
_, err := rr.Rerank(context.Background(), "q", []string{"d"})
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring("no longer available"))
})
})
Context("TokenCounter", func() {
It("returns nil at construction for an unknown model", func() {
Expect(app.TokenCounter("missing")).To(BeNil())
})
It("re-resolves the model config on each call", func() {
writeCfg("tok-test", "llama-cpp")
tc := app.TokenCounter("tok-test")
Expect(tc).NotTo(BeNil())
removeCfg("tok-test")
_, err := tc("anything")
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring("no longer available"))
})
})
})

View File

@@ -462,11 +462,7 @@ func New(opts ...config.AppOption) (*Application, error) {
// traffic doesn't need a parallel config for MITM traffic.
// Runs after loadRuntimeSettingsFromFile so a listener configured
// via /api/settings is brought back up across restarts.
if options.MITMListen != "" {
if err := startMITMProxy(application, options); err != nil {
return nil, fmt.Errorf("mitm: startup: %w", err)
}
}
startMITMIfConfigured(application, options)
application.ModelLoader().SetBackendLoggingEnabled(options.EnableBackendLogging)

View File

@@ -100,8 +100,13 @@ func ModelEmbedding(ctx context.Context, s string, tokens []int, loader *model.M
trace.InitBackendTracingIfEnabled(appConfig.TracingMaxItems, appConfig.TracingMaxBodyBytes)
traceData := map[string]any{
"input_text": trace.TruncateString(s, 1000),
"input_tokens_count": len(tokens),
"input_text": trace.TruncateString(s, 1000),
}
// Only present for token-mode callers (pre-tokenized override);
// emitting "0" alongside input_text would read as "consumed zero
// tokens", which is wrong.
if len(tokens) > 0 {
traceData["input_tokens_count"] = len(tokens)
}
startTime := time.Now()

View File

@@ -87,11 +87,47 @@ func getSeed(c config.ModelConfig) int32 {
return seed
}
func grpcModelOpts(c config.ModelConfig, modelPath string) *pb.ModelOptions {
b := 512
if c.Batch != 0 {
b = c.Batch
// DefaultContextSize and DefaultBatchSize are the backend's fallbacks when a
// model config leaves them unset. Exported so callers that must respect the
// effective decode window — notably the router's prompt trimmer — resolve the
// same numbers grpcModelOpts does instead of guessing.
const (
DefaultContextSize = 4096
DefaultBatchSize = 512
)
// EffectiveContextSize is the context window the backend will run with: the
// configured value, or DefaultContextSize when unset.
func EffectiveContextSize(c config.ModelConfig) int {
if c.ContextSize != nil {
return *c.ContextSize
}
return DefaultContextSize
}
// EffectiveBatchSize is the single-decode batch the backend will run with.
// Score, embedding and rerank all process the whole input in one pass: score
// decodes prompt+candidate (asserts n_tokens <= n_batch), and embedding/rerank
// pool over the full sequence in one physical batch (n_ubatch). So the batch
// is sized to the context — anything that fits the context fits one pass,
// avoiding both the GGML_ASSERT crash and the "input is too large to process"
// error. Explicit `batch:` always wins.
func EffectiveBatchSize(c config.ModelConfig) int {
if c.Batch != 0 {
return c.Batch
}
singlePass := c.HasUsecases(config.FLAG_SCORE) ||
c.HasUsecases(config.FLAG_EMBEDDINGS) ||
c.HasUsecases(config.FLAG_RERANK)
if ctx := EffectiveContextSize(c); singlePass && ctx > DefaultBatchSize {
return ctx
}
return DefaultBatchSize
}
func grpcModelOpts(c config.ModelConfig, modelPath string) *pb.ModelOptions {
ctxSize := EffectiveContextSize(c)
b := EffectiveBatchSize(c)
flashAttention := "auto"
@@ -134,11 +170,6 @@ func grpcModelOpts(c config.ModelConfig, modelPath string) *pb.ModelOptions {
}
}
ctxSize := 4096
if c.ContextSize != nil {
ctxSize = *c.ContextSize
}
mmlock := false
if c.MMlock != nil {
mmlock = *c.MMlock

View File

@@ -97,3 +97,67 @@ var _ = Describe("gRPCPredictOpts reasoning_effort metadata", func() {
Expect(opts.Metadata).ToNot(HaveKey("reasoning_effort"))
})
})
var _ = Describe("grpcModelOpts NBatch", func() {
scoreUsecase := config.FLAG_SCORE
threads := 1
ctx := 4096
It("defaults to 512 for an ordinary model", func() {
cfg := config.ModelConfig{Threads: &threads, LLMConfig: config.LLMConfig{ContextSize: &ctx}}
opts := grpcModelOpts(cfg, "/tmp/models")
Expect(opts.NBatch).To(BeEquivalentTo(512))
})
It("sizes the batch to the context window for score models", func() {
// Score models decode the whole prompt+candidate in one
// llama_decode; n_batch must cover it or the backend aborts.
cfg := config.ModelConfig{Threads: &threads, LLMConfig: config.LLMConfig{ContextSize: &ctx}, KnownUsecases: &scoreUsecase}
opts := grpcModelOpts(cfg, "/tmp/models")
Expect(opts.NBatch).To(BeEquivalentTo(4096))
})
It("keeps an explicit batch over the score default", func() {
cfg := config.ModelConfig{Threads: &threads, LLMConfig: config.LLMConfig{ContextSize: &ctx}, KnownUsecases: &scoreUsecase}
cfg.Batch = 1024
opts := grpcModelOpts(cfg, "/tmp/models")
Expect(opts.NBatch).To(BeEquivalentTo(1024))
})
It("sizes the batch to the context window for embedding models", func() {
// Embedding/rerank pool over the whole sequence in one physical batch
// (n_ubatch); without this the input is capped at the 512 default and
// the backend returns "input is too large to process".
embeddings := true
cfg := config.ModelConfig{Threads: &threads, LLMConfig: config.LLMConfig{ContextSize: &ctx}}
cfg.Embeddings = &embeddings
opts := grpcModelOpts(cfg, "/tmp/models")
Expect(opts.NBatch).To(BeEquivalentTo(4096))
})
It("sizes the batch to the context window for rerank models", func() {
reranking := true
cfg := config.ModelConfig{Threads: &threads, LLMConfig: config.LLMConfig{ContextSize: &ctx}}
cfg.Reranking = &reranking
opts := grpcModelOpts(cfg, "/tmp/models")
Expect(opts.NBatch).To(BeEquivalentTo(4096))
})
It("does not raise the batch when a score model's context is below the default", func() {
small := 256
cfg := config.ModelConfig{Threads: &threads, LLMConfig: config.LLMConfig{ContextSize: &small}, KnownUsecases: &scoreUsecase}
opts := grpcModelOpts(cfg, "/tmp/models")
Expect(opts.NBatch).To(BeEquivalentTo(512))
})
It("sizes the batch to the effective 4096 default for a score model with no explicit context_size", func() {
// The crash case: the backend defaults n_ctx to 4096, so n_batch must
// follow even when context_size is unset — otherwise n_batch stays 512
// against a 4096 window and the score decode hits the GGML_ASSERT.
cfg := config.ModelConfig{Threads: &threads, KnownUsecases: &scoreUsecase}
Expect(cfg.ContextSize).To(BeNil())
opts := grpcModelOpts(cfg, "/tmp/models")
Expect(opts.NBatch).To(BeEquivalentTo(4096))
Expect(opts.ContextSize).To(BeEquivalentTo(4096), "n_batch must match the effective n_ctx the backend receives")
})
})

View File

@@ -3,9 +3,10 @@ package backend
import (
"context"
"fmt"
"strings"
"time"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/trace"
"github.com/mudler/LocalAI/pkg/grpc"
"github.com/mudler/LocalAI/pkg/model"
@@ -39,34 +40,85 @@ func (s *localVectorStore) backend(_ context.Context) (grpc.Backend, error) {
return StoreBackend(s.loader, s.appConfig, s.storeName, "")
}
func (s *localVectorStore) Search(ctx context.Context, vec []float32) (float64, []byte, bool, error) {
be, err := s.backend(ctx)
if err != nil {
return 0, nil, false, fmt.Errorf("vector store load: %w", err)
func (s *localVectorStore) Search(ctx context.Context, vec []float32) (sim float64, payload []byte, ok bool, err error) {
start := time.Now()
outcome := "hit"
defer func() {
s.recordTrace(start, "search", len(vec), sim, outcome, err)
}()
be, berr := s.backend(ctx)
if berr != nil {
outcome = "backend_load_error"
return 0, nil, false, fmt.Errorf("vector store load: %w", berr)
}
_, values, similarities, err := store.Find(ctx, be, vec, 1)
if err != nil {
// local-store's Find returns "existing length is -1" before
// any keys are inserted. Surface that as a clean miss so the
// cache layer treats it as an empty store and proceeds to
// Insert rather than skipping.
if strings.Contains(err.Error(), "existing length is -1") {
return 0, nil, false, nil
}
return 0, nil, false, fmt.Errorf("vector store find: %w", err)
_, values, similarities, ferr := store.Find(ctx, be, vec, 1)
if ferr != nil {
outcome = "find_error"
return 0, nil, false, fmt.Errorf("vector store find: %w", ferr)
}
if len(values) == 0 || len(similarities) == 0 {
outcome = "miss"
return 0, nil, false, nil
}
return float64(similarities[0]), values[0], true, nil
}
func (s *localVectorStore) Insert(ctx context.Context, vec []float32, payload []byte) error {
be, err := s.backend(ctx)
if err != nil {
return fmt.Errorf("vector store load: %w", err)
func (s *localVectorStore) Insert(ctx context.Context, vec []float32, payload []byte) (err error) {
start := time.Now()
outcome := "ok"
defer func() {
s.recordTrace(start, "insert", len(vec), 0, outcome, err)
}()
be, berr := s.backend(ctx)
if berr != nil {
outcome = "backend_load_error"
return fmt.Errorf("vector store load: %w", berr)
}
return store.SetSingle(ctx, be, vec, payload)
if serr := store.SetSingle(ctx, be, vec, payload); serr != nil {
outcome = "insert_error"
return serr
}
return nil
}
// recordTrace surfaces vector-store calls in /api/backend-traces, including
// the backend-load-failure path that otherwise vanishes into an xlog.Warn.
// modelName uses the store namespace (e.g. "router-cache-smart-router") so
// admins can tell which router's cache misbehaved; the backend is always
// "local-store" and can't disambiguate.
func (s *localVectorStore) recordTrace(start time.Time, op string, vecDim int, sim float64, outcome string, err error) {
if s.appConfig == nil || !s.appConfig.EnableTracing {
return
}
trace.InitBackendTracingIfEnabled(s.appConfig.TracingMaxItems, s.appConfig.TracingMaxBodyBytes)
errStr := ""
if err != nil {
errStr = err.Error()
}
summary := op + " " + outcome
if op == "search" && outcome == "hit" {
summary = fmt.Sprintf("search hit (sim=%.3f)", sim)
}
data := map[string]any{
"op": op,
"outcome": outcome,
"vector_dim": vecDim,
}
// Only include similarity for a real neighbor — miss/empty_store would
// otherwise render "similarity: 0" and read as a measured value.
if op == "search" && outcome == "hit" {
data["similarity"] = sim
}
trace.RecordBackendTrace(trace.BackendTrace{
Timestamp: start,
Duration: time.Since(start),
Type: trace.BackendTraceVectorStore,
ModelName: s.storeName,
Backend: model.LocalStoreBackend,
Summary: summary,
Error: errStr,
Data: data,
})
}
func StoreBackend(sl *model.ModelLoader, appConfig *config.ApplicationConfig, storeName string, backend string) (grpc.Backend, error) {

View File

@@ -0,0 +1,88 @@
package backend
import (
"context"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/trace"
"github.com/mudler/LocalAI/pkg/model"
"github.com/mudler/LocalAI/pkg/system"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
// findVectorStoreTrace returns the most recent vector_store trace whose
// model_name matches storeName, or nil if none was recorded. Used by
// the specs below to assert the trace landed without relying on
// ring-buffer ordering across other tests in the suite.
func findVectorStoreTrace(storeName string) *trace.BackendTrace {
traces := trace.GetBackendTraces()
for i := range traces {
bt := &traces[i]
if bt.Type == trace.BackendTraceVectorStore && bt.ModelName == storeName {
return bt
}
}
return nil
}
var _ = Describe("localVectorStore tracing", func() {
// Pin the trace surface admins read from /api/backend-traces.
// The original failure mode that motivated these specs — the
// local-store backend not installed — was silent on every surface
// except a per-call xlog.Warn. With tracing wired in, the row
// appears next to the embedder/score traces for the same request.
BeforeEach(func() {
trace.ClearBackendTraces()
})
It("records a vector_store trace with outcome=backend_load_error when the backend can't be loaded", func() {
// nil ModelLoader → s.backend → StoreBackend → panics on load.
// Use a real-but-empty loader so the failure surfaces as an
// error instead, exercising the load-failure trace path the
// admin would hit when local-store isn't installed.
appCfg := &config.ApplicationConfig{
EnableTracing: true,
TracingMaxItems: 16,
TracingMaxBodyBytes: 1024,
}
s := &localVectorStore{
loader: model.NewModelLoader(&system.SystemState{}),
appConfig: appCfg,
storeName: "router-cache-test",
}
// Search must surface the error AND record a trace describing it.
_, _, _, err := s.Search(context.Background(), []float32{0.1, 0.2, 0.3})
Expect(err).To(HaveOccurred())
Eventually(func() *trace.BackendTrace {
return findVectorStoreTrace("router-cache-test")
}).ShouldNot(BeNil())
bt := findVectorStoreTrace("router-cache-test")
Expect(bt.Backend).To(Equal(model.LocalStoreBackend))
Expect(bt.Data["op"]).To(Equal("search"))
Expect(bt.Data["outcome"]).To(Equal("backend_load_error"))
Expect(bt.Data["vector_dim"]).To(Equal(3))
// Error is the wrapped "vector store load: …" surfaced to the caller.
Expect(bt.Error).To(ContainSubstring("vector store load"))
})
It("does not record a trace when tracing is disabled", func() {
// Opt-out path: appConfig.EnableTracing=false must short-circuit
// before InitBackendTracingIfEnabled, so a workload with tracing
// turned off doesn't pay the channel-send cost per cache call.
appCfg := &config.ApplicationConfig{EnableTracing: false}
s := &localVectorStore{
loader: model.NewModelLoader(&system.SystemState{}),
appConfig: appCfg,
storeName: "router-cache-disabled",
}
_, _, _, _ = s.Search(context.Background(), []float32{1})
Consistently(func() *trace.BackendTrace {
return findVectorStoreTrace("router-cache-disabled")
}).Should(BeNil())
})
})

View File

@@ -7,9 +7,23 @@ import (
"github.com/mudler/LocalAI/core/schema"
"github.com/mudler/LocalAI/core/trace"
"github.com/mudler/LocalAI/pkg/grpc"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
"github.com/mudler/LocalAI/pkg/model"
)
// tokenizeTokenCount returns the number of tokens in a backend response,
// treating a nil response as zero. The gRPC client returns (nil, err) on
// failure, and the tracing block below runs before that error is returned —
// so the count must be read nil-safely here. Reading resp.Tokens on a nil
// resp previously panicked the whole HTTP handler when tracing was enabled
// (e.g. a transient tokenize failure during router probe-budget sizing).
func tokenizeTokenCount(resp *pb.TokenizationResponse) int {
if resp == nil {
return 0
}
return len(resp.Tokens)
}
func ModelTokenize(s string, loader *model.ModelLoader, modelConfig config.ModelConfig, appConfig *config.ApplicationConfig) (schema.TokenizeResponse, error) {
var inferenceModel grpc.Backend
@@ -40,10 +54,7 @@ func ModelTokenize(s string, loader *model.ModelLoader, modelConfig config.Model
errStr = err.Error()
}
tokenCount := 0
if resp.Tokens != nil {
tokenCount = len(resp.Tokens)
}
tokenCount := tokenizeTokenCount(resp)
trace.RecordBackendTrace(trace.BackendTrace{
Timestamp: startTime,
@@ -64,8 +75,8 @@ func ModelTokenize(s string, loader *model.ModelLoader, modelConfig config.Model
return schema.TokenizeResponse{}, err
}
if resp.Tokens == nil {
resp.Tokens = make([]int32, 0)
if resp == nil || resp.Tokens == nil {
return schema.TokenizeResponse{Tokens: make([]int32, 0)}, nil
}
return schema.TokenizeResponse{

View File

@@ -0,0 +1,27 @@
package backend
import (
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
var _ = Describe("tokenizeTokenCount", func() {
// Regression: the gRPC client returns (nil, err) when a tokenize call
// fails, and ModelTokenize's tracing block reads the token count before
// the error is returned. Dereferencing a nil response there panicked the
// HTTP handler (nil pointer dereference) — e.g. a transient tokenize
// failure while the router sized its probe-token budget.
It("returns zero for a nil response instead of panicking", func() {
Expect(tokenizeTokenCount(nil)).To(Equal(0))
})
It("returns zero when the response carries no tokens", func() {
Expect(tokenizeTokenCount(&pb.TokenizationResponse{})).To(Equal(0))
})
It("counts the tokens present on the response", func() {
Expect(tokenizeTokenCount(&pb.TokenizationResponse{Tokens: []int32{1, 2, 3}})).To(Equal(3))
})
})

30
core/cli/chat/chat.go Normal file
View File

@@ -0,0 +1,30 @@
package chat
import (
"context"
"io"
"strings"
)
type Options struct {
Model string
BaseURL string
APIKey string
In io.Reader
Out io.Writer
}
func Run(ctx context.Context, opts Options) error {
if opts.In == nil {
opts.In = strings.NewReader("")
}
if opts.Out == nil {
opts.Out = io.Discard
}
session, err := newChatSession(ctx, newLocalAIChatClient(opts.BaseURL, opts.APIKey), opts.Model)
if err != nil {
return err
}
return runTerminalChat(ctx, session, opts.In, opts.Out)
}

View File

@@ -0,0 +1,13 @@
package chat
import (
"testing"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
func TestChat(t *testing.T) {
RegisterFailHandler(Fail)
RunSpecs(t, "Chat Suite")
}

172
core/cli/chat/chat_test.go Normal file
View File

@@ -0,0 +1,172 @@
package chat
import (
"bytes"
"encoding/json"
"fmt"
"io"
"net/http"
"net/http/httptest"
"strings"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
var _ = Describe("Run chat", func() {
It("streams a single chat response", func() {
var capturedModel string
var capturedAuth string
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
if r.URL.Path == "/v1/models" {
w.Header().Set("Content-Type", "application/json")
writeResponse(w, `{"object":"list","data":[{"id":"test-model","object":"model"}]}`)
return
}
Expect(r.URL.Path).To(Equal("/v1/chat/completions"))
capturedAuth = r.Header.Get("Authorization")
var body struct {
Model string `json:"model"`
Messages []struct {
Role string `json:"role"`
Content string `json:"content"`
} `json:"messages"`
}
Expect(json.NewDecoder(r.Body).Decode(&body)).To(Succeed())
capturedModel = body.Model
Expect(body.Messages).To(HaveLen(1))
Expect(body.Messages[0].Role).To(Equal("user"))
Expect(body.Messages[0].Content).To(Equal("hello"))
w.Header().Set("Content-Type", "text/event-stream")
writeResponse(w, "data: {\"choices\":[{\"index\":0,\"delta\":{\"content\":\"hi\"}}]}\n\n")
writeResponse(w, "data: {\"choices\":[{\"index\":0,\"delta\":{\"content\":\"!\"}}]}\n\n")
writeResponse(w, "data: [DONE]\n\n")
}))
defer server.Close()
var out bytes.Buffer
err := Run(GinkgoT().Context(), Options{
Model: "test-model",
BaseURL: server.URL + "/v1",
APIKey: "secret",
In: strings.NewReader("hello\n/exit\n"),
Out: &out,
})
Expect(err).ToNot(HaveOccurred())
Expect(capturedModel).To(Equal("test-model"))
Expect(capturedAuth).To(Equal("Bearer secret"))
Expect(out.String()).To(ContainSubstring("assistant: hi!"))
Expect(out.String()).To(ContainSubstring("bye"))
})
It("auto-selects the only available model", func() {
server := chatTestServer([]string{"solo"}, nil)
defer server.Close()
var out bytes.Buffer
err := Run(GinkgoT().Context(), Options{
BaseURL: server.URL + "/v1",
In: strings.NewReader("/exit\n"),
Out: &out,
})
Expect(err).ToNot(HaveOccurred())
Expect(out.String()).To(ContainSubstring("LocalAI chat (solo)"))
})
It("returns an actionable error when no models are installed", func() {
server := chatTestServer(nil, nil)
defer server.Close()
err := Run(GinkgoT().Context(), Options{
BaseURL: server.URL + "/v1",
In: strings.NewReader(""),
})
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring("no chat models are installed"))
Expect(err.Error()).To(ContainSubstring("local-ai models install <model>"))
})
It("returns an actionable error when multiple models are available without a selection", func() {
server := chatTestServer([]string{"alpha", "beta"}, nil)
defer server.Close()
err := Run(GinkgoT().Context(), Options{
BaseURL: server.URL + "/v1",
In: strings.NewReader(""),
})
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring("multiple models are available"))
Expect(err.Error()).To(ContainSubstring("--model"))
Expect(err.Error()).To(ContainSubstring("alpha"))
Expect(err.Error()).To(ContainSubstring("beta"))
})
It("lists and switches models inside the chat", func() {
requestedModels := []string{}
server := chatTestServer([]string{"alpha", "beta"}, func(model string) {
requestedModels = append(requestedModels, model)
})
defer server.Close()
var out bytes.Buffer
err := Run(GinkgoT().Context(), Options{
Model: "alpha",
BaseURL: server.URL + "/v1",
In: strings.NewReader("/models\n/model beta\nhello\n/exit\n"),
Out: &out,
})
Expect(err).ToNot(HaveOccurred())
Expect(out.String()).To(ContainSubstring("* alpha"))
Expect(out.String()).To(ContainSubstring(" beta"))
Expect(out.String()).To(ContainSubstring("switched to beta; conversation cleared"))
Expect(requestedModels).To(Equal([]string{"beta"}))
})
})
func chatTestServer(models []string, onChat func(model string)) *httptest.Server {
return httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
switch r.URL.Path {
case "/v1/models":
w.Header().Set("Content-Type", "application/json")
writeResponse(w, `{"object":"list","data":[`)
for i, model := range models {
if i > 0 {
writeResponse(w, ",")
}
writeResponsef(w, `{"id":%q,"object":"model"}`, model)
}
writeResponse(w, `]}`)
case "/v1/chat/completions":
var body struct {
Model string `json:"model"`
}
Expect(json.NewDecoder(r.Body).Decode(&body)).To(Succeed())
if onChat != nil {
onChat(body.Model)
}
w.Header().Set("Content-Type", "text/event-stream")
writeResponse(w, "data: {\"choices\":[{\"index\":0,\"delta\":{\"content\":\"ok\"}}]}\n\n")
writeResponse(w, "data: [DONE]\n\n")
default:
w.WriteHeader(http.StatusNotFound)
}
}))
}
func writeResponse(w io.Writer, text string) {
_, err := fmt.Fprint(w, text)
Expect(err).ToNot(HaveOccurred())
}
func writeResponsef(w io.Writer, format string, args ...any) {
_, err := fmt.Fprintf(w, format, args...)
Expect(err).ToNot(HaveOccurred())
}

114
core/cli/chat/client.go Normal file
View File

@@ -0,0 +1,114 @@
package chat
import (
"context"
"errors"
"fmt"
"io"
"sort"
"strings"
openai "github.com/sashabaranov/go-openai"
)
type chatClient interface {
ListModels(ctx context.Context) ([]string, error)
StreamChat(ctx context.Context, model string, messages []chatMessage, out io.Writer) (string, error)
}
type localAIChatClient struct {
client *openai.Client
}
func newLocalAIChatClient(baseURL string, apiKey string) *localAIChatClient {
cfg := openai.DefaultConfig(apiKey)
cfg.BaseURL = baseURL
return &localAIChatClient{client: openai.NewClientWithConfig(cfg)}
}
func (c *localAIChatClient) ListModels(ctx context.Context) ([]string, error) {
resp, err := c.client.ListModels(ctx)
if err != nil {
return nil, err
}
models := make([]string, 0, len(resp.Models))
for _, model := range resp.Models {
if model.ID != "" {
models = append(models, model.ID)
}
}
sort.Strings(models)
return models, nil
}
func (c *localAIChatClient) StreamChat(ctx context.Context, model string, messages []chatMessage, out io.Writer) (string, error) {
stream, err := c.client.CreateChatCompletionStream(ctx, openai.ChatCompletionRequest{
Model: model,
Messages: openAIChatMessages(messages),
})
if err != nil {
return "", friendlyChatError(err, model)
}
defer func() {
_ = stream.Close()
}()
var answer strings.Builder
for {
resp, err := stream.Recv()
if errors.Is(err, io.EOF) {
break
}
if err != nil {
return answer.String(), friendlyChatError(err, model)
}
if len(resp.Choices) == 0 {
continue
}
token := resp.Choices[0].Delta.Content
if token == "" {
continue
}
answer.WriteString(token)
if _, err := fmt.Fprint(out, token); err != nil {
return answer.String(), err
}
}
return answer.String(), nil
}
func openAIChatMessages(messages []chatMessage) []openai.ChatCompletionMessage {
converted := make([]openai.ChatCompletionMessage, len(messages))
for i, message := range messages {
converted[i] = openai.ChatCompletionMessage{
Role: message.Role,
Content: message.Content,
}
}
return converted
}
func friendlyChatError(err error, model string) error {
var apiErr *openai.APIError
if errors.As(err, &apiErr) {
switch apiErr.HTTPStatusCode {
case 404:
return fmt.Errorf("model %q is not available. Run `local-ai models list`, install a model with `local-ai models install <model>`, or switch with `/model <name>`", model)
case 403:
return fmt.Errorf("model %q is disabled. Enable it from LocalAI settings or choose another model with `/model <name>`", model)
}
if apiErr.Message != "" {
return errors.New(apiErr.Message)
}
}
msg := err.Error()
if strings.Contains(msg, "model") && strings.Contains(msg, "not found") {
return fmt.Errorf("model %q is not available. Run `local-ai models list`, install a model with `local-ai models install <model>`, or switch with `/model <name>`", model)
}
return err
}

17
core/cli/chat/models.go Normal file
View File

@@ -0,0 +1,17 @@
package chat
import "strings"
func formatChatModelList(models []string, current string) string {
var b strings.Builder
for _, model := range models {
prefix := " "
if model == current {
prefix = "* "
}
b.WriteString(prefix)
b.WriteString(model)
b.WriteByte('\n')
}
return b.String()
}

120
core/cli/chat/session.go Normal file
View File

@@ -0,0 +1,120 @@
package chat
import (
"context"
"errors"
"fmt"
"io"
"strings"
)
const (
chatRoleUser = "user"
chatRoleAssistant = "assistant"
)
type chatMessage struct {
Role string
Content string
}
type chatSession struct {
client chatClient
model string
models []string
messages []chatMessage
}
func newChatSession(ctx context.Context, client chatClient, requestedModel string) (*chatSession, error) {
models, err := client.ListModels(ctx)
if err != nil {
return nil, fmt.Errorf("list models: %w", err)
}
model, err := resolveChatModel(requestedModel, models)
if err != nil {
return nil, err
}
return &chatSession{
client: client,
model: model,
models: models,
}, nil
}
func (s *chatSession) CurrentModel() string {
return s.model
}
func (s *chatSession) Models() []string {
models := make([]string, len(s.models))
copy(models, s.models)
return models
}
func (s *chatSession) Clear() {
s.messages = nil
}
func (s *chatSession) SwitchModel(model string) error {
if !modelExists(s.models, model) {
return fmt.Errorf("model %q is not available. Use /models to see installed models", model)
}
s.model = model
s.Clear()
return nil
}
func (s *chatSession) Send(ctx context.Context, prompt string, out io.Writer) error {
s.messages = append(s.messages, chatMessage{
Role: chatRoleUser,
Content: prompt,
})
answer, err := s.client.StreamChat(ctx, s.model, s.messages, out)
if err != nil {
return err
}
s.messages = append(s.messages, chatMessage{
Role: chatRoleAssistant,
Content: answer,
})
return nil
}
func resolveChatModel(requested string, models []string) (string, error) {
switch {
case requested == "" && len(models) == 0:
return "", errors.New(`no chat models are installed.
Install a model first, for example:
local-ai models list
local-ai models install <model>
local-ai run
Then start a chat session:
local-ai chat --model <model>`)
case requested == "" && len(models) == 1:
return models[0], nil
case requested == "" && len(models) > 1:
var b strings.Builder
b.WriteString("multiple models are available; choose one with --model:\n")
b.WriteString(formatChatModelList(models, ""))
return "", errors.New(b.String())
case !modelExists(models, requested):
return "", fmt.Errorf("model %q is not available. Use `local-ai models list` and `local-ai models install <model>`, or pass an installed model with --model", requested)
default:
return requested, nil
}
}
func modelExists(models []string, name string) bool {
for _, model := range models {
if model == name {
return true
}
}
return false
}

View File

@@ -0,0 +1,56 @@
package chat
import (
"context"
"io"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
var _ = Describe("Chat session", func() {
It("keeps model switching and message history out of the terminal adapter", func() {
client := &fakeChatClient{
models: []string{"alpha", "beta"},
answer: "pong",
}
session, err := newChatSession(context.Background(), client, "alpha")
Expect(err).ToNot(HaveOccurred())
Expect(session.CurrentModel()).To(Equal("alpha"))
Expect(session.SwitchModel("beta")).To(Succeed())
Expect(session.CurrentModel()).To(Equal("beta"))
Expect(session.Send(context.Background(), "ping", io.Discard)).To(Succeed())
Expect(client.requests).To(HaveLen(1))
Expect(client.requests[0].model).To(Equal("beta"))
Expect(client.requests[0].messages).To(HaveLen(1))
Expect(client.requests[0].messages[0].Content).To(Equal("ping"))
})
})
type fakeChatClient struct {
models []string
answer string
requests []fakeChatRequest
}
type fakeChatRequest struct {
model string
messages []chatMessage
}
func (c *fakeChatClient) ListModels(context.Context) ([]string, error) {
return c.models, nil
}
func (c *fakeChatClient) StreamChat(_ context.Context, model string, messages []chatMessage, out io.Writer) (string, error) {
copied := make([]chatMessage, len(messages))
copy(copied, messages)
c.requests = append(c.requests, fakeChatRequest{model: model, messages: copied})
if _, err := io.WriteString(out, c.answer); err != nil {
return "", err
}
return c.answer, nil
}

93
core/cli/chat/terminal.go Normal file
View File

@@ -0,0 +1,93 @@
package chat
import (
"bufio"
"context"
"fmt"
"io"
"strings"
)
func runTerminalChat(ctx context.Context, session *chatSession, in io.Reader, out io.Writer) error {
scanner := bufio.NewScanner(in)
scanner.Buffer(make([]byte, 0, 64*1024), 4*1024*1024)
if err := writeChat(out, "LocalAI chat (%s)\n", session.CurrentModel()); err != nil {
return err
}
if err := writeChat(out, "Type /exit to quit, /clear to reset the conversation, /models to list models.\n"); err != nil {
return err
}
for {
if err := writeChat(out, "\n> "); err != nil {
return err
}
if !scanner.Scan() {
break
}
prompt := strings.TrimSpace(scanner.Text())
switch prompt {
case "":
continue
case "/bye", "/exit", "/quit":
return writeChat(out, "bye\n")
case "/clear":
session.Clear()
if err := writeChat(out, "conversation cleared\n"); err != nil {
return err
}
continue
case "/models":
if err := printChatModels(out, session.Models(), session.CurrentModel()); err != nil {
return err
}
continue
}
if nextModel, ok := strings.CutPrefix(prompt, "/model "); ok {
nextModel = strings.TrimSpace(nextModel)
if nextModel == "" {
if err := writeChat(out, "usage: /model <name>\n"); err != nil {
return err
}
continue
}
if err := session.SwitchModel(nextModel); err != nil {
if writeErr := writeChat(out, "%s\n", err); writeErr != nil {
return writeErr
}
continue
}
if err := writeChat(out, "switched to %s; conversation cleared\n", session.CurrentModel()); err != nil {
return err
}
continue
}
if err := writeChat(out, "assistant: "); err != nil {
return err
}
if err := session.Send(ctx, prompt, out); err != nil {
return err
}
if err := writeChat(out, "\n"); err != nil {
return err
}
}
return scanner.Err()
}
func printChatModels(out io.Writer, models []string, current string) error {
if len(models) == 0 {
return writeChat(out, "no models installed\n")
}
return writeChat(out, "%s", formatChatModelList(models, current))
}
func writeChat(out io.Writer, format string, args ...any) error {
_, err := fmt.Fprintf(out, format, args...)
return err
}

25
core/cli/chat_cmd.go Normal file
View File

@@ -0,0 +1,25 @@
package cli
import (
"context"
"os"
chatcli "github.com/mudler/LocalAI/core/cli/chat"
cliContext "github.com/mudler/LocalAI/core/cli/context"
)
type ChatCMD struct {
Model string `short:"m" help:"Model name to use. Defaults to the only model returned by the server when exactly one is available"`
Endpoint string `env:"LOCALAI_CHAT_ENDPOINT" default:"http://127.0.0.1:8080" help:"LocalAI server endpoint. The /v1 path is added automatically when omitted"`
APIKey string `env:"LOCALAI_API_KEY,API_KEY" help:"API key to use when the LocalAI server requires authentication"`
}
func (c *ChatCMD) Run(ctx *cliContext.Context) error {
return chatcli.Run(context.Background(), chatcli.Options{
Model: c.Model,
BaseURL: chatAPIBaseURL(c.Endpoint),
APIKey: c.APIKey,
In: os.Stdin,
Out: os.Stdout,
})
}

27
core/cli/chat_cmd_test.go Normal file
View File

@@ -0,0 +1,27 @@
package cli
import (
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
var _ = Describe("Chat command wiring", func() {
Describe("chatAPIBaseURL", func() {
It("adds /v1 to a root endpoint", func() {
Expect(chatAPIBaseURL("http://127.0.0.1:8080")).To(Equal("http://127.0.0.1:8080/v1"))
})
It("keeps endpoints that already include /v1", func() {
Expect(chatAPIBaseURL("http://127.0.0.1:8080/v1")).To(Equal("http://127.0.0.1:8080/v1"))
Expect(chatAPIBaseURL("http://127.0.0.1:8080/v1/")).To(Equal("http://127.0.0.1:8080/v1"))
})
It("adds a default http scheme", func() {
Expect(chatAPIBaseURL("127.0.0.1:8080")).To(Equal("http://127.0.0.1:8080/v1"))
})
It("preserves non-root paths before /v1", func() {
Expect(chatAPIBaseURL("http://127.0.0.1:8080/localai")).To(Equal("http://127.0.0.1:8080/localai/v1"))
})
})
})

29
core/cli/chat_endpoint.go Normal file
View File

@@ -0,0 +1,29 @@
package cli
import (
"net/url"
"strings"
)
func chatAPIBaseURL(endpoint string) string {
if !strings.Contains(endpoint, "://") {
endpoint = "http://" + endpoint
}
u, err := url.Parse(endpoint)
if err != nil {
return strings.TrimRight(endpoint, "/") + "/v1"
}
path := strings.TrimRight(u.Path, "/")
if path == "" {
u.Path = "/v1"
} else if path != "/v1" && !strings.HasSuffix(path, "/v1") {
u.Path = path + "/v1"
} else {
u.Path = path
}
u.RawQuery = ""
u.Fragment = ""
return u.String()
}

View File

@@ -9,6 +9,7 @@ var CLI struct {
cliContext.Context `embed:""`
Run RunCMD `cmd:"" help:"Run LocalAI, this the default command if no other command is specified. Run 'local-ai run --help' for more information" default:"withargs"`
Chat ChatCMD `cmd:"" help:"Open an interactive chat session against a running LocalAI server"`
Federated FederatedCLI `cmd:"" help:"Run LocalAI in federated mode"`
Models ModelsCMD `cmd:"" help:"Manage LocalAI models and definitions"`
Backends BackendsCMD `cmd:"" help:"Manage LocalAI backends and definitions"`

View File

@@ -30,6 +30,8 @@ type RunCMD struct {
ModelArgs []string `arg:"" optional:"" name:"models" help:"Model configuration URLs to load"`
ExternalBackends []string `env:"LOCALAI_EXTERNAL_BACKENDS,EXTERNAL_BACKENDS" help:"A list of external backends to load from gallery on boot" group:"backends"`
WebRTCNAT1To1IPs []string `env:"LOCALAI_WEBRTC_NAT_1TO1_IPS,WEBRTC_NAT_1TO1_IPS" help:"IPs advertised as the host ICE candidates for /v1/realtime WebRTC instead of every local interface. Set to the reachable host/LAN IP when running under Docker host networking or NAT, where pion otherwise offers unreachable bridge addresses and the connection drops after ICE consent checks fail." group:"api"`
WebRTCICEInterfaces []string `env:"LOCALAI_WEBRTC_ICE_INTERFACES,WEBRTC_ICE_INTERFACES" help:"Restrict /v1/realtime WebRTC ICE candidate gathering to these network interfaces (e.g. eth0), filtering out docker0/veth noise." group:"api"`
BackendsPath string `env:"LOCALAI_BACKENDS_PATH,BACKENDS_PATH" type:"path" default:"${basepath}/backends" help:"Path containing backends used for inferencing" group:"backends"`
BackendsSystemPath string `env:"LOCALAI_BACKENDS_SYSTEM_PATH,BACKEND_SYSTEM_PATH" type:"path" default:"/var/lib/local-ai/backends" help:"Path containing system backends used for inferencing" group:"backends"`
ModelsPath string `env:"LOCALAI_MODELS_PATH,MODELS_PATH" type:"path" default:"${basepath}/models" help:"Path containing models used for inferencing" group:"storage"`
@@ -225,6 +227,8 @@ func (r *RunCMD) Run(ctx *cliContext.Context) error {
config.WithApiKeys(r.APIKeys),
config.WithModelsURL(append(r.Models, r.ModelArgs...)...),
config.WithExternalBackends(r.ExternalBackends...),
config.WithWebRTCNAT1To1IPs(r.WebRTCNAT1To1IPs...),
config.WithWebRTCICEInterfaces(r.WebRTCICEInterfaces...),
config.WithOpaqueErrors(r.OpaqueErrors),
config.WithEnforcedPredownloadScans(!r.DisablePredownloadScan),
config.WithSubtleKeyComparison(r.UseSubtleKeyComparison),
@@ -652,12 +656,12 @@ func (r *RunCMD) Run(ctx *cliContext.Context) error {
// waitForServerReady polls the given address until the HTTP server is
// accepting connections or the context is cancelled.
func waitForServerReady(address string, ctx context.Context) {
// Ensure the address has a host component for dialing.
// Echo accepts ":8080" but net.Dial needs a resolvable host.
host, port, err := net.SplitHostPort(address)
if err == nil && host == "" {
address = "127.0.0.1:" + port
}
ticker := time.NewTicker(250 * time.Millisecond)
defer ticker.Stop()
for {
select {
@@ -665,11 +669,17 @@ func waitForServerReady(address string, ctx context.Context) {
return
default:
}
conn, err := net.DialTimeout("tcp", address, 500*time.Millisecond)
if err == nil {
conn.Close()
return
}
time.Sleep(250 * time.Millisecond)
select {
case <-ctx.Done():
return
case <-ticker.C:
}
}
}

View File

@@ -12,10 +12,19 @@ import (
)
type ApplicationConfig struct {
Context context.Context
ConfigFile string
SystemState *system.SystemState
ExternalBackends []string
Context context.Context
ConfigFile string
SystemState *system.SystemState
ExternalBackends []string
// WebRTCNAT1To1IPs, when set, are advertised as the host ICE candidates for
// /v1/realtime WebRTC instead of every local interface address. Needed when
// the routable address differs from what pion gathers — e.g. Docker host
// networking (where pion also offers unreachable bridge IPs) or NAT.
WebRTCNAT1To1IPs []string
// WebRTCICEInterfaces, when set, restricts ICE candidate gathering to these
// network interfaces (e.g. eth0), filtering out docker0/veth noise.
WebRTCICEInterfaces []string
UploadLimitMB, Threads, ContextSize int
F16 bool
Debug bool
@@ -56,7 +65,7 @@ type ApplicationConfig struct {
//
// patterns:
// - id: email
// action: route_local # downgrade default mask -> route_local
// action: allow # downgrade default mask -> allow (log only)
// - id: ssn
// action: block # upgrade default mask -> block
//
@@ -81,7 +90,6 @@ type ApplicationConfig struct {
// file is mode 0600.
MITMCADir string
// PIIPatternOverrides applies persisted per-id deltas (action,
// disabled) to the live redactor at startup. Loaded from
// runtime_settings.json and applied right after pii.NewRedactor.
@@ -116,11 +124,11 @@ type ApplicationConfig struct {
// --require-backend-integrity / LOCALAI_REQUIRE_BACKEND_INTEGRITY.
RequireBackendIntegrity bool
SingleBackend bool // Deprecated: use MaxActiveBackends = 1 instead
MaxActiveBackends int // Maximum number of active backends (0 = unlimited, 1 = single backend mode)
WatchDogIdle bool
WatchDogBusy bool
WatchDog bool
SingleBackend bool // Deprecated: use MaxActiveBackends = 1 instead
MaxActiveBackends int // Maximum number of active backends (0 = unlimited, 1 = single backend mode)
WatchDogIdle bool
WatchDogBusy bool
WatchDog bool
// Memory Reclaimer settings (works with GPU if available, otherwise RAM)
MemoryReclaimerEnabled bool // Enable memory threshold monitoring
@@ -311,6 +319,18 @@ func WithExternalBackends(backends ...string) AppOption {
}
}
func WithWebRTCNAT1To1IPs(ips ...string) AppOption {
return func(o *ApplicationConfig) {
o.WebRTCNAT1To1IPs = ips
}
}
func WithWebRTCICEInterfaces(interfaces ...string) AppOption {
return func(o *ApplicationConfig) {
o.WebRTCICEInterfaces = interfaces
}
}
func WithMachineTag(tag string) AppOption {
return func(o *ApplicationConfig) {
o.MachineTag = tag
@@ -702,7 +722,6 @@ func WithMITMCADir(dir string) AppOption {
}
}
func WithDynamicConfigDir(dynamicConfigsDir string) AppOption {
return func(o *ApplicationConfig) {
o.DynamicConfigsDir = dynamicConfigsDir

View File

@@ -93,6 +93,9 @@ func applyOverride(f *FieldMeta, o FieldMetaOverride) {
if o.Component != "" {
f.Component = o.Component
}
if o.Language != "" {
f.Language = o.Language
}
if o.Placeholder != "" {
f.Placeholder = o.Placeholder
}

View File

@@ -8,6 +8,7 @@ const (
ProviderModelsTTS = "models:tts"
ProviderModelsTranscript = "models:transcript"
ProviderModelsVAD = "models:vad"
ProviderModelsScore = "models:score"
)
// Static option lists embedded directly in field metadata.

View File

@@ -226,6 +226,7 @@ func DefaultRegistry() map[string]FieldMetaOverride {
Label: "Chat Template",
Description: "Go template for chat completion requests",
Component: "code-editor",
Language: "gotemplate",
Order: 40,
},
"template.chat_message": {
@@ -233,6 +234,7 @@ func DefaultRegistry() map[string]FieldMetaOverride {
Label: "Chat Message Template",
Description: "Go template for individual chat messages",
Component: "code-editor",
Language: "gotemplate",
Order: 41,
},
"template.completion": {
@@ -240,13 +242,22 @@ func DefaultRegistry() map[string]FieldMetaOverride {
Label: "Completion Template",
Description: "Go template for completion requests",
Component: "code-editor",
Language: "gotemplate",
Order: 42,
},
"template.function": {
Section: "templates",
Label: "Functions Template",
Description: "Go template applied when tools/functions are present in the request",
Component: "code-editor",
Language: "gotemplate",
Order: 43,
},
"template.use_tokenizer_template": {
Section: "templates",
Label: "Use Tokenizer Template",
Description: "Use the chat template from the model's tokenizer config",
Order: 43,
Order: 44,
},
// Router section template — kept in the templates UI section
// (rather than the router section under "other") so operators
@@ -257,7 +268,8 @@ func DefaultRegistry() map[string]FieldMetaOverride {
Label: "Router Classifier System Prompt",
Description: "Go text/template (with sprig functions) for the routing system prompt the score classifier feeds to its classifier_model. Executed with `.Policies` ([]{Label, Description}). Empty falls back to the built-in Arch-Router-shaped prompt (route-listing block + JSON output schema). Override when the classifier model was trained on a different schema or you need the routing instructions in a different language. The candidate format scored against the model is fixed at `{\"route\": \"<label>\"}` — keep your override's output schema instruction matching that.",
Component: "code-editor",
Order: 44,
Language: "gotemplate",
Order: 45,
},
// --- Pipeline ---
@@ -308,6 +320,41 @@ func DefaultRegistry() map[string]FieldMetaOverride {
},
Order: 64,
},
"pipeline.disable_thinking": {
Section: "pipeline",
Label: "Disable Thinking",
Description: "Suppress reasoning/thinking output from the pipeline LLM (sets enable_thinking=false on the underlying model). Use for models that emit <think> blocks you don't want spoken or streamed back to the realtime client.",
Component: "toggle",
Order: 65,
},
"pipeline.streaming.llm": {
Section: "pipeline",
Label: "Stream LLM",
Description: "Stream LLM tokens to the realtime client as they are generated instead of waiting for the full response. Emits incremental response.output_audio_transcript.delta / text deltas.",
Component: "toggle",
Order: 66,
},
"pipeline.streaming.tts": {
Section: "pipeline",
Label: "Stream TTS",
Description: "Stream synthesized audio chunks to the realtime client as they are produced (requires a TTS backend that implements TTSStream). Falls back to unary synthesis otherwise.",
Component: "toggle",
Order: 67,
},
"pipeline.streaming.transcription": {
Section: "pipeline",
Label: "Stream Transcription",
Description: "Stream partial transcription text to the realtime client as the STT backend produces it (requires a transcription backend that implements AudioTranscriptionStream). Falls back to unary transcription otherwise.",
Component: "toggle",
Order: 68,
},
"pipeline.streaming.clause_chunking": {
Section: "pipeline",
Label: "Clause Chunking",
Description: "Split the streamed reply into speakable clauses and synthesize each as soon as it completes, instead of buffering the whole message before TTS — lower time-to-first-audio. Script-aware (handles CJK 。!? and Thai/Lao spaces), so it does not whitespace-split. Requires Stream LLM; off buffers the whole message.",
Component: "toggle",
Order: 69,
},
// --- Functions ---
"function.grammar.parallel_calls": {
@@ -365,14 +412,14 @@ func DefaultRegistry() map[string]FieldMetaOverride {
// --- PII filtering (per-model) ---
"pii.enabled": {
Section: "other",
Section: "pii",
Label: "PII Filtering Enabled",
Description: "Enable PII redaction middleware for this model. Unset means use the default (off for local backends, on for proxy-* / cloud-hosted backends).",
Component: "toggle",
Order: 200,
},
"pii.patterns": {
Section: "other",
Section: "pii",
Label: "PII Pattern Overrides",
Description: "Override the global default action for specific patterns on this model. Patterns not listed here inherit the global action (Settings → Middleware → Filtering).",
Component: "pii-pattern-list",
@@ -385,7 +432,7 @@ func DefaultRegistry() map[string]FieldMetaOverride {
// fails closed — the chat handler does NOT silently fall back
// to the local gRPC pipeline.
"proxy.mode": {
Section: "other",
Section: "proxy",
Label: "Proxy Mode",
Description: "passthrough forwards the client's OpenAI body verbatim — point upstream_url at an OpenAI-compatible endpoint (incl. Anthropic's /v1/chat/completions compat layer). translate converts OpenAI ↔ Anthropic Messages so you can target a native API (/v1/messages); tool_calls and usage tokens survive the round-trip.",
Component: "select",
@@ -397,7 +444,7 @@ func DefaultRegistry() map[string]FieldMetaOverride {
Order: 208,
},
"proxy.provider": {
Section: "other",
Section: "proxy",
Label: "Proxy Provider",
Description: "Upstream API family. Drives auth header shape (Bearer vs x-api-key + anthropic-version) and, in translate mode, which request/response codec is used.",
Component: "select",
@@ -409,28 +456,28 @@ func DefaultRegistry() map[string]FieldMetaOverride {
Order: 209,
},
"proxy.upstream_url": {
Section: "other",
Section: "proxy",
Label: "Proxy Upstream URL",
Description: "Full POST endpoint of the upstream provider (e.g. https://api.openai.com/v1/chat/completions). Only used when Backend is cloud-proxy.",
Component: "input",
Order: 210,
},
"proxy.api_key_env": {
Section: "other",
Section: "proxy",
Label: "Proxy API Key Env Var",
Description: "Name of the environment variable holding the upstream API key. Reading from env keeps the secret out of the YAML and the admin UI.",
Component: "input",
Order: 211,
},
"proxy.upstream_model": {
Section: "other",
Section: "proxy",
Label: "Proxy Upstream Model",
Description: "Model name sent to the upstream. Leave empty to forward the client's model field unchanged. Useful when the LocalAI alias differs from the upstream's canonical name.",
Component: "input",
Order: 212,
},
"proxy.request_timeout_seconds": {
Section: "other",
Section: "proxy",
Label: "Proxy Request Timeout (seconds)",
Description: "Caps the upstream HTTP request duration. 0 disables the deadline; the request still ends when the client disconnects.",
Component: "number",
@@ -445,7 +492,7 @@ func DefaultRegistry() map[string]FieldMetaOverride {
// A host claimed by two configs is a critical error — the
// listener refuses to start until resolved.
"mitm.hosts": {
Section: "other",
Section: "mitm",
Label: "MITM Intercept Hosts",
Description: "Hostnames the cloudproxy MITM proxy terminates TLS for on behalf of this model config. PII filtering and pattern overrides flow from this model when the host is intercepted. Each host must be unique across all configs.",
Component: "string-list",
@@ -460,7 +507,7 @@ func DefaultRegistry() map[string]FieldMetaOverride {
// the middleware admin page surfaces every model with a router
// block.
"router.classifier": {
Section: "other",
Section: "router",
Label: "Classifier",
Description: "Picks a candidate by scoring every policy label against the prompt. Only \"score\" is shipped today; it asks the classifier_model to rank each label and reads off the softmax. Empty defaults to \"score\".",
Component: "select",
@@ -470,15 +517,15 @@ func DefaultRegistry() map[string]FieldMetaOverride {
Order: 230,
},
"router.classifier_model": {
Section: "other",
Section: "router",
Label: "Classifier Model",
Description: "Loaded LocalAI model the score classifier asks to rank each policy label as a continuation. Must support the Score gRPC primitive (today: llama-cpp, vLLM) and use the ChatML template. Arch-Router-1.5B Q4_K_M is the canonical choice; any small ChatML instruct model also works at a higher activation_threshold.",
Component: "model-select",
AutocompleteProvider: ProviderModelsChat,
AutocompleteProvider: ProviderModelsScore,
Order: 231,
},
"router.fallback": {
Section: "other",
Section: "router",
Label: "Fallback Model",
Description: "Model used when no candidate's labels cover the classifier's active label set, or when the classifier errors. Empty means router failures bubble up as HTTP 500 — fail-fast, not silent-bypass.",
Component: "model-select",
@@ -486,7 +533,7 @@ func DefaultRegistry() map[string]FieldMetaOverride {
Order: 232,
},
"router.activation_threshold": {
Section: "other",
Section: "router",
Label: "Activation Threshold",
Description: "Softmax-probability floor a policy must clear to join the active label set for a request. Higher → single-label dominant routes; lower → more multi-label activations. 0 picks the package default (0.15). On Arch-Router-1.5B a value around 0.40 keeps the dominant label clean without losing genuine compound activations.",
Component: "slider",
@@ -496,7 +543,7 @@ func DefaultRegistry() map[string]FieldMetaOverride {
Order: 233,
},
"router.classifier_cache_size": {
Section: "other",
Section: "router",
Label: "Classifier L1 Cache Size",
Description: "Bounded LRU keyed on (case-folded, whitespace-trimmed) prompt — amortises the classifier round-trip across verbatim repeats common in agent loops. 0 here means \"use the default\" (1024); the cache cannot be disabled from YAML.",
Component: "number",
@@ -504,21 +551,21 @@ func DefaultRegistry() map[string]FieldMetaOverride {
Order: 234,
},
"router.policies": {
Section: "other",
Section: "router",
Label: "Policies",
Description: "Label vocabulary the classifier scores over. Each policy has a label and a short natural-language description fed verbatim to the classifier model. Short action-oriented sentences work best (\"writing or debugging code\"; \"small talk\").",
Component: "router-policies",
Order: 235,
},
"router.candidates": {
Section: "other",
Section: "router",
Label: "Candidates",
Description: "Routing table: each entry binds a downstream model to a set of policy labels it can serve. Order matters — the middleware picks the FIRST candidate whose labels are a superset of the active set, so list candidates smallest → largest.",
Component: "router-candidates",
Order: 236,
},
"router.score_normalization": {
Section: "other",
Section: "router",
Label: "Score Normalization",
Description: "How the score classifier collapses per-candidate joint log-probs into the softmax input. \"raw\" (default) feeds joint log-prob as-is — on-distribution for Arch-Router (the route the model would actually emit if decoded freely). \"mean\" divides by candidate token count — fairer to long labels but off-distribution for models trained to emit fixed-format outputs.",
Component: "select",
@@ -530,7 +577,7 @@ func DefaultRegistry() map[string]FieldMetaOverride {
Order: 240,
},
"router.embedding_cache.embedding_model": {
Section: "other",
Section: "router",
Label: "L2 Cache: Embedding Model",
Description: "Embedding model used by the L2 decision cache. Embeds incoming probes and looks them up in the per-router local-store collection. Empty disables the cache entirely. nomic-embed-text-v1.5 is the recommended default.",
Component: "model-select",
@@ -538,7 +585,7 @@ func DefaultRegistry() map[string]FieldMetaOverride {
Order: 237,
},
"router.embedding_cache.similarity_threshold": {
Section: "other",
Section: "router",
Label: "L2 Cache: Similarity Threshold",
Description: "Cosine-similarity floor a cache candidate must clear to count as a hit. 0 picks the package default (0.80). Re-tune per embedding model — the histogram on the Routing tab shows where the cosine distribution actually sits.",
Component: "slider",
@@ -548,7 +595,7 @@ func DefaultRegistry() map[string]FieldMetaOverride {
Order: 238,
},
"router.embedding_cache.confidence_threshold": {
Section: "other",
Section: "router",
Label: "L2 Cache: Confidence Threshold",
Description: "Minimum top-label probability a classifier decision must have to be inserted into the cache. 0 picks the package default (0.60). Uncertain decisions are skipped so they can't poison future paraphrases.",
Component: "slider",
@@ -558,7 +605,7 @@ func DefaultRegistry() map[string]FieldMetaOverride {
Order: 239,
},
"router.embedding_cache.store_name": {
Section: "other",
Section: "router",
Label: "L2 Cache: Store Name",
Description: "Optional override for the local-store collection used by this router's cache. Empty defaults to \"router-cache-<router-model-name>\". Two routers sharing a store_name share their cache (rare).",
Component: "input",

View File

@@ -240,7 +240,6 @@ var grandfatheredUnregistered = []string{
"swap_space",
"system_prompt",
"template.edit",
"template.function",
"template.join_chat_messages_by_character",
"template.multimodal",
"template.reply_prefix",

View File

@@ -11,6 +11,7 @@ type FieldMeta struct {
Label string `json:"label"` // human-readable label
Description string `json:"description,omitempty"` // help text
Component string `json:"component"` // "input", "number", "toggle", "select", "slider", etc.
Language string `json:"language,omitempty"` // syntax mode for code-editor fields: "yaml" (default), "gotemplate"
Placeholder string `json:"placeholder,omitempty"`
Default any `json:"default,omitempty"`
Min *float64 `json:"min,omitempty"`
@@ -51,6 +52,7 @@ type FieldMetaOverride struct {
Label string
Description string
Component string
Language string
Placeholder string
Default any
Min *float64
@@ -78,6 +80,10 @@ func DefaultSections() []Section {
{ID: "grpc", Label: "gRPC", Icon: "server", Order: 65},
{ID: "agent", Label: "Agent", Icon: "bot", Order: 70},
{ID: "mcp", Label: "MCP", Icon: "plug", Order: 75},
{ID: "router", Label: "Router", Icon: "git-merge", Order: 78},
{ID: "proxy", Label: "Proxy", Icon: "cloud", Order: 80},
{ID: "mitm", Label: "MITM Proxy", Icon: "shield", Order: 82},
{ID: "pii", Label: "PII", Icon: "shield", Order: 84},
{ID: "other", Label: "Other", Icon: "more-horizontal", Order: 100},
}
}

View File

@@ -385,7 +385,7 @@ type PIIConfig struct {
Enabled *bool `yaml:"enabled,omitempty" json:"enabled,omitempty"`
// Patterns lets a model upgrade or downgrade individual pattern
// actions (mask | block | route_local) relative to the global
// actions (mask | block | allow) relative to the global
// defaults loaded from --pii-config / DefaultPatterns. Pattern IDs
// not listed inherit the global action. The regex itself stays
// global — only the action is settable per-model.
@@ -499,6 +499,16 @@ type Pipeline struct {
// the pipeline's LLM without editing the LLM model config. Overrides the LLM's
// own reasoning_effort. Unset leaves the LLM model config in charge.
ReasoningEffort string `yaml:"reasoning_effort,omitempty" json:"reasoning_effort,omitempty"`
// Streaming opts each pipeline stage into incremental delivery (LLM tokens,
// TTS audio chunks, transcription text). Unset stages keep the blocking
// unary path, so existing configs are unaffected.
Streaming PipelineStreaming `yaml:"streaming,omitempty" json:"streaming,omitempty"`
// DisableThinking suppresses reasoning/thinking for the pipeline LLM (maps
// to enable_thinking=false backend metadata) without editing the underlying
// LLM model config. Unset leaves the LLM model config in charge.
DisableThinking *bool `yaml:"disable_thinking,omitempty" json:"disable_thinking,omitempty"`
}
// ApplyReasoningEffort resolves the effective reasoning effort — a per-request
@@ -530,6 +540,41 @@ func (c *ModelConfig) ApplyReasoningEffort(requestEffort string) {
}
}
// @Description PipelineStreaming toggles incremental delivery per realtime stage.
type PipelineStreaming struct {
LLM *bool `yaml:"llm,omitempty" json:"llm,omitempty"`
TTS *bool `yaml:"tts,omitempty" json:"tts,omitempty"`
Transcription *bool `yaml:"transcription,omitempty" json:"transcription,omitempty"`
// ClauseChunking splits the streamed LLM reply into speakable clauses and
// synthesizes each as soon as it completes, instead of buffering the whole
// message before TTS. Script-aware (CJK/Thai), so it does not rely on
// whitespace sentence boundaries. Requires LLM streaming; unset buffers the
// whole message (today's default).
ClauseChunking *bool `yaml:"clause_chunking,omitempty" json:"clause_chunking,omitempty"`
}
// StreamLLM reports whether LLM tokens should be streamed for this pipeline.
func (p Pipeline) StreamLLM() bool { return p.Streaming.LLM != nil && *p.Streaming.LLM }
// StreamTTS reports whether TTS audio should be streamed for this pipeline.
func (p Pipeline) StreamTTS() bool { return p.Streaming.TTS != nil && *p.Streaming.TTS }
// StreamTranscription reports whether transcription text should be streamed.
func (p Pipeline) StreamTranscription() bool {
return p.Streaming.Transcription != nil && *p.Streaming.Transcription
}
// ChunkClauses reports whether the streamed reply should be split into
// script-aware clauses and synthesized incrementally rather than buffered whole.
func (p Pipeline) ChunkClauses() bool {
return p.Streaming.ClauseChunking != nil && *p.Streaming.ClauseChunking
}
// ThinkingDisabled reports whether the pipeline forces the LLM's thinking off.
func (p Pipeline) ThinkingDisabled() bool {
return p.DisableThinking != nil && *p.DisableThinking
}
// @Description File configuration for model downloads
type File struct {
Filename string `yaml:"filename,omitempty" json:"filename,omitempty"`
@@ -1229,14 +1274,20 @@ func (c *ModelConfig) GuessUsecases(u ModelConfigUsecase) bool {
}
if (u & FLAG_CHAT) == FLAG_CHAT {
if c.TemplateConfig.Chat == "" && c.TemplateConfig.ChatMessage == "" && !c.TemplateConfig.UseTokenizerTemplate {
return false
}
if slices.Contains(nonTextGenBackends, c.Backend) {
return false
}
if c.Embeddings != nil && *c.Embeddings {
return false
// A router model is a chat dispatcher: it carries no chat
// template of its own (those live on the candidates it routes
// to) and is invoked through the chat endpoint, so the router
// block stands in for chat capability.
if !c.HasRouter() {
if c.TemplateConfig.Chat == "" && c.TemplateConfig.ChatMessage == "" && !c.TemplateConfig.UseTokenizerTemplate {
return false
}
if slices.Contains(nonTextGenBackends, c.Backend) {
return false
}
if c.Embeddings != nil && *c.Embeddings {
return false
}
}
}
if (u & FLAG_COMPLETION) == FLAG_COMPLETION {

View File

@@ -283,6 +283,18 @@ parameters:
Expect(e.HasUsecases(FLAG_CHAT)).To(BeFalse())
Expect(e.HasUsecases(FLAG_EMBEDDINGS)).To(BeTrue())
// Router models are chat dispatchers: no chat template of their
// own, but invoked through the chat endpoint, so they default to
// chat-capable.
r := ModelConfig{
Name: "r",
Router: RouterConfig{
Candidates: []RouterCandidate{{Model: "downstream", Labels: []string{"general"}}},
},
}
Expect(r.HasUsecases(FLAG_ANY)).To(BeTrue())
Expect(r.HasUsecases(FLAG_CHAT)).To(BeTrue())
f := ModelConfig{
Name: "f",
Backend: "piper",

View File

@@ -0,0 +1,57 @@
package config
import (
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
"gopkg.in/yaml.v3"
)
// The realtime pipeline can stream each stage (LLM tokens, TTS audio,
// transcription text) and can disable model "thinking" for the LLM. These are
// opt-in per pipeline; everything defaults to off so existing configs keep the
// unary behaviour.
var _ = Describe("Pipeline streaming config", func() {
It("defaults every streaming + thinking helper to false when unset", func() {
var p Pipeline
Expect(p.StreamLLM()).To(BeFalse())
Expect(p.StreamTTS()).To(BeFalse())
Expect(p.StreamTranscription()).To(BeFalse())
Expect(p.ChunkClauses()).To(BeFalse())
Expect(p.ThinkingDisabled()).To(BeFalse())
})
It("parses the nested streaming block and disable_thinking from YAML", func() {
var c ModelConfig
err := yaml.Unmarshal([]byte(`
name: gpt-realtime
pipeline:
llm: my-llm
tts: my-tts
transcription: my-stt
streaming:
llm: true
tts: true
transcription: true
clause_chunking: true
disable_thinking: true
`), &c)
Expect(err).ToNot(HaveOccurred())
Expect(c.Pipeline.StreamLLM()).To(BeTrue())
Expect(c.Pipeline.StreamTTS()).To(BeTrue())
Expect(c.Pipeline.StreamTranscription()).To(BeTrue())
Expect(c.Pipeline.ChunkClauses()).To(BeTrue())
Expect(c.Pipeline.ThinkingDisabled()).To(BeTrue())
})
It("treats an explicit false in the streaming block as disabled", func() {
var c ModelConfig
err := yaml.Unmarshal([]byte(`
name: gpt-realtime
pipeline:
streaming:
tts: false
`), &c)
Expect(err).ToNot(HaveOccurred())
Expect(c.Pipeline.StreamTTS()).To(BeFalse())
})
})

View File

@@ -50,7 +50,14 @@ var _ = Describe("Runtime capability-based backend selection", func() {
must(os.WriteFile(filepath.Join(cudaDir, "metadata.json"), b, 0o644))
must(os.WriteFile(filepath.Join(cudaDir, "run.sh"), []byte(""), 0o755))
// Default system: alias should point to CPU
// Default system: alias should point to CPU. Force the capability to
// "cpu" so this is hermetic on hosts that actually have a GPU: backend
// preference keys off getSystemCapabilities() (env → real nvidia-smi
// detection), not GPUVendor, so without this a GPU dev box reports
// "nvidia" and the cuda alias wins. The NVIDIA case below overrides it.
must(os.Setenv("LOCALAI_FORCE_META_BACKEND_CAPABILITY", "cpu"))
defer func() { _ = os.Unsetenv("LOCALAI_FORCE_META_BACKEND_CAPABILITY") }()
sysDefault, err := system.GetSystemState(
system.WithBackendPath(tempDir),
)

View File

@@ -158,6 +158,11 @@ var defaultImporters = []Importer{
// RFDetrImporter must run before TransformersImporter — RF-DETR
// checkpoints may carry tokenizer-adjacent artefacts.
&RFDetrImporter{},
// LocateAnythingImporter (NVIDIA LocateAnything open-vocab detection,
// native C++/ggml port) must run before LlamaCPPImporter so its GGUF
// bundles aren't claimed by the generic .gguf importer; kept next to
// RFDetrImporter as both are detection models.
&LocateAnythingImporter{},
// Existing
// DS4Importer must precede LlamaCPPImporter - ds4 weights are GGUFs and
// would otherwise be claimed by the generic .gguf-handling llama-cpp

View File

@@ -0,0 +1,137 @@
package importers
import (
"encoding/json"
"path/filepath"
"strings"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/gallery"
"github.com/mudler/LocalAI/core/schema"
"go.yaml.in/yaml/v2"
)
var _ Importer = &LocateAnythingImporter{}
// LocateAnythingImporter routes NVIDIA LocateAnything open-vocabulary
// object-detection / visual-grounding repositories to the
// "locate-anything-cpp" backend (a native C++/ggml port). It must be
// registered BEFORE the generic GGUF matchers (LlamaCPPImporter) so its
// GGUF bundles aren't swallowed by the generic .gguf-handling importer,
// and alongside RFDetrImporter since both are detection models that may
// carry tokenizer-adjacent artefacts.
//
// Detection signals:
// - preferences.backend="locate-anything-cpp" (explicit override);
// - repo name contains "locate-anything" or "locateanything"
// (case-insensitive).
type LocateAnythingImporter struct{}
func (i *LocateAnythingImporter) Name() string { return "locate-anything-cpp" }
func (i *LocateAnythingImporter) Modality() string { return "detection" }
func (i *LocateAnythingImporter) AutoDetects() bool { return true }
func repoLooksLikeLocateAnything(repo string) bool {
lower := strings.ToLower(repo)
return strings.Contains(lower, "locate-anything") ||
strings.Contains(lower, "locateanything") ||
strings.Contains(lower, "locate-anything.cpp") ||
strings.Contains(lower, "locate-anything-cpp")
}
func (i *LocateAnythingImporter) Match(details Details) bool {
preferences, err := details.Preferences.MarshalJSON()
if err != nil {
return false
}
preferencesMap := make(map[string]any)
if len(preferences) > 0 {
if err := json.Unmarshal(preferences, &preferencesMap); err != nil {
return false
}
}
if b, ok := preferencesMap["backend"].(string); ok && b == "locate-anything-cpp" {
return true
}
if details.HuggingFace != nil {
repoName := details.HuggingFace.ModelID
if idx := strings.Index(repoName, "/"); idx >= 0 {
repoName = repoName[idx+1:]
}
if repoLooksLikeLocateAnything(repoName) {
return true
}
}
// Fallback: hfapi recursion bug may leave HuggingFace nil — decide
// from the URI owner/repo.
if _, repo, ok := HFOwnerRepoFromURI(details.URI); ok {
if repoLooksLikeLocateAnything(repo) {
return true
}
}
return false
}
func (i *LocateAnythingImporter) Import(details Details) (gallery.ModelConfig, error) {
preferences, err := details.Preferences.MarshalJSON()
if err != nil {
return gallery.ModelConfig{}, err
}
preferencesMap := make(map[string]any)
if len(preferences) > 0 {
if err := json.Unmarshal(preferences, &preferencesMap); err != nil {
return gallery.ModelConfig{}, err
}
}
name, ok := preferencesMap["name"].(string)
if !ok {
name = filepath.Base(details.URI)
}
description, ok := preferencesMap["description"].(string)
if !ok {
description = "Imported from " + details.URI
}
// Prefer the canonical HF "owner/repo" identifier so the emitted
// YAML mirrors gallery locate-anything entries.
model := details.URI
if details.HuggingFace != nil && details.HuggingFace.ModelID != "" {
model = details.HuggingFace.ModelID
} else if owner, repo, ok := HFOwnerRepoFromURI(details.URI); ok {
model = owner + "/" + repo
}
// Always the native C++/ggml backend; explicit preferences.backend
// overrides the default.
backend := "locate-anything-cpp"
if b, ok := preferencesMap["backend"].(string); ok && b != "" {
backend = b
}
modelConfig := config.ModelConfig{
Name: name,
Description: description,
Backend: backend,
KnownUsecaseStrings: []string{"detection"},
PredictionOptions: schema.PredictionOptions{
BasicModelRequest: schema.BasicModelRequest{Model: model},
},
}
data, err := yaml.Marshal(modelConfig)
if err != nil {
return gallery.ModelConfig{}, err
}
return gallery.ModelConfig{
Name: name,
Description: description,
ConfigFile: string(data),
}, nil
}

View File

@@ -0,0 +1,218 @@
package importers_test
import (
"encoding/json"
"fmt"
"github.com/mudler/LocalAI/core/gallery/importers"
hfapi "github.com/mudler/LocalAI/pkg/huggingface-api"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
var _ = Describe("LocateAnythingImporter", func() {
Context("Importer interface metadata", func() {
It("exposes name/modality/autodetect", func() {
imp := &importers.LocateAnythingImporter{}
Expect(imp.Name()).To(Equal("locate-anything-cpp"))
Expect(imp.Modality()).To(Equal("detection"))
Expect(imp.AutoDetects()).To(BeTrue())
})
})
Context("Match", func() {
It("matches when backend preference is locate-anything-cpp", func() {
imp := &importers.LocateAnythingImporter{}
preferences := json.RawMessage(`{"backend": "locate-anything-cpp"}`)
details := importers.Details{
URI: "https://example.com/some-model",
Preferences: preferences,
}
Expect(imp.Match(details)).To(BeTrue())
})
It("matches when the repo name contains 'locate-anything' (case-insensitive)", func() {
imp := &importers.LocateAnythingImporter{}
details := importers.Details{
URI: "https://huggingface.co/mudler/locate-anything-cpp-3b",
HuggingFace: &hfapi.ModelDetails{
ModelID: "mudler/Locate-Anything-CPP-3B",
Author: "mudler",
},
}
Expect(imp.Match(details)).To(BeTrue())
})
It("matches when the repo name contains 'locateanything' (case-insensitive)", func() {
imp := &importers.LocateAnythingImporter{}
details := importers.Details{
URI: "https://huggingface.co/nvidia/LocateAnything-3B",
HuggingFace: &hfapi.ModelDetails{
ModelID: "nvidia/LocateAnything-3B",
Author: "nvidia",
},
}
Expect(imp.Match(details)).To(BeTrue())
})
It("matches via URI fallback when HuggingFace details are missing", func() {
imp := &importers.LocateAnythingImporter{}
details := importers.Details{
URI: "https://huggingface.co/nvidia/LocateAnything-3B",
}
Expect(imp.Match(details)).To(BeTrue())
})
It("does not match unrelated repos without locate-anything signals", func() {
imp := &importers.LocateAnythingImporter{}
details := importers.Details{
URI: "https://huggingface.co/meta-llama/Llama-3-8B",
HuggingFace: &hfapi.ModelDetails{
ModelID: "meta-llama/Llama-3-8B",
Author: "meta-llama",
},
}
Expect(imp.Match(details)).To(BeFalse())
})
It("does not match an rfdetr repo", func() {
imp := &importers.LocateAnythingImporter{}
details := importers.Details{
URI: "https://huggingface.co/mudler/rfdetr-cpp-nano",
HuggingFace: &hfapi.ModelDetails{
ModelID: "mudler/rfdetr-cpp-nano",
Author: "mudler",
},
}
Expect(imp.Match(details)).To(BeFalse())
})
It("returns false for invalid preferences JSON", func() {
imp := &importers.LocateAnythingImporter{}
preferences := json.RawMessage(`not valid json`)
details := importers.Details{
URI: "https://example.com/model",
Preferences: preferences,
}
Expect(imp.Match(details)).To(BeFalse())
})
})
Context("Import", func() {
It("produces a YAML with backend locate-anything-cpp and the repo as the model", func() {
imp := &importers.LocateAnythingImporter{}
details := importers.Details{
URI: "https://huggingface.co/nvidia/LocateAnything-3B",
HuggingFace: &hfapi.ModelDetails{
ModelID: "nvidia/LocateAnything-3B",
Author: "nvidia",
},
}
modelConfig, err := imp.Import(details)
Expect(err).ToNot(HaveOccurred())
Expect(modelConfig.ConfigFile).To(ContainSubstring("backend: locate-anything-cpp"), fmt.Sprintf("Model config: %+v", modelConfig))
Expect(modelConfig.ConfigFile).To(ContainSubstring("nvidia/LocateAnything-3B"), fmt.Sprintf("Model config: %+v", modelConfig))
Expect(modelConfig.ConfigFile).To(ContainSubstring("detection"), fmt.Sprintf("Model config: %+v", modelConfig))
})
It("respects custom name and description from preferences", func() {
imp := &importers.LocateAnythingImporter{}
preferences := json.RawMessage(`{"name": "my-locate", "description": "Custom"}`)
details := importers.Details{
URI: "https://huggingface.co/nvidia/LocateAnything-3B",
Preferences: preferences,
HuggingFace: &hfapi.ModelDetails{
ModelID: "nvidia/LocateAnything-3B",
Author: "nvidia",
},
}
modelConfig, err := imp.Import(details)
Expect(err).ToNot(HaveOccurred())
Expect(modelConfig.Name).To(Equal("my-locate"))
Expect(modelConfig.Description).To(Equal("Custom"))
})
})
// Table-driven coverage of the backend routing: locate-anything repos
// always route to the native locate-anything-cpp backend, with an
// explicit preferences.backend override honoured.
//
// Cases are kept offline-deterministic by injecting Details directly
// rather than going through DiscoverModelConfig (which would hit live HF).
Context("backend routing (offline)", func() {
hfFile := func(path string) hfapi.ModelFile {
return hfapi.ModelFile{Path: path}
}
type tc struct {
name string
uri string
modelID string
files []hfapi.ModelFile
prefs string
expectBackend string // expected `backend:` line content
}
entries := []tc{
{
name: "canonical NVIDIA repo routes to locate-anything-cpp",
uri: "https://huggingface.co/nvidia/LocateAnything-3B",
modelID: "nvidia/LocateAnything-3B",
files: []hfapi.ModelFile{hfFile("locate-anything-3b-q8_0.gguf"), hfFile("README.md")},
prefs: "",
expectBackend: "backend: locate-anything-cpp",
},
{
name: "GGUF bundle with locate-anything name routes to locate-anything-cpp",
uri: "https://huggingface.co/mudler/locate-anything.cpp-3b",
modelID: "mudler/locate-anything.cpp-3b",
files: []hfapi.ModelFile{hfFile("model-f16.gguf")},
prefs: "",
expectBackend: "backend: locate-anything-cpp",
},
{
name: "explicit preferences.backend override is honoured",
uri: "https://huggingface.co/nvidia/LocateAnything-3B",
modelID: "nvidia/LocateAnything-3B",
files: nil,
prefs: `{"backend": "locate-anything-cpp"}`,
expectBackend: "backend: locate-anything-cpp",
},
}
for _, e := range entries {
e := e // capture for closure
It(e.name, func() {
imp := &importers.LocateAnythingImporter{}
details := importers.Details{
URI: e.uri,
HuggingFace: &hfapi.ModelDetails{
ModelID: e.modelID,
Files: e.files,
},
}
if e.prefs != "" {
details.Preferences = json.RawMessage(e.prefs)
}
Expect(imp.Match(details)).To(BeTrue(), fmt.Sprintf("Match should fire for %+v", details))
modelConfig, err := imp.Import(details)
Expect(err).ToNot(HaveOccurred(), fmt.Sprintf("Import error: %v", err))
Expect(modelConfig.ConfigFile).To(ContainSubstring(e.expectBackend),
fmt.Sprintf("Model config: %+v", modelConfig))
})
}
})
})

View File

@@ -64,7 +64,17 @@ func (i *MLXImporter) Import(details Details) (gallery.ModelConfig, error) {
description = "Imported from " + details.URI
}
// Vision-language checkpoints (e.g. gemma-4 E4B) declare the
// "image-text-to-text" pipeline tag on HuggingFace. The text-only mlx-lm
// tokenizer does not carry their processor chat template, so routing them
// through the plain mlx backend yields degenerate looping output
// (issue #10269). Send them to the mlx-vlm backend, which applies the
// processor-aware chat template.
backend := "mlx"
if details.HuggingFace != nil && details.HuggingFace.PipelineTag == "image-text-to-text" {
backend = "mlx-vlm"
}
// An explicit backend preference always wins.
b, ok := preferencesMap["backend"].(string)
if ok {
backend = b

View File

@@ -4,6 +4,7 @@ import (
"encoding/json"
"github.com/mudler/LocalAI/core/gallery/importers"
hfapi "github.com/mudler/LocalAI/pkg/huggingface-api"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
@@ -122,6 +123,60 @@ var _ = Describe("MLXImporter", func() {
Expect(modelConfig.ConfigFile).To(ContainSubstring("backend: mlx-vlm"))
})
It("should auto-route vision-language models to the mlx-vlm backend", func() {
// gemma-4 E4B and similar VLMs declare pipeline_tag
// "image-text-to-text" on HuggingFace. The text-only mlx-lm
// tokenizer does not carry their processor chat template, so
// routing them through the plain mlx backend produces degenerate
// looping output (issue #10269). They must go to mlx-vlm.
details := importers.Details{
URI: "https://huggingface.co/mlx-community/gemma-4-E4B-it-qat-4bit",
HuggingFace: &hfapi.ModelDetails{
ModelID: "mlx-community/gemma-4-E4B-it-qat-4bit",
PipelineTag: "image-text-to-text",
},
}
modelConfig, err := importer.Import(details)
Expect(err).ToNot(HaveOccurred())
Expect(modelConfig.ConfigFile).To(ContainSubstring("backend: mlx-vlm"))
})
It("should keep text-only models on the plain mlx backend", func() {
details := importers.Details{
URI: "https://huggingface.co/mlx-community/Llama-3.2-1B-Instruct-4bit",
HuggingFace: &hfapi.ModelDetails{
ModelID: "mlx-community/Llama-3.2-1B-Instruct-4bit",
PipelineTag: "text-generation",
},
}
modelConfig, err := importer.Import(details)
Expect(err).ToNot(HaveOccurred())
Expect(modelConfig.ConfigFile).To(ContainSubstring("backend: mlx"))
Expect(modelConfig.ConfigFile).ToNot(ContainSubstring("backend: mlx-vlm"))
})
It("should honor an explicit backend preference even for a VLM", func() {
preferences := json.RawMessage(`{"backend": "mlx"}`)
details := importers.Details{
URI: "https://huggingface.co/mlx-community/gemma-4-E4B-it-qat-4bit",
Preferences: preferences,
HuggingFace: &hfapi.ModelDetails{
ModelID: "mlx-community/gemma-4-E4B-it-qat-4bit",
PipelineTag: "image-text-to-text",
},
}
modelConfig, err := importer.Import(details)
Expect(err).ToNot(HaveOccurred())
Expect(modelConfig.ConfigFile).To(ContainSubstring("backend: mlx"))
Expect(modelConfig.ConfigFile).ToNot(ContainSubstring("backend: mlx-vlm"))
})
It("should handle invalid JSON preferences", func() {
preferences := json.RawMessage(`invalid json`)
details := importers.Details{

View File

@@ -353,7 +353,7 @@ func handleAnthropicStream(c echo.Context, id string, input *schema.AnthropicReq
overrides = make(map[string]pii.Action, len(raw))
for ovid, action := range raw {
switch pii.Action(action) {
case pii.ActionMask, pii.ActionBlock, pii.ActionRouteLocal:
case pii.ActionMask, pii.ActionBlock, pii.ActionAllow:
overrides[ovid] = pii.Action(action)
}
}

View File

@@ -102,7 +102,7 @@ var instructionDefs = []instructionDef{
Name: "pii-filtering",
Description: "Inspect and tune the regex PII filter applied to chat requests",
Tags: []string{"pii"},
Intro: "GET /api/pii/patterns lists the active pattern set with each one's action (mask, block, route_local). GET /api/pii/events returns recent redaction events filtered by correlation_id / user_id / pattern_id (admin or local-user only). POST /api/pii/test dry-runs the redactor against an admin-supplied string. POST /api/pii/decide is the programmatic decision oracle for external routers: send `{text}`, receive `{findings, suggested_action, redacted_preview}` without LocalAI mutating, recording, or acting on the call — caller composes the action with its own policy. Default patterns: email, phone, SSN, credit card (Luhn), IPv4, common API key prefixes (sk-, pk-, ghp_, github_pat_). PII is per-model: by default it is OFF for non-proxy backends and ON for backends starting with proxy-* (cloud passthroughs). Opt in with `pii: { enabled: true }` in a model's YAML; use `pii: { patterns: [{id, action}] }` to upgrade or downgrade individual actions for that model. Override global default actions via --pii-config pii.yaml; --disable-pii turns the filter off entirely.",
Intro: "GET /api/pii/patterns lists the active pattern set with each one's action (mask, block, allow). GET /api/pii/events returns recent redaction events filtered by correlation_id / user_id / pattern_id (admin or local-user only). POST /api/pii/test dry-runs the redactor against an admin-supplied string. POST /api/pii/decide is the programmatic decision oracle for external routers: send `{text}`, receive `{findings, suggested_action, redacted_preview}` without LocalAI mutating, recording, or acting on the call — caller composes the action with its own policy. Default patterns: email, phone, SSN, credit card (Luhn), IPv4, common API key prefixes (sk-, pk-, ghp_, github_pat_). PII is per-model: by default it is OFF for non-proxy backends and ON for backends starting with proxy-* (cloud passthroughs). Opt in with `pii: { enabled: true }` in a model's YAML; use `pii: { patterns: [{id, action}] }` to upgrade or downgrade individual actions for that model. Override global default actions via --pii-config pii.yaml; --disable-pii turns the filter off entirely.",
},
{
Name: "middleware-admin",

View File

@@ -124,6 +124,8 @@ func AutocompleteEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, a
filterFn = config.BuildUsecaseFilterFn(config.FLAG_VAD)
case config.UsecaseTranscript:
filterFn = config.BuildUsecaseFilterFn(config.FLAG_TRANSCRIPT)
case "score": // router classifier usecase (FLAG_SCORE); not in UsecaseInfoMap
filterFn = config.BuildUsecaseFilterFn(config.FLAG_SCORE)
default:
filterFn = config.NoFilterFn
}

View File

@@ -15,9 +15,9 @@ import (
//
// External routers (e.g. the localai-org/platform router) call this
// before dispatching to learn whether to mask the prompt in place,
// route to a local-only backend, block the request, or pass it
// through. LocalAI's in-band PII middleware is the alternative path
// for direct-to-LocalAI clients — same Redactor, different framing.
// block the request, or pass it through. LocalAI's in-band PII
// middleware is the alternative path for direct-to-LocalAI clients —
// same Redactor, different framing.
//
// Takes the *pii.Redactor directly rather than the whole
// *application.Application so the handler stays unit-testable with a
@@ -62,24 +62,18 @@ func PIIDecideEndpoint(redactor *pii.Redactor) echo.HandlerFunc {
}
}
// actionAllow is the wire-only value for "no findings". The other
// three map to existing pii.Action* constants; allow has no in-band
// counterpart because the in-band middleware simply passes through.
const actionAllow = "allow"
// suggestedAction collapses the Redactor's Result flags onto a single
// wire-format action using the in-band ordering (block > route_local
// > mask > allow). Spans-without-Blocked-or-LocalOnly means every
// match resolved to ActionMask.
// wire-format action using the in-band ordering (block > mask >
// allow). "allow" covers both "nothing matched" and "matched but every
// span resolved to the allow action" — in both cases the caller may
// dispatch unchanged, with the Findings list reporting what was seen.
func suggestedAction(res pii.Result) string {
switch {
case res.Blocked:
return string(pii.ActionBlock)
case res.LocalOnly:
return string(pii.ActionRouteLocal)
case len(res.Spans) > 0:
case res.Masked:
return string(pii.ActionMask)
default:
return actionAllow
return string(pii.ActionAllow)
}
}

View File

@@ -16,8 +16,8 @@ import (
// PIIDecideEndpoint exposes the redactor as a decision oracle. These
// specs pin the validation surface and the suggested_action mapping
// across all four actions (allow/mask/route_local/block). The redactor
// itself is covered in core/services/routing/pii/redactor_test.go.
// across the three actions (allow/mask/block). The redactor itself is
// covered in core/services/routing/pii/redactor_test.go.
var _ = Describe("PIIDecideEndpoint", func() {
var redactor *pii.Redactor
@@ -68,16 +68,17 @@ var _ = Describe("PIIDecideEndpoint", func() {
Expect(len(body.Findings)).To(BeNumerically(">=", 1))
})
It("returns route_local when an override sets that action", func() {
// Promote the email pattern to route_local for this test —
// exercises the route_local branch of suggestedAction without
// needing a custom pattern set.
Expect(redactor.SetAction("email", pii.ActionRouteLocal)).To(Succeed())
It("returns allow when a matched pattern's action is allow", func() {
// Downgrade the email pattern to allow for this test —
// exercises the allow branch of suggestedAction: a match is
// found, but the strongest action is allow so the suggestion
// is "allow" and the text is left intact.
Expect(redactor.SetAction("email", pii.ActionAllow)).To(Succeed())
rec, body := invokePIIDecide(redactor, `{"text":"contact alice@example.com"}`)
Expect(rec.Code).To(Equal(http.StatusOK))
Expect(body.SuggestedAction).To(Equal("route_local"))
// route_local leaves the original text intact — caller decides
// whether to forward it to a local-only backend.
Expect(body.SuggestedAction).To(Equal("allow"))
Expect(body.Findings).To(HaveLen(1), "allow still reports the finding")
// allow leaves the original text intact.
Expect(body.RedactedPreview).To(ContainSubstring("alice@example.com"))
})

View File

@@ -103,7 +103,12 @@ func applyAutoparserOverride(
// blocks like "<think></think>" that some models emit when reasoning
// is disabled.
if deltaReasoning == "" && deltaContent != "" {
deltaReasoning, deltaContent = reason.ExtractReasoningWithConfig(deltaContent, thinkingStartToken, reasoningConfig)
// Complete-response extraction: only honor a prefilled <think> start
// token when deltaContent actually closes the reasoning block. Without
// it the model answered directly and the whole answer must stay in
// content rather than be swallowed as unclosed reasoning. See
// reason.ExtractReasoningComplete.
deltaReasoning, deltaContent = reason.ExtractReasoningComplete(deltaContent, thinkingStartToken, reasoningConfig)
}
xlog.Debug("[ChatDeltas] non-SSE no-tools: overriding result with C++ autoparser deltas",
"content_len", len(deltaContent), "reasoning_len", len(deltaReasoning))

View File

@@ -186,6 +186,114 @@ var _ = Describe("applyAutoparserOverride", func() {
Expect(result).To(Equal(existing))
})
})
// Regression tests for the prefilled-thinking-token path (thinkingStartToken
// != ""). This is the configuration the gallery qwen3 family runs in: the
// chat template injects <think> into the prompt, so DetectThinkingStartToken
// returns "<think>" and the model's output begins *inside* a reasoning block
// — it emits a closing </think> but no opening tag.
//
// The defensive Go-side fallback prepends the start token so the standard
// extractor can pair it with the model's </think>. But on a *complete*
// response that contains NO closing tag (the model answered directly with no
// reasoning at all), prepending <think> manufactures an unclosed block that
// swallows the entire answer into reasoning, leaving content empty. That is
// the bug: short/direct answers (session names, JSON summaries) come back
// with an empty content field.
Context("autoparser delivered content with empty reasoning and a prefilled thinking token", func() {
const startToken = "<think>"
It("keeps a tag-less direct answer as content instead of swallowing it as reasoning", func() {
// Model answered directly: no <think>, no </think> anywhere.
chatDeltas := []*pb.ChatDelta{
{Content: "hello", ReasoningContent: ""},
}
result := applyAutoparserOverride(chatDeltas, startToken, reason.Config{}, nil)
Expect(result).To(HaveLen(1))
Expect(result[0].Message.Content).ToNot(BeNil())
Expect(*(result[0].Message.Content.(*string))).To(Equal("hello"),
"a complete answer with no closing reasoning tag must stay in content")
Expect(result[0].Message.Reasoning).To(BeNil(),
"no reasoning block was emitted, so Reasoning must not be set")
})
It("keeps a tag-less JSON answer as content (the summary case)", func() {
raw := `{"short":"Tests pass","long":"go test ./... succeeded."}`
chatDeltas := []*pb.ChatDelta{
{Content: raw, ReasoningContent: ""},
}
result := applyAutoparserOverride(chatDeltas, startToken, reason.Config{}, nil)
Expect(result).To(HaveLen(1))
Expect(*(result[0].Message.Content.(*string))).To(Equal(raw))
Expect(result[0].Message.Reasoning).To(BeNil())
})
It("still splits reasoning when the model emits the closing tag (prefill paired with </think>)", func() {
// The legitimate prefill case: <think> was in the prompt, so the
// output carries only the closing tag. The closing tag is the proof
// that a reasoning block exists, so extraction must run.
raw := "The user wants a greeting.\n</think>\n\nHello there!"
chatDeltas := []*pb.ChatDelta{
{Content: raw, ReasoningContent: ""},
}
result := applyAutoparserOverride(chatDeltas, startToken, reason.Config{}, nil)
Expect(result).To(HaveLen(1))
content := *(result[0].Message.Content.(*string))
Expect(content).To(ContainSubstring("Hello there!"))
Expect(content).ToNot(ContainSubstring("</think>"))
Expect(content).ToNot(ContainSubstring("The user wants a greeting"))
Expect(result[0].Message.Reasoning).ToNot(BeNil())
Expect(*result[0].Message.Reasoning).To(ContainSubstring("The user wants a greeting"))
})
It("still splits a fully-tagged <think>…</think> block with a prefill token set", func() {
raw := "<think>Reasoning here.</think>Final answer."
chatDeltas := []*pb.ChatDelta{
{Content: raw, ReasoningContent: ""},
}
result := applyAutoparserOverride(chatDeltas, startToken, reason.Config{}, nil)
Expect(result).To(HaveLen(1))
Expect(*(result[0].Message.Content.(*string))).To(Equal("Final answer."))
Expect(result[0].Message.Reasoning).ToNot(BeNil())
Expect(*result[0].Message.Reasoning).To(ContainSubstring("Reasoning here"))
})
// End-to-end regression for the real production failure: a request with
// enable_thinking=false against a <think>-capable model (qwen3 family).
//
// In non-thinking mode the model emits no reasoning block, so llama.cpp's
// autoparser correctly returns ChatDeltas with Content set and
// ReasoningContent EMPTY (verified against stock llama-server: the same
// model with chat_template_kwargs.enable_thinking=false returns
// reasoning_content=null and content="hello"). But thinkingStartToken is
// detected per-model from the enable_thinking=TRUE render
// (grpc-server renders with enable_thinking=true; DetectThinkingStartToken
// does not evaluate the jinja {% if enable_thinking %} conditional), so it
// is "<think>" even for this non-thinking request. The old code prepended
// it and swallowed the answer. This is the case that broke session
// summaries and auto-titles and was NOT covered before.
It("preserves content for a non-thinking-mode request (enable_thinking=false, empty reasoning_content)", func() {
// What llama.cpp's autoparser actually returns in non-thinking mode.
chatDeltas := []*pb.ChatDelta{
{Content: `{"short":"Go tests passed for internal/session"}`, ReasoningContent: ""},
}
result := applyAutoparserOverride(chatDeltas, startToken, reason.Config{}, nil)
Expect(result).To(HaveLen(1))
Expect(*(result[0].Message.Content.(*string))).To(Equal(`{"short":"Go tests passed for internal/session"}`),
"non-thinking-mode answers must reach the client intact, not be swallowed as reasoning")
Expect(result[0].Message.Reasoning).To(BeNil())
})
})
})
var _ = Describe("mergeToolCallDeltas", func() {

View File

@@ -130,7 +130,7 @@ func CompletionEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, eva
overrides = make(map[string]pii.Action, len(raw))
for ovid, action := range raw {
switch pii.Action(action) {
case pii.ActionMask, pii.ActionBlock, pii.ActionRouteLocal:
case pii.ActionMask, pii.ActionBlock, pii.ActionAllow:
overrides[ovid] = pii.Action(action)
}
}

View File

@@ -2,8 +2,10 @@ package openai
import (
"context"
"crypto/rand"
"encoding/base64"
"encoding/binary"
"encoding/hex"
"encoding/json"
"fmt"
"math"
@@ -235,6 +237,12 @@ type Model interface {
Transcribe(ctx context.Context, audio, language string, translate bool, diarize bool, prompt string) (*schema.TranscriptionResult, error)
Predict(ctx context.Context, messages schema.Messages, images, videos, audios []string, tokenCallback func(string, backend.TokenUsage) bool, tools []types.ToolUnion, toolChoice *types.ToolChoiceUnion, logprobs *int, topLogprobs *int, logitBias map[string]float64) (func() (backend.LLMResponse, error), error)
TTS(ctx context.Context, text, voice, language string) (string, *proto.Result, error)
// TTSStream synthesizes speech incrementally, invoking onAudio with raw PCM
// chunks (and the backend sample rate) as they are produced.
TTSStream(ctx context.Context, text, voice, language string, onAudio func(pcm []byte, sampleRate int) error) error
// TranscribeStream transcribes audio incrementally, invoking onDelta for each
// transcript text fragment and returning the final aggregated result.
TranscribeStream(ctx context.Context, audio, language string, translate, diarize bool, prompt string, onDelta func(text string)) (*schema.TranscriptionResult, error)
PredictConfig() *config.ModelConfig
}
@@ -1254,27 +1262,15 @@ func commitUtterance(ctx context.Context, utt []byte, session *Session, conv *Co
// TODO: If we have a real any-to-any model then transcription is optional
var transcript string
if session.InputAudioTranscription != nil {
tr, err := session.ModelInterface.Transcribe(ctx, f.Name(), session.InputAudioTranscription.Language, false, false, session.InputAudioTranscription.Prompt)
// emitTranscription streams transcript deltas when
// pipeline.streaming.transcription is set, otherwise emits a single
// completed event; either way it returns the final transcript text.
var err error
transcript, err = emitTranscription(ctx, t, session, generateItemID(), f.Name())
if err != nil {
sendError(t, "transcription_failed", err.Error(), "", "event_TODO")
return
} else if tr == nil {
sendError(t, "transcription_failed", "trancribe result is nil", "", "event_TODO")
return
}
transcript = tr.Text
sendEvent(t, types.ConversationItemInputAudioTranscriptionCompletedEvent{
ServerEventBase: types.ServerEventBase{
EventID: "event_TODO",
},
ItemID: generateItemID(),
// ResponseID: "resp_TODO", // Not needed for transcription completed event
// OutputIndex: 0,
ContentIndex: 0,
Transcript: transcript,
})
} else {
sendNotImplemented(t, "any-to-any models")
return
@@ -1502,6 +1498,26 @@ func triggerResponseAtTurn(ctx context.Context, session *Session, conv *Conversa
},
})
// Streamed LLM path: when the pipeline opts into LLM streaming, stream the
// transcript to the client as it is generated and synthesize the buffered
// message once. Tool turns are supported only when the model uses its
// tokenizer template: the C++ autoparser then delivers content and tool
// calls via ChatDeltas (clearing the text stream), so the spoken transcript
// never leaks tool-call tokens. Grammar-based function calling emits the
// call as JSON in the token stream, so those turns keep the buffered path.
if config != nil && session.ModelConfig != nil && session.ModelConfig.Pipeline.StreamLLM() {
canStream := len(tools) == 0 || config.TemplateConfig.UseTokenizerTemplate
var respMods []types.Modality
if overrides != nil {
respMods = overrides.OutputModalities
}
if canStream && modalitiesContainAudio(resolveOutputModalities(session.OutputModalities, respMods)) {
if streamLLMResponse(ctx, session, conv, t, responseID, conversationHistory, images, config, tools, toolChoice, toolTurn) {
return
}
}
}
predFunc, err := session.ModelInterface.Predict(ctx, conversationHistory, images, nil, nil, nil, tools, toolChoice, nil, nil, nil)
if err != nil {
sendError(t, "inference_failed", fmt.Sprintf("backend error: %v", err), "", "") // item.Assistant.ID is unknown here
@@ -1579,7 +1595,7 @@ func triggerResponseAtTurn(ctx context.Context, session *Session, conv *Conversa
// ExtractReasoningWithConfig is a no-op when no tag pair matches,
// so it's safe to apply unconditionally in the no-reasoning branch.
if deltaReasoning == "" && deltaContent != "" {
deltaReasoning, deltaContent = reasoning.ExtractReasoningWithConfig(deltaContent, thinkingStartToken, config.ReasoningConfig)
deltaReasoning, deltaContent = reasoning.ExtractReasoningComplete(deltaContent, thinkingStartToken, spokenReasoningConfig(config.ReasoningConfig))
}
reasoningText = deltaReasoning
responseWithoutReasoning = deltaContent
@@ -1587,7 +1603,7 @@ func triggerResponseAtTurn(ctx context.Context, session *Session, conv *Conversa
cleanedResponse = deltaContent
toolCalls = deltaToolCalls
} else {
reasoningText, responseWithoutReasoning = reasoning.ExtractReasoningWithConfig(rawResponse, thinkingStartToken, config.ReasoningConfig)
reasoningText, responseWithoutReasoning = reasoning.ExtractReasoningComplete(rawResponse, thinkingStartToken, spokenReasoningConfig(config.ReasoningConfig))
textContent = functions.ParseTextContent(responseWithoutReasoning, config.FunctionsConfig)
cleanedResponse = functions.CleanupLLMResult(responseWithoutReasoning, config.FunctionsConfig)
toolCalls = functions.ParseFunctionCall(cleanedResponse, config.FunctionsConfig)
@@ -1713,64 +1729,7 @@ func triggerResponseAtTurn(ctx context.Context, session *Session, conv *Conversa
return
}
audioFilePath, res, err := session.ModelInterface.TTS(ctx, finalSpeech, session.Voice, session.InputAudioTranscription.Language)
if err != nil {
if ctx.Err() != nil {
xlog.Debug("TTS cancelled (barge-in)")
sendCancelledResponse()
return
}
xlog.Error("TTS failed", "error", err)
sendError(t, "tts_error", fmt.Sprintf("TTS generation failed: %v", err), "", item.Assistant.ID)
return
}
if !res.Success {
xlog.Error("TTS failed", "message", res.Message)
sendError(t, "tts_error", fmt.Sprintf("TTS generation failed: %s", res.Message), "", item.Assistant.ID)
return
}
defer func() { _ = os.Remove(audioFilePath) }()
audioBytes, err := os.ReadFile(audioFilePath)
if err != nil {
xlog.Error("failed to read TTS file", "error", err)
sendError(t, "tts_error", fmt.Sprintf("Failed to read TTS audio: %v", err), "", item.Assistant.ID)
return
}
// Parse WAV header to get raw PCM and the actual sample rate from the TTS backend.
pcmData, ttsSampleRate := laudio.ParseWAV(audioBytes)
if ttsSampleRate == 0 {
ttsSampleRate = localSampleRate
}
xlog.Debug("TTS audio parsed", "raw_bytes", len(audioBytes), "pcm_bytes", len(pcmData), "sample_rate", ttsSampleRate)
// SendAudio (WebRTC) passes PCM at the TTS sample rate directly to the
// Opus encoder, which resamples to 48kHz internally. This avoids a
// lossy intermediate resample through 16kHz.
// XXX: This is a noop in websocket mode; it's included in the JSON instead
if err := t.SendAudio(ctx, pcmData, ttsSampleRate); err != nil {
if ctx.Err() != nil {
xlog.Debug("Audio playback cancelled (barge-in)")
sendCancelledResponse()
return
}
xlog.Error("failed to send audio via transport", "error", err)
}
// For WebSocket clients, resample to the session's output rate and
// deliver audio as base64 in JSON events. WebRTC clients already
// received audio over the RTP track, so skip the base64 payload.
if !isWebRTC {
wsPCM := pcmData
if ttsSampleRate != session.OutputSampleRate {
samples := sound.BytesToInt16sLE(pcmData)
resampled := sound.ResampleInt16(samples, ttsSampleRate, session.OutputSampleRate)
wsPCM = sound.Int16toBytesLE(resampled)
}
audioString = base64.StdEncoding.EncodeToString(wsPCM)
}
// Transcript of the spoken reply (the audio's text).
sendEvent(t, types.ResponseOutputAudioTranscriptDeltaEvent{
ServerEventBase: types.ServerEventBase{},
ResponseID: responseID,
@@ -1788,15 +1747,26 @@ func triggerResponseAtTurn(ctx context.Context, session *Session, conv *Conversa
Transcript: finalSpeech,
})
// Synthesize and send the audio. With pipeline.streaming.tts enabled
// emitSpeech forwards a response.output_audio.delta per backend PCM
// chunk as it's produced; otherwise it sends the whole utterance as a
// single delta. The returned PCM is stored (base64) on the item below.
pcmAudio, err := emitSpeech(ctx, t, session, responseID, item.Assistant.ID, finalSpeech)
if err != nil {
if ctx.Err() != nil {
xlog.Debug("TTS cancelled (barge-in)")
sendCancelledResponse()
return
}
xlog.Error("TTS failed", "error", err)
sendError(t, "tts_error", fmt.Sprintf("TTS generation failed: %v", err), "", item.Assistant.ID)
return
}
if !isWebRTC {
audioString = base64.StdEncoding.EncodeToString(pcmAudio)
}
if !isWebRTC {
sendEvent(t, types.ResponseOutputAudioDeltaEvent{
ServerEventBase: types.ServerEventBase{},
ResponseID: responseID,
ItemID: item.Assistant.ID,
OutputIndex: 0,
ContentIndex: 0,
Delta: audioString,
})
sendEvent(t, types.ResponseOutputAudioDoneEvent{
ServerEventBase: types.ServerEventBase{},
ResponseID: responseID,
@@ -1849,17 +1819,27 @@ func triggerResponseAtTurn(ctx context.Context, session *Session, conv *Conversa
})
}
// Handle Tool Calls. Two paths:
// - LocalAI Assistant tools (session.AssistantExecutor.IsTool) run
// server-side; we append both the call and its output to conv.Items
// and re-trigger a follow-up response so the model can speak the
// result. The client only sees observability events.
// - All other tools follow the standard OpenAI flow: emit
// function_call_arguments.done and wait for the client to send
// conversation.item.create back.
xlog.Debug("About to handle tool calls", "finalToolCallsCount", len(finalToolCalls))
// Emit the parsed tool calls, the terminal response.done, and (for
// server-side assistant tools) the follow-up response. Shared with the
// streamed path so both finalize tool calls identically.
emitToolCallItems(ctx, session, conv, t, responseID, finalToolCalls, finalSpeech != "", toolTurn)
}
// emitToolCallItems emits the realtime function_call items for the parsed tool
// calls, the terminal response.done, and — for server-side LocalAI Assistant
// tools — re-triggers a follow-up response so the model can speak the result.
// hasContent shifts the tool-call output index past the assistant content item
// when the same turn also produced spoken/text content. Two tool paths:
// - LocalAI Assistant tools (session.AssistantExecutor.IsTool) run server-side;
// we append both the call and its output to conv.Items and re-trigger. The
// client only sees observability events.
// - All other tools follow the standard OpenAI flow: emit
// function_call_arguments.done and wait for the client to send
// conversation.item.create back.
func emitToolCallItems(ctx context.Context, session *Session, conv *Conversation, t Transport, responseID string, toolCalls []functions.FuncCallResults, hasContent bool, toolTurn int) {
xlog.Debug("About to handle tool calls", "finalToolCallsCount", len(toolCalls))
executedAssistantTool := false
for i, tc := range finalToolCalls {
for i, tc := range toolCalls {
toolCallID := generateItemID()
callID := "call_" + generateUniqueID() // OpenAI uses call_xyz
@@ -1879,7 +1859,7 @@ func triggerResponseAtTurn(ctx context.Context, session *Session, conv *Conversa
conv.Lock.Unlock()
outputIndex := i
if finalSpeech != "" {
if hasContent {
outputIndex++
}
@@ -2005,8 +1985,11 @@ func generateItemID() string {
}
func generateUniqueID() string {
// Generate a unique ID string
// For simplicity, use a counter or UUID
// Implement as needed
return "unique_id"
// 16 random bytes, hex-encoded. Must be collision-free: session, item,
// response and call IDs build on this, and the conversation tracks/removes
// items by ID (e.g. cancel() in realtime_stream.go, conversation.item.retrieve).
// A constant would make every ID alias and corrupt that bookkeeping.
var b [16]byte
_, _ = rand.Read(b[:])
return hex.EncodeToString(b[:])
}

View File

@@ -0,0 +1,200 @@
package openai
import (
"strings"
"unicode"
"unicode/utf8"
"github.com/rivo/uniseg"
)
// Default clause-chunker bounds (in runes). minRunes gates only sub-sentence
// (clause-mark / Thai-space) cuts so we don't synthesize tiny choppy fragments;
// full sentences always flush regardless of length. maxRunes caps an
// unterminated run so a long punctuation-less span doesn't buffer unbounded.
const (
defaultClauseMinRunes = 12
defaultClauseMaxRunes = 200
)
// clauseChunker splits streamed LLM content into speakable clauses for
// incremental TTS, in a SCRIPT-AWARE way so it works for languages without
// whitespace word boundaries. It leans on UAX #29 sentence segmentation (which
// natively terminates on CJK 。!? as well as Latin .!?), adds CJK clause
// punctuation (,、;:) and Thai/Lao spaces as finer boundaries, and caps an
// over-long unterminated run via UAX #14 line-break opportunities.
//
// Unlike the old ASCII .!?/newline segmenter (dropped in 076dcdbe), it does not
// degrade to whole-message buffering for CJK (handled natively) or Thai/Lao
// (handled via spaces, which Thai uses at clause/sentence boundaries). Scripts
// that genuinely need a dictionary (Khmer/Myanmar) simply stay buffered until a
// space or end-of-message — no worse than the buffered default.
//
// It is not safe for concurrent use; callers feed it from a single goroutine
// (the LLM token callback).
type clauseChunker struct {
buf strings.Builder
minRunes int
maxRunes int
}
func newClauseChunker(minRunes, maxRunes int) *clauseChunker {
return &clauseChunker{minRunes: minRunes, maxRunes: maxRunes}
}
// push appends streamed content and returns any clauses that are now complete —
// "complete" meaning confirmed by following content, so we never speak a clause
// that the next token might extend. Incomplete trailing text stays buffered.
func (c *clauseChunker) push(text string) []string {
c.buf.WriteString(text)
return c.drain(false)
}
// flush returns the remaining buffered clauses, treating end-of-input as a hard
// boundary, and clears the buffer.
func (c *clauseChunker) flush() []string {
return c.drain(true)
}
func (c *clauseChunker) drain(final bool) []string {
s := c.buf.String()
rest := s
var out []string
for rest != "" {
end, ok := c.nextBoundary(rest, final)
if !ok {
break
}
if seg := strings.TrimSpace(rest[:end]); seg != "" {
out = append(out, seg)
}
rest = rest[end:]
}
// Rewriting the builder reallocates and copies the whole buffer; skip it on
// the common per-token call where no boundary was confirmed.
if len(rest) != len(s) {
c.buf.Reset()
c.buf.WriteString(rest)
}
return out
}
// nextBoundary returns the byte offset just past the first emittable clause in
// s, or ok=false when more input is needed (final=false) and no boundary is
// confirmed yet.
func (c *clauseChunker) nextBoundary(s string, final bool) (int, bool) {
if s == "" {
return 0, false
}
// 1) UAX #29 sentence boundary. When the first sentence is followed by more
// text it is a confirmed complete sentence (handles Latin .!? with
// abbreviation/decimal guards, and CJK 。!? with no whitespace).
sentence, rest, _ := uniseg.FirstSentenceInString(s, -1)
if rest != "" {
// Optionally cut finer inside the sentence at a clause boundary.
if cut, ok := c.firstClauseCut(sentence); ok {
return cut, true
}
return len(sentence), true
}
// 2) Unterminated tail: look for a sub-sentence clause boundary (CJK
// punctuation or a Thai/Lao space) confirmed by following content.
if cut, ok := c.firstClauseCut(s); ok {
return cut, true
}
// 3) Over-long punctuation-less run: force a typographically legal break so
// we don't buffer unbounded (e.g. a long CJK run with no punctuation).
if !final && c.maxRunes > 0 && utf8.RuneCountInString(s) > c.maxRunes {
if cut, ok := lineBreakCut(s, c.maxRunes); ok {
return cut, true
}
}
// 4) End of input: emit whatever remains as the final clause.
if final {
return len(s), true
}
return 0, false
}
// firstClauseCut returns the byte offset just past the first sub-sentence clause
// boundary in s — a CJK clause punctuation mark, or a space following a Thai/Lao
// letter — provided the prefix is at least minRunes long and non-space content
// follows. The boundary mark (and any trailing spaces) stay with the left clause.
func (c *clauseChunker) firstClauseCut(s string) (int, bool) {
var prev rune
runes := 0
for i, r := range s {
boundary := isCJKClausePunct(r) || (unicode.IsSpace(r) && isThaiLao(prev))
if boundary && runes+1 >= c.minRunes {
end := i + utf8.RuneLen(r)
for end < len(s) {
nr, sz := utf8.DecodeRuneInString(s[end:])
if !unicode.IsSpace(nr) {
break
}
end += sz
}
if end < len(s) { // confirmed: real content follows the boundary
return end, true
}
// Boundary sits at the end of the buffer with nothing after it yet —
// wait for the next token to confirm it rather than emit early.
return 0, false
}
prev = r
runes++
}
return 0, false
}
// lineBreakCut walks UAX #14 line segments and returns the byte offset of the
// last legal break opportunity at or before maxRunes. Returns ok=false when the
// run has no internal break opportunity (e.g. a space-less Thai run), leaving it
// buffered.
func lineBreakCut(s string, maxRunes int) (int, bool) {
state := -1
rest := s
consumed := 0
runes := 0
for rest != "" {
seg, rem, _, st := uniseg.FirstLineSegmentInString(rest, state)
state = st
runes += utf8.RuneCountInString(seg)
consumed += len(seg)
rest = rem
if runes >= maxRunes {
if consumed < len(s) {
return consumed, true
}
return 0, false
}
}
return 0, false
}
// isCJKClausePunct reports whether r is a CJK clause-level separator worth a
// soft TTS break. Sentence terminators (。!?) are intentionally excluded — UAX
// #29 sentence segmentation already handles those.
func isCJKClausePunct(r rune) bool {
switch r {
case '', // fullwidth comma
'、', // 、 ideographic comma
'', // fullwidth semicolon
'', // fullwidth colon
'・', // ・ katakana middle dot
'・': // ・ halfwidth katakana middle dot
return true
}
return false
}
// isThaiLao reports whether r is a Thai or Lao letter. Those scripts have no
// inter-word spaces; an ASCII space inside such a run marks a clause/sentence
// boundary, which is the only no-dictionary segmentation signal available.
func isThaiLao(r rune) bool {
return unicode.Is(unicode.Thai, r) || unicode.Is(unicode.Lao, r)
}

View File

@@ -0,0 +1,103 @@
package openai
import (
"strings"
"unicode/utf8"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
// clauseChunker splits streamed LLM content into speakable clauses in a
// script-aware way: UAX#29 sentences (Latin .!? and CJK 。!?), CJK clause
// punctuation, and Thai/Lao spaces — never whitespace-splitting CJK.
var _ = Describe("clauseChunker", func() {
Context("Latin sentences", func() {
It("emits a sentence only once following content confirms it is complete", func() {
c := newClauseChunker(12, 200)
Expect(c.push("Hello world. How are you?")).To(Equal([]string{"Hello world."}))
// The trailing sentence is held until flush (the next token might extend it).
Expect(c.flush()).To(Equal([]string{"How are you?"}))
})
It("assembles a sentence across many small tokens", func() {
c := newClauseChunker(12, 200)
var got []string
for _, tok := range []string{"Hello", " world.", " How", " are", " you?"} {
got = append(got, c.push(tok)...)
}
got = append(got, c.flush()...)
Expect(got).To(Equal([]string{"Hello world.", "How are you?"}))
})
It("does not split decimals or abbreviations (UAX#29 SB6)", func() {
c := newClauseChunker(12, 200)
got := c.push("Pi is 3.14 and e is 2.72. Done")
Expect(got).To(Equal([]string{"Pi is 3.14 and e is 2.72."}))
Expect(c.flush()).To(Equal([]string{"Done"}))
})
})
Context("CJK (no whitespace)", func() {
It("splits Chinese on the ideographic full stop", func() {
c := newClauseChunker(12, 200)
Expect(c.push("你好世界。今天天气很好。")).To(Equal([]string{"你好世界。"}))
Expect(c.flush()).To(Equal([]string{"今天天气很好。"}))
})
It("splits Japanese on the ideographic full stop", func() {
c := newClauseChunker(12, 200)
Expect(c.push("こんにちは。元気ですか。")).To(Equal([]string{"こんにちは。"}))
Expect(c.flush()).To(Equal([]string{"元気ですか。"}))
})
It("splits on CJK clause punctuation for lower latency", func() {
c := newClauseChunker(2, 200) // small min so short test clauses cut
Expect(c.push("你好,世界。再见")).To(Equal([]string{"你好,", "世界。"}))
Expect(c.flush()).To(Equal([]string{"再见"}))
})
})
Context("Thai (spaces mark clauses, not words)", func() {
It("splits a Thai run on the inter-clause space", func() {
c := newClauseChunker(2, 200)
Expect(c.push("สวัสดีครับ กินข้าวไหม")).To(Equal([]string{"สวัสดีครับ"}))
Expect(c.flush()).To(Equal([]string{"กินข้าวไหม"}))
})
It("never shatters a space-less Thai run into characters", func() {
c := newClauseChunker(2, 200)
Expect(c.push("สวัสดีครับ")).To(BeEmpty()) // held, no boundary
Expect(c.flush()).To(Equal([]string{"สวัสดีครับ"}))
})
})
Context("length cap (UAX#14 fallback)", func() {
It("force-breaks an over-long punctuation-less CJK run at legal points", func() {
c := newClauseChunker(4, 10) // maxRunes = 10
run := strings.Repeat("字", 25)
got := c.push(run)
got = append(got, c.flush()...)
total := 0
for _, seg := range got {
n := utf8.RuneCountInString(seg)
Expect(n).To(BeNumerically("<=", 10)) // never exceeds the cap
total += n
}
Expect(total).To(Equal(25)) // nothing dropped
Expect(len(got)).To(BeNumerically(">=", 3)) // 10 + 10 + 5
})
})
Context("buffer lifecycle", func() {
It("flush clears the buffer so the chunker is reusable", func() {
c := newClauseChunker(12, 200)
// "First one." is confirmed by the following "Second", so push drains it;
// only the unterminated tail remains for flush.
Expect(c.push("First one. Second")).To(Equal([]string{"First one."}))
Expect(c.flush()).To(Equal([]string{"Second"}))
Expect(c.flush()).To(BeEmpty())
Expect(c.push("Again. More")).To(Equal([]string{"Again."}))
})
})
})

View File

@@ -0,0 +1,138 @@
package openai
import (
"context"
"strings"
"github.com/mudler/LocalAI/core/backend"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/http/endpoints/openai/types"
"github.com/mudler/LocalAI/core/schema"
"github.com/mudler/LocalAI/pkg/grpc/proto"
)
// fakeTransport records the server events and audio sent to a realtime client
// so streaming behaviour can be asserted without a real WebSocket/WebRTC peer.
// It is not a *WebRTCTransport, so handler code takes the WebSocket path.
type fakeTransport struct {
events []types.ServerEvent
audio []fakeAudioChunk
}
type fakeAudioChunk struct {
pcm []byte
sampleRate int
}
func (f *fakeTransport) SendEvent(e types.ServerEvent) error {
f.events = append(f.events, e)
return nil
}
func (f *fakeTransport) ReadEvent() ([]byte, error) { return nil, nil }
func (f *fakeTransport) SendAudio(_ context.Context, pcm []byte, sampleRate int) error {
f.audio = append(f.audio, fakeAudioChunk{pcm: pcm, sampleRate: sampleRate})
return nil
}
func (f *fakeTransport) Close() error { return nil }
// countEvents returns how many recorded events have the given type.
func (f *fakeTransport) countEvents(et types.ServerEventType) int {
n := 0
for _, e := range f.events {
if e.ServerEventType() == et {
n++
}
}
return n
}
// transcriptDeltaText concatenates the Delta of every recorded transcript
// delta event — i.e. the text streamed to the client as it is generated.
func (f *fakeTransport) transcriptDeltaText() string {
var b strings.Builder
for _, e := range f.events {
if d, ok := e.(types.ResponseOutputAudioTranscriptDeltaEvent); ok {
b.WriteString(d.Delta)
}
}
return b.String()
}
// fakeModel is a configurable Model double. TTSStream replays ttsStreamChunks
// and TranscribeStream replays transcribeDeltas, so the handler's streaming
// paths can be driven deterministically.
type fakeModel struct {
cfg *config.ModelConfig
ttsFile string
ttsStreamChunks [][]byte
ttsStreamRate int
ttsStreamErr error
transcribeDeltas []string
transcribeFinal *schema.TranscriptionResult
// Predict streaming: predictTokens are replayed through the token callback
// (simulating streamed LLM output); predictResp/predictErr are returned by
// the deferred predict function. predictChunkDeltas, when set, are delivered
// per-token via TokenUsage.ChatDeltas to exercise the autoparser path.
predictTokens []string
predictChunkDeltas [][]*proto.ChatDelta
predictResp backend.LLMResponse
predictErr error
}
func (m *fakeModel) VAD(context.Context, *schema.VADRequest) (*schema.VADResponse, error) {
return nil, nil
}
func (m *fakeModel) Transcribe(context.Context, string, string, bool, bool, string) (*schema.TranscriptionResult, error) {
return m.transcribeFinal, nil
}
func (m *fakeModel) Predict(_ context.Context, _ schema.Messages, _, _, _ []string, cb func(string, backend.TokenUsage) bool, _ []types.ToolUnion, _ *types.ToolChoiceUnion, _, _ *int, _ map[string]float64) (func() (backend.LLMResponse, error), error) {
if m.predictErr != nil {
return nil, m.predictErr
}
return func() (backend.LLMResponse, error) {
for i, tok := range m.predictTokens {
if cb == nil {
continue
}
usage := backend.TokenUsage{}
if i < len(m.predictChunkDeltas) {
usage.ChatDeltas = m.predictChunkDeltas[i]
}
cb(tok, usage)
}
return m.predictResp, nil
}, nil
}
func (m *fakeModel) TTS(context.Context, string, string, string) (string, *proto.Result, error) {
return m.ttsFile, &proto.Result{Success: true}, nil
}
func (m *fakeModel) TTSStream(_ context.Context, _, _, _ string, onAudio func(pcm []byte, sampleRate int) error) error {
if m.ttsStreamErr != nil {
return m.ttsStreamErr
}
for _, c := range m.ttsStreamChunks {
if err := onAudio(c, m.ttsStreamRate); err != nil {
return err
}
}
return nil
}
func (m *fakeModel) TranscribeStream(_ context.Context, _, _ string, _, _ bool, _ string, onDelta func(text string)) (*schema.TranscriptionResult, error) {
for _, d := range m.transcribeDeltas {
onDelta(d)
}
return m.transcribeFinal, nil
}
func (m *fakeModel) PredictConfig() *config.ModelConfig { return m.cfg }

View File

@@ -3,6 +3,7 @@ package openai
import (
"context"
"crypto/rand"
"encoding/binary"
"encoding/hex"
"encoding/json"
"fmt"
@@ -87,6 +88,14 @@ func (m *transcriptOnlyModel) TTS(ctx context.Context, text, voice, language str
return "", nil, fmt.Errorf("TTS not supported in transcript-only mode")
}
func (m *transcriptOnlyModel) TTSStream(ctx context.Context, text, voice, language string, onAudio func(pcm []byte, sampleRate int) error) error {
return fmt.Errorf("TTS not supported in transcript-only mode")
}
func (m *transcriptOnlyModel) TranscribeStream(ctx context.Context, audio, language string, translate, diarize bool, prompt string, onDelta func(text string)) (*schema.TranscriptionResult, error) {
return transcribeStream(ctx, m.modelLoader, *m.TranscriptionConfig, m.appConfig, audio, language, translate, diarize, prompt, onDelta)
}
func (m *transcriptOnlyModel) PredictConfig() *config.ModelConfig {
return nil
}
@@ -321,10 +330,75 @@ func (m *wrappedModel) TTS(ctx context.Context, text, voice, language string) (s
return backend.ModelTTS(ctx, text, voice, language, "", nil, m.modelLoader, m.appConfig, *m.TTSConfig)
}
func (m *wrappedModel) TTSStream(ctx context.Context, text, voice, language string, onAudio func(pcm []byte, sampleRate int) error) error {
return ttsStream(ctx, m.modelLoader, m.appConfig, *m.TTSConfig, text, voice, language, onAudio)
}
func (m *wrappedModel) TranscribeStream(ctx context.Context, audio, language string, translate, diarize bool, prompt string, onDelta func(text string)) (*schema.TranscriptionResult, error) {
return transcribeStream(ctx, m.modelLoader, *m.TranscriptionConfig, m.appConfig, audio, language, translate, diarize, prompt, onDelta)
}
func (m *wrappedModel) PredictConfig() *config.ModelConfig {
return m.LLMConfig
}
// wavStreamHeaderBytes is the size of the WAV header that backend.ModelTTSStream
// emits as its first audio callback; the sample rate lives at byte offset 24.
const wavStreamHeaderBytes = 44
// ttsStream adapts backend.ModelTTSStream (which emits a WAV stream: a 44-byte
// header carrying the sample rate, then raw PCM) to the realtime onAudio
// callback, which wants raw PCM plus the sample rate. The header is buffered
// until complete, the sample rate is read from it, and subsequent bytes are
// forwarded as PCM.
func ttsStream(ctx context.Context, ml *model.ModelLoader, appConfig *config.ApplicationConfig, ttsConfig config.ModelConfig, text, voice, language string, onAudio func(pcm []byte, sampleRate int) error) error {
var header []byte
headerDone := false
sampleRate := 0
return backend.ModelTTSStream(ctx, text, voice, language, "", nil, ml, appConfig, ttsConfig, func(b []byte) error {
if headerDone {
if len(b) == 0 {
return nil
}
return onAudio(b, sampleRate)
}
header = append(header, b...)
if len(header) < wavStreamHeaderBytes {
return nil
}
sampleRate = int(binary.LittleEndian.Uint32(header[24:28]))
headerDone = true
if len(header) > wavStreamHeaderBytes {
return onAudio(header[wavStreamHeaderBytes:], sampleRate)
}
return nil
})
}
// transcribeStream adapts backend.ModelTranscriptionStream to the realtime
// onDelta callback, returning the final aggregated transcription result.
func transcribeStream(ctx context.Context, ml *model.ModelLoader, transcriptionConfig config.ModelConfig, appConfig *config.ApplicationConfig, audio, language string, translate, diarize bool, prompt string, onDelta func(text string)) (*schema.TranscriptionResult, error) {
var final *schema.TranscriptionResult
err := backend.ModelTranscriptionStream(ctx, backend.TranscriptionRequest{
Audio: audio,
Language: language,
Translate: translate,
Diarize: diarize,
Prompt: prompt,
}, ml, transcriptionConfig, appConfig, func(chunk backend.TranscriptionStreamChunk) {
if chunk.Delta != "" {
onDelta(chunk.Delta)
}
if chunk.Final != nil {
final = chunk.Final
}
})
if err != nil {
return nil, err
}
return final, nil
}
func newTranscriptionOnlyModel(pipeline *config.Pipeline, cl *config.ModelConfigLoader, ml *model.ModelLoader, appConfig *config.ApplicationConfig) (Model, *config.ModelConfig, error) {
cfgVAD, err := cl.LoadModelConfigFileByName(pipeline.VAD, ml.ModelPath)
if err != nil {
@@ -377,13 +451,14 @@ func buildRealtimeRoutingContext(a *application.Application, sessionID string) *
return nil
}
deps := &middleware.ClassifierDeps{
Scorer: a.Scorer,
Embedder: a.Embedder,
VectorStore: a.VectorStore,
Reranker: a.Reranker,
ModelLookup: a.ModelConfigLookup(),
Registry: a.RouterClassifierRegistry(),
Evaluator: a.TemplatesEvaluator(),
Scorer: a.Scorer,
TokenCounter: a.TokenCounter,
Embedder: a.Embedder,
VectorStore: a.VectorStore,
Reranker: a.Reranker,
ModelLookup: a.ModelConfigLookup(),
Registry: a.RouterClassifierRegistry(),
Evaluator: a.TemplatesEvaluator(),
}
userID := ""
if u := a.FallbackUser(); u != nil {
@@ -454,8 +529,10 @@ func newModel(pipeline *config.Pipeline, cl *config.ModelConfigLoader, ml *model
return nil, fmt.Errorf("failed to validate config: %w", err)
}
// Let the pipeline set the LLM's reasoning effort (cfgLLM is a per-session copy).
// Let the pipeline set the LLM's reasoning effort and force thinking off
// (cfgLLM is a per-session copy). disable_thinking applies after the effort.
applyPipelineReasoning(cfgLLM, *pipeline)
applyPipelineThinking(cfgLLM, *pipeline)
cfgTTS, err := cl.LoadModelConfigFileByName(pipeline.TTS, ml.ModelPath)
if err != nil {

View File

@@ -0,0 +1,102 @@
package openai
import (
"context"
"encoding/base64"
"fmt"
"os"
"path/filepath"
"github.com/mudler/LocalAI/core/http/endpoints/openai/types"
laudio "github.com/mudler/LocalAI/pkg/audio"
"github.com/mudler/LocalAI/pkg/sound"
)
// emitSpeech synthesizes text and sends the audio to the client. When the
// pipeline opts into TTS streaming it forwards each PCM chunk as its own
// response.output_audio.delta as soon as the backend produces it; otherwise it
// synthesizes the whole utterance and sends it as a single delta.
//
// It deliberately does NOT emit transcript or audio-done events: the caller owns
// those so a streamed reply can be split into several spoken segments that share
// one response/item.
//
// It returns the PCM audio (at the session output rate) accumulated across all
// chunks, which the caller base64-encodes onto the conversation item. For WebRTC
// the audio goes over the RTP track instead, so the returned slice is empty.
func emitSpeech(ctx context.Context, t Transport, session *Session, responseID, itemID, text string) ([]byte, error) {
if text == "" {
return nil, nil
}
_, isWebRTC := t.(*WebRTCTransport)
var wsAudio []byte // PCM at the session output rate, accumulated for the item record
// sendChunk hands one PCM buffer to the transport: WebRTC consumes the raw
// PCM directly (it resamples internally); WebSocket gets base64 PCM at the
// session output rate via a JSON delta event.
sendChunk := func(pcm []byte, sampleRate int) error {
if len(pcm) == 0 {
return nil
}
if err := t.SendAudio(ctx, pcm, sampleRate); err != nil {
return err
}
if isWebRTC {
return nil
}
wsPCM := pcm
if sampleRate != 0 && sampleRate != session.OutputSampleRate {
samples := sound.BytesToInt16sLE(pcm)
resampled := sound.ResampleInt16(samples, sampleRate, session.OutputSampleRate)
wsPCM = sound.Int16toBytesLE(resampled)
}
wsAudio = append(wsAudio, wsPCM...)
return t.SendEvent(types.ResponseOutputAudioDeltaEvent{
ServerEventBase: types.ServerEventBase{},
ResponseID: responseID,
ItemID: itemID,
OutputIndex: 0,
ContentIndex: 0,
Delta: base64.StdEncoding.EncodeToString(wsPCM),
})
}
language := ""
if session.InputAudioTranscription != nil {
language = session.InputAudioTranscription.Language
}
if session.ModelConfig != nil && session.ModelConfig.Pipeline.StreamTTS() {
if err := session.ModelInterface.TTSStream(ctx, text, session.Voice, language, sendChunk); err != nil {
return nil, err
}
return wsAudio, nil
}
// Unary fallback: synthesize the whole utterance to a file, then emit once.
audioFilePath, res, err := session.ModelInterface.TTS(ctx, text, session.Voice, language)
if err != nil {
return nil, err
}
if res != nil && !res.Success {
return nil, fmt.Errorf("tts generation failed: %s", res.Message)
}
defer func() { _ = os.Remove(audioFilePath) }()
// filepath.Clean normalizes the backend-produced temp path before reading
// (also keeps gosec G304 quiet — the path is backend-controlled, not user input).
audioBytes, err := os.ReadFile(filepath.Clean(audioFilePath))
if err != nil {
return nil, fmt.Errorf("read tts audio: %w", err)
}
pcm, sampleRate := laudio.ParseWAV(audioBytes)
if sampleRate == 0 {
sampleRate = session.OutputSampleRate
}
if err := sendChunk(pcm, sampleRate); err != nil {
return nil, err
}
return wsAudio, nil
}

View File

@@ -0,0 +1,70 @@
package openai
import (
"context"
"os"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/http/endpoints/openai/types"
laudio "github.com/mudler/LocalAI/pkg/audio"
)
// emitSpeech synthesizes a piece of text and forwards the audio to the client,
// streaming a delta per TTS chunk when the pipeline opts in, or sending the
// whole utterance as one delta otherwise.
var _ = Describe("emitSpeech", func() {
ttsOn := true
streamingSession := func(m Model) *Session {
return &Session{
OutputSampleRate: 24000,
ModelInterface: m,
ModelConfig: &config.ModelConfig{
Pipeline: config.Pipeline{Streaming: config.PipelineStreaming{TTS: &ttsOn}},
},
}
}
It("streams one output_audio.delta per TTS chunk when streaming is enabled", func() {
m := &fakeModel{
ttsStreamChunks: [][]byte{{1, 2}, {3, 4}, {5, 6}},
ttsStreamRate: 24000,
}
t := &fakeTransport{}
audio, err := emitSpeech(context.Background(), t, streamingSession(m), "resp1", "item1", "Hello there.")
Expect(err).ToNot(HaveOccurred())
Expect(t.countEvents(types.ServerEventTypeResponseOutputAudioDelta)).To(Equal(3))
// The returned audio is all chunks concatenated (session output rate).
Expect(audio).To(Equal([]byte{1, 2, 3, 4, 5, 6}))
})
It("sends a single output_audio.delta in unary mode", func() {
// A minimal real WAV file for the unary TTS path to read + parse.
f, err := os.CreateTemp("", "emit-*.wav")
Expect(err).ToNot(HaveOccurred())
defer func() { _ = os.Remove(f.Name()) }()
pcm := make([]byte, 320) // 160 samples of silence
hdr := laudio.NewWAVHeader(uint32(len(pcm)))
Expect(hdr.Write(f)).To(Succeed())
_, err = f.Write(pcm)
Expect(err).ToNot(HaveOccurred())
Expect(f.Close()).To(Succeed())
session := &Session{
OutputSampleRate: 24000,
ModelInterface: &fakeModel{ttsFile: f.Name()},
ModelConfig: &config.ModelConfig{}, // streaming off
}
t := &fakeTransport{}
_, err = emitSpeech(context.Background(), t, session, "resp1", "item1", "Hello there.")
Expect(err).ToNot(HaveOccurred())
Expect(t.countEvents(types.ServerEventTypeResponseOutputAudioDelta)).To(Equal(1))
})
})

View File

@@ -0,0 +1,315 @@
package openai
import (
"context"
"encoding/base64"
"fmt"
"github.com/mudler/LocalAI/core/backend"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/http/endpoints/openai/types"
"github.com/mudler/LocalAI/core/schema"
"github.com/mudler/LocalAI/pkg/functions"
"github.com/mudler/LocalAI/pkg/reasoning"
)
// transcriptStreamer turns streamed LLM tokens into the assistant's spoken
// transcript: it strips reasoning incrementally and sends one
// response.output_audio_transcript.delta per content fragment. It does NOT
// synthesize audio — the caller buffers the full message and synthesizes it
// once (streaming the audio chunks when the TTS backend supports TTSStream),
// which works uniformly for streaming and non-streaming TTS and for languages
// without sentence or word boundaries.
type transcriptStreamer struct {
ctx context.Context
t Transport
responseID string
itemID string
extractor *reasoning.ReasoningExtractor
// announce, if set, is invoked once just before the first transcript delta.
// It lets the caller create the assistant item lazily, so a content-less
// tool-call turn never emits a spurious empty assistant item.
announce func()
announced bool
}
func newTranscriptStreamer(ctx context.Context, t Transport, responseID, itemID, thinkingStartToken string, reasoningCfg reasoning.Config) *transcriptStreamer {
return &transcriptStreamer{
ctx: ctx,
t: t,
responseID: responseID,
itemID: itemID,
extractor: reasoning.NewReasoningExtractor(thinkingStartToken, spokenReasoningConfig(reasoningCfg)),
}
}
// onToken handles one streamed unit of model output, sending a transcript delta
// for the new content (reasoning stripped) and returning that content delta so
// the caller can also feed it to the clause chunker. For plain-content models
// the unit is the raw text token; for autoparser tool turns the backend clears
// the text and delivers content via ChatDeltas, so the caller passes that
// content here. Returns "" when the token produced no new spoken content.
func (s *transcriptStreamer) onToken(token string) string {
_, content := s.extractor.ProcessToken(token)
if content == "" {
return ""
}
if !s.announced {
s.announced = true
if s.announce != nil {
s.announce()
}
}
_ = s.t.SendEvent(types.ResponseOutputAudioTranscriptDeltaEvent{
ServerEventBase: types.ServerEventBase{},
ResponseID: s.responseID,
ItemID: s.itemID,
OutputIndex: 0,
ContentIndex: 0,
Delta: content,
})
return content
}
// content returns the full transcript so far with reasoning stripped.
func (s *transcriptStreamer) content() string {
return s.extractor.CleanedContent()
}
// streamLLMResponse drives a streamed realtime reply. It streams the assistant
// transcript as the LLM generates, then synthesizes the whole buffered message
// once (streaming the audio chunks when the TTS backend supports it, otherwise a
// single unary delta). Tool calls parsed from the autoparser ChatDeltas are
// emitted after the spoken content. The assistant content item is created lazily
// on the first content delta, so a content-less tool-call turn emits only the
// tool calls. It returns true when it has fully handled the response so the
// caller can return; callers must only invoke it for an audio modality, and with
// tools only when the model uses its tokenizer template (see triggerResponseAtTurn).
func streamLLMResponse(ctx context.Context, session *Session, conv *Conversation, t Transport, responseID string, history schema.Messages, images []string, llmCfg *config.ModelConfig, tools []types.ToolUnion, toolChoice *types.ToolChoiceUnion, toolTurn int) bool {
itemID := generateItemID()
item := types.MessageItemUnion{
Assistant: &types.MessageItemAssistant{
ID: itemID,
Status: types.ItemStatusInProgress,
Content: []types.MessageContentOutput{{Type: types.MessageContentTypeOutputAudio}},
},
}
// announce creates the assistant content item lazily, just before the first
// transcript delta — a tool-only turn never produces content, so it stays out
// of the conversation and the client sees only the tool calls.
announced := false
announce := func() {
announced = true
conv.Lock.Lock()
conv.Items = append(conv.Items, &item)
conv.Lock.Unlock()
sendEvent(t, types.ResponseOutputItemAddedEvent{
ServerEventBase: types.ServerEventBase{},
ResponseID: responseID,
OutputIndex: 0,
Item: item,
})
sendEvent(t, types.ResponseContentPartAddedEvent{
ServerEventBase: types.ServerEventBase{},
ResponseID: responseID,
ItemID: itemID,
OutputIndex: 0,
ContentIndex: 0,
Part: item.Assistant.Content[0],
})
}
cancel := func() {
if announced {
conv.Lock.Lock()
for i := len(conv.Items) - 1; i >= 0; i-- {
if conv.Items[i].Assistant != nil && conv.Items[i].Assistant.ID == itemID {
conv.Items = append(conv.Items[:i], conv.Items[i+1:]...)
break
}
}
conv.Lock.Unlock()
}
sendEvent(t, types.ResponseDoneEvent{
ServerEventBase: types.ServerEventBase{},
Response: types.Response{ID: responseID, Object: "realtime.response", Status: types.ResponseStatusCancelled},
})
}
var template string
if llmCfg.TemplateConfig.UseTokenizerTemplate {
template = llmCfg.GetModelTemplate()
} else {
template = llmCfg.TemplateConfig.Chat
}
thinkingStartToken := reasoning.DetectThinkingStartToken(template, &llmCfg.ReasoningConfig)
// The autoparser (tokenizer-template path) already delivers reasoning-free
// content. Prefilling the thinking start token here would re-tag that clean
// content as an unclosed reasoning block, leaving CleanedContent() empty —
// no spoken reply, no TTS. Disable the prefill; closed tag pairs are still
// stripped (PEG-fallback case, #9985).
reasoningCfg := llmCfg.ReasoningConfig
if llmCfg.TemplateConfig.UseTokenizerTemplate {
disablePrefill := true
reasoningCfg.DisableReasoningTagPrefill = &disablePrefill
}
streamer := newTranscriptStreamer(ctx, t, responseID, itemID, thinkingStartToken, reasoningCfg)
streamer.announce = announce
// Clause chunking (opt-in): synthesize each clause as soon as it completes
// instead of buffering the whole reply. streamedAudio accumulates the PCM
// across clauses for the conversation item record; ttsErr captures the first
// synthesis failure so the token callback can stop the prediction. emitSpeech
// runs synchronously here — the LLM keeps generating into the gRPC stream
// while a clause is synthesized, so audio still starts mid-generation.
var chunker *clauseChunker
if session.ModelConfig != nil && session.ModelConfig.Pipeline.ChunkClauses() {
chunker = newClauseChunker(defaultClauseMinRunes, defaultClauseMaxRunes)
}
var streamedAudio []byte
var ttsErr error
speakClause := func(clause string) error {
a, err := emitSpeech(ctx, t, session, responseID, itemID, clause)
if err != nil {
return err
}
streamedAudio = append(streamedAudio, a...)
return nil
}
// fail reports a mid-stream failure. A cancelled context means the client
// interrupted (barge-in), so roll the turn back instead of erroring.
fail := func(code, msg string, err error) bool {
if ctx.Err() != nil {
cancel()
} else {
sendError(t, code, fmt.Sprintf("%s: %v", msg, err), "", itemID)
}
return true
}
cb := func(token string, usage backend.TokenUsage) bool {
if ctx.Err() != nil {
return false
}
// Plain-content models stream text via the token; autoparser tool turns
// clear the text and deliver content via ChatDeltas, so prefer the latter
// when present. Either way only content reaches the transcript — tool-call
// deltas are parsed from the final response below.
text := token
if len(usage.ChatDeltas) > 0 {
text = functions.ContentFromChatDeltas(usage.ChatDeltas)
}
delta := streamer.onToken(text)
if chunker != nil && delta != "" {
for _, clause := range chunker.push(delta) {
if ttsErr = speakClause(clause); ttsErr != nil {
return false // stop the prediction; reported after predFunc returns
}
}
}
return true
}
predFunc, err := session.ModelInterface.Predict(ctx, history, images, nil, nil, cb, tools, toolChoice, nil, nil, nil)
if err != nil {
sendError(t, "inference_failed", fmt.Sprintf("backend error: %v", err), "", itemID)
return true
}
pred, err := predFunc()
// A clause synthesis failed mid-stream (the callback stopped the prediction);
// report it as a TTS error rather than a prediction error.
if ttsErr != nil {
return fail("tts_error", "TTS generation failed", ttsErr)
}
if err != nil {
return fail("prediction_failed", "backend error", err)
}
if ctx.Err() != nil {
cancel()
return true
}
content := streamer.content()
toolCalls := functions.ToolCallsFromChatDeltas(pred.ChatDeltas)
// Finalize the spoken content item only when the turn produced content. A
// tool-only turn skips this entirely (no empty assistant item).
if content != "" {
if !announced {
announce()
}
// Synthesize the audio. With clause chunking the completed clauses were
// already spoken inside the token callback; flush the trailing clause(s)
// the segmenter was still holding. Otherwise buffer the whole message and
// synthesize it once. emitSpeech streams the audio chunks when the TTS
// backend supports TTSStream, otherwise it sends a single unary delta.
var audio []byte
if chunker != nil {
for _, clause := range chunker.flush() {
if ttsErr = speakClause(clause); ttsErr != nil {
break
}
}
audio = streamedAudio
} else {
audio, ttsErr = emitSpeech(ctx, t, session, responseID, itemID, content)
}
if ttsErr != nil {
return fail("tts_error", "TTS generation failed", ttsErr)
}
_, isWebRTC := t.(*WebRTCTransport)
sendEvent(t, types.ResponseOutputAudioTranscriptDoneEvent{
ServerEventBase: types.ServerEventBase{},
ResponseID: responseID,
ItemID: itemID,
OutputIndex: 0,
ContentIndex: 0,
Transcript: content,
})
if !isWebRTC {
sendEvent(t, types.ResponseOutputAudioDoneEvent{
ServerEventBase: types.ServerEventBase{},
ResponseID: responseID,
ItemID: itemID,
OutputIndex: 0,
ContentIndex: 0,
})
}
conv.Lock.Lock()
item.Assistant.Status = types.ItemStatusCompleted
item.Assistant.Content[0].Transcript = content
if !isWebRTC {
item.Assistant.Content[0].Audio = base64.StdEncoding.EncodeToString(audio)
}
conv.Lock.Unlock()
sendEvent(t, types.ResponseContentPartDoneEvent{
ServerEventBase: types.ServerEventBase{},
ResponseID: responseID,
ItemID: itemID,
OutputIndex: 0,
ContentIndex: 0,
Part: item.Assistant.Content[0],
})
sendEvent(t, types.ResponseOutputItemDoneEvent{
ServerEventBase: types.ServerEventBase{},
ResponseID: responseID,
OutputIndex: 0,
Item: item,
})
}
// Emit any tool calls, the terminal response.done, and (for server-side
// assistant tools) the follow-up turn — shared with the buffered path.
emitToolCallItems(ctx, session, conv, t, responseID, toolCalls, content != "", toolTurn)
return true
}

View File

@@ -0,0 +1,213 @@
package openai
import (
"context"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
"github.com/mudler/LocalAI/core/backend"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/http/endpoints/openai/types"
"github.com/mudler/LocalAI/pkg/grpc/proto"
"github.com/mudler/LocalAI/pkg/reasoning"
)
// transcriptStreamer turns streamed LLM tokens into incremental transcript
// deltas, stripping reasoning. Audio is synthesized once from the full message
// by the caller, so there is no per-sentence segmentation.
var _ = Describe("transcriptStreamer", func() {
It("emits one transcript delta per content token", func() {
t := &fakeTransport{}
s := newTranscriptStreamer(context.Background(), t, "resp1", "item1", "", reasoning.Config{})
for _, tok := range []string{"Hello", " world.", " Bye"} {
s.onToken(tok)
}
Expect(s.content()).To(Equal("Hello world. Bye"))
Expect(t.countEvents(types.ServerEventTypeResponseOutputAudioTranscriptDelta)).To(Equal(3))
Expect(t.transcriptDeltaText()).To(Equal("Hello world. Bye"))
})
It("strips leaked reasoning even when reasoning is disabled (disable_thinking safety net)", func() {
// disable_thinking maps to DisableReasoning=true (enable_thinking=false to
// the backend). If the model emits thinking anyway, the transcript must
// still not leak it: stripping always runs for spoken output.
disable := true
t := &fakeTransport{}
s := newTranscriptStreamer(context.Background(), t, "resp1", "item1", "",
reasoning.Config{DisableReasoning: &disable})
s.onToken("<think>secret plan</think>")
s.onToken("The answer is 42.")
Expect(s.content()).To(Equal("The answer is 42."))
Expect(s.content()).ToNot(ContainSubstring("secret plan"))
Expect(t.transcriptDeltaText()).ToNot(ContainSubstring("secret plan"))
})
It("does not swallow autoparser content when the template has a thinking start token (tokenizer-template path)", func() {
// Regression: with tag prefill on, the detected <think> token is
// prepended to the autoparser's already-clean content, swallowing the
// whole reply (empty transcript → no TTS). streamLLMResponse disables
// the prefill for the tokenizer-template path.
disablePrefill := true
t := &fakeTransport{}
s := newTranscriptStreamer(context.Background(), t, "resp1", "item1", "<think>",
reasoning.Config{DisableReasoningTagPrefill: &disablePrefill})
s.onToken("Hello")
s.onToken(" there.")
Expect(s.content()).To(Equal("Hello there."))
Expect(t.transcriptDeltaText()).To(Equal("Hello there."))
})
It("still strips embedded closed reasoning tags with prefill disabled (PEG-fallback safety, #9985)", func() {
// Disabling prefill must not stop stripping closed <think>…</think>
// pairs the PEG fallback can leave in autoparser content.
disablePrefill := true
t := &fakeTransport{}
s := newTranscriptStreamer(context.Background(), t, "resp1", "item1", "<think>",
reasoning.Config{DisableReasoningTagPrefill: &disablePrefill})
s.onToken("<think>secret</think>")
s.onToken("The answer is 42.")
Expect(s.content()).To(Equal("The answer is 42."))
Expect(t.transcriptDeltaText()).ToNot(ContainSubstring("secret"))
})
})
// streamLLMResponse drives a full streamed realtime turn: live transcript
// deltas while the LLM generates, then the whole message is synthesized once.
var _ = Describe("streamLLMResponse", func() {
It("streams transcript deltas then synthesizes the whole message once", func() {
on := true
m := &fakeModel{
predictTokens: []string{"Hello", " world.", " How are you?"},
predictResp: backend.LLMResponse{Response: "Hello world. How are you?"},
ttsStreamChunks: [][]byte{{9}},
ttsStreamRate: 24000,
}
session := &Session{
OutputSampleRate: 24000,
ModelInterface: m,
ModelConfig: &config.ModelConfig{
Pipeline: config.Pipeline{Streaming: config.PipelineStreaming{LLM: &on, TTS: &on}},
},
}
conv := &Conversation{}
t := &fakeTransport{}
llmCfg := &config.ModelConfig{}
handled := streamLLMResponse(context.Background(), session, conv, t, "resp1", nil, nil, llmCfg, nil, nil, 0)
Expect(handled).To(BeTrue())
// One live transcript delta per streamed token.
Expect(t.countEvents(types.ServerEventTypeResponseOutputAudioTranscriptDelta)).To(Equal(3))
// The whole message is synthesized ONCE (not per sentence): a single
// emitSpeech replays the one TTS stream chunk.
Expect(t.countEvents(types.ServerEventTypeResponseOutputAudioDelta)).To(Equal(1))
Expect(t.transcriptDeltaText()).To(Equal("Hello world. How are you?"))
})
It("synthesizes each clause as it completes when clause chunking is enabled", func() {
on := true
m := &fakeModel{
predictTokens: []string{"Hello world.", " How are you?"},
predictResp: backend.LLMResponse{Response: "Hello world. How are you?"},
ttsStreamChunks: [][]byte{{9}},
ttsStreamRate: 24000,
}
session := &Session{
OutputSampleRate: 24000,
ModelInterface: m,
ModelConfig: &config.ModelConfig{
Pipeline: config.Pipeline{Streaming: config.PipelineStreaming{LLM: &on, TTS: &on, ClauseChunking: &on}},
},
}
conv := &Conversation{}
t := &fakeTransport{}
llmCfg := &config.ModelConfig{}
handled := streamLLMResponse(context.Background(), session, conv, t, "resp1", nil, nil, llmCfg, nil, nil, 0)
Expect(handled).To(BeTrue())
// Two clauses ("Hello world." mid-stream, "How are you?" on flush) → two
// emitSpeech calls → two audio deltas, vs one for whole-message buffering.
Expect(t.countEvents(types.ServerEventTypeResponseOutputAudioDelta)).To(Equal(2))
// The full transcript still streams verbatim.
Expect(t.transcriptDeltaText()).To(Equal("Hello world. How are you?"))
// Exactly one terminal response.done.
Expect(t.countEvents(types.ServerEventTypeResponseDone)).To(Equal(1))
})
It("streams content deltas and emits tool-call items (autoparser tool turn)", func() {
on := true
// Autoparser path: reply.Message is empty; content + tool calls arrive via
// ChatDeltas. Chunk 1 carries content, chunk 2 carries the tool call.
contentDelta := []*proto.ChatDelta{{Content: "Let me check."}}
toolDelta := []*proto.ChatDelta{{ToolCalls: []*proto.ToolCallDelta{{Index: 0, Name: "get_weather", Arguments: `{"city":"Paris"}`}}}}
m := &fakeModel{
predictTokens: []string{"", ""},
predictChunkDeltas: [][]*proto.ChatDelta{contentDelta, toolDelta},
predictResp: backend.LLMResponse{ChatDeltas: append(append([]*proto.ChatDelta{}, contentDelta...), toolDelta...)},
ttsStreamChunks: [][]byte{{9}},
ttsStreamRate: 24000,
}
session := &Session{
OutputSampleRate: 24000,
ModelInterface: m,
ModelConfig: &config.ModelConfig{
Pipeline: config.Pipeline{Streaming: config.PipelineStreaming{LLM: &on, TTS: &on}},
},
}
conv := &Conversation{}
t := &fakeTransport{}
llmCfg := &config.ModelConfig{}
llmCfg.TemplateConfig.UseTokenizerTemplate = true
handled := streamLLMResponse(context.Background(), session, conv, t, "resp1", nil, nil, llmCfg, nil, nil, 0)
Expect(handled).To(BeTrue())
// The spoken content was streamed live.
Expect(t.transcriptDeltaText()).To(Equal("Let me check."))
// The tool call is emitted as a function_call item.
Expect(t.countEvents(types.ServerEventTypeResponseFunctionCallArgumentsDone)).To(Equal(1))
// Exactly one terminal response.done.
Expect(t.countEvents(types.ServerEventTypeResponseDone)).To(Equal(1))
})
It("emits only tool-call items for a content-less tool turn (no empty assistant item)", func() {
on := true
toolDelta := []*proto.ChatDelta{{ToolCalls: []*proto.ToolCallDelta{{Index: 0, Name: "get_weather", Arguments: `{"city":"Rome"}`}}}}
m := &fakeModel{
predictTokens: []string{""},
predictChunkDeltas: [][]*proto.ChatDelta{toolDelta},
predictResp: backend.LLMResponse{ChatDeltas: toolDelta},
}
session := &Session{
OutputSampleRate: 24000,
ModelInterface: m,
ModelConfig: &config.ModelConfig{
Pipeline: config.Pipeline{Streaming: config.PipelineStreaming{LLM: &on, TTS: &on}},
},
}
conv := &Conversation{}
t := &fakeTransport{}
llmCfg := &config.ModelConfig{}
llmCfg.TemplateConfig.UseTokenizerTemplate = true
handled := streamLLMResponse(context.Background(), session, conv, t, "resp1", nil, nil, llmCfg, nil, nil, 0)
Expect(handled).To(BeTrue())
// No content → no transcript deltas and no spurious assistant content item.
Expect(t.transcriptDeltaText()).To(Equal(""))
Expect(t.countEvents(types.ServerEventTypeResponseOutputAudioTranscriptDelta)).To(Equal(0))
// The tool call is still emitted.
Expect(t.countEvents(types.ServerEventTypeResponseFunctionCallArgumentsDone)).To(Equal(1))
Expect(t.countEvents(types.ServerEventTypeResponseDone)).To(Equal(1))
})
})

View File

@@ -0,0 +1,33 @@
package openai
import (
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/pkg/reasoning"
)
// applyPipelineThinking forces the LLM's reasoning/thinking off when the realtime
// pipeline sets disable_thinking, mapping to the enable_thinking=false backend
// metadata via ReasoningConfig.DisableReasoning. The LLM config passed in is the
// per-session copy returned by the config loader, so this does not affect other
// users of the same model. When the pipeline does not set disable_thinking the
// LLM config is left untouched.
func applyPipelineThinking(llm *config.ModelConfig, pipeline config.Pipeline) {
if llm == nil || !pipeline.ThinkingDisabled() {
return
}
disable := true
llm.ReasoningConfig.DisableReasoning = &disable
}
// spokenReasoningConfig adapts a model's reasoning config for stripping reasoning
// OUT of realtime spoken output. ReasoningConfig.DisableReasoning is overloaded:
// the backend reads it as the "enable_thinking=false" hint (which pipeline
// disable_thinking sets via applyPipelineThinking), but the reasoning extractor
// reads it as "skip stripping, assume there is no reasoning". Honouring the latter
// when extracting for speech would leak raw <think>…</think> whenever the model
// ignores the suppression hint. Spoken output must never contain reasoning, so we
// always strip: clear DisableReasoning while keeping custom tokens/tag pairs.
func spokenReasoningConfig(cfg reasoning.Config) reasoning.Config {
cfg.DisableReasoning = nil
return cfg
}

View File

@@ -0,0 +1,50 @@
package openai
import (
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/pkg/reasoning"
)
// applyPipelineThinking lets a realtime pipeline force the LLM's thinking off
// (enable_thinking=false metadata) without editing the LLM model config.
var _ = Describe("applyPipelineThinking", func() {
It("disables reasoning on the LLM config when the pipeline disables thinking", func() {
disable := true
llm := &config.ModelConfig{}
applyPipelineThinking(llm, config.Pipeline{DisableThinking: &disable})
Expect(llm.ReasoningConfig.DisableReasoning).ToNot(BeNil())
Expect(*llm.ReasoningConfig.DisableReasoning).To(BeTrue())
})
It("leaves the LLM config untouched when the pipeline does not set disable_thinking", func() {
llm := &config.ModelConfig{}
applyPipelineThinking(llm, config.Pipeline{})
Expect(llm.ReasoningConfig.DisableReasoning).To(BeNil())
})
})
// spokenReasoningConfig clears DisableReasoning so realtime spoken output always
// strips reasoning, even though disable_thinking sets DisableReasoning=true on the
// LLM config (which the backend reads as enable_thinking=false).
var _ = Describe("spokenReasoningConfig", func() {
It("clears DisableReasoning so the extractor still strips leaked reasoning", func() {
disable := true
out := spokenReasoningConfig(reasoning.Config{DisableReasoning: &disable})
Expect(out.DisableReasoning).To(BeNil())
})
It("preserves the other reasoning settings", func() {
disable := true
out := spokenReasoningConfig(reasoning.Config{
DisableReasoning: &disable,
ThinkingStartTokens: []string{"<reason>"},
TagPairs: []reasoning.TagPair{{Start: "<reason>", End: "</reason>"}},
})
Expect(out.ThinkingStartTokens).To(Equal([]string{"<reason>"}))
Expect(out.TagPairs).To(HaveLen(1))
Expect(out.TagPairs[0].Start).To(Equal("<reason>"))
})
})

View File

@@ -0,0 +1,63 @@
package openai
import (
"context"
"fmt"
"github.com/mudler/LocalAI/core/http/endpoints/openai/types"
)
// emitTranscription transcribes a committed utterance and emits the transcription
// events for it, returning the final transcript text. With
// pipeline.streaming.transcription enabled it streams each transcript fragment as
// a conversation.item.input_audio_transcription.delta as the backend produces it,
// then a completed event; otherwise it transcribes the whole utterance and emits
// a single completed event. delta and completed events share itemID.
func emitTranscription(ctx context.Context, t Transport, session *Session, itemID, audioPath string) (string, error) {
cfg := session.InputAudioTranscription
if session.ModelConfig != nil && session.ModelConfig.Pipeline.StreamTranscription() {
final, err := session.ModelInterface.TranscribeStream(ctx, audioPath, cfg.Language, false, false, cfg.Prompt, func(delta string) {
_ = t.SendEvent(types.ConversationItemInputAudioTranscriptionDeltaEvent{
ServerEventBase: types.ServerEventBase{EventID: "event_TODO"},
ItemID: itemID,
ContentIndex: 0,
Delta: delta,
})
})
if err != nil {
return "", err
}
transcript := ""
if final != nil {
transcript = final.Text
}
if err := t.SendEvent(types.ConversationItemInputAudioTranscriptionCompletedEvent{
ServerEventBase: types.ServerEventBase{EventID: "event_TODO"},
ItemID: itemID,
ContentIndex: 0,
Transcript: transcript,
}); err != nil {
return "", err
}
return transcript, nil
}
// Unary fallback: transcribe the whole utterance, emit one completed event.
tr, err := session.ModelInterface.Transcribe(ctx, audioPath, cfg.Language, false, false, cfg.Prompt)
if err != nil {
return "", err
}
if tr == nil {
return "", fmt.Errorf("transcribe result is nil")
}
if err := t.SendEvent(types.ConversationItemInputAudioTranscriptionCompletedEvent{
ServerEventBase: types.ServerEventBase{EventID: "event_TODO"},
ItemID: itemID,
ContentIndex: 0,
Transcript: tr.Text,
}); err != nil {
return "", err
}
return tr.Text, nil
}

View File

@@ -0,0 +1,54 @@
package openai
import (
"context"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/http/endpoints/openai/types"
"github.com/mudler/LocalAI/core/schema"
)
// emitTranscription transcribes a committed utterance, streaming transcript text
// deltas when the pipeline opts in, and returns the final transcript text.
var _ = Describe("emitTranscription", func() {
It("streams transcription deltas then a completed event when streaming is enabled", func() {
on := true
session := &Session{
InputAudioTranscription: &types.AudioTranscription{},
ModelConfig: &config.ModelConfig{
Pipeline: config.Pipeline{Streaming: config.PipelineStreaming{Transcription: &on}},
},
ModelInterface: &fakeModel{
transcribeDeltas: []string{"Hel", "lo", " world"},
transcribeFinal: &schema.TranscriptionResult{Text: "Hello world"},
},
}
t := &fakeTransport{}
transcript, err := emitTranscription(context.Background(), t, session, "item1", "/tmp/x.wav")
Expect(err).ToNot(HaveOccurred())
Expect(transcript).To(Equal("Hello world"))
Expect(t.countEvents(types.ServerEventTypeConversationItemInputAudioTranscriptionDelta)).To(Equal(3))
Expect(t.countEvents(types.ServerEventTypeConversationItemInputAudioTranscriptionCompleted)).To(Equal(1))
})
It("emits a single completed event with no deltas in unary mode", func() {
session := &Session{
InputAudioTranscription: &types.AudioTranscription{},
ModelConfig: &config.ModelConfig{}, // streaming off
ModelInterface: &fakeModel{transcribeFinal: &schema.TranscriptionResult{Text: "Hi"}},
}
t := &fakeTransport{}
transcript, err := emitTranscription(context.Background(), t, session, "item1", "/tmp/x.wav")
Expect(err).ToNot(HaveOccurred())
Expect(transcript).To(Equal("Hi"))
Expect(t.countEvents(types.ServerEventTypeConversationItemInputAudioTranscriptionDelta)).To(Equal(0))
Expect(t.countEvents(types.ServerEventTypeConversationItemInputAudioTranscriptionCompleted)).To(Equal(1))
})
})

View File

@@ -48,7 +48,8 @@ func RealtimeCalls(application *application.Application) echo.HandlerFunc {
return c.JSON(http.StatusInternalServerError, map[string]string{"error": "codec registration failed"})
}
api := webrtc.NewAPI(webrtc.WithMediaEngine(m))
se := webRTCSettingEngine(application.ApplicationConfig())
api := webrtc.NewAPI(webrtc.WithMediaEngine(m), webrtc.WithSettingEngine(se))
pc, err := api.NewPeerConnection(webrtc.Configuration{})
if err != nil {

View File

@@ -0,0 +1,47 @@
package openai
import (
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/xlog"
"github.com/pion/webrtc/v4"
)
// webRTCSettingEngine builds the pion SettingEngine for /v1/realtime WebRTC.
//
// With a default (empty) SettingEngine, pion gathers a host ICE candidate for
// every local interface. Under Docker host networking that includes bridge
// addresses (docker0/veth, 172.x) that a remote browser cannot route to; the
// connection often establishes on a good pair and then drops once ICE consent
// checks fail on the unreachable ones. The two opt-in knobs below let an
// operator advertise only the reachable address.
func webRTCSettingEngine(cfg *config.ApplicationConfig) webrtc.SettingEngine {
s := webrtc.SettingEngine{}
if cfg == nil {
return s
}
if len(cfg.WebRTCNAT1To1IPs) > 0 {
s.SetNAT1To1IPs(cfg.WebRTCNAT1To1IPs, webrtc.ICECandidateTypeHost)
xlog.Debug("realtime webrtc: advertising NAT 1:1 host IPs", "ips", cfg.WebRTCNAT1To1IPs)
}
if filter := iceInterfaceFilter(cfg.WebRTCICEInterfaces); filter != nil {
s.SetInterfaceFilter(filter)
xlog.Debug("realtime webrtc: restricting ICE interfaces", "interfaces", cfg.WebRTCICEInterfaces)
}
return s
}
// iceInterfaceFilter returns an interface allow-list predicate for pion, or nil
// when no interfaces are configured (pion's default: gather from all).
func iceInterfaceFilter(allowed []string) func(string) bool {
if len(allowed) == 0 {
return nil
}
set := make(map[string]struct{}, len(allowed))
for _, name := range allowed {
set[name] = struct{}{}
}
return func(iface string) bool {
_, ok := set[iface]
return ok
}
}

Some files were not shown because too many files have changed in this diff Show More