* fix(darwin): never package a go backend build tree as a working image
The darwin/arm64 vibevoice-cpp image shipped the source tree with a
half-built CMake directory (build-libgovibevoicecpp-fallback.so/) and no
backend binary, so the backend could never start: run.sh exec'd a
vibevoice-cpp binary that was not in the package and LocalAI timed out
waiting for the gRPC service.
Two durable, backend-agnostic defenses:
- backend/go/vibevoice-cpp/Makefile: mirror whisper's cleanup discipline so a
partial CMake tree cannot survive into packaging. Run `make purge` before
each variant build and `rm -rfv build*` after. The old recipe only removed
its build dir after a successful `mv`, so a failed build left the half-built
tree behind.
- scripts/build/golang-darwin.sh: before creating the OCI image, remove any
stray build-* directory and assert that the binary run.sh launches actually
exists. A build that produced no binary now fails the job loudly instead of
publishing a source tree as a working backend. The binary name is derived
from run.sh's `exec $CURDIR/<binary>` line (parakeet-cpp launches
parakeet-cpp-grpc, so it is not always ${BACKEND}) with a ${BACKEND}
fallback.
The underlying native build failure that left vibevoice-cpp half-built still
needs to be reproduced and fixed on Apple Silicon; this change ensures such a
failure can never again be published as a working image.
Refs #10267
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* fix(vibevoice-cpp): build libvibevoice.a on darwin (link target, not path)
The darwin build failed with:
No rule to make target 'vibevoice/libvibevoice.a', needed by
'libgovibevoicecpp.so'. Stop.
The upstream vibevoice project is added with add_subdirectory(... EXCLUDE_FROM_ALL),
so its `vibevoice` static-library target is only built when something links it
as a target. The Apple branch linked only `$<TARGET_FILE:vibevoice>` - a bare
archive path with no target reference - so CMake never emitted a rule to build
libvibevoice.a, while the Linux branch worked because it passes the `vibevoice`
target name inside the --whole-archive flags.
Link the `vibevoice` target on Apple (establishing the build dependency) and
apply -force_load as a separate link option to keep whole-archive semantics so
purego can dlsym the vv_capi_* symbols.
Refs #10267
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Vision-language checkpoints such as mlx-community/gemma-4-E4B-it-qat-4bit
declare the "image-text-to-text" pipeline tag on HuggingFace. The mlx
importer hardcoded backend "mlx" for every mlx-community model, so these
VLMs were served by the text-only mlx-lm backend whose tokenizer does not
carry the processor chat template. The template was never applied and the
model produced degenerate, looping output that echoed the prompt.
Detect the "image-text-to-text" pipeline tag in the importer and route those
models to mlx-vlm, which applies the processor-aware chat template. An
explicit backend preference still wins.
As a defensive backstop, the mlx backend now warns loudly when the loaded
model has no chat template, so a misrouted VLM surfaces the problem instead
of silently looping.
Fixes#10269
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* ⬆️ Update mudler/parakeet.cpp
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* fix(parakeet-cpp): close streaming segments on <EOB> after ABI v5 eou/eob split
parakeet.cpp ABI v5 (the pin this PR bumps to) splits the streaming JSON
"eou" flag: in v4 "eou":1 fired for either <EOU> (end of utterance) or
<EOB> (backchannel); in v5 "eou" means <EOU> only, with a new separate
"eob" field for the backchannel token.
The streamSegmenter closed a segment on "eou" alone, so after the bump a
backchannel token would silently stop ending a segment and merge into the
next utterance. Read the new "eob" field and flush on either signal to
preserve the v4 segmentation boundaries. The flat stream_feed eou_out path
is unaffected: its mask is still non-zero for either event.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
CrispASR's piper backend phonemizes non-English text via espeak-ng (dlopen,
the MIT-clean path; English uses a built-in G2P). The FROM scratch crispasr
image shipped none of it, so non-English piper voices loaded but failed
synthesis with "phonemization failed". Bundle the espeak-ng runtime so they
work:
- Dockerfile.golang: install espeak-ng-data + libespeak-ng1 and its libpcaudio0
/ libsonic0 deps in the crispasr builder (espeak's dlopen fails without the
latter two).
- package.sh: copy libespeak-ng.so.1, libpcaudio.so.0, libsonic.so.0 into
package/lib/ and the espeak-ng-data dir into the package root.
- run.sh: export CRISPASR_ESPEAK_DATA_PATH so the bundled data is found.
Add 9 single-speaker piper voices (de/en/it, incl. Italian paola + riccardo) to
the gallery, run through backend:piper, hosted at
LocalAI-Community/piper-voices-GGUF (converted from rhasspy/piper-voices with
CrispASR's convert-piper-to-gguf.py). Only single-speaker low/medium voices are
included; the engine does not yet support multi-speaker or high-quality piper
decoders.
All 9 verified end-to-end: each synthesizes a WAV at the model's native sample
rate using only the image-bundled espeak payload.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
CrispASR's piper backend returns PCM at the voice's native rate (from the GGUF
piper.sample_rate key: 16 kHz for x_low/low, 22.05 kHz for medium/high) and does
not resample, but the Go WAV encoder hardcoded 24000 Hz. Every piper voice was
therefore written with a wrong header and played back at the wrong pitch/speed.
Read piper.sample_rate from the model's GGUF metadata at Load via the vendored
gguf-parser-go and use it for the WAV header, falling back to the 24 kHz default
for the other CrispASR TTS engines (vibevoice/orpheus/chatterbox/qwen3-tts) that
emit 24 kHz and carry no such key.
Adds unit specs (minimal crafted GGUFs + WAV-header decode) and an env-gated
end-to-end spec (CRISPASR_PIPER_MODEL_PATH). Verified e2e: en_GB-cori-medium
synthesizes a 22050 Hz WAV through backend:piper.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Neither the sherpa-onnx nor the speaker-recognition backend had a
darwin/arm64 image, so `local-ai backends install` failed with "no child
with platform darwin/arm64" on macOS. This left /v1/audio/diarization (the
sherpa-onnx path) and /v1/voice/embed without any usable backend on Apple
Silicon.
Both backends build on darwin/arm64:
- sherpa-onnx (Go) already fetches the onnxruntime osx-arm64 runtime in its
Makefile; it only needed a darwin matrix entry (build-type metal, lang go,
like whisper and silero-vad).
- speaker-recognition (Python) needed a requirements-mps.txt so the mps build
installs plain onnxruntime (which ships a macOS arm64 wheel) instead of the
onnxruntime-gpu pulled by its base requirements (which does not).
Add both to the includeDarwin build matrix, wire the metal capability and
metal image aliases into the gallery, and add the speaker-recognition
requirements-mps.txt.
Fixes#10268
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
ggml leaves GGML_CUDA_GRAPHS off by default. Passing -DGGML_CUDA_GRAPHS=ON
for cublas builds lets the CUDA backend capture and replay the compute
graph for a small free speedup (about 1% measured on a GB10, never
negative). It is not gated by parakeet.cpp's CMake options, so it passes
straight through to ggml.
Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* fix(router): score classifier production-readiness
Conversation trimming runs through the classifier model's chat template
and trims by exact token count, sized to the model's n_batch which is
now scaled to context so long probes can't crash the backend. Missing
chat_message templates are a hard error at router build time. Router-
facing factories (Embedder/Scorer/Reranker/TokenCounter) re-resolve
ModelConfig per call so a model installed post-startup doesn't bind a
stub Backend="" config and silently fall into the loader's auto-
iterate path.
New 'vector_store' backend trace recorded inside localVectorStore on
every Search/Insert — including the backend-load-failure path that
previously vanished into an xlog.Warn — with outcome tagging
(hit/miss/empty_store/backend_load_error/find_error/insert_error/ok).
Companion cleanup drops misleading similarity:0 and input_tokens_count:0
from non-hit and text-mode traces.
Gallery local-store-development aliases to 'local-store' so the master
image satisfies pkg/model.LocalStoreBackend lookups from the embedding
cache.
Misc: llama-cpp TokenizeString reads the correct 'prompt' JSON key
(the original bug); ModelTokenize nil-guard; non-fatal mitm proxy
startup; PII 'route_local' renamed to 'allow' with docs/UI in sync;
model-editor footer no longer eats the edit area on small screens;
several config-editor template/dropdown/section fixes.
Tests: e2e router specs (casual/code-hint + long-conversation trim),
vector_store trace specs, lazy-factory specs, gallery dev-alias
resolution, Playwright trace badge + scroll regression.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* feat(backend): auto-size batch to context for embedding and rerank models
Embedding and rerank models pool over the whole input in a single physical batch (n_ubatch). With batch left at the 512 default, the backend rejects longer inputs with "input is too large to process", silently capping a large-context embedder (e.g. 8k/32k) at 512 tokens. Size n_batch to the context for these single-pass usecases, mirroring the existing FLAG_SCORE behaviour; an explicit batch: still wins.
Extracts EffectiveContextSize/EffectiveBatchSize from grpcModelOpts so the effective decode window has one home for other callers to reuse.
Adds an e2e-aio regression test that embeds a >512-token input. The AIO embedding model is switched to nomic-embed-text-v1.5 (2048 context) because the previous granite model was capped at 512 tokens and could not exercise the larger batch.
Assisted-by: claude-code:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* fix(gallery): raise arch-router scoring output cap via parallel:64
Scoring decodes the whole prompt+candidate in a single llama_decode and
reads one logit row per candidate token. The vendored llama.cpp server
caps causal output rows at n_parallel, so the default of 1 aborts with
GGML_ASSERT(n_outputs_max <= cparams.n_outputs_max) on multi-token route
labels. Set options: [parallel:64] on both arch-router quant entries to
lift the cap; kv_unified (the grpc-server default) keeps the full context
per sequence, so this does not split the KV cache.
Assisted-by: claude-code:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
---------
Signed-off-by: Richard Palethorpe <io@richiejp.com>
fix(vllm): restore compatibility with vLLM >= 0.22 (get_tokenizer moved)
vLLM 0.22 moved get_tokenizer from vllm.transformers_utils.tokenizer
to vllm.tokenizers. Since the backend requirements install vllm
unpinned, freshly built/installed vllm backends currently fail to
start with ModuleNotFoundError: No module named
'vllm.transformers_utils.tokenizer' (surfacing as 'grpc service not
ready' when loading a model).
Use the same try/except version-compat import pattern already used
elsewhere in this file: try the new vllm.tokenizers location first and
fall back to the pre-0.22 path.
Tested on a DGX Spark (GB10, ARM64) with the
cuda13-nvidia-l4t-arm64-vllm backend and vllm 0.22.0: model load, chat
completions and tool calls all work with this patch applied.
Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(llama-cpp): bump to 8f83d6c for mtmd video input support
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(llama-cpp): forward video input to mtmd (template + non-template paths)
Wire request->videos() into grpc-server.cpp mirroring the existing image
and audio handling: a video_data build + non-template files extraction, and
input_video chat chunks on the tokenizer-template path. allow_video is
auto-set at model load by the vendored upstream chat_params.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): add video attachment support to the chat UI
Mirror the image/audio attachment path for video: emit video_url content
parts, accept video/* in the picker, keep video files as base64, show a
film icon badge, and render attached video inline with a <video> player.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(llama-cpp): patch mtmd video stdin double-close (heap crash)
Upstream mtmd video input (ggml-org/llama.cpp#24269) double-fcloses the
ffmpeg/ffprobe stdin FILE: feed_stdin() fclose()s the FILE returned by
subprocess_stdin() (which is sp->stdin_file), then subprocess_destroy()
fclose()s the same pointer again -> heap corruption that aborts the
backend on any base64 input_video request (the CLI --video file path is
unaffected). Vendor a one-line fix (null sp->stdin_file after fclose)
via prepare.sh's patches/ until upstream merges it.
Verified e2e with gemma-4-e2b-it-qat-q4_0: video frames decode via
ffmpeg and the model answers correctly (red clip -> 'Red', blue -> 'Blue').
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore(llama-cpp): re-pin to upstream #24316, drop vendored stdin patch
Upstream replaced the ad-hoc video stdin handling with a proper RAII
refactor (ggml-org/llama.cpp#24316, "mtmd: refactor video subproc
handling"), which includes the same `sp->stdin_file = nullptr` guard our
patch added (plus join-before-destroy ordering). Re-pin LLAMA_VERSION to
that branch head and drop patches/0001 - it's now redundant.
Verified e2e with gemma-4-e2b-it-qat-q4_0: no crash, video frames decode
and the model answers correctly (red clip -> "Red", blue -> "Blue").
NOTE: #24316 is not yet merged, so this pins to its branch-head commit
(28ca1e60). Re-pin to the squash-merge commit on master once it lands,
otherwise `git fetch` may lose the commit after the branch is deleted.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* ⬆️ Update CrispStrobe/CrispASR
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* fix(crispasr): link crispasr-lib CMake target instead of crispasr
The dependency-bump regeneration of this branch reset CMakeLists.txt to
master and dropped the prior link-target fix, reintroducing the
`cannot find -lcrispasr` failure. Upstream CrispASR (f7838a3) defines the
library as the CMake target `crispasr-lib` (with OUTPUT_NAME crispasr);
there is no target named `crispasr`, so target_link_libraries falls back
to a bare `-lcrispasr` linker flag that cannot be resolved. Point the link
at the real target name.
Verified locally: CPU cmake-configure of the bumped source generates a
gocrispasr link line referencing sources/CrispASR/src/libcrispasr.a with no
dangling -lcrispasr.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
---------
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* ⬆️ Update antirez/ds4
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* fix(ds4): link ds4_ssd.o into the backend build
Upstream antirez/ds4 splits the SSD expert-cache into its own ds4_ssd.c
translation unit, whose symbols (ds4_ssd_memory_lock_acquire/release,
ds4_ssd_cache_experts_for_byte_budget, ds4_ssd_auto_cache_plan) are
referenced by ds4.c/ds4_cpu.o. The dependency-bump automation regenerated
this branch from clean master and dropped the prior linkage fix, so the
cpu-ds4 / cublas-ds4 backend builds fail again with undefined references.
Re-apply the ds4_ssd.o linkage GPU-agnostically (mirroring ds4_distributed.o)
in both the backend Makefile (DS4_OBJ_TARGET + the engine-object build rule
for every GPU mode) and CMakeLists.txt (list(APPEND DS4_OBJS ds4_ssd.o)).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
---------
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(parakeet-cpp): real segment timestamps (NeMo-faithful)
Offline: replace the single synthetic whole-clip segment with multiple
segments grouped exactly like NeMo's get_segment_offsets - a new segment
after sentence-ending punctuation ('. ? !'), each carrying start/end and
its time-window token ids. The optional model option segment_gap_threshold
(NeMo's unit: encoder FRAMES, default 0=off) adds NeMo's silence-gap split,
converted to seconds via the JSON frame_sec the engine now reports.
Per-segment words are still gated behind timestamp_granularities=["word"];
a zero-word document falls back to a single text segment.
Streaming: when libparakeet.so exposes the ABI v4 JSON entry points
(probed), drive parakeet_capi_stream_feed_json / _finalize_json and
accumulate the streamed per-word timestamps into per-utterance segments
(EOU stays the boundary), so streaming FinalResult segments now carry
start/end. Falls back to the text-only feed against an older library.
Pure-Go specs cover splitWordsIntoSegments (punctuation + gap rules, NeMo
elif order, fallback), transcriptResultFromDoc (multi-segment, token
windows, word-granularity gate), and the streaming segmenter.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* docs(audio): document parakeet-cpp segment timestamps + segment_gap_threshold
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(parakeet-cpp): update model-gated specs for multi-segment output
The offline AudioTranscription specs asserted the old single synthetic
segment (Segments HaveLen(1), Segments[0].Text == res.Text). With
NeMo-faithful segmentation a multi-sentence clip now yields multiple
punctuation-delimited segments, so assert the new contract instead:
one-or-more time-ordered segments, each with text and (under word
granularity) per-segment words whose span tracks the segment start/end.
Caught by running the model-gated suite on the dgx (GB10) against the
real tdt_ctc-110m + realtime_eou models.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* chore(turboquant): bump TheTom/llama-cpp-turboquant to 7d9715f1
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* fix(turboquant): drop obsolete legacy-spec shim after fork rebased
The TheTom/llama-cpp-turboquant fork (pin c9aa86a) rebased past the
upstream common_params_speculative refactor (ggml-org/llama.cpp
#22397/#22838/#22964), the model_tgt rename (#22838) and get_media_marker
(#21962). The old fork-compat shim forced now-wrong legacy code paths,
breaking the build with errors like 'struct common_params_speculative has
no member named mparams_dft / type' and 'server_context_impl has no member
named model'.
Remove the obsolete LOCALAI_LEGACY_LLAMA_CPP_SPEC branches from the shared
grpc-server.cpp (stock llama-cpp and the modern fork both take the modern
path now), and narrow the one remaining gap (the fork still lacks
common_params::checkpoint_min_step) to a dedicated
LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP guard injected by
patch-grpc-server.sh. The patch script now only adds the turbo2/3/4
KV-cache types and injects that one macro.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* fix(turboquant): HIP-port the fork's CUDA additions (copy2d 3D-peer + cudaEventCreate)
The turboquant fork adds/modifies a few ggml-cuda.cu spots with CUDA APIs that
ggml's HIP/MUSA shim does not provide, breaking the -gpu-rocm-hipblas-turboquant
build. patches/0001-hip-guard-copy2d-peer-fastpath.patch (applied by
apply-patches.sh) ports them:
- Guard ggml_cuda_copy2d_across_devices's 3D-peer copy fast path with
#if !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA) so HIP/MUSA fall through
to the existing cudaMemcpyAsync staging fallback (HIP genuinely lacks
cudaMemcpy3DPeerAsync, per the fork's own comment).
- Create the device event in ggml_backend_cuda_device_event_new with the
HIP-aliased cudaEventCreateWithFlags(.., cudaEventDisableTiming) instead of the
un-aliased plain cudaEventCreate, matching this file's own usage elsewhere.
CUDA builds are unaffected.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* ci(turboquant): drop the ROCm/hipblas build flavor
The TheTom/llama-cpp-turboquant fork is not ROCm-clean at the current pin:
beyond the CUDA-API gaps already patched (3D-peer copy, cudaEventCreate),
its llama.cpp base fails to compile the flash-attention MMA f16 kernels for
head-dim 640 under HIP (cols_per_warp evaluates to 0 -> division-by-zero /
non-constant static asserts in fattn-mma-f16.cuh). That is a deep
ggml-on-ROCm kernel issue, not something a small fork patch can paper over.
Drop -gpu-rocm-hipblas-turboquant from the build matrix so turboquant still
ships for cpu / cublas / vulkan / sycl. Re-add it once the fork's HIP path
compiles (or upstream ggml fixes the large-head-dim MMA kernels for ROCm).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(stablediffusion-ggml): support Ideogram4 unconditional diffusion model
Bump stable-diffusion.cpp from 1f9ee88 to b9254dd, the upstream commit that
adds Ideogram4 support (leejet/stable-diffusion.cpp#1609). Ideogram4 derives
its classifier-free guidance from a separate unconditional diffusion model,
exposed upstream through the new sd_ctx_params_t.uncond_diffusion_model_path
field.
Wire that field into the gosd wrapper via a new uncond_diffusion_model_path
option. The _path suffix is deliberate: the Go loader only resolves options
whose name contains "path" to an absolute path under the model directory, so
this keeps the option consistent with diffusion_model_path and
high_noise_diffusion_model_path.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* feat(gallery): add Ideogram4 stablediffusion-ggml models
Single-file GGUF weights for Ideogram4 are now published
(stduhpf/ideogram-4-gguf), so add the model to the gallery. Ideogram4 is a
text-to-image model with strong, accurate in-image text rendering, driven by
a Qwen3-VL-8B text encoder and real classifier-free guidance from a separate
unconditional diffusion model (the uncond_diffusion_model_path support added
in the preceding commit).
Two index entries, both built on gallery/virtual.yaml with the full config
inlined in overrides (same pattern as the other models, no dedicated template
file):
- ideogram-4-iq4nl-ggml (4-bit, ~11.6GB diffusion)
- ideogram-4-q8_0-ggml (8-bit, ~20GB diffusion)
Each bundles the diffusion + unconditional GGUF (stduhpf), the
Qwen3-VL-8B-Instruct text encoder (unsloth), and the FLUX.2 VAE (Comfy-Org
mirror, non-gated). cfg_scale is 7 to match the upstream Ideogram4 default,
since it performs real CFG unlike the guidance-distilled Flux/Z-Image models.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(parakeet-cpp): honor request language (multilingual nemotron) on batched + streaming paths
Reads opts.GetLanguage() and threads it through to the new
parakeet_capi_transcribe_pcm_batch_json_lang and parakeet_capi_stream_begin_lang
C-API entry points, both probed with Dlsym so the backend still loads against an
older libparakeet.so (falling back to the non-lang paths, i.e. model default).
parakeet.cpp's batched C-API takes a single target_lang for the whole batch, so
the dispatcher only coalesces same-language requests: a request whose language
differs from the batch leader is held as a single carry-over and becomes the
leader of the next batch, never dropped and never left waiting (including on
shutdown). A new batcher test asserts no dispatched batch is ever mixed-language
and that every submitted request still receives a reply.
Assisted-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(gallery): add parakeet-cpp-nemotron-3.5-asr-streaming-0.6b; bump parakeet.cpp pin
Adds the multilingual prompt-conditioned streaming model to the gallery (q8_0
default, OpenMDW-1.1) and bumps the parakeet-cpp backend pin to the parakeet.cpp
commit that ships nemotron support plus batched causal subsampling and the
batched target_lang C-API.
Assisted-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
chore(parakeet-cpp): bump pin to banded long-audio attention (843600590)
Update PARAKEET_VERSION to mudler/parakeet.cpp@843600590f
(merge of parakeet.cpp#9). Brings NeMo rel_pos_local_attn banded/Longformer
attention with the chunk-matmul construction: long audio now uses O(T*window)
attention instead of global O(T^2), fixing the encoder OOM on long clips
(~16.6-min clip: 54GB->9.4GB peak, ~4x faster) at NeMo's full [128,128] window.
Short clips are unchanged (global path). No C-ABI change.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat: forward reasoning_effort to the backend so jinja models honor it
reasoning_effort was only mapped to the binary enable_thinking toggle and
otherwise reached Go-side templates — it was never sent to the backend. So
jinja-templated models whose chat template keys on reasoning_effort (gpt-oss
Harmony, LFM2.5) could not be driven by it: LFM2.5 ignores enable_thinking and
kept emitting <think>.
Forward the effective reasoning_effort to the backend as a chat_template_kwarg
(mirroring enable_thinking) in grpc-server.cpp, and put it in PredictOptions
metadata (gRPCPredictOpts). Add a config-level default: ModelConfig.reasoning_effort
and Pipeline.reasoning_effort, resolved by ModelConfig.ApplyReasoningEffort
(request value overrides config default, none->disable / level->enable, an
operator's reasoning.disable wins). request.go now uses that helper.
Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): set the pipeline LLM's reasoning_effort
Apply Pipeline.ReasoningEffort to the pipeline's LLM config when the realtime
model is built (per-session copy, overrides the LLM's own reasoning_effort),
and surface the resolved effort on the template input so Go-templated models
get it too. jinja models receive it via the backend metadata. This lets a
realtime pipeline disable thinking on models that only honor reasoning_effort
(e.g. LFM2.5), which enable_thinking can't.
Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* fix(distributed): self-heal stale 'model not loaded' routing
In distributed mode the registry can list a model as loaded on a node
while the worker has evicted it (autonomous LRU eviction, an out-of-band
unload, etc.) yet the backend process survives. The router's cached-node
check only verifies the process is alive (probeHealth), so it routes there
and inference fails with "<backend>: model not loaded" — and stays broken
until the controller restarts and rebuilds its registry.
InFlightTrackingClient now reconciles this: when a tracked inference call
returns a model-not-loaded error, it drops the stale replica row
(RemoveNodeModel) so the next request reloads the model on a healthy node
instead of routing back to the evicted one. The original error is returned
unchanged; only the registry is corrected.
Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(distributed): typed model-not-loaded error via gRPC status code
Replace the controller-side error-string match with a shared, code-aware
helper. Go error types don't survive the gRPC boundary, so the signal is
carried as a status code (FailedPrecondition):
- pkg/grpc/grpcerrors: ModelNotLoaded(backend) constructor +
IsModelNotLoaded(err) checker (status-code first, message fallback for
backends not yet migrated).
- InFlightTrackingClient.reconcile now uses grpcerrors.IsModelNotLoaded.
- Migrate the Go backends that emit this error (parakeet-cpp, cloud-proxy,
rfdetr-cpp) to the typed constructor.
Acting on a false positive is harmless (the model is just reloaded).
Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
The qwen3-tts.cpp backend honored the request `language` field only via exact lowercase two-letter codes in the C++ language_to_id table, silently defaulting to English for anything else (en-US, EN, english, ...).
Add normalizeLanguage() in the Go handler: lowercase + trim, strip the region/locale suffix (en-US, pt_BR, zh-Hans -> en/pt/zh), and resolve common English full names (english -> en). The canonical codes match the existing C++ table, so no C++ change is needed. Covered by a pure-Go Ginkgo spec. Also document the language field and accepted forms under the Qwen3-TTS docs.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>