The llama.cpp C++ backend aborts generation when its gRPC context is
cancelled (grpc-server.cpp polls context->IsCancelled() in the result
loops), but Go backends served by pkg/grpc never observed context
cancellation: a disconnected client left the generation running to
completion. Add an optional Cancellable capability; the server registers
context.AfterFunc on the request/stream context (after the Locking block
so queued requests cannot abort the current owner) covering both rich
and legacy paths. dllm implements it: measured cancel latency ~10ms vs
~10s of orphaned generation, and follow-up requests no longer queue
behind cancelled ones (~220ms vs ~9s in the e2e proof).
Assisted-by: Claude Code (Fable 5)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Drop the stray executable bit from the Go sources and Makefile (the
sibling Go backends commit them 644; only run.sh/package.sh are
executable), and correct two documentation claims found in the final
branch review: cuda13-dllm is built for amd64 only (arm64 CUDA ships as
the l4t flavor), and package.sh is the parakeet-cpp-style stub layout
with no ldd walk.
Assisted-by: Claude Code (Fable 5)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Registers the dllm backend across every surface: backend gallery index
(cpu amd64+arm64 with manifest merge, cuda13, l4t-cuda13 for GB10-class
hardware; no darwin per engine scope), top-level Makefile targets,
bump_deps pin tracking for DLLM_VERSION, and the curated known-backends
list for /backends/known (pref-only: auto-detecting on .gguf would
shadow llama-cpp). Note: image builds and the nightly bump leg stay red
until github.com/mudler/dllm.cpp is published (planned at merge time).
Assisted-by: Claude Code (Fable 5)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Implements PredictRich/PredictStreamRich (legacy methods delegate),
TokenizeString, and Load over the purego binding. A single worker
goroutine serializes all C calls per the dllm.cpp one-generate-per-ctx
contract (cancel is the documented exception); an RWMutex guards Free
against in-flight requests. Under use_tokenizer_template the gemma4
renderer and streaming parser own templating and ChatDelta extraction;
raw-prompt mode passes through verbatim. enable_thinking is opt-in via
request metadata (the gemma4 template treats thinking as opt-in).
Assisted-by: Claude Code (Fable 5)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Fragment-safe state machine (content / channel header / thought /
tool-call / done) classifying model output into content,
reasoning_content and tool_calls deltas. Tool-call payload decoder is a
non-partial port of vLLM's gemma4 parser grammar; ~25 of its test cases
are ported with citations, plus a 2-split invariance property over
every byte position. Recursion depth-capped against model-generated
deep nesting; marker constants shared with the renderer.
Assisted-by: Claude Code (Fable 5)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Binds the 9-symbol flat C-ABI of dllm.cpp (DiffusionGemma engine) via
purego: typed wrappers with correct string ownership (malloc'd returns
freed via dllm_capi_free_string, borrowed last_error never freed),
once-allocated stream-callback trampolines, and a gated Ginkgo binding
smoke against the tiny fixture model.
Assisted-by: Claude Code (Fable 5)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
fix(vllm): restore compatibility with vLLM >= 0.22 (get_tokenizer moved)
vLLM 0.22 moved get_tokenizer from vllm.transformers_utils.tokenizer
to vllm.tokenizers. Since the backend requirements install vllm
unpinned, freshly built/installed vllm backends currently fail to
start with ModuleNotFoundError: No module named
'vllm.transformers_utils.tokenizer' (surfacing as 'grpc service not
ready' when loading a model).
Use the same try/except version-compat import pattern already used
elsewhere in this file: try the new vllm.tokenizers location first and
fall back to the pre-0.22 path.
Tested on a DGX Spark (GB10, ARM64) with the
cuda13-nvidia-l4t-arm64-vllm backend and vllm 0.22.0: model load, chat
completions and tool calls all work with this patch applied.
Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(llama-cpp): bump to 8f83d6c for mtmd video input support
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(llama-cpp): forward video input to mtmd (template + non-template paths)
Wire request->videos() into grpc-server.cpp mirroring the existing image
and audio handling: a video_data build + non-template files extraction, and
input_video chat chunks on the tokenizer-template path. allow_video is
auto-set at model load by the vendored upstream chat_params.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): add video attachment support to the chat UI
Mirror the image/audio attachment path for video: emit video_url content
parts, accept video/* in the picker, keep video files as base64, show a
film icon badge, and render attached video inline with a <video> player.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(llama-cpp): patch mtmd video stdin double-close (heap crash)
Upstream mtmd video input (ggml-org/llama.cpp#24269) double-fcloses the
ffmpeg/ffprobe stdin FILE: feed_stdin() fclose()s the FILE returned by
subprocess_stdin() (which is sp->stdin_file), then subprocess_destroy()
fclose()s the same pointer again -> heap corruption that aborts the
backend on any base64 input_video request (the CLI --video file path is
unaffected). Vendor a one-line fix (null sp->stdin_file after fclose)
via prepare.sh's patches/ until upstream merges it.
Verified e2e with gemma-4-e2b-it-qat-q4_0: video frames decode via
ffmpeg and the model answers correctly (red clip -> 'Red', blue -> 'Blue').
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore(llama-cpp): re-pin to upstream #24316, drop vendored stdin patch
Upstream replaced the ad-hoc video stdin handling with a proper RAII
refactor (ggml-org/llama.cpp#24316, "mtmd: refactor video subproc
handling"), which includes the same `sp->stdin_file = nullptr` guard our
patch added (plus join-before-destroy ordering). Re-pin LLAMA_VERSION to
that branch head and drop patches/0001 - it's now redundant.
Verified e2e with gemma-4-e2b-it-qat-q4_0: no crash, video frames decode
and the model answers correctly (red clip -> "Red", blue -> "Blue").
NOTE: #24316 is not yet merged, so this pins to its branch-head commit
(28ca1e60). Re-pin to the squash-merge commit on master once it lands,
otherwise `git fetch` may lose the commit after the branch is deleted.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* ⬆️ Update CrispStrobe/CrispASR
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* fix(crispasr): link crispasr-lib CMake target instead of crispasr
The dependency-bump regeneration of this branch reset CMakeLists.txt to
master and dropped the prior link-target fix, reintroducing the
`cannot find -lcrispasr` failure. Upstream CrispASR (f7838a3) defines the
library as the CMake target `crispasr-lib` (with OUTPUT_NAME crispasr);
there is no target named `crispasr`, so target_link_libraries falls back
to a bare `-lcrispasr` linker flag that cannot be resolved. Point the link
at the real target name.
Verified locally: CPU cmake-configure of the bumped source generates a
gocrispasr link line referencing sources/CrispASR/src/libcrispasr.a with no
dangling -lcrispasr.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
---------
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* ⬆️ Update antirez/ds4
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* fix(ds4): link ds4_ssd.o into the backend build
Upstream antirez/ds4 splits the SSD expert-cache into its own ds4_ssd.c
translation unit, whose symbols (ds4_ssd_memory_lock_acquire/release,
ds4_ssd_cache_experts_for_byte_budget, ds4_ssd_auto_cache_plan) are
referenced by ds4.c/ds4_cpu.o. The dependency-bump automation regenerated
this branch from clean master and dropped the prior linkage fix, so the
cpu-ds4 / cublas-ds4 backend builds fail again with undefined references.
Re-apply the ds4_ssd.o linkage GPU-agnostically (mirroring ds4_distributed.o)
in both the backend Makefile (DS4_OBJ_TARGET + the engine-object build rule
for every GPU mode) and CMakeLists.txt (list(APPEND DS4_OBJS ds4_ssd.o)).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
---------
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(parakeet-cpp): real segment timestamps (NeMo-faithful)
Offline: replace the single synthetic whole-clip segment with multiple
segments grouped exactly like NeMo's get_segment_offsets - a new segment
after sentence-ending punctuation ('. ? !'), each carrying start/end and
its time-window token ids. The optional model option segment_gap_threshold
(NeMo's unit: encoder FRAMES, default 0=off) adds NeMo's silence-gap split,
converted to seconds via the JSON frame_sec the engine now reports.
Per-segment words are still gated behind timestamp_granularities=["word"];
a zero-word document falls back to a single text segment.
Streaming: when libparakeet.so exposes the ABI v4 JSON entry points
(probed), drive parakeet_capi_stream_feed_json / _finalize_json and
accumulate the streamed per-word timestamps into per-utterance segments
(EOU stays the boundary), so streaming FinalResult segments now carry
start/end. Falls back to the text-only feed against an older library.
Pure-Go specs cover splitWordsIntoSegments (punctuation + gap rules, NeMo
elif order, fallback), transcriptResultFromDoc (multi-segment, token
windows, word-granularity gate), and the streaming segmenter.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* docs(audio): document parakeet-cpp segment timestamps + segment_gap_threshold
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(parakeet-cpp): update model-gated specs for multi-segment output
The offline AudioTranscription specs asserted the old single synthetic
segment (Segments HaveLen(1), Segments[0].Text == res.Text). With
NeMo-faithful segmentation a multi-sentence clip now yields multiple
punctuation-delimited segments, so assert the new contract instead:
one-or-more time-ordered segments, each with text and (under word
granularity) per-segment words whose span tracks the segment start/end.
Caught by running the model-gated suite on the dgx (GB10) against the
real tdt_ctc-110m + realtime_eou models.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* chore(turboquant): bump TheTom/llama-cpp-turboquant to 7d9715f1
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* fix(turboquant): drop obsolete legacy-spec shim after fork rebased
The TheTom/llama-cpp-turboquant fork (pin c9aa86a) rebased past the
upstream common_params_speculative refactor (ggml-org/llama.cpp
#22397/#22838/#22964), the model_tgt rename (#22838) and get_media_marker
(#21962). The old fork-compat shim forced now-wrong legacy code paths,
breaking the build with errors like 'struct common_params_speculative has
no member named mparams_dft / type' and 'server_context_impl has no member
named model'.
Remove the obsolete LOCALAI_LEGACY_LLAMA_CPP_SPEC branches from the shared
grpc-server.cpp (stock llama-cpp and the modern fork both take the modern
path now), and narrow the one remaining gap (the fork still lacks
common_params::checkpoint_min_step) to a dedicated
LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP guard injected by
patch-grpc-server.sh. The patch script now only adds the turbo2/3/4
KV-cache types and injects that one macro.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* fix(turboquant): HIP-port the fork's CUDA additions (copy2d 3D-peer + cudaEventCreate)
The turboquant fork adds/modifies a few ggml-cuda.cu spots with CUDA APIs that
ggml's HIP/MUSA shim does not provide, breaking the -gpu-rocm-hipblas-turboquant
build. patches/0001-hip-guard-copy2d-peer-fastpath.patch (applied by
apply-patches.sh) ports them:
- Guard ggml_cuda_copy2d_across_devices's 3D-peer copy fast path with
#if !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA) so HIP/MUSA fall through
to the existing cudaMemcpyAsync staging fallback (HIP genuinely lacks
cudaMemcpy3DPeerAsync, per the fork's own comment).
- Create the device event in ggml_backend_cuda_device_event_new with the
HIP-aliased cudaEventCreateWithFlags(.., cudaEventDisableTiming) instead of the
un-aliased plain cudaEventCreate, matching this file's own usage elsewhere.
CUDA builds are unaffected.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* ci(turboquant): drop the ROCm/hipblas build flavor
The TheTom/llama-cpp-turboquant fork is not ROCm-clean at the current pin:
beyond the CUDA-API gaps already patched (3D-peer copy, cudaEventCreate),
its llama.cpp base fails to compile the flash-attention MMA f16 kernels for
head-dim 640 under HIP (cols_per_warp evaluates to 0 -> division-by-zero /
non-constant static asserts in fattn-mma-f16.cuh). That is a deep
ggml-on-ROCm kernel issue, not something a small fork patch can paper over.
Drop -gpu-rocm-hipblas-turboquant from the build matrix so turboquant still
ships for cpu / cublas / vulkan / sycl. Re-add it once the fork's HIP path
compiles (or upstream ggml fixes the large-head-dim MMA kernels for ROCm).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(stablediffusion-ggml): support Ideogram4 unconditional diffusion model
Bump stable-diffusion.cpp from 1f9ee88 to b9254dd, the upstream commit that
adds Ideogram4 support (leejet/stable-diffusion.cpp#1609). Ideogram4 derives
its classifier-free guidance from a separate unconditional diffusion model,
exposed upstream through the new sd_ctx_params_t.uncond_diffusion_model_path
field.
Wire that field into the gosd wrapper via a new uncond_diffusion_model_path
option. The _path suffix is deliberate: the Go loader only resolves options
whose name contains "path" to an absolute path under the model directory, so
this keeps the option consistent with diffusion_model_path and
high_noise_diffusion_model_path.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* feat(gallery): add Ideogram4 stablediffusion-ggml models
Single-file GGUF weights for Ideogram4 are now published
(stduhpf/ideogram-4-gguf), so add the model to the gallery. Ideogram4 is a
text-to-image model with strong, accurate in-image text rendering, driven by
a Qwen3-VL-8B text encoder and real classifier-free guidance from a separate
unconditional diffusion model (the uncond_diffusion_model_path support added
in the preceding commit).
Two index entries, both built on gallery/virtual.yaml with the full config
inlined in overrides (same pattern as the other models, no dedicated template
file):
- ideogram-4-iq4nl-ggml (4-bit, ~11.6GB diffusion)
- ideogram-4-q8_0-ggml (8-bit, ~20GB diffusion)
Each bundles the diffusion + unconditional GGUF (stduhpf), the
Qwen3-VL-8B-Instruct text encoder (unsloth), and the FLUX.2 VAE (Comfy-Org
mirror, non-gated). cfg_scale is 7 to match the upstream Ideogram4 default,
since it performs real CFG unlike the guidance-distilled Flux/Z-Image models.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(parakeet-cpp): honor request language (multilingual nemotron) on batched + streaming paths
Reads opts.GetLanguage() and threads it through to the new
parakeet_capi_transcribe_pcm_batch_json_lang and parakeet_capi_stream_begin_lang
C-API entry points, both probed with Dlsym so the backend still loads against an
older libparakeet.so (falling back to the non-lang paths, i.e. model default).
parakeet.cpp's batched C-API takes a single target_lang for the whole batch, so
the dispatcher only coalesces same-language requests: a request whose language
differs from the batch leader is held as a single carry-over and becomes the
leader of the next batch, never dropped and never left waiting (including on
shutdown). A new batcher test asserts no dispatched batch is ever mixed-language
and that every submitted request still receives a reply.
Assisted-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(gallery): add parakeet-cpp-nemotron-3.5-asr-streaming-0.6b; bump parakeet.cpp pin
Adds the multilingual prompt-conditioned streaming model to the gallery (q8_0
default, OpenMDW-1.1) and bumps the parakeet-cpp backend pin to the parakeet.cpp
commit that ships nemotron support plus batched causal subsampling and the
batched target_lang C-API.
Assisted-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
chore(parakeet-cpp): bump pin to banded long-audio attention (843600590)
Update PARAKEET_VERSION to mudler/parakeet.cpp@843600590f
(merge of parakeet.cpp#9). Brings NeMo rel_pos_local_attn banded/Longformer
attention with the chunk-matmul construction: long audio now uses O(T*window)
attention instead of global O(T^2), fixing the encoder OOM on long clips
(~16.6-min clip: 54GB->9.4GB peak, ~4x faster) at NeMo's full [128,128] window.
Short clips are unchanged (global path). No C-ABI change.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat: forward reasoning_effort to the backend so jinja models honor it
reasoning_effort was only mapped to the binary enable_thinking toggle and
otherwise reached Go-side templates — it was never sent to the backend. So
jinja-templated models whose chat template keys on reasoning_effort (gpt-oss
Harmony, LFM2.5) could not be driven by it: LFM2.5 ignores enable_thinking and
kept emitting <think>.
Forward the effective reasoning_effort to the backend as a chat_template_kwarg
(mirroring enable_thinking) in grpc-server.cpp, and put it in PredictOptions
metadata (gRPCPredictOpts). Add a config-level default: ModelConfig.reasoning_effort
and Pipeline.reasoning_effort, resolved by ModelConfig.ApplyReasoningEffort
(request value overrides config default, none->disable / level->enable, an
operator's reasoning.disable wins). request.go now uses that helper.
Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): set the pipeline LLM's reasoning_effort
Apply Pipeline.ReasoningEffort to the pipeline's LLM config when the realtime
model is built (per-session copy, overrides the LLM's own reasoning_effort),
and surface the resolved effort on the template input so Go-templated models
get it too. jinja models receive it via the backend metadata. This lets a
realtime pipeline disable thinking on models that only honor reasoning_effort
(e.g. LFM2.5), which enable_thinking can't.
Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* fix(distributed): self-heal stale 'model not loaded' routing
In distributed mode the registry can list a model as loaded on a node
while the worker has evicted it (autonomous LRU eviction, an out-of-band
unload, etc.) yet the backend process survives. The router's cached-node
check only verifies the process is alive (probeHealth), so it routes there
and inference fails with "<backend>: model not loaded" — and stays broken
until the controller restarts and rebuilds its registry.
InFlightTrackingClient now reconciles this: when a tracked inference call
returns a model-not-loaded error, it drops the stale replica row
(RemoveNodeModel) so the next request reloads the model on a healthy node
instead of routing back to the evicted one. The original error is returned
unchanged; only the registry is corrected.
Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(distributed): typed model-not-loaded error via gRPC status code
Replace the controller-side error-string match with a shared, code-aware
helper. Go error types don't survive the gRPC boundary, so the signal is
carried as a status code (FailedPrecondition):
- pkg/grpc/grpcerrors: ModelNotLoaded(backend) constructor +
IsModelNotLoaded(err) checker (status-code first, message fallback for
backends not yet migrated).
- InFlightTrackingClient.reconcile now uses grpcerrors.IsModelNotLoaded.
- Migrate the Go backends that emit this error (parakeet-cpp, cloud-proxy,
rfdetr-cpp) to the typed constructor.
Acting on a false positive is harmless (the model is just reloaded).
Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
The qwen3-tts.cpp backend honored the request `language` field only via exact lowercase two-letter codes in the C++ language_to_id table, silently defaulting to English for anything else (en-US, EN, english, ...).
Add normalizeLanguage() in the Go handler: lowercase + trim, strip the region/locale suffix (en-US, pt_BR, zh-Hans -> en/pt/zh), and resolve common English full names (english -> en). The canonical codes match the existing C++ table, so no C++ change is needed. Covered by a pure-Go Ginkgo spec. Also document the language field and accepted forms under the Qwen3-TTS docs.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
The OpenAI-compatible TTS endpoint accepts an `instructions` field, but it
was silently dropped at the HTTP->gRPC boundary: neither schema.TTSRequest
nor the gRPC TTSRequest proto carried it, so backends could only read such a
value from static YAML options (identical for every request). This blocked
per-line emotion/style and, for Qwen3-TTS VoiceDesign, limited a model config
to a single designed voice.
Plumb a generic per-request instruction string end to end, plus an optional
backend-specific params map:
- proto: add `optional string instructions` and `map<string,string> params`
to TTSRequest.
- schema: add Instructions (maps OpenAI `instructions`) and Params (LocalAI
extension) to schema.TTSRequest.
- core: thread both through ModelTTS/ModelTTSStream via a newTTSRequest helper
that attaches instructions only when non-empty (so backends can fall back to
YAML when unset); forward them from the /v1/audio/speech handler.
- qwen-tts: prefer the per-request instruction over the YAML `instruct` option
(used by both mode detection and generation) and merge per-request params.
- chatterbox: merge per-request params (coerced to float/int/bool) over YAML
options into generate() kwargs.
Fully backward compatible: empty instructions fall back to the YAML option and
backends that don't support style/voice instructions ignore the field.
Closes#10164
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
The direct (non-batched) transcription path handed the original upload
path straight to the C library via parakeet_capi_transcribe_path_json.
That loader only understands 16 kHz mono WAV/PCM, so any other format
(MP3, etc.) failed with "parakeet: failed to load audio: <file>".
Only the batched path converted the input (via decodeWavMono16k ->
utils.AudioToWav). Every other audio backend (whisper, crispasr)
converts unconditionally with utils.AudioToWav before handing the file
to its engine; the parakeet-cpp fallback was the lone exception.
Extract a convertToWavMono16k helper (reused by decodeWavMono16k) that
produces a 16 kHz mono WAV in a temp dir, and run the non-batched path
through it before calling the C loader. WAV inputs already in the target
format are passed through without ffmpeg.
Add specs covering the helper (decodable copy + cleanup, and an error on
a missing input) that need neither the model, the C library, nor ffmpeg.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>