LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-13 03:09:03 -04:00

Author	SHA1	Message	Date
Richard Palethorpe	085fc53bbc	fix(router): production-ready request router + auto-size batch for embedding/rerank (#10104 ) * fix(router): score classifier production-readiness Conversation trimming runs through the classifier model's chat template and trims by exact token count, sized to the model's n_batch which is now scaled to context so long probes can't crash the backend. Missing chat_message templates are a hard error at router build time. Router- facing factories (Embedder/Scorer/Reranker/TokenCounter) re-resolve ModelConfig per call so a model installed post-startup doesn't bind a stub Backend="" config and silently fall into the loader's auto- iterate path. New 'vector_store' backend trace recorded inside localVectorStore on every Search/Insert — including the backend-load-failure path that previously vanished into an xlog.Warn — with outcome tagging (hit/miss/empty_store/backend_load_error/find_error/insert_error/ok). Companion cleanup drops misleading similarity:0 and input_tokens_count:0 from non-hit and text-mode traces. Gallery local-store-development aliases to 'local-store' so the master image satisfies pkg/model.LocalStoreBackend lookups from the embedding cache. Misc: llama-cpp TokenizeString reads the correct 'prompt' JSON key (the original bug); ModelTokenize nil-guard; non-fatal mitm proxy startup; PII 'route_local' renamed to 'allow' with docs/UI in sync; model-editor footer no longer eats the edit area on small screens; several config-editor template/dropdown/section fixes. Tests: e2e router specs (casual/code-hint + long-conversation trim), vector_store trace specs, lazy-factory specs, gallery dev-alias resolution, Playwright trace badge + scroll regression. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(backend): auto-size batch to context for embedding and rerank models Embedding and rerank models pool over the whole input in a single physical batch (n_ubatch). With batch left at the 512 default, the backend rejects longer inputs with "input is too large to process", silently capping a large-context embedder (e.g. 8k/32k) at 512 tokens. Size n_batch to the context for these single-pass usecases, mirroring the existing FLAG_SCORE behaviour; an explicit batch: still wins. Extracts EffectiveContextSize/EffectiveBatchSize from grpcModelOpts so the effective decode window has one home for other callers to reuse. Adds an e2e-aio regression test that embeds a >512-token input. The AIO embedding model is switched to nomic-embed-text-v1.5 (2048 context) because the previous granite model was capped at 512 tokens and could not exercise the larger batch. Assisted-by: claude-code:claude-opus-4-8 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(gallery): raise arch-router scoring output cap via parallel:64 Scoring decodes the whole prompt+candidate in a single llama_decode and reads one logit row per candidate token. The vendored llama.cpp server caps causal output rows at n_parallel, so the default of 1 aborts with GGML_ASSERT(n_outputs_max <= cparams.n_outputs_max) on multi-token route labels. Set options: [parallel:64] on both arch-router quant entries to lift the cap; kv_unified (the grpc-server default) keeps the full context per sequence, so this does not split the KV cache. Assisted-by: claude-code:claude-opus-4-8 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-06-12 16:21:15 +02:00
LocalAI [bot]	892fc49949	feat(realtime): stream the LLM / TTS / transcription pipeline stages (#10176 ) * feat(realtime): pipeline streaming + disable_thinking config Add a nested pipeline.streaming.{llm,tts,transcription} block plus pipeline.disable_thinking, with StreamLLM/StreamTTS/StreamTranscription/ ThinkingDisabled helpers. Pointer-bools so unset keeps the unary path; existing configs are unaffected. Wiring into the realtime handler follows. Assisted-by: Claude:claude-opus-4-8 go vet Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(realtime): sentence segmenter for streamed LLM->TTS pipelining streamSegmenter accumulates streamed LLM tokens and emits complete sentence/clause segments (terminator+whitespace, or newline) so TTS can synthesize each segment as it completes instead of waiting for the whole reply. Pure helper; the streaming handler wiring consumes it next. Assisted-by: Claude:claude-opus-4-8 go vet Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(realtime): streaming TTS/transcription methods on Model interface Add TTSStream and TranscribeStream to the realtime Model interface and implement them on wrappedModel (delegating to backend.ModelTTSStream / ModelTranscriptionStream) and transcriptOnlyModel. ttsStream adapts the backend's WAV-framed stream (44-byte header carrying the sample rate, then PCM) into raw PCM + sample rate for the realtime transports. Handler wiring that consumes these (flag-gated) follows. Assisted-by: Claude:claude-opus-4-8 go vet Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(realtime): emitSpeech with flag-gated streaming TTS emitSpeech synthesizes a piece of text and forwards audio to the client, streaming one output_audio.delta per backend PCM chunk when the pipeline sets streaming.tts, or one delta for the whole utterance otherwise. WebRTC gets raw PCM (it resamples internally); WebSocket gets base64 PCM at the session rate. It emits no transcript/audio-done events so a streamed reply can be split into multiple spoken segments sharing one response. Adds fakeModel/fakeTransport test doubles for the realtime Model/Transport interfaces, driving streaming assertions deterministically. Assisted-by: Claude:claude-opus-4-8 go vet Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(realtime): route response audio through emitSpeech (streaming TTS) Replace the inline unary TTS block in the response handler with emitSpeech, which streams a response.output_audio.delta per backend PCM chunk when pipeline.streaming.tts is set and otherwise preserves the single-delta unary behaviour. emitSpeech returns the accumulated base64 audio, stored on the conversation item as before. Transcript and audio-done events stay in the handler so later per-segment streaming can reuse emitSpeech. Assisted-by: Claude:claude-opus-4-8 go vet Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(realtime): streaming transcription text deltas Add emitTranscription and route commitUtterance through it. With pipeline.streaming.transcription set it streams each transcript fragment as a conversation.item.input_audio_transcription.delta via TranscribeStream then a completed event; otherwise it preserves the single completed-event unary behaviour. Returns the final transcript for response generation. Assisted-by: Claude:claude-opus-4-8 go vet Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(realtime): pipeline disable_thinking maps to enable_thinking off applyPipelineThinking forces the LLM's ReasoningConfig.DisableReasoning when pipeline.disable_thinking is set, which gRPCPredictOpts turns into the enable_thinking=false backend metadata. Applied at newModel construction on the per-session LLM config copy, so it doesn't leak to other model users and needs no realtime-specific request plumbing. Assisted-by: Claude:claude-opus-4-8 go vet Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(realtime): speechStreamer for token-streamed LLM->TTS emitSpeech now returns raw PCM (caller base64-encodes) so streamed segments accumulate correctly. speechStreamer consumes streamed LLM tokens: it strips reasoning via the streaming ReasoningExtractor, emits a transcript delta per content fragment, and sentence-pipes content into emitSpeech so each sentence is synthesized as soon as it's ready. Handler wiring (plain-content turns) follows. Assisted-by: Claude:claude-opus-4-8 go vet Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(realtime): wire streamLLMResponse for token-streamed replies triggerResponseAtTurn takes a streamed path when pipeline.streaming.llm is set, the turn has no tools, and audio is requested: streamLLMResponse announces the assistant item, drives the LLM token callback through a speechStreamer (reasoning-stripped transcript deltas + sentence-piped TTS), and emits the terminal events. Tool turns and non-streaming pipelines keep the existing buffered path unchanged, so this is strictly opt-in. Assisted-by: Claude:claude-opus-4-8 go vet Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(realtime): document pipeline streaming + disable_thinking Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(realtime): register pipeline streaming/thinking config fields TestAllFieldsHaveRegistryEntries (core/config/meta) requires every config field to have a meta registry entry. The four new pipeline fields (disable_thinking, streaming.{llm,tts,transcription}) had none, failing tests-linux/tests-apple. Add toggle entries for them. Also handle the os.Remove return in realtime_speech_test.go to satisfy errcheck (golangci-lint). Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(realtime): always strip reasoning from spoken output disable_thinking maps to ReasoningConfig.DisableReasoning=true on the LLM config, which the backend reads as enable_thinking=false. But the realtime handler reads that SAME config to drive reasoning extraction, and there DisableReasoning=true means "skip stripping". PredictConfig() returns this LLM config, so both the streamed (speechStreamer) and buffered realtime paths stopped stripping <think>…</think> exactly when disable_thinking was on — leaking raw reasoning to the client whenever the model ignored the enable_thinking hint (e.g. lfm2.5). Add spokenReasoningConfig() which clears DisableReasoning for extraction (keeping custom tokens/tag pairs) and route both realtime paths through it. Spoken output now always strips reasoning, independent of the backend suppression hint. Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(realtime): clean TTS temp path before read (gosec G304) emitSpeech reads the WAV file the TTS backend wrote. The read moved here from realtime.go, so code-scanning flagged it as a new G304 alert even though the path is backend-controlled (a temp file), not user input. Wrap it in filepath.Clean — a real path normalization that also clears the alert, keeping with the repo's no-#nosec convention. Assisted-by: Claude:claude-opus-4-8 gosec, golangci-lint Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(realtime): buffer whole message for TTS, drop sentence segmenter Per review (richiejp): the sentence segmenter pipelined unary TTS by splitting on ASCII .!?/newline, which does nothing for languages without those boundaries (CJK/Thai) — there it already degraded to buffering the whole message anyway. Replace it with a uniform model: stream the LLM transcript live, buffer the full message, then synthesize it once. emitSpeech already streams the audio chunks when the backend implements TTSStream and falls back to a single unary delta otherwise, so this is real streaming TTS where supported and a clean whole-message synthesis elsewhere — no per-sentence emulation, no language assumptions. speechStreamer becomes transcriptStreamer (transcript deltas only); the whole-message synthesis moves into streamLLMResponse. Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(realtime): stream tool-call turns via tokenizer-template autoparser Per review (richiejp): tool-call deltas exist, so streaming should work with tools too. It does — for models that use their tokenizer template. The C++ autoparser then clears reply.Message and delivers content + tool calls via ChatDeltas, so the streamed transcript carries only spoken content (no tool-call JSON leak) and the tool calls are parsed from the final response. - Drop the len(tools)==0 gate; stream when no tools OR use_tokenizer_template (grammar-based function calling still buffers, since its call is emitted as JSON in the token stream and would leak into the transcript). - streamLLMResponse takes tools/toolChoice/toolTurn, reads ChatDelta content in the token callback, parses tool calls from the final ChatDeltas, and creates the assistant content item lazily so a content-less tool turn emits only the tool calls. - Extract emitToolCallItems from the buffered path so both paths finalize tool calls, response.done, and server-side assistant-tool follow-ups identically. Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(realtime): script-aware clause chunking + streamed-reply fixes Opt-in pipeline.streaming.clause_chunking splits the streamed LLM reply into speakable clauses and synthesizes each as soon as it completes, lowering time-to-first-audio instead of buffering the whole message. The splitter is script-aware (rivo/uniseg, pure Go): UAX#29 sentence segmentation handles CJK 。！？ with no whitespace, CJK clause punctuation (，、；：) and Thai/Lao spaces give finer cuts, and a UAX#14 line-break cap bounds an over-long punctuation-less run. Unlike the old ASCII .!?/newline segmenter (dropped in `076dcdbe`) it does not degrade to whole-message buffering for CJK/Thai; scripts needing a dictionary (Khmer/Burmese) stay buffered until a space or end-of-message. Clauses are synthesized synchronously in the token callback (the LLM keeps generating into the gRPC stream meanwhile), so audio still starts mid-generation. Off by default — the whole-message path is unchanged. Also fix the streamed-reply path and the Talk page: - Don't swallow streamed autoparser content as reasoning: the tokenizer-template path already delivers reasoning-free content via ChatDeltas, so prefilling the thinking start token re-tagged it as an unclosed reasoning block, leaving no spoken reply. Disable the prefill on that path; closed tag pairs are still stripped (#9985). - Generate collision-free realtime IDs (16 random bytes) instead of a constant, so per-item bookkeeping (cancel, conversation.item.retrieve) works. - Key the Talk transcript by the server item_id and upsert entries. Realtime events arrive over a WebRTC data channel — outside React's event system — so React defers the setTranscript updaters while synchronous ref writes in handler bodies run first; the old index-tracking ref rendered a duplicate assistant bubble on completion. Upserts by item_id are idempotent and order-independent. - Drop the partial assistant bubble on a cancelled response (barge-in): the server discards the interrupted item and sends response.done with status "cancelled"; mirror that in the UI so the regenerated reply isn't rendered as a second assistant message. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Assisted-by: Claude:claude-fable-5 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Signed-off-by: Richard Palethorpe <io@richiejp.com> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Richard Palethorpe <io@richiejp.com>	2026-06-11 08:43:12 +01:00
LocalAI [bot]	e837921c2c	feat: forward reasoning_effort to the backend so jinja models honor it (#10184 ) * feat: forward reasoning_effort to the backend so jinja models honor it reasoning_effort was only mapped to the binary enable_thinking toggle and otherwise reached Go-side templates — it was never sent to the backend. So jinja-templated models whose chat template keys on reasoning_effort (gpt-oss Harmony, LFM2.5) could not be driven by it: LFM2.5 ignores enable_thinking and kept emitting <think>. Forward the effective reasoning_effort to the backend as a chat_template_kwarg (mirroring enable_thinking) in grpc-server.cpp, and put it in PredictOptions metadata (gRPCPredictOpts). Add a config-level default: ModelConfig.reasoning_effort and Pipeline.reasoning_effort, resolved by ModelConfig.ApplyReasoningEffort (request value overrides config default, none->disable / level->enable, an operator's reasoning.disable wins). request.go now uses that helper. Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(realtime): set the pipeline LLM's reasoning_effort Apply Pipeline.ReasoningEffort to the pipeline's LLM config when the realtime model is built (per-session copy, overrides the LLM's own reasoning_effort), and surface the resolved effort on the template input so Go-templated models get it too. jinja models receive it via the backend metadata. This lets a realtime pipeline disable thinking on models that only honor reasoning_effort (e.g. LFM2.5), which enable_thinking can't. Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-05 13:45:43 +00:00
Richard Palethorpe	6a80e23733	feat(middleware): Model routing, PII filtering, Cloud model proxies (#9802 ) Add a routing middleware stack and a cloud-proxy backend. * cloud-proxy: a Go gRPC backend that forwards OpenAI- and Anthropic-shaped chat requests to upstream providers, with an optional translate mode (OpenAI request -> Anthropic /v1/messages -> OpenAI response) and full tool-calling support. * routing: admission control, content-aware model routing (embedding cache + classifier + rerank + Arch-Router score), PII detection/redaction (regex + NER) with streaming filter and OpenAI/Anthropic adapters, and a per-user/per-key billing recorder backed by GORM or in-memory storage. * middleware: UsageMiddleware records usage via the billing recorder, plus admission, route-model, usage-stamp and trace middlewares. * observability: BackendTrace ring buffer stores full request bodies (capped), MITM proxy emits structured trace events, and router classifier decisions surface at /api/router/decide. * gallery: Arch-Router-1.5B (Q4_K_M and Q8_0). * UI: cloud-proxy model-editor fields, classifier system-prompt and score-normalization config, and a Traces page rendering request bodies. Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Bash] Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-05-25 09:28:27 +02:00
Richard Palethorpe	9ac1bdc587	feat(ui): Interactive model config editor with autocomplete (#9149 ) * feat(ui): Add dynamic model editor with autocomplete Signed-off-by: Richard Palethorpe <io@richiejp.com> * chore(docs): Add link to longformat installation video Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-04-07 14:42:23 +02:00
Richard Palethorpe	557d0f0f04	feat(api): Allow coding agents to interactively discover how to control and configure LocalAI (#9084 ) Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-04-04 15:14:35 +02:00

6 Commits