* feat(ui): clone a chat into a new conversation (#10645)
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): retry any assistant answer, not just the last (#10645)
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): copy an entire chat to the clipboard (#10645)
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): branch a new chat from any assistant answer (#10645)
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(ui): send truncated history on mid-conversation retry (#10645)
Mid-conversation retry regenerated an answer with the downstream turns
still in the model's context. handleRegenerate truncated the DOM history
via updateChatSettings (a scheduled state update), but the synchronous
sendMessage that followed read the stale, pre-truncation history from its
closure to build the outbound API payload. Thread the intended base
history explicitly through sendMessage's options.baseHistory so the
request body matches the truncated view. Backward compatible: the normal
send path (no baseHistory) is unchanged.
Also guard two minor issues in Chat.jsx: the "Branch from here" button now
renders under !isStreaming to match the retry button, and the duplicate
toast only fires when forkChat returns a chat (not on a null result).
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* fix(grpc): self-terminate backend workers when LocalAI dies non-gracefully
Symptom: a backend model-worker subprocess (the per-model gRPC server LocalAI
spawns) can be orphaned and linger — holding VRAM and its listen port — if the
LocalAI process is killed non-gracefully (e.g. a supervisor's graceful-shutdown
grace period elapses and LocalAI is SIGKILLed) before its own teardown runs.
Root cause: LocalAI's graceful teardown (pkg/signals/handler.go installs the
SIGINT/SIGTERM handler; core/cli/run.go registers app.Shutdown ->
ModelLoader.StopAllGRPC -> process.Stop in pkg/model/process.go) only runs when
LocalAI receives a catchable signal and survives long enough to run its
handlers. Backends are spawned via github.com/mudler/go-processmanager v0.1.1,
whose getSysProcAttr() sets Setpgid:true (own process group, so the group can be
signalled) but never PR_SET_PDEATHSIG/Pdeathsig, and exposes no Config field or
option for a caller to inject/extend SysProcAttr. LocalAI fully delegates
spawning to that library (it never builds the exec.Cmd itself), so it cannot set
a kernel parent-death signal at the spawn site. If LocalAI is SIGKILLed, nothing
tells the backend to exit and it is reparented to init.
Fix: add a best-effort, backend-side safety net at the one shared choke point
every out-of-process Go backend routes through — grpc.StartServer / RunServer in
pkg/grpc. On startup it captures getppid() and polls; when the process is
reparented (getppid changes / becomes 1 — the standard POSIX signal the original
parent died) it logs and self-terminates. getppid() reparent detection is
portable (Linux + macOS), unlike Linux-only PR_SET_PDEATHSIG. Toggle via
LOCALAI_BACKEND_PARENT_WATCH (default on; off on Windows) and
LOCALAI_BACKEND_PARENT_WATCH_INTERVAL. This is strictly a backstop alongside the
existing graceful SIGTERM->grace->SIGKILL teardown, which is unchanged.
Scope/limitations: covers Go-based backends (everything using pkg/grpc). The
C++ backends (e.g. llama-cpp) and Python backends do not route through
pkg/grpc and are not covered by this mechanism — they would each need an
equivalent parent-death check (follow-up). The fully general fix is for
go-processmanager to expose SysProcAttr injection so LocalAI can set Pdeathsig
at spawn for every backend regardless of language (suggested upstream follow-up;
out of scope for this LocalAI-only PR).
Test: pkg/grpc/parentwatch_test.go builds a real test -> middle -> grandchild
process tree, lets the middle process exit to orphan the grandchild running the
real watchParentDeath, and asserts it detects the reparent and self-terminates.
Unix-only (build-tagged), runs in CI (Linux).
Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(process): extend parent-death backstop to C++ and Python backends
The Go parent-death watcher (pkg/grpc/parentwatch.go, commit 772b435d5)
only protects backends that route through pkg/grpc. C++ and Python
backends don't, so the originally-reported case — the llama.cpp gRPC
worker surviving a non-graceful LocalAI death — was still uncovered.
Extend the same best-effort backstop to both languages, reusing the
exact mechanism and semantics:
- capture getppid() at startup, skip if already orphaned (<=1)
- a background thread polls getppid() and self-exits on reparenting
(getppid() != orig || == 1), portable across Linux/macOS, no-op on
Windows
- same env vars: LOCALAI_BACKEND_PARENT_WATCH (default on; falsy
false/0/no/off disable) and LOCALAI_BACKEND_PARENT_WATCH_INTERVAL
(default 2s; accepts Go-style durations like 500ms/2s/1m)
C++: implemented in backend/cpp/llama-cpp (the reported, most-used C++
backend) as a dependency-free header parent_watch.h, wired into
grpc-server.cpp's main() and copied at build time via prepare.sh. C++
backends have no shared server scaffolding, so other C++ backends
(ds4, ik-llama-cpp, privacy-filter, ...) are not yet covered and would
each need the same one-line include+call as follow-ups.
Python: implemented once in the shared common/parent_watch.py and armed
from common/grpc_auth.py's get_auth_interceptors() — the single helper
every one of the 35 Python backends invokes while building its gRPC
server — so all Python backends (and future ones) are covered with no
per-backend edits and no duplicated implementation.
Tests (real process-tree reparent detection, mirroring the Go test):
- backend/cpp/llama-cpp/parent_watch_test.cpp (via run-unit-tests.sh)
- backend/python/common/parent_watch_test.py (python -m unittest)
Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Claude Sonnet 5 <noreply@anthropic.com>
* fix(backends): make backend install ops idempotent unless forced
POST /backends/apply hardcoded force=true through
LocalBackendManager.InstallBackend, so applying an already-installed
backend re-downloaded and re-extracted the whole artifact every time.
API clients that ensure a backend exists at startup paid a full OCI
image pull on every boot.
Backend install ops now default to non-forced — an installed, runnable
backend short-circuits (the orphaned-meta reinstall path in
InstallBackendFromGallery is preserved) — and reinstall stays available:
- ManagementOp gains a Force field; the local manager passes it through
instead of hardcoding true.
- /backends/apply accepts an optional "force" boolean in the body.
- The React UI install route keeps forcing, since its button doubles as
the explicit "Reinstall backend" action.
Distributed installs already behaved this way (workers skip when the
binary exists unless force is set); this aligns single-node behavior.
Assisted-by: Claude:claude-fable-5 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(backends): don't force-reinstall LOCALAI_EXTERNAL_BACKENDS on boot
The startup loop for LOCALAI_EXTERNAL_BACKENDS runs
InstallExternalBackend for each listed backend on every boot, and its
gallery-name path hardcoded force=true — so every start re-downloaded
and re-extracted each listed backend's OCI image even when it was
installed and runnable. Supervising apps that list several backends
paid several full OCI pulls per launch.
Give InstallExternalBackend an explicit force parameter (it only
affects the gallery-name fallback; URI installs always write) and pass:
- false from the boot loop and `local-ai backends install` (idempotent
ensure — `backends upgrade` is the refresh path),
- op.Force from the local manager's external-URI op,
- the request's force on the worker install path and true on its
upgrade path (behavior unchanged).
Assisted-by: Claude:claude-fable-5 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Newest cloud reasoning models reject two parameters the cloud-proxy
backend currently sends:
- Anthropic (claude-opus-4-x) and OpenAI (gpt-5.x) return 400 when
temperature is present: "'temperature' is deprecated for this model".
OpenAI-compatible clients typically send only the server-side DEFAULT
sampling values rather than user intent, so the translators now forward
neither temperature nor top_p and let the upstream apply its own
defaults.
- OpenAI gpt-5.x rejects max_tokens ("Unsupported parameter: 'max_tokens'
... Use 'max_completion_tokens' instead"). The OpenAI translator now
serializes the token limit as max_completion_tokens, which current
chat-completions models accept.
Verified live against claude-opus-4-8, gpt-5.5 and gemini-3.1-pro
(Gemini OpenAI-compat endpoint). Tests updated to the new contract.
Assisted-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: stefanwalcz <stefan.walcz@walcz.de>
* fix(distributed): don't let a dead worker pin the model-load advisory lock
In distributed mode a chat request could fail with:
failed to route model with internal loader: routing model ...:
loading model ...: advisorylock: acquiring lock <id>:
ERROR: canceling statement due to lock timeout (SQLSTATE 55P03)
Root cause is two independent defects in the cross-replica model-load path:
1. SmartRouter.Route holds a per-model PostgreSQL advisory lock for the whole
cold-load sequence, which includes installBackendOnNode -> InstallBackend,
a NATS request-reply with a 15m deadline (DefaultBackendInstallTimeout) that
ignored ctx. When the chosen worker died mid-install, the holder sat on the
lock for up to 15m. The detached loadCtx (WithoutCancel) had no deadline, so
nothing capped the hold.
2. The acquiring statement, pg_advisory_lock(), is subject to any deployment
global lock_timeout. A common operator setting (e.g. 10s) aborts the wait
with SQLSTATE 55P03, so every other replica's request for that model hard
-errored instead of waiting for the in-progress load and reusing it. For the
~15m window the model was effectively unroutable.
Fixes:
- advisorylock.WithLockCtx (postgres): SET lock_timeout = 0 on its dedicated
connection (RESET before it returns to the pool) so the Go context, not a
deployment-wide GUC, governs how long we wait. Waiters now block and then
re-check, reusing the model another replica just loaded.
- SmartRouter: bound the detached loadCtx with a single ModelLoadCeiling so the
lock is always released in bounded time even if a sub-step wedges. Default is
the configured backend.install deadline + 10m (staging + LoadModel margin),
so a legitimately slow load is never cut.
- installBackendOnNode: use singleflight.DoChan + select on ctx.Done() so the
install wait honors cancellation; the ceiling can then actually free a caller
pinned behind a dead worker. The shared install still coalesces via
singleflight.
Reproduced both defects as failing tests first (a real 55P03 against a
testcontainer with a short lock_timeout; a wedged install that blocks Route)
and confirmed green.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* fix(distributed): bound advisory-lock wait instead of disabling lock_timeout
Setting lock_timeout = 0 to override a deployment's short global lock_timeout
meant "wait forever" server-side. Safe for SmartRouter.Route (its loadCtx now
carries the model-load ceiling) but unsafe for the schema-migration callers
that pass context.Background(): a holder whose session never releases would
hang them indefinitely.
Derive the server-side lock_timeout from the caller's context instead: its
remaining budget plus a margin (so the Go context's cancellation still wins
with a clean error and the server bound is only a backstop), or a finite
30m backstop when the context has no deadline. Never zero - "wait forever"
is no longer possible, while a deployment's hostile short lock_timeout is
still overridden so legitimate cross-replica waits don't fail with 55P03.
Added a spec proving a deadline-less waiter gives up at the (shrunk) backstop
rather than hanging.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
fix(vllm): non-streaming tool-call regression after #10351 (native_streaming is a capability flag, not a state flag)
#10351 introduced native streaming via `parser.extract_tool_calls_streaming`
and gated the post-loop `extract_tool_calls` block on `native_streaming and
not native_streaming_error`. That works for streaming requests, but for
non-streaming requests the same flag is still True (it only means "the
parser can stream", not "we actually streamed"), so the block was skipped
and the `elif` cleared `content = ""` — the tool call was silently lost.
Symptom: non-streaming chat.completions with `tools=[...]` returns
`finish_reason: "stop"` with `content: ""` and no `tool_calls`. Streaming
requests are unaffected.
Fix: gate both branches on `streaming` too, so the extract_tool_calls
block runs for non-streaming requests (and for streaming requests that
fell back to the buffered path).
Reproduction (vLLM 0.24, Qwen3-Coder-Next-NVFP4, qwen3_coder parser):
curl -s -X POST http://localhost:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model":"coder","stream":false,
"messages":[{"role":"user","content":"7*8 via calc"}],
"tools":[{"type":"function","function":{"name":"calc",
"parameters":{"type":"object",
"properties":{"expression":{"type":"string"}}}}}]}'
Before: finish_reason: "stop", content: "", tool_calls: []
After: finish_reason: "tool_calls", tool_calls[0].function.name: "calc"
Streaming path re-verified in the same setup: delta.tool_calls arrives
token-by-token, finish_reason: "tool_calls", no raw XML in content.
Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
* fix(hipblas): symlink amdgpu.ids so ROCm backends find the ASIC ID table
ROCm's bundled libdrm_amdgpu looks up the GPU ASIC ID table at a
hardcoded fallback path, /opt/amdgpu/share/libdrm/amdgpu.ids, which is
only populated by AMD's full amdgpu-install (graphics/DKMS) stack. The
hipblas image is compute-only and doesn't have it, so every model load
logs "No such file or directory" and the GPU can't be identified.
Symlink it to the equivalent file already shipped by Ubuntu's
libdrm-amdgpu1 package.
Fixes#10624
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(hipblas): correct amdgpu.ids source package name in comment
Verified against the real rocm/dev-ubuntu-24.04:7.2.1 image with
hipblas-dev/hipblaslt-dev/rocblas-dev installed: /usr/share/libdrm/amdgpu.ids
is owned by libdrm-common, not libdrm-amdgpu1 as the comment said.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
The launcher starts the server with run --models-path/--backends-path but
leaves --data-path and the dynamic config dir unset, so the server falls
back to its /data and /configuration defaults.
is kong.ExpandPath("."), i.e. the launcher process CWD
(commonly the user's home root), producing ~/data and ~/configuration
outside ~/.localai and an agent-pool stateDir under ~/data.
Pass --data-path and --localai-config-dir explicitly, rooted at the
launcher's own data directory (GetDataPath() -> ~/.localai), so data and
config stay consistent with --models-path/--backends-path.
* fix(watchdog): don't log optional Free() as an error when backend returns Unimplemented (#10602)
When the watchdog evicts a model, deleteProcess calls the backend's gRPC
Free() to release VRAM before stopping the process. Free is optional:
backends that don't override it -- the generated UnimplementedBackendServer
stub, many Python/external backends, or a federation proxy in distributed
mode -- return gRPC Unimplemented. That is expected, not a failure: VRAM is
reclaimed when the local process is stopped, or by the remote unloader for
remote backends. Logging it as "WARN Error freeing GPU resources" made a
benign, optional RPC look like a fault (the alarming line in #10602, seen
in distributed mode where the model is remote and Free hits a stub).
Treat gRPC Unimplemented from Free() as a no-op logged at Debug; genuine
failures still Warn. Free() is still attempted for every backend, so any
backend that does implement it is unaffected.
Add a reusable grpcerrors.IsUnimplemented helper following the package's
existing code-based detection idiom (prefer the typed status code, fall
back to the message across non-gRPC boundaries), with table tests.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Adira Denis Muhando <dennisadira@gmail.com>
* fix(watchdog): log a non-Unimplemented Free() failure at error level
Per review: now that the expected gRPC Unimplemented case is split out and
logged at Debug, any remaining Free() error is a genuine failure to release
VRAM, so surface it at error level instead of warn.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Adira Denis Muhando <dennisadira@gmail.com>
---------
Signed-off-by: Adira Denis Muhando <dennisadira@gmail.com>
The Q4_K_M quant degraded tool-call reliability for LFM2.5-8B-A1B.
Switch the gallery entry to the Q8_0 GGUF (sha256 verified via HF
x-linked-etag) while keeping the native jinja tool-parsing config.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
The backend.proto AudioTranscriptionLive bidirectional streaming RPC added
new required trait items (AudioTranscriptionLiveStream + audio_transcription_live)
on the generated Backend trait. The kokoros (TTS) backend did not implement
them, breaking its release build with E0046 (missing trait items).
kokoros is text-to-speech and has no live-ASR support, so stub the method to
return UNIMPLEMENTED, mirroring the existing audio_transcription_stream stub.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
LocalAI enables a cross-request prompt-prefix cache (cache_reuse, see
core/config/serving_defaults.go) so repeated prefixes — system prompts,
RAG context, agent scaffolds, multi-turn chat — are not reprocessed every
turn. For sliding-window-attention (SWA) models (Gemma 2/3, Cohere2,
Llama 4, ...) this silently does nothing: llama.cpp defaults to a reduced
SWA KV cache sized to the sliding window, and that reduced cache cannot
preserve a prompt prefix across requests, so every turn reprocesses the
whole prompt anyway.
llama.cpp's --swa-full (params.swa_full, already wired through the
LocalAI llama.cpp backend's `swa_full` option) keeps the full KV cache so
the shared prefix is reused. Enable it automatically, but only for models
that are actually SWA: detection reads the gguf-parser-normalized
`<arch>.attention.sliding_window` metadata (which also applies llama.cpp's
family rules, e.g. Phi-3 → not SWA), right where the GGUF is already
parsed for defaults. It is never applied to dense models (pure memory
waste) and never overrides an explicit user `swa_full`/`n_swa` choice.
Tradeoff: the full SWA cache scales with context_size, so it costs more
memory at large contexts — hence the SWA gating and the documented
`swa_full:false` opt-out.
Assisted-by: Claude:claude-opus-4-8 [Claude Code] golangci-lint
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
fix(watchdog): persist a UI-saved Check Interval across restarts (#10601)
The watchdog Check Interval saved via /api/settings reverted to 500ms on
every restart, while the idle/busy timeouts persisted correctly.
Root cause: NewApplicationConfig baseline-defaulted WatchDogInterval to
500ms, whereas the idle/busy timeouts default to 0. The startup loader
(loadRuntimeSettingsFromFile) applies a persisted runtime_settings.json
value only when the field is still at its zero default - its heuristic
for "this wasn't set by an env var". Because the interval was always
500ms at that point, the loader never read the persisted value back, so
the saved interval was silently discarded on each boot.
Fix: drop the non-zero baseline default so the interval behaves like the
sibling timeouts (0 = unset). The effective 500ms default is now supplied
at the watchdog layer: WithWatchdogInterval ignores a non-positive value
so DefaultWatchDogOptions' 500ms is preserved (and a 0 interval can never
turn the watchdog loop into a busy spin). Also mirror the interval in the
live config file watcher alongside idle/busy, and report the real 500ms
default (not the stale "2s") from ToRuntimeSettings.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Stapling only the dmg leaves the LocalAI.app bundle with no embedded
notarization ticket. Gatekeeper then falls back to an online notarization
check on first launch, so the app fails to open on a Mac that is offline or
behind a firewall, or once it has been copied out of the dmg — while it keeps
working on the (online) build host, which masks the problem.
Notarize and staple the .app before packaging it into the dmg so the bundle
verifies offline. Adds a `notarize-app` subcommand to
contrib/macos/sign-and-notarize.sh (zips the bundle for notarytool, then
staples + validates) and invokes it from dmg-launcher-darwin. Stays a no-op
when notary secrets are unset, so unsigned local/fork builds are unaffected.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: mudler <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
PR #10589 changed repo-root HuggingFace URI imports to name the model after
the selected GGUF file rather than the repository. The Open Responses API
integration test still requested the old repo-derived name
("Qwen3-VL-2B-Instruct-GGUF"), so every request 404'd on an unknown model and
the suite has failed on master since 1a4f68ed4.
Update testModel to the name the importer now registers for the default
q4_k_m quant ("Qwen3-VL-2B-Instruct-Q4_K_M") so the specs resolve the model
again. The #10589 behaviour change is intentional; only the stale test needed
updating.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Importing a model from a local directory (e.g. a HuggingFace checkout or an
LM Studio store) via a file:// URI produced a config whose model field kept
the scheme verbatim, e.g. model: file:///Users/u/.../Qwen3-4bit. The mlx and
vllm backends treat that field as a HuggingFace repo id or local path and
reject the file:// form with "Repo id must be in the form 'repo_name' or
'namespace/repo_name'", so the model imported fine but failed to load (issue
#7461).
Add a shared LocalModelPath helper that reduces a file:// URI to the bare
filesystem path it points at and leaves HuggingFace/HTTP URIs untouched, and
route the mlx, vllm, transformers and diffusers importers (all of which pass
details.URI straight into the model field for from_pretrained-style loading)
through it.
Cover the helper directly plus end-to-end file:// import specs for the mlx and
vllm importers.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Adira Denis Muhando <dennisadira@gmail.com>
fix(functions): avoid quadratic-time debug logging in CleanupLLMResult/ParseFunctionCall
The streaming chat path (core/http/endpoints/openai/chat_stream_workers.go)
calls CleanupLLMResult / ParseFunctionCall once per delta chunk with the
*full accumulated* LLM result so far. Both functions xlog.Debug the entire
argument on entry and exit, so a single N-chunk stream emits roughly
chunk_size * N^2 bytes of debug output.
Under LOG_LEVEL=debug this was observed in a recent SGLang-via-LocalAI
session on a DGX Spark host (about 50K tokens, long streaming generation)
to drive container logs to ~96 GiB, which interacted with the streaming
hot loop on the same filesystem and contributed to a host-wide hard hang
once disk pressure built up. Workaround was setting LOG_LEVEL=info, but
the quadratic shape remains a foot-gun for anyone intentionally enabling
debug.
Replace the four result-content debug arguments with len(...) plus a
fixed-size head (200 bytes via a new truncForLog helper), bounding per-
call output to a constant. The debug signal stays useful: the first 200
chars are enough to identify which generation is in flight, and the
length lets you observe growth without paying for the payload itself.
No API change. No behaviour change for LOG_LEVEL != debug.
Signed-off-by: Poseidon <philipp.wacker@ibf-solutions.com>
Co-authored-by: Poseidon <philipp.wacker@ibf-solutions.com>
When importing a HuggingFace GGUF model from a repository-root URI (no file
component, e.g. hf://owner/repo) with the Model Name field left blank, the
importer named the model after the repository (filepath.Base(details.URI))
instead of the GGUF file it actually selected from the repo listing (issue
#10587).
Track whether the user supplied an explicit name; the URI base is now only a
fallback. In the HuggingFace branch, once the model group is picked, re-derive
the name from the selected GGUF via a new modelNameFromShardGroup helper that
uses ShardGroup.Base minus the .gguf extension. For sharded models this yields
a clean logical name (e.g. Qwen3-30B-A3B-Q4_K_M) rather than a shard filename
like ...-00001-of-00002. An explicit name preference still always wins, and the
.gguf/URL/OCI paths are unchanged.
Add network-free unit specs covering name-from-GGUF, clean-name-from-shard-base,
and explicit-name precedence, and update the live integration specs that had
encoded the previous repo-name behaviour.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Adira Denis Muhando <dennisadira@gmail.com>
fix(openai): stop max_tokens streaming retry loop on reasoning models
When a thinking model spends its entire max_tokens budget on the reasoning
block, the C++ autoparser clears the raw Response and delivers reasoning-only
ChatDeltas (no content, no tool calls). ComputeChoices' empty-response retry
then fires and regenerates from scratch up to maxRetries times, each
re-consuming the whole budget, instead of terminating with finish_reason
"length" (issue #9716).
Add a reachedTokenBudget helper and suppress both the built-in and
caller-driven retries when the completion count has reached the configured
max_tokens ceiling. Report finish_reason "length" instead of "stop" in the
streaming and non-streaming chat paths when the budget was exhausted.
Adds a deterministic regression test that counts backend invocations
(previously 6, now 1) plus boundary tests for the helper.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Dennisadira <dennisadira@gmail.com>
* feat(realtime): EOU-driven semantic_vad turn detection
Add a `semantic_vad` turn-detection mode to the realtime API that feeds
the transcription model live and decides "the user finished speaking"
from the `<EOU>` end-of-utterance token rather than from silence alone.
When EOU fires the turn commits immediately (~0.3s); otherwise it falls
back to an eagerness-scaled silence threshold (low/med/high = 8/4/2s).
Plumbing, bottom to top:
- proto: `AudioTranscriptionLive` bidirectional RPC (config-first oneof,
mono float PCM @16k, ready-ack / Unimplemented degrade signal) plus
`TranscriptResult.eou` for the unary retranscribe gate.
- pkg/grpc: client/server/base/embed scaffolding for the bidi stream,
modeled on AudioTransformStream; release stream conns on terminal Recv.
- parakeet-cpp: live transcription RPC with per-C-call engine locking
(one live stream per turn, finalize+free at commit); bump parakeet.cpp
to ABI v5 — incremental StreamingMel (no more quadratic per-feed mel
recompute that delayed EOU on long turns) and the <EOU>/<EOB> split;
strip the literal <EOU>/<EOB> from offline text and set Eou.
- core/backend: LiveTranscriptionSession wrapper + pipeline
`turn_detection:` config block (type/eagerness/retranscribe).
- realtime: semantic_vad integration — live input captions streamed as
transcription deltas while the user speaks, EOU-immediate commit with
eagerness fallback, optional retranscribe gate (batch re-decode must
also end in <EOU> to confirm), clause synthesis off the LLM token
callback, and per-turn live-transcription / model_load telemetry.
- UI: show the realtime pipeline components as a vertical list.
Docs and tests included; opt-in via the pipeline YAML or per-session
`session.update`. Non-streaming STT backends degrade to silence-only.
Assisted-by: Claude Code:claude-opus-4-8 [Read] [Edit] [Write] [Bash]
Assisted-by: Claude Code:claude-fable-5 [Read] [Edit] [Bash]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* feat(realtime): explicit formally-verified state machines + parakeet streaming driver
The realtime API had several implicit state machines whose state was inferred
from scattered booleans, channels, and five separate mutexes, leaving
illegal/inconsistent states reachable. Make them explicit and keep the
implementation in step with a formal design; rework the parakeet streaming
backend along the same lines.
Realtime state machines (M1-M5). Each is a sealed sum-type State/Event/Effect
with a total, pure Next(state,event)->(state,[]effect) behind a single-writer
Coordinator:
M1 conncoord connection lifecycle: VAD toggle + once-only teardown
(replaces vadServerStarted + a `done` channel closed from
two sites).
M2 turncoord turn detection: collapses speechStarted and the live-stream
"turn open" flag into one state, so discardTurn can no longer
desync them and suppress the next onset.
M3 respcoord response coordination: serializes the dual-writer
start/cancel so at most one response is live; one
response.done per response.create.
M4 compactcoord conversation compaction: single-flight (replaces the
`compacting atomic.Bool` CAS).
M5 ttscoord TTS pipeline: open->closing->closed, idempotent wait(),
rejects enqueue-after-close (was a silent drop).
The Coordinator/Sink/Next plumbing — only the sealed types and Next differed
per machine — is extracted once into core/http/endpoints/openai/coordinator as
a generic Coordinator[S,E,F]; each machine keeps its public API via type
aliases, so no sink, call-site, or test moved.
Hierarchy. session_lifecycle.fizz models M1 as the parent region with its
children (M2/M3/M4) as one statechart and asserts ChildrenDieWithParent (conn
torn => all children terminal, none start after teardown). respcoord and
compactcoord gain an absorbing Terminated state + Shutdown event; conncoord's
teardown drives the children terminal. This closes a compaction teardown gap: a
fire-and-forget compaction could outlive a torn session — compactionSink now
takes a session-scoped cancellable context + WaitGroup and joins the in-flight
summarize+evict on shutdown.
Formal verification. formal-verification/ holds one authoritative FizzBee spec
per machine plus the composition spec, each with an always-assertion and a
documented one-line edit that makes the checker fail (verified non-vacuous).
scripts/realtime-conformance.sh is fail-closed: all Go conformance suites under
-race AND a model-check of every .fizz spec; a missing FizzBee is a hard error
(only the loud REALTIME_CONFORMANCE_SKIP_FIZZBEE=1 bypasses it, never in CI).
FizzBee is pinned by sha256 and installed via scripts/install-fizzbee.sh into
.tools/ (gitignored). Wired as make test-realtime-conformance, a CI workflow,
and a pre-commit path filter. Go conformance tests are Ginkgo/Gomega (per the
repo's forbidigo lint): transition tables + fixed-seed property walks +
concurrent/-race specs, no rapid dependency. Design map:
docs/design/realtime-state-machines.md.
Parakeet streaming backend. The same treatment applied to the parakeet-cpp
streaming paths:
- AudioTranscriptionStream returns codes.Unimplemented for non-streaming models
instead of decoding offline and emitting it as one delta + final. A client
that asked for streaming learns the model cannot stream rather than receiving
a batch result shaped like a stream. New grpcerrors.StreamTranscriptionUnsupported
carries that signal; the HTTP /v1/audio/transcriptions stream path surfaces it
as an SSE error event. Mirrors AudioTranscriptionLive, which already did this.
- utteranceBoundary (boundary.go): a single definition of the end-of-utterance
latch, replacing three open-coded finalEou toggles. Modelled as a two-valued
type so illegal states are unrepresentable.
- Shared decode driver (driver.go): streamFeedResult (one per-feed event) +
feedChunk (hides the ABI v4 JSON vs text-only split) + feedSlices + flushTail.
The feed loop is written once.
- AudioTranscriptionLive becomes a bidi adapter: it streams the per-feed
{delta,eou,eob,words} the realtime turn detector consumes and a terminal
FinalResult carrying only Text. Segments/duration/eou are offline-only and no
longer produced (nor read) on the live path; liveTraceState drops the terminal
eou and keeps the per-feed eou_events count.
- AudioTranscriptionStream + streamJSON merge into one driver-based function;
streamSegmenter is generalized to the unified event with a text-only fallback
that preserves the legacy (no-words) library's per-utterance segmentation.
Verified: build/vet/gofumpt clean, golangci-lint 0 issues, all coordinator and
parakeet packages under -race, the fail-closed conformance gate green, and
make test-realtime (12 e2e WS+WebRTC).
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
---------
Signed-off-by: Richard Palethorpe <io@richiejp.com>
OpenAI wire format carries `function.arguments` as a JSON-encoded string,
but chat templates (e.g. Qwen3-Coder) iterate over it as a mapping. The
vllm backend already parses arguments before applying the chat template
(PR #10256); this mirrors that fix in the sglang backend.
Without this fix the second turn of any tool-using session (assistant
returns tool_calls, user posts `role:"tool"` result, model is invoked
with arguments still as a string) crashes inside transformers' Jinja
chat-template rendering with:
TypeError: Can only get item pairs from a mapping.
File ".../transformers/utils/chat_template_utils.py", in render_jinja_template
File ".../jinja2/filters.py", in do_items
raise TypeError("Can only get item pairs from a mapping.")
Reproduced on `lmsysorg/sglang:v0.5.14` via LocalAI v4.5.4 with
`saricles/Qwen3-Coder-Next-NVFP4-GB10` (W4A4 NVFP4 / compressed-tensors)
on NVIDIA DGX Spark (GB10, sm_121).
After the patch, a tool-call roundtrip (assistant tool_calls -> tool
result -> assistant final answer) returns http=200 with the expected
follow-up content; no behaviour change on requests that don't carry
tool_calls.
Signed-off-by: Poseidon <philipp.wacker@ibf-solutions.com>
Co-authored-by: Poseidon <philipp.wacker@ibf-solutions.com>
chore(recon): re-pin voice/face-detect to squashed release commits
The voice-detect.cpp and face-detect.cpp engine repos were squashed to a single
release commit, which orphaned the previous pins (voice 3d51077, face 06914b0).
Re-pin to the new single-commit SHAs (voice 1db1759, face e22260d).
These also fold in a real correctness fix: the persistent graph-cache fingerprint
now includes op_params, so two structurally identical GGML_OP_CUSTOM graphs (a
blocked 3x3 vs a blocked 1x1 strided conv) can no longer false-hit the cache and
replay the wrong kernel. voice CI was failing test_blocked/conv1x1_s2 with an
out-of-bounds write on the GGML_NATIVE=OFF build; both engine repos are now green
and WeSpeaker embed parity is 1.0 vs golden.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>