pkg/utils/path.go provides the security primitives for download paths
(VerifyPath, InTrustedRoot) and the file-naming helpers used by every
import flow (SanitizeFileName, GenerateUniqueFileName). None of them had
test coverage, so a future regression in the traversal check or in the
".." stripping inside SanitizeFileName would land unnoticed.
The new specs pin the lexical contract for each helper:
- VerifyPath accepts strict descendants and inner traversal that stays
inside the base, rejects "..", compound traversal, and the base path
itself. An explicit spec documents that the check is purely lexical
(filepath.Clean, not EvalSymlinks) so any future caller that needs
symlink-aware defence knows to EvalSymlinks first.
- InTrustedRoot rejects the trusted root and sibling directories,
accepts deeply nested descendants.
- SanitizeFileName covers the leading-directory and absolute-prefix
paths plus the embedded ".." case ("foo..bar" -> "foobar") that the
Clean+Base layer alone would leave intact.
- GenerateUniqueFileName covers the no-collision, single-collision,
walk-the-counter, and empty-extension cases using GinkgoT().TempDir()
so the suite stays hermetic.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: TLoE419 <tloemizuchizu@gmail.com>
Exercise the filtered empty-state path in the models gallery and verify
that the clear-filters action restores the list and resets the filter
selection.
Assisted-by: Codex:gpt-5
Signed-off-by: Ching Kao <0980124jim@gmail.com>
fix(functions): validate auto-detected XML tool-call names (#9722)
The XML tool-call auto-detector tries every preset, including glm-4.5 whose
tool block is <tool_call>name...</tool_call>. When a Hermes/NousResearch model
emits <tool_call>{"name":"bash","arguments":{...}}</tool_call>, glm-4.5
mis-claims the block and returns the entire JSON object (or leading prose, or a
JSON array) as the function NAME. The misparse then wins over the JSON parser,
so streaming clients receive a tool call whose name is a JSON blob.
Guard the auto-detect paths in ParseXMLIterative: a returned tool name must look
like a real function name ([A-Za-z0-9_.-]+). Results that don't are dropped so
auto-detection falls through to the next format and ultimately to JSON parsing,
which handles Hermes correctly. An explicitly forced format (format != nil) is
left untouched and trusted verbatim.
This supersedes PR #9940, which dropped only names with a leading "{". That
narrower check misses leading prose ("Sure: {...}"), JSON arrays ("[{...}]")
and brace-less garbage ("name: bash, ..."); the name-shape check rejects all of
them while still accepting legitimate glm-4.5 calls. The fix applies to both the
streaming worker and the non-streaming ParseFunctionCall path, which both call
ParseXMLIterative with auto-detection.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
application.New wires a fire-and-forget goroutine that runs
StopAllGRPC + distributed.Shutdown when the app context is cancelled.
Callers (tests, CLI signal handler) cancel the context and then exit
immediately, so the test binary / process can terminate before that
goroutine kills the spawned backend children. go-processmanager sets no
Pdeathsig, so the orphans are reparented to init and survive — leaving
dozens of stray mock-backend processes after an e2e run.
Add Application.Shutdown(), which runs the same cleanup synchronously on
the caller's stack and is idempotent via sync.Once. The context-cancel
goroutine, the CLI signal handler, and the test suites all call it, so
cleanup is deterministic and the duplicated teardown logic collapses to
one place. The async goroutine remains as a safety net for callers that
forget; sync.Once dedupes the double call.
Wire e2e_suite_test and the two mock-backend Contexts in app_test to
call Shutdown in their AfterSuite/AfterEach.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Streaming /v1/chat/completions could emit the same logical tool call at
multiple `index` values. In processStreamWithTools the Go-side iterative
parser (ParseXMLIterative / ParseJSONIterative) runs on every token and
emits tool-call deltas, while the C++ chat-template autoparser delivers
its own tool calls via ChatDeltas that are flushed at end-of-stream by
ToolCallsFromChatDeltas -> buildDeferredToolCallChunks. With both paths
active the same call is emitted twice at different indices, so OpenAI
clients that accumulate tool calls by `index` dispatch the tool N times.
Skip the Go-side iterative parser once the autoparser is producing tool
calls (hasChatDeltaToolCalls). The deferred flush stays guarded by
lastEmittedCount, so the race where the Go parser emitted before the flag
flipped also remains single-emission. Backends without an autoparser
(e.g. vLLM) keep hasChatDeltaToolCalls=false and are unaffected.
Refs #9722
Signed-off-by: bozhouDev <259759010+bozhouDev@users.noreply.github.com>
Co-authored-by: bozhouDev <259759010+bozhouDev@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(grammars): honor properties_order entry at index 0
The JSON-schema-to-GBNF property sort used `aOrder != 0 && bOrder != 0` as
its "is this key ordered?" guard. That treats index 0 — the first key listed
in properties_order — as unset, so `properties_order: name,arguments` fell
back to alphabetical ordering and still emitted "arguments" before "name".
Use presence in the order map instead: listed keys sort by their index and
ahead of unlisted keys, which keep a stable alphabetical order. This makes
the documented `properties_order: name,arguments` actually produce
name-first tool-call JSON. Relates to #10052.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* fix(functions): defer tool grammar to the backend when the tokenizer template owns templating (#10052)
When use_tokenizer_template delegates templating to the backend (llama.cpp),
the backend also owns tool-call grammar generation and parsing. LocalAI was
still generating its own GBNF grammar and sending it down. With a grammar
present, llama.cpp does not hand the tools to its template, so its native
peg/json tool parser never engages: it streams the grammar-constrained
tool-call JSON back as plain content instead of emitting tool_calls. In
streaming mode the JSON object leaked into the content field, and the
Go-side incremental detector never gated content because the
LocalAI-generated grammar emitted "arguments" before "name".
The GGUF auto-import path already couples use_tokenizer_template with
grammar.disable, but that block is skipped when a template is already
configured, so gallery and hand-written configs (e.g. qwen3) that set the
tokenizer template directly never got the paired grammar.disable.
- SetDefaults now enforces the coupling for every config: when
use_tokenizer_template is set, grammar generation is disabled and tools
flow to the backend's native (name-first) pipeline. This also fixes
already-installed models without editing each config.
- Set function.grammar.disable in the shared gallery/qwen3.yaml, which is
the base config referenced by every qwen3 gallery entry.
Verified end to end against qwen3-4b with stream:true + tools: content no
longer carries the tool-call JSON, reasoning is classified separately, and
tool calls stream as proper name-first tool_calls deltas.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
fix(turboquant): guard upstream-only grpc-server fields for fork build
backend/cpp/llama-cpp/grpc-server.cpp is reused by the turboquant build,
which compiles against an older llama.cpp fork (TheTom/llama-cpp-turboquant).
Two recent changes added references to upstream-only struct fields outside the
existing LOCALAI_LEGACY_LLAMA_CPP_SPEC guards:
- common_params::checkpoint_min_step (default + option handler), added with
the ggml-org/llama.cpp 35c9b1f3 bump (#9998)
- the common_params_speculative::draft tensor_buft_overrides sentinel
termination (#9919), which sat after the guard's #endif
The fork has neither field, so grpc-server.cpp failed to compile for every
turboquant flavor. Wrap the three references in #ifndef
LOCALAI_LEGACY_LLAMA_CPP_SPEC, matching the existing fork-compat guards, so the
stock llama-cpp build is unchanged and the fork build skips them. Update
patch-grpc-server.sh's doc comment to record what the macro now gates out.
Verified by a local fallback-flavor turboquant build: grpc-server.cpp compiles
against the fork and the backend image builds.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* Curate the highlight.js build to ~29 languages (lib/core + the
common set) instead of the full ~190-grammar default: -787 KB raw /
-230 KB gz on the base bundle.
* Code-split every route via React.lazy with a per-layout <Suspense>
in App.jsx so the sidebar stays mounted on navigation. Initial entry
chunk drops from 3194 KB raw / 887 KB gz to 397 KB / 122 KB (-87%).
Warm chunks on sidebar hover/focus/touch via a preload registry so
the click finds the chunk already in flight or cached.
* Migrate Playwright coverage from istanbul (build-time counters) to
native Chromium V8 coverage, with per-worker accumulation +
conversion. Suite drops from 71s to 30s at 20 workers (~58%) at the
non-instrumented floor.
* Keep the coverage gate bundling-invariant: the coverage build inlines
dynamic imports so every shipped source file lands in the denominator
(otherwise untested page chunks silently drop out and inflate the
percentage). Production builds stay code-split.
* Add UI_TEST_WORKERS=N Makefile knob; tighten coverage tolerance to
0.8pp now that jitter sits near istanbul's ~0.5pp again.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* fix(openresponses): populate Content and accept bare {role,content} items (#10039)
Fixesmudler/LocalAI#10039 — `/v1/responses` silently returned empty
output on any model whose YAML doesn't include a Go-side
`template.chat_message` block.
Three cooperating bugs:
* `convertORInputToMessages` populated only `StringContent` for string
input and for the `input.Instructions` system message, leaving the
`Content` (any) field nil.
* `TemplateMessages` gated all fallback content-rendering branches on
`Content != nil && StringContent != ""` — but every branch in that
function consumes `StringContent`, not `Content`. The `&&` silently
dropped messages that had StringContent set and Content nil, producing
an empty prompt that the 5× empty-retry guard then turned into a
200 OK with `output: []`.
* The array-input branch of `convertORInputToMessages` dispatched on
`itemMap["type"]` with no default, dropping bare `{role, content}`
items emitted by the OpenAI Python SDK helper
`client.responses.create(input=[{...}])`.
Fix:
* Set both `Content` and `StringContent` in the two openresponses
message-construction sites that only set one.
* Treat a bare `{role, content}` item (no `type`) as
`type: "message"` for OpenAI-SDK compatibility.
* Gate `TemplateMessages` fallback rendering on `StringContent != ""`,
which is what every downstream branch in that function actually
reads.
Regression test added to `evaluator_test.go` covering the fallback
path (no `ChatMessage` template) with a StringContent-only message,
both with and without a role mapping.
* test(openresponses): guard Content population and ToProto path (#10039)
Add regression tests for the two seams the original fix touched but
left uncovered:
* convertORInputToMessages must populate both Content and StringContent
for plain string input and for bare {role, content} array items (the
OpenAI SDK shape that omits the type discriminator). Both are
functional reds against the pre-fix code.
* Messages.ToProto reads Content, not StringContent — this is the path
UseTokenizerTemplate backends (imported GGUFs) take. The cases pin
that contract so a future regression on the producer side is caught.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* fix(react-ui): force .check() on hidden Toggle input in fits-filter e2e
The polish PR (#10030) swapped the raw <input type=checkbox> for the
shared <Toggle> component, which visually hides its native input via
opacity:0;width:0;height:0. Playwright's .check() waits for visibility
before clicking and times out after 30 s, breaking two UI E2E tests:
- enabling fits filter hides models that exceed available VRAM
- fits filter state persists after reload
Pass { force: true } to skip the visibility check; the input is still
the real focusable checkbox and toggles state on click. The companion
.toBeChecked() assertion only reads state and works unchanged.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7
* fix(react-ui): click visible Toggle track in fits-filter e2e
force:true skips the actionability checks but not the viewport check,
and the Toggle's hidden input has width:0;height:0 so Playwright still
reports "Element is outside of the viewport". Click the visible
.toggle__track inside the filter-bar-group__toggle wrapper instead —
that's what a real user clicks, and label-input association toggles
the wrapped checkbox naturally.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* fix(react-ui): polish 'Fits in my GPU' filter to use design-system Toggle
The recently added VRAM-fit filter in the Models page used a raw
<input type="checkbox"> next to the themed range slider, breaking the
visual language of the rest of the row. Swap it for the shared
<Toggle> component (already used by Backends, Settings, Traces,
AgentCreate), adopt the filter-bar-group__toggle class to drop the
duplicated inline styles, add a fa-microchip icon to mirror the
per-row fit indicator, and add a subtle left divider so the filter
reads as separate from the context-size slider on its left.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(react-ui): move 'Fits in GPU' filter to filter row and unify copy
Two follow-ups on the previous polish pass:
1. Move the toggle from the context-slider row into the filter-button
row above. The toggle is a filter on the result set, not a config
for VRAM estimation, so it belongs with the type chips and backend
select. The context slider stays its own thing.
2. Unify the label copy. The same locale file had "Fits in my GPU"
for the filter and "Fits in GPU" for the per-row indicator; pick
the shorter, possessive-free variant everywhere (en/de/es/it/zh-CN).
Update e2e selectors to match.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Adds a Go native gRPC backend that dlopens librfdetrcpp.so (built from
mudler/rf-detr.cpp at the pinned RFDETR_VERSION) via purego and exposes
the rfdetr.cpp inference pipeline through LocalAI's existing Detect RPC.
Supports all 5 RF-DETR detection variants (Nano/Small/Base/Medium/Large)
and 6 segmentation variants (SegNano/SegSmall/SegMedium/SegLarge/
SegXLarge/Seg2XLarge) with F32/F16/Q8_0/Q4_K quantizations. Pre-built
GGUFs ship at mudler/rfdetr-cpp-* on HuggingFace.
Detection returns Bbox + class_name + confidence; segmentation also
returns PNG-encoded per-detection masks via the rfdetr_capi accessor
functions (rfdetr_capi_get_detection_{class_id,box,score,class_name,
mask_png}).
End-to-end verified through POST /v1/detection: HTTP -> gRPC -> purego
dlopen -> rfdetr.cpp -> ggml -> response (9 detections on the detection
model, 21 detections + valid PNG masks on the seg-nano model against
the kitchen fixture).
Wiring:
- backend/go/rfdetr-cpp/{main.go,gorfdetrcpp.go,CMakeLists.txt,
Makefile,run.sh,package.sh,test.sh,.gitignore}
- Top-level Makefile: BACKEND_RFDETR_CPP, docker-build target,
.NOTPARALLEL, prepare-test-extra, test-extra
- backend/go/rfdetr-cpp/Makefile: `test` target invoked by test-extra
- .github/backend-matrix.yml: CPU + CUDA-12/13 + L4T CUDA-12/13
(arm64) + HIP + Vulkan (amd64 + arm64) + SYCL f32/f16
- backend/index.yaml: rfdetr-cpp meta anchor + latest/development
image entries for every matrix tag-suffix
- .github/workflows/bump_deps.yaml: RFDETR_VERSION pin tracking
(mudler/rf-detr.cpp branch main)
- gallery/index.yaml: 11 rfdetr-cpp-* entries (nano + 4 detection
variants + 6 seg variants), all backed by mudler/rfdetr-cpp-*
on HuggingFace with sha256 pinning on the F16 default
- core/gallery/importers/rfdetr.go: GGUF auto-routing for HF imports
(mudler/rfdetr-cpp-* repos route to rfdetr-cpp, Transformer-format
repos stay on the Python rfdetr backend; explicit preferences.backend
overrides both heuristics)
- core/gallery/importers/rfdetr_test.go: table-driven coverage of the
auto-routing + a live mudler/rfdetr-cpp-nano cross-check
scripts/changed-backends.js needs no change: the existing
Dockerfile.golang -> backend/go/${item.backend}/ branch already routes
the 9 rfdetr-cpp matrix entries to the correct backend path.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
useOperations() spun up its own setInterval per hook instance, so on
pages like /app/models the OperationsBar in App.jsx plus the page's
own useOperations() call each polled /api/operations at 1 Hz - 2 RPS
sustained for the whole session, repeated on Backends and Chat.
Lift the poller into an OperationsProvider mounted under AuthProvider
so all consumers (OperationsBar, Models, Backends, Chat) share one
timer. The hook file re-exports from the context to keep call sites
unchanged.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* fix(nemo): extract Hypothesis.text for TDT/RNNT ASR models
CTC models (e.g. Whisper) return List[str] from transcribe(), but
TDT/RNNT models (e.g. parakeet-tdt-0.6b-v3) return List[Hypothesis]
where the decoded text lives in the Hypothesis.text attribute.
Previously, results[0] was assigned directly to the protobuf string
field, causing silent empty output for non-CTC models.
Now checks the return type and extracts .text from Hypothesis objects,
with a safe fallback via getattr().
* refactor: simplify Hypothesis text extraction per Copilot review
Use single getattr() call instead of hasattr() + double access,
and return empty string for unknown types instead of str(result)
to avoid leaking internal repr to clients.
* fix(qwen-asr): enable timestamp output when forced_aligner is configured
Two bugs prevented timestamps from working in the qwen-asr backend:
1. transcribe() was called without return_time_stamps=True, so the
forced aligner was loaded but never invoked. Now we pass
return_time_stamps=True when a forced_aligner is present.
2. The timestamp parsing code expected (list, tuple) items, but the
qwen_asr library returns ForcedAlignItem dataclass instances with
.text, .start_time, .end_time attributes. Added hasattr() check
to handle this correctly, falling back to tuple parsing for
backward compatibility.
* refactor: address Copilot review for qwen-asr timestamps
- Wrap return_time_stamps kwarg in try/except TypeError for safety
- Add defensive float() normalization for timestamp times
- Use str() for text extraction to ensure string type
* fix(qwen-asr): convert seconds to nanoseconds for Go time.Duration
The Go server reads TranscriptSegment.start/end via time.Duration,
which is in nanoseconds. Previously the backend sent milliseconds
(* 1000), causing timestamps to be 1000x too small (e.g. 8e-8
instead of 0.08). Convert seconds → nanoseconds (* 1e9) instead.
Also applies to the legacy tuple path for consistency.
* feat(qwen-asr): respect timestamp_granularities (segment vs word)
Read request.timestamp_granularities from the gRPC request.
- 'word': return one segment per aligned item (character / word)
- 'segment' (default): merge consecutive items at sentence boundaries
Sentence boundaries detected via CJK punctuation (。!?;…)
and Latin endings (. ! ? ;). This matches the OpenAI Whisper API
contract where omitting the parameter defaults to segment-level.
* fix(qwen-asr): escape smart quotes in punctuation set
Unicode curly quotes (U+2018/2019) were being interpreted as Python
string delimiters, causing SyntaxError. Use explicit unicode escapes.
* fix(qwen-asr): use time-gap threshold for segment boundaries
The forced aligner strips punctuation from its output, so text-based
sentence detection doesn't work. Instead, detect segment boundaries
by measuring time gaps between consecutive aligned items.
Threshold = max(median_gap * 4, 0.3s). This cleanly separates
intra-sentence gaps (< 0.24s) from inter-sentence gaps (> 0.3s)
across Chinese, English, and other languages.
* fix(qwen-asr): smart join with spaces for non-CJK tokens
The forced aligner strips whitespace from tokenized text, so English
words like ['hello', 'world'] were joined as 'helloworld'. Add
_smart_join() that inserts spaces between non-CJK tokens while
keeping CJK characters and punctuation unspaced. Works for Chinese,
English, Korean, Japanese, and mixed-language text.
---------
Co-authored-by: fqscfqj <fqsfqj@outlook.com>
- Strict monotonic Go coverage gate (make test-coverage-check, 45% baseline)
run in CI; fixes ginkgo dropping all-but-one coverprofile across multiple
recursive roots, builds with -tags auth, and folds in the in-process
tests/e2e suite via --coverpkg.
- React UI e2e coverage (make test-ui-coverage: vite-plugin-istanbul + nyc,
nix-provided Chromium) plus e2e specs for 6 previously-untested pages, and a
UI coverage gate (make test-ui-coverage-check) with a small tolerance since
e2e line coverage jitters ~0.5pp run-to-run.
- pre-commit hook: lint + coverage on Go changes, Playwright e2e + UI coverage
gate on react-ui changes; install with make install-hooks.
- New Go handler tests (settings, branding), hermetic base64 download test.
- fix(ui): model editor reads vram_display (snake_case), so the VRAM estimate
renders again; covered by a regression test.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Richard Palethorpe <io@richiejp.com>
The build context shipped to the daemon included several large
untracked directories the image never needs: saved image tarballs
(backend-images), locally-installed backends (local-backends), the
host-built binary (local-ai), the rust target/ build output, and
host node_modules/protoc/tests. This bloated the context to ~23GB.
Exclude them so only the sources the Dockerfile actually copies are
transferred. backend/rust sources stay tracked; only target/ is ignored.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Walk down the release history and add per-release one-liners for 4.3.0,
4.2.0, 4.1.0, and 4.0.0 in the Latest News section, leading with the
headline win for each release. Move Prem into a collapsible "Past
sponsors" block under the active sponsors row.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.7 [claude-code]
* ⬆️ Update ggml-org/llama.cpp
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* fix(llama-cpp): track upstream rename checkpoint_every_nt -> checkpoint_min_step
Upstream llama.cpp renamed common_params::checkpoint_every_nt to
checkpoint_min_step and changed its default from 8192 to 256. The semantics
also shifted: it used to enforce a fixed checkpoint cadence during prefill,
now it sets a minimum spacing between context checkpoints. Track the new
field name in grpc-server.cpp and accept the old option names as backward-
compatible aliases for users with existing configs.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:claude-opus-4-7
---------
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
When the C++ autoparser is in pure-content fallback mode (qwen3-4b after
model emits a tool-call JSON in non-thinking mode, the streaming worker
ended the SSE stream with a spurious
data: {...,"delta":{"reasoning":"{\"name\":\"exec\",\"arguments\":...}"}}
chunk carrying the same JSON that was already in delta.tool_calls.
The Go-side ReasoningExtractor is configured from
DetectThinkingStartToken, which scans the model's jinja chat template
verbatim and finds <think> inside an {% if enable_thinking %} block
without evaluating the conditional. Every output chunk then runs through
PrependThinkingTokenIfNeeded, which synthesizes a <think> in front and
makes ExtractReasoning treat everything after as reasoning. The autoparser
correctly classifies zero reasoning (qwen3's tool format isn't on
llama.cpp's recognized-tool list, so all tokens land in
ChatDelta.Content), but processStreamWithTools then preferred
extractor.Reasoning() over functions.ReasoningFromChatDeltas at the
end-of-stream flush — handing the polluted Go-side state to
buildDeferredToolCallChunks, which emitted it as a trailing reasoning
chunk.
Two changes:
* Add a sticky preferAutoparser flag to processStreamWithTools, mirroring
the analogous flag in processStream from #9985. Once any ChatDelta
carries content or reasoning, the flag stays on for the rest of the
stream and the worker stops falling back to the Go-side extractor for
per-token deltas. This avoids the per-chunk leak path and the cumulative
pollution.
* Extract chooseDeferredReasoning, a small helper that selects the
end-of-stream reasoning source. When preferAutoparser is set, return
functions.ReasoningFromChatDeltas(chatDeltas); otherwise fall back to
extractor.Reasoning() (the correct source for vLLM and other backends
with no autoparser).
The helper has a focused test suite covering both sides of the contract:
autoparser-active with empty reasoning (the qwen3 case — the fix's
purpose), autoparser-active with real reasoning_content
(jinja-with-recognized-format models), and autoparser-not-active with
genuine Go-side reasoning (vLLM-style backends).
E2E with combined #9988 and this fix on qwen3-4b post-#9985 gallery
shape: 18 content chunks of the tool-call JSON, 1 tool_call chunk with
name='exec' and the right arguments, finish_reason=tool_calls, and zero
reasoning chunks — down from one polluted reasoning chunk before this
fix.
Depends on #9999 (the streaming JSON tool-call gating bug for qwen3) to
make the trailing chunk observable end-to-end; the helper unit tests are
independent.
Assisted-by: Claude:opus-4-7 [Read] [Edit] [Bash] [Write]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>