* fix(streaming/tools): stop healing-marker stubs from gating off content
When the C++ autoparser is in pure-content fallback mode (e.g. qwen3
without --jinja) and the model emits a tool call as JSON, the streaming
worker calls ParseJSONIterative on each new chunk. parseJSONWithStack
heals partial input like `{` into `{"<marker>":1}` where <marker> is a
random integer. removeHealingMarkerFromJSON only stripped the marker
from values, so the synthetic key survived and downstream callers saw
a stub object with a random-looking key.
chat_stream_workers.go's JSON tool-call detector then bumped
lastEmittedCount past the stub even though no real tool call was
emitted, gating off ALL subsequent content chunks. The qwen3 + tools +
streaming case ended up dribbling only the first `{"` to clients and
then nothing, even when the model went on to call the noAction
`answer({"message": "…"})` pseudo-tool.
Three changes, each with its own regression test:
* removeHealingMarkerFromJSON now strips the marker suffix from keys
too, dropping the entry when the truncated key is empty. Inputs like
`{` no longer leak `{"<marker>":1}` to callers; partial keys like
`{ "code` still preserve the model-typed prefix `code`.
* ParseJSONIterative skips empty-after-healing maps so a healed `{`
doesn't surface as a stub result.
* The streaming JSON detector now breaks (not continues) on entries
without a usable `name`, and only bumps lastEmittedCount past
successfully-emitted entries. Defense-in-depth against any future
partial-parse shape.
The parser tests cover eight partial-JSON-prefix shapes and verify no
marker characters leak into keys, plus the two early shapes (`{`,
`{"`) that should not surface a stub at all.
Fixes#9988
Assisted-by: Claude:opus-4-7 [Read] [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(streaming/tools): cover the autoparser-correctly-working path
Extract the JSON tool-call streaming emit loop into emitJSONToolCallDeltas
and unit-test it against every shape that can hit the streaming worker:
* the bug case — a healing-marker stub at index 0 must NOT bump
lastEmittedCount, so subsequent content chunks keep flowing;
* the autoparser-correctly-working case — empty jsonResults (because
the C++ autoparser cleared the raw text and delivers tool calls via
TokenUsage.ChatDeltas) is a no-op, leaving the deferred end-of-stream
emitter to ship the autoparser's tool calls;
* a single complete tool call — emit one chunk, advance to 1;
* arguments arriving as a JSON-string vs as a nested object — both
serialize to the wire as JSON-string arguments;
* multiple parallel tool calls — one chunk each;
* a real tool call followed by a partial stub — emit the real one,
stop at the stub, resume on a later chunk once the stub completes.
Locks down the no-regression guarantee the user asked for: this PR's
fix is scoped to the pure-content fallback path; when the autoparser
actually classifies tool calls (jinja-recognized chat format with tool
support), the helper is a no-op and nothing changes.
Assisted-by: Claude:opus-4-7 [Read] [Edit] [Bash] [Write]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat: add distributed mode (experimental)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix data races, mutexes, transactions
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactorings
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fixups
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix events and tool stream in agent chat
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* use ginkgo
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactoring and consolidation
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactoring and consolidation
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactoring and consolidation
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactoring and consolidation
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactoring and consolidation
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactoring and consolidation
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactoring and consolidation
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactoring and consolidation
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(cron): compute correctly time boundaries avoiding re-triggering
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* enhancements, refactorings
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* do not flood of healthy checks
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* do not list obvious backends as text backends
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* tests fixups
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactoring and consolidation
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Drop redundant healthcheck
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* enhancements, refactorings
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat: wire min_p
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat: inferencing defaults
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore(refactor): re-use iterative parser
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore: generate automatically inference defaults from unsloth
Instead of trying to re-invent the wheel and maintain here the inference
defaults, prefer to consume unsloth ones, and contribute there as
necessary.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore: apply defaults also to models installed via gallery
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore: be consistent and apply fallback to all endpoint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(functions): add peg-based parsing
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat: support returning toolcalls directly from backends
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore: do run PEG only if backend didn't send deltas
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(function): Add XML Tool Call Parsing Support
Extend the function parsing system in LocalAI to support XML-style tool calls, similar to how JSON tool calls are currently parsed. This will allow models that return XML format (like <tool_call><function=name><parameter=key>value</parameter></function></tool_call>) to be properly parsed alongside text content.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* thinking before tool calls, more strict support for corner cases with no tools
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Support streaming tools
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Iterative JSON
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Iterative parsing
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Consume JSON marker
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Fixup
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* add tests
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Fix pending TODOs
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Don't run other parsing with ParseRegex
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat: initial hook to install elements directly
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* WIP: ui changes
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Move HF api client to pkg
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Add simple importer for gguf files
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Add opcache
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* wire importers to CLI
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Add omitempty to config fields
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Fix tests
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Add MLX importer
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Small refactors to star to use HF for discovery
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Add tests
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Common preferences
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Add support to bare HF repos
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(importer/llama.cpp): add support for mmproj files
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* add mmproj quants to common preferences
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Fix vlm usage in tokenizer mode with llama.cpp
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(llama.cpp): expose env vars as options for consistency
This allows to configure everything in the YAML file of the model rather
than have global configurations
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(llama.cpp): respect usetokenizertemplate and use llama.cpp templating system to process messages
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* WIP
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Detect template exists if use tokenizer template is enabled
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Better recognization of chat
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Fixes to support tool calls while using templates from tokenizer
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Fixups
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Drop template guessing, fix passing tools to tokenizer
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Extract grammar and other options from chat template, add schema struct
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* WIP
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* WIP
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Automatically set use_jinja
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Cleanups, identify by default gguf models for chat
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Update docs
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* wip
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* get rid of panics
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* expose it properly from the config
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Simplify
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* forgot to commit
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Remove focus on test
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Small fixups
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(functions): enhance parsing with broken JSON when we parse the raw results
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* breaking: make function name by default
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(grammar): dynamically generate grammars with mutating keys
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor: simplify condition
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Update docs
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* models(gallery): add mistral-0.3 and command-r, update functions
Add also disable_parallel_new_lines to disable newlines in the JSON
output when forcing parallel tools. Some models (like mistral) might be
very sensible to that when being used for function calling.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* models(gallery): add aya-23-8b
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(functions): relax mixedgrammars
Extend even more the functionalities and when mixed mode is enabled,
tolerate also both strings and JSON in the result - in this case we make
sure that the JSON can be correctly parsed.
This also updates the examples and the gallery model to configure the
grammar.
The changeset also breaks current function/grammar configuration as it
reserves now a stanza in the YAML config.
For example:
```yaml
function:
grammar:
# This allows the grammar to also return messages
mixed_mode: true
# Suffix to add to the grammar
# prefix: '<tool_call>\n'
# Force parallel calls in the grammar
# parallel_calls: true
```
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor, add a way to disable mixed json and freestring
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Fix linting issues
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(functions): allow to use JSONRegexMatch unconditionally
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(functions): make json_regex_match a list
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
feat(functions): support mixed JSON BNF grammar
This PR provides new options to control how functions are extracted from
the LLM, and also provides more control on how JSON grammars can be used
(also in conjunction).
New YAML settings introduced:
- `grammar_message`: when enabled, the generated grammar can also decide
to push strings and not only JSON objects. This allows the LLM to pick
to either respond freely or using JSON.
- `grammar_prefix`: Allows to prefix a string to the JSON grammar
definition.
- `replace_results`: Is a map that allows to replace strings in the LLM
result.
As an example, consider the following settings for Hermes-2-Pro-Mistral,
which allow extracting both JSON results coming from the model, and the
ones coming from the grammar:
```yaml
function:
# disable injecting the "answer" tool
disable_no_action: true
# This allows the grammar to also return messages
grammar_message: true
# Suffix to add to the grammar
grammar_prefix: '<tool_call>\n'
return_name_in_function_response: true
# Without grammar uncomment the lines below
# Warning: this is relying only on the capability of the
# LLM model to generate the correct function call.
# no_grammar: true
# json_regex_match: "(?s)<tool_call>(.*?)</tool_call>"
replace_results:
"<tool_call>": ""
"\'": "\""
```
Note: To disable entirely grammars usage in the example above, uncomment the
`no_grammar` and `json_regex_match`.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
When enabling grammar with functions, it might be useful to
allow more flexibility to support models that are fine-tuned against returning
function calls of the form of { "name": "function_name", "arguments" {...} }
rather then { "function": "function_name", "arguments": {..} }.
This might call out to a more generic approach later on, but for the moment being we can easily support both
as we have just to specific different types.
If needed we can expand on this later on
Signed-off-by: mudler <mudler@localai.io>