LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-30 19:37:00 -04:00

Files

Adira 28d7397743 fix(openai): stop max_tokens streaming retry loop on reasoning models (#9716 ) (#10448 )

fix(openai): stop max_tokens streaming retry loop on reasoning models

When a thinking model spends its entire max_tokens budget on the reasoning
block, the C++ autoparser clears the raw Response and delivers reasoning-only
ChatDeltas (no content, no tool calls). ComputeChoices' empty-response retry
then fires and regenerates from scratch up to maxRetries times, each
re-consuming the whole budget, instead of terminating with finish_reason
"length" (issue #9716).

Add a reachedTokenBudget helper and suppress both the built-in and
caller-driven retries when the completion count has reached the configured
max_tokens ceiling. Report finish_reason "length" instead of "stop" in the
streaming and non-streaming chat paths when the budget was exhausted.

Adds a deterministic regression test that counts backend invocations
(previously 6, now 1) plus boundary tests for the helper.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Dennisadira <dennisadira@gmail.com>

2026-06-30 09:01:53 +02:00

anthropic

feat(pii): NER tier engine — privacy-filter.cpp backend + NER-centric PII filter (#10360 )

2026-06-18 11:45:22 +01:00

elevenlabs

feat(tts): support per-request instructions and params (#10172 )

2026-06-04 11:45:02 +02:00

explorer

feat: add distributed mode (#9124 )

2026-03-30 00:47:27 +02:00

jina

feat(whisper): honor client cancellation via ggml abort_callback (#9710 )