mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-02 20:37:03 -04:00
fix(openai): stop max_tokens streaming retry loop on reasoning models When a thinking model spends its entire max_tokens budget on the reasoning block, the C++ autoparser clears the raw Response and delivers reasoning-only ChatDeltas (no content, no tool calls). ComputeChoices' empty-response retry then fires and regenerates from scratch up to maxRetries times, each re-consuming the whole budget, instead of terminating with finish_reason "length" (issue #9716). Add a reachedTokenBudget helper and suppress both the built-in and caller-driven retries when the completion count has reached the configured max_tokens ceiling. Report finish_reason "length" instead of "stop" in the streaming and non-streaming chat paths when the budget was exhausted. Adds a deterministic regression test that counts backend invocations (previously 6, now 1) plus boundary tests for the helper. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Dennisadira <dennisadira@gmail.com>
12 lines
375 B
Go
12 lines
375 B
Go
package openai
|
|
|
|
// Finish reason constants for OpenAI API responses
|
|
const (
|
|
FinishReasonStop = "stop"
|
|
FinishReasonToolCalls = "tool_calls"
|
|
FinishReasonFunctionCall = "function_call"
|
|
// FinishReasonLength is reported when generation stopped because it
|
|
// reached the max_tokens budget rather than a natural stop (issue #9716).
|
|
FinishReasonLength = "length"
|
|
)
|